CASL

Permanent URI for this collectionhttps://eresearch.qmu.ac.uk/handle/20.500.12289/22

Browse

Search Results

Now showing 1 - 8 of 8

Beyond the edge: Markerless pose estimation of speech articulators from ultrasound and camera images using DeepLabCut
(MDPI, 2022-02-02) Wrench, Alan A.; Balch-Tomes, Jonathan
Automatic feature extraction from images of speech articulators is currently achieved by detecting edges. Here, we investigate the use of pose estimation deep neural nets with transfer learning to perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images as training input. Midsagittal ultrasound images of the tongue, jaw, and hyoid and camera images of the lips were hand-labelled with keypoints, trained using DeepLabCut and evaluated on unseen speakers and systems. Tongue surface contours interpolated from estimated and hand-labelled keypoints produced an average mean sum of distances (MSD) of 0.93, s.d. 0.46 mm, compared with 0.96, s.d. 0.39 mm, for two human labellers, and 2.3, s.d. 1.5 mm, for the best performing edge detection algorithm. A pilot set of simultaneous electromagnetic articulography (EMA) and ultrasound recordings demonstrated partial correlation among three physical sensor positions and the corresponding estimated keypoints and requires further investigation. The accuracy of the estimating lip aperture from a camera video was high, with a mean MSD of 0.70, s.d. 0.56, mm compared with 0.57, s.d. 0.48 mm for two human labellers. DeepLabCut was found to be a fast, accurate and fully automatic method of providing unique kinematic data for tongue, hyoid, jaw, and lips.
Coarticulation across morpheme boundaries: An ultrasound study of past-tense inflection in Scottish English
(Elsevier, 2021-09-15) Mousikou, Petroula; Strycharczuk, Patrycja; Turk, Alice; Scobbie, James M.
It has been hypothesized that morphologically-complex words are mentally stored in a decomposed form, often requiring online composition during processing. Morphologically-simple words can only be stored as a whole. The way a word is stored and retrieved is thought to influence its realization during speech production, so that when retrieval requires less time, the articulatory plan is executed faster. Faster articulatory execution could result in more coarticulation. Accordingly, we hypothesized that morphologically-simple words might be produced with more coarticulation than apparently homophonous morphologically-complex words, because the retrieval of monomorphemic forms is direct, in contrast to morphologically-complex ones, which might need to be composed online into full word forms. Using Ultrasound Tongue Imaging, we tested this hypothesis with nine speakers of Scottish English. Over two days of training, participants learned phonemically identical monomorphemic and morphologically-complex nonce words, while on the third consecutive testing day, they produced them in two prosodic contexts. Two types of articulatory analyses revealed no systematic differences in coarticulation between monomorphemic and morphologically-complex items, yet a few speakers did idiosyncratically produce some morphological effects on articulation. Our work contributes to our understanding of how morphologically complex words are stored and processed during speech production.
Quantifying changes in ultrasound tongue-shape pre- and post-intervention in speakers with submucous cleft palate: An illustrative case study
(Taylor & Francis, 2021-09-08) Roxburgh, Zoe; Cleland, Joanne; Scobbie, James M.; Wood, Sara
Ultrasound Tongue Imaging is increasingly used during assessment and treatment of speech sound disorders. Recent literature has shown that ultrasound is also useful for the quantitative analysis of a wide range of speech errors. So far, the compensatory articulations of speakers with cleft palate have only been analysed qualitatively. This study provides a pilot quantitative ultrasound analysis, drawing on longitudinal intervention data from a child with submucous cleft palate. Two key ultrasound metrics were used: 1. articulatory t-tests were used to compare tongue-shapes for perceptually collapsed phonemes on a radial measurement grid and 2. the Mean Radial Difference was reported to quantify the extent to which the two tongue shapes differ, overall. This articulatory analysis supplemented impressionistic phonetic transcriptions and identified covert contrasts. Articulatory errors identified in this study using ultrasound were in line with errors identified in the speech of children with cleft palate in previous literature. While compensatory error patterns commonly found in speakers with cleft palate have been argued to facilitate functional phonological development, the nature of our findings suggest that the compensatory articulations uncovered are articulatory in nature.
Lenition and fortition of /r/ in utterance-final position, an ultrasound tongue imaging study of lingual gesture timing in spontaneous speech
(Elsevier, 2021-04-16) Lawson, Eleanor; Stuart-Smith, Jane; Taehong Cho
The most fundamental division in English dialects is the rhotic/non-rhotic division. The mechanisms of historical /r/-loss sound change are not well understood, but studying a contemporary /r/-loss sound change in a rhotic variety of English can provide new insights. We know that /r/ weakening in contemporary Scottish English is a gesture-timing based phenomenon and that it is socially indexical, but we have no phonetic explanation for the predominance of weak /r/ variants in utterance-final position. Using a socially-stratified conversational ultrasound tongue imaging speech corpus, this study investigates the effects of boundary context, along with other linguistic and social factors such as syllable stress, following-consonant place and social class, on lingual gesture timing in /r/ and strength of rhoticity. Mixed-effects modelling identified that utterance-final context conditions greater anterior lingual gesture delay in /r/ and weaker-sounding /r/s, but only in working-class speech. Middle-class speech shows no anterior lingual gesture delay for /r/ in utterance-final position and /r/ is audibly strengthened in this position. It is unclear whether this divergence is due to variation in underlying tongue shape for /r/ in these social-class communities, or whether utterance-final position provides a key location for the performance of social class using salient variants of /r/.
The effects of syllable and utterance position on tongue shape and gestural magnitude in /l/ and /r/
(International Phonetic Association, 2019-08-10) Lawson, Eleanor; Leplatre, Gregory; Stuart-Smith, Jane; Scobbie, James M.
This paper is an ultrasound-based articulatory study of the impact of syllable-position and utterance position on tongue shape and tongue-gesture magnitude in liquid consonants in American, Irish and Scottish English. Mixed effects modelling was used to analyse variation in normalised tongue-gesture magnitude for /r/ and /l/ in syllable-onset and coda position and in utterance-initial, medial and final position. Variation between onset and coda mean midsagittal tongue surfaces was also quantified using normalised root-mean-square distances, and patterns of articulatory onset-coda allophony were identified. Despite the fact that some speakers in all varieties used tip-up /r/ in syllable-onset position and bunched /r/ in coda position, RMS distance results show greater degrees of similarity between onset and coda /r/ than between onset and coda /l/. Gesture magnitude was significantly reduced for both /l/ and /r/ in coda position. Utterance position had a significant effect on /l/ only.
The impact of real-time articulatory information on phonetic transcription: Ultrasound-aided transcription in cleft lip and palate speech
(Karger, 2019-05-24) Cleland, Joanne; Lloyd, Susan; Campbell, Linsay; Crampin, Lisa; Palo, Pertti; Sugden, Eleanor; Wrench, Alan A.; Zharkova, Natalia
Objective: This study investigated whether adding an additional modality, namely ultrasound tongue imaging, to perception-based phonetic transcription impacted on the identification of compensatory articulations and on interrater reliability. Patients and Methods: Thirty-nine English-speaking children aged 3 to 12 with cleft lip and palate (CLP) were recorded producing repetitions of /aCa/ for all places of articulation with simultaneous audio and probe-stabilised ultrasound. Three types of transcriptions were performed: 1. Descriptive observations from the live ultrasound by the clinician recording the data; 2. Ultrasound-aided transcription by two ultrasound-trained clinicians; and 3. Traditional phonetic transcription by two CLP specialists from audio recording. We compared the number of consonants identified as in error by each transcriber and then classified errors into eight different subcategories. Results: Both the ultrasound-aided and traditional transcriptions yielded similar error-detection rates, however these were significantly higher than the observations recorded live in the clinic. Interrater reliability for the ultrasound transcribers was substantial (k=0.65), compared to moderate (k=0.47) for the traditional transcribers. Ultrasound-aided transcribers were more likely to identify covert errors such as double articulations and retroflexion than the audio-only transcribers. Conclusion: Ultrasound-tongue imaging is a useful complement to traditional phonetic transcription for CLP speech.
Complex patterns in silent speech preparation: Preparing for fast response might be different to preparing for fast speech in a reaction time experiment
(International Phonetic Association, 2015-08-15) Schaeffler, Sonja; Scobbie, James M.; Schaeffler, Felix
This paper presents articulatory data on silent preparation in a standard Verbal Reaction Time experiment. We have reported in a previous study [6] that Reaction Time is reliably detectable in Ultrasound Tongue Imaging and lip video data, and between 120 to 180 ms ahead of the standard acoustics-based measurements. The aim of the current study was to investigate in more detail how silent speech preparation is timed in relation to faster and slower Reaction Times, and faster and slower articulation rates of the verbal response. The results suggest that the standard acoustic-based measurements of Reaction Time may not only routinely underestimate fastness of response but also obscure considerable variation in actual response behaviour. Particularly tokens with fast Reaction Times seem to exhibit substantial variation with respect to when the response is actually initiated, i.e. detectable in the articulatory data.
Measuring reaction times: Vocalisation vs. articulation
(2014-05-09) Schaeffler, Sonja; Scobbie, James M.; Schaeffler, Felix; Fuchs, S.; Grice, M.; Hermes, A.; Lancia, L.; Mücke, D.
There is a sizeable delay between any formulation of an intention to speak and the audible vocalisation that results. Silent articulatory movements in preparation for audible speech comprise a proportion of this phase of speech production. The extensive literature on Reaction Time (RT) is based on the delay between a stimulus and the acoustic onset to speech that is elicited, ignoring the preceding silent elements of speech production in what is an utterance-initial position. We used a standard Snodgrass and Vanderwart picture-naming task to elicit speech in a standard Reaction Time protocol, but recorded the behaviour of two typical speakers with audio plus Ultrasound Tongue Imaging (201 frames per second) and de-interlaced NTSC video of the mouth and lips (60fps). On average, acoustic RT occurred between 120 to 180 ms later than a clearly observable articulatory movement, with no consistent advantage for lip or tongue-based measures.

CASL

Browse

Filters

Settings

Sort By

Results per page

Search Results