Queen Margaret University logo
    • Login
    View Item 
    •   QMU Repositories
    • eResearch
    • School of Health Sciences
    • The Institute for Global Health and Development
    • View Item
    •   QMU Repositories
    • eResearch
    • School of Health Sciences
    • The Institute for Global Health and Development
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    A “Roziah” by Any Other Name: A Simple Bayesian Method for Determining Ethnicity From Names

    Date
    2014-08-01
    Author
    Komahan, Kridaraan
    Reidpath, Daniel
    Metadata
    Show full item record
    Citation
    Komahan, K. and Reidpath, D.D. (2014) ‘A “roziah” by any other name: a simple bayesian method for determining ethnicity from names’, American Journal of Epidemiology, 180(3), pp. 325–329. Available at: https://doi.org/10.1093/aje/kwu129.
    Abstract
    Correct identification of ethnicity is central to many epidemiologic analyses. Unfortunately, ethnicity data are often missing. Successful classification typically relies on large databases (n > 500,000 names) of known name-ethnicity associations. We propose an alternative naïve Bayesian strategy that uses substrings of full names. Name and ethnicity data for Malays, Indians, and Chinese were provided by a health and demographic surveillance site operating in Malaysia from 2011–2013. The data comprised a training data set (n = 10,104) and a test data set (n = 9,992). Names were spliced into contiguous 3-letter substrings, and these were used as the basis for the Bayesian analysis. Performance was evaluated on both data sets using Cohen's κ and measures of sensitivity and specificity. There was little difference between the classification performance in the training and test data (κ = 0.93 and 0.94, respectively). For the test data, the sensitivity values for the Malay, Indian, and Chinese names were 0.997, 0.855, and 0.932, respectively, and the specificity values were 0.907, 0.998, and 0.997, respectively. A naïve Bayesian strategy for the classification of ethnicity is promising. It performs at least as well as more sophisticated approaches. The possible application to smaller data sets is particularly appealing. Further research examining other substring lengths and other ethnic groups is warranted.
    URI
    https://eresearch.qmu.ac.uk/handle/20.500.12289/12956
    Official URL
    https://doi.org/10.1093/aje/kwu129
    Collections
    • The Institute for Global Health and Development

    Queen Margaret University: Research Repositories
    Accessibility Statement | Repository Policies | Contact Us | Send Feedback | HTML Sitemap

     

    Browse

    All QMU RepositoriesCommunities & CollectionsBy YearBy PersonBy TitleBy QMU AuthorBy Research CentreThis CollectionBy YearBy PersonBy TitleBy QMU AuthorBy Research Centre

    My Account

    LoginRegister

    Queen Margaret University: Research Repositories
    Accessibility Statement | Repository Policies | Contact Us | Send Feedback | HTML Sitemap