Unsupervised clustering analysis of population and subpopulation structure using dense SNP markers

Document Type : Research Paper


1 Ph. D. Student, Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Karaj, Iran

2 Professor, Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Karaj, Iran

3 Assistant Professor, Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Karaj, Iran


High through put sequencing of single nucleotide polymorphisms (SNP) has revolutionized the fine scale analysis of the population structure in different species. Various methods have been proposed and used for the study of population structure using whole-genome marker data that each has advantages and disadvantages with respect to their characteristics. Super Paramagnetic Clustering (SPC) which is based on data mining was used in this study in order to investigate the population and sub-population structures in simulated populations. The purpose of applying this method was to achieve population structure without using any information from ancestral population. After editing the data, 29209 autosomal markers from 159 animals were analyzed. The results showed that animals are placed properly in their respective population and sub-populations based on their similarities and dissimilarities. The main advantages of this method are the computational efficiency and not requiring any prior assumptions. Therefore, it might be used to analyze the data from thousands of animals without any pedigree and ancestry information to reveal their population structure.


  1. Alexander, D. H., Novembre, J. & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9), 1655-1664.
  2. Behzadi, B., Amini, A., Slaminejad, A. & Tahmoorespour, M. (2013). Estimation of genetic parameters for production traits of Iranian Holstein dairy cattle. Livestock Research Rural Development, 25(9).
  3. Blatt, M., Wiseman, S. & Domany, E. (1996a). Clustering data through an analogy to the Potts model. Advances in Neural Information Processing Systems, 416-422.
  4. Blatt, M., Wiseman, S. & Domany, E. (1996b). Superparamagnetic clustering of data. Physical review letters, 76(18), 3251.
  5. Blatt, M., Wiseman, S. & Domany, E. (1997). Data clustering using a model granular magnet. Neural Computation, 9(8), 1805-1842.
  6. Bowden, R., MacFie, T. S., Myers, S., Hellenthal, G., Nerrienet, E., Bontrop, R. E., Freeman, C., Donnelly, P. & Mundy, N. I. (2012). Genomic tools for evolution and conservation in the chimpanzee: Pan troglodytes ellioti is a genetically distinct population. PLoS genetics, 8(3), e1002504.
  7. Brohée, S., Faust, K., Lima-Mendez, G., Sand, O., Vanderstocken, G., Deville, Y. & van Helden, J. (2008). NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways. Nucleic acids research, 36(suppl 2), W444-W451.
  8. Decker, J. E., Pires, J. C., Conant, G. C., McKay, S. D., Heaton, M. P., Chen, K., Cooper, A., Vilkki, J., Seabury, C. M. & Caetano, A. R. (2009). Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proceedings of the National Academy of Sciences, 106(44), 18644-18649.
  9. Domany, E. (2003). Cluster analysis of gene expression data. Journal of Statistical Physics, 110(3-6), 1117-1139.
  10. Evanno, G., Regnaut, S. & Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular ecology, 14(8), 2611-2620.
  11. Faraji-Arough, H., Aslaminejad, A., Tahmoorespur, M., Rokouei, M. & Shariati, M. (2015). Bayesian Inference of (Co) Variance Components and Genetic Parameters for Economic Traits in Iranian Holsteins via Gibbs Sampling. Iranian Journal of Applied Animal Science, 5(1), 51-60.
  12. Fernández, M. E., Goszczynski, D. E., Lirón, J. P., Villegas-Castagnasso, E. E., Carino, M. H., Ripoli, M. V., Rogberg-Muñoz, A., Posik, D. M., Peral-García, P. & Giovambattista, G. (2013). Comparison of the effectiveness of microsatellites and SNP panels for genetic identification, traceability and assessment of parentage in an inbred Angus herd. Genetics and Molecular Biology, 36(2), 185-191.
  13. Gao, X. & Starmer, J. (2007). Human population structure detection via multilocus genotype clustering. BMC genetics, 8(1), 34.
  14. Gautier, M., Laloë, D. & Moazami-Goudarzi, K. (2010). Insights into the genetic history of French cattle from dense SNP data on 47 worldwide breeds. PloS one, 5(9), e13038.
  15. Getz, G., Gal, H., Kela, I., Notterman, D. A. & Domany, E. (2003). Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data. Bioinformatics, 19(9), 1079-1089.
  16. Getz, G., Levine, E. & Domany, E. (2000a). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences, 97(22), 12079-12084.
  17. Getz, G., Levine, E., Domany, E. & Zhang, M. (2000b). Super-paramagnetic clustering of yeast gene expression profiles. Physica A: Statistical Mechanics and its Applications, 279(1), 457-464.
  18. Heaton, M. P., Keen, J. E., Clawson, M. L., Harhay, G. P., Bauer, N., Shultz, C., Green, B. T., Durso, L., Chitko-McKown, C. G. & Laegreid, W. W. (2005). Use of bovine single nucleotide polymorphism markers to verify sample tracking in beef processing. Journal of the American Veterinary Medical Association, 226(8), 1311-1314.
  19. Holmström, E., Bock, N. & Brännlund, J. (2009). Modularity density of network community divisions. Physica D: Nonlinear Phenomena, 238(14), 1161-1167.
  20. Kijas, J. W., Townley, D., Dalrymple, B. P., Heaton, M. P., Maddox, J. F., McGrath, A., Wilson, P., Ingersoll, R. G., McCulloch, R. & McWilliam, S. (2009). A genome wide survey of SNP variation reveals the genetic structure of sheep breeds. PloS one, 4(3), e4668.
  21. Lee, C., Abdool, A. & Huang, C.-H. (2009). PCA-based population structure inference with generic clustering algorithms. BMC bioinformatics, 10(Suppl 1), S73.
  22. Levy, M. & Feingold, J. (2000). Estimating prevalence in single-gene kidney diseases progressing to renal failure. Kidney international, 58(3), 925-943.
  23. Lirón, J., Ripoli, M., De Luca, J., Peral-García, P. & Giovambattista, G. (2002). Analysis of genetic diversity and population structure in Argentine and Bolivian Creole cattle using five loci related to milk production. Genetics and Molecular Biology, 25(4), 413-419.
  24. Markovtsova, L., Marjoram, P. & Tavaré, S. (2000). The age of a unique event polymorphism. Genetics, 156(1), 401-409.
  25. Marquitti, F. M. D., Guimarães, P. R., Pires, M. M. & Bittencourt, L. F. (2014). MODULAR: software for the autonomous computation of modularity in large network sets. Ecography, 37(3), 221-224.
  26. Meuwissen, T., Solberg, T. R., Shepherd, R. & Woolliams, J. A. (2009). A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol, 41(2).
  27. Morin, P. A., Luikart, G. & Wayne, R. K. (2004). SNPs in ecology, evolution and conservation. Trends in Ecology & Evolution, 19(4), 208-216.
  28. Neuditschko, M., Khatkar, M. S. & Raadsma, H. W. (2012). NetView: A High-Definition Network-Visualization Approach to Detect Fine-Scale Population Structures from Genome-Wide Patterns of Variation. PloS one, 7(10), e48375.
  29. Newman, M. E. & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113.
  30. Ojango, J. & Pollott, G. (2001). Genetics of milk yield and fertility traits in Holstein-Friesian cattle on large-scale Kenyan farms. Journal of animal science, 79(7), 1742-1750.
  31. Paschou, P., Ziv, E., Burchard, E. G., Choudhry, S., Rodriguez-Cintron, W., Mahoney, M. W. & Drineas, P. (2007). PCA-correlated SNPs for structure identification in worldwide human populations. PLoS genetics, 3(9), e160.
  32. Pritchard, J. K., Stephens, M. & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945-959.
  33. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., De Bakker, P. I. & Daly, M. J. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3), 559-575.
  34. Radjiman & Sugiarto. (2005). Super Paramagnetic Clustering of DNA Sequences. Master of Science dissertation. National University of Singapore. (In Farsi)
  35. Sargolzaei, M. & Schenkel, F. S. (2009). QMSim: a large-scale genome simulator for livestock. Bioinformatics, 25(5), 680-681.
  36. Seo, T.-K., Thorne, J. L., Hasegawa, M. & Kishino, H. (2002). Estimation of effective population size of HIV-1 within a host: a pseudomaximum-likelihood approach. Genetics, 160(4), 1283-1293.
  37. Serre, D., Montpetit, A., Paré, G., Engert, J. C., Yusuf, S., Keavney, B., Hudson, T. J. & Anand, S. (2008). Correction of population stratification in large multi-ethnic association studies. PloS one, 3(1), e1382.
  38. Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., Amin, N., Schwikowski, B. & Ideker, T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research, 13(11), 2498-2504.
  39. Tang, H., Peng, J., Wang, P. & Risch, N. J. (2005). Estimation of individual admixture: analytical and study design considerations. Genet Epidemiol, 28(4), 289-301.
  40. Tetko, I. V., Facius, A., Ruepp, A. & Mewes, H.-W. (2005). Super paramagnetic clustering of protein sequences. BMC bioinformatics, 6(1), 82.
  41. Troy, C. S., MacHugh, D. E., Bailey, J. F., Magee, D. A., Loftus, R. T., Cunningham, P., Chamberlain, A. T., Sykes, B. C. & Bradley, D. G. (2001). Genetic evidence for Near-Eastern origins of European cattle. Nature, 410(6832), 1088-1091.
  42. Tsafrir, D., Tsafrir, I., Ein-Dor, L., Zuk, O., Notterman, D. A. & Domany, E. (2005). Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices. Bioinformatics, 21(10), 2301-2308.
  43. Wang, D. G., Fan, J.-B., Siao, C.-J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester, E. & Spencer, J. (1998). Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science, 280(5366), 1077-1082.