Tuesday, March 3, 2015

My experience with Chromopainter adn Finestructure



I used MaCH  (http://csg.sph.umich.edu/abecasis/mach/tour/input_files.html and http://csg.sph.umich.edu/abecasis/mach/tour/imputation.html ) in imputation and phasing.   The imputation showed good reliability by its statistics, which was expected because only a few SNPs were missing.   The proportion of missing alleles was 0.03% (three per 10000), approximate in random positions.  Both stages were done chromosome by chromosome, still the processing time was quite long, typically hours per run (PC: quad core Intel 4770k/3500 MHz / 16 GB memory).  

Data

The data was selected from following studies with additional populations from the 1000-genome project (Finnish, CEU, British and Tuscan samples):
http://mbe.oxfordjournals.org/content/29/1/359
http://www.nature.com/nature/journal/v466/n7303/pdf/nature09103.pdf
http://www.nature.com/nature/journal/vaop/ncurrent/pdf/nature12736.pdf
http://digitalcommons.wayne.edu/humbiol_preprints/41/

The total amount of SNPs per sample was limited by these studies to around 300000 SNPs.

Emerged problems in running Chromopainter/Finestructure

There are two disadvantages occurring with Chromopainter and Finestructure.   I have also tested the functionality using smaller "synthetic" data to see how it works and problems in detail.  The first problem is related to isolated "daughter" populations and caused by Markov chain process.   Markov chain process can’t itself be aware of the population history and the process leads to a result where more homogeneous and possible oversampled isolated populations are more source than actual donating populations, although this is not possible in case of isolates being  younger "daughters".   It is hard avoid this error, because Chromopainter/Finestructure  doesn't give enough factual information to steer the process and to take in to account the known history and the origin of “chunks” or haplotypes, i.e. causality.  Actually you can supervise Chromopainter and it gives you a chance to correct this problem, or make it even worse.   Practically the only way to avoid these errors is to cut out known isolated populations from the input, but this all is up to you and the result can still be subjective.  

Here is a picture showing how the clustering works:
   






The amount of additional chunks multiplies when the A-B population grows. 

This would be a perfect way to make clusters if we only could know gene flow directions between individual, or it would be a reasonable way if we could know gene flow directions between countries or putative populations, but if we have to guess the result will be just a guess or even worse.   
Another question related to the donor populations of Chromopainter is that we simply don’t know unidirectional gene flows in Europe.  It is a great idea to mark Scandinavian, Spaniards and Germans as donors if we analyze American populations,  but this doesn’t work in Europe, because here we have barely any unidirectional gene flows.  Any attempt to mark donors in this analysis would be simply a guess. I didn’t want to guess and I ran Chromopainter in a neutral mode in which every individual is compared to all other individuals.  Maybe I could use high quality ancient samples as donors, but if I see a Finestructure analysis targeting only to Europeans with asymmetric admixture matrix I would be interesting in how the donor haplotypes were determined. 

Another problem is also caused by the Markov chain process and is related to mixed populations.   Basically it is very similar to the first problem, but needs different data preparing.   When the process finds mixed individuals it considers also ancestral populations being mixed.  This happens because the process is relative and there is no understanding of the causality between individuals.   So the Markov chain process clusters  both ancestral populations together with the mixed one, despite of the history, geography and genetic distances shown in the input data.  How strong this clustering is depends on sample sizes of all three populations, ancestral and the mixed ones.   Again we need thorough preparation of the data to avoid wrong results.   In a worst case some of populations are mixed and isolated, combining both errors into the result. 

The following picture demonstrates the problem concerning mixed populations.  In Chromopainter/Finestructure it is even worse because they use chunk/haplotype counts instead of haplotypes.  





Maybe you say now that this is okay, but it is not.   If we put 20 Spaniards, 20 Amerindians and 20 Mestizoes into Markov chain process and get one cluster including all three populations it would also in my opinion be okay and I don’t object it, but after all very misleading because Spaniards and Native Americans are not relatives and live in two continents thousands of miles apart.  This problem is solved in Chromopainter and you can mark Spanish and Native American phased data as donor data and Mestizoes as recipients, but this strategy doesn’t work in Europe where there are no such donor-recipient pairs than between Europe and Americas.  

Because I am especially interested in Finnish results I have here some details.   Finnish samples include 18 samples estimated being from old settlements.   Please check Finnish settlement definitions, explained in Finnish studies  (Jakkula: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2668058/    Palo: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2986642/ ).     The estimate is based on a comparison to earlier analyzed and better known data sources and PCA analyses, the 1000-genome data itself is not documented enough to make this decision.   Good news are now that after Karelian and Vepsian samples became available it is possible to add them to Finestructure tests and also Finnish late settlements without drawback of showing  too much genetic drift , i.e. catching Finnish clusters by a strong intra-populational chunk sharing.  My next tests will include all those samples.

Western Finns show highest similarity to the south, with Estonians, West Russians and Poles, but there are two individuals with more North Russian similarity and some West Finns show weaker similarity with Scandinavians.   It is possible that the pre-selection made using PCA was not perfect and two Karelians or Savonians became included, or those two belong partly to some other ethnicity and the result fell into same PCA category.  Low Scandinavian chunk amount doesn’t necessarily mean low Scandinavian ancestry, only low chunk sharing, which could mean that the western ancestry is older than southern and eastern ancestry.  Mosaic patterns also show that the Scandinavian affinity based on chunk sharing (on linked results) is more East European ancestry in Scandinavia than vice versa, although also Swedish admixture in Finland is detectable. This reasoning about old Scandinavian ancestry in Finland may surprise some people, but perhaps it can be supported by the small amount of young Scandinavian specific y-dna in Western Finland (look for example Lappalainen et al. 2006).  Swedish admixture estimated by the ratio of R1b is 8/21=38% among the Swedish speaking population in Ostrobothnia while Swedish speakers form around 5% of the Finnish population.   

Maybe there is also reason to mention also 23andme’s and FtDna’s results giving sometimes high amounts of western admixture for West Finns.  There is a principled difference between what they do and this analysis.   While 23andme and FtDna created a Finnish “average Joe” and compare individuals to him, Finestructure in this analysis compare everyone to everyone and there are no inferred archetypes, stereotypes or hypothetical ancestors for any ethnic groups.  Another question is how to create genetic averages, whatever it might be. 

Abbreviation:  CEU=Utah-Europeans, FR= France, NRG=Norway, HU=Hungary, RO=Romania, BL=Bulgaria, SE=Sweden, CR=Croatia, EE=Estonia, BR=Belarussia, UKR=Ukraine, WRU=West Russia, WFI=Western Finland, MA=Mari, CH=Chuvash, MR=Mordva, NRU=Russia-Volodga, TSI=Tuscan, SP=Spain, ITALY=Abruzzo

Inferred groups averagely
1: UK-Kent CEU FR NRG HU RO BL SE CR
2: EE BR PL UKR WRU
3: WFI, mixed
4: MA CH TATAR 
5: MR NRU
6: TSI SP SICILY ITALY

Imputing and phasing was done by MaCH with rounds 50 and states 200 per each chromosome, creating around 1500 shared chunks between individuals.  This really reaches a deep haplotype history.


Run parameters in Finestructure:  50000 burnin, 500000 MCMC rounds, tree climbing 100000.


You can download results here (compressed .zip).








No comments:

Post a Comment

English preferred, because readers are international.

No more Anonymous posts.