tiistai 18. lokakuuta 2016

European coarse population structure using 14.4 millions markers

I already made a Finestructure analysis before my previous Admixture based work, but didn't publish it because it gave so little additional information.   I used same data than with Admixture.   The workflow:

1 extracting chrpmosomes 1 and 6
2 running haplotypes (HAPI-UR ten times and making consensus)
3 running Chromopainter in linked mode, without defining donor haplotypes
4 running Finestructure with parameters burning 200000 and runtine 2000000

As a result we see a very obvious grouping, each ethnic group are grouped together.   Some cautions have to be made about Chromopainter-Finestrucure combination

-  first at all,  Finestructure doesn't really use dedicated haplotypes, but the number of shared haplotypes and haplotype lengths between individuals.  So there is no guarantee that in a triple sample case (individuals a, b and c)  all three share common haplotypes, even when the result of  Finestructure shows up haplotype sharing for all three samples.  This can lead to a pseudo-ancestry between individuals and also to a wrong tree grouping.

- using donor haplotypes can be methodically unreliable.  We can assign donor haplotypes for people living in Americas, but it is not equally reliable for people living in the old world.  It is a chicken egg question.  If we really know donors before testing we know the result before we have the result.   I have seen methods creating donor types (selections of prepared haplotypes), but I can't see how it could really work reliably.  Note also that speaking about donor populations (I have seen it) makes this even a more problematic question; to know donor populations we already know the population grouping before the analysis and bind donor populations to something that exists today, but did not necessarily exist thousands years ago.

While checking the data I see there a questionable sample qroup:  Swedes. They look more eastern than can be healthily suggested.

In general, looking at any results the first question is "does the result look obvious?".  If we have two different results based on any kind supervised method (like using donor haplogroups/populations) it is only common sense to see the more obvious result being the better one.   Here we have a philosophic question: what "the obvious" means for you and for me.  It makes sense, but an idea as "too obvious" lead us to tin foil hat theories. Perfection is suspicious.  We don't want it, although also it is in practice possible.   Another, much more sensible question in regards to donor haplotypes would be if we could assign  donor haplotypes of Bronze Age Europeans based on ancient samples.  It would make sense.

Dowload Finestructure picture here.

perjantai 14. lokakuuta 2016

Worldwide admixture analysis based on 14.4 million SNP's

The EGDP data, available from Estonian Biocenter, made it possible to reach 15-30 times more genome density than earlier available data made possible.  The new data lacks of West European samples, but it was not a big problem due to the publicly available western data from the 1000-genomes project.   So I merged these two data sets.  For the quality check I ran heterozygosity rates for all European samples in both data sets and found both sets being considerably close each other, although the read depth of the 1000-genome data is smaller.   Actually Finnish samples in both sets showed exactly same level of heterozygosity.

After the succesful merge I had 14.4 million SNPs over all 22 chromosomes, which was far too much to process in few days on my desktop (i7, 3.5Ghz, 32 GB memory).  Instead of thinning the whole data set to 1-2 millions SNPs I decided to use chromosomes 1 and 6 and leave the genome density untouched.  So I had two chromosomes, a bit over 2 million SNPs showing still 15-30 times more genotype information per chromosome than other available genotype sets.  Considering thinning over all chromosomes to get the dataset handy enough to be processed with my computer would likely have induced more algorithm dependent bias, which I wanted to avoid.

The process

1 merging EGDP and 1000g data sets
2 quaility checks, including homozygosity/heterozygosity ratios per populations
3 extracting chromosomes 1 and 6
4 thinning data by Plink:   plink --file data --indep 50 5 2, resulting 1.1 million SNPs
5 running admixture analyses with k values from 3 to 13 in unsupervised mode and without reference populatons (=projection).

Each k-value was run in unsupervised mode without reference data, because projection reference data is not available for this SNP set.  You can see analyses using projection reference for example in works analysing ancient and moderm genomes together. Analyses made on any kind of projection are cool, because we have no other way to designate proportion of ancient samples to modern ones.  I am not saying that unsupervised analysis without references would be error-free, but that errors are systemic and not user dependant.

All analyses (k-values from 3 to 13) done here are run as individual runs without user supervision and for that reason colors on charts are not consistent (at least it sounded like a painful work the get colors consistent). Each analysis is optimized separately by the Admixture algorithm.  All this makes it more difficult to perceive differences between different K values, but as soon as you get the idea I am sure you also can see the big picture and understand details.

Hopefully this test is helpful for you.  In my opinion, it gives interesteing hints about Finnish relations with other populations, but the analysis itself is wordwide.

- Mordvins seem to differ from other Volga-Finnic populations and belong to Balto-Slavic ancestry and they probably are language shifters from a Baltic to a Volga-Finnic language.

- Estonians are just what can be expected, some Estonians have Baltic ancestry, some others Baltic-Finnic ancestry.  We should, however, be cautious of in using linguistic terms when we speak about ancestry.

- North Russian Finno-Ugric populations seem to be Baltic-Finnic people with Siperian admixture.  The Siberian admixture is present in a lesser amount among Finns and Estonians (note that the amount of minor admixtures depends on the used data/populations and Admixture is based on a selective method processing admixture proportions relatively).

- in some extent also Swedes show Baltic-Finnic ancestry, but the Swedish sample size is rather small to make a sure conclusion.  However,  if this is true, we can assume the present-day Baltic-Finnic people having largely Fennoscandinavian ancestry.

- Ingrian samples show up like pure unadmixed Baltic-Finnic people, which surprises me because of their long lasting minority status in Russia. Sample collectors have done good work.  Those samples are valuable indeed.

- thinking all this and trying to rebuild the the history of Baltic-Finnic people it looks like they lived to the north from the axis Latvia-Moscow (Balts living to the south before the East-Slavic expansion). Mixing between Baltic and Finnic people happened and people also shifted language.

- open questions are how strong the Baltic-Finnic influence is/was in Scandinavia and conversely how strong the Germanic influence is/was in Finland and Estonia.  For certain political reasons it is a difficult approach today.

CV errors, indicating quality in general, the lower the value is the better the quality, but absolute values depend on the used data and can't be compared to other Admixture tests. 

K3: 0.19708
K4: 0.19503
K5: 0.19480
K6: 0.19451
K7: 0.19432
K8: 0.19503
K9: 0.19508
K10: 0.19576
K11: 0.19708
K12: 0.19797
K13: 0.20221

Population abbreviations, download here

Analysis, download here.

You definitely need a suitable picture viewer being able to handle big GIF-files.

torstai 8. syyskuuta 2016

Worldwide diversity based on 3.2 millions X chromosome markers

Genetic diversity tests are usually done using around 300-500 thousands markers.  It is however possible to use much more markers (SNPs) using already available data from the 1000 genomes project.  The downside is that we have only a few populatons and the upside is that we see the big picture accurately, without possible bad sampling.

I made this test using Chromopainter and Finestructure.  Unfortunately Chromopainter is a rather ineffective tool and incapable to use available computing resources (threads, memory).  Without this drawback I would have made this using 25 millions markers instead of only 3.2 millions.

The process:

1 Vcftools, parameters  -remove indels -chr 23
2 Haplytyping using HAPI-UR and all samples, run three times and driven in consensus
3 Made a manual selection for random samples, 10-20 of each population
4 Chromopainter,  without specifying donor haplotypes
5 Finestructure  with run parameters 30000/300000
6 MDS using Past.

Additionally I ran Vcftools using parameters -keep-only-indels and -chr 23.   The result was filtered and biallelic deletions (CN=0) were counted.  Male results were treated biallelic, so CN=0 should give us the number of effectine deletions in both cases, for females and males.


MDS done by Past:

All previous pictures are downloadable with better resolution, here.

Deletions per 3.2 million markers (averages per sample):

The British subgrouping is gathered from internet and can be unreliable.  The Finnish one represents those with highest Siberian admixture, the group being "most Finnish" / local, those closest ancient Corded Ware samples and the rest of all 99 samples.  The last Finnish group includes all outliers.  

lauantai 20. elokuuta 2016

Mitochondrial diversity in Europe


I have seen several mitochondrial statistics using main haplogroups, H, U, I etc.  Haplogroups, being tens of thousand years old are a very robust way to analyze geographic areas where people have moved and mixed during latest centuries and in maximum during some thousands years.   Because of this I decided to use mutation information based on RSRS-reference.  The RSRS was introduced a few years ago and lists mitochondrial mutations defined from so called "mito-Eve", from the reconstructed first woman in the human ancestral tree.  Even RSRS lets lot to be desired, because many mutations are common in several mitochondrial branches.


The data is collected from publicly available FamilyTreeDna's projects and includes two hypervariable regions, HVR1 and HVR2.   HVR2 is not available for all samples, in those cases it is marked as "no call", otherwise all mutations are included.

Countries and sample sizes

Finnish sample size is probably biggest ever seen in academic or any studies.  Even taking into account some bias in regional personal activity this have to be the best ever seen sample data from Finland.

Some geographical areas are underrepresented, like White Sea Karelians, but I was expecting some interest and included them.


Fst distances

Seeking for country level rather than individual statistics I ran at first Fst-statistics between countries.  Keeping in mind the nature of mitochondrial data and mutations it is not relevant to expect any strict ancestral sum information, on the contrary results mirror European migrations during thousands years.

Fst distances

 Image with better resolution can be downloaded here

 MDS-plot based on Fst-distances:

Two dots to the most left are Poland and Germany.

And classical euclidean tree plot:

edit 20.0.2016 13:40

Here I  reconstructed mitochondrial genome instead of using straightforwardly hypervariable mutations.  Reconstructed SNP data was analyzed by standard analyzing tools.   I am very sure that analyzes done using only mutation indicators will not be successful.  

22.9.2016 11:30

Added Fst and genome data.  Notice that the genome data is reconstructed using minimum labor input and original kit-id numbers are substituted by surrogates!

Fst-data download here
Genome data download here

maanantai 27. kesäkuuta 2016

Global ROH-results

ROH (runs of homozygosity) predicts or estimates individual autozygosity for a subpopulation.   After reading the study "Genetic characterization of northeastern Italian population isolates in the context of broader European genetic diversity" I stopped to ponder its statistics, because the presentation in the figure 5 shows decimals for country ROH averages.  Using integers results below 1 are not possible without individual zero values and zero values in practice means some lost data.  It seemed necessary to count shorter ROH segments to get more precise results.  Although my statistics looks in general reasonable, I can't take responsibility for possible bad sampling regarding some ethnic groups. 

Data and processes

Primaty data: 600 ksnp, with very low no-call rate
LD-pruning:  ./plink --noweb --bfile LARGEDATA --indep-pairwise 200 25 0.4
Pruned data: 160 ksnp
ROH process: ./plink --noweb --bfile LARGEDATA --extract plink.prune.in --homozyg --homozyg-window-kb 5000 --homozyg-window-snp 25 --homozyg-snp 50 --homozyg-window-het 1 --homozyg-window-missing 1  --homozyg-density 50 --homozyg-window-threshold 0.05 --homozyg-gap 100 --homozyg-kb 1000

My goal was to find smaller ROH segments and it was done by changing three parameters: homozygosity-density, homozygosity-snp and homozygosity-kb, not big changes, but enough to do it.   There is an optimum combination of SNP and basepair lenghts and comparing to the study I picked smaller basepair length (1500->1000) and longer SNP length (25->50).  This did the trick.   

ROH count on the X-axis, total ROH length in basepairs on the Y-axis.   

Large picture:

Small picture covering the left bottom corner:

Pictures with better resolution:


Zoom in

tiistai 31. toukokuuta 2016

I1-L22 revised

I revised my earlier test about I1-L22 trying to figure Scandinavian and Finnish clades using TRMCA method based on 67 STR markers.  The main reason for doing this is new available CTS2208 samples.  It is really fascinating to see how CTS2208 divides L22 subclades into two brances, implying the Finnish "Bothnian" clade being older than the estimated age of 1850 years.  Here are recent TMRCA estimates

L22 - 4100 BP
Z74 - 4100 BP  (It is not credible to assume both clades being 4100 years old and L22 is likely older than predicted)

P109 - 3400 BP
CTS2208 - 2800 BP 
L205 - 1400 BP
L287 - 1850 BP
L258 - 1700 BP

The logic goes that downstream clades can be older than the calculated TMRCA,  at the maximum as old as the TMRCA of its nearest known upstream clade.

Here is also a tree figure.  67 STR markers are not enough to create a perfect tree, but it gives anyway certain idea of the close relation of the "Bothnian" and CTS2208.  


torstai 5. toukokuuta 2016

Comparison of Ice Age and modern Europeans, Ice Age remix

Thanks to the new study "The genetic history of Ice Age Europe" and the corresponding data we have now a lot more really old human samples.   As a quick experiment I made some comparisons between those ancient samples, following the grouping presented in the study,  and modern Europeans.  Using dstat and selected third populations from America, Asia and Europe I try to infer the amount of common ancestry of selected Europeans and Karitians, Hans and Frenchmen insofar it goes to selected ancient samples.  

The dstat formula was d(European population, Karitian/Han/French ; ancient sample group, Chimp)

06.05.16 20:05  There was a small error in El Mirón numbers, showing somewhat too low similarity for Europeans.  Now corrected.

15.05.16 11.00  Added dstat-gtaphics (as above) regarding Northeast Europe:

16.05.16 18:45

Added GoyetQ116-1 to the first series of graphics.