torstai 8. syyskuuta 2016

Worldwide diversity based on 3.2 millions X chromosome markers

Genetic diversity tests are usually done using around 300-500 thousands markers.  It is however possible to use much more markers (SNPs) using already available data from the 1000 genomes project.  The downside is that we have only a few populatons and the upside is that we see the big picture accurately, without possible bad sampling.

I made this test using Chromopainter and Finestructure.  Unfortunately Chromopainter is a rather ineffective tool and incapable to use available computing resources (threads, memory).  Without this drawback I would have made this using 25 millions markers instead of only 3.2 millions.

The process:

1 Vcftools, parameters  -remove indels -chr 23
2 Haplytyping using HAPI-UR and all samples, run three times and driven in consensus
3 Made a manual selection for random samples, 10-20 of each population
4 Chromopainter,  without specifying donor haplotypes
5 Finestructure  with run parameters 30000/300000
6 MDS using Past.

Additionally I ran Vcftools using parameters -keep-only-indels and -chr 23.   The result was filtered and biallelic deletions (CN=0) were counted.  Male results were treated biallelic, so CN=0 should give us the number of effectine deletions in both cases, for females and males.


MDS done by Past:

All previous pictures are downloadable with better resolution, here.

Deletions per 3.2 million markers (averages per sample):

The British subgrouping is gathered from internet and can be unreliable.  The Finnish one represents those with highest Siberian admixture, the group being "most Finnish" / local, those closest ancient Corded Ware samples and the rest of all 99 samples.  The last Finnish group includes all outliers.  

lauantai 20. elokuuta 2016

Mitochondrial diversity in Europe


I have seen several mitochondrial statistics using main haplogroups, H, U, I etc.  Haplogroups, being tens of thousand years old are a very robust way to analyze geographic areas where people have moved and mixed during latest centuries and in maximum during some thousands years.   Because of this I decided to use mutation information based on RSRS-reference.  The RSRS was introduced a few years ago and lists mitochondrial mutations defined from so called "mito-Eve", from the reconstructed first woman in the human ancestral tree.  Even RSRS lets lot to be desired, because many mutations are common in several mitochondrial branches.


The data is collected from publicly available FamilyTreeDna's projects and includes two hypervariable regions, HVR1 and HVR2.   HVR2 is not available for all samples, in those cases it is marked as "no call", otherwise all mutations are included.

Countries and sample sizes

Finnish sample size is probably biggest ever seen in academic or any studies.  Even taking into account some bias in regional personal activity this have to be the best ever seen sample data from Finland.

Some geographical areas are underrepresented, like White Sea Karelians, but I was expecting some interest and included them.


Fst distances

Seeking for country level rather than individual statistics I ran at first Fst-statistics between countries.  Keeping in mind the nature of mitochondrial data and mutations it is not relevant to expect any strict ancestral sum information, on the contrary results mirror European migrations during thousands years.

Fst distances

 Image with better resolution can be downloaded here

 MDS-plot based on Fst-distances:

Two dots to the most left are Poland and Germany.

And classical euclidean tree plot:

edit 20.0.2016 13:40

Here I  reconstructed mitochondrial genome instead of using straightforwardly hypervariable mutations.  Reconstructed SNP data was analyzed by standard analyzing tools.   I am very sure that analyzes done using only mutation indicators will not be successful.  

22.9.2016 11:30

Added Fst and genome data.  Notice that the genome data is reconstructed using minimum labor input and original kit-id numbers are substituted by surrogates!

Fst-data download here
Genome data download here

maanantai 27. kesäkuuta 2016

Global ROH-results

ROH (runs of homozygosity) predicts or estimates individual autozygosity for a subpopulation.   After reading the study "Genetic characterization of northeastern Italian population isolates in the context of broader European genetic diversity" I stopped to ponder its statistics, because the presentation in the figure 5 shows decimals for country ROH averages.  Using integers results below 1 are not possible without individual zero values and zero values in practice means some lost data.  It seemed necessary to count shorter ROH segments to get more precise results.  Although my statistics looks in general reasonable, I can't take responsibility for possible bad sampling regarding some ethnic groups. 

Data and processes

Primaty data: 600 ksnp, with very low no-call rate
LD-pruning:  ./plink --noweb --bfile LARGEDATA --indep-pairwise 200 25 0.4
Pruned data: 160 ksnp
ROH process: ./plink --noweb --bfile LARGEDATA --extract --homozyg --homozyg-window-kb 5000 --homozyg-window-snp 25 --homozyg-snp 50 --homozyg-window-het 1 --homozyg-window-missing 1  --homozyg-density 50 --homozyg-window-threshold 0.05 --homozyg-gap 100 --homozyg-kb 1000

My goal was to find smaller ROH segments and it was done by changing three parameters: homozygosity-density, homozygosity-snp and homozygosity-kb, not big changes, but enough to do it.   There is an optimum combination of SNP and basepair lenghts and comparing to the study I picked smaller basepair length (1500->1000) and longer SNP length (25->50).  This did the trick.   

ROH count on the X-axis, total ROH length in basepairs on the Y-axis.   

Large picture:

Small picture covering the left bottom corner:

Pictures with better resolution:


Zoom in

tiistai 31. toukokuuta 2016

I1-L22 revised

I revised my earlier test about I1-L22 trying to figure Scandinavian and Finnish clades using TRMCA method based on 67 STR markers.  The main reason for doing this is new available CTS2208 samples.  It is really fascinating to see how CTS2208 divides L22 subclades into two brances, implying the Finnish "Bothnian" clade being older than the estimated age of 1850 years.  Here are recent TMRCA estimates

L22 - 4100 BP
Z74 - 4100 BP  (It is not credible to assume both clades being 4100 years old and L22 is likely older than predicted)

P109 - 3400 BP
CTS2208 - 2800 BP 
L205 - 1400 BP
L287 - 1850 BP
L258 - 1700 BP

The logic goes that downstream clades can be older than the calculated TMRCA,  at the maximum as old as the TMRCA of its nearest known upstream clade.

Here is also a tree figure.  67 STR markers are not enough to create a perfect tree, but it gives anyway certain idea of the close relation of the "Bothnian" and CTS2208.  


torstai 5. toukokuuta 2016

Comparison of Ice Age and modern Europeans, Ice Age remix

Thanks to the new study "The genetic history of Ice Age Europe" and the corresponding data we have now a lot more really old human samples.   As a quick experiment I made some comparisons between those ancient samples, following the grouping presented in the study,  and modern Europeans.  Using dstat and selected third populations from America, Asia and Europe I try to infer the amount of common ancestry of selected Europeans and Karitians, Hans and Frenchmen insofar it goes to selected ancient samples.  

The dstat formula was d(European population, Karitian/Han/French ; ancient sample group, Chimp)

06.05.16 20:05  There was a small error in El Mirón numbers, showing somewhat too low similarity for Europeans.  Now corrected.

15.05.16 11.00  Added dstat-gtaphics (as above) regarding Northeast Europe:

16.05.16 18:45

Added GoyetQ116-1 to the first series of graphics.

keskiviikko 20. huhtikuuta 2016

Neolithic and Bronze Age Irish samples, compared to modern populations

It was worth of waiting for a few weeks to see these Irish samples, especially because I already expected that Irish insular samples could reveal new things about ancient people who lived in Northwest Europe.  You see the original study here.  There are four samples, three from Rathlin Island in Northern Ireland and one sample from Ballynahatty, which locates in Northern Ireland.  Two of Rathlin samples are of low quality and don't work well with my database based on Estonian Biocentre's data.  Maybe I'll download them later to the Lazaridis' database.  The third Rathlin and Ballynahatty samples are however excellent.

Picking from the study

- Ballynahatty, a Neolithic woman (3343–3020 cal BC)
- Rathlin, in context of  an early megalithic passage-like grave, an Early Bronze Age man from Rathlin Island (2026–1885 cal BC)

I was really excited when started to analyse Rathlin samples, because it was possible that it would reveal new knowledge about ancient people who lived in North Europe before eastern Bronze Age steppe migrations.  I decided to compare them to present-day population instead of using ancient samples, to make results touchable.  At first  I tested which of modern populations are closest Rathlin and Ballynahatty samples and found that the Rathlin genome emphasized still Irish people.  Ballynahatty sample was closest present-day Sardinians, representing typical Neolithic era.

After processing all this from fastq-files 1) I made two qpDstat comparisons to find out who of modern populations resembles best those ancient Irish samples in comparison with best fits of modern populations.  In comparison with the Rathlin man I included also my project samples, mainly Finnish and Swedish individuals.

Rathlin and modern populations

Ballynahatty and modern populations

Inspired by the western origin of Saami people I made one comparison more using another database to get reliable results with the Saami sample introduced by Haak et al. 2015.  It looks like, despite of the remarkable North Asian admixture, they have Rathlin like ancestry more than Eastern Finns, who have less North Siberian.

Saami between ancient samples, using the arrangement seen already in my previous post

FI15 is from Northern Karelia, FI12 is western Finnish, FI10 is from Finnish Lapland.

Finally, after downloading and testing DNA.LAND's admixture program,  I made some admixture analyses.   You can find and download the software from their site, here.  This small program is based on allele frequencies and probably the method is Markov chain Monte Carlo.  It is not based on original alleles and genetic drift, thus there is always a residual admixture.  There are also other weaknesses, what kind of, it could be a new topic.  Now I only say that in my opinion it has problems in composing kinship populations with different minor admixtures. 

Two results using references downloaded from DNA.LAND


CSAMERICA 0.00697236
KALASH 0.0165295
NEEUROPE 0.223957
NEUROPE 0.731415
SWEUROPE 0.0185991


ITALY 0.0116662
SARDINIA 0.565326
SWEUROPE 0.423008

Two results using my Estonian-BC database as reference


Bulgaria 0.0524726
Colombian 0.00995061
Ireland 0.213786
Kalash 0.00510618
Latvia 0.0102334
Lithuania 0.221711
Orcadian 0.140857
RU_Smolensk 0.0244413
Scotland 0.207837
Udmurtia 0.0262457
Welsh 0.0840646


Basque 0.0446958
Ireland 0.0547766
NorthItaly 0.0953578
Sardinian 0.569061
Scotland 0.0409351
Sicily 0.0670495
Spain 0.12727
Tuscany 0.000854723

1)  I have changed my fastq-process.   Although BWA is an excellent program in mapping reads, it's automatic trimming is not powerful enough and now I have rerun also all older samples using separate trimming program. 


keskiviikko 23. maaliskuuta 2016

Two-fold ancestry of Finnish people

It has been a common idea, especially among linguists, to say that Baltic Finnic languages came from the Volga region, from so called Volga river bend near Samara. It is a carefully cherished tradition in Finnish science, but any movement of people from there to Finland is still without genetic evidences. Now I am going to prove something which contradicts with this idea of the Volga origin of Finns, or at least gives a new view about it.  I'll show a plausible genetic evidence of Volga-Saami connection using the Saami sample (Haak et al. 2015 and Lazaridis et al. 2014), which shows very high similarity with the ancient Eneolithic Samara sample (Mathiesson et al.).

The other half of my Finnish story tells about ancient Central-European influence in Finland.  Around 20% of Finnish samples from the 1000genome project show Corded-Ware similarity comparable to Estonians and Lithuanians, and Western Finnish project samples show equally Corded-Ware similarity with Swedes, some even more, despite of the fact that they are much more "eastern" when compared to present-day Swedes.

This Finnish duality doesn't tell were and when the mixing occurred and so far I have not seen any genetic evidence about the Baltic Finnic origin. It looks very possible that genetically Baltic Finns were born somewhere in region from Estonia to White Sea, no matter what the origin of Baltic Finnish language could have been. 

Saami results

Saamis are genetically closer for Eneolithic Samara people than Mordovians (Mordva) and Chuvashes.   Worth noticing is that Mordovians, who live near Volga are not closer those ancient people living in Samara.  Saami people live thousands kilometers and thousands years away from what was the suggested Volga home range.  Siberian admixture of Chuvashes roughly equals to Saami Siberian.  This statistic has however very limited use, because Saami people are not Central Europeans, but still the statistic shows them being comparable to Central Europeans when compared to ancient East European samples.   What could be the best outcome?

Probably some readers can think that the Eneolithic Samara - Saami - Finnish genetic connection is only based on the amount of Siberian.  It is not true and easily proved false.  Chuvashes and Mansi people (and Komis, not included) with high Siberian admixture are far away from the Eneolithic Samara, definitely not comparable to the Saamis.  Similarly those Finns being closest Eneolithic Samara have less Siberian than Russians living in Archangel and Pinega regions in Russia (look project results).

Only people in northernmost Europe beat Saami_WGA in comparison with Eneolithic Samara.  Have to admit, this is a bit complicated question. Then let's look at another perspective of supposed Finnish ancestry, Corded Ware samples.  It is less complicated.

Corded Ware results

Only Lithuanians beat the Finnish CW-group (20% of Finnish samples from the 1000g project after removing outliers) when the test is done using over half million SNPs.  Even Lithuanians would be beaten with more homogeneous Finnish sample group.  There is all variations from very CW-looking to only moderately CW-looking. They don't look like coming from Volga bend.  Not really.

Then combining Saami and CW results and project members.  To do this I have to use my smaller data base, based on Estonian Biocentre's data.   The accuracy is somewhat poorer.   Numbers show the difference between Eneolithic Samara and German Corded Ware affinities in Finland and in neighboring countries, as well as results for project members.  Using Eneolithic Samara and CW samples the Siberian-like admixture becomes excluded and results show only affinities common for those two groups, even if tested populations or project members have extra Siberian admixture.  It is important to understand that this table alone doesn't tell how much individuals and populations have those two ancient affinities  (it tells only a ratio).  To see the big picture you have to take into account also two previous tables showing how significant is the relation between ancient and modern populations.

Project results