Kezdőlap Újdonságok Kereső Egyesület Fórumok Aktualitások Kutatás Adattárak Történetek Fotógaléria Partnereink RSS
RSS hírcsatornák
Ancestor.com
Ancestry.com
AustraliaGenWeb
Computergenealogie
FamilySearch.org
Forum zur Ahnenfors...
Genealogy Gems
Genealogy News Center
GenealogyBlog
GeneaNet.org
Hungaricana - Magya...
Interment.net Cemet...
LegacyTree.com
Louis Kessler's Beh...
Mac Genealogy Softw...
Magyar Országos Lev...
MyHeritage.com
NYT > Genealogy
Országos Széchényi ...
The Genealogue

Louis KesslerLouis Kessler's Behold Blog

the Development of my Genealogy Program named Behold

Genealogy is Virtually Everywhere

2019. november 13., szerda 21:34:23

Last night, I attended a talk of a prominent genealogy speaker. This is a speaker who keynotes conferences and attracts thousands to her talks.

Diahan Southard gave her talk “Your Slice of the DNA Pie”, and I watched it on my computer at home. It was a presentation of the Virtual Genealogical Association, an organization formed in April 2018 to provide a forum for genealogists to connect online. Webinars such as Diahan’s are just one of their offerings. Membership is just $20 a year.

image

The VGA just completed their first highly successful Virtual Conference. There was one track with 6 well-known genealogical speakers on the Friday, 6 more on Saturday and 5 on Sunday, so the Conference lasted three full days. In addition, three prerecorded talks were included. All talks are available to attendees for re-watching (or watching if they missed the live talk) for the next six months.

image

Like any physical conference, handouts by the speakers were made available to attendees for each of their talks individually, or as a syllabus. Attendees were told about a surprise bonus at the end of the conference which was a special offer from the National Institute for Genealogical Studies of a free 7 or 10 week online course worth $89 authored by two of the VGA Conference speakers: Gena Philibert-Ortega and Lisa Alzo.

The VGA Conference was hosted and directed by their delightful president Katherine Willson. She said they were very happy that over 250 people paid the $59 (members) or $79 (non-members) fee to attend the 3-day online conference, something that was really the first of its kind. 

The VGA plans to continue these annual conferences. The next is already scheduled for Nov 13-15, 2000, so be sure to block those days off now in your calendar.

  
What is Virtual Genealogy?

Most of us are used to attending live genealogy conferences. You know, the ones you have to physically be there, be semi-awake, have showered, look decent, be pleasant even if you’re not feeling pleasant.

They may be offered by your local genealogical society in the city you live, a regional conference in a city you can drive to, or a national or international conference that you usually have to fly to. Live conferences require many people to organize and run. They are expensive to put on, require booking of a venue, obtaining of sponsors to cover the costs, vendors to fill an exhibition hall, rooms and logistics to enable the speakers to speak, etc., etc.

By comparison, I would say:

Virtual Genealogy simply is any genealogical activity you can do on your computer or smartphone in your pajamas.

This includes everything from:

  • attending online lectures
  • taking online courses or workshops
  • watching conference livestreams
  • communicating with other genealogists via social media
  • researching your family online
  • using genealogy software to record your family tree information

It’s only in the past couple of years that many of these virtual genealogical activities have become available. I can truly say now that you can be a bedroom genealogist and learn and do almost everything you need to without slipping out of bed (as long as your laptop or smartphone is within arms reach).

This wasn’t possible just a few years ago, but it is possible now.

  
Legacy Family Tree Webinars

The big kid on the block as far as online genealogy lectures goes is Legacy Family Tree Webinars. They have been around since 2010 and started off simply as a way for the the genealogy software program Legacy Family Tree to make instructional videos available for their software. They offer a webinar membership for a $50 annual fee giving you full access to their webinar library. Many new webinars on any and every topic are made available free for the live presentation.

In August 2017, Legacy software and the Family Tree Webinars were purchased by MyHeritage. MyHeritage has allowed them to continue running, with the added advantage of making MyHeritage instructional videos and talks available to everyone for free.

The long-time host of most of the videos is Geoff Rasmussen. He just celebrated Family Tree Webinar’s 1000th webinar in September with this wonderful amusing behind-the-scenes video.

image

  
Family History Fanatics

Another not-to-be-missed webinar producer is the family of Andy Lee, Devon Lee and their son Caleb Lee, who call themselves Family History Fanatics. They have their own You Tube channel with 16.7 K subscribers where they post their numerous instructional videos and live streams.

image

They also produce online webinars and workshops which are well worth the modest $30 ($25 early bird) price they charge for them. Their next is a DNA Workshop: Integrated Tools that will be three instructional talks of 2 hours each on Dec 5, 12 and 19.

They also from time to time host one-day eConferences. I paid the $20 early bird fee to attend their A Summer of DNA eConference last August, which included 4 talks by Daniel Horowitz, Donna Rutherford, Emily Aulicino and Leah Larkin.

Their next eConference will be January 25 called “A Winter of DNAVirtual Conference”.  It will feature four DNA experts. I know three of them will be Jonny Perl (DNA Painter), Paul Woodbury (LegacyTree Genealogists) and myself.

I have given many lectures at genealogy conferences around the world, but this will be my first ever live webinar. It will be about double match triangulation and the ideas behind it and what it can be used for. I’m really looking forward to this.

Andy doesn’t have the details up yet for the January conference but likely will soon and will then accept registrations. I’ll write a dedicated blog post when registration becomes available.

  
APG

The Association of Professional Genealogists (APG) has webinars for anyone interested.

The APG also has a Virtual Chapter (membership $20 annually) with monthly online presentations by a prominent speaker.

  
Live Streaming of Genealogical Conferences

Another wonderful trend happening more and more is the live streaming now being offered by genealogy conferences. Many of the livestreams have been recorded and made available following the conference, so you don’t always have to wake up at 3 a.m. to catch the talk you want.

MyHeritage Live in Amsterdam took place in September. They have made many of the 2019 lectures available. Lectures from their MyHeritage Live 2018 from Oslo are also still available for free.

The first ever RootsTech in London took place last month. A few of their live stream videos have been made available. You can find quite a few Salt Lake City RootsTech sessions from 2019 and from 2018 still available for free.

The National Genealogical Society offered 10 live stream sessions for their conference last May for $149.

With regards to big Conferences, nothing compares to being there in person. But when you can’t make it, you can still feel the thrill of the conference while it happens with live streams and enjoy later the recordings of some of their sessions.



What’s Next?

My next webinar I plan to watch is another Virtual Genealogy Association webinar: "Artificial Intelligence & the Coming Revolution of Family History" presented by Ben Baker this Saturday morning, Nov 16.

Never stop learning.

What’s next on your agenda?

Using DMT, Part 2: My GEDmatch data

2019. október 25., péntek 4:47:05

In my last blog post, I analyzed my segment matches at 23andMe with Double Match Triangulator,. This time let’s do the same but with my GEDmatch segment matches.



Getting Segment Match Data from GEDmatch

At GEDmatch you need their Tier 1 services (currently $10 a month) in order to download your segment matches. But you can download anyone’s segment matches, not just your own. But+But they don’t include close matches of 2100 cM or more, meaning they won’t include anyone’s parents, children, siblings, and maybe even some of their aunts, uncles, nephews or nieces. The But+But could in some cases be problematic because people who should triangulate will not if you don’t include their close matches. But+But+But even that should still be okay in DMT since DMT’s premise is to use the matches and triangulations that exist with the ideas that there will generally be enough of those to be able to determine something.

GEDmatch not too long ago merged their original GEDmatch system and their Genesis system into one. Now all the testers on GEDmatch who used to be in two separate pools, can all be compared with each other. While doing so, GEDmatch also changed their Segment Search report now providing a download link. The download is in a different format than their screen report is and used to be. With all these changes, if you have old GEDmatch or Genesis match file reports that DMT helped you download, you should recreate each of them again in the new format. Check DMT’s new download instructions for GEDmatch.

When running GEDmatch’s Segment Search, the default is to give you your closest 1000 kits. I would suggest increasing that to give you your 10000 closest kits. I’ve determined that it is definitely worth the extra time needed to download the 10000 kits. For example, in my tests, comparing 1000 kits versus 1000 will average about 50 people in common.  Whereas comparing 1000 kits versus 10,000 kits will average 350 people in common.  So 300 of the 1000 people in common with Person A are in not in the first 1000 kits in common with Person B, but are in the next 9,000 kits. When using the 1000s, DMT can cluster 18% of the people. When using 10,000s, that number goes up to 56%

It can take anywhere from a few minutes to an hour to run the 10,000 kit Segment Search at GEDmatch, so if you have 10 kits you want to get segment matches for, it could take the good part of a day to complete.

I downloaded the segment match data for myself and 8 people I match to using the 10,000 kit option. They include 3 relatives I know, and 5 other people who I am interested in. 

My closest match is my uncle who shares 1958 cM with me. GEDmatch says:

image

I really don’t know why GEDmatch does this. To find the one-to-one matches of a few close relatives and include them in the segment match list would use only a tiny fraction of the resources required overall by their segment match report, The penalty of leaving out those close matches is huge as matches with all siblings and some uncles/aunts, nephews and nieces are left out. Parents and children are also left out, but they should match everywhere.

I have added into DMT an ability to download the one-to-one matches from GEDmatch for your matches that the Segment Search does not include. In my case, my uncle is 1958 cM, so I didn’t need to do this for him. You can also use the one-to-one matches to include more distant relatives who didn’t make your top 1000 or 10000 people.



Entering my Known Relatives’ MRCAs in my People File

These are the 3 people at GEDmatch that I know my relationship to and who is our Most Recent Common Ancestor (MRCA).

  1. My uncle on my father’s side, MRCA = FR, 1958 cM
  2. A daughter of my first cousin on my mother’s side, MRCA = MR, 459 cM
  3. A third cousin on my father’s mother’s father’s side, MRCA = FMFR, 54 cM

(The F, M, and R in the MRCA refer to Father, Mother and paiR of paRents. So an MRCA of MR are your mother’s parents. FMFR are your father’s mother’s father’s parents. MRCAs are always from the tester’s point of view.)

    Since DNA shared with my uncle can come from either of my paternal grandparents, and since DNA from my 1C1R can come from either of my maternal grandparents, their double matches and triangulations do not help in the determination of the grandparent. However, they should do a good job separating my paternal relatives from my maternal relatives.

    The third cousin will allow me to map people to my FM (father’s mother) grandparent and to my FMF (father’s mother’s father) great-grandparent.

    Lets see how this goes.



    Painting

    At GEDmatch, I only have the one third cousin to work with to determine grandparents. My cousin only shares 54 cM or about 1.5% of my DNA. The process DMT uses of automating the triangulations and extending them to grandparents, then clustering the matches and repeating, results in DMT being able to map 44% of my paternal side to grandparents or deeper. This is using just this one match along with my uncle and my 1C1R.

    Loading the mappings into DNA Painter gives:

    image

    If you compare the above diagram to the analysis from my 23andMe data I did in my previous post, you’ll see a few disagreements where this diagram is showing FM regions and the 23andMe results show FF regions. These estimates are not perfect. They are the best possible prediction based on the data given. If I was a betting man, I would tend to trust the 23andMe results more than the above GEDmatch results in these conflicting regions because 23andMe had many more MRCAs to work with, including some on both the FF and FM sides, than just the one FMF match I have here at GEDmatch.

    The bottom line is the more MRCAs you know, the better a job DMT can do in determining triangulation groups and the ancestral segments they belong to. None-the-less, using only one MRCA that specifies a grandparent, this isn’t bad.



    Clustering

    DMT clusters the 9998 people in my GEDmatch segment match file as follows:

    image

    DMT clustered 39% of the people I match to as paternal and 17% as maternal.

    34% were clustered into one big group, my Father’s Mother’s Father (FMF) which is higher than the expected percentage (12.5%). My third cousin whose MRCA is FMFR might be biasing this a little bit. Every additional MRCA you know will add information that DMT can work with to help improve its segment mappings and clustering. I just don’t happen to know any more at GEDmatch, so these are the best estimates I can do from just the GEDmatch data. As more people test, and I figure out how some more of my existing matches are connected, I should be able to add new MRCAs to my GEDmatch runs.



    Grandparents on My Mother’s Side

    I have one relative on my mothers side included here at GEDmatch. Being the daughter of my first cousin, she is connected through both my maternal grandparents. Her MRCA is MR and DMT can’t use her on her own for anything more than classifying people as maternal.

    I don’t like the idea of being unable to map grandparents on my mothers side. I don’t have any MRCAs to do that.

    But I have an idea and there is something I can try.  If I take a look at the DMT People File and scroll down to where the M cluster starts, you’ll see my 1C1R named jaaaa Saaaaaa Aaaaaaa with her MRCA of MR. Listed after her are all the other people that DMT assigned the cluster M to and they are shown by highest total cM.

    The next highest M person matches me with 98.2 cM. That is a small enough number that the person will be at least at the 2nd cousin level, but more likely the 3rd or 4th cousin level.. Since they are further than a 1st cousin, they should not be sharing both my maternal grandparents with me, but should only share one of them. I don’t know which grandparent that would be, but I’m going to pick one and assign them an MRCA to it. This will allow DMT to distinguish between the two grandparents. I’ll just have to remember that the grandparent I picked might be the wrong one.

    So I assign the person Maaaa Kaaaaa Aaaaaaaaa who shares 98.2 cM with me the MRCA of MF as shown below. Note I do not add the R at the end and make it MFR. The R indicates the paiR of paRents, and indicates you know the exact MRCA and share both of those ancestors. DMT accepts partial MRCAs like this.

    image

    DMT starts by assigning the MRCA you give it. Unlike known MRCAs whose cluster is always based on the MRCA, DMT could determine a different cluster for partial MRCAs.

    I run DMT adding the one partial MRCA to this one person. Without even downloading the segment match file for Maaaa Kaaaaa Aaaaaaaaa, DMT assigned 976 of the 1674 people that were in the M cluster to the MF cluster, leaving the other 698 people in the M cluster. In so doing, it painted 11.5% of the maternal chromosome by calculating triangulations with Maaaa Kaaaaa Aaaaaaaaa and then extending the grandparents based on other AC matches on the maternal side that overlap with the triangulations.

    That’s good. Now what do you think I’ll try. Let’s try doing it again. Likely many of those 698 people still assigned cluster M are on the MM side. So let’s take the person in the M cluster with the highest cM and assign them an MRCA of MM. But we have to be a bit careful. That person with 86.3 cM named Jaaaaaaa Eaaaaa Kaaaa Aaaaaaaa has a status of “In Common With B”. That means they don’t triangulate with any of the B people. If they don’t triangulate, DMT cannot assign the ancestral path to others who also triangulate in the same triangulation group at that spot. So I’ll move down to the next person named Eaaaa Jaaaaaa Maaaaaa whose status is “Has Triangulations” and shares 78.5 cM and assign the MRCA of MM to her, like this:

    image

    After I did that, I ended up with 296 people assigned the MM cluster, 859 assigned MM (so 117 were changed), and 519 still left at M. meaning there were still some people that didn’t have matches in any of the triangulation groups that DMT had formed, or maybe they had the same number of MF matches as they do MM matches so no consensus. Or maybe the people I labelled MM and MF are really MMF and MFM and these people left over are MMM or MFF thus not triangulating with the first two. It could be any of those reasons, or maybe their segment is false, I don’t know which. In any case, I now have 23% of my maternal chromosomes painted to at least the grandparent level.

    If I load this mapping into DNA Painter, it gives me:

    image

    and yahoo!  I even have a bit of my one X chromosome painted to my MM side.



    Confirming the Grandparent Side

    This is nice. But I still have to remember that I’m not sure whether MF and MM are correct as MF and MM or if they are reversed.  As it turns out, I can use clustering information that I did at Ancestry DNA to tell. Over at Ancestry, I have 14 people whose relationships I know that have tested there. And some of them are on my mother’s side.

    I have previously used the Leeds Method and other clustering techniques to try to cluster my Ancestry DNA matches into grandparents. Here’s what my Leeds analysis was, with the blue, green, yellow and pink being my FF, FM, MF and MM sides.

    And lucky me, I did find a couple of people on GEDmatch who tested on Ancestry who were in my MM clusters from DMT. And in fact they were in my pink grouping which is my mother’s mother’s side in my Leeds method. So this does not prove, but gives me good reason to believe that I’ve got the MF and MM clusters designated correctly.



    Extending Beyond Grandparents

    You would think I could extend this procedure. After all, now that I have a number of people who are on the MF side, some of them should be MFM and some should be MFF.

    Well I probably could, but I’d have to be careful. Because now I have to be sure the people I pick are 3rd cousins or further. Otherwise, they match me on both sides. So there is a limit here. This might be something I explore in the future. If you can understand anything of what I’ve been saying up to now, feel free to try it yourself.



    Same for Paternal Grandfather

    I now have people mapped to 3 of my grandparents:  FM, MF and MM.  I can do the same thing to get some mapped to my father’s father FF. In exactly the same way that I did it for my maternal grandparents, I can assume that my highest F match is likely FF because it would have been mapped FM if it was not. So I’ll give an MRCA of FF to Eaaaa Jaaaaaaaa Kaaaaaaaa who matches me 71 cM on 9 segments, where one of those segments triangulates with some of my B people.

    I run everything all again and I now people are clustered this way:

    image

    The FF assignments stole some people from the other Fxx groups. There are still a lot of people clustered into FMF, but that is possible. Some sides of your family may have more relatives who DNA tested, even a lot more than others do.

    Loading my grandparent mappings into DNA Painter now gives me this:

    image

    You can see all 4 of my grandparents (FF=blue, FM=green, MF=pink, MM=yellow) and a bit more detail on my FM side.



    Filling in the Entire Genome

    If I had enough people who I know the MRCA for who trace back to one of my grandparents or further, and all of my triangulations with them covered the complete genome, then theoretically I’d be able to map ancestral paths completely. Endogamy does make this more difficult, because many of my DNA relatives are related on several sides. But the MRCA, because it is “most recent”, should on average pass down more segments than the other more distant ancestors do. Using a consensus approach, the MRCA segments should on average outnumber the segments of the more distant relatives and should be expected to suggest the ancestor who likely passed down the segment.

    Right now in the above diagram, I have 34% of my genome mapped.

    So if I go radical, and assume that DMT has got most of its cluster assignments correct, then why don’t I try copying those clusters into the MRCA column and run the whole thing again. That will allow DMT to use them in triangulations and those should cover most of my Genome. Let’s see what happens.

    image

    Doing this has increased my coverage from 34% to 60% of my genome. About 25% of the original ancestral path assignments were changed because of the new assumptions and because the “majority rules” changed.  

    With DMT, the more real data you include, the better the results should be. What I’m doing here is not really adding data, but telling DMT to assume that its assumption are correct. That sort of technique in simulations is called bootstrapping. It works when an algorithm is known to converge to the correct solution. I haven’t worked enough with my data to know yet whether its algorithms converge to the correct solution, so at this point, I’m still hypothesizing. The way I will be able to tell is if with different sets of data, I get very similar solutions. My matches from different testing companies likely are different enough to determine this. But I’m not sure if I have enough known MRCAs to get what would be the correct solution.

    Let’s iterate a second time. I copy the ancestral paths of the 25% that were changed over to the MRCA column. This time, only 3% of the ancestral path assignments got changed. So we are converging in on a solution that at least DMT thinks makes sense. 

    I’ll do this one more time, copying the ancestral paths of the 3% that were changed over to the MRCA column.  This time, only 1.5% changed.

    Stopping here and loading into DNA Painter gives 63% coverage, remaining very similar to the previous diagram, although I’m surprised the X chromosome segment keeps popping in and out of the various diagrams I have here.

    image



    What Have I Done?

    What I did above was to document and illustrate some of the experimentation I have been doing with DMT so I can see what it can do, figure out how best to use it, and hopefully map my genome in the process. Nobody has ever done this type of automated determination of ancestral paths before, not even me.

    I still don’t know if the above results are mostly correct or mostly incorrect. A future blog post (Part 3) will see if I can determine that.

    By the way, in all this analysis, I found a few small things to fix in DMT, so feel free to download the new version 3.1.1.

    Using DMT, Part 1: My 23andMe Data

    2019. október 18., péntek 0:22:19

    I am going to show you how I am using Double Match Triangulator, and some of the information it provides, at least for me.

    My own DNA match data is difficult to analyze. I come from a very endogamous population on all my sides that originates in Romania and Ukraine. The endogamy gives me many more matches than most people, but because my origins are from an area that have scarce records prior to 1850, I can only trace my tree about 5 generations. Therefore the vast majority of my matches are with people I may never be able to figure out the connection to. But there should be some that I can, and that is my goal, to find how I’m related to any of my 5th cousins and closer who DNA tested.

    I’ve tested at Family Tree DNA, 23andMe, Ancestry, MyHeritage, and I’ve uploaded my DNA to GEDmatch. I have also tested my uncle (father’s brother) at Family Tree DNA and uploaded his DNA to MyHeritage and GEDmatch.

    Ancestry does not provide segment match data, so the only way to compare segments with an Ancestry tester is if they’ve uploaded to GEDmatch.



    Where to Start

    With DMT, the best place to start is with the company where you have the most DNA relatives whose relationship you know. I know relationships with:

    • 14 people at Ancestry, but they don’t give you segment match data
    • 9 people at 23andMe.
    • 3 people at GEDmatch.
    • 2 people at MyHeritage.
    • 2 people at Family Tree DNA

    So I’ll start first with the 9 people at 23andMe. The somewhat odd thing about those 9 people are that they are all related on my father’s side, meaning I won’t be able to do much on my mother’s side.

    We’ll see what this provides.



    Getting Segment Match Data from 23andMe

    At 23andMe, you can only download the segment match data of people you administer. That means you have to ask your matches if they would download and send you their match data if you want to use it.

    But there is an alternative. If you subscribe to DNAGedcom, you can get the segment match files of any of the people you match to. I have a section of the DMT help file that describes how to get 23andMe match data.

    I used DNAGedcom to download my own segment matches, as well as the segment matches of the 9 relatives I know relationships with at 23andMe, plus 7 other people I’m interested in that I don’t know how I’m related to.



    Entering my Known Relatives’ MRCAs in my People File.

    First I load my own 23andMe match file as Person A in DMT. Then I run that file alone to create my People file. DMT tells me I have 8067 single segment matches. DMT excludes those that are less than 7 cM and produces a People file for me with the 892 people who I share 4244 segments of at least 7 cM. It is sorted by total matching cM, highest to lowest so that I can see my closest relatives first.

    I go down the list and find the 9 relatives I know and enter our Most Recent Common Ancestors (MRCAs). Here’s the first few people in my list whose names I altered to keep them private:

    image

    Naa is the daughter of my first cousin, i.e. my 1C1R. She is my closest match at 23andMe and we share 541 cM on 22 segments that are 7 cM or greater. Our MRCA from my point of view is my Father’s paRents, so I enter FR as our MRCA.

    Daaaaa Paaaaa is my father’s first cousin sharing 261 cM. So he is also my 1C1R but since he is my Father’s Father’s paRent’s daughter’s son, he gets an MRCA of FFR.

    Similarly, Baaaa Raaaaaa and Raaa Raaaaaa are brothers who are both 2nd cousins sharing 153 and 143 cM with me. Their MRCA is FFR

    The other two people I marked FFR are my 2C1R sharing 152 cM and 94 cM.

    I also have 3 people related on my father’s mother’s side. The two FMFR’s shown above are 3rd cousins on my father’s mother’s father’s side sharing 90 cM and 84 cM. Also on line 96 (not shown above) is a 3rd cousin once removed sharing 58 cM who I’ve given an MRCA of FMFMR.



    Single Matching

    I save the People file with the 9 MRCAs entered, and I’ll run that file alone again. This time, it uses the MRCA’s and paints any matching segments at least 15 cM to the MRCA ancestral path.

    This is exactly what you do when you use DNA Painter. You are painting the segment matches whose ancestral paths you know onto their places on the chromosome.

    The reason why single segment matches must be 15 cM to paint is because shorter single segment matches might be a random match by chance. That can be true even if it is a close relative you are painting. It’s better to be safe than sorry and paint just the segments you are fairly certain are valid. Beware of small segments.

    DMT tells me it is able to paint 15.3% of my paternal segments to at least the grandparent level using the 9 people. My closest match, my first cousin once removed doesn’t help directly, because some of her segment matches may be on my father’s father’s side, and some may be on my father’s mother’s side, and we can’t tell which on their own. But her matching segments may overlap with one of the other relative’s matches, and hers can then be used to extend that match for that grandparent. DMT does this work for you.

    With this data, DMT cannot paint any of my maternal segments. I’d need to know MRCAs of some of my maternal relatives at the 2nd cousin level or further to make maternal painting possible.

    DMT produces a _dnapainter file that I can upload to DNA Painter. This is what it looks like in DNA Painter showing the 15.3% painted on my paternal chromosomes:

    image

    My father’s father’s side (in blue), could only be painted to the FF level because I don’t have any MRCAs beyond FFR.

    My father’s mother’s side (in green), was paintable to the FMFM level because I had a 3C3R with an MRCA of FMFMR.  But there’s less painted on the FM side than the FR side because my FM relatives share less DNA with me.



    Double Matching and Triangulating

    Now let’s use the power of DMT to compare my matches to the matches of my 9 known relatives and the 7 unknown relatives, combine the results, and produce triangulation groups and finally produce some input for DNA Painter.

    The great thing about triangulating is that by ensuring 3 segments all match each other (Person A with Person C, Person B with Person C, and Person A with Person B), it considerably reduces the likelihood of a by chance match, maybe down to segment matches as small as 7 cM, which is the default value for the smallest segment DMT will include.

    This allows DNA to paint 46.1% of the paternal DNA, about 3 times what was possible with just single matching. In DNA Painter, this looks like:

    image



    Clusters

    DMT also clusters the people I match to according to the ancestral paths that the majority of their segments were assigned to.

    DMT clustered the 892 people that I match to as follows:

    image

    There were 83 U people who DMT could not figure out if they were on the F or M side. There were 75 F people who DMT could not figure out if there were on the FF or FM side. And there were 19 X people who didn’t double match and were only a small (under 15 cM) match in my segment match file, so these were excluded from the analysis.

    Overall 63% of the people I match to were clustered to my father’s side, 24% to my mothers side. Here’s how my closest matches (see first diagram above) look after they were clustered:

    image

    So it looks like quite a few of my closest matches whose MRCA I don’t know might be on my FF side. That tells me where I should start my search for them.

    There are also 4 matches that might be on my M mother’s side that I can try to identify.

    All but two of my top matches have at least one segment that triangulates with me and at least one of the 16 people whose match files I ran against. All the details about every match and triangulation are included in the map files that DMT produces, so there’s plenty of information I can look through if I ever get the time.

    Next post:  GEDmatch.

    DMT 3.1 Released

    2019. október 17., csütörtök 3:34:21

    I’m working on a series of articles to show how I am using the new 3.0 version of Double Match Triangulator to analyze my own segment match data.

    As much as I’d like you to believe that I’ve developed DMT for the good of genetic genealogists everywhere, I humbly admit that I actually developed it so that I could analyze my own DNA to help me figure out how some of my DNA matches might be related.

    Of course, as I started working on my articles, first looking at my 23andMe matches, I found some problems in my new 3.0.1 version, and a few places I could make enhancements.

    If you downloaded version 3.0 that was released on Oct 1, or 3.0.1 on Oct 3, please upgrade to 3.1 whenever you can.

    image

    If you are a member of the Genetic Genealogy Tips and Tricks group on Facebook, or the DNA Painter User Group on Facebook, my free trial key I gave there is still valid and will let you run the full version of DMT until the end of October. Just look for my post on Oct 2 on either group for the key.

    Some of the enhancements in Version 3.1 include:

    • The grandparent extension algorithm:  I found a few extra extensions where they should not be. So I completely changed the algorithm to one that was clearer and easier for me to verify that it is working properly. The final results are similar to the old algorithm, but they’re significant enough to make a noticeable effect on the grandparent assignments.
    • Small improvements were made to the determination of triangulation boundaries.
    • Internally, I increased the minimum overlap DMT uses from 1.0 Mbp to 1.5 Mbp. This was to prevent some incorrect overlaps between two segments when there was a bit of random matching at the overlapping ends of the segment.

    And there were a few bug fixes:

    • If you run 32-bit Windows, then the DMT installer installs the 32-bit version of DMT rather than the 64-bit version. The 32-bit version had a major bug that crashes it when writing the People file. Nobody complained to me about this, so I guess most of you out there are running 64-bit Windows. Maybe one day in the not-too-distant future, I’ll only need to distribute just the 64-bit version of DMT.
    • In the Combine All Files run, a few matches were not being assigned an AC Consensus when they should have been. Also a few assignments of AC No Parents was made when there was a parent.
    • 23andMe FIA match files downloaded using DNAGedcom were not being input correctly.

    Now back to see what DMT can do for me.

    The GEDCOM 5.5.5 Initiative and Making It Work

    2019. október 7., hétfő 7:47:24

    It’s been 35 years since GEDCOM 1.0 was released to the genealogical software development community in 1984.

    It’s been 20 years since GEDCOM 5.5.1, the last official specification by FamilySearch was released on October 2, 1999.

    Five days ago, October 2, 2019, the gedcom.org website was renewed containing the newly-released GEDCOM 5.5.5 specifications.

    GEDCOM was originally an acronym for GEnealogical Data COMmunication. It has been the standard that genealogical software has been using for the past 35 years. It specifies how genealogical data should be exported to a text file so the data can be preserved in a vendor-neutral form (separate from their proprietary databases) in a format that other programs will be able to import.

    For 15 years, between 1984 and 1999, the GEDCOM standard was developed and made available by the Church of Jesus Christ of the Latter Day Saints (LDS). They had a team in place that had discussions with many genealogical software vendors. They prepared the standard by taking all the ideas and figuring out how computer programs of the day could and should transfer data between programs.

    They were very successful. There have been hundreds of genealogy programs developed since then, and just about every single one of them adopted the standard and can import GEDCOM and export GEDCOM. I don’t know of too many standards that gained nearly 100% acceptance. To me, that is a very successful adoption of a standard in any field.

    Why was GEDCOM successful by the LDS? My take on that is because of the way they operated.

    1. They had a team in place to develop the standard.
    2. They sent drafts to developers and solicited feedback.
    3. The team evaluated all the feedback and suggested possible implementations, and evaluated conflicting ideas,
    4. And most importantly: one person acted as editor and made the decisions. I believe that may have been Bill Harten, who has been called the “Father of GEDCOM”.  


    What Happened in 1999?

    In 1999, the LDS decided it no longer was going to continue the development of GEDCOM and it disbanded the team. The last GEDCOM version the LDS issued was 5.5.1, which was still labeled beta, but was definitely the de facto standard because the LDS itself used it and not the 5.5 version, in their very own  Personal Ancestral File (PAF) software.

    The standard was very good, but not perfect. Each software developer used GEDCOM basics for important items like names, linkages between children and families, families and parents, birth, marriage, death dates and places.

    GEDCOM had more than that. Way more. It had lots of different events and facts. It had links to multimedia. It had sources, repositories and source references. Technically it had everything you needed to properly source and reference your genealogical material. The funny thing was that it was ahead of its time.

    Genealogical programs back then and genealogists in particular did not have the understanding of the need to document our sources. We were all name collectors back then. C’mon admit it. We were. We only learned years later our folly and the importance of sources.

    So in the next 10 years after 1999, software developers starting adding source documentation to their programs. And in doing so, most of them ignored or only loosely used the sourcing standards that GEDCOM included. They did so because no one else was exporting sources with GEDCOM. What happened is that each of them developed their own unique sourcing structure in their programs and often didn’t export that to GEDCOM, or if they did, they only used some of GEDCOM and developed their own custom tags for the rest, which none of the other programs would be able to understand.

    This continued to other program features as well. For example, adding witness information or making a place a top level record was not something the last version of GEDCOM had, so developers, and even groups of developers came out with extensions to GEDCOM (such as Gedcom 5.5EL).

    The result:  Source data and a lot of other data did not transfer between programs, even though almost all of them were exporting to and importing from what should have been the very same GEDCOM.


    Two Attempted Fixes

    About 2010, a BetterGEDCOM grassroots initiative was started by Pat Richley-Erickson (aka Dear Myrtle) and a number of others. For 10 years, too much data, and especially source data, was not transferring between programs.

    At the time I wrote in a blog post about BetterGEDCOM:

    “The discussion is overwhelming. To be honest, I don’t see how it is going to come together. There are a lot of very smart people there and several expert programmers and genealogists who seem to be having a great time just enjoying the act of discussing and dissecting all the parts. … I really hope they come back to earth and accomplish something. I’ve suggested that they concentrate on developing a formal document.”

    There was a very large amount of excellent discussion about where GEDCOM should go and what was wrong with it and what should be fixed. But nobody could agree on anything.

    The reason in my opinion why this didn’t work out:  Because the discussion took over, similar to a Facebook group where everyone had their own opinion and no one would compromise.

    The one thing missing was an editor. Someone who would take all the various ideas and make the decision as to the way to go.

    After a few years, the BetterGEDCOM group realized it wasn’t getting anywhere. Their solution was to create a formal organization with by-laws and a Board of Directors to spearhead the new initiative. It was called FHISO and they created the website at fhiso.org. They obtained the support of many software vendors and genealogical organizations.

    On March 22, 2013, FHISO initiated their standards development process with an Open Call for Papers. Scores of papers were submitted, two by myself.

    Then discussion started and continued and continued and continued. There was little agreement on anything. They had excellent technical people involved, but the nature of the discussion was often too technical for even me to understand and got involved in externalities far beyond what GEDCOM needed and spent months on items that were more academic in nature than practical.

    FHISO had hoped to come out with an Extended Legacy Format (ELF) which would update the 5.5.1 standard. They were working on a Serialization Format (which GEDCOM already effectively had) and a Data Model. It’s the data model that is what is wanted and is most important. But to date, after over 6 years, all they have is a document defining Date, Age and Time Microformats.

    Horrors, that document is 52 pages long and is a magnificent piece of work, if you wanted to submit a PhD thesis. Extrapolating that out to the rest of GEDCOM, we’d likely be looking at a 20 year development time for a 50,000 page document which no programmer would be able use as a standard.


    We Need Something Practical – GEDCOM 5.5.5

    Genealogy technology expert Tamura Jones on his Modern Software Experience website has for years been an in-depth critical reviewer of genealogical software and the use of GEDCOM. I have greatly benefited from several of his reviews of my Behold software which sparked me to make improvements.

    I had the pleasure of meeting Tamura when I was invited to speak at the Gaenovium one day technology conference in his home town of Leiden Netherlands in 2014. I then gave the talk “Reading wrong GEDCOM right”. Over the years Tamura and I have had many great discussions about GEDCOM, not always agreeing on everything.

    In May of last year (2018), on his own initiative, Tamura created an annotated version of the GEDCOM 5.5.1 specification that was very needed. It arose from his many articles about GEDCOM issues, with solutions and best practices. Prior to release, he had a draft of the document reviewed by myself and six other technical reviewers, and incorporated all of our ideas that made sense with respect to the document. Many of the comments included thoughts about turning the annotated edition into a new version of GEDCOM.

    So I was surprised and more than pleased when a few months ago, Tamura emailed me and wanted reviewers for what would be a release of GEDCOM 5.5.5.

    image

    The goal was to fix what was not right in 5.5.1, remove no longer needed or used constructs, ensure there was just one way to do anything, fix examples and clear up everything that was unclear. This version 5.5.5 was not to have anything new. Only obsolete, deprecated, duplicate, unnecessary and failed stuff was to be taken out.

    The 5.5.5 result was published on October 2, exactly 20 years after 5.5.1 was released. You can find GEDCOM 5.5.5 on the gedcom.org site.

    For a description of what was done, see the Press Release.


    Why GEDCOM 5.5.5 Works and What’s Coming

    GEDCOM 5.5.5 is based on GEDCOM 5.5.1. It included the thoughts and ideas of a number of genealogy software developers and experts, just like the original GEDCOM did. It is practical as it is a repair of what has aged in 5.5.1.

    The reason why this document has come out was because it was done the way the LDS originally did it. A draft was sent to experts. Each sent back comments, opinions, criticisms and suggestions back to the editor, who then compared the ideas and decided what would work and what wouldn’t.

    I know I spent dozens of hours reviewing the several drafts that were sent to me. I sent back what was surely a couple hundred comments on every piece of the document. I didn’t agree with the document on all items. In the end, some of my arguments were valid and the item was changed. But some were not and I appreciate those decisions, because other reviewers may feel different than me and some final decision must be made. I’m glad Tamura is the one willing to make the decision as I very much respect his judgement. I also greatly thank him for the thousands of hours I know he must have put into this.

    What made this happen?

    1. There was a standard that had already been annotated to start with.
    2. Drafts were sent to developers and feedback was received.
    3. Tamura reviewed all the feedback and considered all possible implementations, and evaluated conflicting ideas,
    4. And most importantly: Tamura acted as editor and made the decisions. He is not selling software himself, so is unbiased in that respect.

    I will support and promote this initiative. I will be adding GEDCOM 5.5.5 support into future versions of Behold. I encourage other genealogical software developers to do so as well.

    This is only a first step. There are enhancements to GEDCOM that many vendors want. These ideas need to be compared to each other, and decisions need to be made as to how they would best be implemented in a GEDCOM 5.7 or 6.0.  A similar structure with one respected editor vetting the ideas is likely the best way to make this happen. And as long as the vendors are willing to offer their ideas and compromise on a solution, Tamura will be willing to act as editor to resolve the disputes. I would expect this would give us the best chance of ending up with a new GEDCOM standard that is clear, easy for genealogy software programmers to implement, enable all our genealogical information to transfer between programs, and be something that genealogy software developers would agree to use.

    It Took a Lot of Effort to Get to DMT 3.0

    2019. október 6., vasárnap 8:11:49

    I’m cleaning up my directories after over 3 years of development of Double Match Triangulator.

    When I released Version 1 of DMT back in August 2016, I had it write information about the run to a log file, primarily so that I could debug what was going on. But I realized it is useful for the user as it could contain error messages about the input files and statistics about the matches that the user could refer to. Also, if DMT wasn’t working right, I could be sent the log file and it would be a great help in debugging the problem.

    I have not deleted any of my log files since Version 1.0. I want to delete them now because there are a lot of them and they are taking up space and resources on my computer. How much? Well, I’ve got 9,078 log files totaling 421 MB of space.

    There were 1,159 days since I started accumulating log files. I have at least one log file on 533 of those days, so I worked on DMT 46% of the days, i.e. over 3 days a week, averaging 17 runs each day I was working on it. These are almost all development runs, where I’m testing code in DMT and making sure everything is working correctly. The maximum in any day was 89 runs. After every run, I’d check to see what worked and what didn’t, go back to my program to fix what didn’t work and recompile and run a test again. If everything worked, then I’d go onto the next change I needed to make.

    So I’m going to have a bit of fun and do some statistics about my work over the past three years.

    Here’s a plot of my daily runs:

    image

    Here’s my distribution of working times:

    image

    You can see I don’t like starting too early in the morning, usually not before 9 a.m., but I’m on a roll by noon and over 10% of my runs were from noon to 1 p.m.

    I relax in the afternoon. In the summer if it’s nice, I’m outside for a bike ride or a swim. In the winter I often go for a walk if it’s not too cold (e.g. –30 C).

    Then you can see my second wind starting at 9 p.m with a good number of late nights to 1 a.m. You can also see the occasional night where I literally dream up something at 4 a.m. and run to my office to write it down and maybe even turn on my computer to try it out.

    The days of the week are pretty spread out. I’m actually surprised that Fridays are so much lower than the other days. Not too sure why.

    So that just represents the time I spent on DMT over the past 3 years. It doesn’t include time I spent working on Behold, updating my website, writing blog posts, answering emails, being on social sites, maintaining GenSoftReviews, working on my family tree, deciphering my DNA results, going to conferences, watching online seminars, vacations, reading the paper each morning, watching TV, following tennis, football and hockey, eating breakfast, lunch and supper each day, cleaning the dishes, doing house errands, buying groceries, and still having time to spend with family.

    Wow. It’s 1:11 a.m. I’m posting this and going to bed!

    Double Match Triangulator 3.0 Released

    2019. október 2., szerda 6:46:32

    After 10 months of hard work, I have finally released DMT 3.0.

    You can download it from www.doublematchtriangulator.com.
    (It is a Windows program. My apologies to Mac or Unix users)

    All purchasers of DMT get a lifetime license. All updates including this one are free for past purchasers.

    There’s a lot new and improved here.

    • An MRCA (Most Recent Common Ancestor) column has been added to the People file, allowing you to enter the MRCA for each person that you know.
    • The process of segment matching and assignment of ancestral paths to segments and matches is now automated in an expert-system manner that mimics what a person would do map their chromosomes.
    • You can now select the minimum cM of the matches that you want DMT to include in its analysis.
    • You can now run DMT with only Person A to get the listing of Person A’s matches and people.
    • DMT now handles all the new segment match file formats from the various companies.
    • The Map page, People page and Log file all have extensive revisions.
    • DMT outputs a dnapainter csv file that can be uploaded to www.dnapainter.com.
    • Now uses conditional formatting extensively in the Excel files, so most of the formatting should move when the data is copied and pasted or sorted.
    • Now can filter all matches to a minimum cM.
    • Calculates and uses all inferred matches.
    • Clusters people into their primary ancestral lines.
    • Does parental filtering if one or both parents have DNA tested.


    This is what the DMT program looks like. It is this one window:

    dmt-main-window


    This is what DMT’s Map page looks like, listing every segment match:

    double-match-triangulator-map-file


    This is what DMT’s People page looks like, listing every person matched:

    image


    Here’s an example of a upload to DNA Painter from a DMT file;

    image


    There are many great tools available now for genetic genealogists who are interested in using their DNA to help them figure out their connections to relatives.

    With version 3.0 of DMT, I’ve now made available a tool that automates segment matching, triangulation and chromosome mapping. It’s a bit different from all the others and it should give you new and different insight into your DNA matches.

    23andMe’s Family Tree Beta

    2019. szeptember 24., kedd 4:12:06

    Someone on Facebook reported a new feature at 23andMe and I couldn’t wait to try it. This 23andMe beta “auto-builds”  your family tree from your DNA connections with other customers.

    image

    They aren’t the first to try something like this. Ancestry DNA has ThruLines, which uses your tree and your DNA match’s tree to try to show you how you connect. MyHeritage DNA does the same with their Theory of Family Relativity (TOFR) using your tree and your match’s tree at MyHeritage.

    I had good success at Ancestry who gave me 6 ThruLines joining me correctly to 3 relatives I previously knew the connection to, and 3 others who were correctly connected but were new to me. I then contacted the latter 3 and we shared information and I was able to add them and their immediate relatives to my tree.

    I had no success yet at MyHeritage DNA even though I have my main tree there. I’ve never had a sing TOFR there fore either me or my uncle. My closest match (other than my uncle) is 141 cM. My uncle’s closest is 177 cM. Those should be close enough to figure out the connection. But none of the names of my matches at MyHeritage give me any clues, and I haven’t been able to figure out any connection with any them, even using some of their extensive trees.

    23andMe does not have trees to work with like Ancestry and MyHeritage. Actually, I shouldn’t say that. Not too long ago, in another beta, 23andMe allowed uploading your FamilySearch tree to 23andMe. See Kitty Cooper’s blog post about it for details about it. They never said what they were going to do with that data, but I wanted to be ready if they did do something, All it tells me right now is this:

    image

    If you include your FamilySearch Tree in your profile, then anyone else who has done the same will show a FamilySearch icon next to their name. You can also filter for those who have done so. I don’t think many people know about this beta feature yet, because my filter says I have no matches who have done so.

    image

    But I digress.  Let’s go check out the new Family Tree beta at 23andMe. I’m somewhat excited because I have a dozen relatives who I know my connection to who have tested at 23andMe. I’ve been working with them over the past 2 weeks getting my 23andMe matches to work in my (almost-ready) version 3.0 of my Double Match Triangulator program. And the odd thing about all my known relatives at 23andMe is that they are all on my father’s side!  I’d love to be able to connect to a few people at 23andMe on my mother’s side. Maybe this Family Tree beta will help. Let’s see.

    So I go over to my 23andMe “Your Family Tree Beta” page. It takes it a few minutes to build my tree and update my predicted relationships. Once it does, out pops this wonderful diagram for me.  (Click on the image to enlarge it).

    image

    I’m shown in the middle (my Behold logo sun), with my parents, grandparent, great-grandparents above. And 23andMe has then drawn down the expected paths to 12 of my DNA matches.

    This is sort of like ThruLines and TOFR, but instead of showing just the individual connections with each of the relative, 23andMe are showing all of them on just one diagram. I like it!! 

    The 12 DNA matches they show on the diagram include 5 of my 12 people who I know my connection to, and I show arrows to them, and 7 who I don’t. Maybe this will help me figure out the other 7.

    My 3 closest 23andMe matches are included, who I have numbered 1, 2 and 3. Number 3 is on my mother’s side, but I don’t know what the connection is. None of the other 9 are on my first page of matches (top 25).

    The number 1 with a green arrow is my 1st cousin once removed. She is the granddaughter of my father’s brother. So that means the entire right side of the tree should be my father’s side and the left should be my mother’s side.

    The ancestors are all shown with question marks. I can now try labelling the people I know because of my connection with my first cousin.  When I click on the question mark that should be my father, I get the following dialog box:

    image

    The “More actions” brings up a box to add a relative, but that action and likely others that are coming are not available yet.

    When you click on “Add Information” you get:

    image

    I click on “I understand” and “Next” and I get:

    image

    I click “Yes” and “Next” and it lets me enter information about this person:

    image

    Now I’m thinking at this point that they have my FamilySearch info. Maybe in a future version, they can allow me to connect this person to that FamilySearch tree, and not only could they transfer the info, but they should be able to automatically include the spouse as well.

    But for now, I simply enter my father’s information and press “Save”. I did not attempt to add a photo.

    When I clicked “Deceased”, it added Place of death and Date of death. But it has a bug because the Place of Death example cannot be edited. But what the heck. This is a beta. Expect a few bugs.

    It gives a nice confirmation box and then on the chart changes the orange circle with the question mark to the green circle with the “TK” (my father’s initials):

    image

    I also go and fill in my mother, and my father’s brother and his daughter who connect to my 1st cousin once removed.

    Next step: Those 4 red arrows on the right point to four cousins on my father’s father’s father’s side. I can fill in two more sets of ancestors:

    image

    Unfortunately, the 4 DNA matches at the right were up one generation from where they should have been.

    image

    They should under AM and RB, not under RB’s parents. This is something you can’t tell from DNA, but maybe 23andMe could use ages of the DNA testers to estimate the correct generation level the matches should be at.

    This is basically what 23andMe’s Family Tree beta seems to do in this, their first release. It does help visualize and place where DNA relatives might be in the tree. For example, the two unidentified cousins shown above emanate from my great-grandmother’s parents. So like clustering does, it tells me where to look in my family tree for my connection to them.

    Conclusion:

    This new Family Tree at 23andMe has potential. They seem to be picking specific people that would represent various parts of your tree, so it is almost an anti-clustering technique, i.e. finding the people who are most different.

    There is a lot of potential here. I look forward to see other people’s comments and what enhancements 23andMe makes to it in the future, like making use of the FamilySearch relatives from their other beta. Being able to click through each DNA relative to their profile would be a useful addition. And using ages of the testers would help to get the generational level right.

    Our desire as genealogists is that DNA should help us extend our family tree. It’s nice to see these new tools from 23andMe as they show that the company is interested in helping genealogists.

    Now off I go to see if I can figure out how the other 7 people might be connected.

    The Life and Death of a DNA Segment

    2019. augusztus 20., kedd 7:53:34

    There’s a bad rumor going around that segment matches, especially for small segments, can be very old. I’ve heard expectations that the segment might come from a common ancestor 20 generations back or even 30, 40 or more. And that’s said to happen even if you have a fairly large 15 cM segment.

    Part of this is due to the incorrect thinking that a segment of your DNA has been around forever and has been passed down from some ancient ancient ancestor to you and to just about everyone else. Since there is only a 1/2 chance that each generation gets the segment from the right parent, the argument is that it gets offset maybe by the more than 2 children per generation that keep the segment alive all the way down to two 30th or 40th generation descendants who then happen to share the segment. That also assumes there is no intervening ancestor along some other path who is more recent than that 30th generation one. For endogamy, the argument is that the segment has proliferated through the people and most of them happen to have it. Although in that case, I find it hard to believe that there is not a line to a different common ancestor who is fewer than 30 generations back.

    The fallacy here is that all our DNA segments are ancient. They are not. In fact, many of them are quite recent, only a few generations old.

    Let’s take a look at, say a 15 cM segment that you got from your father. You could have:

    1. Got the whole segment from your father’s father’s chromosome,

    2. Got the whole segment from your father’s mother’s chromosome, or

    3. There could have been a recombination that occurred somewhere along the 15 cM segment and you got part of it from your father’s father and part from your father’s mother.

    It is case number 3 that is interesting. In this case, that 15 cM segment is no longer the same as your father’s father’s segment, nor is it the same as your father’s mother’s segment. It is a new segment that has been born in you and you are the first ancestor to have that segment and maybe you’ll pass it down to many of your descendants. And no one else will have that segment that you have, unless some random miracle as rare as a lottery winning happens.

    Also, your father’s father’s segment at this location and your father’s mother’s segment both are not passed down to you. Maybe they’ll be passed to a sibling of yours or maybe they won’t. But both of your grandparent’s segments have died along your line.

    So what actually happens is that any segment of your DNA has its birth in one of your ancestors. That ancestor may pass it down to zero or more descendants, and if it is passed down, each descendant may or may not continue to pass it down. The segment eventually dies. A recombination on the segment can’t be avoided forever.

    Now what is the probability of a new 15 cM segment being “born” in you? Well, that’s what cM represents and there will be about a 15% chance that any particular 15 cM segment of your DNA was formed from a recombination in your parent, and that you have a brand new segment. For most purposes, using the cM as a percentage is close enough. But for more accuracy, I’ll use the actual probability from the equation P(recomb) = (1 – exp(2 * cM / 100) / 2 which gives 13.0%.

    Well guess what? The probability that any particular 15 cM segment is born in any of your ancestors is also 13%. The chance that the segment was not born, but was passed down is therefore 87%. We can use that fact to now calculate the probability that this segment was passed down any number of generations to some descendant:

    image

    What this says is that if you have a 15 cM segment, then there is about a 50% chance that it was created in one of the last 5 generations, a 75% chance that it was created in one of the last 10 generations, and 94% chance that it was created in one of the last 20 generations. The average age of segments that size is 6.7 generations (1 / 13%). This is very simple mathematics/statistics.

    If you match with another person on the same segment, then they have the same probabilities. The chance both of you got this segment from more than 20 generations back would be only 6%.



    Revisiting Speed and Balding Once Again

    I’m still frustrated that Speed and Balding’s simulation results are being used without question to estimate segment age for human DNA segment matches.

    About two years ago, I used two different sets of calculations, one my own in Revisiting Speed and Balding, and one based on work by Bob Jenkins in Another Estimate of Speed and Balding Figure 2B. In both cases, I found segment age estimates that were somewhat less than Speed and Balding.

    Let’s see how my Segment Life estimates compare. Picking a few different segment sizes and calculate their values gives:

    image

    And then lets plot these in a stacked chart:

    image

    Look at the gray area at the top left. That’s the probability of segments of the given segment size being 20 or more generations old. The green bar is the divider at 10 generations. You likely have a good chance to identify how you’re related to segment matches that are under the green bar, indicating that most segments over 15 cM should be identifiable and that even very small segments might be identifiable.

    Compare this to Speed and Balding:

    Speed and Balding give much larger chance of older segments than does my segment life methodology, or than do either of the two analyses in my earlier blog posts.



    Conclusion

    Segments aren’t passed down from ancient times. They are created and die all the time due to recombination events and they may not be as old as you are led to believe. Some of your smaller matching segments. e.g., between 5 and 15 cM have (by my segment life and other earlier calculations) a 40% to 70% chance of originating less than 10 generations ago. This means you might be able to determine how you’re related to your match.

    By using triangulation techniques (such as Double Match Triangulator), you can determine triangulations of segments in the 5 to 15 cM range which will eliminate most by-chance matches. You can then put your segment matches into Triangulation Groups, to help find the common ancestor of the group and connect your DNA matches to your tree.

    50 Years, Travelling Salesman, Python, 6 Hours

    2019. augusztus 8., csütörtök 7:41:39

    This is my first blog post in over 2 months. The reason is that I have been working very hard trying to finish Version 3 of Double Match Triangulator. Every thing I’ve been doing with it is experimental, and there’s no model to follow. So it’s tough to get it just right. I started the documentation of the new version already, when I diverted to get some sample data from some people who had done Visual Phasing (VP) with 3 or more siblings, because I was thinking that this version of DMT should be able to use segment matches to get most of the same grandparent assignments that VP does. I’ve made progress but still not completed with that.

    But this morning, I was sparked programmatically by an annual event that happens where I live in Winnipeg. Folklorama is a two week festival that celebrates the multiculturalism in our city. image

    “Pavilions” are set up in various venues (arenas, churches, community centres) to showcase a particular country/culture. Each pavilion has a stage performance, cultural displays, and serves authentic ethnic food and drink.

    This is the 50th year of Folklorama. So I remember it as a kid. The 40 pavilions were something that I always wanted to do a bike tour of, as they were spread all over our city. Being interested in mathematics, I was curious of a way to optimize my route and use the shortest possible route to bike to all of the pavilions.

    But 50 years ago was well before we had personal computers or the internet. And route traversal problems, especially this one which was known as the Travelling Salesman problem, were computationally difficult to solve back then, even on the mainframe computers at the time.

    This year’s version of Folklorama got me thinking: Maybe the problem is solvable easily today. I took a look online and was surprised very much by what I found. There is a Google Developers site that I didn’t know about.

    image

    And at that site, they had all sorts of OR-Tools.  OR stands for Operations Research which is the name of the field that deals with analytical methods to make better decisions. The Traveling Salesman problem is in that field and has its own page at Google Developers:

    image

    Not only that, but they explain the algorithms and present the programs in four different programming languages:  Python, C++, Java and C#.

    Now, I’m a Delphi developer, and I use Delphi for development of Behold and Double Match Triangulator. I’ve never used the four programming languages given. But I’ve been looking for a quick and easy to program language to use for smaller tasks such as analysis of raw data files from DNA tests, or even analysis of the huge 100 GB BAM files from my Whole Genome Sequencing test.

    Over the last year or so, I had been looking with interest at the language Python (which is not named after the snake but is named after Monty Python’s Flying Circus). Python has been moving up in popularity because it is a new, fast, interpretive, concise, powerful, extensible and free language that can do just about anything and even do a Hello World in just one line. It sort of reminds me of APL (but without the Greek letters) which was my favorite programming language when I was in University.

    Well what better time to try Python than now to see if I can run that Travelling Salesman problem.

    So this morning I installed the Windows version of Python on my computer. It normally runs from a command prompt, but there is a development environment for it called IDLE that it comes with it that makes it easier to use.

    It didn’t take me too long to go through the first few topics of the Tutorial and learn the basics of the language.  I threw in the Traveling Salesman code and sample data from the Google Developers site, and I got an error. The Python ortools package was missing. It took me about an hour to figure out how to use the Python PIP (package manager) to add ortools. Once I did, the code ran like a charm.

    Fantastic. Now can I use it for my own purpose. First, I had the map of all the Pavilion locations:

    image

    There were 22 pavilions in week 1, of which 4 were at our Convention Centre downtown, so in effect there were 19 locations, plus my home where I would start and end from, so 20 in total.

    Now how to find the distances between each pavilion?  Well, that’s a fairly simple and fun thing to do. You can do it on Google Maps by selecting the start and end address. Choosing the bicycle icon, it would show me possible routes and the amount of time it would take to bike them.

    For instance, to go from the Celtic Ireland Pavilion to the Egyptian Pavilion, Google Maps suggested 3 possible bike routes taking 44 minutes, 53 minutes or 47 minutes. I would choose the quickest one, so I’d take the 44 minute route.

    image

    Now it was just a matter of using Google Maps to find the time between each of the 20 locations. That’s 20 x 19 / 2 = 190 combinations!  Google Maps does have a Google Distance Matrix API to do it programmatically, but I figured doing this manually once would take less time than figuring out the API. And besides, I liked seeing the routes that Google Maps was picking for me. Google Maps did remember last entries, so using I only had to enter the street number to change the starting or ending location. It wouldn’t take that long.

    At 1 p.m was the Legacy Family Tree webinar that I was registered for: “Case Studies in Gray: Identifying Shared Ancestries Through DNA and Genealogy.” by Nicka Smith.

    image

    It was a fantastic webinar. Nicka is a great speaker.

    And while I had the webinar on my right monitor, I was Google mapping my 190 combinations on my left monitor and entering them into my Python data set:

    image

    I finished my data entry just about when the webinar ended at 2:30 pm CST.

    Next, I ran the program with my own data, and literally in the blink of an eye, the program spewed out the optimal bike route:

    image

    After 50 years of wanting to one day do this, it only 6 hours to install and use a new language for the first time, enter 190 routes onto Google Maps, load the data, find my answer, and enjoy a wonderful webinar.

    So tomorrow morning, it will be back to working on version 3 of DMT in the morning, followed by what should be a very pleasant 4 hour (247 minute) afternoon bike ride to all 23 week 1 Folklorama pavilions along the optimal route.

    image

    And maybe next week, I’ll do the same for the week 2 pavilions.

    Finally, Interesting Possibilities to Sync Your Data

    2019. május 18., szombat 7:12:25

    Although I don’t use Family Tree Maker (FTM), per se, I am very interested in its capabilities and syncing abilities. FTM along with RootsMagic are the only two programs that Ancestry have allowed to use the API that gives them access to the Ancestry.com online family trees. Therefore they are the only two programs that can directly download data from, upload data to, and sync between your family tree files on your computer and up at Ancestry.


    RootsMagic

    RootsMagic currently has its TreeShare function to share the data between what you have in RootsMagic on your computer, and what you have on Ancestry. It will compare for you and show you what’s different. But it will not sync them for you. You’ll have to do that manually in RootsMagic, one person at a time using the differences.

    image

    That is likely because RootsMagic doesn’t know which data is the data you’ve most recently updated and wants you to verify any changes either way. That is a good idea, but if you are only making changes on RootsMagic, you’ll want everything uploaded and synced to Ancestry. If you are only making changes on Ancestry, you’ll want everything downloaded and synced to RootsMagic.

    With regards to FamilySearch, RootsMagic does a very similar thing. So basically, you can match your RootsMagic records to Family Search and sync them one at a time, and then do the same with Ancestry. But you can’t do all at once or sync Ancestry and FamilySearch with each other.

    With regards to MyHeritage, RootsMagic only incorporates their hints, and not their actual tree data.


    Family Tree Maker

    Family Tree Maker takes the sync with Ancestry a bit further than RootsMagic, offering full sync capabilities up and down.

    image

    For FamilySearch, FTM up to now only incorporates their hints and allows merging of Family Search data into your FTM data, again one person at a time. But Family Tree Maker has just announced their latest upgrade, and they include some new FamilySearch functionality.

    What looks very interesting among their upcoming features that I’ll want to try is their “download a branch from the FamilySearch Family Tree”. This seems to be an ability to bring in new people, many at a time, from FamilySearch into your tree.


    Family Tree Builder

    MyHeritage’s free Family Tree Builder download already has full syncing with MyHeritage’s online family trees.

    image

    They do not have any integration with their own Geni one-world tree, which is too bad.

    But in March, MyHeritage announced a new FamilySearch Tree Sync (beta) which allows FamilySearch users to synchronize their family trees with MyHeritage. Unfortunately, I was not allowed to join the beta and test it out as currently only members of the Church of Jesus Christ of Latter-Day Saints are allowed. Hopefully they’ll remove that restriction in the future, or at least when the beta is completed.


    Slowly … Too Slowly

    So you can see that progress is being made. We have three different software programs and three different online sites that are slowly adding some syncing capabilities. Unfortunately they are not doing it the same way and working with your data on the 6 offline and online platforms is different under each system.

    The very promising Ancestor Sync program was one of the entrants in the RootsTech 2012 Developer Challenge along with Behold. I thought Ancestor Sync should have won the competition. Dovy Paukstys, the mastermind behind the program had great ideas for it. It was going to be the program that would sync all your data with whatever desktop program you used and all your online data at Ancestry, FamilySearch, MyHeritage, Geni and wherever else. And it would do it with very simple functionality. Wow.

    This was the AncestorSync website front page in 2013 retrieved from archive.org.
    image

    They had made quite a bit of progress. Here is what they were supporting by 2013 (checkmarks) and what they were planning to implement (triangles):

    image

    Be sure to read Tamura Jones’ article from 2012 about AncestorSync Connect which detailed a lot of the things that Ancestor Sync was trying to do.

    Then read Tamura’s 2017 article that tells what happened to AncestorSync and describes the short-lived attempt of Heirlooms Origins to create what they called the Universal Genealogy Transfer Tool.


    So What’s Needed?

    I know what I want to see. I want my genealogy software on my computer to be able to download the information from the online sites or other programs into it, show the information side by side, and allow me to select what I want in my data and what information from the other trees I want to ignore. Then it should be able to upload my data the way I want it back to the online sites, overwriting the data there with my (understood to be) correct data. Then I can periodically re-download the online data to get new information that was added online, remembering the data from online that I wanted to ignore, and I can do this “select what I want” again.

    I would think it might look something like this:

    image

    where the items from each source (Ancestry, MyHeritage, FamilySearch and other trees or GEDCOMs that you load in) would be a different color until you accept them into your tree or mark them to ignore in the future.

    By having all your data from all the various trees together, you’ll easily be able to see what is the same, what conflicts, what new sources are brought in to look at, and can make decisions based on all the source you have as to what is correct and what is not.

    Hmm. That above example looks remarkably similar to Behold’s report.

    I think we’ll get there. Not right away, but eventually the genealogical world will realize how fragmented our data has become, and will ultimately decide that they need to see it all their data from all sites together.

    Determining VCF Accuracy

    2019. május 14., kedd 7:12:05

    In my last post, I was able to create a raw data file from the Whole Genome Sequencing (WGS) BAM file using the WGS Extract program. It seemed to work quite well.

    But my previous post to that: WGS – The Raw VCF file and the gVCF file, I was trying to see if I could create a raw data file from the Variant Call Format (VCF) file. I ended that post with a procedure that I thought could generate a raw data file, which was:

    1. Make a list of all the SNPs you want raw data for.
    2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO has good set of them included with his DNA Kit Studio.
    3. The positions of variants in your gVCF file should be marked as no-calls. Many of these variants are false, but we don’t want them to break a match.
    4. The positions of variants in your filtered VCF should be marked as having that variant. This will overwrite most of the optimistic no-calls marked in step 3 with filtered reliable values.

    When I wrote that, I had thought that the gVCF file contained more variants in it than the Raw VCF file had. During my analysis since then, I found out that is not true. The Raw VCF contains all the unfiltered variants. Everything that might be considered to be a variant is in the Raw VCF file. The gVCF includes the same SNP variants that are in the Raw VCF file, but also includes all the insertion/deletions as well as about 10% of the non-variant positions. It’s the  non-variant positions that makes the gVCF such a large file.

    So right away, in Step 3 of the above proposed procedure, the Raw VCF file can be suggested instead of the gVCF file and will give the same results. That is a good thing since the Raw VCF file is much smaller than the gVCF file so it will be faster to process. Also the Raw VCF file and the filtered VCF file include the same fields. My gVCF included different fields and would need to be processed differently than the other two.

    (An aside:  I also found out that my gVCF supplied to me by Dante did not have enough information in it to determine what the variant is. It gives the REF and ALT field values, but does not include the AC field. The AC field gives the frequency of the ALT value, either 1 or 2.

    • If REF=A, ALT=C, AC=1, then the variant value is AC.
    • If REF=A, ALT=C, AC=2, then the variant value is CC
    • If REF=A, ALT=C, AC is not given, then the variant value can be AC or CC.

    For me to make any use of my gVCF file, for not just this purpose but any purpose, I would have to go back and ask Dante to recreate it for me and include the AC field in the variant records.  End aside.)


    Estimating Type 1 and Type 2 Errors

    We now need to see if the above procedure using the Raw VCF file in step 3 and the filtered VCF file in step 4 will be accurate enough to use.

    We are dealing with two types of errors.

    Type 1: False Positive: The SNP is not a variant, but the VCF file specifies that it is a variant.

    Type 2: True Negative:  The SNP is a variant, but the VCF file specifies that it is not a variant.

    Both are errors that we want to minimize, since either error will give us an incorrect value.

    To determine the Type 1 and Type 2 error rate, I used the 959,368 SNPs that the WGS Extract program produced for me from my BAM file. That program uses well a developed and respected genomic library of analysis functions called samtools, so the values it extracted from my WGS via my BAM file are as good as they can get. It is essential that I have as correct values as possible for this analysis, so I removed 2,305 values that might be wrong because some of my chip test results disagreed with. I also removed 477 values that WGS Extract included but were at insertion or deletion positions.

    From the remaining values, I could only use positions where I could determine the reference value. This included 458,894 variant positions, which always state the reference value, as well as the 10% or so of non-variant reference values that I could determine from my gVCF file. That amounted to 42,552 non-variants.

    Assuming these variant and non-variant positions all have accurate values from the WGS extract, we can now compute the two types of errors for my filtered VCF file and for my Raw VCF file.

    image

    In creating a VCF, the filtering is designed to eliminate as many Type 1 errors as possible, so that the variants you are given are almost surely true variants. The Raw VCF only had 0.13% Type 1 errors, and the filtering reduced this to a very small 0.08%.

    Type 1 and Type 2 errors work against each other. Doing anything to decrease the number of Type 1 errors will increase the number of Type 2 errors and vise versa.

    The Raw Data file turns out to only have 0.06% Type 2 errors, quite an acceptable percentage. But this gets increased by the filtering to a whopping 0.76%.

    This value of 0.76% represents the number of true variants that are left out of the filtered VCF file. This is what is causing the problem with using the filtered VCF file to produce a raw data file. When the SNPs that are not in the filtered VCF file are replaced by reference values, they will be wrong. These extra errors are enough to cause some matching segments to no longer match. And a comparison of a person’s raw dna with his raw dna generated from a filtered VCF file will not match well enough.

    If instead, the Raw VCF file is used, the Type 2 errors are considerably reduced. The Type 1 errors are only slightly increased, well under worrisome levels.

    Since there are approximately the same number of variants as non-variants among our SNPs, the two error rates can be averaged to give you an idea of the percentage of SNPs expected to have an erroneous value.  Using the Raw VCF instead of the filtered VCF will reduce the overall error rate down from 0.42% to 0.09%, a 79% reduction in errors.

    This could be reduced a tiny bit more. If the Raw VCF non-variants are all marked as no-calls, and then the Filtered VCF non-variants are replaced by the reference values, then 20 of the 55 Type 1 Errors in my example above, instead of being wrong, will be marked as no-calls. No-calls are not really correct, but they aren’t wrong either. For the sake of reducing the average error rate from 0.09% to 0.07%, it’s likely not worth the extra effort of processing both VCF files.


    Conclusion

    Taking all the above account, my final suggested procedure to create a raw data file from a VCF file is to use only the Raw VCF file and not the filtered VCF file, as follows:

    1. Make a list of all the SNPs you want raw data for.
    2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO has good set of them included with his DNA Kit Studio.
    3. Mark the positions of the variants in your Raw VCF with the value of that variant. These will overwrite the reference values assigned in step 2.

    Voila!  So from a Raw VCF file, use this procedure. Do not use a filtered VCF file.

    If you have a BAM file, use WGS Extract from yesterday’s post.




    Update: May 14: Ann Turner pointed out to me (in relation to my “Aside” above, that in addition to the AC (allele count) field, the GT (genotype) field could supply the information to correctly identify what the variant is. Unfortunately, the gVCF file Dante supplied me with has missing values for that field’s values.

    I’ve looked at all the other fields in my gVCF file and entries that leave out the BaseQRandSum and ClippingRankSum fields as they often indicate a homozygous variant, but I’ve found several thousand SNPs among the variants that constitute too many exceptions to use this as a "rule".

    Wilhelm HO is working on implementing the sort of procedure I suggest into his DNA Kit Studio. It likely will be included when he releases Version 2.4, and his tool will then be able to produce a raw data file from a VCF file and will also extract a mtDNA file for you that you can upload to James Lick’s site for mtDNA Haplogroup analysis.

    Creating a Raw Data File from a WGS BAM file

    2019. május 13., hétfő 6:12:13

    I was wondering in my last post if I could create a raw data file that could be uploaded to to GEDmatch or DNA testing company from my Whole Genome Sequencing (WGS) results. I was trying to use one of the Variant Call Format (VCF) files. Those only include where you vary from the human reference. So logically you would think that all the locations not listed must be human reference values. But that was giving less than adequate results.

    Right while I was exploring that, there was a beta announced for a WGS Extract program. It works in Windows and you can get it here

    image

    This is not a program for the fainthearted. The download is over 2 GB because it includes the reference genome in hg19 (Build 37) and hg38 (Build 38) formats. It also includes a windows version of samtools which it runs in the background as well as the full python language.

    I was so overwhelmed by what it brought that I had to ask the author how to run the program. I was embarrassed to find out that all I had to do was run the “start.bat” file that was in the main directory of the download, which opens up a command window that automatically starts the program for you, bringing up the screen I show above.

    WGS Extract has a few interesting functions, but let me talk here about that one labeled “Autosomes and X chromosome” with the button: “Generate file in 23andmeV3 format”.  I selected my BAM (Binary Sequence Alignment Map) file, a 110 GB file I received by mail on a 500 GB hard drive (with some other files) from Dante. I pressed the Generate file button, and presto, 1 hour and 4 minutes later, a raw data file in 23andMe v3 format was generated as well as a zipped (compressed) version of the same file.

    This was perfect for me. I had already tested at 5 companies, and had downloads of FTDNA, MyHeritage, Ancestry, Living DNA and 23andMe v5 raw data files. I had previously combined these 5 files into what I call my All 5 file.

    The file WGS Extract produced had 959,368 SNPs in it. That’s a higher number of SNPs than most chips produce, and since it was based on the 23andMe v3 chip, I knew there should be quite a few SNPs in it that hadn’t been tested by my other 5 companies.

    You know me. I did some analysis:

    image

    The overlap (i.e. SNPs in common) varied from a high of 693,729 with my MyHeritage test, to a low of 183,165 with Living DNA. These are excellent overlap numbers – a bit of everything.

    Each test had a number of no-calls, so I compared all the other values with what WGS Extract gave me, and there was a 98.1% agreement. That’s a 2% error that is either in the chip test, or in the WGS test, but from this, I cannot tell whether its the chips or the WGS that are the incorrect values. But in each case, one of them is.

    When I compare this file to my All 5 file, which has 1,389,750 SNPs in it, I see that there are an extra 211,747 SNPs in my WGS file. That means I’ll be able to create a new file, an All 6 file, that will have 1,601,497 SNPs in it.

    More SNPs don’t mean more matches. In fact they usually mean fewer matches, but better matches. The matches that are more likely to be false are the ones that get excluded.

    In addition to including the new matches, I also wanted to update the 747,621 SNPs in the file to the same SNPs in my All 5 file. As noted in the above table, I had 2,305 SNPs whose values disagreed, so I changed them to no calls. No calls are the same as an unknown value, and for matching purpose, are always considered to be a match. Having more no calls will make you more “matchy” and like having less overlap, you’ll have more false matches. The new SNPs added included another 905 no calls. But then, of the 20,329 no calls I had in my All 5 file, the WGS test had values for 9,993 of them.

    So my number of no calls went from:

    20,329 + 2,305 + 905 - 9,993 =  13,546, a reduction of 6,783.

    I started with 20,329 no calls in 1,389,750 SNPs (1.5%),
    and reduced that to 13,546 no calls in 1,601,497 SNPs (0.8%)

    A few days ago, I was wondering how much work it take to get raw data for the SNPs needed for genealogical purposes out of my WGS test. A few days later, with this great program, it turns out to be no work at all. (It probably was a lot of work for the author, though.)

    I have uploaded both the 23andMe v3 file, as well as my new All 6 file to GEDmatch to see how both do at matching. I’ve marked both research. But I expect once the matching process is completed, I’ll make my All 6 file my main file and relegate my All 5 file back to research mode.

    Here are the stats at GEDmatch for those who know what these are:

    WGS Extract SNPs:  original 959,368; usable 888234; slimmed 617,355
    All 5 SNPs: original 1,389,750; usable 1,128,146; slimmed 813,196
    All 6 SNPs: original 1,601,497; usable 1,332,260; slimmed 951,871

    WGS – The Raw VCF file and the gVCF file

    2019. május 7., kedd 19:55:15

    As I noted in my last post, Whole Genome: The VCF File, Part 2, the SNP VCF (Variant Call Format) file that Dante Labs gives tester of WGS (Whole Genome Sequencing), does not quite have everything that is needed to generate a raw data file that can be uploaded to various DNA sites for people matching.

    The VCF file contains SNPs that vary from the standard human reference genome. The thinking is then that any SNPs needed for the raw data file (to match with the SNPs that are tested by the chips used by Ancestry, 23andMe, Family Tree DNA, MyHeritage DNA and Living DNA) that are not in the file can simply be replaced by the value from the standard human reference.

    But they can’t.

    The problem is that the VCF contains only those variants that passes a set of filters meeting some quality controls to ensure that the read was good enough. Those SNPs that fail the filters are not included in the VCF file. So some actual variants that should be in the VCF file don’t make it. These are known as true negatives. Substituting the standard reference genome values for those will give you incorrect values.

    How many might there be? Well take a look again at the table from my last post.

    image

    When I created my “All 5” raw data file, it ended up containing 1,343,424 SNPs.

    My VCF file contains 3,442,712 variant SNPs. Of those, 462,305 (13.4%) were at positions in my All 5 file. 

    The green highlighted values are those that matched the consensus of my All 5 file. If you exclude the no-calls and deletions in my All 5 file, then that’s 453,534 out of (462,305 – 2 – 4 – 2 – 6,156 =) 456,141 or 99.4% agreement on the values. Only 0.6% disagree which isn’t too bad.

    But the real problem are those yellowed values. Those are SNPs with two different allele values (heterozygous) and they by definition must be variants. Yet, the VCF file does not include them. These amount to another 10,705 SNPs that either were wrong in the chip reads, or should have been in the VCF file but were not. This means that as many as 2.3% of the VCF values may not be included in the VCF.

    There could be others as well. The AA, CC, GG and TT values just above the yellow cells that are not in the VCF may have been variants but not included in the VCF file. e.g. If the reference value was G and the SNP value was AA, then this homozygous read is a variant and should be among the green values in the AA row and AA column. But the VCF may not contain it. We can’t tell how many of these there might be until/unless we get a third test done to compare.

    It’s for these true-negative values, that substituting the reference genome value would be an incorrect thing to do and build you a substandard raw data file with likely thousands of incorrect values in it.


    The Raw VCF File

    Dante gives you access to your filtered VCF file. There is another file available that you can get. If you copy the link to your Dante Labs VCF file and change the end of the filename from “snp.vcf.gz” to “raw.snp.vcf.gz”, you can download what I’m calling the Raw VCF file.

    This file contains the unfiltered variants found by your whole genome test. None of the quality measures were applied to this file. The quality filtering that Dante does is designed to get remove most of the false positives, i.e. to prevent SNPs that are not variants from being reported to be variants.

    My Raw VCF contains 3,973,659 SNPs. It includes every one of the 3,442,712 SNPs in my filtered SNP VCF file, plus the 712,064 SNPs (17.9%) that it found that it filtered out because the quality didn’t wasn’t enough to be sure they were true variants.

    So you can use this Raw VCF file to get more of your true negatives back, but by doing so, you will also add false positives to it.

    It’s a tricky situation. It’s a trade-off like a teeter totter, with false positives on one side and true negatives on the other. Dante picked a set a filters that presumably take a good compromise position that does its best to minimize the two types of mistakes.

    The one nice thing about this Raw VCF file is that it includes my 46 mtDNA variants. The VCF filters somehow removed these and they are not in the filtered VCF file. Once I got these, I was able to put them into 23andMe raw data format and upload it to James Lick’s mtDNA Haplogroup utility. The format I used is:

    i703360 MT 114 T

    Since the raw file doesn’t give the RSID, I just used i703360 for everything, hoping that James Lick didn’t use it, and it appears he doesn’t. I used MT as the 2nd field because that’s what 23andMe has. 114 is the position number. T is the value. Those are Tabs (Hex 09) between the fields, not spaces.

    I know these were my mtDNA variants because Lick’s routine gave me my correct mt haplogroup: K1a1b1a.

    However, for most purposes, this Raw VCF File is likely worse to use than the filtered VCF file since it will include more false positives (variants that are not really variants) than will the filtered VCF file.

    And the Raw VCF File also won’t help in our goal to produce a raw data file that can be uploaded to sites for people matching. Reducing true negatives is good, but increasing false positives is bad.


    The Genome VCF (gVCF) File

    So we still want/need/hope for a way to produce that raw data file from our WGS results. Too many errors are introduced by adding human reference values where they don’t have values in the VCF file. Using the Raw VCF file will alleviate some of the true negatives but it also will increase the false positives, which is not a good trade-off.

    There is an intermediate file. It is called the gVCF or Genome VCF file. It contains all the reads of the Raw VCF file and then is said to fill in the gaps with the human reference genome.

    Well, that really doesn’t help. It is still basically the Raw VCF file. All it does is supposedly make the lookup of the human reference genome a little simpler.

    I requested my gVCF from Dante and they made it available to me. It was 3.6 GB in size and took 30 minutes for me to download. It is a compressed file and took 5 minutes to unzip. The full file is 24.6 GB

    Here’s a snippet from it:

    image

    The long lines are the variants. The short lines are the positions where I don’t have a variant.

    Position 14464 on Chr 1 is the first variant in my filtered VCF file. This file contains all the variants in my filtered VCF file.

    Positions 14464 and 14653 are my 6th and 7th variants in my Raw VCF file. This file contains all the variants in my Raw VCF file.

    But positions 14523, 14542, 14574, 14673, 14907 and 14930 are variants that are in this file, but not in either my Raw VCF file or my filtered VCF file. So there must even be more filtering done even before the variants even make it to the Raw VCF file. Maybe some of the values on the line (like the BaseQRankSum for example) indicate uncertain reads and have something to do with them being left out. None-the-less, I wouldn’t get excited about adding these additional variants because many likely will be false positives if you do.

    At least the hope is that you’ll get every read in every position from this file.

    But no. It doesn’t even do that.

    The lines given that are not variants often contain an END value. e.g., the first 3 lines I displayed above contain:

    14373 . T END=14379
    14380 . C END=14383
    14384 . T END=14396

    Does the value T represent all the values from positions 14373 to 14379, or does it just represent the first position? My perusal of the VCF specs finds:

    The value in the POS field refers to the position of the first base in the String.

    To verify, I took some homozygous SNPs from my DNA tests that agree between different companies:

    Chr 1: position 1040026 was read as TT by all 5 companies. My gVCF has:

    image

    So it says C at position 1039779, ending at position 1040048. That’s not T.

    Try another: Chr 1: position 1110019 was read as AA by all 5. My gVCF has:

    image

    Here it says C at position 1109397, ending at position 1110155. That’s not A.

    So the value shown refers to the first position. You do not know what the other positions up to the end positions hold.

    And in fact, my gVCF file contains 308,122,966 data lines. That is about 10% of the 3 billion base pairs we have. So only 1 out of 10 positions are reported in the gVCF file with either your variant value or the human genome value.

    None-the-less, it doesn’t matter whether the gVCF file contains all the human reference genome values or not. The variants in it are even more liberal than the Raw VCF file and would introduce even more false positives if it is used the generate a raw data file.


    A Possible Solution

    Between the filtered VCF, the Raw VCF and the gVCF files, none of them alone have the data in it to generate a raw data file that can be uploaded to GEDmatch and other DNA testing sites. And they have a sliding range of variants from optimistic (gVCF) to liberal (Raw VCF) to conservative (filtered VCF).

    The problem is that DNA results include both false positives and true negatives. DNA testing companies get around the uncertain values by indicating them to be no-calls. That works because no-calls are always considered a match, and therefore they won’t break a matching segment. As long as there are not too many of them (i.e. 5% or less), the segment matching should work well.

    So I believe we can generate a raw data file by doing this:

    1. Make a list of all the SNPs you want raw data for.
    2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO have a good set of them included with his DNA Kit Studio.
    3. The positions of variants in your gVCF file should be marked as no-calls. Many of these variants are false, but we don’t want them to break a match.
    4. The positions of variants in your filtered VCF should be marked as having that variant. This will overwrite most of the optimistic no-calls marked in step 3 with filtered reliable values.

    I likely will try this myself. When I do, it will be worthy of its own blog post.

    Whole Genome: The VCF File, Part 2

    2019. április 23., kedd 4:36:32

    A couple of months ago, I compared my VCF file to my DNA test results.

    The Variant Call Format (VCF) file is given to you when you do a Whole Genome Sequence (WGS) test. That test finds your DNA values for your whole genome, all 3 billion positions, not just the 700,000 or so positions that a standard DNA test gives you.

    But most of those 3 billion positions are the same for most humans. The ones that differ are called Single-Nucleotide Polymorphisms (SNPs) because they “morph” and can have differing values among humans. The standard DNA companies test a selection of the SNPs that differ the most, and they can use the 700,000 they selected for matching people without having to test all 3 billion positions. It works very well. WGS tests are not needed for finding relatives.


    Converting VCF to a Raw Data File

    But near the end of my last post, I was trying to see if the VCF file could be converted into a raw data file that could be uploaded to GEDmatch or a DNA company that allows raw data uploads.

    My VCF file contains 3,442,712 SNPs whose value differ for me from the standard human reference genome. Of those, I found 471,923 SNPs were the same SNPs (by chromosome and position) as those in my raw data file that I created by combining the raw data from 5 companies (FTDNA, MyHeritage, Ancestry, 23andMe and LivingDNA). I compared them in my first analysis and found that 2,798 of them differed, which is only 0.6%. 

    At the time, I didn’t think that was too bad an error rate. So I thought a good way to make a raw data file from a VCF file would be:

    1. Take a raw data file you already have to use as a template.
    2. Blank out all the values
    3. Add values for the positions that are in the VCF file
    4. Fill in the others with the human reference genome value.

    The basis of that idea is that if it’s not a variant in the variant file, then it must be the reference value.

    Today on Facebook, Ann Turner told me that that’s not necessarily the case. The reason she believes, is that the VCF file does not contain all the variant SNPs. And the discrepancies were enough to break her comparison of “herself” with “herself” into 161 segments.


    So What’s Really Different Between VCF and Raw Data?

    In my first analysis, I only compared whether the values were the same or not, giving that 0.6% difference. I did not look at the specific values. Let’s do that now:

    image

    For this analysis, let’s not worry about the rows: DD (Deletions), DI (Deletion/Insertions), II (Insertions) or – (no-calls), since they are only in the raw data and not in the VCF file.

    The green values down the diagonal are the agreement between the All-5 raw data file and the VCF file. Any numbers above and below that diagonal are disagreements between the two. Those are the 0.6% where one is wrong for sure, but we don’t know which.

    But let me now point you to those yellowed numbers in the “Not in VCF” column. Those are all heterozygous values, with two different letters. AC, AG, AT, CG, CT or GT. If they have two different letters, then they cannot be human reference values. One of the two letters is a variant and those entries should have been in the VCF file. But they were not.

    This creates even a bigger concern than our earlier 0.6% mismatch. If we total these yellow counts, we find there’s 10,705 or 1.2% of the 881,119 SNPs that are not in the VCF file that should have been in the VCF file.

    Again, we don’t know who is wrong, the raw data, or the VCF file. But from Ann’s observations, we’d have to say at least some of those heterozygous values must have been left out and when reference values were added instead to Ann’s file, they caused the match breaking that resulted in 161 segments.


    Which is Correct: VCF or Raw Data

    When you are comparing two things, and you know one is wrong, you don’t know which of the two is the wrong one. You need others to compare with. I am awaiting the results of my long read WGS test, and when that comes I’ll have a third to compare.

    But until then, can I get an idea of which of the two files might be more often correct? There’s one thing I can do.

    I can provide the same table as I did above, but for the X and Y chromosomes. Since I’m a male, I only have one X and one Y chromosome. The value could be shown as a single value, but it is still read as a double value and therefore shown as a double value. The caveat is that the two letters must be the same or the read is definitely incorrect. (Note that this table excludes 688 SNPs that are in the pseudoautosomal region of the X or Y which can recombine and have two different allele values).

    So let’s see the table:

    image

    The top left section contains the agreed upon values (in green) between the All 5 raw data file and the VCF file. The counts in that section above and below the green values are disagreements between All 5 and VCF and we don’t know which is correct and which is wrong.

    The numbers in red are incorrect reads. Those on the left side are incorrect reads for the All 5 raw data file. It has 219 incorrect reads versus 40,848 correct reads, a ratio of 1 every 186.

    The right side are incorrect reads for the VCF file. It has 13 incorrect reads versus 9,605 correct reads. That’s a ratio of 1 every 739 reads.

    Now, verging on the realm of hyperbole, the difference in ratios could indicate that an error in a standard DNA test is 4 times (739 / 186) more common than an error in a VCF file.

    And applying that ratio to the 10,705 heterozygous values that should have been in the VCF file, we would say that 8,564 would be because the raw data file is wrong, and 2,141 because the VCF file should have included them but did not.

    And if 2,141 values out of your DNA file created from the VCF file are incorrect, couldn’t that quite easily have caused the 161 segments that Ann observed?

    Yes, this is all conjecture. But the point is that maybe the VCF file is leaving out a significant number of variants. If that is the case, then we can’t just put in a reference value when there is no value in a VCF file. And that would mean a raw data file created from a VCF file and filled in by human reference values may not have enough accuracy to be usable for matching purposes.

    Compare Your Number of DNA Matches Among Companies

    2019. április 21., vasárnap 22:12:21

    I saw a post on Facebook trying to compare the number of relative matches a person hade at different DNA testing companies.

    Here’s my results with my endogamy:

    image

    Note that there are a few things to consider when you or anyone else does a comparison of your number of DNA matches.

    A few companies only give you a specific number of matches. GEDmatch gives 2,000, GEDmatch Genesis gives 3,000, and 23andMe limit you to 2,000 but only lets you see those that have opted in to sharing.

    Some companies give you all your matches down to some minimum threshold. The minimum cM is not necessarily the only criteria. Largest segment length and number of segments may be considered. Ultimately, that works out to an effective minimum Total cM for me at Living DNA of 37 cM, at Family Tree DNA of 17 cM, at MyHeritage of 12 cM and at AncestryDNA of 6 cM. Since Ancestry DNA goes right down to matches who have a single segment matching of just 6 cM, I expectedly have a very large total number of matches with them. Even without endogamy, you will likely have your largest number of matches at AncestryDNA as well because of this low matching limit.

    If you look at only larger matches, you get a completely different story. I counted the number of matches I had that were a total of 50 cM or more. You’ll see I have very few, just 56 matches at Ancestry DNA. That’s because Ancestry uses their Timber algorithm to eliminate many segments they consider to be false. Whereas Family Tree DNA have a lot of matches 50 cM or more simply because they include segments right down to 1 cM in their total, and therefore will have a larger Total cM than the same match at another company

    I’ve added a Database size column. These are numbers I have visually read off of Leah Larkin’s www.theDNAgeek.com/dna-tests chart of Autosomal DNA Database Growth as of April 2019.

    When you divide the matches by the database size, in my case, my largest proportion of the database I match to is 1.7% at Family Tree DNA, and then 1.1% at AncestryDNA.

    All those statistics are just statistics. What’s much more important and what we as genealogists want, are people who we can determine our exact connection to and can determine a Most Recent Common Ancestor for. These are the people who through DNA and genealogy, will help us to expand our family tree.

    My endogamy does give me lots of matches, but I have few connections I can determine because Romanian and Ukrainian records rarely go back further than the mid 1800’s limiting me to about 5 generations genealogically. My best success so far in finding DNA testers whose relationship (well at least one, there may be others) I’ve been able to determine are 11 relatives at AncestryDNA and 10 at 23andMe. At Ancestry, that’s 11 out of the 56 people sharing 50 cM or more.

    Ah the thrill of another cousin testing! Two days ago, a 1st cousin on my Mom’s side showed up at AncestryDNA. And just this morning, a 1C1R on my Dad’s side showed up at 23andMe. Go Testers Go!!

    If you’ve already tested everywhere, GEDmatch and DNA.Land won’t help you on this front, because they only accept uploads from other companies. So you’ll already match with them at the company they originally tested at. But these upload sites will help you with the additional tools they provide.

    My upload to DNA Land does not show any matches for me. I think that’s strange considering that they show 50 matches (their limit) for my uncle’s upload (which I’ve included in the above chart). The 50 matches go down to 48 cM (of what they call Total Recent Shared Length). If you haven’t seen their match report, it is interesting and looks like this:

    image

    In conclusion, each company uses different algorithms to determine what they consider a match. So it is not really possible to fairly compare matches between companies.

    The main thing you want to find from your matches are identifiable relatives. So the best bet is still to fish in as many ponds as you can.

    WGS Long Reads Might Not Be Long Enough

    2019. április 18., csütörtök 8:28:07

    Today my Dante Labs kit for my Whole Genome Sequencing (WGS) Long Reads arrived. Dante became the first company to make WGS Long Reads available to the general public. The price they are charging is $999 USD, but past customers of Dante Labs are eligible for a $200 USD discount putting it down to $799. In 2016 the cost of long read sequencing was around $17K, and they hoped to get the price down to $3K by 2018. Here it is, 2019, and it’s available to the general public at $1K.

    image

    I had purchased a Dante Labs WGS, the standard short reads test, last August (2018) when they had it on sale for $399 USD. That was a great price as they had only a few months earlier lowered it from $999 USD, and a year earlier you’d have had to pay several thousand dollars for any whole genome test from anyone. Dante currently offers their standard short read WGS for $599, but if you want it, you can wait for DNA Day or other sales, and I’m sure it will come down.

    In October, when Dante had my sample, I had started reading about long read WGS technology, so I asked Dante if they had that technology available. They said they did. I asked how much that would be. They said $1,750 USD. I asked if they could do a long reads test from my sample and they checked and said, no, the sample had started sequencing already.

    So I wasn’t able to do the long read test back in October. But it worked out anyway. Now, I will have both the short read test and the long read test for $550 less than the cost would have been for the long read test alone just 6 months ago. This is actually excellent because I will be able to analyze the short read test, analyze the long read test, and then compare the two. When you have just one test you can make no estimate as to the error rate, but when you have two tests to compare, then the differences represent an error in one of the tests and an average error rate can be calculated.



    What Good is WGS for Genealogical Cousin Matching?

    WGS testing, whether long reads or short reads, provide no help for relative matching. Matching is based on the 700,000 or so SNPs that a company tests. Those SNPs are spread out over the 3 billion base pairs of your genome. The standard DNA tests you take do a good job of identifying those SNPs for matching purposes.

    WGS testing is for determining all your 3 billion base pairs and finding all the SNPs where you vary from the human reference. From my short read WGS test, my VCF file had 3,442,712 entries, which are the SNPs where I differ from the human reference. The SNPs other than the 700,000 the company tests are not used for matching, so getting their values does not help matching. Those extra SNPs are very important for medical studies, but not matching. The 700,000 vary enough already that DNA companies would get very little benefit by adding to that number.

    The reason to combine raw data from multiple companies, such as you can now do at GEDmatch is because GEDmatch compares tested SNPs between different companies. Some companies have very little overlap between them, i.e. less than 100,000 may be in common and available to be compared which is too small for reliable matching. Combining the multiple kits will increase that overlap number for you.

    So for genealogical purposes, you’re likely better off spending your money taking a few standard DNA tests from companies who give you your matches. Then you can create a combined kit at GEDmatch Genesis. A WGS test would not help you with this.


    So Why Did I Take a WGS Test?

    Other than insatiable curiosity and the need to know, I was hoping to see what, if anything WGS tests will do that could help a genetic genealogist. My current conclusion, (as I just wrote) is not that much.

    For analysis of your DNA for health purposes, you will want a WGS test. Most regular DNA companies do not test many SNPs that have known health effects. Even 23andMe only tests a subset of medically-related SNPs. Dante Labs specializes in reports for medical purposes. When you take a test with them, you can request detailed custom reports on specific ailments you may have, like this sample report on epilepsy.

    But for me, I’m not really interested in the medical information.


    So Why Did I Want To Take a Long Read WGS Test?

    A Nanopore Technologies white paper about The Advantages of Long Reads for Genome Assembly gave me the idea that maybe the long reads would overlap enough, that they could be used to phase my raw data. Phasing is separating out the pair of allele values of each SNP into their paternal and maternal values. I would thus find the 22 autosomal chromosomes of DNA that I got from my father and the 23 autosomal chromosomes I got from my mother. If you phase your DNA and create a raw data file from it, you can use it to find the people who match just one parent.

    Typically, when you are like me and your parents have passed away and they had never DNA tested, phasing would need to be done with the raw data of close relatives such as siblings, children, aunts, uncles or cousins, nieces or nephews who did test. You can use tools like Kevin Borland’s DNA Reconstruction Toolkit. But I only have an uncle who has tested. Just an uncle isn’t quite enough. Maybe, I thought, long reads would overlap enough to span the entire chromosome and voila, you’ve phased it.

    Dante’s long reads uses Oxford Nanopore Promethion technology. The specs are 30x with N50>20,000bp.  That means that 50% of the reads will be longer than 20,000 contiguous base pairs and enough reads are made to give an average coverage of 30 reads for every base pair in the genome. By comparison, short reads average only 150 contiguous base pairs.

    Let’s see: 30 x 3 billion base pairs / 20,000 = 4.5 million long reads are made.


    Unfortunately, Long Reads Might Not Be Long Enough

    Despite my original thought that 4.5 million overlapping reads of 20,000 contiguous base pairs should cover the whole genome, apparently that isn’t the case. The long reads can reconstruct good sized pieces of a chromosome, which are called Contigs. But when you have long stretches where there are few SNPs and for those that are there, the allele values are both the same, then the long reads will not be able to cross the gap. How often does that happen?

    Well, as I mentioned above, my VCF file indicates I have 3,442,712 SNPs that are different than the human reference genome. Of those 2,000,090 SNPs have different allele values, meaning we can use one value to represent one chromosome and the other value to represent the other chromosome of the pair. One long read starts a config. An overlapping long read must contain one of the SNPs with different allele values in the contig in order to extend it.

    It sort of works like this:

    image

    Read 1 includes two SNPs. We know the T and C go together on one chromosome, and the C and G go together on the other. So Read 1 is a contig.

    Since Read 2 overlaps with Read 1, we can extend the Read 1 contig.

    But the next read, Read 3 does not reach back far enough to include the SNP with the CG values. So we cannot tell whether the C or the G connects to the A or the G in Read 3. So our first Contig ends with the AA at the end of Read 2, and the second Contig starts at the AA at the beginning of Read 3.

    How many contigs will we have. Quite a few are possible. Here are some rough calculations just to get an idea of what the number might be.

    I took all my 2 million SNPs with different values and ordered them within chromosome by base pair address. I then found the difference between the next base pair address and the current. This gives the number of base pairs in a row with no differences.

    I then sorted those and plotted them. Here’s the graph:

    image

    This says that 2% of my SNPs with different allele values are at 15,000 or more base pairs away from the next SNP with different allele values.  Out of my 2 million SNPs with different allele values, 2% means 40,000.

    0.2% are at least 70,000 or more base pairs away. Out of my 2 million SNPs, that’s 4,000.

    Since my long read test is a N50>20,000bp, only half my long reads will be longer than 20,000. I do get 30x coverage or an average of about 30 reads on any base pair position, so let’s say our average longest of the 30 reads is 70,000 base pairs. Then there would be about 4,000 regions that the can’t be spanned. Some may be adjacent to each other, so I may get something like 3,000 contigs.

    This would give me about 3,000 pieces of my genome. Some will be bigger and some will be smaller, but they should average about 1 million base pairs (which is about 1 cM).

    There are methods called scaffolding to try to assemble these pieces correctly to the same chromosome. This is all state of the art stuff to handle long read WGS, so I’ve got some reading to do to understand it all.


    Forward Thinking

    I look forward to getting my long read WGS results and then comparing them to my short read WGS and my combined raw data file from my 5 standard DNA tests. I know I will learn something from that.

    I intend to see how many contigs I get out of the long reads. Maybe my estimates above are wrong and I only get 300 contigs instead of 3,000. I might be able to do something with that and figure out how to scaffold to separate out my allele values into each of the pairs of each chromosome.

    And maybe I’ll discover something I hadn’t even thought of. In a few months when I get my long read results, we’ll see.

    Advanced Genetic Genealogy

    2019. április 14., vasárnap 3:49:46

    Living in Canada, I had to wring my hands waiting an extra two weeks over my US neighbors for my copy of Advanced Genetic Genealogy: Techniques and Case Studies to arrive.

    Packaged for me nicely and safely in bubble wrap, the book itself is physically impressive, larger than your average book:  Full letter size 8.5” x 11” (22 x 28 cm), 1 full inch (2.5 cm) thick, and despite being soft cover, weighing in at a hefty 3 pounds. (1.4 kg). Its 382 pages exclude a 4 page table of contents, a six page list of its beautiful full-color figures and tables, a 5 page preface and 2 page acknowledgement by its editor Debbie Parker Wayne, and 7 pages of author biographies.

    image

    The names of the chapter writers is a who’s who of genetic genealogy: Bartlett, Bettinger, Hobbs, Johnson, Johnston, Jones, Kennett, Lacopo, Owston, Powell, Russell, Stanbary,Turner and Wayne. If you know who these people are, then you are likely knowledgeable enough in this field to take in their wisdom. It is advanced. This is no beginners course. You’ll have to have experience and the knowledge of working with your DNA to fully grasp what is said.

    Let’s see what can be learned.



    1. Jim Bartlett talks about Segment Triangulation.

    Now you have a choice. You can either spend hundreds of hours like I did delving to understand every detail in his four years of blog posts on his segmentology.org blog, or you can read this chapter. He tells you how he uses Segment Triangulation to create Triangulation Groups to allow him to do Chromosome Mapping.

    My favorite line from Jim’s chapter: “You can be confident that virtually all of the segments in a Triangulation Group are IBD. This statement has been contested because it has not been proved or published. However after five years of Triangulating, I have not found any evidence to the contrary.”

    p.s. I have been working the past few months to implement chromosome mapping techniques similar to what Jim describes in his chapter into the next version 3.0 of Double Match Triangulator. He gives me some new ideas to wake up to think about at 3 a.m.


    2. Blaine Bettinger covers Visual Phasing.

    Visual Phasing a technique to map the segments shared by three or more siblings to determine the grandparents that supplied them. This is generally done manually from GEDmatch one-to-one comparisons of the three siblings. I have not personally used Visual Phasing for myself because I’m not fortunate enough to have any sets of three siblings who have DNA tested.

    This is one of the advanced techniques that has some tools available to help you, but none that yet do it for you. I’m sure the tools to do VP for you will be one of those innovations that appears in the next few years. I’m not going to be the one to build that tool (because I don’t personally need it), but I am implementing some of the ideas of Visual Phasing into DMT.


    3. Kathryn Johnston talks about the X Chromosome.

    You just can’t help enjoying any writing that brings up the Fibonacci sequence. Kathryn’s most interesting comment to me and something I never knew is that “Visual phasing began with X comparison and the X is still recommended as a starting point.”

    I haven’t spent a lot of time on the X chromosome for my own DNA. It really is a bit of a different beast, and I love the one main property being that the ancestral line an X segment comes from cannot go through a father and his father. That can immediately eliminate false MRCAs.


    4. Jim Owston on Y DNA.

    Well, I’ve done the Y-111 and Big Y500 at Family Tree DNA to help with the Jewish Levite DNA studies. I’d feel better about and work harder with Y-DNA if my closest match was within my 5 generation genealogical time horizon. Alas, it is not and I can’t even use the common surname idea because my ancestors in Romania and Ukraine only adopted their surnames 5 generations ago. So until something breaks through here, I’ll have to remain an autosomal guy. I envy Jim and anyone who can include their 8 generation lineage charts that run from 1520 to 1831. Sick!

    Jim has a good writeup on the benefits of going from Big Y-500 to Big Y-700. I see no personal benefit for my own genealogy to upgrade, but if I’m approached because it will help the Levite study, then I’ll likely do it for them. Technically, the study is finding people related to me, albeit along the lines of Jim’s people who are 10 to 20 generations back, but in my case, unlikely to ever be genealogically connected to me. 


    5. Melissa Johnson on Unknown Parentage.

    Many people do not know who their birth parent or parents are. Melissa describes the various ways to analyze your DNA matches to determine who they might be. She includes Blaine Bettinger’s Shared cM Project tables, X-DNA, Y-DNA, haplogroups, lists various background check websites, and then issues involved in targeted testing when dealing with a birth family.


    6. Kimberly Powell on Endogamy.

    Ah, endogamy, I know thee well. Kimberly describes all the complications that endogamy brings to the table to make DNA analysis much more challenging. She talks about matches being predicted closer than they are, how “in common with” (ICW) matches can be deceiving, and how clustering systems like the Leeds method do not give clear cut answers.

    Kimberly says to check for runs of homozygosity using GEDmatch’s “Are my parents related?” tool. Interestingly for me, with my great amount of endogamy, you’d think my parents would turn out only to related at least at the 3rd or 4th cousin level. But they don’t.

    image

    One segment of 8.8 cM and 9.8 generations apart for an endogamous population is not much at all. Despite the endogamy of the general population of both my parents, somehow my paternal and maternal families must have remained mostly separate. My paternal side is from towns now in Romania that are a few hundred kilometers from my maternal side’s towns that are now in Ukraine.

    When I check my uncle (my father’s brother), he gets no indication that his parents (my paternal grandparents) are related:

    image

    My paternal grandparents are from two towns now in Romania that are about 300 kilometers (200 miles) apart.

    Kimberly also brings up the calculation of the coefficient of relationship, and describes how to use triangulated groups, trees, chromosome mapping and cluster analysis to help identify relationships.


    7. Debbie Parker Wayne on Combining atDNA and Y-DNA.

    Debbie brings up a very detailed case study from her own research to illustrate some of her methodology. Debbie’s two full pages of citations are impressive unto itself, and it shows the professionalism in her amazing research and analysis.

    Debbie includes a bit of almost every technique, and her article is the only one in the book to include Ancestry’s DNA Circles.


    8. Ann Turner’s on Raw Data.

    Ann’s article is about the Raw Data you download from the testing company. She describes the different file structures of each company, explains RSID and SNP selection, why there are no-calls and miscalls, what phasing is and what statistical phasing is. She goes into child phasing, segments, boundaries (“The actual boundaries may be fuzzier”), builds and genetic distance. I’ve always loved the relationship versus cM versus number of segments chart (Figure 8.8) produced originally by 23andMe that Ann describes.

    Then Ann goes into SNPs, overlap between the SNPs tested at the various companies, and why this is important at GEDmatch Genesis. She then talks about other tools for raw data, and finishes by mentioning whole genome sequencing (WGS).


    9. Karen Stanbury on DNA and the Genealogical Proof Standard (GPS).

    You’ll want to read this chapter if you are a professional genealogist who wants to incorporate DNA into the work you do for your clients. The GPS is expected in any professional work done. Karen describes the testing plan, documentation, focus study groups, correlation, the formulation of a hypothesis, testing the hypothesis, and writing the conclusion.


    10. Patricia Hobbs, a Case Study.

    Patricia basically follows the principles that Karen described in her chapter, and describes the path taken to use documents and DNA evidence to identify an unknown ancestor. This is another one of those papers that you usually see in an advanced genealogical journal, and it definitely shows you what you must attempt to achieve if you want your work to be published. Very impressive, and way beyond what I ever hope to achieve.


    11. Thomas Jones on Publishing Your Results.

    Dr. Jones is the author of the classic “Mastering Genealogical Proof” and he applies all his knowledge and techniques into this chapter. His conclusion: “When genealogist, geneticist, and genetic genealogists use DNA test results to help establish genealogical conclusions, they are genealogical researcher. When they write about that research, they become scholarly writers. When their written work helps present-day and future researchers and members of the families that they have studied, they have met their research and writing goals.”


    12. Judy Russell on Ethics in Genetic Genealogy.

    I love Judy. I read every one of her Legal Genealogist blog posts. But arguing about ethics, like politics, is something I prefer to leave to others. Judy is an expert in these matters. If you worried about any ethical matter with respect to a DNA test, get this book and read this chapter.


    13. Michael Lacopo on Uncovering Family Secrets.

    I was dreadfully afraid to take a DNA test several years ago, simply because I didn’t want to find out that my father wasn’t my father. If you would have looked at pictures of my father and his siblings and me, you would have said I had nothing to worry about. I ended up getting my uncle to take the test and I took it myself and I’m happy to report that he is indeed my full uncle.

    Do you have that worry? Well you should. No matter who you are, you are sure to find a few skeletons in the closet. They may not be immediate family, but they will occur among your DNA relatives. The reasons are varied, sometimes  covered up deeds, sometimes mistakes (switch at birth), or the result of violent crime (e.g. rape). Michael’s chapter is a wonderful treatise on the psychology behind all this. He talks about identify and self, privacy and outcomes. His chapter and Judy’s chapter work hand in hand.


    14. Debbie Kennett on the Future!

    They couldn’t have picked a better person to write this chapter. This chapter alone is worth the price of the book. Debbie talks about the promise and limitations of:  Y-DNA testing, mtDNA testing, Autosomal DNA testing, Whole Genome Sequencing, ancestral reconstruction, DNA from our ancestors,and the power of big data.

    Debbie was nice enough in her chapter of Whole Genome Sequencing to make mention of one of my posts about the VCF file. As a result, I’m proud to have my name is listed in the index of the book on page 354 right after Debbie’s.

    Debbie sees the time many years in the future when we will take a DNA test, put our name into a database, and produce an instant, fully sourced family tree, complete with family photographs and composite facial reconstructions. I guess something like this:



    Conclusion

    If you feel you are ready to plunge into some advanced material to take you to the next level, don’t wait. Get the book now. You don’t have much time to learn this because the field is growing and advancing as we speak. Within a few years, we’ll have a whole new advanced set of tools and ideas that will be developed to help us with our genealogical DNA endeavors. Debbie Parker Wayne’s AGG will be the prerequisite of required knowledge to get to that next level.

    Now don’t think for a second that I’ve been able to read and digest everything in this book over the past two days. No, I’ve skimmed and read some parts just to get a feel of it and to write this blog post. It’s going to take me a few months to read it all in detail and take in everything.

    Final review score:  A++

    WGS Result Files

    2019. április 13., szombat 22:12:00

    I received the rest of my raw data files for my WGS (Whole Genome Sequencing) test today. It was shipped from their lab in Italy and came on a 1 TB hard drive.

    image

    Previously I was able to download my VCF files from Dante’s site. I reported on the files a couple of months ago in my post: My Whole Genome Sequencing. The VCF File. Those files were compressed and totaled 224 MB and expanded to 869 MB. The VCF (Variant Call Format) files only contain the variants, i.e. the readings where I vary from the human genome reference.

    The files supplied this time include all my data, not just the variants. So they are much larger. As a result they were sent to me on the large hard drive. They include the BAM and FASTQ files.

    The files are provided in three folders:

    • clean_data
    • result_alignment
    • result_variation



    The FASTQ files

    The clean_data directory contains 16 files named something like:
         aaaaaaaaaaa_L01_5mm_n.fq.gz

    where aaa is some identifier, and mm run from 78 to 85 and n is 1 or 2.

    Each file is about 8 GB in size and is gzip compressed at about 34%. I have a fast Intel i7 computer and it takes 30 minutes to uncompress one of these files to its full size of about 22 GB.

    When unzipped, the .gz drops off the file name and the .fq suffix represents a FASTQ file.

    Here’s what the beginning of one of the unzipped FASTQ files looks like:

    image

    Shown above are the first 4 groups of readings in one of the files, where 4 lines make up a reading. The first line of each group of 4 is an identifier, the 2nd are exactly 100 base pari values, the 3rd is just a plus (at least in the records I glanced at) and the 4th are codes that represent the quality of each of the 100 reads.

    Also in the directory is a small excel file that contains some summary statistics from my WGS test. It contains:

    image

    Sample is my kit number. There were 1.5 billion reads. In order to get an average of 30x coverage on 3 billion base pairs, I figure the average read length would have to be 60 base pairs.

    Each of the eight 5mm files have two pictures associated with them that show some results, e.g.

    aaaaaaaaaaa_LO1_578.base.png contains:

    image

    and aaaaaaaaaaa_L01_578.qual.png contains:

    image

    Most of this is all new to me too, so I can’t explain what all this means yet.



    The BAM file

    The result_alignment directory contains the aaaaaaaaaaaaaa.bam file. It is 115 GB in size compressed at 27% which expands to 425 GB. Decompression time for this file on my computer is over 17 hours. For most purposes, you don’t want to decompress this file since most genome analysis programs work with the BAM file itself and with a small bam.bai (BAM index) file of 8 MB that is also in the directory. The BAM index file allows the programs to go directly to the section of the genome that the analysis program needs.

    This directory also contains the same summary xls file that was with the FASTQ files (see above) and three png images:

    aaaaaaaaaaa.Depth.png that shows that I mostly achieved at least 30x coverage

    image

    aaaaaaaaaaa.Cumulative.png that shows the cumulative distribution of the depth

    image

    aaaaaaaaaaa.Insert.png

    image

    Insert size has something to do with the analysis process. I used to know what paired reads are, but I forgot. I’ll have to look that up again if I ever have to use them.



    Variation Files

    There are four subdirectories named sv, snp, indel and cnv:

    image

    They contain a number of files, some gzipped (which I decompressed in the above listing). Basically, these are various files indicating all my gene, exome and genome differences from the human reference for both my SNPs and my INDELs (insertions/deletions). These files are in a different format than the two VCF files I downloaded for SNPs and INDELs earlier.



    What’s Ahead

    I purchased the Long Read WGS test from Dante a few days ago. I think I’m going to wait until I get the results from my Long Read test. This will likely take a few months. Once I get the long read results, I’ll look at the BAM and FASTQ files from both tests, compare them to each other, and see what I can learn from them.

    With just one test, you can’t tell how good it is. But with two, you can compare their results to each other. It should be interesting.

    Combine Kits into One Superkit on GEDmatch Genesis

    2019. április 7., vasárnap 7:19:19

    Today GEDmatch Genesis added a new Tier 1 application. They state:

    image

    I did that myself manually with 5 kits about 6 months ago, uploaded my combined raw data to GEDmatch Genesis, and reported the results in my post: The Benefits of Combining Your DNA Raw Data.

    I thought I’d try the new GEDmatch Genesis application to see if it produces essentially the same result.

    I selected the Tier 1 “Combine mupltiple kits into 1 superkit” application and it gave the the option to select up to 4 kits that are already uploaded. I had all my 5 kits uploaded and I selected FTDNA, 23andMe, Ancestry and LivingDNA. I left out MyHeritage which uses includes almost the same SNPs as my FTDNA file does.

    image

    I pressed the “Generate” button and within a second, I got my combined kit:

    image

    Comparing my kits using the GEDmatch Diagnostic utility gives:

    image

    When I manually combined the kits, I got 1,389,750 SNPs, but GEDmatch only combines the 1,123,247 SNPs it wants to combine that it knows it is going to use. Slimmed SNPs are what GEDmatch actually uses for comparisons with other kits. I’m surprised that GEDmatch’s 834,457 slimmed SNPs are over 20,000 more than my manually combined kits. I have no explanation for that.

    I’ve included my Whole Genome kit from Dante, that GEDmatch only loads the SNPs in the VCF file. Those SNPs are the ones where I am different from the human reference genome. The SNPs where I am the same as the human reference genome are not included. The GEDmatch people still have to fix the upload of VCF files so that human reference genomes are added when the SNP is not included in the file.

    The one to one comparison was possible immediately, so I compared the GEDmatch combined kit to each of my individual kits, and to my manually created All-5 kit.

    image

    All of the comparisons indicate that I match myself at least 99.210%. It’s not important that there are some small breaks in the matching segments which results in more than 22 shared segments. I expect that when the one-to-many comparisons become available, the overlaps will improve just as they did with my manually combined file.



    The Bottom Line

    If you’ve tested with multiple companies and you subscribe to Tier 1, you should combine your kits to get better comparisons at GEDmatch Genesis. Make sure you make this combined kit the kit for yourself that you use for matching, and change all the others to Research so that you show up only once in other people’s match list.

    The only unfortunate thing is that you don’t have access to your raw data at GEDmatch. So you won’t know exactly what they did and you won’t have the raw data for yourself to look at or use for other purposes.