Kezdőlap Újdonságok Kereső Egyesület Fórumok Aktualitások Kutatás Adattárak Történetek Fotógaléria Partnereink RSS
RSS hírcsatornák
Ancestor.com
Ancestry.com
AustraliaGenWeb
Computergenealogie
FamilySearch.org
Forum zur Ahnenfors...
Genealogy Gems
Genealogy News Center
GenealogyBlog
GeneaNet.org
Hungaricana - Magya...
Interment.net Cemet...
LegacyTree.com
Louis Kessler's Beh...
Mac Genealogy Softw...
Magyar Országos Lev...
MyHeritage.com
NYT > Genealogy
Országos Széchényi ...
The Genealogue

Louis KesslerLouis Kessler's Behold Blog

the Development of my Genealogy Program named Behold

Proof or Hint?

2020. július 19., vasárnap 18:05:07

Have you heard the big hubbub going on in genetic genealogy circles?  Ancestry will be dropping your 6 and 7 cM matches from your match list.

image

In my case, I have 192,306 DNA matches at Ancestry. Of those, 54,498 matches are below 8 cM meaning Ancestry will drop over 28% of the the people on my match list.


The Proof Corner

Many of the DNA experts understand that a 6 or 7 cM segment is small and is rarely useful for proof of anything. That is totally true. As Blaine Bettinger states, small segment are “poison”. They are often false matches. When they are not, those segments are usually too many generations back to be used as “proof” of the connection.

I am not talking about Y-DNA or mtDNA here. Those have provable qualities in them. I’m talking about autosomal matching, you know, the DNA where the amount of DNA you share with a cousin reduces with each generation and you can be a 3rd cousin with someone and not share anything.

The only reasonable way to use autosomal segment matches as a “proof” is to use the techniques Jim Bartlett developed for Walking an Ancestor Back. This technique uses combinations of MRCAs on the same ancestral line, e.g. a 2C, a 3C, a 5C and a 7C all matching on the same segment who are on the same line. Jim has been able to do this successfully only because he has an extensive family tree and has rigorously mapped all his matches into triangulation groups over his whole genome. This is something that 99.9% of us will never attempt.

Note that Jim only includes matches that triangulate that are at least 7 cM. He is also aware that small segments may be false even when triangulated, so he excludes them.

But too often, people find through a DNA match a new 7th cousin, and find a family tree connection to them, and then claim that the DNA match proves the connection. This is so untrue on so many fronts.

Or people find two relatives who have a segment match that starts and/or ends at the same position as another DNA match. They then use this as proof of their connection to Charlemagne. Now doesn’t that sound ridiculous?


The Hint Corner

So why the worry of eliminating these mostly false, poison matches that can’t prove anything from your Ancestry DNA match list? It’s because they are hints.

As genealogists, we are using our DNA matches to find possible relatives that have common ancestors with us. We do that to extend our tree outwards and up. Any person who may have researched a part of our tree and have information about our relatives and ancestors that we don’t have is a very welcome find. (Hopefully they’ll respond to our email!)

So of my 192,306 matches, the closest 1% are the best candidates for me to research and see if I can connect them.

What about the other 99%? Surely, some of them might turn up to be a closer cousin than expected, or be along a line that I have researched deeper.

Obviously, none of us can spend the rest of our lives researching 190,000 matches one by one. So what do we do. We filter them to get interesting candidates, via:

1. A match who shares a common ancestor.

2. Match name who matches a surname in our tree

3. Surname in matches’ tree who matches one of ours

4. Birth location in matches’ tree that is a place our ancestors were from, or our relatives now live.

image

5. Shared matches who match with some of our DNA matches who we already have in our tree.

6. ThruLines, which compares the trees of our DNA matches for us and gives us possible family connections that we can investigate.

The people we find through any of these 6 methods (and other similar methods) is a way to take an unmanageable list of 192,000 people and select a subset for us to look at. Our hope (we don’t know this for sure) is that this will include more people who we’ll be able to connect to our family, and exclude the ones who are less likely.

So what most people are lamenting is not the loss of 28% of their DNA matches, but a loss of 28% of the hints they might be able to use.


Recommendation

If you want, there are ways to save some of the 6 and 7 cM matches that Ancestry will soon be eliminating. I won’t describe them here since many others already have. See Randy Seaver’s summary.

But please, don’t spend the next few weeks robotically marking the tens of thousands of small matches so that you don’t lose them. Yes, maybe one of them will turn out to be a hint one day. But you’ve got all your other matches to work with as well. You won’t run out of things to do, I guarantee it.





Addendum:  July 29, 2020:

If small DNA matches of 6 or 7 cM at Ancestry DNA cannot be used to prove a connection, because they are either false matches, or are too many generations back to confirm their ancestral path, then why can they be used as hints?

Answer: Simply because if you take a random selection of, say, 20,000 DNA testers at Ancestry, some of them will be relatives of yours. They may not actually share DNA with you, since 3rd cousins and further need not, but they could be people whose family tree connects to yours.

Basically, Ancestry DNA is giving you hints by simply giving you a large random selection of DNA testers. Their filtering tools (surname, place) may narrow those down to possible relatives, who don’t necessarily share any actual DNA with you.    
   
But these hints are better than just random hints. They will likely be people who share more ethnicity with you than a random DNA tester at Ancestry would.

For example, Ancestry has me at 100% European Jewish. If I compare myself with my first 6 cM match at Ancestry, I get this:

image

This 6 cM match of mine also has 100% European Jewish ethnicity.

To see if this was generally the case, I took my closest 20 matches, and my first 20 matches at 40 cM, at 20 cM, 15 cM, 10 cM, 9 cM, 8 cM, 7 cM and 6 cM. I marked down what percentage of European Jewish they had. Then I sorted each group of 20 highest to lowest. I get this:

image

Of the 180 matches I checked 179 had some European Jewish Ancestry. Over half of the matches also had 100% European Jewish ethnicity and many of them have 50% or more.

There is a much greater chance that I might find a connection to someone with European Jewish ancestry than someone without any, so these are good hints. Using ancestral surname and place filtering tools, I might find that some of these people are relatives and they can help me extend my family tree.

Does that mean that we share DNA?  Not necessarily. The matches, especially the small ones, may be false matches.,

Or we may actually share DNA. but the segments we share may not be coming from the common ancestor we found, but may be from another more distant line that we’ll never find, or it may be (especially in my case) general background noise from distant ancestors due to endogamy. We don’t know and cannot tell.

None-the-less, these matches are hints that might connect you to a relative.

Revisiting 23andMe’s Family Tree

2020. július 10., péntek 7:18:28

A very exciting day for me today, as most of you reading this will relate to. A second cousin of mine who I know showed up on my 23andMe match list. She matched me with 3.1% = 234 cM on 19 segments, which is exactly where she should be according to The Shared cM Project tool.

I have 9 other cousins who have have tested at 23andMe and match me. What makes this newly tested cousin different from the other 9 is that she’s on my mother’s side! All my previous known matches at 23andMe were on my father’s side.

So now I can finally get some maternal information from my 23andMe matches. A second cousin is perfect because we share great-grandparents and she will allow me to cluster my maternal matches into my mother’s father’s side, the side she is on.


23andMe’s Family Tree

I last looked at 23andMe’s Family Tree last September in my article: 23andMe’s Family Tree Beta.

My tree as calculated by 23andMe back then included 13 of my DNA matches. It placed 8 on my father’s side and 5 on my mother’s side.

My automated tree today has two more of my matches included, so there are now 15. The 8 circled matches at the left are on my father’s side. The 7 circled matches at the right are on my mother’s side. The people circled in blue are the 5 relatives in the tree that I know how I’m related to. One is a 1C1R who is the granddaughter of my uncle, so she shares both my paternal grandparents with me and I show her above the “F”. The other 4 are all on my father’s father’s side, and they are in the “FF” section. I do have a few relatives on my father’s mother’s side that tested, but 23andMe decided not to include them in my automated tree. There are 10 matches that I don’t know how they are related to me. But the tree hypothesizes that 1 is on my father’s father’s side, 2 are on my father’s mother’s side, and 7 are on my mother’s side. (Click the image below for a larger version)

image

23andMe has not yet included my new mother-side match on my tree. They only recalculate the tree from time to time and I’d have to wait until they do it again to see if they add my cousin to it.

Of those 7 people hypothesized to be on my mother’s side, 3 are with one parent and 4 are with the other. So once my cousin is added, presumably the group of 3 or the group of 4 would be with her on my mother’s father’s side and the other group would be on my mother’s mother’s side.

But then I saw that I don’t have to wait for 23andMe’s recalculation.

At the top left of the tree is this symbol:
image

When I click on it, it brings up this box with unplaced relatives:

image

I have 5 people shown at the bottom. You have to scroll to the right to see the other two. The person on the left is my newly tested cousin. The other 4 are people I don’t know how I’m related to.

Clicking on the little info symbol next to the “Unplaced Relatives” text gives:

image

Clicking on the “Learn more” link gives:

image

Well 5 minutes doesn’t sound so bad. Let’s see what happens when I reset my tree.


Recalculating the 23andMe Family Tree

I press the “Yes, delete my edits and recalculate my tree” button, and it gives this:

image

Okay. 5 to 10 minutes isn’t so bad either.  Back at the tree, they actually show progress:

image

Now it’s saying less than 1 minute. Sheesh!  After about what turns out to be 3 minutes, I get this message:

image

I’m doing this on a Thursday evening at 7 p.m. CDT. Is this a busy time?

I wait a couple of minutes and of course I don’t believe them and don’t want to wait until tomorrow, so I go back up to the 23andMe main menu, and under Family & Friends, select “Family Tree”

image

Sure enough, I didn’t have to wait a day. It displays my new tree:

SNAGHTML14befb5

Now it only shows 6 of my DNA matches. Pressing the symbol in the top left, it now shows this:

image

So it moved 9 of my previously placed matches into the Unplaced Relatives list. That list now has those 9 plus the 5 that I had before I had them recalculate the tree, plus the 8 non-tested relatives (e.g. my parents, grandparents, uncle, cousin, etc.) that I had previously manually added to my tree.

The recalculation placed some of my paternal cousins at the wrong generational level. But that’s no problem. Since the beta 10 months ago, 23andMe has added the ability to move people in the tree, and even move a whole branch of the tree:

image

The link the often show that says “View our guide:” takes you to 23andMe’s illustrative guide of How to build and edit your Family Tree, which is worth a read. In there, you’ll see that you not only can add people to your tree, but you can include their date and place of birth and death and add a photo. I’m not sure why entering the birth and death information is currently useful, since that information doesn’t show up in the tree. But maybe 23andMe has planned a use for it that they’ve not implemented yet.

Unfortunately, the one person I really wanted automatically added, my new DNA testing relative on my mother’s side, was not placed. That would have separated out my maternal sides. But now it wouldn’t have helped anyway, because the 7 people they previously placed on my maternal side were now all with the Unplaced Relatives. So placing my 2nd cousin without those 7 on the tree no longer will allow me to divide them up into my MF and MM sides. Sad smile


My New 23andMe’s Family Tree

I can easily add my new cousin, because I know where she goes. But I can’t add the people that the recalculation removed from the tree because I don’t know how I’m related to them. It would have been nice if 23andMe could have left them in. The algorithm must have changed somewhat. Maybe those people were previously placed inaccurately.

So be aware. You may lose some of 23andMe’s theories if you recalculate. Make sure you record how everyone is connected before you get it to do the recalculation.

Now my tree has 6 DNA relatives whose relationship I know. There is only one theory remaining. My tree now looks like this, with my father’s side now being on the right side.

image

I’ve circled in green the 6 relatives I have that I know are placed correctly. Circled in red is the one relative that remains as 23andMe’s theory.

23andMe has left me with 13 people in my Unplaced Relatives that I cannot place.

I also have 5 relatives among my matches that I know how I’m related, but 23andMe’s Family Tree chose not to include them. I could add them to the correct place on 23andMe’s Family Tree. But they would not be connected to their DNA match information. It would be nice if 23andMe would allow you to select people from your match list. I think I’ll suggest that to them via their survey at the bottom of Your Family Tree page.


Updating My Double Match Triangulator 23andMe Results

I last tried DMT on my 23andMe data last October:  Using DMT, Part 1: My 23andMe Data. Since I only had paternal matches back then, DMT couldn’t do much with my maternal side other than classifying which matches it calculated were maternal. What it gave me back then was this:

image

So now I’ll just do this exercise again. I’ll use DNAGedcom Client to download a new set of segment match files from 23andMe (see DMT’s help file for how to do this).

The segment match files I’ll download will be for myself and the 10 relatives I know how I’m related to. Each takes about 10 minutes for DNAGedcom to gather, so I’ll do them while I’m working on something else.


Two Hours Later

I put the 11 segment match files into a folder. I start DMT and select my own segment match file as File A. I have DMT create my People file with all my matches. Now I go through and add the MRCA for my 10 known relatives (9 of which are shown below):

image

Now I set Folder B to the folder containing all the match files and I let ‘er rip.

Double Match Triangulator clusters my matches into these groups. Compare this to the table above:

image

I have 199 more matches than I did last October.  The percentages are about the same as they used to be with the exception that DMT was able to pick out 201 of the maternal matches and associate them with my mother’s father’s cluster, due to their segment matches with my newly tested cousin.

Also, last October, I was only able to paint grandparents or further over 46.1% of my paternal DNA and none of my maternal side.  Now with my new data including my newly tested cousin, I’m able to paint 46.8% of my paternal side and 25.6% of my maternal side as well.

Uploading the DNA Painter file that DMT produces with this latest run into DNA Painter now gives this:

image

This is very similar to what I got 10 months ago, but now a significant amount of my maternal grandfather’s side (MF, in red) also gets painted. That’s a nice chunk of additional painting that DMT was able to add.

That one person whose relationship that I don’t know that 23andMe added to my tree (see the last tree above, red circle, far right) they included as a second cousin once removed on my father’s father’s mother’s side. DMT puts that person in my FF (father’s father’s) cluster. DMT cannot work this any further back because I don’t have any cousins tested who I know are on either my FFF or FFM side for it to use. So 23andMe’s estimation of FFM is a good theory and could be correct. Now I’ll just have to trace his family tree and see if we can connect. Smile

VGA Webinar: “Your DNA Raw Data & WYCDWI”

2020. július 6., hétfő 2:21:15

In just over a week, on Tuesday July 14, 2020 at 8:00 pm EDT, I’ll be giving a live online talk for the Virtual Genealogical Association @VirtualGenAssoc

2020 07 14 Kessler (002) 

The description of my talk is:

Presenter Louis Kessler explains those mysterious files that we download from DNA testing companies, helps us to understand what’s in them, and shows us the ways we can make use of them. He will also discuss whether Whole Genome Sequencing (WGS) tests are worthwhile for genealogists.

I hope you come and join me for this.

To register for my presentation, you’ll need to be a member of the Virtual Genealogical Association. Annual Dues are only $20 USD, and that gives you free registration for a year to any of their regular webinars as well as handouts and other benefits. Upcoming webinars include:

  • Tuesday, July 14 at 8 pm EDT - Louis Kessler presents
    “Your DNA Raw Data & What You Can Do With It”
  • Sunday, July 26 at 1 pm EDT - Sara Gredler presents
    “Successfully Searching the Old Fulton New York Postcards Website”
  • Saturday, August 1, 2020 EDT - Jessica Trotter presents
    “Occupational Records: Finding Work-Related Paper Trails”
  • Friday, August 7, 2020 at 8:00 pm EDT - Ute Brandenburg presents
    “Research in East and West Prussia
  • Tuesday, August 18, 2020 at 8:00 pm EDT - Caroline Guntur presents
    “Introduction to Swedish Genealogy”
  • Sunday, August 23, 2020 at 1 pm EDT - Julie Goucher presents
    “Researching Displaced People”
  • Saturday, Sept 5, 2020 at 11:00 am EDT - Sara Campbell presents
    “Using Historic Maps of New England and Beyond”
  • Tuesday, Sept 15, 2020 at 8:00 pm EDT - Tammy Tipler-Priolo presents
    “Simple Steps to Writing Your Ancestors’ Biographies”
  • Sunday, Sept 20, 2020 at 1:00 pm EDT - Tamara Hallo presents
    “How to Get the Most Out of FamilySearch.org”
  • Friday, Sept 25, 2020 at 8:00 pm EDT - Annette Lyttle presents
    “Finding & Using Digitized Manuscript Collections for Genealogical Research”
  • Saturday, Oct 3, 2020 at 11:00 am EDT - Patricia Coleman presents
    “Beginning with DNA Painter: Chromosome Mapping”
  • Sunday, Oct 11, 2020 at 1:00 pm EDT - Kristin Brooks Barcomb presents
    “Understanding & Correlating U.S. World War I Records & Resources”
  • Tuesday, Oct 20, 2020 at 8:00 pm EDT - Christine Johns Cohen presents
    “Lineage & Hereditary Societies: Why, Where, When, What & How?”
  • Sunday, November 22, 2020 at 1:00 pm EST - Judy Nimer Muhn presents
    “Researching French-Canadians in North America”
  • Tuesday, November 24, 2020 at 8:00 pm EST - Marian B. Wood presents
    “Curate Your Genealogy Collection – Before Joining Your Ancestors!”
  • Tuesday, Dec 1, 2020 at 8:00 pm EST - Diane L. Richard presents
    “The Organizational Power of Timelines”
  • Friday, Dec 4, 2020 at 8:00 pm EST - Nancy Loe presents
    “Using Macs and iPads for Genealogy”
  • Sunday, Dec 13, 2020 at 1:00 pm EST - Jean Wilcox Hibben presents
    “Family History Can Heal Family Present”

Notice they vary the day of the week and the time of the day to accommodate people all over the world with different schedules.

If you are unable to attend a talk live that you wanted to, members have access to recordings of the last six months of webinars. Some of the past webinars that you can still access if you join now include:

  • Pam Vestal presented
    “20 Practical Strategies to Find What You Need & Use What You Find”
  • Mary Cubba Hojnacki presented
    ”Beginning Italian Research”
  • Alec Ferretti presented
    ”Strategies To Analyze Endogamous DNA”
  • Renate Yarborough Sanders presented
    ”Researching Formerly Enslaved Ancestors: It Takes a Village”
  • Megan Heyl presented
    ”Road Trip Tips: Don’t Forget To…”
  • Lisa A. Alzo presented
    ”Finding Your Femme Fatales: Exploring the Dark Side of Female Ancestors”
  • Lisa Lisson presented
    ”How To Be A Frugal Genealogist”
  • Michelle Tucker Chubenko presented
    ”Using the Resources of the U.S. Holocaust Memorial Museum”
  • Cheri Hudson Passey presented
    ”Evidence: Direct, Indirect or Negative? It Depends!”
  • Kate Eakman presented
    ”William A. James’ 30 May 1944 Death Certificate”

While you’re at it, clear off your calendars from Nov 13 to 15 for the VGA’s annual Virtual Conference. Many great speakers and topics. There is a $59 fee for members and $79 for non-members. If the Conference interests you, then why not join the VGA right now for $20 and enjoy a year of upcoming webinars and 6 months of past webinars for free!

image

I’ve been a member of the Virtual Genealogical Association since it started in April 2018. They are always on the lookout for interesting speakers with interesting topics. If you would like to propose a talk, they are now accepting submissions for 2021 webinars and the 2021 Virtual Conference. Deadline for submission is August 30, 2020.

So How’s My Genealogy Going?

2020. július 2., csütörtök 22:47:29

I’ve written over 1100 genealogy-related blog posts since I started blogging in 2002. But very rarely have I written about my own genealogy research.

It’s actually going okay now.

This blog was started to document the development and progress of my software program Behold, that I’m building to assist me with my genealogy. About 8 years ago, I started attending international conferences and became a genealogy speaker myself. Then about 4 years ago, DNA testing started to become a thing, and I jumped fully in, finding everything about it fascinating, and I wrote my program Double Match Triangulator to help decipher matches. About 2 years ago, the Facebook era of genealogy groups began. I joined and started participating in many groups that were of interest to me and relevant to my own family research.

I got interested in my genealogy in my late teens when one of my father’s aunts was in from Los Angeles and she started drawing a tree showing her and her 8 brothers and sisters. Then I started researching. The first program I started entering my data into was Reunion for Windows. When Reunion sold their Windows product to Sierra in 1997, I became a beta tester for their release of the program which they called Generations. I used Generations to record my genealogy until 2002, when Genealogy.com purchased it along with Family Origins and Ultimate Family Tree, and then subsequently dropped all three programs in favour of their own product Family Tree Maker.

What I had was a GEDCOM with my family tree information updated up to 2002. And until about 2 years ago, I had made no updates to that at all, waiting for Behold to become the program I’d enter all my genealogy data into. Working full time, the onset of DNA testing, becoming involved in genealogy conferencing and speaking, plus family and life in general prevented that from happening.

But then a simple step recently rebooted me and my genealogy work.


The MyHeritage Step

In February 2018, I took advantage of a half-price subscription for MyHeritage’s Complete Plan. I loaded my 16 year-old GEDCOM up to MyHeritage. I downloaded their free Family Tree Builder program which syncs with their online system, and I went to it.

The special price enticed me, but I liked what I saw in MyHeritage. They had lots of users. Billions of records. They had plenty of innovation, especially in their Smart Matching. And they were less America-centric than Ancestry. All my ancestors come from Romania and Ukraine ending up here in Canada, so I have eastern European needs. I’ll need to write names in Romanian, Russian, Hebrew and Yiddish, and language handling is one of MyHeritage’s strong points.

The one place MyHeritage was weak was Canada. So I also subscribed to Ancestry as well, but just their Canadian edition. The main database I wanted that Ancestry gave me was the passenger lists for arrival to Canadian ports.

Once I uploaded my 1400 people I had from 2002 via GEDCOM, MyHeritage’s Smart Matches started working for me. Over the course of a year, I added about 500 people to my tree and attached 5000 source records to them.


Filling Out My Tree

The sides of my family I am researching include my 5 grandparents and my wife’s 4 grandparents. My father’s parents are both from Romania. My mother’s parents are both from Ukraine as are all my wife’s grandparents.

My 5th grandparent is my father’s step-father Kessler. He is my mystery side. I know very little about him and his first wife. I don’t even know where he came from other than some unidentifiable place Ogec somewhere in Russia. He has no living blood relatives that I know of, and since no one I know is related to him, I can’t even use DNA to help me on his or his first wife’s side.

In addition to my 9 grandparents, I am also sort of doing a one-place study of Mezhirichi in the Ukraine, where my mother’s father came from. The reason why that town is more of interest than the other towns of my grandparents is because in the 1920’s, a synagogue in Winnipeg was formed called the Mezericher Shul made up only of immigrants from that town, including my mother’s father. I am trying to trace back all the people in Winnipeg whose parents or grandparents went to that synagogue, back to their roots in Mezhirichi. I’m sure many of us are related in ways that we don’t know. So to be more precise this is not really a one-place study of Mezhirichi, but is really a study of the families of the people who attended this synagogue in Winnipeg who likely came from Mezhirichi.

On my wife’s father’s mother’s side is a cousin in the United States who has done an extensive study on that side of the family. He wrote a 255 page book listing about 1000 people who descended from his and my wife’s common ancestors. He graciously allowed me to add the data to my MyHeritage tree as another way to preserve his research. I enjoyed the month and a half I spent manually adding people and their birth and death years to my family tree. That was enough to let MyHeritage’s Smart Matches do the dirty work of finding  record matches and easily allowing me add dates and places from the records to our people.

Shortly after that, I ran into a problem. MyHeritage is supposed to privatize living person information. And when you look at a person in the tree who is living, it looks like they have been privatized. But it isn’t quite:

image

It shows the surname of the person, and the spouse’s maiden name. This wasn’t that bad, but the real problem were the Smart Matches. When someone Smart Matches to you over living people that they may have in their tree, they get all the information you have: names, dates, places, children, etc. I had a cousin email me and tell me he got a Smart Match from my tree, and his birthday was displayed to him. He wasn’t happy and neither was I.

I really was hoping I wouldn’t have to delete all the living people from my online tree, keeping them only in my local files on my computer. Fortunately there was a solution. When editing a person in Family Tree Maker, the “More” tab contains a privatization selection for the person. You check the box to make the person private:

SNAGHTML675722a

They had no automated way to check this selection for all living people, so I manually opened up each of my 1500 living people and marked them private one-by-one, another week-long project.

Once those private people synced up to MyHeritage, the living couples now displayed as:
image

That’s much better. Every person still has a box online, but they are all now marked as “Unknown” rather than “private” with a surname.  Also, no more information about living people is given to anyone through Smart Matches. As a consequence, I also don’t get smart matches for any of my privatized people. But this latter aspect might be a blessing in disguise. Now the Smart Matches I get are only for my deceased people who are the ones I’m most interested in researching and tracing further back. And the number of Smart Matches I now get are manageable. I can clean them out in a few days until I get a few hundred more a few weeks later.


Cousin Bait

I love this term cousin bait. You don’t want to put your data in one place. You want to put it everywhere you can. And you don’t want to put it all up for everyone to see and take. You want to make enough available to get people to contact you, so you can communicate with them and then share what you both have.

For the past 20 years, I have maintained a page of My Family Research and Unsolved Mysteries on my personal website:

image

That page is well indexed on Google. For instance, searching for “Braunstein Tecuci” on Google brings my page up in 3rd place out of 11,500 results:

image

Over those 20 years, I’ve had about 200 people email me inquiring about some of the names and places that I identify. And maybe one third of those have been actual relatives whom I’ve shared data with.

The 2nd best resource I’ve used for a long time to find family has ben the JewishGen Family Finder (JGFF). I have just 17 entries, but those have been enough to get maybe 100 people to contact me to see if we have part of our family tree in common. And again, in maybe a third of those cases, we did.

image

Also, 2 decades ago, I uploaded my GEDCOM to JewishGen’s Family Tree of the Jewish People. As of March 2017, the collection had 7,310,620 records from 6,266 family trees. I’ve recently updated my tree there with my MyHeritage tree.

One of the best successes from my family webpage and through JewishGen was my connection to about 10 relatives on my father’s mother’s Focsaner side. We all have been emailing each other for many years and have been sharing information about our common family. I have only met one of these relatives in person, when our family went to New York City for a vacation about 10 years ago. But despite most of us never having met, and being 3rd cousins or further, we feel like we’re close family.

In the past 2 years, I have also added some of my own family tree (not my wife’s) to other sites, usually just my ancestors.

  • Ancestry:  Just ancestors, but I’ve connected them down to any DNA matches who are relatives.  This has given me a number of useful ThruLines that have led me to identify a couple of DNA testers who were relatives that I didn’t have in my tree.
  • Family Search:  I just added my ancestors, but I’m connecting them to anyone else in this one-world tree who I know are relatives.
  • Geni: Same as for Family Search.
  • Wikitree:  I’ve only put myself and my parents in so far. If in the future I notice a relative, I’ll connect to them.
  • Geneanet: About a year ago, I uploaded my tree from MyHeritage, so I have about 4000 in my tree there.
  • GenealogieOnline:  Just ancestors.
  • Family Tree DNA:  Just ancestors but connected down to DNA matches
  • GEDmatch:  Up to yesterday, just ancestors.

Unfortunately, other than the ThruLines results at Ancestry, these trees have not led to people contacting me. So they are not as good at being cousin bait as I hoped they would be.

But yesterday, GEDmatch added their MRCA Search Tool, that compares the GEDCOM file you uploaded to GEDmatch to the GEDCOM file of your DNA matches. So I downloaded my GEDCOM from MyHeritage (which already had all living people privatized) and I uploaded it to GEDmatch and ran their new tool.

The GEDmatch tool compared 766 of my DNA matches’ trees to mine, and 933 of my uncle’s DNA matches trees to my uncle in my tree. Mine is a very problematic family for these sorts of comparisons. All my ancestors are Jewish so I have endogamy to deal with on the DNA side, and they are all from Romania or Ukraine, so I have lack of records and ability to only go back 5 generations to deal with on the tree side. The result sort of expectedly was that neither I nor my uncle had any MRCA matches.


Other Findings

Of course, one goal every genealogist has is to expand our ancestral tree as much as we can. With all my ancestors coming from Romania and Ukraine, the records there only start in the early to mid 1800s. I can only hope to go back about 5 generations with the known records available.

Over the past few years, I found some researchers who have been able to acquire records for me and translate them from the Romanian or Russian they are written in.

Researcher Gheorge Mireuta obtained 10 birth and death records from Tecuci, Romania on my father’s father’s side.

Sorin Goldenberg obtained about 70 records from the Dorohoi region of Romania on my father’s mother’s side.

Viktoria Chymshyt has obtained records from the Mezhirichi area of Ukraine, trying to find people for me on my mother’s father’s side, but we haven’t been successful yet.

Boris Malasky has obtained about 70 records on two of my wife’s sides from Kodnya and Zhitomir in the Ukraine.

This record research is really the only possible way to expand my tree into the “old country” and provide the physical evidence to back it up.


Where I Am Now

Currently, I sit at over 5100 people in my family tree at MyHeritage, including all the people I’ve privatized.

I really love MyHeritage’s Fan View. It give me a good representation as to where I am. Here’s the Fan View of my tree today:

image

And a new record I just got a few days ago from Sorin Goldenberg gave me the first names of the parents of my great-great-great-grandfather Manashcu Naftulovici.

image

So Naftuli and Sura are the first two ancestors I’ve identified in my 6th generation! Their son Manashcu was the first in his line to start using a surname, and he selected the patronym: Naftulovici.

My wife’s Fan View is currently this:

image

We have two of her 7th generation ancestors identified in records acquired from Boris Malasky.


Still To Do

In one word, lots!  All genealogists know this is a never ending task. Every new ancestor you find leads to two new questions.

But my three major tasks over the next few years will be:

  1. Going through and organizing the dozens of boxes in my closet and basement and binders in my bookshelf of unorganized genealogical material and pictures from my early years of research and from my parents and my wife’s parents and grandparents.
  2. Digitizing what’s valuable from #1.
  3. Entering data obtained from #1 into my family tree along with source citations.

That should keep me busy for a while.

And in the meantime, I’ll still be developing Behold so that it will continue to assist me as I go.

Writing a Genome Assembler

2020. június 29., hétfő 5:21:23

I have now taken 5 DNA microarray (chip) tests with Family Tree DNA, 23andMe, Ancestry DNA, MyHeritage DNA and Living DNA. I have also taken two Whole Genome Sequencing (WGS) tests with Dante Labs, one short-reads and one long-reads.

I analyzed the accuracy of these tests by comparing individual SNP values in my article Determining the Accuracy of DNA Tests. The chip tests don’t all test the same SNPs, but there’s enough commonality that they can be compared, and an error rate can be estimated. For my chip tests, that error rate turned out to be less than 0.5%.

The WGS test results don’t give you individual positions. They give you a set of reads, which are segments that are somewhere along the genome. Short read WGS tests give you segments that may be 100 to 150 bases long. Long read WGS tests can give segments that average 10000 bases long with the longest being into the megabases (millions of bases). But you don’t know where those segments are located on the genome.

To determine where the WGS reads are on the genome, there are two methods available:

    1. Alignment:  Each of the reads are matched to where they are best located in the human reference genome. The WGS testing companies often do the alignment for you and give your results to you in a BAM (Binary sequence Alignment Map) file. The alignment cannot be perfect because 

    • You have variants that are different from the human reference genome as well as INDELs (insertions and deletions),
    • The WGS data has errors in the reads, sometimes changing values, adding extra values or deleting values.
    • The algorithms used for alignment are not perfect and sometimes make assumptions.

    Comparing my BAM file results from my short read WGS test using the BWA alignment tool, the SNPs I could compare were even more accurate than my chip tests with an error rate of less than 0.1%. That sounds very good, but still 1 in 1300 results were wrong, meaning in 700,000 SNPs, there could be 500 errors.

    The WGS_Extract tool that I was using to extract the SNP values from the BAM file didn’t handle INDELs properly so I couldn’t check the accuracy of those.  Despite its high accuracy for individual SNPs, short read WGS tests are not as good at identifying INDELs correctly, e.g the YouTube video (see below) states 85% to 95% accuracy which is a high 5% to 15% error rate.

    For my long reads WGS test, I had two alignments done, one using a program called BWA and one using minimap2 which was supposed to be better for long reads. I was very disappointed to find a quite high error rate on the SNPs I could compare, which was 7.7% and 6.6% for the two programs.

    Thus, alignment techniques and the algorithms that implement them are not bad, but they are far from perfect. They match your reads to a reference genome and have to assume that the best fit is where your read goes.

    2. De Novo Assembly, or just Assembly: This is where you only take the WGS reads themselves, and match them up with each other, piecing them together like a big picture puzzle.

    Actually, it’s tougher than a picture puzzle. The best analogy I’ve seen is it’s like taking 100 copies of today’s issue of the New York Times newspaper, and shredding them into small random pieces where you can only see a few words from a few lines. Just to make it a bit more difficult, half the papers are the morning edition, and half are the afternoon edition, where 90% of the articles are the same, but the other 10% have a different article in the same location in the paper. On top of that, somehow one copy of yesterday’s paper accidentally got in the mix. Now you have to reassemble one complete newspaper from all these pieces. And as a bonus, try to create both the morning edition and the afternoon edition.

    You likely will mix up some morning edition articles with some afternoon edition articles, unless you get some pretty big pieces that include 2 of the same edition’s articles in that piece. (Think about this!)

    So the two editions are like the your paternal and maternal chromosomes, and one copy of the previous day’s paper are like a 1% error rate that your reassembling has to deal with. Add in shredded versions of six different issues of the newspaper for a 6% error rate.

    A genome assembler matches one read to another and tries to put them together. The longest stretches of continuous values that it can assemble are called contigs. Ideally, we would want to assemble 24 contigs for the 23 autosomal chromosomes plus the mitochondrial (mtDNA) chromosome. A male will have a 25th, that being his Y chromosome.

    When assemblers can’t connect a full chromosome together (which none can do yet for humans), you can run another program to use a technique called scaffolding to connect the contigs together. That is done by mapping the contigs to the human reference genome and using the human reference genome as the scaffolds (or connections).

    Assembly with short read WGS has not been able to give good results. Similar to alignment, the reads are too short to span repeats, and thus give way too many contigs. Long reads are normally used for assembly, and despite their high error rate for individual base pairs, sophisticated error correction techniques and minimum distance algorithms have been developed to do something reasonable. However, chromosome-scale configs are still not there yet, and many smart researchers are working to solve this, eg. this article from Nov 2019 describing a method using a connection graph.

    I attempted an assembly of my long reads WGS about 6 months ago using a program called miniasm. I let it run on my computer for 4 days but I had to stop it. So I waited until before a 2 week vacation and started it, but while it was running my computer crashed.

    I realized that this is too long to occupy my computer to do an assembly that likely will not give good results. And I was not happy running it in Unix on my Windows machine. I was interested in a Windows solution.


    Algorithms for Genome Assembly

    I’ve always been a programmer who likes the challenge of developing an algorithm to solve a problem. I have a BSc Honours in Statistics and an MSc in Computer Science, and my specialty and interest was in probabilty and optimization.

    I have developed and/or implemented many computer algorithms, including detection of loops in family trees for Behold, matching algorithms in Double Match Triangulator, simulation of sports and stock market performance (winning me over $25,000 in various newspaper contests) and from my university days: my at-the-time world class chess program: Brute Force.

    Currently for the next version of Behold, I am implementing a DNA probability of match and conditional upon matching expected match length for autosomal, X, Y and mtDNA between selected people and everyone else in your family tree. In doing so, I have to also determine all ways the selected people are related and statistically combine the results. All this data will be available if the user wants along with all the ways these people are related. It should be great.

    But back to genome assembly. The problem with assembly algorithms today is that they have to use long reads, and long reads have very high error rates. So they must attempt to do some sort of approximate matching that allows for errors and then uses the consensus approach, i.e. that takes the values that most reads aligning to the same position agree on. It is not clean. It is not simple. There is a lot of error correction and many assumptions must be made.

    Ahh, but wouldn’t it be simple if we could just take one read, and match the start to another read and the end to a third read. If you have enough coverage, and if the reads are accurate enough, then this would work fine. 

    image 

    In fact this is how they put together the first human genomes, painstakingly connecting the segments that they had one by one.

    But alas, the long reads WGS tests are not accurate enough to do this. So something else had to be done.

    A couple of months ago, I discovered a wonderful online book called Bioinformatics Algorithms, designed for teaching. The entire text of the book is available online. You can also purchase the book for yourself or for your class.

    image

    Chapter 3 is :How Do We Assemble Genomes? That is where I got the exploding newspaper analogy which I expanded on above. The chapter is amazing, turning the problem into graph theory, bringing in the famous Königsberg Bridge Problem, solved by mathematician Leonhard Euler, and explaining that a de Bruijn graph is the best solution for error prone reads.

    This looked like quite a task to implement. There are many assembly algorithms already developed using this technique, and I don’t think there’s anything I can do here that those working on this haven’t already done.


    Accurate Long Reads WGS!!!

    Also a couple of months ago, another innovation caught my attention. The company PacBio developed a DNA test they call PacBio HiFi SMRT (Single Molecule, Real-Time) WGS, which are both long reads (up to 25 kb) and are highly accurate (about 99.8%)

    Whoa! The world has just changed.

    No longer was extensive error correction required. The video above talks about the HiCanu assemblers and how they were modified to very much take advantage of this improved test. Not only that, but the practise of using short reads to “polish” the data is no longer required, and is actually discouraged with HiFi reads as the polishing actually can introduce errors.

    What does this mean? Well, to me this indicates that the original ideas of simply connecting ends might just work again. I have not seen any write-up about this being attempted anywhere yet. The assembly algorithm designers have been using advanced techniques like de Bruijn graphs for so long, they might never have thought to take a step back and think that a simpler solution may now work.

    So I thought I’d take that step back and see if I can develop that simpler solution.


    A Simple Single-Pass Assembler for Windows

    For 25 years I’ve developed software using the programming language Delphi on Windows. Most bioinformatics tools are written in Python for Unix. I’m too much of an old horse who is too busy to learn new tricks. So Delphi it will be for me.

    The algorithm with perfect reads seemed fairly simple to me. Make the first read a contig. Check the next read. Does the start or end of the read match any anywhere within the contig? If so, extend the contig. If not, make the read a contig. Continue sequentially just one time through the reads and after the last read, you should be done!!!

    Once I got going, I only found it slightly more complicated than that. You also had to check if the start and end of the contigs matched anywhere within the read, and also if the read contained the contig or the contig contained the read. I set a minimum overlap length thinking that I’d want to ensure that the read and the contig matched at least that much. Then any repeats smaller than that overlap would be bridged.

    First I needed some sample data. In the Bioinformatics Algorithms book Chapter 9, the Ch 9 Epilogue on Mismatch-Tolerant Read Mapping gives a challenge problem includes a 798 KB partial dataset of the bacterial genome Mycoplasma pneumoniae with 816,396 values in it, all either A, C, G or T.

    This is what that dataset looks like in my text viewer. It’s just one long line 816,396 values in it:

    image

    The challenge problem also included a file of 40,000 short reads from that dataset, all of length 100. That gives 4 million data points for a coverage of 4.9x over the 816,396 in the genome.

    However, not a single one of the 40,000 reads were in the genome. The challenge was to find the number of reads that had at most 1 mismatch.

    Since I wanted a perfect dataset of reads to start with, I saw that I needed to create my own. Also, I wanted them to be like long reads, all with differing lengths.  So after a bit of trial and error, I ended up using a base-10 lognormal distribution, with a mean of 3 and standard deviation of 0.25 to generate 9000 random read lengths. Longest read length was 11,599. Shortest was 124. Mean was 1174.

    image

    So those 9000 reads average 1174 bases and total 10.6 million data points, giving 13.0x coverage of the genome, which is quite a bit more than the 4.9x coverage in their example short reads. This is good, because there’s more likelihood I’ll have enough reads to cover the entire genome without gaps.

    I then generated random start positions for those 9000 reads, and extracted the actual genome values at that position for that read length, and put those into my own reads file. So now I had a set of reads with no errors to develop with.

    This is what my set of reads look like in my text viewer. There are 9000 lines, each of varying length:

    image

    To do the alignment, I didn’t know how much of the start and the end of each read was needed for finding a match in another read. So I wrote a small routine to take the first n positions at the start and end of the first read, and find out how many other reads they are contained in:

    image

    I started at length 5. The first 5 values of the first read matched somewhere in 14,281 other reads. Obviously length 5 is too small. Increasing to the first 11 values, we see the start of the first read only matches 10 other reads and the end only matches 8. This does not decrease any more as we increase the segment size indicating that we likely found all the occurrences of that sequence in all the reads. With 13.0x coverage, you would expect on average 13 matches over any segment. I have 1 + 10 = 11 occurrences of the first part of the first read, and 1 + 8 = 9 occurrences of the last part of the first read. That’s a very possible result with 13.0x coverage.

    So for this genome and the sample data I have, I’ll set my segment length to 12 and select the first 12 values and last 12 values of each read for my comparisons.

    The reason why such a small 12 value segment can be used is because there are 4 possible values, A, C, G and T at each position. And 4 to the power of 12 is 16,777,216 meaning there’s that many ways to make a 12 letter string out of those 4 values. Our genome is only 816,396 bases long, so there is very little chance there are very many segments of length 12 that are randomly included more than once. For a human genome of 3 billion reads, a slightly longer segment to compare with will be required, maybe length 17 or 18 will do it.


    Running My Assembler: Sembler

    After about 4 days of development, testing, debugging and enhancement, I got my simple assembler running. I call it:  Sembler. This version ended up with about 200 lines of code, but half of that is for reporting progress.

    So this is its algorithm. Sembler checks the first and last 12 positions of every read against each contig created so far. It also checks the first and last 12 positions of the contig against the read. And it checks if the read is completely in the contig and if the contig is completely in the read. Based on the situation, it will then either expand the contig, combine two contigs, or create a new contig.

    Sembler reports its progress as it goes. Here is how it starts on my test data:

    image

    The first line shows the settings used. The next lines show the reads used. Since the minimum read length for this run was 1200, reads 2, 6, 7, 9, … were not used because they were too short.

    Up to read 66 no overlaps were found, so a contig was created from each read. At read 66, and again at read 77, the first 12 values of the read matched somewhere in one of the contigs. The rest of the contig matched the read after those 12 values, but the read was longer and had more values available that Sembler then used to extend that contig to the right.

    If we go down further to reads starting at 1004 we see:

    image

    We have now built up 177 contigs and they grow to a maximum of 179 contigs by read 1018. At this point, the contigs cover much of the genome and it is getting tougher for new reads not to be overlapping with at least one of the contigs.

    The start of read 1024 matches somewhere in contig 78 and the end of read 1024 matches somewhere in contig 90.  So this read has connected the two contigs. Contig 90 is merged into contig 78, and the last contig 179 is moved into contig 90’s spot just so that there aren’t any empty contigs to deal with.

    Below is the end of the output from running reads with length >= 1200:

    image

    We get down to read 8997 which ends up merging contig 3 into contig 1, and contig 4 becomes contig 3. So we are left with just 3 contigs.

    The run took 19.156 seconds.

    Normally, you don’t know the genome. This procedure is designed to create the genome for you. But since I am still developing to get this to work, I had Sembler look up the final contigs in the genome to ensure it has done this correctly. The three contigs it came up with were:

    Contig 1 from position 82 to 658275
    Contig 3 from position 658471 to 764383, and
    Contig 2 from position 764404 to 816396.

    So positions 1 to 81 were not identified, because there were no reads with length at least 1200 that started before position 82. And there was a small gap of length 195 between 658275 and 658471 which no reads covered and another gap of length 20 between 764383 and 764404 that no reads covered.

    Despite the 7.7x coverage, there were still a couple of small gaps. We need a few more reads to fill in those gaps. One way of doing so is to lower the minimum read length. So I lowered the minimum read length to 1000 and get this:

    image

    Success! We now have a single contig from position 82 to 816396.


    Optimizing the Assembler

    I have so far done nothing to optimize Sembler’s code. The program compares character strings. It uses a Pos function to locate one string within another. There are many ways to improve this to make it run faster, but getting the algorithm working correctly was the first necessity. I have a lot of experience at optimizing code, so if I carry this program further, I will be sure to do so.

    But just as important as optimizing the code is optimizing the algorithm. Starting parameters are very important. Let’s look at what tweaks can be made.

    image

    As we increase the minimum length of the reads we include, we reduce the number of reads we are using. This reduces coverage, reduces the number of compares we do and takes less time. The maximum contigs we have to deal with decreases and that maximum happens later during the reads.

    But if our value for the minimum length is too high, we don’t get enough coverage to fill in all the gaps and we end up with more than one contig. The most important thing here is to try to end up with just one contig.

    Based on the one contig requirement, our optimum for this set of reads for this genome is to select a minimum length of 1000.

    Now let’s set the minimum length to 1000 and vary the segment length:

    image

    Varying the segment length we are comparing doesn’t change the result. The segment length is looking for a potential contig it matches to. If the length is too short, then the start or end of the read will match to random locations in each contig. They will be rejected when the rest of the read is compared, which is why the solution doesn’t change. But all these extra checks can dramatically increase the execution time if the segment length is too small.

    These are perfect reads I’m working with right now that have no errors. Once errors are considered, we’ll want to keep the seglength as small as possible to minimize the chance that the start segment or end segment contains an error. If it does, then that read will all be rejected when the rest of the read is compared, effectively eliminating the use of that read.

    Now let’s put the segment length back to 12 and vary the minimum overlap which by default I had set to 100:

    image

    These results surprise me somewhat. I was expecting a minimum overlap of 50 and especially of 0 to fail and give lots of contigs. I’ll have to think about this a bit. Maybe it is because I’m using perfect reads with no errors in them.

    None-the-less, this shows that if the minimum overlap is too high, then some of our matches will be excluded causing some gaps. We don’t want the minimum overlap too low, or we may match two contigs that are side by side but don’t have enough “proof” to connect them. That isn’t a problem in this “perfect reads” case, but once errors are introduced, some overlap will likely be wanted as a double check.


    Next Steps

    This procedure works.

    Is it fast enough for a full WGS dataset? We’re talking about tens to hundreds of millions of reads rather than just 9000.  And we’re talking about a genome that is 3 billion positions rather than just 800,000.  So instead of 200 max contigs, we’ll likely have 200,000 max contigs. So it could theoretically take a million times longer to solve than the little problem I have here.

    If with optimization I can get the comparisons to be 20 times faster, then we’re talking a million seconds which is 278 hours, i.e. 23 days. That’s a little bit longer than I was hoping. But this is just a back of the envelope calculation. I’m not multithreading, and there are faster machines this can run on. If a procedure can be made available that will do a full de novo assembly of a human genome in less than 24 hours, that would be an achievement.

    I have so far only tested with perfect data. It wouldn’t be too hard to test the addition of imperfections. I could change every 1000th value in my sample reads to something else and use that as a 0.1% error rate like WGS short reads. I could change every 500th for a 0.2% error rate like PacBio HiFi reads. And I can change every 20th for a 5% error rate like WGS long reads. I already have some ideas to change my exact comparison to an approximate comparison that will allow for a specific error rate. The tricky part will be getting it to be fast.

    It might be worthwhile running my program as it is against my WGS short reads. I showed above that the minimum overlap may not need to be as high as I originally thought, so maybe the WGS short reads will be able to assemble somewhat. There likely will be many regions that repeats are longer than the short reads are, and this procedure will not be able to span them. The elephant in the room is can I process my entire WGS short reads file in a reasonable amount of time (i.e. 1 day, not 23 days)?  And how many contigs will I get? If it will be 200, that will be pretty good, since that will only be an average of 10 per chromosome. But if there’s 2000 contigs, then that’s not quite as good.

    I should try to get a set of PacBio HiFi human reads.That is what this procedure is geared towards. PacBio HiFi are the reads that I think with enough coverage, might just result in 25 contigs, one for each chromosome plus Y plus mt. Then it wouldn’t be too hard to add a phasing step to that to separate out those contigs into 46 phased chromosomes + mt for women, or 44 phased chromosomes + X + Y + mt for men. I think the PacBio HiFi reads have a chance of being able to do this.

    Finally, I would love to get a set of PacBio HiFi reads for myself. I don’t have my parents with me any more and they never tested, and I’d love to phase my full genome to them. Also, then I can do some analysis as see how well (or poorly) the WGS alignment techniques I did compared to the (hopefully) accurate genome that I’ll have assembled for myself.

    Maybe this won’t all happen this year. But I’m sure it will eventually, whether based on my Sembler algorithm, or on some other creation by some of the many hyper-smart bioinformation programmers that are out there.

    If PacBio HiFi reads prove to be the revolution in genetic testing that they are promising to be, then for sure the whole world of WGS testing will change in the next few years.

    Kevin Borland visits Speed and Balding

    2020. június 23., kedd 17:20:52

    Kevin Borland is the author of Borland Genetics, a fantastic site where you can upload your Raw DNA data, match to others, and use tools to reassemble your ancestors’ DNA. I very recently wrote a blog post about Kevin’s site.

    Kevin also has a blog in which he has been posting very interesting articles, usually of an analytic nature which are the type I really like. Yesterday, Kevin posted an excellent article: Help! My Segments Are So Sticky! in which he clearly explains how he calculated the probabilities of age ranges for 7 cM and 20 cM autosomal segments, where he used 25 years = 1 generation.

    So Kevin gives another take on the segment age estimates done by Speed and Balding in their 2014 paper made available online by Doug Speed:
    Relatedness in the post-genomic era: is it still useful?

    In the Genetic Genealogy Tips & Techniques group on Facebook, Blaine Bettinger posted about Kevin’s article and said: “I would absolutely love to see Kevin address the differences between his calculations and the calculations in the Speed & Balding paper, how fun that would be!”

    I’ve always felt that Table 2B from the Speed and Balding paper overestimates the age of segments for a given segment size. I wrote two articles on my blog in 2017 with alternative analyses and compared them to Speed and Balding:

    And I further updated that with another calculation in my article:

    Those articles received many comments, including one from Doug Speed, and much discussion on Facebook.

    So I was very interested to see what Kevin’s analysis says. Let’s compare.

    Using Kevin’s easy to follow method of calculation, I can first calculate the probability of no recombinations in x generations:

    image

    And then I simply subtract each column from the previous to give the probability that a segment is x generations old:

    image

    Now let’s plot this in the Speed and Balding chart format:

    image

    Lets compare this to the Speed and Balding Figure 2B chart that everyone quotes. I’ll cut off the left and right sides which have smaller and larger segments that we’re not comparing:

    SNAGHTML1de36330

    Speed and Balding uses ranges, so for Kevin’s chart above, I used values at the start, middle and end of each range. Speed and Balding uses Megabases (Mb)and Kevin uses centimorgans (cM), but they are close enough for practical purposes.

    What we see is:

    Speed and Balding, 1 – 2 Mb:  About 18% chance of <= 20 Generations
    Kevin Borland, 1 – 2 cM:  Between 18% and 33% chance of <= 20 Generations

    Speed and Balding, 2 – 5 Mb:  About 28% chance of <= 20 Generations
    Kevin Borland, 2 - 5 cM:  Between 33% and 63% chance of <= 20 Generations

    Speed and Balding, 5 – 10 Mb:  About 50% chance of <= 20 Generations
    Kevin Borland, 5 - 10 cM:  Between 63% and 87% chance of <= 20 Generations

    Speed and Balding, 10 – 20  Mb:  About 68% chance of <= 20 Generations
    Kevin Borland, 10 – 20 cM: Between 87% and 98% chance of <= 20 Generations

    So indeed, Kevin’s figures do corroborate with my own and indicate that Speed and Balding’s table likely are an overestimate to the age of segments of a certain size.




    Disclaimer: I sort of knew after reading Kevin’s article that his estimates would be similar to mine, since I used the same calculations as Kevin in my Life and Death of a DNA Segment article, except that I used the Poisson distribution for the starting probability rather than the 1 cM = 99% estimate that Kevin used.

    Xcode Life Health Reports

    2020. június 6., szombat 6:38:08

    On Facebook, I was delivered a sponsored ad for getting health reports from your DNA raw data at a 55% discount from a company called Xcode Life.
    @xcode_ls

    image

    I’ve always been much more interested in DNA for genealogical purposes than for health, but I had never heard of this company and it sounded interesting. Their “Mega Pack report was said to contain reports for Nutrition, Fitness, Health, Allergy, Skin, Precision Medicine, Methylation, Carrier Status, and Traits & Personality in 600+ categories.

    I looked around on the internet for a coupon and saved an additional $10 and paid $89 for the package. They accept uploads from all the major companies. I uploaded my combined all-6 file with 1.6 million SNPs in it that was in 23andMe format and it was accepted.

    The next morning, 13 hours later, I got an email stating my reports were ready. In the email, they gave me the coupon code REFVTB47WRMU5 worth $10 off any of their packages that I can give away. If you use it, I will also get $10.

    The Reports

    I downloaded my reports as a compressed zip file. After unzipping, there were 9 pdf files for the 9 reports ranging in size (for me) from the Methylation report at 10 pages up to the Carrier report at 84 pages.

    Most of the reports start with a 2 page introduction, the first page on understanding your report and the 2nd page on how to read your report. Each report ends with a 1 page disclaimer.

    The results follow on the next 2 to 4 pages and each trait is presented as one row of a table containing 2 or 3 possible results. They are color coded green for better than average, orange for average and red for not as good as average.

    For example, the Personality Results have two possible results. These are a couple of mine with better than average results:
    image

    And here’s a couple with just average results:

    image

    And then there are those for negative traits:

    image

    Whereas the Nutrition, Skin, Health, Allergy and Fitness reports give mostly 3 possible outcome per trait, e.g.:

    image

    The remainder of these 6 reports summarize and explain each of the traits, giving a recommendation with the same color as your result. It then tells you which genes were analyzed for the trait, but does not tell you which SNPs were analyzed or what your SNP values were.

    The other 3 reports each have their own format.

    The Carrier report lists 402 different conditions in alphabetical order and tells you if you have potential pathogenic variants.

    image

    They write in bold red letters in the introduction that these are not to be used for medical purposes. And the disclaimer says only your physician is qualified to interpret this report and incorporate this information in treatment and advice. None-the-less, if anything shows up, it is likely worthwhile following up on it with your doctor.


    Actual SNPs!

    The other two reports had what I was more interested in. I wanted the actual SNPs identified indicating the value that I had for them and what they meant.

    The Pharmacogenetics Report lists the gene variants I have that are associated with my reaction to 185 different drugs:

    image 

    The rsid (Reference SNP cluster ID) is listed,as well as my result: the Genotype TT.  I can find the rsid: rs2395029 in my Raw Data File that I supplied to Xcode.Life and that will tell me where it is by chromosome and position:

    image

    So this SNP is on Chromosome 6, position 31,431,760 and yes, it does have the value TT.

    I can then look up that rsid on Google and it will bring up lots of other information that can be found about the SNP:

    image

    such as SNPedia, which was recently purchased by MyHeritage that they are leaving as a free resource available to all., presumably to give them information for their health tests.

    image

    The Methylation Report lists about 60 SNPs from various genes that are associated with conditions such as cardiovascular disease, Alzheimer’s, cancer, depression. They list the normal value, the risk value, and then show your genome values (GENO), shading the line if you have one or two risk values.

    image


    Conclusion:

    I wasn’t expecting to find too much of importance, as I am relatively healthy, and my 23andMe health test results didn’t come up with anything important. But the Xcode Life reports did identify 1 potential variant for a condition I have that would have helped me 5 years ago before I found out about it.

    If you have some ailment but don’t know what’s causing it, a DNA-based health report like this one from Xcode Life might be a good screen. If something shows up, you can discuss the report with your doctor.

    For me, I did this mostly for curiosity. Having the rsids of the SNPs of interest in the Pharmacogenetics and Methylation reports will allow you, as a genetic genealogist to map those SNPs onto your genome with a tool like DNA Painter, and track those SNPs though your ancestors.

    Upload Your Raw DNA Data to Borland Genetics

    2020. május 25., hétfő 20:53:11

    There’s another website I recommend you upload your DNA raw data to called Borland Genetics.

    image

    See this video: Introducing Borland Genetics Web Tools

    In a way, Borland Genetics is similar to GEDmatch in that they accept uploads of raw data and don’t do their own testing. Once uploaded, you can then see who you match to and other information about your match. Borland Genetics has a non-graphic chromosome browser that lists your segment matches in detail.    
       
    But Borland Genetics has a somewhat different focus from all the other match sites. This site is geared to help you reconstruct the DNA of your ancestors and includes many tools to help you do so. And you can search for matches of your reconstructed relatives, and your reconstructed relatives will also show up in the match lists of other people.

    Once you upload your raw data and the raw data from some tests done by a few of your relatives, you’re ready to use the exotically named tools that include:

    • Ultimate Phaser
    • Extract Segments
    • Missing Parent
    • Two-Parent Phase
    • Phoenix (partially reconstructs a parent using raw data of a child and relatives on that parent’s side)
    • Darkside (partially reconstructs a parent using raw data of a child and relatives that are not on that parent’s side)
    • Reverse Phase (partially reconstructs grandparents using a parent, a child, and a “phase map” from DNA Painter) 

    Coming soon is the ominously named: Creeper, that will be guided by an Expert System that use a bodiless computerized voice to instruct you what your next steps should be.

    There’s also the Humpty Dumpty merge utility that can combine multiple sets of raw data for the same person, and a few other tools.

    The above tools are all free at Borland Genetics and there’s a few additional premium tools available with a subscription. You can use them to create DNA kits for your relatives. Then you can then download them if you want to analyze them yourself or upload them to other sites that allow uploads of constructed raw data.

    By comparison, GEDmatch has only two tools for ancestor reconstruction. One called Lazarus and one called My Evil Twin. Both tools are part of GEDmatch Tier 1, so you need a subscription to use them. Also, you can only use the results on GEDmatch, because GEDmatch does not allow you to download raw data.


    Kevin Borland

    The mastermind behind this site is Kevin Borland. Kevin started building the tools he needed for himself for his own genetic genealogy research a few years ago and then decided, since there wasn’t one already, to build a site for DNA reconstruction. See this delightful Linda Kvist interview of Kevin from Apr 16, 2020.

    In March 2020, Kevin formally created Borland Genetics Inc.and partnered with two others to ensure that this work would continue forward.

    If you are a fan of the BYU TV show Relative Race (and if you are a genealogist, you should be), then you should know that Kevin was the first relative visited by team Green in Season 2.  See him at the end of Season 2 Episode 1 starting about 32:24.


    Creating Relatives

    I have not been as manic as many genetic genealogists in getting relatives to test. I only have my own DNA and my uncle (my father’s brother) who I have tested. So with only two sets of raw data, what can I do with that at Borland Genetics?

    Well, first I uploaded and created profiles for myself and my uncle.

    The database is still very small, currently sitting at about 2500 kits. Not counting my uncle, I have 207 matches with the largest being 54 cM. My uncle has 86 matches with the largest being 51 cM. This is interesting because most sites have more matches for my uncle than for me, since he is 1 generation further back.  I don’t know any of the people either of us match with. None of them are likely to be any closer than 4th cousins.

    My uncle and I share 1805.7 cM. The chromosome browser indicates we have no FIR (fully identical regions) so it’s very likely that despite endogamy, I’m only matching my uncle on my father’s side.

    The chromosome browser suggest three Ultimate Phaser options for me to try:

    image

    To interpret the results of these, you sort of have to know what you’re doing.

    So let me go instead to try create some relatives. For that I can first use the Phoenix tool.

    image

    It allows me to select either myself or my uncle as the donor. I select myself as the donor and press Continue.

    image

    Here I enter information for my father and press Continue

    SNAGHTML3187291c

    I now can select all my matches who I know are related on my father’s side. You’ll notice the fourth entry lists the “Source” as “Borland Genetics” which means it is a kit the person created, likely of a relative who never tested anywhere.

    In my case, my uncle is the only one I know to be on my father’s side, so I select just him. I then scroll all the way down to the bottom of my match list to press Continue.

    image

    And while I’m waiting, I can click play to listen to some of Kevin’s music.  After only about 2 minutes (the time was a big overestimate) the music stopped and I was presented with:

    image

    I now can go to my father’s kit and see what was created for him. His kit type is listed as “Mono” because only one allele (my paternal chromosome) can be determined. The Coverage is listed as 25% because I used his full brother who shares 50% with him, and thus 25% with me.

    image

    His match list will populate as if he was a person who had tested himself.

    I can download my father’s kit:

    image

    which gives me a text file with the results at every base pair:

    image

    The pairs of values are all the same because this is a mono kit. Also be sure to  use only those SNPs within the reconstructed segments list. There must be an option somewhere to just download the reconstructed segments, but I can’t see it. (Kevin??)

    In a very similar manner (which I won’t show here because it is, well, similar), I can use the Darkside tool to create a kit for my Mother using myself as the child and my Uncle as the family member on the opposite side of the tree.


    Reconstructing Ancestral Bits

    Now I have kits for myself, my uncle, my father and my mother. Can I do anything else?

    Well yes! I can use my analysis from DNA Painter to define my segments by ancestor.

    image

    I just happened to have the DNA Painter analysis done already, which I used Double Match Triangulator for. Using DMT, I created a DNA Painter file from my 23andMe data for just my father’s side:

    image

    I labelled them based on the ancestor I identified, e.g. FMM = my father’s mother’s mother. I downloaded the segments from DNA Painter and clicked “Choose File” in Borland Genetics and it gave me my 5 ancestors with the same labeling to choose from.

      image

    I select “FF”, click on “Extract Selected Segments” and up comes a screen to create a Donor Profile for my paternal grandfather!

    image

    Wowzers! I have now just created a DNA profile for a long-dead ancestor, and I can do the same for 4 more of my ancestors on my father’s side.

    Just a couple of days ago, I think I was asking Kevin for this type of analysis. Only today when writing this post, did I see that he already had it.


    Summary

    I only have my own and my uncle’s raw data to work with, yet I can still do quite a bit. For people who have parents, siblings and dozens of others tested … well I’m enviously drooling at the thought of what you can do at Borland Genetics with all that.

    There is a lot more to the Borland Genetics site than I have discussed here. There are projects you can create or join. Family tree information. Links to WikiTree. You can send messages to other users. There are advanced utilities you can get through subscription.

    The site is still under development and Kevin is regularly adding to it. Kevin started a Borland Genetics channel on YouTube, and over the past 2 years he made an excellent 20 episode series of You Tube videos on Applied Genetics. And he runs the Borland Genetics Users Group on Facebook, now with 738 members.  – I don’t know how he finds the time.

    So now, go and upload your raw data kits to Borland Genetics, help build up their database of matches, and try out all the neat analysis it can do for you.

    OneDrive’s Poison Setting

    2020. május 9., szombat 6:31:31

    OneDrive’s default setting of no limit for network upload and download rates has caused years of Internet problems at my house. Unbeknownst to us, it would from time to time consume most or all of the Internet bandwidth affecting me when on my ethernet connected desktop computer and affecting everyone else in my house connected with their devices to our Wi-fi. It is now obvious to me that this hogging of bandwidth happened following any significant upload of pictures or files from my desktop computer to OneDrive and the effect sometimes lasted for days!

    Yikes! I’m flabbergasted at how we finally discovered the reason behind our Internet connection problems. A number of times in the past few years, we’ve found the Wi-fi and TV in the house to be spotty. We had got used to unplugging the power on the company-supplied modem and waiting the 3 or 4 minutes for it to reset. Often that seemed to improve things, or maybe the reset just made us feel it had done so – we don’t really know. We’ve called our supplier several times, and they came over, inspected our lines, checked our modem. In all cases, the problem repaired itself, if not immediately, then over the course of a few days.

    It didn’t get really bad too often. But it did about 2 months ago, just after my wife and I got back from a wonderful Caribbean cruise (which we followed up with 2 weeks of just-in-case self-isolation at home). I had to replace my computer, and very shortly after the new one was installed, we had several days of Internet/TV problems.

    I called my service provider (BellMTS) and I told them about the poor service we were having and they tried to help over the phone. We rebooted the modem several times but that wasn’t helping.

    image

    They sent a serviceman to check the wiring from our house to the distribution boxes on our block. We thought that might have helped and it was not long after that it seemed everything was pretty good.

    We had very few problems over the next 6 weeks, but just last night, I was in the middle of an Association of Professional Genealogists Zoom webinar (Mary Kircher Roddy – Bagging a Live One; Reverse Genealogy in Action), when suddenly I lost my Internet in my other windows and my family lost the Internet on their devices. Our TV was even glitching. However the Zoom webinar continued on uninterrupted. I could not at all figure this out.

    After the webinar ended, I called my Internet/TV provider and things seemed to improve. The next morning, the troubles reoccurred. I called my provider again. They sent a serviceman. He came into the house (respecting social distancing) and cut the cable at our box so they could test the wiring leading to our house. He was away for over an hour doing that. When he came back, they had set up some sort of new connectors. He reconnected us. But no, we still had the problem. He then found what he though was a poorly wired cable at the back of the modem. He fixed that, but still the problem. Then he replaced our modem and the power supply and the cabling. Still the problem.

    We were monitoring the problem using speedtest.net. We’ve got what’s called the Fibe 25 plan**. We should be getting up to 25 Mbps (mega-bits per second) download and up to 3 Mbps upload. We were getting between 1 and 2 download and 1 upload. Not good. 

    After several more attempted resets and diagnostic checks, we were now 3 hours into this service call. The serviceman’s next idea was the one that worked. He said turn off all devices connected to the Internet. Then turn them on one-by-one and we might find it is a device we have that’s causing the problem. We did so and when we got to my ethernet connected computer, it was the one slowing everything. The serviceman said there it is, found the reason. He couldn’t stay any more and left.

    I checked and sure enough, when my computer was on, we got almost no Internet, but when it was off, everything was fine. Here was the speed test with my computer off:

    image

    When I went to the network settings to see if it was a problem with my ethernet cable, I could see a large amount of Activity, with the Sent and Received values changing quite quickly:

    image

    My first thought was that maybe my computer was hacked. I opened Task Manager and sorted by the Network column to see what was causing all the Network traffic. There was my answer, in number 1 place consuming the vast majority of my network was: Microsoft OneDrive.

    My older daughter immediately commented that she had long ago stopped using the free 1 TB of OneDrive space we each get by being Microsoft 365 subscribers because she found it hogged all her resources.

    Eureka! 2 months ago what had I done? I had uploaded all my pictures and videos from our trip to OneDrive. And what was I doing while watching that Zoom webinar last night? I was uploading several folders of pictures and videos to OneDrive. What wasn’t I doing during the 6 weeks in-between was any significant uploads to OneDrive.

    In Task Manager, I ended the OneDrive task. Sure enough my download speed from speed test went back up to good numbers, and our Internet/TV problem had finally been isolated.

    It didn’t take me long to search the Internet to find that OneDrive had network settings. The default was (horrors) a couple of “Don’t limit” settings. The “Limit to” boxes, which were not selected, both had suggested defaults of 125 KB/s (kilobytes per second). I did some calculations and selected them and set the upload value to 100 KB/s and left the download value at 125 KB/s: 

    image

    Note that these are in KB/s whereas Speedtest gives Mbps. The former is thousands of bytes and the latter is millions of bits. There are 8 bits in a byte. So 125 KB/s = 1.0 Mbps, which is about 4% of my 25 Mbps download capacity and 100 KB/s = 0.8 Mbps which is less than 30% of my upload capacity. Now when OneDrive is synching, there should be plenty left for everyone else. Yes, OneDrive will take several times longer to upload now. But I and my family should no longer have it affecting our Internet and TV in a significant way any more.

    Also notice there’s an “Adjust automatically” setting. Maybe that is the one to choose, but unfortunately they don’t also have that setting on the Download rate, which is maybe more important.

    My wife and daughters have complained to me for a number of years claiming my computer was slowing the Internet. Up to now, I did not see how that could be. Yes, as it turns out, it was technically coming from my computer, but the culprit in fact was OneDrive’s poison setting. I am someone who turns off my desktop computer when I am not using it and also every night I don’t have it working on anything. No wonder our problems were spotty. When my computer was off, OneDrive could not take over. So my family was right all along.

    Well that’s now fixed. I will let my TV/Internet provider know about this so that they can save their time and their customers time when someone else has a similar intermittent internet problem which may be OneDrive. I will also let Microsoft know through their feedback form and hopefully they one day will decide to either change their default network traffic settings to something that would not affect the capacity of most home Internet providers, or change the algorithm so that “unlimited” has a lower priority than all other network activity. Maybe that “adjust automatically” setting is the magic algorithm. If so, it could be the default but it should also be added as an option on the Download rate, to eliminate OneDrive’s greediness.

    Are you listening Microsoft?

    And I’d recommend anyone who uses OneDrive to check out if you have no limit on your OneDrive Network settings. If you do, change them and you might see the speed and reliability of your Internet improve dramatically.


    —-

    **Note:  The Fibe 25 plan is the maximum now available from BellMTS in our neighborhood. They are currently (and I mean currently since my front lawn is all marked up) installing fiber lines in our neighborhood that will allow much higher capacity. Once installed, I should have access to their faster plans, and will likely subscribe to their Fibe 500 plan for only $20 more per month. That will give up to 500 Mbps download (20x faster) and 500 Mbps upload (167x faster). They have even faster plans, but that should be enough because our wi-fi is 20 MHz which is only capable of 450 Mbps. My ethernet cable (which was hardwired in from the TV downstairs to my upstairs office when we built the house 34 years ago) is capable of 1.0 Gbps which is 1000 Mbps. Once we switch plans, I’ll likely give OneDrive higher limits (maybe 100 Mbps both ways) and it will be a new world for us at home on the Internet. 

    Determining the Accuracy of DNA Tests

    2020. április 10., péntek 19:16:40

    In my last post, New Version of WGS Extract, I used WGS_Extract to create 4 extracts from 3 BAM (Binary Sequence Alignment Map) files from my 2 WGS (Whole Genome Sequencing) tests.

    These extracts each contain about 2 million SNPs that are tested by the five major consumer DNA testing companies: Ancestry DNA, 23andMe, Family Tree DNA, MyHeritage DNA and Living DNA.

    Almost two years ago, I posted: Comparing Raw DNA from 5 DNA Testing Companies to see how different the values were. Last year, in Determining VCF Accuracy, I estimated Type I and Type II error rates from two VCF (Variant Call Format) files that I got from my WGS (Whole Genome Sequencing) test.

    But in those articles, I was not able to estimate how accurate each of the tests were. To do so, you need to know what the correct values are, in order to be able to benchmark the tests. But now with my 4 WGS extracts and my 5 company results, I now have enough information to make an attempt at this.

    For this accuracy estimation, I’m going to look at just the autosomal SNPs, those from chromosome 1 to 22. I’ll exclude the X, Y and mt chromosomes because they each have their own properties that make them quite different from the autosomes.

    Let me first summarize what I’ve got. Here are the counts of my autosomal allele values from each of my standard DNA tests. I’m not including test version numbers, because different places list them differently, so instead I’m including when I tested:

    image

    Comparing the above table to the one from my Comparing Raw DNA article last year, all values are the same except the 23andMe column. Last year’s article totalled 613,899 instead of 613,462, a difference of 437. I’m not sure why there’s this difference, but I do know this new value is correct. Whatever mistake I might have made should not have significantly affected my earlier analysis.

    I find it odd that 23andMe and Living DNA both have half as many AC and AG values as the other companies. I also find it odd that Ancestry DNA has twice as many of the AT and CG values as the other companies, and that Living DNA has no AT or CG values. I have no explanation for this.

    23andMe is the only company that identified and included any insertions and deletions (INDELs), the II, DD and DI values, that it found.

    The double dash “–" values are called “no calls”. Those are positions tested that the company algorithm could not determine a value for. The percentage of no calls range from a low of 0.4% in my Ancestry DNA data to a high of 2.8% in my FTDNA data. Matching algorithms tend to treat no calls as a match to any value.

    Below are the counts from my WGS tests:

    image

    I have done two WGS tests at Dante Labs: a Short Reads test and a Long Reads test.

    For the Short Reads test, Dante used the program BWA (Burrows-Wheeler Aligner) to create a Build 37 BAM file. I then used WGS Extract to extract all the SNPs it could.

    For my Long Reads test, I used the program BWA to create a Build 37 BAM file. (See: Aligning My Genome). But BWA was not supposed to be good for Long Reads WGS, so I had YSeq use the program minimap2 to create a build 37 BAM file.

    The WGS Extract program would not work on my Long Reads file until I added the –B parameter to the mpileup program. The –B parameter is to disable BAQ (Base Alignment Quality) computation to reduce the false SNPs caused by misalignment. Because I had to add –B to get the Long Reads to work, I also did a run with –B added to my Short Reads so that I could see the effect of the –B parameter on the accuracy.

    When I used WGS Extract a year ago (see: Creating a Raw Data File from a WGS BAM file), it produced a file for me with 959,368 SNPs from my Short Reads WGS file and I was able to use it to improve my combined raw data file.

      

    Accuracy Determination

    Now I’ll use the above two sets of data to determine accuracy. By accuracy, I’m interested in knowing if a test is saying that a particular position has a specific value, e.g. CT, then what is the probability that the CT reading is correct?

    I will ignore all no calls in this analysis. If a test says it doesn’t know, so it isn’t wrong. Having no-calls is preferable to having incorrect values.

    I will also ignore the 4518 SNPs where 23andMe say there is an insertion or deletion (II or DD or DI). The reason is because few of the other standard tests have values on those SNPs (which is good) but almost all the WGS test results do have a value there (which is conflicting information and bad!). Somehow WGS Extract needs to find a way to identify the INDELs so that it doesn’t incorrectly report them as seemingly valid SNPs. Of course some of 23andMe’s reported INDELs might be wrong, but I don’t have multiple sources reporting the INDELs to be able to tell for sure. I do have my VCF INDEL file from my Short Reads WGS, but then it’s just one word against another. A quick comparison showed that some 23andMe reported INDELs are in my VCF INDEL file, but some are not.

    So first I’ll determine the accuracy of the standard DNA tests, then of the WGS tests.



    The Accuracy of Standard Microarray DNA Tests

    I have 4 BAM files from 2 WGS tests using different alignment or extraction methods. There are 1,851,128 out of the over 2 million autosomal positions where all 4 WGS readings were all the same and were not no calls and the 23andMe value was not an insertion or deletion.

    Since all 4 BAM files agree, let’s assume the agreed upon values are correct.

    I compared these with the values from each of my 5 standard tests:

    image

    That’s not bad. An error rate of 0.5% or less. Fewer than 1 error in 197 values. FTDNA and MyHeritage’s tests were the best with an error rate of about 1 out of 600 values.

    These tests are all known as microarray tests. They do not test every position, but only test certain positions. They are very different from WGS and are expected to have a lower error rate than WGS tests. Of course, they often include 3% no calls to their results, but that’s the tradeoff required to help them minimize their Type I false positive errors.



    The Accuracy of Whole Genome Sequencing Tests

    WGS tests have several factors involved in their accuracy. One is the accuracy of their individual reads which in the case of Long Read WGS is said to be much worse than Short Read WGS, maybe even as bad as 1 in 20. But those inaccurate reads are offset by excellent alignment algorithms that have been tuned to handle high error rates. This is a necessary requirement anyway because the algorithms need to handle insertions and deletions as well.

    Another factor in accuracy is coverage rate, and 30x is considered to be what will give reasonably accurate results. If you have 30 segments mapped over a SNP, and 13 of them say “A” and 16 of them say “T” and 1 says “C”, then the value is likely “AT”. If 27 are “A” and 3 are “T” then the value is likely “AA”. They’ve been doing this for a long time and know the probabilities and they’ve got this down to a science (pun intended).

    So my question is what is the accuracy of my WGS Extract SNPs from my four BAM files. To determine this, I’ll do the opposite of what I did before. I’m going to find all the SNPs where at least 3 of my standard DNA tests gave the same value and the others either gave a no call or did not test that SNP. From my the above analysis, each of my standard tests should have at least a 1 in 200 error rate, so three or more different tests with all the same value should not be wrong very often. I’ll compare them with every position in my 4 BAM files that have a value and are not a no call. Here’s my results:

    image

    So my Short Reads test gave really good results. Only 1 in over 1300 disagreed with my standard tests. That’s quite acceptable. The –B option on creating the BAM seemed to have little effect on the accuracy.

    But those Long Reads tests – ooohh!  I’m very disappointed. 7.7% of the values in my Long Reads BAM file created with BWA were different from my standard tests. Using minimap2 instead of BWA only reduced that to 6.6%. This is not acceptable for SNP analysis purposes. The penalty for getting the wrong health interpretation of a SNP can be disasterous.

    I’m very disappointed in this Long Reads result. I would have thought that even though Long Reads are known to have higher error rates in individual readings, I would have thought that the longer reads along with good alignment algorithms that take into account possible errors, would give give values once you have a 30x coverage. If 1 out of 10 values are read wrong, then 27 out of 30 values should be correct.

    So something else is happening here. This high error rate can come from one of several places. It could be read errors, transcription errors, algorithm errors, problems in any of the programs in the pipelines to create the BAM files, or problems in the programs that WGS Extract uses, such as the mpileup program.

    So then can the Short Reads test values still be used? Well, I still have one outstanding problem with them. That’s with regards to INDELs as reported in my 23andMe test.  Unfortunately, the results out of WGS Extract gives SNP values at almost all of the INDEL positions. In the table below, I compare only the INDEL positions out of all the 23andMe positions that match each test:

    image

    Now I’m still not sure if the 23andMe value is correct or if the long read value is correct, but reporting a SNP value where there is an INDEL could be happening as much as 0.8% of the time, at least in the values reported by WGS Extract. This is something that needs to be looked at by the WGS Extract people to see if they can prevent this.



    Conclusions

    For genealogical purposes and relative matching on the various sites including GEDmatch, the standard microarray-based DNA tests are good enough.

    Don’t ever expect that your DNA raw data is perfect. There are going to be incorrect values in it. Most matching algorithms for genealogists allow for an error every 100 SNPs or so. Some even introduce new errors with imputation. As long as errors are kept to under 1 in 100 or so, differences in analysis for genealogical purposes should be small. But because of these inaccuracies, nothing is exact.

    It is worthwhile if you upload to a site, to improve the quality of your data by using a combined file made up of all the agreeing values from your DNA tests.  See my post on The Benefits of Combining Your Raw DNA Data.

    WGS tests are worthwhile for medical purposes, but are probably overkill for genealogy. The WGS files you need to work with are huge requiring a powerful computer with large amounts of free disk space. Downloading your data takes days and uploading your data to an analysis site is impossible on most home internet services. The programs to analyze these files are made for geneticists and are designed for the Unix platform.

    There are not many programs designed for genealogists that analyze WGS data. The program WGS Extract is excellent, but you will need to know what you are doing. Until they find a way to filter out the INDELs, you’ll have to be careful in using the raw data files that the program produces.

    New Version of WGS Extract

    2020. április 7., kedd 3:42:17

    Back in May 2019, I wrote about a program called WGS Extract to produce from your Whole Genome Sequencing (WGS) test, a file with autosomal SNPs in 23andMe format that you can upload to sites like GEDmatch, Family Tree DNA,  MyHeritage DNA, or Living DNA.

    The mastermind behind this program, who prefers to remain anonymous, last month made a new version available.You can get it here. The program last year was 2 GB. This one now is 4.5 GB. The download took about 45 minutes. And that is a compressed zip file which took about 3 minutes to unzip into 8,984 files totaling 4.9 GB. It didn’t expand much because the majority of the space was used by 5 already compressed human genome reference files, each about 850 MB:

    1. hg38.fa.gz
    2. hs37d5.fa.gz
    3. GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
    4. human_g1k_v37.fasta.gz
    5. hg19.fa.gz

    I don’t know the technical aspects about what’s different in each of these references, except that 1 and 3 are Build 38 and 2, 4 and 5 are Build 37. For genealogical purposes, our DNA testing companies use Build 37.

    Also included among the files and very useful are raw file templates from various companies which include the majority of the SNPs from each of the tests:

    1. 23andMe_V3.txt   (959286 SNPs)
    2. 23andMe_V4.txt   (601885 SNPs)
    3. 23andMe_V4_1.txt   (596806 SNPs)
    4. 23andMe_V5.txt   (638466 SNPs)
    5. 23andMe_V5_1.txt   (634165 SNPs)
    6. MyHeritage_V1.csv   (720922 SNPs)
    7. MyHeritage_V2.csv   (610128 SNPs)
    8. FTDNA_V1_Affy.csv   (548011 SNPs)
    9. FTDNA_V2.csv   (720449 SNPs)
    10. FTDNA_V3.csv   (630074 SNPs)
    11. FTDNA_V3_1.csv   (613624 SNPs)
    12. Ancestry_V1.txt   (701478 SNPs)
    13. Ancestry_V1_1.txt   (682549 SNPs)
    14. Ancestry_V2.txt   (668942 SNPs)
    15. Ancestry_V2_1.txt   (637639 SNPs)
    16. LDNA_V1.txt   (618640 SNPs)
    17. LDNA_V2.txt   (698655 SNPs)

    There are 4 summary files:

    1. 23andMe_SNPs_API.txt   (1498050 SNPs) which likely combines the SNPs from all five 23andMe tests.
    2. All_SNPs_combined_RECOMMENDED_hg19_ref.tab.gz   (2081060)
    3. All_SNPs_combined_RECOMMENDED_GRCh37_ref.tab.gz   (2081060)
    4. All_SNPs_combined_RECOMMENDED_hg38_ref.tab.gz   (2080323)

    The last 3 appear to be a combination of all the SNPs from all the raw file templates. The hg19 and GRCh37 files appear to be the same, but differ in how the chromosome is specified, as 1 or as chr1, as MT or as chrM. I’m not sure how the hg38 file was derived, but it may have been a translation of all addresses from Build 37 to Build 38, excluding 737 SNPs that are in Build 37 but not Build 38.



    Running WGS Extract

    The program can now run on any computer. Just run the appropriate script:

    1. Linus_START.sh
    2. MacOS_START.sh
    3. WIndows_START.bat

    Since I have Windows, I ran the third script. I had to tell Microsoft Defender SmartScreen to allow it to run. It starts up a Command window which then starts up the WGS Extract Window:

    image

    There are now three tabs:  “Settings”, “Extract Data” and “Other”.  Above is the Settings Page.

    Here is the Extract Data page:

    image

    The Mitochondrial DNA and Y-DNA functions are both new.

    And this is the Other page:

    image

    All the functionality on this 3rd page is new.



    Two WGS Tests

    When I first checked it out WGS Extract last year, I only had my Dante Short Reads WGS test. See: Creating a Raw Data File from a WGS BAM file.  Since then, I have taken a Dante Long Reads WGS test.


    Three BAM Files

    The raw reads from a WGS test are provided as FASTQ files. These need to be put into the correct place on my genome. A file containing the mappings of each of my reads to where it is in my genome is called a BAM file (Binary Sequence Alignment Map).  It’s these BAM files that WGS Extract reads.

    I have 3 BAM files I can use:

    1. The BAM file Dante provided with my Short Reads WGS test. They used a program called BWA (the Burrows-Wheeler Aligner) to produce my BAM.
    2. Dante did not provide a BAM file with my Long Reads WGS test. So I did the alignment myself using BWA to produce a BAM from this test. I documented that in my Aligning My Genome post.
    3. I found out that the program minimap2 produced more accurate alignment than BWA for Long Reads. I tried to run that myself but the job was taking too long. Then I heard that YSeq offered the mapping service using minimap2, so I had them create a minimap2-based BAM file from my Long Reads WGS test.

    Let’s now try a few things.



    Show statistics

    From on the Settings page, we first load our BAM file and select an output directory. Loading the huge BAM file is surprisingly quick, taking only about 5 seconds.

    We can now go to the “other” page and press “Show statistics on coverage, read length etc.”

    Here’s my statistics from my Short Reads test. (Click image to enlarge)

    image

    My Short Reads test consisted of almost 1.5 billion reads. 86.44% of them were able to be mapped to a Build 37 human reference genome. That gave an average of 41x coverage over every base pair.  The average read length was 100 base pairs.

    Here’s my statistics from my Long Reads test with BWA mapping:

    image

    Long Reads WGS test are known to have a higher percentage of errors in them than a Short Reads WGS test. But because their reads are longer, they still can be mapped fairly well to the human genome.   

    I had over 20 million long reads. 76.17% of the reads were able to be mapped, which is lower than the 86% from my short read test. This resulted in an average coverage of 25x versus the 41x from my short read test. The average read length of the mapped reads was 3627 base pairs, which is 36 times longer than my short read test.

    Here’s the stats from my Long Reads test aligned by YSEQ using minimap2:

    image

    I have no idea why the Samtools stats routine decided to show the chromosome in alphabetical order just for this run but not the other two above. That is definitely strange. But the stats themselves seem okay. This does show the improvement that minimap2 made over BWA since the average read depth is now up to 36x and the average read length of mapped reads is increased to 5639. I expect that BWA must have had trouble aligning the longer reads due to the errors in them, whereas minimap2 knows better how to handle these.



    Haplogroups and Microbiome

    First I run Y-DNA from the “Other” page using my Long Reads. WGS Extract now includes a version of the python program Yleaf which is available on GitHub to perform this analysis. It takes about 5 minutes and then gives me this.

    image

    That’s interesting. I know I’m R1a, but my more detailed groups from the various companies then start taking me into M198, Y2630, BY24978 and other such designations. I’ve not seen it strung out as a R1a1a1b2a2b1a3 designnation before. At any rate, it doesn’t matter too much. My Y-DNA does not help me much for genealogy purposes.

    For Mitochondrial DNA, WGS Extract gave me this:
    image

    That’s okay. I already know my mt haplogroup is K1a1b1a. It doesn’t help me much for genealogical purposes either.

    There was also an option in WGS Extract to create an oral microbiome that can be uploaded to app.cosmosid.com. This option will extract your unmapped reads which might be bacterial. I’m not interested in this so I didn’t try it.



    Creating A DNA Raw Data File

    Going to the Extract Data page in WGS Extract, now I press the “Generate files in several autosomal formats”. It gives me this screen:

    image

    When the screen first popped up, everything was checked. I clicked “Deselect everything” and then checked just the combined file at the top.

    I did this for my Short Reads BAM file, When I pressed the Generate button at the bottom, the following info box popped up.

    image

    I  pressed OK and the run started. After 65 minutes it completed and produced a text file with a 23andMe raw data header and over 2 million data lines that looks like this:

    image

    It also produced a zipped version of the same file, since some of the upload sites request compressed raw data files.


    A Glitch and a Fix

    I wanted to do the same with my two Long Read BAM files. When I tried, it was taking considerably longer than an hour. So I let it run all night. It was still running the next morning. It was still running in the afternoon.Why would a Long Reads BAM file take over 20 times longer than a Short Reads BAM file to run? They both are about the same size. The Long Reads file of course has longer reads but fewer of them.

    I started wondering what was going on. I contacted the author. I posted about this on Facebook and got some helpful ideas. Finally I found the temporary files area that WGS Extract used. I was able to tell that for my Long Read BAMs, the temporary file with the results was not being created. I isolated the problem to the program mpileup that was the one failing. I searched the web for “mpileup long reads nanopore” and found this post:  mpileup problem with processing nanopore alignment. It suggested to use the mpileup –B option.

    The mpileup –B option stands for “no-BAQ”. The Samtools mpileup documentation explains BAQ to be Base Alignment Quality. This is a calculation of the probability that a read base is misaligned. Allowing the BAQ calculation “greatly helps to reduce false SNPs caused by misalignments”.

    I tried adding the –B option, and now WGS Extract worked! It took 75 minutes to run for my BWA Long Reads file and 115 minutes for my YSEQ minimap2 Long Reads file. I then ran my Short Reads file with the –B option and it ran in only 20 minutes. I’ll compare that last run with my Short Reads run with the –B option, and that should give me an estimate as to how many false SNPs might have been introduced.


    Next Steps

    I’ll compare these 4 WGS Extract files with each other and with my 5 raw data files from my standard DNA tests in my next blog post. I’ll see if I can determine error rates, and I’ll see how much I can improve the combined raw data file that I’ll upload to GEDmatch.

    When Everything Fails At Once…

    2020. március 22., vasárnap 23:05:05

    Remember the words inscribed in large friendly letters on the cover of the book called The Hitchhiker’s Guide to the Galaxy:

    DON’T PANIC

    I returned 9 days ago from a two week vacation with my wife and some good friends on a cruise to the southern Caribbean. While away, we had a great time, but every day we heard more and more news of what was happening with the coronavirus back home and worldwide.

    On the ship, extra precautions were being taken. Double the amount of cleaning was being done, and purell sanitizer was offered to (and taken by) everyone when entering and leaving all public areas. The sanitizer had been a standard procedure on cruise ships for many years. I joked that this cruise would be one where I gained 20 pounds: 10 from food and 10 from purell. Our cruise completed normally and we had a terrific time. There was no indication that anyone at all had got sick on our cruise.

    We flew home from Fort Lauderdale to Toronto to Winnipeg. Surprisingly to us, the airports were full of people as were our flights. None of the airport employees asked us anything related to the coronavirus and gave no indication that there was even a problem. I don’t think we saw 2 dozen people with masks on out of the thousands we saw.

    After a cab ride home at midnight, our daughters filled us in on what was happening everywhere. Since we were coming from an international location, my wife and I began our at-least 2 week period of self-isolation to ensure that we are not the ones to pass the virus onto everyone else. We both feel completely fine but that does not matter. Better safe than sorry.


    Failure Number 1 – My Phone

    On the second day of cruise, I just happened to have my smartphone in the pocket of my bathing suit as I stepped into the ship’s pool. I realized after less than two seconds and immediately jumped out. I turned on the phone and it was water stained but worked. I shook it out as best as I could and left it to dry.

    I thought I had got off lucky. I was able to use my phone for the rest of the day. All the data and photos were there. It still took pictures. The screen was water stained but that wasn’t so bad. But then that night, when I plugged it in to recharge, it wouldn’t. The battery had kicked the bucket. Once the battery completely ran out, the phone would work only when plugged in.

    Don’t panic!

    I had been planning to use my phone to take all my vacation pictures. Obviously that wouldn’t be possible now. I went down to the ship’s photo gallery. They had some cameras for sale but I was so lucky that they had one last one left of the inexpensive variety. I bought the display model of a Nikon Coolpix W100 for $140 plus $45 for a 64 GB SD card. I took over 1000 photos of our vacation over the remainder of our cruise, including some terrific underwater photos since the camera is waterproof.

    imageBefore the cruise was over, my phone decided to get into a mode where it wouldn’t start up until I did a data backup to either an SD card (which the phone didn’t support) or a USB drive which I didn’t have with me.

    Somehow, with some fiddling, the phone then decided it needed to download an updated operating system so I wrongly let it do that. Bad move! It was obvious that action failed as then the phone would no longer get past the logo screen. 

    At home, Saturday at 11 pm, I ordered a new phone for $340 from Amazon. It arrived at my house on Monday afternoon and I’m back in action. The only thing on my old phone were about a month of pictures including the first 3 days of our vacation. If it’s not too expensive, I might try to see if a data recovery company can retrieve the pictures for me. If not, oh well.


    Failure Number 2 – My Desktop Computer

    I had left my computer running while I was gone. I was hoping for it to do a de novo assembly of my genome from my long read WGS (Whole Genome Sequencing) test.  I had tried this a few months ago, running on Ubuntu under Windows. When I first tried, it had run for 4 days but when I realized it was going to take several days longer I canned it. Knowing I was going to be away for 14 days was the perfect opportunity to let it run. I started it up the day before I left and it was still running fine the next morning when I headed to the airport.

    When I got back, I was faced with the blue screen of death. Obviously something happened. “Boot Device Not Found”.

    image

    Don’t panic!

    I went into the BIOS and it sees my D drive with all my data, but not my C drive. My C drive is a 256 GB SSD (Solid State Drive) which includes the Windows Operating System as well as all my software. My data was all on my D drive (big sigh of relief!) but I also have an up-to-date backup on my network drive from my use of Windows File History running constantly in the background. So I wasn’t worried at all about my data. Programs can be reinstalled. Data without backups are lost forever.

    I spent the rest of Saturday seeing if I can get that C drive recognized. No luck. My conclusion is that my SSD simply failed which can happen. I had a great computer but it was about 8 years old. The SSD drive was a separate purchase that I installed when I bought it to speed up startup and all operations and programs. My computer was as dead as a doorknob,

    Saturday night, along with the phone I purchased at Amazon, I also purchased a new desktop at Amazon. Might as well get a slight upgrade while I’m at it.  From my current HP Envy 700-209, a 4-core 4th generation i7 with 12 GB RAM, 256 GB SSD and 2 TB hard drive, I decided on a refurbished/renewed HP Z420 Xeon Workstation with 32 GB RAM, 512 GB SSD and a 2 TB hard drive for $990. It comes with 64-bit Windows 10 installed on the SSD drive. I’ve always had excellent luck with refurbished computers. The supplying company makes doubly sure that they are working well before you get them and the price savings are significant.

    On Tuesday, the computer was shipped from Austin Texas to Nashville Tennessee. It went through Canada customs Thursday morning arriving here in Winnipeg at 9 a.m. and at my house just before noon.

    First step, hook it up and a problem: My monitors have different cables than its video card needs. I ordered the less expensive video card with it, an NVIDEA Quadro K600. It did not come with the cables. I’m not a gamer so I don’t need a high-powered card, I made sure it could handle two monitors but I didn’t think about the cables. As it turns out, comparing my old NVIDEA GeForce GTX 645 card, I see my old card is a better card. So first step, switch my old card into my new computer.

    image

    Now start it up, update the video driver, and get all the windows updates. (The latter took about a half a dozen checks for updates and 3 hours of time)

    Next turn it off and remove my 2 TB drive from my old computer to an empty slot in my new computer and connect it up. That will give me a D drive and an E drive, each with 2 TB which should last me for a while.

    That was good enough for Thursday. Friday and Saturday, I spent configuring Windows the way I like it and updating all my software, including:

    1. Set myself up as the user with my Microsoft account.
    2. Change my user files to point to where they are on my old D drive.
    3. Set my new E drive to be my OneDrive files and my workplace for analysis of my huge (100 GB plus) genome data files.
    4. Reinstall the Microsoft Office suite from my Office 365 subscription.
    5. Set my system short dates and long dates the way I like them:
      2020-03-22 and Sun Mar 22, 2020
      image
    6. Set up my mail with Outlook. Connect it to my previous .pst file (15 GB) containing all my important sent and received emails back to 2002.
    7. Reinstall and set up MailWasher Pro to pre-scan my mail for spam.
    8. Reinstall Diskeeper. If you don’t use this program, I highly recommend it. It defragments your drives in the background, speeds up your computer and reduces the chance of crashes. Here’s my stats for the past two days:
      image
    9. Reindex all my files and email messages with Windows indexer:
      Capture1
    10. Change my screen and sleep settings to “never” turn off.
    11. Get my printer and scanner working and reinstall scanner software.
    12. Reinstall Snagit, the screen capture program I use.
    13. Reinstall UltraEdit, the text editor I use.
    14. Reinstall BeyondCompare, the file comparison utility I use. I also use it for FTPing any changes I make to my websites to my webhost Netfirms.
    15. Reinstall TopStyle 5, the program I use for editing my websites. (Sadly no longer supported, but it still works fine for me)
    16. Reinstall IIS (Internet Information Server) and PHP/MySQL on my computer so that I can test my website changes locally.
    17. Reinstall Chrome and Firefox so that I can test my sites in other browsers.
    18. Delete all games that came with Windows.
    19. File Explorer: Change settings to always show file extensions. For 20 years, Windows has had this default wrong. image
    20. Set up Your Phone, so I can easily transfer info to my desktop.
    21. Set up File History to continuously back up my files in the background, so if this ever happens again, I’ll still be able to recover.
      image
      (and occasionally it saves me when I need to get a previous copy of a file)
    22. Reinstall Family Tree Builder so I can continue working on my local copy of my MyHeritage family tree. I hope Behold will one day replace FTB as the program I use once I add editing and if MyHeritage allows me to connect to their database. I also have a host of other genealogy software programs that I’ve purchased so that I can evaluate how they work. I’ll reinstall them when I have a need for them again. These include: RootsMagic, Family Tree Maker, Legacy, PAF and many others.
    23. My final goal for the rest of today and tomorrow is to reinstall my Delphi development environment so that I can get back to work on Behold. This includes installation of three 3rd party packages and is not the easiest procedure in the world. Also Dr. Explain for creating my help files and Inno Setup for creating installation programs. I’ll also have to make sure my Code Signing certificate was not on my C drive. If so, I’ll have to reinstall it.
    24. Any other programs I had purchased, I’ll install as I find I need them, e.g. Xenu which I use as a link checker, or PDF-XChange Editor which I use for editing or creating PDF files, or Power Director for editing videos. I’ll reinstall the Windows Susbsystem for Linux and Ubuntu when I get back to analyzing my genome.
    25. One program I’m going to stop using and not reinstall is Windows Photo Gallery. Windows stopped supporting it a few years ago, but it was the most fantastic program for identifying and tagging faces in photos.  I know the replacement, Microsoft Photos, does not have the face identification, but hopefully it will be good enough for all else that I need. Maybe I’ll have to eventually add that functionality to Behold if I can get my myriad of other things to do with it done first.

    Every computer needs a good enema from time to time. You don’t like it to be forced on you, but like cleaning up your files or your entire office or your whole residence, you’ll be better off for it.

    How would you cope if both your phone and computer failed at the same time?

    Just don’t panic!

    Computers 23 years ago

    2020. február 26., szerda 4:48:02

    #Delphi25 #Delphi25th – I came across an email I sent to a friend of mine on February 6, 1997 (at 1:17 AM). I’ll just give it here without commentary, but it should amuse and bring back recollections of people who were early PC users.
     image

    You should find this message to be a little different. I am sending it using Microsoft Mail & News through my Concentric Network connection, rather than than using my Blue Wave mail reader through my Muddy Waters connection. This gets around my problem of not being able to attach files, as you had tried for me. In a future E-mails, I can attach pictures for you. I presume you can read GIFs, or would you prefer JPG or TIF?

    I will still be keeping my MWCS account until the end of 1997, but I am switching over more and more to my Concentric account. I am still not entirely happy with Windows-based Newsreaders yet, and find Blue Wave much more convenient for reading newsgroups. Hopefully, by the end of the year I will have this sorted out.

    I bit the bullet, and switched over to Windows 95 at home. I first had to upgrade my machine. I bought 16 MB more memory (to give me 24 MB) for $99 at Supervalue (of all places!) and bought a 2 GB hard drive for $360 (also at Supervalue!) less a $30 US mail-in rebate on the Hard Drive and a $30 sweatshirt thrown in due to a Supervalue coupon when over $200 is spent. My 260 MB drive that I bought 3 1/2 years ago already had Stacker on it to make it 600 MB, and I only had 80 MB free. I wanted to get rid of Stacker before going to Windows 95.

    It only took me 3 1/2 hours to install the RAM and the Hard Drive myself at home! It wasn’t without problems, but the operation was a success. I had hooked up my old and new Drives as master and slave and everything worked. The next night, I took another 3 1/2 hours to transfer everything from my old drive to my new one, removing the old drive, and getting the system working from the new drive - again not without problems, but completed that evening. I am very proud of myself! The next evening, it took about an hour to get Windows 95 installed, and to customize it to the way I liked.

    This hardware upgrade should be good for another couple of years. I only have the power supply, base, keyboard, mouse, and monitor as original parts. All the rest has been since upgraded.

    Windows 95 - Well I actually like 90% of it better than Windows 3.1, and am only finicky about 10% of it. I know, I know, buy a Mac you will say. Well I hope you are prepared to buy a new operating system every six months like Jobs says you’ll have to. I still agree Macs are a good system, but there is much more software available for PCs, Macs are 40% more expensive, and they still use that horrible character font that they used in the early 80’s - yecch!

    In the meantime, I have kept myself very, very, very, very, very, very,
    very, very busy. I have been working hard on many different fronts, after work playing hard with the kids until their bedtimes (usually closer to 10 p.m. than to 8), most often working on the Computer from 10 to 11 to 12 to (yikes) 1 or 2 sometimes - Got my web pages up (http://www.concentric.net/~Ikessler); have responded to about 50 e-mail messages and inquiries about it; designed a tender proposal for the photographic work for our Cemetery Photography Project
    (http://www.concentric.net/~Ikessler/cemphoto.shtml); and I’ve started learning how to use Borland Delphi to develop my BEHOLD program (http://www.concentric.net/Ikessler/behold.shtml)

    Whew! I’m getting tired just thinking about all this!

    Take care.  Louis

    25 Years of Delphi

    2020. február 14., péntek 9:52:01

    The Delphi programming language is having its #Delphi25 #Delphi25th birthday on Friday Feb 14, 2020. I’ve been using Delphi for about 23 years since 1997 when Delphi 2 was released.

    Delphi is an amazing language. I use it now for Behold and Double Match Triangulator, and I’ve made use of it for a number of personal projects along the way as well.

    It’s appropriate on this day that I write about Delphi and how I use it and what I like about it.


    Pre-Delphi

    I should provide a bit of background to give context to my adoption of Delphi as my programming language of choice.

    As I entered high school (grade 10), my super-smart friend and neighbor Orest who lived two doors over and was two grades ahead of me recommended I follow his lead and get into programming at school. The high schools in Winnipeg at that time (1971) had access to a Control Data Corporation mainframe, and provided access to it from each school via a cardreader and a printer. You would feed your computer cards into the cardreader. In the room was one (maybe two) keypunches, likely KP-26 or maybe KP-29.

    The computer language Orest used at the time and the school was teaching was FORTRAN, a Waterloo University version called FORTRAN IV with WATFOR and WATFIV. What an amazing thing. You type up a sequence of instructions on computer cards, feed them through the card reader, and a few minutes later your results are printed on classic fanfold computer output.

    Image result for fortran iv with watfor and watfiv  See the source image  See the source image

    For three years of high school, my best friend Carl and I spent a lot of time in that small computer room together. I remember a few of the programs I wrote.

    1. There was the hockey simulation following the rules of a hockey game we invented using cards from a Uno Card Game. We simulated a full World Hockey Association season of the 12 teams each playing 78 games giving each team a different strategy. 11 of my friends would each have a team and look for the daily results and standings.
    2. For a special school event, my friend Carl and I wrote a dating program. We got everyone in school (about 300) and all the teachers (about 30) to fill out a multiple choice survey of about 10 questions about themselves, and the same questions for what they wanted in a date. During our school event, people would come to the computer room, and Carl and I would run them against the database and give them their top 5 dates with hilarious results.
    3. I played touch football with a number of friends once or twice a week during the summer. I recorded all the stats on the back of a computer card in between plays, and I then would punch the results onto computer cards and wrote a program that would give total all the passing stats, receiving stats, interceptions and fumble recoveries by player, giving the leaders and record holders in each category. Everyone loved seeing the stats and played harder and better because of it.
    4. I wrote a program to play chess. Carl wrote one as well. We had a championship match – chess program vs chess program that got us in our city’s newspaper.

    At University, I took statistics but also included many computer science courses. While there, I continued work on my chess program in my spare time and the University of Manitoba sponsored me as a contestant in the North American Computer Chess Championships in Seattle, Washington in 1977 and in Washington, D.C. in 1978. Games were played with modems, and connected dumb terminals to the mainframes back at our Universities. Read all about my computer chess exploits here: http://www.lkessler.com/brutefor.shtml

    After getting my degree in statistics, I went for my Masters in Computer Science. Now we finally no longer needed computer cards, but had terminals we could enter our data on. There was a Script language for developing text documents, and I used it to build my family tree, with hierarchical numbering, table of contents and an index of names. It printed out to several hundred pages on fanfold paper. I still have that somewhere.

    I started working full time at Manitoba Hydro as a programmer/analysis rewriting  and making enhancements to programs for Tower Analysis (building electric transmission towers) and Tower Spotting (optimizing the placing of the towers). These were huge FORTRAN programs containing tens of thousands of lines of what we called spaghetti code.

    Then I was part of a 3 year project to develop MOSES (Model for Screening Expansion Scenarios) which we wrote in the language PL/I. That was followed by another 3 year project from 1986 to 1988 where our team wrote HERMES (Hydro Electric Reservoir Management Evaluation System) which we also said stood for Having Empty Reservoirs Makes Engineers Sad. I learned that one of the most important parts of any program is coming up with a good name for it. I also learned how to three-ball juggle.

    The HERMES program was written in Pascal. That was a new language for me but I learned it quite thoroughly over the course of the project. I believe I purchased my first personal computer, an IBM 386 20 Mhz for home sometime around 1993. When I did, FORTRAN was still available but very expensive. So instead I purchased Borland’s Turbo Pascal.  I started programming what would one day become my genealogy program Behold.


    My Start with Delphi and Evolution Thereof

    I like to joke that I’m not an early adopter and that’s why I didn’t buy into Delphi when it came out in 1995, but did buy Delphi 2 in 1997. Delphi was basically still the Pascal language. But what Delphi added over Turbo Pascal was primarily two things:  the addition of Object- Oriented Programming (OOP), and an Integrated Development Environment (IDE). Those were enough that I had to go “back to school” so to speak, and I loaded up on getting my hands on any Delphi Books that I could. They’re still on my shelf now.

    IMG_20200213_233105

    I purchased Delphi 2 on May 14, 1997 for $188.04 plus $15 shipping & handling.

    I didn’t upgrade every year. It was expensive. But I only upgraded when I felt there was some important improvement or new features I needed.

    I upgraded to Delphi 4 in June 1998 for $249.95 plus $15 s/h. At this time, Borland had changed its name to Inprise. By 2001, they abandoned that name and went back to Borland.

    I was able to use Delphi 4 for quite some time. Finally there was a feature I absolutely needed and that was Unicode which came in Delphi 2009.  I was allowed to upgrade my version of Delphi 4 and I did that and upgraded to Delphi 2009 in Sept 2008 for $374.

    Embarcadero purchased Delphi from Borland in 2008. In 2011, I upgraded to Delphi XE2 for $399 which included a free upgrade to Delphi XE3.

    I upgraded to Delphi XE7 in 2015 for $592. And I upgraded to Delphi 10.1 in 2016 for $824.40.

    The upgrades were starting to get expensive so in 2017 I started subscribing to Delphi maintenance for $337 per year.


    Third Party Packages

    Delphi includes a lot of what you want, but not everything. I needed a few packages from third parties who built components for Delphi. For Behold I used two:

    TRichView by Sergei Tkachenko. TRichView is a WYSIWYG (What You See Is What You Get) full featured document editor that forms the main viewing (soon to be editing/viewing) window of my program Behold that I call “The Everything Report”. Behold is listed among the many Applications that have been made with TRichView.

    I purchased TRichView in 2000 when it was still version 1.3.  Now it’s 7.2. Back then the cost was $35, and it was a lifetime license that Sergey grandfathered in for his early customers. He has continued to develop the program and has not charged me another nickel for any upgrades. I did, however, pay $264 to Sergey in 2004 for some custom code he developed that I needed. I liked that lifetime license policy so much that it inspired me to do so as well for my Behold and Double Match Triangulator customers who all get free upgrades for life when they purchase a license. Sergey no longer offers lifetime licenses. His current price for TRichview is $330, but he also offers other products that work with it. That’s at 20 years of Delphi development for Sergey.

    image

    LMD Innovative’s ElPack is the other package I use for Behold. This is a package of over 200 components that extend the functionality of the VCL controls that Delphi has. The main purpose I purchased this was for their ElXTree which allows custom creation of TreeViews and grids:

    imageimage

    I first purchased ElPack in 2000 from the company EldoS (Eugene Mayevski) who originally developed it.  The cost was $68. About 6 months after I purchased it, I noticed a free product available called Virtual Treeview written by Mike Lischke, but I was already using and happy with ElPack so I continued to use it. I considered switching to Virtual Treeview several years later, but my use of ElPack was already so deeply embedded into Behold, that it wasn’t worth the effort.

    I did have to pay for upgrades to ElPack, so I upgraded only when there was a reason to. Usually it was because I got a new version of Delphi and the old version wouldn’t install. Also, my third party packages were also a reason I didn’t upgrade Delphi so often, because I couldn’t really upgrade until both TRichView and ElPack had versions that worked with the new version of Delphi, which could take up to a year after the Delphi version release.

    In 2003, LMD Innovative acquired ElPack from EldoS and continued developing it. LMD’s current price for ElPack is $190. They have a partnership with TRichView and give me 20% off for being a TRichView customer. I tend to upgrade ElPack every two years or so.

    TMS Software’s FlexCel Studio was a package I purchased for Double Match Triangulator (DMT) to provide native Excel report and file generation, without requiring use of Excel automation and not even requiring Excel on your computer. I use it to produce the Excel files that DMT puts its results into. The capabilities of this component actually amaze me. It can do anything you can think of doing in Excel and more.

    image

    I first purchased FlexCel in August 2017 for $157.


    Additional Tools I Used to Work With Delphi

    Developing programs with Delphi requires additional tools from time to time. Here’s some of the tools that were useful in my Delphi Development:

    In 2009, I purchased for $129 a program called EurekaLog, which integrated with Delphi and worked to find and help locate memory leaks in my program Behold. The program helped me determine how my code was causing leaks, so after a few years and all leaks eradicated and better programming to avoid future leaks, I really didn’t have a great need to keep using the program.

    In 2010 when I was tuning Behold for efficiency, I purchased a line by line profiler from Automated QA called AQTime that worked by itself or with Delphi. This was a very expensive program at $599, but I was able able to speed up Behold 100 times by finding inefficient algorithms and sections of code that I could improve, so it was worth the price. The program has since been acquired by SmartBear and still sells for $599. I no longer have a version of this program that works with the latest version of Delphi. Delphi does provide a lite version of AQTime for free, but that does not include its fantastic line-by-line profiler. I’m no longer in need of super-optimizing my low-level code because that rarely changes. When I need to ensure a section of code is not too slow, I now put timers around the section and that often tells me what I need to know.

    Dr. Explain is the program I chose for writing the help files for my programs. I first purchased it in 2007 for $182, upgraded in 2014 for $100. The current price of an advanced license is $390.

    image

    And my installation program of choice for Behold and DMT is the free Inno Setup Compiler from jrsoftware. I purchase Comodo Code Signing certificates for about $70 a year.

    image


    Personal Uses of Delphi

    Other than the two programs Behold and DMT that I am developing and selling licenses for, I also have used Delphi over the years to build some programs for my own use. These include:

    • A database search program I build for my local Heritage Centre so they could easily query their Microsoft Access databsse which had listings and details of over 60,000 items. Originally written in Turbo Pascal and later converted to Delphi. (1996)
    • A program to build some of my link web pages for me such as my Computer Chess Links page. (1997)
    • A program to screen stocks for me to find stocks that I was interested in purchasing. (1997)
    • A program to run through all possible picks and determine what selections my competitors picked in our local newspaper’s hockey, football and stock market contests (1997). (Aside:  I have won more than $20,000 in such contests using this type of analysis to help me gain an advantage.)
    • A page counter for early versions of my websites. (2001)
    • A program to help win at the puzzle called Destructo, where you’re trying to break through a wall. (2001)
    • A program that produces the RSS feeds for this blog on this website (2004).
    • A program to analyze the log files from my websites, especially to find pages that link into my sites.(2005)
    • A program to help play soduko.  (2005)
    • A program to download stock market data and do technical analysis for me. (2008)
    • A program to analyze 100 GB raw data files from whole genome DNA tests. (2019)

    One thing I never have done is resurrected my chess program. For a while I considered it, but I knew it would be a lot of work and I didn’t want to take my time away from my genealogy software. In the past couple of years, deep learning and Alpha Zero has made all other programs irrelevant.


    What’s Next with Delphi

    I am very pleased that Embarcadero has continued to support and improve Delphi and that my Third party packages continue to roll. Hopefully that will continue for the foreseeable future.

    The stability I’ve had over the past 24 years being able to use Delphi has been fantastic. The development environment is great. I love how fast it compiles, how fast the code runs, and how easy it is to debug.

    Here’s 24 Years of Delphi and 25 Years of Excellence and here’s Going Forward.

    On my speaker topics page, I like saying that “Louis is fluent in five languages: English, Delphi/Pascal, HTML, GEDCOM and DNA.”

    Well now I better post this page and get to bed, because I have to be up in 9 hours for the Happy Birthday Delphi 25 celebrations

    GEDCOM Assessment

    2020. február 9., vasárnap 4:07:12

    I’ve working hard to get Behold 1.3 completed. It will primarily be a newer iteration of Behold’s Everything Report. Once that is released, I’ll start my effort to add GEDCOM export followed by editing.

    I’ve designed Behold to be a comprehensive and flexible GEDCOM reader that understands and presents to you all the data contained in GEDCOM of any flavour, from 1.0 to 5.5.1 with developer extensions and user-defined tags. So when John Cardinal came up with his GEDCOM Assessment site, that was a opportunity I couldn’t resist.

    “assess.ged is a special GEDCOM file which you may use to perform a review of the GEDCOM 5.5.1 import capability of any program that reads a GEDCOM file and imports the contents”

    John is a long-time user of The Master Genealogist program written by Bob Velke. John is also a programmer and wrote programs to work with TMG including Second Site for TMG, TMG Utility and On This Day.

    After TMG was retired in 2014, John wanted to help people get all their data out of TMG allowing them to transfer to other programs so he wrote the TMG to GEDCOM program. He also wrote a program that creates an e-book from a GEDCOM file called Gedcom Publisher. And John then wrote a program to create a website from any generic GEDCOM file and called that program GedSite.

    In the process of all this, John gained an expertise in working with GEDCOM and has made tests for GEDCOM compatibility that he invites all genealogy software authors to try.

    So try it I shall. 

    I followed John’s “process” and downloaded version 1.03 of assess.ged file as well as the images file references and placed the latter in a C:GedcomAssessment folder. Then I loaded assess.ged into Behold 1.2.4 and used his website’s Data Entry page to capture the results. This really is a beautifully set up assessment system. My complements to John Cardinal.


    A Few Things To Fix for Version 1.3

    There were a number of tests that illustrated some aspects of GEDCOM that Behold does not fully support. I’ve made a list of them here:

    1. Behold by default uses the first person in the file and treats that person (and their spouse(s)) as the family the report is about. (You can of course pick anyone you want instead of or in addition to the first person). The assess.ged file does not link the 185 people in the file to each other, except for two who are connected as spouses. Behold was not using the first person in the file as a singular family but instead had the first section blank and listed all the people, including that first person, in its “Everyone Else” section. This should be a simple fix.
    2. I was surprised to see Behold display:  1 FACT Woodworking 2 TYPE Skills as Woodworking Skills rather than Skills: Woodworking. That’s a bug because I intended it the latter way. Same for 1 IDNO 43-456-1899 2 TYPE Canadian Health Registration which was being displayed as 43-456-1899 Canadian Health Registration rather than Canadian Health Registration: 43-456-1899.
    3. Behold somehow was ignoring and not displaying the TIME value on the change date of a record.
    4. The CONC tag to concatenate two lines is specified by GEDCOM to require the last word in the first line be split so that it’s second half begins the second line. Behold does this, but in doing so, Behold trims the lines before concatenating. As a result, if a GEDCOM used a non-standard method of including a leading space on the second line or a trailing space on the first line, then it is ignored and the word at the end of the first line and the beginning of the second line would be joined with no intervening space. I haven’t noticed programs using this non-standard format, but even so, I’ll think about it and maybe I’ll remove Behold’s trimming of concatenated lines in version 1.3.
    5. Behold displays: “Birth, date”.  But it should display “Birth: date”. Same for other events such as “Adoption, date” or “Baptism, date”. How did that ever happen?
    6. Behold currently displays the user-defined tag _PRIM as “Primary: Y” after a date, but retains the first-entered date as primary and does not use this tag to make that date primary. I think about deciding to honor the _PRIM tag in version 1.3.
    7. The non-standard shared event tag, e.g. 1 CENS 2 _SHAR @I124@ is not being displayed correctly by Behold. This will be fixed.
    8. Behold does not convert @@ in notes or text values to @, as it should. Technically all line items should be checked for @@ and changed as well so that includes names.
    9. Hyperlinks to objects unfortunately do not open the file because Behold added a period to the end of it. This is a bug that I noticed a few weeks ago and has already been fixed for the upcoming version of Behold under development.
    10. Alias tags (ALIA) whose value is the name of the person rather than a link is valid according to the GEDCOM standard, but it may be something I want to support if I see it was exported into GEDCOMs by some programs.
    11. I’m not displaying the tags under a non-standard top level 0 _PLAC structure correctly. This includes 1 MAP, 2 LATI, 2 LONG and 1 NOTE tags under the 0 _PLAC record.
    12. Non-standard place links such as: _PLAC @P142@ that link to the 0 _PLAC records should have been working in Behold, but the display of these links needs to be improved.
    13. If a person’s primary name has a name type, then it should be repeated with the type on the next line, e.g.
         Birth name:  Forename(s) Surname
      Also additional names should be called “Additional name” rather than just “Name”.
    14. Names with a comma suffix should not be displayed with a space between the surname and the comma. I’ve actually never seen this in the wild.
      e.g. /Smith/, Jr should be displayed as Smith, Jr and not Smith , Jr
    15. Notes on places are repeated and shouldn’t be.  Dates should be shown following any notes or other subordinate info for the place.
    16. Addresses could be formatted better.
    17. EVEN and ROLE tags on a source citation should have their tag text looked up and displayed instead of just displaying the tag name.
    18. The OBJE link was not included in source citation when it should have been.

    So that was a really good exercise. Most of these are minor, but a lot more issues came up than I expected. Over the next few days, I’ll resolve each of these in the development version of Behold which soon is to become version 1.3.

      
    Results and Comparison

    John presents a Comparison Chart that currently compares the results for 15 programs. There are 192 tests. Here’s my summary of John’s Comparison.

    image

    I’ve added Behold’s result in my chart. I’ve also excluded John’s program GedSite in summarizing the other programs, because his results are for a program that has already been tuned to handle these tests. So GedSite’s numbers are a good example of the results that I and other developers should try to attain with our programs.

    Behold didn’t do too bad with 161 supported constructs out of 192. Best was GedSite’s 185 followed by Genealogie Online’s 179, then by My Family Tree’s 169 and then by Behold’s 161. Genealogy Online is the baby of Bob Coret who is another GEDCOM expert, and My Family Tree is by Andrew Hoyle of Chronoplex Software who also makes GEDCOM Validator, so you would expect both of them to be doing well with regards to GEDCOM compliance.

    I’ve emailed the JSON text file of Behold’s results to John. Hopefully he’ll add Behold to his comparison chart.


    Comments About John’s Test File and Data Entry Page

    1. The assess.ged file version 1.03 includes a 1 SEX M line in each of the test cases. I’m not sure why. SEX is not a required tag in an INDI record. For a test file, it would be simpler to just leave the SEX lines out.
    2. I disagree with the constructs of two of the Master Place Link by XREF tests. They include within one event, both a standard test place link and the non-standard place xref link, i.e.: 
         1 CENS
         2 PLAC New York
         2 _PLAC @P158@
      The trouble I have with this is that GEDCOM only allows one place reference per event. By using this alternative tag, you’ve effectively got two which is illegal if they were both PLAC tags. And what if they are not the same? John should take out the 2 PLAC New York line from his NAME 02-Link by XREF tests where he has the 2 _PLAC tag so that there is only one place reference. Any programs allowing both PLAC and _PLAC tags on the same event should cease and desist from doing this. The second test where the 3 _PLAC tag is under the 2 PLAC tag is an even more horrible construct that no one should support.
    3. The GEDCOM Assessment Data Entry Page does not completely function in all browsers. When using my preferred browser Microsoft Edge, entering “Supported (w/comment)” did not bring up the box to enter the comment. I tried Internet Explorer and the page did not function at all. I had to switch to Google Chrome (or Firefox) to complete the data entry.


    Conclusion

    What this little exercise does show is how hard it is to get all the little nuances of GEDCOM programmed correctly and as intended. This assessment took the better part of a day to do, but I think it was well worth the time and effort.

    And what’s really nice about having a file with test cases is that they provide simple examples that illustrate issues that can be fixed or improved.

    I hope all other genealogy software authors follow my lead and test their programs with GEDCOM Assessment’s assess.ged file. Then it’s a matter of using this analysis to help make their programs more compatible with the standard and thus do their part to help improve genealogical data transfer for everyone.




    Update  Feb 10:  John reviewed the assessment with me. A few results changed status and I’ve updated the table above. John mentioned that his creation of GedSite wasn’t a conversion of Second Site for TMG, but was a completely new program.

    Behold version 1.2.4’s final assessment is now available here:
    https://www.gedcomassessment.com/en/assessment-behold.htm

    Once I complete version 1.3, I’ll likely submit it again for a new assessment.

    So Much Fun!

    2020. január 25., szombat 22:37:09

    The Family History Fanatics’ @FHFanatics online Winter DNA eConference has just finished and it was so much fun. Andy and Devon sure know how to put un a good show.

    This was the first time I’ve presented at an online conference.  I was able to do this comfortably from my office at home and my family was really good and went out of their way not to bother me for 6 hours.

    Devon Noel Lee, Jonny Perl and Paul Woodbury also presented and it was so great getting to interact with them. It was the next best thing to being at a physical conference, without the need to spend time in airports and hotels.

    In addition to all the great genetic genealogy methodologies presented, I also learned that I know nothing about what genes are on what chromosomes, and that audiences love to suggest random numbers.

    We ended with a question/answer period that turned into an entertaining roundtable discussion.

    image

    Now to get back to work and figure out how Jonny and I are connected through the 16 cM segment on chromosome 7 that we share.

    Double Match Triangulator, Version 3.1.2

    2020. január 22., szerda 5:16:01

    I released small upgrade to Double Match Triangulator today. It includes a few fixes to minor problems and a couple of improvements:

    • Fixed the display of Base A-B segments in a combined run. They were always showing as single matches, but when they triangulate, they should show as Full Triangulation.
    • Improve the handling of the horizon effect by restricting just B-C segment matches to the Min Triang value and allowing smaller A-C and A-B matches.
    • Improve some of the data displayed and data descriptions in the log file.
    • Allow Person A and B to be processed if only one file has matches to the other. Previously, matches both ways was required which won’t happen when one of the files was downloaded before the other persons test results were available.
    • MyHeritage Matches Shared Segment files are now filtered only by their English filename to help prevent the more severe problem of people selecting their MyHeritage Matches List file by mistake.
    • If running only File A, then messages say that instead of "using 0 Person B Files"
    • The Min Single label now shows as dark grey instead of red when values of 10 cM or 12 cM are selected, since 85% of single matches 10 cM or more should be valid.
    • The display of the number of inferred segments on the People page is now right justified rather than center justified.
    • If you have the DMT or People file open when DMT is trying to save it, DMT will now prompt you to close the file.

    As always, this new version and all future versions are free for previous purchasers of DMT,

    I’ll be talking about Double Matching and Triangulation Groups at the Family History Fanatics Winter DNA eConference this Saturday. I’ll be presenting my material in a very visual style taking you through some of the challenges I have with my own genealogy and DNA matches. I’ll be introducing some concepts that I don’t think have been discussed before. One of the attendees will win a free lifetime license for DMT. Hope to see you there.

    DNA eConference on Saturday, January 25

    2020. január 18., szombat 3:39:04

    I’m looking forward to the Family History Fanatics eConference coming up in a week’s time. I’ll be one of the speakers on what will be a great day of DNA talks.

    I have given many talks at conferences, but this will be my first virtual talk from the comfort of my office at home. I just submitted my Syllabus today and will be spending the next few days reviewing my presentation and tweaking my slides.

    image

    I’ll be talking about Double Matching and Triangulation Groups, but I’ll be presenting it in a very visual style taking you through some of the challenges I have with my own genealogy and DNA matches. I’ll be introducing some concepts that I don’t think have been discussed before.

    And for all you DNA Painter fans, I’ll be including:
    image 

    Andy and Devon who run Family History Fanatics, have a great lineup with three other fantastic speakers who will also present.

    Devon herself, in her own wonderfully unique style will present:image

    Then it’s Paul Woodbury, the DNA Team Lead at Legacy Tree Genealogists:image

    And finally is no other than Mr. DNA Painter himself, Jonny Perl:image

    Andy has some interesting twist planned after all of our talks.  It’s called “Genealogy Unscripted” and it will bring all four of us together to first answer questions and then “compete” in some sort of challenge to see who can claim the DNA trivia title. This comes as I am still recovering from watching last week’s Jeopardy Greatest of All Time tournament. Against Devon, Paul and Jonny, I’ll feel like Brad felt against James and Ken.

    If you want to see any of the talks but can’t make it live to some or all of the eConference, by registering, you will be sent a link to the recorded conference. You’ll have a month to watch and/or re-watch any or all of the talks at your convenience.

    The cost is only $25 USD.  Register now, before it fills up. Hope to see you there.

    image

    Aligning My Genome

    2020. január 5., vasárnap 5:18:28

    I purchased a Whole Genome Sequencing (WGS) test from @DanteLabs for $399 USD in August 2018. My Variant Call Format (VCF) files were ready for me to download in January 2019, which I dissected in a blog post in February:  My Whole Genome Sequencing. The VCF File.

    My 1 TB hard disk containing my raw data arrived in April 2019. It included my raw reads (FASTQ files) and assembled Ch37 genome (BAM file) in April 20

    I was originally intending to then analyze my raw reads and BAM (Binary Sequence Alignment Map) file, but at that time in April, Dante had a deep discount on their Long Reads WGS test that I purchased for $799 USD. So I figured I’d wait until I got the long read results and then analyze and compare both the short read and long read tests together. That would prove interesting and show the strengths and weaknesses of each test and maybe they can work together for improved results.


    The FASTQ file

    I got my long reads results from Dante in October. They provided only the FASTQ file and provided it online as a single downloadable file.

    image

    The file was 199 GB (yes, that’s GB). On my internet connection, it took 12 hours to download. It is a compressed file. I unzipped it to look at it. It took 78 minutes to decompress to 243 GB.  It’s a good thing I still had half of my 2 TB internal hard drive free to use.

    This is what the beginning of my decompressed FASTQ file looks like. Four lines are used per sequence, containing: (1) an identifier and description, (2) the raw sequence letters, (3) a “+” character, and (4) the quality values for each letter in the sequence.

    image

    The lines extend further to the right than shown above. The 7 sequences here have 288, 476, 438, 302, 353, 494 and 626 bases. These are some of the shorter sequences in the file. If I go to the 1321st sequence in the file, it contains 6411 bases.

    But even that is likely short compared to what some of the longest reads must be. This file is promised to have an N50 > 20,000 bases.  That is not an average length, but that means that if you total the lengths of all the sequences that are more than 20,000 bases, then they will make up more than 50% of all the bases. In other words, the N50 is like a length-weighted median.

    By comparison, taking a look at my short read FASTQ files, I see that every single sequence is exactly 100 bases. That could be what Dante’s short read Illumina equipment is supposed to produce, or it could have been (i hope not) processed in some way already.

    image


    The Alignment Procedure

    The FASTQ file only contains the raw reads. These need to be mapped to a standard human genome so that they can be aligned with each other. That should result in an average of 30 reads per base over the genome. That’s what a 30x coverage test means, and both my short read and long read tests were 30x coverage. The aligned results are put into a BAM file which is a binary version of a SAM (Sequence Alignment Map) file.

    Dante produced and supplied me with my BAM file for my short reads. But I just learned that they only provide the FASTQ file with the long read WGS test. That means, I have to do the alignment myself to produce the BAM file.

    On Facebook, there is a Dante Labs Customers private group that I am a member of. One of the files in the file area are instructions for “FASTQ(s) –> BAM” created by Sotiris Zampras on Oct 22. He gives 5 “easy” steps:

    1. If using Windows, download a Linux terminal.
    2. Open terminal and concatenate the FASTQ files.
    3. Download a reference genome
    4. Make an index file for the reference genome.
    5. Align the FASTQs and make a BAM file.

    Step 1 - Download a Linux Terminal

    Linux is an open-source operating system released by Linus Torvalds in 1991. I have been a Windows-based programmer ever since Windows 3.0 came out in 1990. Prior to that, I did mainframe programming.

    I have never used Linux. Linux is a Unix-like operating system. I did do programming for 2 years on Apollo Computers between 1986 and 1988. Apollo had their own Unix-like operating system called Aegis. It was a multi-tasking OS and was wonderful to work with at a time DOS was still being used on PCs.

    So now, I’m going to plunge in head first and download a Linux Terminal. Sotiris recommended the Ubuntu system and provided a link to the Ubuntu app in the Windows store. It required just one Windows setting change: to turn on the Windows Subsystem for Linux.

    image

    Then I installed Ubuntu. It worked out of the box. No fuss, no muss. I have used virtual machines before, so that I could run older Windows OS’s under a newer one for testing my software. But this Ubuntu was so much cleaner and easier. Microsoft had done a lot in the past couple of years to ensure that Windows 10 will run Linux smoothly.

    I had to look up online to see what the basic Ubuntu commands were. I really only needed two:

    • List files in current directory:  ls
      List files with details:  ls –l
    • Change current directory:  cd folder
      The top directory named “mnt” contains the mounted drives.

    Step 2 – Concatenate the FASTQ files

    My short read FASTQ files were a bunch of files, but my long read file is just one big file, so nothing needed for it.

    Step 3 – Download a Reference Genome.

    Sotiris gave links to the human reference files for GRCh37.p13 and GRCh38.p13.

    For medical purposes, you should use the most recent version, which currently is: March 2019 GRCh38.p13  (aka Build 38, or hg 38). But I’m doing this to compare my results to my raw data from other DNA testing companies. Those are all June 2013 GRCh37.p13 (aka Build 37, or hg 19).

    So I’m going to use Build 37.

    The two reference genome files are each about 800 MB in size. They are both compressed, and after unzipping, they are both about 3 GB.

    The files are in FASTA format. They list the bases of each chromosome in order. This is what they look like:

    image

    Both files denote the beginning of each chromosome with a line starting with the “>” character. That is followed simply by the sequence of bases in that chromosome. In Build 37, each chromosome is one long line, but my text editor folds each line after 4096 characters for displaying. In Build 38 they are in 60 character groups with each group on a new line.

    The lines contain only 5 possible values:  A, G, C, T or N, where N represents Nucleic acid, meaning any of A, G, C or T. There are usually blocks of N, especially at the beginning and end of each chromosome as you can see above.

    Each genome also includes references for chromosomes X, Y, and MT.

    Those are followed by a good number of named patches, e.g. GL877870.2 HG1007_PATCH in Build 37 which contains 66021 bases.

    Here are the number of bases for the two builds by chromosome:

    image

    You’ll notice the number of bases in the reference genome is different in the two Builds by as much as 3.6%. A good article explaining the differences between GRCh37 and GRCh38 is Getting to Know the New Reference Genome Assembly, by Aaron Krol 2014.

    Just for fun, I compared the mt Chromosome which has the same number of bases in the two Builds. All bases also have the same value in both Builds. The count of each value is:

    • 5124 A
    • 2169 G
    • 5181 C
    • 4094 T
    • 1 N

    The one N value is at position 3107.

    Step 4 – Make an Index File for the Reference Genome

    Sotiris didn’t mention this in his instructions, but there were two tools I would have to install. The Burrows-Wheeler Aligner (BWA) and SAMtools.

    I was expecting that I’d need to download and do some complex installation into Ubuntu. But no. Ubuntu already knew what BWA and SAMtools were. All I had to do was execute these two commands in Ubuntu to install them:

    sudo apt install bwa

    sudo apt install samtools

    Again. No fuss. No muss. I’m beginning to really like this.

    In fact, Ubuntu has a ginormous searchable package library that has numerous genomics programs in it, including samtools, igv, minimap, abyss, ray, sga, canu, and all are available by that simple single line install.

    The command to index the Reference Genome was this:

    bwa index GRCh37.p13.genome.fa.gz

    It displayed step-by-step progress as it ran. It completed in 70 minutes after processing 6,460,619,698 characters. The original file and the 5 index files produced were:

    image

    Step 5 – Align the FASTQs and Make a BAM File

    The BWA program has three algorithms. BWA-MEM is the newest and is usually the preferred algorithm.  It is said to be faster and more accurate and has long-read support. BWA-MEM also tolerates more errors given longer alignments. It is expected to work well given 2% error for an 100bp alignment, 3% error for a 200bp, 5% for 500bp and 10% for 1000bp or longer alignment. Long reads are known to have much higher error rates than short reads, so this is important.

    The command for doing the alignment is:

    bwa mem -t 4 GRCh37.p13.genome.fa.gz MyFile.fastq.gz | samtools sort -@4 -o FinalBAM.bam

    So the program takes my FASTQ file and aligns it to GRCh37. It then pipes (that’s the “|” character) the output to samtools which then creates my resulting BAM file.

    I have a fairly powerful Intel i7 four-core processor with 12 GB RAM. The –t 4 and –@4 parameters are telling the program to use 4 threads. Still, I knew this was going to take a long time.

    Here’s what I did to start the program:

    image

    First I used the ls –l command to list the files in the folder.

    Then I ran the command. By mistake, I had the reference genome and my FASTQ in the wrong order and it game me a “fail to locate the index files”. Once I figured out what I did wrong, I ran it correctly.

    The display showed the progress every 40 million bp that was processed. That seemed to average about 6200 sequences indicating that, at least to start, the there was an average of about 6450 bases per sequence. Working out the amount of time per base, I can extrapolate the total time needed to being about 102 hours, or a little over 4 days.  That’s do-able, so I let it go and was interested to see if it would speed up, slow down, or end up completing when I predicted.

    Slowly temporary files were being created:

    image

    Every 75 minutes or so, 4 new temporary files (likely for the 4 threads) of about 500 MB each were being created. Don’t you like the icon Windows uses for BAM files? It uses that icon for FASTQ and FASTA files as well.

    Now I just had to wait.

    What amazed me while I was waiting was how well Windows 10 on my computer handled all this background processing. I could still do anything I wanted, even watching YouTube videos, with hardly any noticeable delay. So I took a look at the Task Manager to see why.

    image

    Even though 10.5 of my 12 GB RAM was being used, only 58% of the CPU was taken. I’m thinking that the 4 thread setting I used for BWA was easily being handled because of the 8 Logical processors of my CPU.

    What also impressed me was that my computer was not running hot. Its cooling fan was dispersing air that was at the same temperature I usually experience. My worry before I started this was that days of 100% processing might stress my computer to it’s limits. Fortunately, it seems that’s not going to be a problem.


    Wouldn’t You Know It

    While I was waiting for the run to complete, I thought I’d look to see what Dante used to align my short read WGS test. For that test they provided me with not just the FASTQ raw read files, but also some of the processed files, including the BAM and VCF files.

    I unzipped my 110 GB BAM file, which took 3 hours to give me a 392 GB text file that I could read. I looked at the first 200 lines and I could see that Dante had used BWA-MEM version 0.7.15 to produce the BAM file.

    I thought I’d go to BWA’s github repository to see if that was the most recent version. It’s pretty close. 0.7.17 was released Oct 23, 2017. The 0.7.15 version is from May 31, 2016 and the changes weren’t significant.

    But while there, I was surprised to see this notice:

    image

    Seems that minimap2 is now recommended instead of BWA-MEM. Here’s what they say in the minmap2 Users’ Guide:

    Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

    For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 paper or the preprint.

    Oh well. It looks like I’ll let the BWA-MEM run finish, and then try running minimap2.


    BWA Finally Done

    The BWA-MEM program finally completed 5 days and 12 hours later. That’s 132 hours, a bit more than the 102 hours I had estimated.  It would have been nice for the progress to be shown with a percentage completed, as I wasn’t entirely sure at any time how much more there was remaining to do. In the end, BWA created 328 temporary BAM files totaling 147 GB.

    Following this, BWA reported it had spent 480,409 seconds of real time (133.4 hours) and 1,638,167 of cpu time, a ratio of 3.4 representing the gain it got from using 4 threads.

    Then BWA passed off to samtools for the assembly of the final BAM file. There was about an hour of nothing visible happening. Then samtools started creating the BAM file. Windows Explorer showed its progress, with the size of the BAM file being created growing by about 12 KB every second. This took another 3.5 hours and the result was a single BAM file of 145 GB (152,828,327 KB).


    Minimap2

    After reading that BWA now recommends use of minimap2, and that minimap2 was much faster, more accurate and produces a better alignment, obviously I could not stop with the BAM file I had.

    I went back to Ubuntu and ran the following:

    sudo apt install minimap2

    but I got the message:

    image

    I found out it required a newer version of Ubuntu than Windows had supplied in their store. So I followed the instructions: How to Upgrade Ubuntu 18.04 to 19.10 on Windows 10 Linux subsystem by Sarbasish Basu. Then I was able to install and run minimap2.

    minimap2 –ax map-ont –t 4 GRCh37.p13.genome.fa.gz MyFile.fastq.gz | samtools sort -@4 -o FinalBAM.bam

    where the ax parameter “map-ont” is for Oxford Nanopore long noisy reads.

    I ran this. It gave little feedback. After about 6 hours, it told me it mapped 500000 sequences. Then another 6 hours and it mapped another 500000 sequences. It wouldn’t have been so bad if minimap2 was as resource friendly as BWA, but I found it sometimes noticeably impacted my working on my computer. I could still do things, but much slower than normally. What would normally be instantaneous would sometimes take 3 to 30 seconds – to open a browser, my email, etc.

    None-the-less, I let it go for 3 days (72 hours) and then canned the program because I needed my computer back. Afterwards, I calculated that there likely are about 20 million long read sequences in my file. Extrapolating time-wise, that would have been about 10 days of running to complete.


    YSeq

    A few weeks later, in the Genetic Genealogy Tips & Techniques Facebook group, Blaine Bettinger posted regarding his Dante WGS test that he took in November and said that he purchased the “FASTQ Mapping to hg38” from YSeq for $25 and recommended that service.

    Since minimap2 didn’t look like it was going to work for me on my computer, I thought using MSeq sounded like a good idea. Since I’m interested in DNA for cousin matching purposes, all the big consumer DNA companies are using Build 37 (i.e. hg19), so I decided to all purchase the “FASTQ Mapping to hg19” from YSeq for an additional $25.

    I gave them access to my long read FASTQ files at Dante by setting up a temporary password for them. After about 4 weeks, my results were ready.

    You can see they had a whole bunch of result files for me:

    image

    They used the minimap2 program. The files are for both hg19 and hg 38. The minimap2_hg38_sorted.bam file is 132 GB and the minimap2_hg19_sorted.bam file is the same size 132 GB.

    There’s also a bunch of Y-DNA results and M (mt) DNA results along with various stats. There’s also a few files with 23andMe in the name that contain my variants in 23andMe’s raw data format. The pipeline files show me exactly what commands were run and I appreciate having those.

    YSeq’s email to me telling me my results were ready included my BAM coverage figures: 38.8x for hg38 and 38.4x for hg19, so I achieved more than the 30x coverage that Dante promised. The average read was 5999 bases. That are much longer than a short read WGS test that typically averages 100 bases per read. I don’t have stats on what the N50 was (see earlier in the article for a definition of N50), but Dante promises at least 20,000 and I trust that I got that.

    YSeg’s email also gave me my mt haplogroup: K1a1b1a, which is the same as  what my Family Tree DNA mtDNA test gave me. And their Y-haplogroup path was the same as my Big Y-500 test at Family Tree DNA. YSeq ended with: R-Y2630 –> R-YP4538 compared to FTDNA:  R-Y2630 –> R-BY24982 –> R-BY24978.

    YSeq provides BAM files just for mt and Y which is convenient for uploading to services such as YFull. Personally, I’m not that interested in Y and mt because, other than my uncle, none of my matches are close enough to connect on my family tree. I have provided my Y-DNA to the Ashkenazi Levite DNA study and I’ve let them do the tough stuff with it.

    Each of the two 132 GB BAM files took me about 22 hours to download at an average download speed of 1.7 MB/second.


    So What the Heck Do I Plan To Do With These BAMs?

    I’ve now got BAM files that were produced from:

    • My short read WGS produced by Dante using BWA-MEM.
    • My long read WGS produced by me using BWA-MEM.
    • My long read WGS produced by YSeq using minimap2.

    Other than scientific curiosity and an interest in learning, I’m mostly interested in autosomal DNA matching for genealogy purposes. I have two goals for these BAMs:

    1. To compare the BAMs from my short read and long read WGS test with each other and to the raw data from the 5 SNP-based DNA tests I took. I want to see if from that I can determine error rates in each of the tests and see if I can correct the combined raw data file that I now use for matching at GEDmatch.

    2. To see how much of my own DNA I might be able to phase into my two parents. This will be a long term goal. Reads need to contain at least two heterozygous (different value for both parents) SNPs in order to connect each end of them to the next read of the same parent’s chromosome. And there are some very long regions of homozygous (same value for both parents) SNPs. WGS long reads are generally not long enough to span all them. But I’d still like to see how many long segments can be phased.

    All this will happen when the time is right, and if I ever get some time.

    Behold Version 1.2.4, 64-bit

    2020. január 4., szombat 5:48:02

    I’ve released an update to Behold that includes a 64-bit executable that will run on 64-bit Windows computers.

    If Behold 1.2.3 is running fine for you, there’s no reason to upgrade. There are no other changes in it.

    The new installation program now contains both a 32-bit and 64-bit executable. The bit-level of your Windows computer will be detected, and the appropriate executable will be installed.

    What does 64-bit give you that 32-bit doesn’t? Well, really it just can handle larger files. If Behold 1.2.3 runs out of memory because your GEDCOM is too big, then 64-bit Behold may not.  My tests using GEDCOMs created by Tamura Jones’ GedFan program indicate that on my computer which has 12 GB of RAM, 32-bit Behold can load up to fan value 19 (a half a million individuals) but 64-bit Behold can load up to fan value 22 (four million individuals).

    The 64-bit version of Behold is actually a bit slower than the 32-bit version. On my computer I find it about 30% slower when comparing the loading speed of the same file. But I believe 64-bit is more stable and is less likely to crash than a 32-bit program because it really can’t run out of address space. 

    And a bit of a teaser:  I’m starting to get back to working on Behold. Stay tuned.