Kezdőlap Újdonságok Kereső Egyesület Fórumok Aktualitások Kutatás Adattárak Történetek Fotógaléria Partnereink RSS
RSS hírcsatornák
Ancestor.com
Ancestry.com
AustraliaGenWeb
Computergenealogie
FamilySearch.org
Forum zur Ahnenfors...
Genealogy Gems
Genealogy News Center
GenealogyBlog
GeneaNet.org
Hungaricana - Magya...
Interment.net Cemet...
LegacyTree.com
Louis Kessler's Beh...
Mac Genealogy Softw...
Magyar Országos Lev...
MyHeritage.com
NYT > Genealogy
Országos Széchényi ...
The Genealogue

Louis KesslerLouis Kessler's Behold Blog

the Development of my Genealogy Program named Behold

GEDCOM Should NOT Allow Extensions

2021. január 18., hétfő 23:54:45

The GEDCOM standard for transferring genealogical data has been in use basically unchanged for over 20 years now. Just about every genealogy software program can export (some of) its family data to a GEDCOM file, and can import (some of) the family data in a GEDCOM file into its database.

The issue is the “(some of)” qualifier that I put in.

We want our programs to export all their family data so that a user can transfer that data to another program or website. For the most part, the basic name-birth-marriage-death-date-place information transfers reliably. It’s everything else, facts, events, sources, repositories and even notes that often don’t make the crossing.

The blame is usually put solely on GEDCOM, accusing it of being unable to represent the data.

I disagree. I put just 10% of the blame on GEDCOM, and 90% of the blame on the programmers of genealogy software who have, for whatever reason, decided not to use some of the GEDCOM tags and constructs but rather use their own inventions instead.


Why Data Doesn’t Transfer

Several obvious reasons:

  1. The exporting program doesn’t export some of its data. You can’t import what’s not there.
  2. The exporting program sometimes exports its own custom GEDCOM tag or construct rather than use what’s in GEDCOM. An importing program can’t import what it doesn’t understand.
  3. The exporting program exports some of GEDCOM incorrectly. Hard to import anything that isn’t correctly exported.
  4. The importing program doesn’t import everything. Usually it won’t import what it doesn’t export.
  5. The importing program doesn’t recognize certain standard GEDCOM tags and constructs when it uses its own custom GEDCOM tags and constructs in their place for its own export. So for these tags and constructs, it will only import its own data again.
  6. The importing program imports some of GEDCOM incorrectly. It may lose some data as a result.
  7. GEDCOM does not have a construct for storing a certain type of data, so it can’t be transferred. Many people think this is a worse problem than it is. There’s not much family data that GEDCOM cannot transfer.
  8. GEDCOM allows developers to use their own custom tags or extensions, so the developers do use their own. Other programs will not understand anything a developer does that’s not in the standard unless they do custom programming specifically to handle that developer’s custom tags and extensions. Allowing this was a mistake.


What is the Problem?

The number one problem is that developers for whatever reason, are not taking the time to ensure that they understand the GEDCOM standard and try to export their data the way GEDCOM is telling them to.

Too often, they are jumping to the conclusion that there is no way to export their data to GEDCOM, so they take what they think is the easy way out, and they invent their own tags and constructs for their data.

What harm in that? – they think. After all, their program will export their data, and their program will be able to import it again. Do they really care if another program can?  (They should, but I won’t get into that in this article.)


An Example

I recently had an online conversation with a very experienced genealogy software developer who was wondering how strict a genealogy program should be with respect to GEDCOM support.

He gave this example of how he wanted to export information extracted from a marriage licence and add it as part of the MARL (marriage license) tag in GEDCOM.  

image

The MARL tag is valid. GROO, BRID and RECR are not. Source information is being included in an MARL fact under the GROO and BRID tags, when it should be in GEDCOM’s SOURCE_CITATION structure instead.

Other than the program creating this, no genealogy program will be able to read and load this data as intended into its database.

So how should this case be handled?  This was my answer:

Converting your MARL event to valid GEDCOM (adding illegal indentation for clarity) would give this:

image

The birth places and ages could also be documented, but they shouldn’t be done under the marriage license event. They should be under the individual’s birth event:

image

What GEDCOM is saying regarding Evidence and Conclusions is this: Evidence should be in the DATA portion of the SOURCE_CITATION. Conclusions are the Events/Facts that you enter.

The TEXT information can be included as it is in the document and needn’t have to be pigeonholed into real or imaginary tags like GROO or RECR


Conclusion

As I see it, two very bad things happen when developers do not follow GEDCOM as intended:

1. They will export GEDCOM that other programs will not understand.

2. They will not bother to implement some GEDCOM constructs that they are not using, so their program will not be able to import and properly interpret those valid GEDCOM constructs from other programs.

People think GEDCOM is the main reason why data doesn’t completely transfer between programs. False. It is the inconsistent implementation of GEDCOM for both import and export that is the primary cause of data loss.

Future enhancements to GEDCOM should require that only GEDCOM tags and constructs be used. No developer tags or constructs should be allowed.

Requiring compliance with no exceptions is the only hope we will ever have for all our genealogy data to one day be able to transfer correctly from program to program.


Further Reading

From 2015: Complete Genealogy Data Transfer
From 2015: Is GEDCOM Good For Sources?
From 2013: Nine Necessities in a GEDCOM Replacement

Setting Up an IIS Webserver for Local Website Development

2021. január 7., csütörtök 4:07:42

Windows 10 comes with its own webserver called IIS (Internet Information Services). By default, IIS is not enabled because most people don’t need a webserver on their Windows computer. But if you want a copy of your website on your computer and want to be able to view your local copy and use it to test updates to your site prior to sending them up to your live site, then you’ll need your own webserver.

IIS is not your only choice on Windows. I have looked at WAMP and XAMPP as alternatives, but I am personally most familiar with IIS and have previously used it successfully for my websites, and I’m happy continuing to do so.

In March, my SSD on my computer crashed. That was my C drive with my operating system and all my software. My data was all on my 2 GB internal D harddrive and it was fine. And besides, I had all my data backed up.

I had set up full working local copies of my websites for development on my old computer when I purchased it in 2014. The setup procedure is relatively simple, but is full of gotcha’s, so I thought I’d document my efforts this time around.

I’ll give you the steps that I needed to get IIS working on my new Windows 10 desktop so that a copy of all my websites would work locally on my own machine.


Enabling IIS on Windows 10

In the Windows search bar, I entered: “Turn Windows Features on or off”. That  opened a window with all the neat stuff Windows 10 has that you never knew about.

I found the line that says “Internet Information Services” and checked the box.

image

I clicked on the plus sign to the left to expand it. That shows the various options available. Most of those needed are enabled already. In my case, I knew I needed server-side includes, so I opened up “Application Development Features” and checked the box beside “Server-Side Includes”. This will allow the .shtml pages I have on my lkessler.com to work.

image

I clicked OK and IIS was activated. Now I could type “IIS” in the Windows search bar, and the Internet Information Services Manager would open.

image

On the left Connections pane, I clicked on “Default Web Site”. Then on the right pane under “Actions”, I clicked on “Browse *:80 (http)” and the default web site called “localhost” would appear in my default browser which for me is Microsoft Edge.   
   
image

There! I’ve installed a webserver on my computer.


Adding Security, i.e. https

Most websites now use the https protocol, which adds an extra level of security over the http protocol. Browsers now will warn you of potential insecurities that a website might have. Website developers want to minimize these warnings and in so doing, maximize the security for their visitors so that the connection will be private for their personal information and passwords and for doing e-commerce.

The technology of doing this involves obtaining a certificate that confirms the validity of the site. The site passes a private key that verifies it is the site that the visitor is talking to, and not some other site intercepting the visitor’s keystrokes.

I should have done it earlier, but finally last May, I converted my live web sites to use the https protocol. My webhost Netfirms made this simpler than I expected. They provide a free SSL (Secure Sockets Layer) Certificate from the company Let’s Encrypt. With the selection of just one setting, they do almost everything required automatically. There were some “mixed content” issues due to images I was linking to in other sites and 3rd party links that I needed to fix, as well as some minor WordPress changes. There was also a redirect I had to add into my .htaccess file so that all http requests would become https. But overall, it went quite smoothly.

Now I needed to add the same https protocol for my local sites.This took me a number of days to figure out how to do.

IIS gives you an ability to create a self-signed certificate. Browsers do not normally trust self-signed certificates, because they technically are not secure. But the real purpose here is to simulate the security, so that my development environment on my computer will include the https protocol that my live sites have and will act similarly.

To create the self-signed certificate, I opened IIS and double clicked on the “Server Certificates” icon. Then in the Actions panel, I clicked on “Create Self-Signed Certificate…”  Specify a friendly name for the certificate, e.g. mycert and click OK. Without any delay, the certificate was created and was now listed.

image

It is important to notice who the certificate was issued to. In my case it was issued to Z420. That’s the name I gave to my computer when I booted it for the first time.

Now we’ll create an https version of the Default Web Site using the certificate. In the Connections panel of IIS, I drilled down to the Default Web Site and selected it. In the Actions panel, I clicked on “Bindings…”. In the “Site Bindings” window that opened, I clicked on “Add…”.  In the “Add Site Binding” window, under “Type”, I selected “https”. Then I entered the name of the computer as the Host name and selected “mycert” as the SSL certificate.

image

After clicking OK, I saw that https was added to the Site Bindings window.

image

I closed that window and looked over in the Actions panel and saw there there were now two entries under “Browse Website”.

image

I clicked on the second one to bring up the “secure” https version of my Default Web Site:

image

Sure enough, the Microsoft Edge browser shows this as secure with the small lock symbol on the address bar.  Google Chrome also shows https://z420 as secure. Firefox however does not, but says the certificate is not trusted because it is self-signed:

image

Firefox will not relent on this, but you click on  “Accept the Risk and Continue”  and this big box won’t come up every time again. Firefox will still show a tiny warning symbol on top of the lock symbol on the address line, but that’s really of no consequence if you’re just doing local testing.


Installing PHP

PHP is a programming language used on many websites. It is the language WordPress is written in. I don’t use PHP on my lkessler.com website (I use Server Side Includes – see above), but my other 3 sites are all PHP pages.

To install PHP to work with IIS, you can do so manually, downloading the Windows Non Thread Safe version of PHP from www.php.net and then manually change the settings as required for IIS. Or you can download Microsoft’s Web Platform Installer (Web PI) and let it install PHP for you. I decided to use Web PI.

After Installing Web PI, I selected “Product” and “Frameworks” and all the different Frameworks that Web PI has available appeared. I wanted to install version 7.4 of PHP. I have a 64-bit Windows Operating system, so I wanted the x64 version. I’m using full IIS, not IIS Express, so I choose PHP 7.4.1 (x64) and then clicked on “Add”.

image

After I did so, in the bottom left it said “3 items to be installed”. Clicking on that displayed:

image

I close that window, click on the “Install” button and let it go.

image

Alas, this is one of the complaints about WebPI. It doesn’t always go smoothly and may install other packages it didn’t tell you about. So the PHP Manager and Windows Cache Extension installs failed. And it included CGI and another earlier version of PHP that I didn’t ask for.

This turns out to be okay. My desired version of PHP did get installed.. And the earlier version will prove to be needed while I’m converting my live websites from PHP 5.6 to PHP 7.4, allowing me to test in both. And CGI is a required IIS component for PHP. I’d have to manually include it (using “Turn Windows Features on or off – see above) if Web PI didn’t do that for me.

PHP Manager is useful to have because it will allow me to easily change PHP settings and switch between versions. Downloading and installing the PHP Manager for IIS from its website www.phpmanager.xyz is a simple process. This adds a new icon to the IIS window that brings up a nice way to check the PHP configuration, change settings, and change the PHP version.

image

PHP Manager suggested two minor recommendations for the PHP configuration which I accepted to remove the warning.


Adding My Websites

The Default Web Site directory was set up by IIS to be c:intpubwwwroot.

I’ve already got local copies of my websites set up in my D:Documentswww folder. I want them to stay in my D:Documents folder so that Windows File History will continue to automatically back them up for me.

I originally tried setting my websites up with IIS virtual directories. But that had the problem that internal links referencing the home folder would think that localhost was the home directory, resulting in missing images and incorrect links, e.g. below should have been a graphic, and it linked to localhost/index.php when it should have been to localhost/dmt/index.php.

image

There did not appear to be a simple solution to this. If there was, this would have been my preferred solution because Edge and Chrome both considered all my virtual directory sites to be fully secure.

So instead of using virtual directories, I created full websites in IIS. What I lose by doing this is that the secure versions of the sites are no longer subordinate to https://z420, so Edge and Chrome no longer think they are secure with my self-signed certificate. I looked for a solution to this as well, and could not find anything simple for this either.

So I was in a catch-22. Either virtual directories with full security but links that don’t work, or full websites with working links but security warnings.

Since this was on my local machine for development purposes and only I would be accessing it, I needed the links to work and the security wasn’t as important so I went with full websites.

To create a full website in IIS, in the Connections window I clicked on Sites, and then in the Actions pane I clicked on Add Website.

image

In the dialog, I entered a short site name and host name to clearly differentiate it from my live site (www.doublematchtriangulator.com) and make it easy for me to bring up my local site in my browser just by typing “dmt” into my browser’s address bar. Since I only need to develop the secure version of my site, I select the https binding and pick the self-signed certificate I created earlier in this post that I named mycert.

Browsing my local site now gives this:

image

I clicked on “Advanced”, and then on “Continue to dmt (unsafe)” and despite it looking ugly with the “Not secure” warning in the address bar, it displays my site correctly and the links work.

image

Somewhere/how I need a certificate that claims it is from the site “dmt” and then Edge will display my local page without the warning. I’ll keep looking for a simple solution to this.

I did the same thing for my other 3 sites as well, giving me this in IIS:

image 


Redirecting HTTP to HTTPS in IIS

Typing “dmt” in the browser window by default looks for the http version of the site. I want to simply type my abbreviated site names without needing the https:// before it to get to my secure local site. The solution to that is redirection.

I used Web PI again and find the URL Rewrite module. I clicked on “Add” and then “Install”.

image

That added a “URL Rewrite” button to the Features view in IIS.

image

In IIS I next clicked on URL Rewrite. I click on “Add Rule(s)…” and followed the instructions given in: Best way to redirect all HTTP to HTTPS in IIS.

I found I did have to access each site once with the preceding http, e.g. as in: “http://lk” which will get redirected to https://lk. But after I do that once, then entering just “lk” in my browser will redirect to my https site.


Set up the MySQL database and phpMyAdmin

WordPress stores all its posts and comments in a MySQL database. I decided to use Web PI to install MySQL:

image

That was simple and went well. All I had to do was give it the password I wanted.

Web PI surprisingly does not include an option to download the phpMyAdmin tool, which is a browser-based MySQL database tool that most MySQL database Admins use. So I loosely followed the instructions given by Cyril Kardashevsky.

I downloaded the latest version 5.0.4 from www.phpmyadmin.net. It comes as a zip file. I unzipped it and copied the contents into its own folder where my websites are: D:Documentswwwphpmyadmin. Then I set it up as a website just as I did my other websites.

Next step was to open my browser and go to phpmyadmin/setup in my browser. 

image

I clicked on “New server” which takes me to a server settings page. I left all the options as default, and then clicked “Apply”. It created a server called localhost and returned me to the above window. I pressed “Download” and that created a “config.inc.php” file that I moved to my phpMyAdmin folder. I edited the config file, entered the password I wanted, and saved the file.

image

Now I went back to my phpMyAdmin site, to see the phpMyAdmin login:

image

I entered the user as “root” and the password I specified and pressed “Go” and phpMyAdmin opened and I could see the new database in the left panel:

image


Revisiting Self-Signed Certificates

Up above, I wrote:

Somewhere/how I need a certificate that claims it is from the site “dmt” and then Edge will display my local page without the warning. I’ll keep looking for a simple solution to this.

Well, before I even finished this blog post, I ran across the solution as outlined here: How to Create Self-Signed SSL Certificates in Windows 10, and it’s pretty simple.

Type “PowerShell” in the Windows search bar, and then click on “Run as Administrator”. That brings up a PowerShell window where I entered (as one line):

New-SelfSignedCertificate -CertStoreLocation Cert:LocalMachineMy –DnsName "dmt" -FriendlyName "mycertdmt" -NotAfter (Get-Date).AddYears(30)

What this does is produce a self-signed certificate for domain “dmt”, which is the very short domain name I use for on my computer for my local version of my doublematchtriangulator.com site.  Note that I don’t use a suffix like “.com” for my local domain, but if it had a suffix, I’d have to include that in the command shown above.

This is what it looks like in PowerShell and the response after entering the command:

image

I ran this 4 more times, changing the two dmts in “dmt” and mycertdmt, to bho and mycertbho, to gsr and mycertgsr, to lk and mycertlk, and to phpmyadmin and nycertpma,  Those were for my other 3 websites and the phpmyadmin site I created.

After running those 5 commands, I typed “Computer Certificates” into the Windows search bar and clicked on “Manage computer certificates”. That opens the Microsoft Management Console to it’s Local Computer Certificates window. Then I opened the Personal folder and then the Certificates folder. It shows the 5 certificates that I just created, as well as the original Z420 self-signed certificate I created from within IIS.

image

Now I select the 5 certificates I created, right-click and choose “Copy”. I go to the left panel and expand “Trusted Root Certification” and right-click on “Certificates which is under it and choose “Paste”. That copied the 5 certificates to the Trusted Root Certification folder.

Then I opened IIS and in the Connections panel selected my dmt website. In the Actions panel I clicked on “Bindings…”. I selected the “https” binding. I clicked on “Edit”. The “SSL certificate” selection had all the self-signed trusted certificates I just created:

image

I selected the appropriate one for dmt, and clicked OK. But I got the message:

image

To prevent that message and to use individual bindings with each certificate, I had to go back and simply check the “Require Server Name Indication” box that’s under the “Host name”.

And sure enough, I’ve now got the secure lock symbol on my local site and no ugly warning:

image

This works in beautifully in both Edge and Chrome.

Firefox still does not like the self-signed certificate and requires you “Accept the risk” one time as described earlier in the post. After you do, you’re left just with a caution sign on the lock symbol which isn’t too intrusive:

image


Reloading my MySQL databases

I lost my local copy of my MySQL databases when my computer crashed in March. Those aren’t really important, because they really are just a backup of my WordPress database that I have online. In effect, I just lost my backup.

But, in order to get WordPress going again locally, I had to copy my data down from my online site.I’ do this anyway from time to time to backup my online data.

To get the data, I login to my account at my webhost Netfirms, I load their version of phpMyAdmin, and I export the database:

image

At the bottom left, you can see it downloaded to a .sql file.

Now I can go to my local phpmyadmin site, login, click on the Import tab, choose the file, and click on “Go”.

image

However, the maximum files size is set at 2 MB, and my file is 7.5 MB.

So I’ll open IIS and open PHP Manager. In the PHP Settings section, I’ll select “Manage all settings" and I’ll find and increase the upload_max_filesize setting. The setting actually resides in the PHP configuration file known as php.ini.

image

And now the sql statements get executed and the database gets imported:

image

That was my GenSoftReviews database.

Now that I did that, all the WordPress code should just work. Does it?

Not with PHP 7.4, but when I use PHP Manager to downgrade back to PHP 5.3, then yes! My local copy of my GenSoftReviews site does work, with all the latest content that I just copied from my live site:

image

Then I did the same with my Behold blog and forum database. The SQL download for that was 50.5 MB, and I got this error:

image

I tried one suggestion to increase the PHP post_max_size setting. That didn’t work. I tried another one that suggested increasing the IIS Configuration setting  uploadReadAheadSize that is under system.webServer/serverRuntime. That didn’t work either.

The solution that worked for me was changing the IIS Configuration setting: maxAllowedContent length:

image

I changed it from 30000000 (30 million) to 500000000 (500 million). My 52 MB exceeded the 30 million value.

I wasn’t out of the woods yet. Loading the file I got this:

image

Unknown collation: ‘utf8mb4_unicode_520_ci’.  But I see on the left that it did create the 2nd mySQL database and some of the tables were successfully created. It failed on my “wp_commentmeta” table. 

I opened my SQL download file and saw that other collations in the file were ‘utf8mb4_unicode_ci’, i.e. without the “_520”. So I took out the “_520” from the 4 instances and saved the file. Then from phpMyAdmin, I selected that database, went to the Operations tab, and clicked on the red text: Drop the database to delete the database. Then I tried the Import again.

image

Again I switched back to PHP 5.3. It’s a little awkward having to switch to PHP 7.4 for phpMyAdmin and then back to PHP 5.3 to get my blog going. I may be doing more of this switching between PHP versions until I get everything working in 7.4. It’s a good thing PHP Manager makes this switching easy.

In PHP 5.3 with my blog and forum database now loaded, lets see if it works.

image

Nope. Not yet.

After an hour of debugging (I won’t go into the gory details), I determined that there was something wrong locally with one of the WordPress plugins I was using. By changing the name of the plugins folder to “notplugins”, WordPress would not find any of the plugins and hopefully load my site properly without plugins. That worked. The local copy of Behold blog now appeared and was loaded with my latest live data that I was loaded into my local mySQL database:

image

  And my forum worked as well:

image

Compare with my live site and you’ll see there is no login line in either case, because that was from a plugin, but none-the-less:  Taa daa!

Adding the plugins back one by one allowed me to find the one that failed. The culprit was a plugin called “maxblogpress-ping-optimizer” which I don’t really need anyway. I copied back all the other plugins and everything worked including my login line.

I should also add that there is just one difference between my WordPress code on my live website and my local site. It’s in my wp-config.php file. I set up a variable $whereami to say which site the file belongs to:  my live site (Production) or my local site (Test). And the only difference in the two files is which whereami statement is first and which is second, the second one being the value used:

image


Summary

All of the above took several weeks. Some of the steps took me a dozen tries before I got it right, and many required web research to find out how to fix or get around something. I didn’t ever get to the point of frustration where I had to resort to asking a question on StackOverflow, since I did manage to find a solution to all my problems, sometimes resorting to answers already on StackOverflow.

This blog post acts as my reminder to myself of what I did, and will help me remember what to do again when I get my next computer, hopefully no less than 5 years from now. I doubt if anybody will have to do exactly what I have done here, but I hope this post will help a few people with a specific problem when their web search brings them here.

My site seems to work fine with PHP 7.4 except for my blog, forum and GenSoftReviews which use old versions of WordPress and bbPress. My next step will be to get the latest versions of WordPress and bbPress working with my own customized theme. I may have to replace plugins that are no longer available, and look at what custom modifications I made that are still necessary and find a way to implement them without hacking the WordPress code directly as I did before. Then in the future, I should be able to keep PHP, WordPress and bbPress up-to-date and not run into a forced upgrade again.

If what I described in this blog post sounded difficult, I expect my upgrade of WordPress and bbPress won’t be any easier. But maybe I’ll be surprised.

2020 GenSoftReviews Users Choice Awards

2021. január 1., péntek 21:49:09

Happy 2021 everyone! This is the 12th year of the awarding of Users Choice Awards to genealogy software that users have rated highly.

image

Since 2008, GenSoftReviews, www.gensoftreviews.com has had users write 5,874 reviews for the 1,041 different genealogy-based programs listed at the site.

Of these 1,041 programs:

  • 498 run on Windows
  • 133 run on a Mac
  • 114 run on Unix
  • 127 are for handheld devices
  • 408 run online (i.e. from a website)
  • 365 are full-featured for recorded your family tree
  • 532 are free
  • 235 are no longer supported by the author, but many are still in use

To receive a Users Choice Award each year, a particular program must:

  1. Have an end-of-year user rating of at least 4.00 out of 5.
  2. Have at least 10 user reviews.
  3. Have at least 1 user review during that year.

GenSoftReviews uses an exponential rating algorithm. Every user rating will have double the weight of a rating from one year earlier. So more recent ratings will have more influence on the overall rating.

A complete list of all the 2020 winners and previous winners can be found on the GenSoftReviews awards page, with their rank, rating, and a link to their descriptions and reviews.


Summary for 2020

27 programs were awarded a Users Choice Award in 2020.

Sixteen programs won last year and won again this year:

  • Brother’s Keeper, winner since 2009
  • Personal Ancestral File (PAF), winner since 2009, unsupported
  • Reunion, winner since 2009
  • The Next Generation (TNG), winner since 2009
  • Ancestral Quest, winner since 2011
  • Family Historian, winner since 2011
  • Family Tree Maker (up to Version 16), winner since 2011, unsupported
  • Ahnenblatt, winner since 2012
  • Famberry, winner since 2013
  • Genealogie Online, winner since 2015
  • webtrees, winner since 2015
  • Family Book Creator, winner since 2016
  • Generations,winner since 2016, unsupported
  • The Master Genealogist (TMG), winner since 2016, unsupported
  • GedSite, first-time winner in 2019
  • Second Site for TMG, first-time winner in 2019

Seven programs worked their way back into the winner’s category this year:

  • Aldfaer, who previously won in 2016,
  • Ancestris, who previously won in 2017-2018,
  • Clooz, who previously won in 2012-2018,
  • Familienbande, who previously won in 2015-2018,
  • Oxy-gen, who previously won in 2018,
  • RelativelyYours (unsupported), who previously won in 2016-2018, and
  • Rootstrust, who previously won in 2018.

Four programs became an award winner for the first time:

  • Centurial, evidence-based software by Acoose.NET (Fouke Boss)
  • MacFamily Tree, a full-feature program for the Mac by Synium Software
  • My Family Tree, a free full-featured Windows program from Chronoplex Software (Andrew Hoyle)
  • ScionPC, a free “Genealogical Management System” by Robbie J Atkins of New Zealand. During the year the program became unsupported.


Programs that Did Not Repeat from 2019

There were four award winners from 2019 who failed to win again this year:

Two programs who were award winners in 2019 slipped below the required 4.00 value this year:

  • MyHeritage, who was an award winner from 2014 to 2019, and
  • Mundia, an unsupported program that won for the first time in 2019.

Two programs who had the required 4.00 rating, but did not receive at least one review during 2020:

  • iFamily for Mac
  • Ultimate Family Tree (unsupported)


Wishes for the Future

The goal of GenSoftReviews is to encourage developers to build genealogy software that their users like. Congratulations to the award winners. You have a majority of users who are willing to praise you for your software.

Developers winning a GenSoftReviews award should feel free to place their award badge on their site and encourage their users to review their software.

To those developers who did not win an award (and even to those who did), I encourage you to look at your program’s reviews and ratings and to use them as constructive criticism to make changes that can improve your users’ opinions of your software.

Averting Blog Disaster

2020. december 12., szombat 7:58:14

Yesterday, I logged into my account at my webhost Netfirms and I was met with a somewhat alarming message:

image

That was not pleasing to me. I knew what that meant. Likely I’d need to make major revisions to my website to get my Behold blog and my GenSoftReviews site to work under the new version of PHP.

My Behold blog and GenSoftReviews sites are 12 years old. I developed them both myself with WordPress. GenSoftReviews uses a WordPress plugin called WP Review Site that I purchased and then customized to my liking. My Behold Forum uses bbPress version 0.8 that was able to integrate with WordPress.

I spent many months customizing my blogs and forum to my liking, starting with the Behold style that I created to make my blog and the forum completely match the rest of my site. I added a user database for my Behold and DMT trials and purchases and automated the sending out of trial keys and recording of purchases. I created an integrated login system so people could post comments on my blog and messages in my forum. I added sophisticated spam filters to prevent the multitude of spam from getting onto my page. I added my newsletter system into the framework. Almost every single thing is tweaked and customized exactly to my liking.

The programming language for this is PHP and the database is MySQL. I had never used either of them prior to this endeavor, so it was a trial by fire. I’m proud of what I created and it has worked almost without a hitch for the past 12 years. That is of course without upgrading the underlying versions of WordPress and bbPress that I was using. I couldn’t upgrade them, really. The customizations I had done were extensive and some of the plugins that I was using were no longer available and were not being upgraded to work with new versions of WordPress.


Flipping the PHP Switch

I knew what would happen when I selected a PHP version 7 or greater: My blog would stop working. I tested it out and sure enough, only an error message appeared where my blog should be.  I changed it back, and it worked again.

I spent the next couple of hours adding PHP 7.4 to my computer. I went back to my live blog and tried a few things. I flipped the PHP switch on my live site again and got the error again. I flipped it back to 5.6 and … oh oh, I still had the error.

This was no ordinary error. This was the dreaded Error 500 – Internal Server Error, that told you absolutely zip, zero, zilch about what was going on:

image

So how do you figure out what’s causing an error when no information is given? Into my Wordpress PHP code I went. For the next 3 hours, I was debugging it live online, line by line, putting in “here I am” statements and tracing to find what line is causing the error. I found out it was the line that was trying to initialize the MySQL database.

    $wpdb = new wpdb(DB_USER, DB_PASSWORD, DB_NAME, DB_HOST);

I spent two hours trying to get WordPress to initialize the database and tried everything including setting up test programs, and scanning the web and StackOverflow for this type of problem and solutions. I almost went as far as changing the password on the database. The funny thing that I noticed was GenSoftReviews was still working, but what that meant didn’t yet register on me.

It was now 1 a.m. I used Netfirms Support chat and got help from one of their support people. I was trying to figure out from the support person why the PHP change and then changing back now resulted in my blog not working. We tried a number of things and finally I was given a ticket where a Technical Specialist would contact me in 24 to 48 hours.  It was 2:30 am and I went to bed.

The next morning I was right back at it with some new ideas. I tried various things and continued debugging. Overnight and for much of the day, I had a sad little message posted on my blog and forum:

image

After a few hours working through it all, I checked my email and I had got this message:

image

Umm. What!?. This is an automated message from WordPress to me. Sure enough, lots of WordPress files were missing on the server.  And there were extra files as well. What I had on my computer which was supposed to be a working copy was different than what was online.

So I used BeyondCompare to mirror the tens of thousands of files on my computer in my blog directory back onto my website at Netfirms. When that completed a half an hour later, my blog appeared and worked fine!

An earlier email from the morning said this:

image

What had happened earlier that I didn’t realize was that WordPress on my website updated itself to its latest version. That I knew would crash my blog just as would the PHP upgrade. It should have twigged on me that because GenSoftReviews still worked. It couldn’t have been the PHP upgrade and downgrade that caused the problem since that would have affected GenSoftReviews as well.

Phew. Problem solved. But no images were being displayed in my blog. Another whoops. The images were uploaded from my blogging program Open Live Writer. Open Live Writer updates the blog posts into my blog’s MySQL database at Netfirms, but the images are put into the wp-content/upload folder with the WordPress code. I had never thought of syncing those images back to my computer.  So I inadvertently deleted them when I mirrored up my files.

Another support chat with Netfirms and they were able to restore that folder for me from their backup.

By the way, I was very pleased with the Netfirm support chats. There was no waiting and the support person at the other end was very courteous and knowledgeable and helpful!  It was not like this 5 years ago at Netfirms. They have really upped their game impressively.


Upgrade Necessary

I was still being presented with this message::

image

This is a window I was now getting when I try to go into Admin mode for my blog. Prior to last night, I had never seen this message before. I don’t know what triggered this message to start happening, but I did notice it at some point last night and dismissed it as something I can’t do and not to worry about. 

Maybe I accidentally hit that “Upgrade WordPress link”, or maybe WordPress itself may have detected an error in the plugin when I switched to PHP 7 – I’m not sure which. But something caused Wordpress to merrily start upgrading itself in the background. That’s why the database wouldn’t open. That’s why all the files were different. That might have initiated those emails.

That “Database Upgrade Required” message prevents me from getting into the Admin mode in Wordpress. I tried using the:
      define(‘WP_AUTO_UPDATE_CORE’, false);
directive that is supposed to turn the display of the message off, but it didn’t for me. So instead I just hacked the WordPress code and commented out the calls to the routine:
      wp-admin/includes/upgrade.php

Netfirms is forcing its users to upgrade to PHP 7. As I result I will also have to upgrade WordPress and bbPress. I guess after 12 years of smooth sailing, it’s come to this and I’ll finally have to bite the bullet and update everything.

Sigh! That’s not what I wanted to have to do now. I’ve got an updates to both DMT and Behold that I’m working on. But neither of those will be of use if I don’t have a working website to present them.

I’ve got an adventure ahead of me. It will be a lot of work, and a lot of learning, but it should be interesting and fun as well.

Fiction versus Fact

2020. december 1., kedd 5:51:03

In my last post, I discussed a methodology that I could quickly put together an ancestors-only tree for my niece at MyHeritage.

I was able to get back to about 3rd great-grandparents on most of her lines. But it was her mother’s father’s mother’s side that started to get interesting.

My niece’s mother’s father’s mother was Emma Blanche (Smith) Graham (1883-1976). Now you can instantly spot that I’m in for a challenge with a maiden name of Smith. Smith of course is one of the most common surnames there are. So how can I ensure I get the correct John Smith out of two million John Smith’s?


Mayflower?

Following one of Emma’s ancestral lines I assembled at MyHeritage, it led me back through Smiths of Niagara Peninsula (Upper Canada) in the 1800’s to a Wilcox line in the 1700’s that led to Elizabeth Cooke (1641-1715) who was born in Plymouth, Massachusetts. Hmm. Plymouth was where the Mayflower arrived in 1520.

image

Her father was Jean John Cooke. One of the Record Hints that MyHeritage gave me was this one from WikiTree:

clip_image002

Jean John Cooke was born in Leiden in The Netherlands. Instantly, I recognized that as the city where the passengers on the Mayflower lived before their voyage in 1520. This year is the 500th anniversary of the Mayflower’s arrival! Might the picture WikiTree has for Jean John Cooke be the Mayflower? Could my niece be one of the 35 million Mayflower descendants?

I visited Leiden in 2014 for the Gaenovium Conference. What a beautiful city! And I had the pleasure of meeting and spending time with Tamura Jones, who just happens to be an expert with regards to Mayflower descendants. 

I sent off an email to Tamura asking him if this Jean John Cooke might have been on the Mayflower. Tamura confirmed for me that Francis Cooke was on the Mayflower along with his eldest son John who was a boy a the time. His wife Hester and other children came later.

This Jean John Cooke was the son who was on the Mayflower. Eureka! I can say now that it’s a fact that my niece is a Mayflower descendant, right?

Not so fast. Tamura then told me that he could not find the Wilcox line I supplied him in the lists of descendants he had. He said I should check that line.

So I went to our friend Google and came up with this: I2742: Daniel WILCOX (1631 - 2 Jul 1702) (ksu.edu). It’s from an obviously well researched and sourced genealogy of the Needham Family.

It indicates that Daniel Wilcox (1656 – bef 1730) was the son of Daniel Wilcox (1631 – 1702) and NOT Jean John Cooke’s daughter Elizabeth Cooke, but a previous wife, possibly: Susanna Thompson.

So Daniel Wilcox and his full brother Samuel Wilcox, are not descendants of Elizabeth Cooke and thus not descendants of John Cooke or Frances Cooke.

The extensive references at the bottom of the page talk about this and indicate that “there is no evidence that Elizabeth was the mother of his sons Daniel and Samuel”. 

I immediately scratched out the fiction of Elizabeth Cooke being an ancestor and replaced her with the fact of it being possibly Susanna Thompson.

So much for my niece being a Mayflower descendant, at least on that line.


Churchill?

We did get a not-so-small consolation prize out of it though. If you take a look at that Daniel Wilcox link I have above, at the bottom of the page in the references it states:

The Churchill Centre, "Mayflower Ancestry: For and Against"
http://www.winstonchurchill.org/i4a/pages/index.cfm?pageid=50
"No genealogies have been more carefully prepared, or reach a higher standard than, the Mayflower Society genealogies. There is solid evidence that Daniel Wilcox married a first wife prior to his marriage to Elizabeth Cooke, granddaughter of Francis Cooke. There is no evidence that Elizabeth was the mother of his sons Daniel (Churchill’s ancestor) and Samuel. There is circumstantial evidence that she was not. In genealogy, absence of evidence means absence of conclusions."

Checking out Sir Winston Churchill’s ancestry, he does in-fact connect to Emma (Smith) Graham’s line.

image

Sir Winston was in fact a 5th cousin of my niece’s great-grandmother, making my niece a 5C3R (5th cousin, 3 times removed) to the British Prime Minister.


Just the Facts

The Needham Family site is a fantastic resource. You can see the numerous references at the bottom of each individual. It would take years to redo that work.

So I decided to go through his site and cross reference the ancestors I had collected and change any information I had to what he had. As I did that, Needham pointed me to another excellent study that was of Benjamin Wilcox by John Blythe Dobson, and I cross referenced and changed my information for my people from that study as well.

Of the 129 ancestors I had found for Emma (Smith) Graham, Needham had information on 84 of them, and Dobson had 32 of them.

I put the information in my spreadsheet so that I could quickly visualize and access the information at Needham and Dobson’s sites:

image

Notice the people in orange. They were fiction I obtained from other people’s genealogy.

Dobson stated:

"We know of no basis for the recent claim that she was a Sarah Hart or Hort, b 16 Apr 1684 at Dartmouth, daughter of Thoas Hart or Hort and Margaret. Not only is any such person absent from the town’s vital records, but …". Needham states "Some claim she was Sarah Hort, daughter of Thomas Hort. I have seen no definitive proof of this claim."

which negated the Hort name and Sarah’s parents and grandparents.

And Needham only gives Susanna Swift as “Susanna” with a 1612 birth date, not the 1622 that I had. So the Susanna that married Ralph Allen, likely wasn’t Susanna Swift. So scratch her parents. Needham also didn’t give a surname for Rachel Sherman, so I removed that as well.

There could, of course, be later scholarly research that updates what Dobson or Needham have found, but I’d like to see it with extensive references that can be followed before I’ll believe it.


Prime Minister?

Notice the Borden ancestors in the spreadsheet above. Needham pointed me a site with the Descendants of Richard Borden. That site pointed me to information about Sir Robert Borden (1854-1937), who happened to be the 8th Prime Minister of Canada.

So now I can assuredly add this Prime Minister as well to my niece’s cousin list. He would have been her 7C5R (7th cousin, 5 times removed).


Seaver?

One other connection I managed to make. While searching to verify the fiction or fact of a “Thomas Bloomfield” ancestor, I came across Amanuensis Monday - Post #286: 1684 Will of Thomas Bloomfield (1615-1686) of Woodbridge, N.J. by the incredible genealogist Randy Seaver on his Genea-Musings blog.

Randy’s genealogical work is also of the gold standard that I would 100% trust.

Searching his site for more information, I found his page Genea-Musings: Surname Saturday - BLOOMFIELD (England > colonial Massachusetts > New Jersey) and from that page I was able to tell that Thomas Bloomfield was Randy’s 10th great-grandfather.

He’s also my niece’s 10th great-grandfather. So that makes Randy and my niece 11th cousins.


Others?

I’m sure there will be more connections that will come up for my niece. Once a genealogy gets back this far to Colonial America and England, there’s much more to be found.

These first discoveries are exciting for me. My own genealogy by comparison heads back to Romania and Ukraine in the early 1900’s, so I’ve never really got to experience these sorts of family connections the way so many other genealogists do.

And I feel much better knowing that these connections are not fiction, but fact.

Tracking Just Ancestors at MyHeritage

2020. november 28., szombat 21:37:05

A few months ago, I decided it would be interesting to investigate my niece’s genealogy.

I entered information about her parents and grandparents into a new tree at MyHeritage and let MyHeritage’s Record Hints and Smart Matches go to work. It wasn’t too long before I had a couple of hundred people representing a lot of her relatives out to about 2nd cousins.

There were a few enticing hints on her mother’s father’s side that attracted me. I started following them and they’d each add another 30 relatives and take me back another generation. In very little time, MyHeritage had 80 source suggestions for me with over 1000 matches and every time I analyzed and confirmed a suggestion, the numbers would continue to grow.

I’m the type of person who likes a clean email in-box, as well as a clean MyHeritage hint list. I saw that I’d soon have thousands or even tens of thousands of people in my niece’s tree, and I really didn’t have the time to build that and resolve the multitude of hints that would result.

So I took a step back and thought about what I was doing. My niece had a few lines that were going back to Southern Ontario in the early 1800’s, and they got there from New Jersey where they were in the 1700’s and previously England. Now I’m in the realms of early America, and I realized these are genealogies that many genealogists connect to. A lot of research has already been done on these individuals and the same people are in the trees of many genealogists.

I had never had this problem with my own genealogy, since all my ancestors arrived in North America between 1900 and 1930 and I’ve never had to deal with early America and English roots. That really looks so interesting.

But I concluded it would be a waste of my time to redo all the research others have done. It’s all out there already. I just needed some way to connect to those people.



Pedigree-Only

I decided what was best was, after allowing inclusion of my niece’s relatives out to her 2nd cousins, would be to only add her direct ancestors to the tree.

At MyHeritage, that ended up being fairly simple to do. I checked the Smart Matches by person and found those that gave additional spouse or parent information for any of the direct ancestors.

Matthew Borden is one of my niece’s lines that traced back to England. I currently have that his wife’s name is Joan, but I don’t have her maiden name. Getting her maiden name, birth date and death date and place might lead to finding her parents.

MyHeritage has 78 Smart Matches for Matthew, and the Smart Match summary indicates that there is new spouse information.

image

Clicking on the "Review 78 matches” button reveals the 78 matches with other people’s trees at MyHeritage ordered first by the ones that provide the most additional information:

image

I check to make sure that my birth year and death year and places match, that the wife is listed as a partner, and that the son is listed as among the children. Since I am doing just the pedigree, I only have one child included.

Then I take a look at the partners, and see if the partners listed for the 78 matches all agree.  This is what I found:

  • Joan Reeder (51 times)
  • Joan Glover Reeder (9 times)
  • Joan Mary Reeder (1 time)
  • Joan (6 times)
  • no partners listed (11 times), but 5 of these listed Joan Reeder as Matthew’s mother (presumably incorrect)

So they all seem to agree that the maiden name is likely Reeder. The Glover was sometimes in parenthesis, so I’m thinking it might the surname of her spouse of a previous or later marriage. With only 1 entry of a middle name of Mary, I’m not going to believe that until I get further evidence.

If I would have had conflicts here, then I would have gone off to our good friend Mr. Google and see if there’s something on her. I could look up “Matthew Borden” “Joan Reeder” and see what pops up. I’d be looking, not for a genealogy containing the two people, because I’ve already got plenty of those from MyHeritage’s Smart Matches, but for some scholarly work detailing the ancestry WITH SOURCES!!  The sources will show that it was detailed first-person research. Another family tree is NOT considered a source.

Here’s what Google gives:

image

The third link has nice information, specifically that source at the bottom:

image

So what I’ll now do now is Review the first Smart Match of Matthew Borden, confirm it, and mark the data in my tree I want to update. 

image

I may want to also update Matthew’s information if it’s better (e.g. birth and death date and place, occupation), and add his parent’s information if I don’t have it.

I was originally going to use Matthew’s parents as the example in this post, showing how conflicting information can be resolved, but this case is more complicated and his mother is still an unknown with several different people possible, so I chose to use Matthew’s wife instead.

The only people I will accept information for is Matthew, his wife, his parents and possibly his son Richard who is the direct line ancestor we are interested in. I will not select information about Matthew’s other children or siblings. By not selecting that information for inclusion, those other people will not be added to my niece’s MyHeritage tree and it will remain pedigree-only at this generational level.

I’ll then go down to the bottom of the Smart Match. I’ll be sure NOT to press the “Extract all info” button, but just click “Save to tree”.

image

Now I’ll go back to Matthew Borden’s Smart Matches. I’ll check the next few to see if they might have useful additional information. Once I’ve got most of the information I want, I’ll go up to the “More actions” dropdown and tell MyHeritage to confirm all the remaining matches.

image

However, I will not save any of the confirmed information to my tree. The confirmation simply makes them available in my profile of Matthew Borden, and removes them from my list of Smart Matches so that I can now concentrate on the next person.

I then simply continue this same procedure for each of the Smart Matches until I’ve exhausted them.

I’ll report on the results of this in an upcoming post.

My DNAweekly Interview by Ditsa Keren

2020. november 9., hétfő 19:46:39

Two weeks ago, I was interviewed on Zoom by Ditsa Keren for an article that was published on DNAweekly today.

image

DNAweekly publishes an interesting blog with a wide range of articles about consumer-based DNA tests that extend into their use by genealogists. They reach out and look for third-party software that might be of interest to DNA testers and found me and asked me if I was willing to be interviewed.

On their Blog page, they give an example of some of the recent DNAweekly blog posts:

image


The website’s primary focus is comparing, reviewing and rating DNA tests and include some FamilyTree based sites in their reviews. I currently count 58 different services in their review list.

They classify companies into these categories:  Ancestry, Family Tree, Health & Wellness, Diet and Nutrition, STD, Pets.  They give each company a rating from 1 to 10, provide for each a User Score of 1 to 5 stars, and then link each to a complete review of that product.

The product reviews are quite detailed and seem to be done very objectively. The company is obviously making money from affiliate links by you clicking and then purchasing the product, but that does not seem to be biasing their reviews in my opinion. They have some coupons available for some products towards the end of their review. Finally, at the bottom of their review, they allow you the user to write your own review on the product and give your star rating. The author of each review is shown with a brief biography.

All in all, a very nice review site for DNA, family tree and health testing services.

Ancestry’s Timber Algorithm is Better Than You Think

2020. október 30., péntek 2:04:24

Ancestry has recently made changes to its display of the amount of DNA you match with someone. The amount is shown in cM (centimorgans). Most DNA testers using their DNA for genealogy purposes know what cM are and what they represent.

image

Your DNA match list shows the Shared DNA you have with each of your matches.

The change Ancestry made that I’d like to talk about is the addition of “Unweighted shared DNA”. When you click on the “Shared DNA” link, you’ll be shown information containing this unweighted segment value:

image

Here you’ll see a “Shared DNA” value of 91 cM and an “Unweighted shared DNA’” value also of 91 cM.  When the shared DNA value is 90 cM or more, the unshared value is always the same.

But when the shared DNA value is less than 90 cM, then the unweighted value can be more, and usually is.  The unweighted value can be as high as 89 cM.

image

Ancestry uses what they call their Timber algorithm to filter out pieces of DNA that it figures should not be considered when deciding if two people are related.

A lot of people, including myself, have been critical of Timber believing it removes segments that it shouldn’t and they were very happy with the new information that now shows the pre-Timber amount. You can’t easily get this amount for all your matches. You do have to click through each match one by one to get that match’s unweighted value. You cannot see them all on your DNA Matches page like you can the post-Timber values.



Comparing Average Shared Values

The research work I’m currently doing on one branch of my wife’s family with her cousin Terry Lasky includes some lines where we do not know if the ancestors are brother, half-brothers or first cousins. We have descendants of two ancestors who DNA tested that we can compare.  Those who are 3 generations down would be 3rd cousins if the ancestors are brothers, half 3rd cousins if they are half-brothers, and 4th cousins if the ancestors are 1st cousins. 

All of our family includes endogamy. Terry and I have been worried about the effect of endogamy on our cM shared values, and on the effect that the Ancestry Timber algorithm would have on our cM values.

Terry has 32 DNA testers from this branch who tested at Ancestry. Among the testers he had 138 pairs of them where he knew for sure how they were related and did not know of a second way they might be related, other than through endogamy.

Parent/child are 1 generation apart. At Ancestry DNA, parent/child pairs match with 3476 cM. Children are two generations apart (up to parent, down to other child). Their average match at Ancestry DNA should be 3/4 of a parent/child match or 2607 cM.  An uncle/aunt/nephew/niece is 3 generations apart, and an average match at Ancestry DNA in theory should average half of a parent/child match and be 1738 cM. From there on, every extra generation halves the cM matching. What we are doing is counting meiosis which is the number of times the cells recombine. Meiosis 6 for example can be 2nd cousins, 1st cousins twice removed, half 1st cousins once removed, or great-great-great-great grandparent/child and many other relationships. But they all should have the same theoretical average cM at Ancestry DNA and that should be 217 cM.

So what I did is averaged Terry’s known pairs by meiosis and compared them to what the theoretical average cM should be at Ancestry. It resulted in this table:

image

This very much surprised me when I first saw it. I had thought that Terry’s Ancestry numbers would be considerably higher than the theoretical averages due to endogamy. But Terry’s pairs averaged only 5 cM higher than the theoretical values. That is extremely close.

I scratched my head wondering why. These are the post-Timber values which had some segments removed by TImber. I decided to separate out the Timber affected numbers from those unaffected and divided the above table into >= 90 cM and < 90 cM.

image

Again I was surprised. The meiosis 7 and 8 have average differences of +29 and +76 for >= 90 cM.  They have average differences of -70 and -26 for < 90 cM.

It seems Ancestry optimized their 90 cM cutoff for Timber to get the averages in the meiosis levels to be close to the theoretical. What this seems to show is that it is not a good idea to separate out the two or to try to correct for their Timber algorithm.  Their numbers with Timber seem to be best.

Just to check, I averaged out the Ancestry unweighted values for Terry’s pairs:

image

Meiosis 8 corrected is okay, but meiosis 7 has and average difference of -51.  Compare that to an average difference of 7 in the original raw values with Timber.  So I wouldn’t want to use these unweighted. Using Ancestry’s values with Timber seems best.

It seems that the Ancestry genetic scientists knew what they were doing with Timber. They seemed to have optimized it so that each meiosis level will average out very close to it’s theoretical value.



Blaine’s Shared cM Version 4.0

Well that was really good to know. Now I wanted to know how much Blaine Bettinger’s Shared cM Project v4 varied from the Ancestry theoretical averages. Surely Blaine’s would be different. His numbers were based on submissions of people who got cM values not just from Ancestry, but also from 23andMe, Family Tree DNA, GEDmatch, MyHeritage and others. Not all companies report exactly the same way. Family Tree DNA includes small segments down to 1 cM and will usually report higher shared cMs for the same two people. 

So here was a second surprise:

image

Blaine’s values are actually very close to the Ancestry theoretical value for the closer relationships.  Even meiosis 6 to 9 isn’t that far away. I attribute the slightly larger differences for the more distant relationships being due to some reported pairs being related an additional way that is adding to the amount. It isn’t much, just 12 to 21 cM,

None-the-less, Blaine’s numbers match up well with the Ancestry theoretical and that’s good to know.



Conclusion

Ancestry did Timber for a reason. It seems to me that they may have calibrated TImber so that the average cM for a given relationship would be the same as the theoretical average. Even if they didn’t do that calibration on purpose, it sure worked out well.

My recommendation is to use the Timber-based numbers, especially when comparing to Blaine’s shared cM project.

Don’t worry about the new unweighted Shared DNA values, and stop complaining so much about Timber.

Using WATO for Unknown Ancestral Relationships

2020. október 26., hétfő 23:25:53

Big update Oct 27:  Much easier way to do this than in my post below.  Leah Larkin informed me that I can do all 3 scenarios at once like this:

image

So all three hypothesis indeed can be included at once.

And the results with WATO Version 2 come out as:

image

Showing Hypothesis 1 (Brother) is 37 times more likely than Hypothesis 2 (Half-Brother) which is 2481 times more likely than Hypothesis 3 (1st Cousin).

Much simpler! Many thanks to Leah and Andrew Millard on the WATO Facebook group for letting me see the light. 

I’ll leave my post below to show my original thinking.



Original Post:

In yesterday’s post, I wanted to see if the What Are The Odds (WATO) tool at the DNA Painter site would work for endogamy, and I came out satisfied that it does, for either Ancestry DNA numbers or Family Tree DNA numbers, with the < 7 cM matches removed from the latter.

WATO is designed to help you have a DNA match with someone where you don’t know for sure how that person is related to you. You build your tree in the WATO tool and add positions where you think your match might be. You set those positions to be Hypothesis.

Well, I’ve got a slightly different problem. We’ve got a bunch of DNA matches and I know where the fit in the tree.  What I don’t know is how the people at the top of the tree are related.

Let me start with the tree that I used as an example yesterday:

image

So these are all the relevant descendants of Moshe. The DNA testers are shown shaded. The Hypothesis 1 is a known tester who we simply used as a hypothesis.

Now there happens to have been a man named Gedalia who has the same last name as Moshe and came from the same town in Ukraine. We know of a few of Gedalia’s descendants who DNA tested and they are matches to the descendants of Moshe. What we don’t know and want to figure out is the relationship between Moshe and Gedalia. Could they be brothers? Half-brothers? First cousins?


Are Moshe and Gedalia Brothers?

So what I’ll do is expand the tree. I’ll add Gedalia to the tree as a brother to Moshe. I’ll add the descendants and mark the one we will use in this example as the Hypothesis: Now I’ll enter the cM shared between this descendant of Gedalia and each of the testers under Moshe.  I’ll used filtered Family Tree DNA numbers since those worked best yesterday:

image

This gives us a score of zero, saying this is not possible.

So let’s take a look at the score calculation:

image

It’s saying that Rob is way too high at 263 cM to be a 3rd cousin.

But wait a minute! That is saying that Rob is related more closely than 3rd cousin to our Hypothesis person, who we’ll call: Hyp.  We know from the diagram above that through Moshe and Gedalia, he cannot be closer than 3rd cousins.  Since Rob’s cousin Sha and 1C1R Ala don’t have the same problem, they are okay. That must mean that Rob’s mother is related to Hyp, adding extra cMs to Rob and his sibling And. In fact, And is higher than all the rest at 145 cM, but not high enough to make being a 3rd cousin to Hyp an impossibility.

Since Rob and And are related another way to Hyp, what I’ll do is remove their shared DNA amounts from being included in the WATO calculations and run it again:

image

That’s better and now the Hypothesis shows up as possible. Here’s the score calculation:

image

It’s the same as the above for the listed people, except that the Combined odds ratio is now 1.00.


Are Moshe and Gedalia Half-Brothers?

Let’s now do the same thing and just change Moshe and Gedalia to be half-brothers. WATO lets us do this and indicates they are halves with the coloured dotted lines to the left of their boxes:

image

All of the scores have changed, but this scenario is still a possibility:

image


Are Moshe and Gedalia First Cousins?

Well, let’s delete Gedalia’s side and add him back in as a first cousin:

image

Once again, this is said to be possible. Here are the scores:

image


So Which Is More Likely? Brother? Half? Cousin?

WATO has a wonderful mechanism for comparing different Hypotheses. When you include more than one hypothesis in a scenario, it tells you which of the three is most likely and how many times more likely it is than the next. (See yesterday’s post for an example).

But here, I have three different trees each with only one Hypothesis. WATO won’t compare them for you.

Well I think I see what WATO is doing.  I may be wrong, but it looks like it is multiplying the probabilities together and comparing the results between the scenarios. So I can easily do that myself in a spreadsheet:

image

I have highlighted the most likely scenario for each match. Half-Brother wins this comparison with 7, versus 1st Cousin with 3 and Brother with just 2.

The line at the bottom contains the product of the 9 values above it. The highest value is Half-Brother which is 9 times larger, meaning it is 9 times more likely a possibility than 1st Cousin. 1st Cousin is 3 times more likely than Brother. And Brother is 25 times less likely than Half-Brother.

So there you have it. We haven’t proved anything, but at least we now know that all scenarios are possible and that half-brother is most likely.


Hint, Hint, Leah and Jonny

WATO is a wonderful tool to help you hypothesize where your DNA matches fit into your tree. That was what it was designed for.

But wouldn’t it be nice if WATO could also help you test different ancestral scenarios as well, as I have just done?  Well it can, if you follow the above procedure and do the comparison yourself,

WATO-Ancestors could be set up to make it easier for you by remembering the results of each of your scenarios, and then comparing them for you, so that you won’t have to yourself.




Update (80 minutes later): I didn’t realize when I was doing the analysis that I was using Version 1 of WATO. Version 2 includes new probability numbers taken from an update to Ancestry’s paper. See Leah’s article: Improving the Odds. The main improvement is that it now has much more detail for small matches.

You can switch from Version 1 to 2 very easily, so I did and I recalculated. Here’s the revised table:

image

To tell the truth, it really changed the results. Now the conclusion is that Brother is the most likely relationship and that scenario is 37 times more likely than Half-Brother.

So make sure you use Version 2 of WATO to get the best probabilities.




Additional Idea: If you have more than one tester on the other side of the tree, you can calculate all the match values for each scenario for each of them, and then simply multiply out (or geometric mean) the “Product” line for each of them.

For example, in the above table, if I had a second person that gave Product numbers of 0.0000385 for Brother, 0.0000655 for Half-Brother and 0.0000073 for 1st Cousin, then

GMean(Brother) = (0.0000033505 * 0.0000385) ^ (1/2) = 0.0000114
GMean(Half-Brother) = (0.0000000901  * 0.000655) ^ (1/2) = 0.0000024
GMean(1st Cousin) = (0.0000000 * 0.0000073) ^ (1/2) = 0.0000000

If you don’t know what a geometric mean is, then just use a simple average which should still tell you which scenario is most likely.

Does WATO work well with Endogamous populations?

2020. október 26., hétfő 2:05:46

I’ve been quiet lately because I’ve been enjoying doing some research with my wife’s cousin Terry Lasky on one branch of their common families. Terry has got several dozen of his relatives on that side of the family to do DNA tests.

One aspect of what we are doing led to Jennifer Mendelsohn suggesting to me that we try WATO – the What Are the Odds tool built by Leah LaPerle Larkin and Jonny Perl.

I was concerned that the endogamy in our matches might add too much to the shared cM of two people. And I was also worried that the shared cM values that Family Tree DNA gives which are higher than the Ancestry DNA’s numbers would cause additional problems.

If WATO would not work for our known relationships, then we should not use it for our unknown relationships, meaning a test is required first.


Family Tree DNA data for a Known Relationship

So first step is to test WATO on a relationship which includes endogamy for a a person that has just one known pair of common ancestors with the other people. So there’s no other close multiple relationships that we know of other than the distant endogamy.

I took one of our starting ancestors, Moshe and Wife 3, who had three children. We have 14 DNA testers who between the children are 2C, 2C1R and 3C to each other. I took the 14th and made him the hypothesis and I created this with the WATO tool:

 WATO Tree for Endogamy(click on the image above to expand it)

So I created 11 hypothesis. 1, 2 and 3 are descendants of a child of Grace. 4, 5 and 6 are descendants of a child of Grace who is a half-sibling of Grace’s other children. 7, 8 and 9 are descendants of a full sibling of Grace, and 10 and 11 are descendants of a half sibling of Grace.

Each line of hypothesis is a half generation further away than the previous. And interestingly enough, the possible hypothesis marked in green move up a generation to compensate for this difference.

WATO’s gives you the calculated probabilities of each hypothesis:

image

So this is staying that Hypothesis 5, that this person is a child of a half-sibling of Grace’s other children is the most likely and is 52 times more likely than Hypothesis 2. Three other are possible and the rest are not statistically possible.

I love the detailed score calculation that Leah and Jonny put together. It gives you everything you’d ever want to know about each relationship in each hypothesis. And you can see how the probabilities were arrived at:

image

Now can you guess which Hypothesis is the correct one?  (spoiler below)


Family Tree DNA data stripping out small < 7 cM Matches

I had thought that WATO was based on the numbers from Blaine Bettinger’s Shared cM project. As I was calculating and writing the above, Jonny Perl responded to one of my posts on Facebook and said:

“The probabilities are actually separate from the shared cM project. In WATO v1 they’re from Ancestry’s white paper on matching and in v2 they are extrapolated from the probabilities AncestryDNA displays in the popup when you click on the cM amount.”

So I asked Jonny if it might be better to use Ancestry shared cM with WATO than to use Family Tree DNA data with it.  He said yes, and pointed me to his Individual Match Filter tool (IMF) to strip Family Tree DNA  matches back to a certain threshold (default is 7 cM).

Well Terry had done most of this work already for me and had many of the Family Tree DNA shared cM values already stripped back to only include 7 cM or larger values. I’m sure Terry would have liked to have known about Jonny’s tool as it would have saved him a lot of time.

I plotted Terry’s filtered numbers versus the non-filtered and got this relationship:

image

Notice this is a pretty strong relationship, and you can see that the trend line gives a pretty good estimate of what the filtered Family Tree DNA shared cM should be. The equation is basically saying that subtracting 50 cM from your unfiltered value will give you a decent filtered value. It should work okay for values greater than 100 cM, but obviously won’t be as good for smaller values.

Now I’ll use the filtered Family Tree DNA values in WATO instead of the unfiltered and we’ll see what happens:

WATO Tree for Endogamy (1)

This gives 5 feasible hypotheses with Hypothesis 2 coming on top being 8 times more likely than Hypothesis 5.

image


Ancestry DNA data for the same Known Relationship

Jonny’s comment also prompted me to try our Ancestry DNA matches. 11 of our 14 people above had originally tested at Ancestry DNA and those tests were later uploaded to FTDNA, so we still have 10 people we can compare with our 11th.

Putting in the Ancestry DNA shared cM values, we get this:

image

The Ancestry cM values we put in were actually not too different than the filtered FTDNA values. In fact, the biggest difference between them was 35 cM  The conclusion is the same with Hypothesis 2 being ahead of Hypothesis 5, but only being about 2 times more likely.

image


The Answer and Some Observations

The correct hypothesis is Hypothesis 2.

So it does seem that WATO is doing a good job and picked the correct Hypothesis with both the filtered FTDNA data and the Ancestry data.

Even though there are a few choices of possible valid Hypothesis, adding the known generational level of the tester and/or their age, will help to invalidate some and make one more likely.

I was worried that the endogamy would be a factor, but it seems not to be. Only the unfiltered FTDNA did not pick the correct answer on its number one hypothesis, and that is due to the many extra segments (about 50 cM worth) included in those numbers. As a result, it preferred to pick the hypothesis which was a half generation higher.

So this tells me that you needn’t worry about endogamy when using WATO. Just be sure to use either filtered FTDNA data (eliminating matches less than 7 cM) or use Ancestry DNA shared cM.

DNA Short Snappy Opinions

2020. augusztus 22., szombat 21:06:22

Lots has been happening on the DNA analysis front in the past few months. Lots of very divergent opinions on a whole bunch of issues.

Here are my opinions. You are free to agree or disagree, but these are mine.


AncestryDNA

  • Ancestry has had performance issues. Couldn’t they have been more honest to say performance is the reason for their cease and desist orders to the 3rd party screen scrapers who have been providing useful utilities.
  • I just hate the endless scrolling screens. Bring back paging, please.
  • The 6 and 7 total cM matches that Ancestry will be deleting definitely include people who have a higher probability of being related, but not because of the small DNA match which is likely false and too distant a match to ever track.
  • The 6 and 7 total cM matches are also being deleted because of their performance issues.   
  • I in no way trust Ancestry’s Timber algorithm, especially with the longest segment length being labeled as pre-Timber to explain why it’s longer than the post-Timber total cM. Now none of their numbers make sense.
  • Longest segment length is not as helpful if you have to look at it one by one. Why didn’t they show it in the match list and let us sort by it?
  • Let us download our match list, please.
  • Thinking Ancestry will ever give us a chromosome browser is a pipe dream.


23andMe

  • I love that they show your ethnicity on a chromosome map. This is in my opinion, a very underutilized feature by DNA testers.
  • Their Family Tree generated from just your DNA matches is a fantastic innovation.
  • A month ago, some people were able to add any of their DNA matches to that family tree. They’ve never announced this and it still hasn’t rolled out to me yet. What’s the problem here? Release it, please!
  • If my matches don’t opt in, I don’t want to know that. Please give me 2000 matches rather than 1361 matches that I can see and 639 that I can’t.


FamilyTreeDNA

  • Lot’s of innovation that they don’t get enough credit for, e.g. their assignment of Paternal / Maternal / Both to your matches based on the Family Tree you build.
  • Keeps your DNA for a looooooong time! Will be useful for future tests that don’t exist now on your relatives who passed away.
  • Best Y-DNA and mtDNA analysis for those who can make use of it.
  • Take advantage of their Projects if you can!
  • Nobody should see segment matches down to 1 cM, or have them included in your match totals. Pick a more reasonable cutoff, please.


MyHeritage DNA

  • I hate, hate, hate, did I say hate, imputation and splicing.
  • As a result of the aforementioned, I believe MyHeritage has the most inaccurate matching and ethnicities of the major services.
  • Showing triangulations on their chromosome browser is their best advanced feature that no one else has. 
  • I love that you are working with 3rd parties, and include features that others won’t such as AutoClusters.
  • How about some features to connect your DNA matches to your tree, like Ancestry and 23andMe and Family Tree DNA have?


Living DNA

  • They’ve missed out on a golden opportunity. They had the whole European market available.
  • Three years ago they launched and promised shared matches and a chromosome browser, which they’ve still not implemented.
  • Your ethnicity in no way works for me unless you add a Jewish category.


GEDmatch

  • I feel so sorry for GEDmatch’s recent troubles. They are trying so hard.
  • Great tools. Love the new Find Common Ancestors from DNA Matches tool that compares your GEDCOM with the GEDCOM files of your matches. Would love it more if I had any results from it.
  • They let you analyze anyone’s DNA, but don’t let you download your own tool-manipulated raw data.  Doesn’t that seem backwards?
  • Over 100 cold cases have been solved using DNA to identify the suspect. I loved CeCe Moore’s Genetic Detective series. I can’t figure why more people won’t opt-in their DNA for police use.


ToTheLetter DNA and KeepSake DNA

  • C’mon guys. We all want the stamps and envelopes our ancestors licked analyzed. This sounded so promising a couple of years ago. What’s taking so long?


Whole Genome Sequencing (WGS)

  • Sorry, but today’s WGS technology will never improve relative matching the way some people think it will. Current chip-based testing already does as good a job you can do when you’re dealing with unphased data.
  • Today’s WGS short read technology is too short. Today’s WGS long read technology is too inaccurate.
  • The breakthrough will come once accurate long reads can sequence and phase the entire genome with a single de novo assembly (no reference required) for $100.
  • PacBio is leading the way with their unbiased accurate long read SMRT technology that is not subject to repeat errors. It just needs to be about 100 times longer and remain accurate and we’re there. Optimistically: 5 years for the technology and 10 years for the price to come down.

Proof or Hint?

2020. július 19., vasárnap 18:05:07

Have you heard the big hubbub going on in genetic genealogy circles?  Ancestry will be dropping your 6 and 7 cM matches from your match list.

image

In my case, I have 192,306 DNA matches at Ancestry. Of those, 54,498 matches are below 8 cM meaning Ancestry will drop over 28% of the the people on my match list.


The Proof Corner

Many of the DNA experts understand that a 6 or 7 cM segment is small and is rarely useful for proof of anything. That is totally true. As Blaine Bettinger states, small segment are “poison”. They are often false matches. When they are not, those segments are usually too many generations back to be used as “proof” of the connection.

I am not talking about Y-DNA or mtDNA here. Those have provable qualities in them. I’m talking about autosomal matching, you know, the DNA where the amount of DNA you share with a cousin reduces with each generation and you can be a 3rd cousin with someone and not share anything.

The only reasonable way to use autosomal segment matches as a “proof” is to use the techniques Jim Bartlett developed for Walking an Ancestor Back. This technique uses combinations of MRCAs on the same ancestral line, e.g. a 2C, a 3C, a 5C and a 7C all matching on the same segment who are on the same line. Jim has been able to do this successfully only because he has an extensive family tree and has rigorously mapped all his matches into triangulation groups over his whole genome. This is something that 99.9% of us will never attempt.

Note that Jim only includes matches that triangulate that are at least 7 cM. He is also aware that small segments may be false even when triangulated, so he excludes them.

But too often, people find through a DNA match a new 7th cousin, and find a family tree connection to them, and then claim that the DNA match proves the connection. This is so untrue on so many fronts.

Or people find two relatives who have a segment match that starts and/or ends at the same position as another DNA match. They then use this as proof of their connection to Charlemagne. Now doesn’t that sound ridiculous?


The Hint Corner

So why the worry of eliminating these mostly false, poison matches that can’t prove anything from your Ancestry DNA match list? It’s because they are hints.

As genealogists, we are using our DNA matches to find possible relatives that have common ancestors with us. We do that to extend our tree outwards and up. Any person who may have researched a part of our tree and have information about our relatives and ancestors that we don’t have is a very welcome find. (Hopefully they’ll respond to our email!)

So of my 192,306 matches, the closest 1% are the best candidates for me to research and see if I can connect them.

What about the other 99%? Surely, some of them might turn up to be a closer cousin than expected, or be along a line that I have researched deeper.

Obviously, none of us can spend the rest of our lives researching 190,000 matches one by one. So what do we do. We filter them to get interesting candidates, via:

1. A match who shares a common ancestor.

2. Match name who matches a surname in our tree

3. Surname in matches’ tree who matches one of ours

4. Birth location in matches’ tree that is a place our ancestors were from, or our relatives now live.

image

5. Shared matches who match with some of our DNA matches who we already have in our tree.

6. ThruLines, which compares the trees of our DNA matches for us and gives us possible family connections that we can investigate.

The people we find through any of these 6 methods (and other similar methods) is a way to take an unmanageable list of 192,000 people and select a subset for us to look at. Our hope (we don’t know this for sure) is that this will include more people who we’ll be able to connect to our family, and exclude the ones who are less likely.

So what most people are lamenting is not the loss of 28% of their DNA matches, but a loss of 28% of the hints they might be able to use.


Recommendation

If you want, there are ways to save some of the 6 and 7 cM matches that Ancestry will soon be eliminating. I won’t describe them here since many others already have. See Randy Seaver’s summary.

But please, don’t spend the next few weeks robotically marking the tens of thousands of small matches so that you don’t lose them. Yes, maybe one of them will turn out to be a hint one day. But you’ve got all your other matches to work with as well. You won’t run out of things to do, I guarantee it.





Addendum:  July 29, 2020:

If small DNA matches of 6 or 7 cM at Ancestry DNA cannot be used to prove a connection, because they are either false matches, or are too many generations back to confirm their ancestral path, then why can they be used as hints?

Answer: Simply because if you take a random selection of, say, 20,000 DNA testers at Ancestry, some of them will be relatives of yours. They may not actually share DNA with you, since 3rd cousins and further need not, but they could be people whose family tree connects to yours.

Basically, Ancestry DNA is giving you hints by simply giving you a large random selection of DNA testers. Their filtering tools (surname, place) may narrow those down to possible relatives, who don’t necessarily share any actual DNA with you.    
   
But these hints are better than just random hints. They will likely be people who share more ethnicity with you than a random DNA tester at Ancestry would.

For example, Ancestry has me at 100% European Jewish. If I compare myself with my first 6 cM match at Ancestry, I get this:

image

This 6 cM match of mine also has 100% European Jewish ethnicity.

To see if this was generally the case, I took my closest 20 matches, and my first 20 matches at 40 cM, at 20 cM, 15 cM, 10 cM, 9 cM, 8 cM, 7 cM and 6 cM. I marked down what percentage of European Jewish they had. Then I sorted each group of 20 highest to lowest. I get this:

image

Of the 180 matches I checked 179 had some European Jewish Ancestry. Over half of the matches also had 100% European Jewish ethnicity and many of them have 50% or more.

There is a much greater chance that I might find a connection to someone with European Jewish ancestry than someone without any, so these are good hints. Using ancestral surname and place filtering tools, I might find that some of these people are relatives and they can help me extend my family tree.

Does that mean that we share DNA?  Not necessarily. The matches, especially the small ones, may be false matches.,

Or we may actually share DNA. but the segments we share may not be coming from the common ancestor we found, but may be from another more distant line that we’ll never find, or it may be (especially in my case) general background noise from distant ancestors due to endogamy. We don’t know and cannot tell.

None-the-less, these matches are hints that might connect you to a relative.

Revisiting 23andMe’s Family Tree

2020. július 10., péntek 7:18:28

A very exciting day for me today, as most of you reading this will relate to. A second cousin of mine who I know showed up on my 23andMe match list. She matched me with 3.1% = 234 cM on 19 segments, which is exactly where she should be according to The Shared cM Project tool.

I have 9 other cousins who have have tested at 23andMe and match me. What makes this newly tested cousin different from the other 9 is that she’s on my mother’s side! All my previous known matches at 23andMe were on my father’s side.

So now I can finally get some maternal information from my 23andMe matches. A second cousin is perfect because we share great-grandparents and she will allow me to cluster my maternal matches into my mother’s father’s side, the side she is on.


23andMe’s Family Tree

I last looked at 23andMe’s Family Tree last September in my article: 23andMe’s Family Tree Beta.

My tree as calculated by 23andMe back then included 13 of my DNA matches. It placed 8 on my father’s side and 5 on my mother’s side.

My automated tree today has two more of my matches included, so there are now 15. The 8 circled matches at the left are on my father’s side. The 7 circled matches at the right are on my mother’s side. The people circled in blue are the 5 relatives in the tree that I know how I’m related to. One is a 1C1R who is the granddaughter of my uncle, so she shares both my paternal grandparents with me and I show her above the “F”. The other 4 are all on my father’s father’s side, and they are in the “FF” section. I do have a few relatives on my father’s mother’s side that tested, but 23andMe decided not to include them in my automated tree. There are 10 matches that I don’t know how they are related to me. But the tree hypothesizes that 1 is on my father’s father’s side, 2 are on my father’s mother’s side, and 7 are on my mother’s side. (Click the image below for a larger version)

image

23andMe has not yet included my new mother-side match on my tree. They only recalculate the tree from time to time and I’d have to wait until they do it again to see if they add my cousin to it.

Of those 7 people hypothesized to be on my mother’s side, 3 are with one parent and 4 are with the other. So once my cousin is added, presumably the group of 3 or the group of 4 would be with her on my mother’s father’s side and the other group would be on my mother’s mother’s side.

But then I saw that I don’t have to wait for 23andMe’s recalculation.

At the top left of the tree is this symbol:
image

When I click on it, it brings up this box with unplaced relatives:

image

I have 5 people shown at the bottom. You have to scroll to the right to see the other two. The person on the left is my newly tested cousin. The other 4 are people I don’t know how I’m related to.

Clicking on the little info symbol next to the “Unplaced Relatives” text gives:

image

Clicking on the “Learn more” link gives:

image

Well 5 minutes doesn’t sound so bad. Let’s see what happens when I reset my tree.


Recalculating the 23andMe Family Tree

I press the “Yes, delete my edits and recalculate my tree” button, and it gives this:

image

Okay. 5 to 10 minutes isn’t so bad either.  Back at the tree, they actually show progress:

image

Now it’s saying less than 1 minute. Sheesh!  After about what turns out to be 3 minutes, I get this message:

image

I’m doing this on a Thursday evening at 7 p.m. CDT. Is this a busy time?

I wait a couple of minutes and of course I don’t believe them and don’t want to wait until tomorrow, so I go back up to the 23andMe main menu, and under Family & Friends, select “Family Tree”

image

Sure enough, I didn’t have to wait a day. It displays my new tree:

SNAGHTML14befb5

Now it only shows 6 of my DNA matches. Pressing the symbol in the top left, it now shows this:

image

So it moved 9 of my previously placed matches into the Unplaced Relatives list. That list now has those 9 plus the 5 that I had before I had them recalculate the tree, plus the 8 non-tested relatives (e.g. my parents, grandparents, uncle, cousin, etc.) that I had previously manually added to my tree.

The recalculation placed some of my paternal cousins at the wrong generational level. But that’s no problem. Since the beta 10 months ago, 23andMe has added the ability to move people in the tree, and even move a whole branch of the tree:

image

The link the often show that says “View our guide:” takes you to 23andMe’s illustrative guide of How to build and edit your Family Tree, which is worth a read. In there, you’ll see that you not only can add people to your tree, but you can include their date and place of birth and death and add a photo. I’m not sure why entering the birth and death information is currently useful, since that information doesn’t show up in the tree. But maybe 23andMe has planned a use for it that they’ve not implemented yet.

Unfortunately, the one person I really wanted automatically added, my new DNA testing relative on my mother’s side, was not placed. That would have separated out my maternal sides. But now it wouldn’t have helped anyway, because the 7 people they previously placed on my maternal side were now all with the Unplaced Relatives. So placing my 2nd cousin without those 7 on the tree no longer will allow me to divide them up into my MF and MM sides. Sad smile


My New 23andMe’s Family Tree

I can easily add my new cousin, because I know where she goes. But I can’t add the people that the recalculation removed from the tree because I don’t know how I’m related to them. It would have been nice if 23andMe could have left them in. The algorithm must have changed somewhat. Maybe those people were previously placed inaccurately.

So be aware. You may lose some of 23andMe’s theories if you recalculate. Make sure you record how everyone is connected before you get it to do the recalculation.

Now my tree has 6 DNA relatives whose relationship I know. There is only one theory remaining. My tree now looks like this, with my father’s side now being on the right side.

image

I’ve circled in green the 6 relatives I have that I know are placed correctly. Circled in red is the one relative that remains as 23andMe’s theory.

23andMe has left me with 13 people in my Unplaced Relatives that I cannot place.

I also have 5 relatives among my matches that I know how I’m related, but 23andMe’s Family Tree chose not to include them. I could add them to the correct place on 23andMe’s Family Tree. But they would not be connected to their DNA match information. It would be nice if 23andMe would allow you to select people from your match list. I think I’ll suggest that to them via their survey at the bottom of Your Family Tree page.


Updating My Double Match Triangulator 23andMe Results

I last tried DMT on my 23andMe data last October:  Using DMT, Part 1: My 23andMe Data. Since I only had paternal matches back then, DMT couldn’t do much with my maternal side other than classifying which matches it calculated were maternal. What it gave me back then was this:

image

So now I’ll just do this exercise again. I’ll use DNAGedcom Client to download a new set of segment match files from 23andMe (see DMT’s help file for how to do this).

The segment match files I’ll download will be for myself and the 10 relatives I know how I’m related to. Each takes about 10 minutes for DNAGedcom to gather, so I’ll do them while I’m working on something else.


Two Hours Later

I put the 11 segment match files into a folder. I start DMT and select my own segment match file as File A. I have DMT create my People file with all my matches. Now I go through and add the MRCA for my 10 known relatives (9 of which are shown below):

image

Now I set Folder B to the folder containing all the match files and I let ‘er rip.

Double Match Triangulator clusters my matches into these groups. Compare this to the table above:

image

I have 199 more matches than I did last October.  The percentages are about the same as they used to be with the exception that DMT was able to pick out 201 of the maternal matches and associate them with my mother’s father’s cluster, due to their segment matches with my newly tested cousin.

Also, last October, I was only able to paint grandparents or further over 46.1% of my paternal DNA and none of my maternal side.  Now with my new data including my newly tested cousin, I’m able to paint 46.8% of my paternal side and 25.6% of my maternal side as well.

Uploading the DNA Painter file that DMT produces with this latest run into DNA Painter now gives this:

image

This is very similar to what I got 10 months ago, but now a significant amount of my maternal grandfather’s side (MF, in red) also gets painted. That’s a nice chunk of additional painting that DMT was able to add.

That one person whose relationship that I don’t know that 23andMe added to my tree (see the last tree above, red circle, far right) they included as a second cousin once removed on my father’s father’s mother’s side. DMT puts that person in my FF (father’s father’s) cluster. DMT cannot work this any further back because I don’t have any cousins tested who I know are on either my FFF or FFM side for it to use. So 23andMe’s estimation of FFM is a good theory and could be correct. Now I’ll just have to trace his family tree and see if we can connect. Smile

VGA Webinar: “Your DNA Raw Data & WYCDWI”

2020. július 6., hétfő 2:21:15

In just over a week, on Tuesday July 14, 2020 at 8:00 pm EDT, I’ll be giving a live online talk for the Virtual Genealogical Association @VirtualGenAssoc

2020 07 14 Kessler (002) 

The description of my talk is:

Presenter Louis Kessler explains those mysterious files that we download from DNA testing companies, helps us to understand what’s in them, and shows us the ways we can make use of them. He will also discuss whether Whole Genome Sequencing (WGS) tests are worthwhile for genealogists.

I hope you come and join me for this.

To register for my presentation, you’ll need to be a member of the Virtual Genealogical Association. Annual Dues are only $20 USD, and that gives you free registration for a year to any of their regular webinars as well as handouts and other benefits. Upcoming webinars include:

  • Tuesday, July 14 at 8 pm EDT - Louis Kessler presents
    “Your DNA Raw Data & What You Can Do With It”
  • Sunday, July 26 at 1 pm EDT - Sara Gredler presents
    “Successfully Searching the Old Fulton New York Postcards Website”
  • Saturday, August 1, 2020 EDT - Jessica Trotter presents
    “Occupational Records: Finding Work-Related Paper Trails”
  • Friday, August 7, 2020 at 8:00 pm EDT - Ute Brandenburg presents
    “Research in East and West Prussia
  • Tuesday, August 18, 2020 at 8:00 pm EDT - Caroline Guntur presents
    “Introduction to Swedish Genealogy”
  • Sunday, August 23, 2020 at 1 pm EDT - Julie Goucher presents
    “Researching Displaced People”
  • Saturday, Sept 5, 2020 at 11:00 am EDT - Sara Campbell presents
    “Using Historic Maps of New England and Beyond”
  • Tuesday, Sept 15, 2020 at 8:00 pm EDT - Tammy Tipler-Priolo presents
    “Simple Steps to Writing Your Ancestors’ Biographies”
  • Sunday, Sept 20, 2020 at 1:00 pm EDT - Tamara Hallo presents
    “How to Get the Most Out of FamilySearch.org”
  • Friday, Sept 25, 2020 at 8:00 pm EDT - Annette Lyttle presents
    “Finding & Using Digitized Manuscript Collections for Genealogical Research”
  • Saturday, Oct 3, 2020 at 11:00 am EDT - Patricia Coleman presents
    “Beginning with DNA Painter: Chromosome Mapping”
  • Sunday, Oct 11, 2020 at 1:00 pm EDT - Kristin Brooks Barcomb presents
    “Understanding & Correlating U.S. World War I Records & Resources”
  • Tuesday, Oct 20, 2020 at 8:00 pm EDT - Christine Johns Cohen presents
    “Lineage & Hereditary Societies: Why, Where, When, What & How?”
  • Sunday, November 22, 2020 at 1:00 pm EST - Judy Nimer Muhn presents
    “Researching French-Canadians in North America”
  • Tuesday, November 24, 2020 at 8:00 pm EST - Marian B. Wood presents
    “Curate Your Genealogy Collection – Before Joining Your Ancestors!”
  • Tuesday, Dec 1, 2020 at 8:00 pm EST - Diane L. Richard presents
    “The Organizational Power of Timelines”
  • Friday, Dec 4, 2020 at 8:00 pm EST - Nancy Loe presents
    “Using Macs and iPads for Genealogy”
  • Sunday, Dec 13, 2020 at 1:00 pm EST - Jean Wilcox Hibben presents
    “Family History Can Heal Family Present”

Notice they vary the day of the week and the time of the day to accommodate people all over the world with different schedules.

If you are unable to attend a talk live that you wanted to, members have access to recordings of the last six months of webinars. Some of the past webinars that you can still access if you join now include:

  • Pam Vestal presented
    “20 Practical Strategies to Find What You Need & Use What You Find”
  • Mary Cubba Hojnacki presented
    ”Beginning Italian Research”
  • Alec Ferretti presented
    ”Strategies To Analyze Endogamous DNA”
  • Renate Yarborough Sanders presented
    ”Researching Formerly Enslaved Ancestors: It Takes a Village”
  • Megan Heyl presented
    ”Road Trip Tips: Don’t Forget To…”
  • Lisa A. Alzo presented
    ”Finding Your Femme Fatales: Exploring the Dark Side of Female Ancestors”
  • Lisa Lisson presented
    ”How To Be A Frugal Genealogist”
  • Michelle Tucker Chubenko presented
    ”Using the Resources of the U.S. Holocaust Memorial Museum”
  • Cheri Hudson Passey presented
    ”Evidence: Direct, Indirect or Negative? It Depends!”
  • Kate Eakman presented
    ”William A. James’ 30 May 1944 Death Certificate”

While you’re at it, clear off your calendars from Nov 13 to 15 for the VGA’s annual Virtual Conference. Many great speakers and topics. There is a $59 fee for members and $79 for non-members. If the Conference interests you, then why not join the VGA right now for $20 and enjoy a year of upcoming webinars and 6 months of past webinars for free!

image

I’ve been a member of the Virtual Genealogical Association since it started in April 2018. They are always on the lookout for interesting speakers with interesting topics. If you would like to propose a talk, they are now accepting submissions for 2021 webinars and the 2021 Virtual Conference. Deadline for submission is August 30, 2020.

So How’s My Genealogy Going?

2020. július 2., csütörtök 22:47:29

I’ve written over 1100 genealogy-related blog posts since I started blogging in 2002. But very rarely have I written about my own genealogy research.

It’s actually going okay now.

This blog was started to document the development and progress of my software program Behold, that I’m building to assist me with my genealogy. About 8 years ago, I started attending international conferences and became a genealogy speaker myself. Then about 4 years ago, DNA testing started to become a thing, and I jumped fully in, finding everything about it fascinating, and I wrote my program Double Match Triangulator to help decipher matches. About 2 years ago, the Facebook era of genealogy groups began. I joined and started participating in many groups that were of interest to me and relevant to my own family research.

I got interested in my genealogy in my late teens when one of my father’s aunts was in from Los Angeles and she started drawing a tree showing her and her 8 brothers and sisters. Then I started researching. The first program I started entering my data into was Reunion for Windows. When Reunion sold their Windows product to Sierra in 1997, I became a beta tester for their release of the program which they called Generations. I used Generations to record my genealogy until 2002, when Genealogy.com purchased it along with Family Origins and Ultimate Family Tree, and then subsequently dropped all three programs in favour of their own product Family Tree Maker.

What I had was a GEDCOM with my family tree information updated up to 2002. And until about 2 years ago, I had made no updates to that at all, waiting for Behold to become the program I’d enter all my genealogy data into. Working full time, the onset of DNA testing, becoming involved in genealogy conferencing and speaking, plus family and life in general prevented that from happening.

But then a simple step recently rebooted me and my genealogy work.


The MyHeritage Step

In February 2018, I took advantage of a half-price subscription for MyHeritage’s Complete Plan. I loaded my 16 year-old GEDCOM up to MyHeritage. I downloaded their free Family Tree Builder program which syncs with their online system, and I went to it.

The special price enticed me, but I liked what I saw in MyHeritage. They had lots of users. Billions of records. They had plenty of innovation, especially in their Smart Matching. And they were less America-centric than Ancestry. All my ancestors come from Romania and Ukraine ending up here in Canada, so I have eastern European needs. I’ll need to write names in Romanian, Russian, Hebrew and Yiddish, and language handling is one of MyHeritage’s strong points.

The one place MyHeritage was weak was Canada. So I also subscribed to Ancestry as well, but just their Canadian edition. The main database I wanted that Ancestry gave me was the passenger lists for arrival to Canadian ports.

Once I uploaded my 1400 people I had from 2002 via GEDCOM, MyHeritage’s Smart Matches started working for me. Over the course of a year, I added about 500 people to my tree and attached 5000 source records to them.


Filling Out My Tree

The sides of my family I am researching include my 5 grandparents and my wife’s 4 grandparents. My father’s parents are both from Romania. My mother’s parents are both from Ukraine as are all my wife’s grandparents.

My 5th grandparent is my father’s step-father Kessler. He is my mystery side. I know very little about him and his first wife. I don’t even know where he came from other than some unidentifiable place Ogec somewhere in Russia. He has no living blood relatives that I know of, and since no one I know is related to him, I can’t even use DNA to help me on his or his first wife’s side.

In addition to my 9 grandparents, I am also sort of doing a one-place study of Mezhirichi in the Ukraine, where my mother’s father came from. The reason why that town is more of interest than the other towns of my grandparents is because in the 1920’s, a synagogue in Winnipeg was formed called the Mezericher Shul made up only of immigrants from that town, including my mother’s father. I am trying to trace back all the people in Winnipeg whose parents or grandparents went to that synagogue, back to their roots in Mezhirichi. I’m sure many of us are related in ways that we don’t know. So to be more precise this is not really a one-place study of Mezhirichi, but is really a study of the families of the people who attended this synagogue in Winnipeg who likely came from Mezhirichi.

On my wife’s father’s mother’s side is a cousin in the United States who has done an extensive study on that side of the family. He wrote a 255 page book listing about 1000 people who descended from his and my wife’s common ancestors. He graciously allowed me to add the data to my MyHeritage tree as another way to preserve his research. I enjoyed the month and a half I spent manually adding people and their birth and death years to my family tree. That was enough to let MyHeritage’s Smart Matches do the dirty work of finding  record matches and easily allowing me add dates and places from the records to our people.

Shortly after that, I ran into a problem. MyHeritage is supposed to privatize living person information. And when you look at a person in the tree who is living, it looks like they have been privatized. But it isn’t quite:

image

It shows the surname of the person, and the spouse’s maiden name. This wasn’t that bad, but the real problem were the Smart Matches. When someone Smart Matches to you over living people that they may have in their tree, they get all the information you have: names, dates, places, children, etc. I had a cousin email me and tell me he got a Smart Match from my tree, and his birthday was displayed to him. He wasn’t happy and neither was I.

I really was hoping I wouldn’t have to delete all the living people from my online tree, keeping them only in my local files on my computer. Fortunately there was a solution. When editing a person in Family Tree Maker, the “More” tab contains a privatization selection for the person. You check the box to make the person private:

SNAGHTML675722a

They had no automated way to check this selection for all living people, so I manually opened up each of my 1500 living people and marked them private one-by-one, another week-long project.

Once those private people synced up to MyHeritage, the living couples now displayed as:
image

That’s much better. Every person still has a box online, but they are all now marked as “Unknown” rather than “private” with a surname.  Also, no more information about living people is given to anyone through Smart Matches. As a consequence, I also don’t get smart matches for any of my privatized people. But this latter aspect might be a blessing in disguise. Now the Smart Matches I get are only for my deceased people who are the ones I’m most interested in researching and tracing further back. And the number of Smart Matches I now get are manageable. I can clean them out in a few days until I get a few hundred more a few weeks later.


Cousin Bait

I love this term cousin bait. You don’t want to put your data in one place. You want to put it everywhere you can. And you don’t want to put it all up for everyone to see and take. You want to make enough available to get people to contact you, so you can communicate with them and then share what you both have.

For the past 20 years, I have maintained a page of My Family Research and Unsolved Mysteries on my personal website:

image

That page is well indexed on Google. For instance, searching for “Braunstein Tecuci” on Google brings my page up in 3rd place out of 11,500 results:

image

Over those 20 years, I’ve had about 200 people email me inquiring about some of the names and places that I identify. And maybe one third of those have been actual relatives whom I’ve shared data with.

The 2nd best resource I’ve used for a long time to find family has ben the JewishGen Family Finder (JGFF). I have just 17 entries, but those have been enough to get maybe 100 people to contact me to see if we have part of our family tree in common. And again, in maybe a third of those cases, we did.

image

Also, 2 decades ago, I uploaded my GEDCOM to JewishGen’s Family Tree of the Jewish People. As of March 2017, the collection had 7,310,620 records from 6,266 family trees. I’ve recently updated my tree there with my MyHeritage tree.

One of the best successes from my family webpage and through JewishGen was my connection to about 10 relatives on my father’s mother’s Focsaner side. We all have been emailing each other for many years and have been sharing information about our common family. I have only met one of these relatives in person, when our family went to New York City for a vacation about 10 years ago. But despite most of us never having met, and being 3rd cousins or further, we feel like we’re close family.

In the past 2 years, I have also added some of my own family tree (not my wife’s) to other sites, usually just my ancestors.

  • Ancestry:  Just ancestors, but I’ve connected them down to any DNA matches who are relatives.  This has given me a number of useful ThruLines that have led me to identify a couple of DNA testers who were relatives that I didn’t have in my tree.
  • Family Search:  I just added my ancestors, but I’m connecting them to anyone else in this one-world tree who I know are relatives.
  • Geni: Same as for Family Search.
  • Wikitree:  I’ve only put myself and my parents in so far. If in the future I notice a relative, I’ll connect to them.
  • Geneanet: About a year ago, I uploaded my tree from MyHeritage, so I have about 4000 in my tree there.
  • GenealogieOnline:  Just ancestors.
  • Family Tree DNA:  Just ancestors but connected down to DNA matches
  • GEDmatch:  Up to yesterday, just ancestors.

Unfortunately, other than the ThruLines results at Ancestry, these trees have not led to people contacting me. So they are not as good at being cousin bait as I hoped they would be.

But yesterday, GEDmatch added their MRCA Search Tool, that compares the GEDCOM file you uploaded to GEDmatch to the GEDCOM file of your DNA matches. So I downloaded my GEDCOM from MyHeritage (which already had all living people privatized) and I uploaded it to GEDmatch and ran their new tool.

The GEDmatch tool compared 766 of my DNA matches’ trees to mine, and 933 of my uncle’s DNA matches trees to my uncle in my tree. Mine is a very problematic family for these sorts of comparisons. All my ancestors are Jewish so I have endogamy to deal with on the DNA side, and they are all from Romania or Ukraine, so I have lack of records and ability to only go back 5 generations to deal with on the tree side. The result sort of expectedly was that neither I nor my uncle had any MRCA matches.


Other Findings

Of course, one goal every genealogist has is to expand our ancestral tree as much as we can. With all my ancestors coming from Romania and Ukraine, the records there only start in the early to mid 1800s. I can only hope to go back about 5 generations with the known records available.

Over the past few years, I found some researchers who have been able to acquire records for me and translate them from the Romanian or Russian they are written in.

Researcher Gheorge Mireuta obtained 10 birth and death records from Tecuci, Romania on my father’s father’s side.

Sorin Goldenberg obtained about 70 records from the Dorohoi region of Romania on my father’s mother’s side.

Viktoria Chymshyt has obtained records from the Mezhirichi area of Ukraine, trying to find people for me on my mother’s father’s side, but we haven’t been successful yet.

Boris Malasky has obtained about 70 records on two of my wife’s sides from Kodnya and Zhitomir in the Ukraine.

This record research is really the only possible way to expand my tree into the “old country” and provide the physical evidence to back it up.


Where I Am Now

Currently, I sit at over 5100 people in my family tree at MyHeritage, including all the people I’ve privatized.

I really love MyHeritage’s Fan View. It give me a good representation as to where I am. Here’s the Fan View of my tree today:

image

And a new record I just got a few days ago from Sorin Goldenberg gave me the first names of the parents of my great-great-great-grandfather Manashcu Naftulovici.

image

So Naftuli and Sura are the first two ancestors I’ve identified in my 6th generation! Their son Manashcu was the first in his line to start using a surname, and he selected the patronym: Naftulovici.

My wife’s Fan View is currently this:

image

We have two of her 7th generation ancestors identified in records acquired from Boris Malasky.


Still To Do

In one word, lots!  All genealogists know this is a never ending task. Every new ancestor you find leads to two new questions.

But my three major tasks over the next few years will be:

  1. Going through and organizing the dozens of boxes in my closet and basement and binders in my bookshelf of unorganized genealogical material and pictures from my early years of research and from my parents and my wife’s parents and grandparents.
  2. Digitizing what’s valuable from #1.
  3. Entering data obtained from #1 into my family tree along with source citations.

That should keep me busy for a while.

And in the meantime, I’ll still be developing Behold so that it will continue to assist me as I go.

Writing a Genome Assembler

2020. június 29., hétfő 5:21:23

I have now taken 5 DNA microarray (chip) tests with Family Tree DNA, 23andMe, Ancestry DNA, MyHeritage DNA and Living DNA. I have also taken two Whole Genome Sequencing (WGS) tests with Dante Labs, one short-reads and one long-reads.

I analyzed the accuracy of these tests by comparing individual SNP values in my article Determining the Accuracy of DNA Tests. The chip tests don’t all test the same SNPs, but there’s enough commonality that they can be compared, and an error rate can be estimated. For my chip tests, that error rate turned out to be less than 0.5%.

The WGS test results don’t give you individual positions. They give you a set of reads, which are segments that are somewhere along the genome. Short read WGS tests give you segments that may be 100 to 150 bases long. Long read WGS tests can give segments that average 10000 bases long with the longest being into the megabases (millions of bases). But you don’t know where those segments are located on the genome.

To determine where the WGS reads are on the genome, there are two methods available:

    1. Alignment:  Each of the reads are matched to where they are best located in the human reference genome. The WGS testing companies often do the alignment for you and give your results to you in a BAM (Binary sequence Alignment Map) file. The alignment cannot be perfect because 

    • You have variants that are different from the human reference genome as well as INDELs (insertions and deletions),
    • The WGS data has errors in the reads, sometimes changing values, adding extra values or deleting values.
    • The algorithms used for alignment are not perfect and sometimes make assumptions.

    Comparing my BAM file results from my short read WGS test using the BWA alignment tool, the SNPs I could compare were even more accurate than my chip tests with an error rate of less than 0.1%. That sounds very good, but still 1 in 1300 results were wrong, meaning in 700,000 SNPs, there could be 500 errors.

    The WGS_Extract tool that I was using to extract the SNP values from the BAM file didn’t handle INDELs properly so I couldn’t check the accuracy of those.  Despite its high accuracy for individual SNPs, short read WGS tests are not as good at identifying INDELs correctly, e.g the YouTube video (see below) states 85% to 95% accuracy which is a high 5% to 15% error rate.

    For my long reads WGS test, I had two alignments done, one using a program called BWA and one using minimap2 which was supposed to be better for long reads. I was very disappointed to find a quite high error rate on the SNPs I could compare, which was 7.7% and 6.6% for the two programs.

    Thus, alignment techniques and the algorithms that implement them are not bad, but they are far from perfect. They match your reads to a reference genome and have to assume that the best fit is where your read goes.

    2. De Novo Assembly, or just Assembly: This is where you only take the WGS reads themselves, and match them up with each other, piecing them together like a big picture puzzle.

    Actually, it’s tougher than a picture puzzle. The best analogy I’ve seen is it’s like taking 100 copies of today’s issue of the New York Times newspaper, and shredding them into small random pieces where you can only see a few words from a few lines. Just to make it a bit more difficult, half the papers are the morning edition, and half are the afternoon edition, where 90% of the articles are the same, but the other 10% have a different article in the same location in the paper. On top of that, somehow one copy of yesterday’s paper accidentally got in the mix. Now you have to reassemble one complete newspaper from all these pieces. And as a bonus, try to create both the morning edition and the afternoon edition.

    You likely will mix up some morning edition articles with some afternoon edition articles, unless you get some pretty big pieces that include 2 of the same edition’s articles in that piece. (Think about this!)

    So the two editions are like the your paternal and maternal chromosomes, and one copy of the previous day’s paper are like a 1% error rate that your reassembling has to deal with. Add in shredded versions of six different issues of the newspaper for a 6% error rate.

    A genome assembler matches one read to another and tries to put them together. The longest stretches of continuous values that it can assemble are called contigs. Ideally, we would want to assemble 24 contigs for the 23 autosomal chromosomes plus the mitochondrial (mtDNA) chromosome. A male will have a 25th, that being his Y chromosome.

    When assemblers can’t connect a full chromosome together (which none can do yet for humans), you can run another program to use a technique called scaffolding to connect the contigs together. That is done by mapping the contigs to the human reference genome and using the human reference genome as the scaffolds (or connections).

    Assembly with short read WGS has not been able to give good results. Similar to alignment, the reads are too short to span repeats, and thus give way too many contigs. Long reads are normally used for assembly, and despite their high error rate for individual base pairs, sophisticated error correction techniques and minimum distance algorithms have been developed to do something reasonable. However, chromosome-scale configs are still not there yet, and many smart researchers are working to solve this, eg. this article from Nov 2019 describing a method using a connection graph.

    I attempted an assembly of my long reads WGS about 6 months ago using a program called miniasm. I let it run on my computer for 4 days but I had to stop it. So I waited until before a 2 week vacation and started it, but while it was running my computer crashed.

    I realized that this is too long to occupy my computer to do an assembly that likely will not give good results. And I was not happy running it in Unix on my Windows machine. I was interested in a Windows solution.


    Algorithms for Genome Assembly

    I’ve always been a programmer who likes the challenge of developing an algorithm to solve a problem. I have a BSc Honours in Statistics and an MSc in Computer Science, and my specialty and interest was in probabilty and optimization.

    I have developed and/or implemented many computer algorithms, including detection of loops in family trees for Behold, matching algorithms in Double Match Triangulator, simulation of sports and stock market performance (winning me over $25,000 in various newspaper contests) and from my university days: my at-the-time world class chess program: Brute Force.

    Currently for the next version of Behold, I am implementing a DNA probability of match and conditional upon matching expected match length for autosomal, X, Y and mtDNA between selected people and everyone else in your family tree. In doing so, I have to also determine all ways the selected people are related and statistically combine the results. All this data will be available if the user wants along with all the ways these people are related. It should be great.

    But back to genome assembly. The problem with assembly algorithms today is that they have to use long reads, and long reads have very high error rates. So they must attempt to do some sort of approximate matching that allows for errors and then uses the consensus approach, i.e. that takes the values that most reads aligning to the same position agree on. It is not clean. It is not simple. There is a lot of error correction and many assumptions must be made.

    Ahh, but wouldn’t it be simple if we could just take one read, and match the start to another read and the end to a third read. If you have enough coverage, and if the reads are accurate enough, then this would work fine. 

    image 

    In fact this is how they put together the first human genomes, painstakingly connecting the segments that they had one by one.

    But alas, the long reads WGS tests are not accurate enough to do this. So something else had to be done.

    A couple of months ago, I discovered a wonderful online book called Bioinformatics Algorithms, designed for teaching. The entire text of the book is available online. You can also purchase the book for yourself or for your class.

    image

    Chapter 3 is :How Do We Assemble Genomes? That is where I got the exploding newspaper analogy which I expanded on above. The chapter is amazing, turning the problem into graph theory, bringing in the famous Königsberg Bridge Problem, solved by mathematician Leonhard Euler, and explaining that a de Bruijn graph is the best solution for error prone reads.

    This looked like quite a task to implement. There are many assembly algorithms already developed using this technique, and I don’t think there’s anything I can do here that those working on this haven’t already done.


    Accurate Long Reads WGS!!!

    Also a couple of months ago, another innovation caught my attention. The company PacBio developed a DNA test they call PacBio HiFi SMRT (Single Molecule, Real-Time) WGS, which are both long reads (up to 25 kb) and are highly accurate (about 99.8%)

    Whoa! The world has just changed.

    No longer was extensive error correction required. The video above talks about the HiCanu assemblers and how they were modified to very much take advantage of this improved test. Not only that, but the practise of using short reads to “polish” the data is no longer required, and is actually discouraged with HiFi reads as the polishing actually can introduce errors.

    What does this mean? Well, to me this indicates that the original ideas of simply connecting ends might just work again. I have not seen any write-up about this being attempted anywhere yet. The assembly algorithm designers have been using advanced techniques like de Bruijn graphs for so long, they might never have thought to take a step back and think that a simpler solution may now work.

    So I thought I’d take that step back and see if I can develop that simpler solution.


    A Simple Single-Pass Assembler for Windows

    For 25 years I’ve developed software using the programming language Delphi on Windows. Most bioinformatics tools are written in Python for Unix. I’m too much of an old horse who is too busy to learn new tricks. So Delphi it will be for me.

    The algorithm with perfect reads seemed fairly simple to me. Make the first read a contig. Check the next read. Does the start or end of the read match any anywhere within the contig? If so, extend the contig. If not, make the read a contig. Continue sequentially just one time through the reads and after the last read, you should be done!!!

    Once I got going, I only found it slightly more complicated than that. You also had to check if the start and end of the contigs matched anywhere within the read, and also if the read contained the contig or the contig contained the read. I set a minimum overlap length thinking that I’d want to ensure that the read and the contig matched at least that much. Then any repeats smaller than that overlap would be bridged.

    First I needed some sample data. In the Bioinformatics Algorithms book Chapter 9, the Ch 9 Epilogue on Mismatch-Tolerant Read Mapping gives a challenge problem includes a 798 KB partial dataset of the bacterial genome Mycoplasma pneumoniae with 816,396 values in it, all either A, C, G or T.

    This is what that dataset looks like in my text viewer. It’s just one long line 816,396 values in it:

    image

    The challenge problem also included a file of 40,000 short reads from that dataset, all of length 100. That gives 4 million data points for a coverage of 4.9x over the 816,396 in the genome.

    However, not a single one of the 40,000 reads were in the genome. The challenge was to find the number of reads that had at most 1 mismatch.

    Since I wanted a perfect dataset of reads to start with, I saw that I needed to create my own. Also, I wanted them to be like long reads, all with differing lengths.  So after a bit of trial and error, I ended up using a base-10 lognormal distribution, with a mean of 3 and standard deviation of 0.25 to generate 9000 random read lengths. Longest read length was 11,599. Shortest was 124. Mean was 1174.

    image

    So those 9000 reads average 1174 bases and total 10.6 million data points, giving 13.0x coverage of the genome, which is quite a bit more than the 4.9x coverage in their example short reads. This is good, because there’s more likelihood I’ll have enough reads to cover the entire genome without gaps.

    I then generated random start positions for those 9000 reads, and extracted the actual genome values at that position for that read length, and put those into my own reads file. So now I had a set of reads with no errors to develop with.

    This is what my set of reads look like in my text viewer. There are 9000 lines, each of varying length:

    image

    To do the alignment, I didn’t know how much of the start and the end of each read was needed for finding a match in another read. So I wrote a small routine to take the first n positions at the start and end of the first read, and find out how many other reads they are contained in:

    image

    I started at length 5. The first 5 values of the first read matched somewhere in 14,281 other reads. Obviously length 5 is too small. Increasing to the first 11 values, we see the start of the first read only matches 10 other reads and the end only matches 8. This does not decrease any more as we increase the segment size indicating that we likely found all the occurrences of that sequence in all the reads. With 13.0x coverage, you would expect on average 13 matches over any segment. I have 1 + 10 = 11 occurrences of the first part of the first read, and 1 + 8 = 9 occurrences of the last part of the first read. That’s a very possible result with 13.0x coverage.

    So for this genome and the sample data I have, I’ll set my segment length to 12 and select the first 12 values and last 12 values of each read for my comparisons.

    The reason why such a small 12 value segment can be used is because there are 4 possible values, A, C, G and T at each position. And 4 to the power of 12 is 16,777,216 meaning there’s that many ways to make a 12 letter string out of those 4 values. Our genome is only 816,396 bases long, so there is very little chance there are very many segments of length 12 that are randomly included more than once. For a human genome of 3 billion reads, a slightly longer segment to compare with will be required, maybe length 17 or 18 will do it.


    Running My Assembler: Sembler

    After about 4 days of development, testing, debugging and enhancement, I got my simple assembler running. I call it:  Sembler. This version ended up with about 200 lines of code, but half of that is for reporting progress.

    So this is its algorithm. Sembler checks the first and last 12 positions of every read against each contig created so far. It also checks the first and last 12 positions of the contig against the read. And it checks if the read is completely in the contig and if the contig is completely in the read. Based on the situation, it will then either expand the contig, combine two contigs, or create a new contig.

    Sembler reports its progress as it goes. Here is how it starts on my test data:

    image

    The first line shows the settings used. The next lines show the reads used. Since the minimum read length for this run was 1200, reads 2, 6, 7, 9, … were not used because they were too short.

    Up to read 66 no overlaps were found, so a contig was created from each read. At read 66, and again at read 77, the first 12 values of the read matched somewhere in one of the contigs. The rest of the contig matched the read after those 12 values, but the read was longer and had more values available that Sembler then used to extend that contig to the right.

    If we go down further to reads starting at 1004 we see:

    image

    We have now built up 177 contigs and they grow to a maximum of 179 contigs by read 1018. At this point, the contigs cover much of the genome and it is getting tougher for new reads not to be overlapping with at least one of the contigs.

    The start of read 1024 matches somewhere in contig 78 and the end of read 1024 matches somewhere in contig 90.  So this read has connected the two contigs. Contig 90 is merged into contig 78, and the last contig 179 is moved into contig 90’s spot just so that there aren’t any empty contigs to deal with.

    Below is the end of the output from running reads with length >= 1200:

    image

    We get down to read 8997 which ends up merging contig 3 into contig 1, and contig 4 becomes contig 3. So we are left with just 3 contigs.

    The run took 19.156 seconds.

    Normally, you don’t know the genome. This procedure is designed to create the genome for you. But since I am still developing to get this to work, I had Sembler look up the final contigs in the genome to ensure it has done this correctly. The three contigs it came up with were:

    Contig 1 from position 82 to 658275
    Contig 3 from position 658471 to 764383, and
    Contig 2 from position 764404 to 816396.

    So positions 1 to 81 were not identified, because there were no reads with length at least 1200 that started before position 82. And there was a small gap of length 195 between 658275 and 658471 which no reads covered and another gap of length 20 between 764383 and 764404 that no reads covered.

    Despite the 7.7x coverage, there were still a couple of small gaps. We need a few more reads to fill in those gaps. One way of doing so is to lower the minimum read length. So I lowered the minimum read length to 1000 and get this:

    image

    Success! We now have a single contig from position 82 to 816396.


    Optimizing the Assembler

    I have so far done nothing to optimize Sembler’s code. The program compares character strings. It uses a Pos function to locate one string within another. There are many ways to improve this to make it run faster, but getting the algorithm working correctly was the first necessity. I have a lot of experience at optimizing code, so if I carry this program further, I will be sure to do so.

    But just as important as optimizing the code is optimizing the algorithm. Starting parameters are very important. Let’s look at what tweaks can be made.

    image

    As we increase the minimum length of the reads we include, we reduce the number of reads we are using. This reduces coverage, reduces the number of compares we do and takes less time. The maximum contigs we have to deal with decreases and that maximum happens later during the reads.

    But if our value for the minimum length is too high, we don’t get enough coverage to fill in all the gaps and we end up with more than one contig. The most important thing here is to try to end up with just one contig.

    Based on the one contig requirement, our optimum for this set of reads for this genome is to select a minimum length of 1000.

    Now let’s set the minimum length to 1000 and vary the segment length:

    image

    Varying the segment length we are comparing doesn’t change the result. The segment length is looking for a potential contig it matches to. If the length is too short, then the start or end of the read will match to random locations in each contig. They will be rejected when the rest of the read is compared, which is why the solution doesn’t change. But all these extra checks can dramatically increase the execution time if the segment length is too small.

    These are perfect reads I’m working with right now that have no errors. Once errors are considered, we’ll want to keep the seglength as small as possible to minimize the chance that the start segment or end segment contains an error. If it does, then that read will all be rejected when the rest of the read is compared, effectively eliminating the use of that read.

    Now let’s put the segment length back to 12 and vary the minimum overlap which by default I had set to 100:

    image

    These results surprise me somewhat. I was expecting a minimum overlap of 50 and especially of 0 to fail and give lots of contigs. I’ll have to think about this a bit. Maybe it is because I’m using perfect reads with no errors in them.

    None-the-less, this shows that if the minimum overlap is too high, then some of our matches will be excluded causing some gaps. We don’t want the minimum overlap too low, or we may match two contigs that are side by side but don’t have enough “proof” to connect them. That isn’t a problem in this “perfect reads” case, but once errors are introduced, some overlap will likely be wanted as a double check.


    Next Steps

    This procedure works.

    Is it fast enough for a full WGS dataset? We’re talking about tens to hundreds of millions of reads rather than just 9000.  And we’re talking about a genome that is 3 billion positions rather than just 800,000.  So instead of 200 max contigs, we’ll likely have 200,000 max contigs. So it could theoretically take a million times longer to solve than the little problem I have here.

    If with optimization I can get the comparisons to be 20 times faster, then we’re talking a million seconds which is 278 hours, i.e. 23 days. That’s a little bit longer than I was hoping. But this is just a back of the envelope calculation. I’m not multithreading, and there are faster machines this can run on. If a procedure can be made available that will do a full de novo assembly of a human genome in less than 24 hours, that would be an achievement.

    I have so far only tested with perfect data. It wouldn’t be too hard to test the addition of imperfections. I could change every 1000th value in my sample reads to something else and use that as a 0.1% error rate like WGS short reads. I could change every 500th for a 0.2% error rate like PacBio HiFi reads. And I can change every 20th for a 5% error rate like WGS long reads. I already have some ideas to change my exact comparison to an approximate comparison that will allow for a specific error rate. The tricky part will be getting it to be fast.

    It might be worthwhile running my program as it is against my WGS short reads. I showed above that the minimum overlap may not need to be as high as I originally thought, so maybe the WGS short reads will be able to assemble somewhat. There likely will be many regions that repeats are longer than the short reads are, and this procedure will not be able to span them. The elephant in the room is can I process my entire WGS short reads file in a reasonable amount of time (i.e. 1 day, not 23 days)?  And how many contigs will I get? If it will be 200, that will be pretty good, since that will only be an average of 10 per chromosome. But if there’s 2000 contigs, then that’s not quite as good.

    I should try to get a set of PacBio HiFi human reads.That is what this procedure is geared towards. PacBio HiFi are the reads that I think with enough coverage, might just result in 25 contigs, one for each chromosome plus Y plus mt. Then it wouldn’t be too hard to add a phasing step to that to separate out those contigs into 46 phased chromosomes + mt for women, or 44 phased chromosomes + X + Y + mt for men. I think the PacBio HiFi reads have a chance of being able to do this.

    Finally, I would love to get a set of PacBio HiFi reads for myself. I don’t have my parents with me any more and they never tested, and I’d love to phase my full genome to them. Also, then I can do some analysis as see how well (or poorly) the WGS alignment techniques I did compared to the (hopefully) accurate genome that I’ll have assembled for myself.

    Maybe this won’t all happen this year. But I’m sure it will eventually, whether based on my Sembler algorithm, or on some other creation by some of the many hyper-smart bioinformation programmers that are out there.

    If PacBio HiFi reads prove to be the revolution in genetic testing that they are promising to be, then for sure the whole world of WGS testing will change in the next few years.

    Kevin Borland visits Speed and Balding

    2020. június 23., kedd 17:20:52

    Kevin Borland is the author of Borland Genetics, a fantastic site where you can upload your Raw DNA data, match to others, and use tools to reassemble your ancestors’ DNA. I very recently wrote a blog post about Kevin’s site.

    Kevin also has a blog in which he has been posting very interesting articles, usually of an analytic nature which are the type I really like. Yesterday, Kevin posted an excellent article: Help! My Segments Are So Sticky! in which he clearly explains how he calculated the probabilities of age ranges for 7 cM and 20 cM autosomal segments, where he used 25 years = 1 generation.

    So Kevin gives another take on the segment age estimates done by Speed and Balding in their 2014 paper made available online by Doug Speed:
    Relatedness in the post-genomic era: is it still useful?

    In the Genetic Genealogy Tips & Techniques group on Facebook, Blaine Bettinger posted about Kevin’s article and said: “I would absolutely love to see Kevin address the differences between his calculations and the calculations in the Speed & Balding paper, how fun that would be!”

    I’ve always felt that Table 2B from the Speed and Balding paper overestimates the age of segments for a given segment size. I wrote two articles on my blog in 2017 with alternative analyses and compared them to Speed and Balding:

    And I further updated that with another calculation in my article:

    Those articles received many comments, including one from Doug Speed, and much discussion on Facebook.

    So I was very interested to see what Kevin’s analysis says. Let’s compare.

    Using Kevin’s easy to follow method of calculation, I can first calculate the probability of no recombinations in x generations:

    image

    And then I simply subtract each column from the previous to give the probability that a segment is x generations old:

    image

    Now let’s plot this in the Speed and Balding chart format:

    image

    Lets compare this to the Speed and Balding Figure 2B chart that everyone quotes. I’ll cut off the left and right sides which have smaller and larger segments that we’re not comparing:

    SNAGHTML1de36330

    Speed and Balding uses ranges, so for Kevin’s chart above, I used values at the start, middle and end of each range. Speed and Balding uses Megabases (Mb)and Kevin uses centimorgans (cM), but they are close enough for practical purposes.

    What we see is:

    Speed and Balding, 1 – 2 Mb:  About 18% chance of <= 20 Generations
    Kevin Borland, 1 – 2 cM:  Between 18% and 33% chance of <= 20 Generations

    Speed and Balding, 2 – 5 Mb:  About 28% chance of <= 20 Generations
    Kevin Borland, 2 - 5 cM:  Between 33% and 63% chance of <= 20 Generations

    Speed and Balding, 5 – 10 Mb:  About 50% chance of <= 20 Generations
    Kevin Borland, 5 - 10 cM:  Between 63% and 87% chance of <= 20 Generations

    Speed and Balding, 10 – 20  Mb:  About 68% chance of <= 20 Generations
    Kevin Borland, 10 – 20 cM: Between 87% and 98% chance of <= 20 Generations

    So indeed, Kevin’s figures do corroborate with my own and indicate that Speed and Balding’s table likely are an overestimate to the age of segments of a certain size.




    Disclaimer: I sort of knew after reading Kevin’s article that his estimates would be similar to mine, since I used the same calculations as Kevin in my Life and Death of a DNA Segment article, except that I used the Poisson distribution for the starting probability rather than the 1 cM = 99% estimate that Kevin used.

    Xcode Life Health Reports

    2020. június 6., szombat 6:38:08

    On Facebook, I was delivered a sponsored ad for getting health reports from your DNA raw data at a 55% discount from a company called Xcode Life.
    @xcode_ls

    image

    I’ve always been much more interested in DNA for genealogical purposes than for health, but I had never heard of this company and it sounded interesting. Their “Mega Pack report was said to contain reports for Nutrition, Fitness, Health, Allergy, Skin, Precision Medicine, Methylation, Carrier Status, and Traits & Personality in 600+ categories.

    I looked around on the internet for a coupon and saved an additional $10 and paid $89 for the package. They accept uploads from all the major companies. I uploaded my combined all-6 file with 1.6 million SNPs in it that was in 23andMe format and it was accepted.

    The next morning, 13 hours later, I got an email stating my reports were ready. In the email, they gave me the coupon code REFVTB47WRMU5 worth $10 off any of their packages that I can give away. If you use it, I will also get $10.

    The Reports

    I downloaded my reports as a compressed zip file. After unzipping, there were 9 pdf files for the 9 reports ranging in size (for me) from the Methylation report at 10 pages up to the Carrier report at 84 pages.

    Most of the reports start with a 2 page introduction, the first page on understanding your report and the 2nd page on how to read your report. Each report ends with a 1 page disclaimer.

    The results follow on the next 2 to 4 pages and each trait is presented as one row of a table containing 2 or 3 possible results. They are color coded green for better than average, orange for average and red for not as good as average.

    For example, the Personality Results have two possible results. These are a couple of mine with better than average results:
    image

    And here’s a couple with just average results:

    image

    And then there are those for negative traits:

    image

    Whereas the Nutrition, Skin, Health, Allergy and Fitness reports give mostly 3 possible outcome per trait, e.g.:

    image

    The remainder of these 6 reports summarize and explain each of the traits, giving a recommendation with the same color as your result. It then tells you which genes were analyzed for the trait, but does not tell you which SNPs were analyzed or what your SNP values were.

    The other 3 reports each have their own format.

    The Carrier report lists 402 different conditions in alphabetical order and tells you if you have potential pathogenic variants.

    image

    They write in bold red letters in the introduction that these are not to be used for medical purposes. And the disclaimer says only your physician is qualified to interpret this report and incorporate this information in treatment and advice. None-the-less, if anything shows up, it is likely worthwhile following up on it with your doctor.


    Actual SNPs!

    The other two reports had what I was more interested in. I wanted the actual SNPs identified indicating the value that I had for them and what they meant.

    The Pharmacogenetics Report lists the gene variants I have that are associated with my reaction to 185 different drugs:

    image 

    The rsid (Reference SNP cluster ID) is listed,as well as my result: the Genotype TT.  I can find the rsid: rs2395029 in my Raw Data File that I supplied to Xcode.Life and that will tell me where it is by chromosome and position:

    image

    So this SNP is on Chromosome 6, position 31,431,760 and yes, it does have the value TT.

    I can then look up that rsid on Google and it will bring up lots of other information that can be found about the SNP:

    image

    such as SNPedia, which was recently purchased by MyHeritage that they are leaving as a free resource available to all., presumably to give them information for their health tests.

    image

    The Methylation Report lists about 60 SNPs from various genes that are associated with conditions such as cardiovascular disease, Alzheimer’s, cancer, depression. They list the normal value, the risk value, and then show your genome values (GENO), shading the line if you have one or two risk values.

    image


    Conclusion:

    I wasn’t expecting to find too much of importance, as I am relatively healthy, and my 23andMe health test results didn’t come up with anything important. But the Xcode Life reports did identify 1 potential variant for a condition I have that would have helped me 5 years ago before I found out about it.

    If you have some ailment but don’t know what’s causing it, a DNA-based health report like this one from Xcode Life might be a good screen. If something shows up, you can discuss the report with your doctor.

    For me, I did this mostly for curiosity. Having the rsids of the SNPs of interest in the Pharmacogenetics and Methylation reports will allow you, as a genetic genealogist to map those SNPs onto your genome with a tool like DNA Painter, and track those SNPs though your ancestors.

    Upload Your Raw DNA Data to Borland Genetics

    2020. május 25., hétfő 20:53:11

    There’s another website I recommend you upload your DNA raw data to called Borland Genetics.

    image

    See this video: Introducing Borland Genetics Web Tools

    In a way, Borland Genetics is similar to GEDmatch in that they accept uploads of raw data and don’t do their own testing. Once uploaded, you can then see who you match to and other information about your match. Borland Genetics has a non-graphic chromosome browser that lists your segment matches in detail.    
       
    But Borland Genetics has a somewhat different focus from all the other match sites. This site is geared to help you reconstruct the DNA of your ancestors and includes many tools to help you do so. And you can search for matches of your reconstructed relatives, and your reconstructed relatives will also show up in the match lists of other people.

    Once you upload your raw data and the raw data from some tests done by a few of your relatives, you’re ready to use the exotically named tools that include:

    • Ultimate Phaser
    • Extract Segments
    • Missing Parent
    • Two-Parent Phase
    • Phoenix (partially reconstructs a parent using raw data of a child and relatives on that parent’s side)
    • Darkside (partially reconstructs a parent using raw data of a child and relatives that are not on that parent’s side)
    • Reverse Phase (partially reconstructs grandparents using a parent, a child, and a “phase map” from DNA Painter) 

    Coming soon is the ominously named: Creeper, that will be guided by an Expert System that use a bodiless computerized voice to instruct you what your next steps should be.

    There’s also the Humpty Dumpty merge utility that can combine multiple sets of raw data for the same person, and a few other tools.

    The above tools are all free at Borland Genetics and there’s a few additional premium tools available with a subscription. You can use them to create DNA kits for your relatives. Then you can then download them if you want to analyze them yourself or upload them to other sites that allow uploads of constructed raw data.

    By comparison, GEDmatch has only two tools for ancestor reconstruction. One called Lazarus and one called My Evil Twin. Both tools are part of GEDmatch Tier 1, so you need a subscription to use them. Also, you can only use the results on GEDmatch, because GEDmatch does not allow you to download raw data.


    Kevin Borland

    The mastermind behind this site is Kevin Borland. Kevin started building the tools he needed for himself for his own genetic genealogy research a few years ago and then decided, since there wasn’t one already, to build a site for DNA reconstruction. See this delightful Linda Kvist interview of Kevin from Apr 16, 2020.

    In March 2020, Kevin formally created Borland Genetics Inc.and partnered with two others to ensure that this work would continue forward.

    If you are a fan of the BYU TV show Relative Race (and if you are a genealogist, you should be), then you should know that Kevin was the first relative visited by team Green in Season 2.  See him at the end of Season 2 Episode 1 starting about 32:24.


    Creating Relatives

    I have not been as manic as many genetic genealogists in getting relatives to test. I only have my own DNA and my uncle (my father’s brother) who I have tested. So with only two sets of raw data, what can I do with that at Borland Genetics?

    Well, first I uploaded and created profiles for myself and my uncle.

    The database is still very small, currently sitting at about 2500 kits. Not counting my uncle, I have 207 matches with the largest being 54 cM. My uncle has 86 matches with the largest being 51 cM. This is interesting because most sites have more matches for my uncle than for me, since he is 1 generation further back.  I don’t know any of the people either of us match with. None of them are likely to be any closer than 4th cousins.

    My uncle and I share 1805.7 cM. The chromosome browser indicates we have no FIR (fully identical regions) so it’s very likely that despite endogamy, I’m only matching my uncle on my father’s side.

    The chromosome browser suggest three Ultimate Phaser options for me to try:

    image

    To interpret the results of these, you sort of have to know what you’re doing.

    So let me go instead to try create some relatives. For that I can first use the Phoenix tool.

    image

    It allows me to select either myself or my uncle as the donor. I select myself as the donor and press Continue.

    image

    Here I enter information for my father and press Continue

    SNAGHTML3187291c

    I now can select all my matches who I know are related on my father’s side. You’ll notice the fourth entry lists the “Source” as “Borland Genetics” which means it is a kit the person created, likely of a relative who never tested anywhere.

    In my case, my uncle is the only one I know to be on my father’s side, so I select just him. I then scroll all the way down to the bottom of my match list to press Continue.

    image

    And while I’m waiting, I can click play to listen to some of Kevin’s music.  After only about 2 minutes (the time was a big overestimate) the music stopped and I was presented with:

    image

    I now can go to my father’s kit and see what was created for him. His kit type is listed as “Mono” because only one allele (my paternal chromosome) can be determined. The Coverage is listed as 25% because I used his full brother who shares 50% with him, and thus 25% with me.

    image

    His match list will populate as if he was a person who had tested himself.

    I can download my father’s kit:

    image

    which gives me a text file with the results at every base pair:

    image

    The pairs of values are all the same because this is a mono kit. Also be sure to  use only those SNPs within the reconstructed segments list. There must be an option somewhere to just download the reconstructed segments, but I can’t see it. (Kevin??)

    In a very similar manner (which I won’t show here because it is, well, similar), I can use the Darkside tool to create a kit for my Mother using myself as the child and my Uncle as the family member on the opposite side of the tree.


    Reconstructing Ancestral Bits

    Now I have kits for myself, my uncle, my father and my mother. Can I do anything else?

    Well yes! I can use my analysis from DNA Painter to define my segments by ancestor.

    image

    I just happened to have the DNA Painter analysis done already, which I used Double Match Triangulator for. Using DMT, I created a DNA Painter file from my 23andMe data for just my father’s side:

    image

    I labelled them based on the ancestor I identified, e.g. FMM = my father’s mother’s mother. I downloaded the segments from DNA Painter and clicked “Choose File” in Borland Genetics and it gave me my 5 ancestors with the same labeling to choose from.

      image

    I select “FF”, click on “Extract Selected Segments” and up comes a screen to create a Donor Profile for my paternal grandfather!

    image

    Wowzers! I have now just created a DNA profile for a long-dead ancestor, and I can do the same for 4 more of my ancestors on my father’s side.

    Just a couple of days ago, I think I was asking Kevin for this type of analysis. Only today when writing this post, did I see that he already had it.


    Summary

    I only have my own and my uncle’s raw data to work with, yet I can still do quite a bit. For people who have parents, siblings and dozens of others tested … well I’m enviously drooling at the thought of what you can do at Borland Genetics with all that.

    There is a lot more to the Borland Genetics site than I have discussed here. There are projects you can create or join. Family tree information. Links to WikiTree. You can send messages to other users. There are advanced utilities you can get through subscription.

    The site is still under development and Kevin is regularly adding to it. Kevin started a Borland Genetics channel on YouTube, and over the past 2 years he made an excellent 20 episode series of You Tube videos on Applied Genetics. And he runs the Borland Genetics Users Group on Facebook, now with 738 members.  – I don’t know how he finds the time.

    So now, go and upload your raw data kits to Borland Genetics, help build up their database of matches, and try out all the neat analysis it can do for you.

    OneDrive’s Poison Setting

    2020. május 9., szombat 6:31:31

    OneDrive’s default setting of no limit for network upload and download rates has caused years of Internet problems at my house. Unbeknownst to us, it would from time to time consume most or all of the Internet bandwidth affecting me when on my ethernet connected desktop computer and affecting everyone else in my house connected with their devices to our Wi-fi. It is now obvious to me that this hogging of bandwidth happened following any significant upload of pictures or files from my desktop computer to OneDrive and the effect sometimes lasted for days!

    Yikes! I’m flabbergasted at how we finally discovered the reason behind our Internet connection problems. A number of times in the past few years, we’ve found the Wi-fi and TV in the house to be spotty. We had got used to unplugging the power on the company-supplied modem and waiting the 3 or 4 minutes for it to reset. Often that seemed to improve things, or maybe the reset just made us feel it had done so – we don’t really know. We’ve called our supplier several times, and they came over, inspected our lines, checked our modem. In all cases, the problem repaired itself, if not immediately, then over the course of a few days.

    It didn’t get really bad too often. But it did about 2 months ago, just after my wife and I got back from a wonderful Caribbean cruise (which we followed up with 2 weeks of just-in-case self-isolation at home). I had to replace my computer, and very shortly after the new one was installed, we had several days of Internet/TV problems.

    I called my service provider (BellMTS) and I told them about the poor service we were having and they tried to help over the phone. We rebooted the modem several times but that wasn’t helping.

    image

    They sent a serviceman to check the wiring from our house to the distribution boxes on our block. We thought that might have helped and it was not long after that it seemed everything was pretty good.

    We had very few problems over the next 6 weeks, but just last night, I was in the middle of an Association of Professional Genealogists Zoom webinar (Mary Kircher Roddy – Bagging a Live One; Reverse Genealogy in Action), when suddenly I lost my Internet in my other windows and my family lost the Internet on their devices. Our TV was even glitching. However the Zoom webinar continued on uninterrupted. I could not at all figure this out.

    After the webinar ended, I called my Internet/TV provider and things seemed to improve. The next morning, the troubles reoccurred. I called my provider again. They sent a serviceman. He came into the house (respecting social distancing) and cut the cable at our box so they could test the wiring leading to our house. He was away for over an hour doing that. When he came back, they had set up some sort of new connectors. He reconnected us. But no, we still had the problem. He then found what he though was a poorly wired cable at the back of the modem. He fixed that, but still the problem. Then he replaced our modem and the power supply and the cabling. Still the problem.

    We were monitoring the problem using speedtest.net. We’ve got what’s called the Fibe 25 plan**. We should be getting up to 25 Mbps (mega-bits per second) download and up to 3 Mbps upload. We were getting between 1 and 2 download and 1 upload. Not good. 

    After several more attempted resets and diagnostic checks, we were now 3 hours into this service call. The serviceman’s next idea was the one that worked. He said turn off all devices connected to the Internet. Then turn them on one-by-one and we might find it is a device we have that’s causing the problem. We did so and when we got to my ethernet connected computer, it was the one slowing everything. The serviceman said there it is, found the reason. He couldn’t stay any more and left.

    I checked and sure enough, when my computer was on, we got almost no Internet, but when it was off, everything was fine. Here was the speed test with my computer off:

    image

    When I went to the network settings to see if it was a problem with my ethernet cable, I could see a large amount of Activity, with the Sent and Received values changing quite quickly:

    image

    My first thought was that maybe my computer was hacked. I opened Task Manager and sorted by the Network column to see what was causing all the Network traffic. There was my answer, in number 1 place consuming the vast majority of my network was: Microsoft OneDrive.

    My older daughter immediately commented that she had long ago stopped using the free 1 TB of OneDrive space we each get by being Microsoft 365 subscribers because she found it hogged all her resources.

    Eureka! 2 months ago what had I done? I had uploaded all my pictures and videos from our trip to OneDrive. And what was I doing while watching that Zoom webinar last night? I was uploading several folders of pictures and videos to OneDrive. What wasn’t I doing during the 6 weeks in-between was any significant uploads to OneDrive.

    In Task Manager, I ended the OneDrive task. Sure enough my download speed from speed test went back up to good numbers, and our Internet/TV problem had finally been isolated.

    It didn’t take me long to search the Internet to find that OneDrive had network settings. The default was (horrors) a couple of “Don’t limit” settings. The “Limit to” boxes, which were not selected, both had suggested defaults of 125 KB/s (kilobytes per second). I did some calculations and selected them and set the upload value to 100 KB/s and left the download value at 125 KB/s: 

    image

    Note that these are in KB/s whereas Speedtest gives Mbps. The former is thousands of bytes and the latter is millions of bits. There are 8 bits in a byte. So 125 KB/s = 1.0 Mbps, which is about 4% of my 25 Mbps download capacity and 100 KB/s = 0.8 Mbps which is less than 30% of my upload capacity. Now when OneDrive is synching, there should be plenty left for everyone else. Yes, OneDrive will take several times longer to upload now. But I and my family should no longer have it affecting our Internet and TV in a significant way any more.

    Also notice there’s an “Adjust automatically” setting. Maybe that is the one to choose, but unfortunately they don’t also have that setting on the Download rate, which is maybe more important.

    My wife and daughters have complained to me for a number of years claiming my computer was slowing the Internet. Up to now, I did not see how that could be. Yes, as it turns out, it was technically coming from my computer, but the culprit in fact was OneDrive’s poison setting. I am someone who turns off my desktop computer when I am not using it and also every night I don’t have it working on anything. No wonder our problems were spotty. When my computer was off, OneDrive could not take over. So my family was right all along.

    Well that’s now fixed. I will let my TV/Internet provider know about this so that they can save their time and their customers time when someone else has a similar intermittent internet problem which may be OneDrive. I will also let Microsoft know through their feedback form and hopefully they one day will decide to either change their default network traffic settings to something that would not affect the capacity of most home Internet providers, or change the algorithm so that “unlimited” has a lower priority than all other network activity. Maybe that “adjust automatically” setting is the magic algorithm. If so, it could be the default but it should also be added as an option on the Download rate, to eliminate OneDrive’s greediness.

    Are you listening Microsoft?

    And I’d recommend anyone who uses OneDrive to check out if you have no limit on your OneDrive Network settings. If you do, change them and you might see the speed and reliability of your Internet improve dramatically.


    —-

    **Note:  The Fibe 25 plan is the maximum now available from BellMTS in our neighborhood. They are currently (and I mean currently since my front lawn is all marked up) installing fiber lines in our neighborhood that will allow much higher capacity. Once installed, I should have access to their faster plans, and will likely subscribe to their Fibe 500 plan for only $20 more per month. That will give up to 500 Mbps download (20x faster) and 500 Mbps upload (167x faster). They have even faster plans, but that should be enough because our wi-fi is 20 MHz which is only capable of 450 Mbps. My ethernet cable (which was hardwired in from the TV downstairs to my upstairs office when we built the house 34 years ago) is capable of 1.0 Gbps which is 1000 Mbps. Once we switch plans, I’ll likely give OneDrive higher limits (maybe 100 Mbps both ways) and it will be a new world for us at home on the Internet.