Kezdőlap Újdonságok Kereső Egyesület Fórumok Aktualitások Kutatás Adattárak Történetek Fotógaléria Partnereink RSS
RSS hírcsatornák
Ancestor.com
Ancestry.com
AustraliaGenWeb
Computergenealogie
FamilySearch.org
Forum zur Ahnenfors...
Genealogy Gems
Genealogy News Center
GenealogyBlog
GeneaNet.org
Hungaricana - Magya...
Interment.net Cemet...
LegacyTree.com
Louis Kessler's Beh...
Mac Genealogy Softw...
Magyar Országos Lev...
MyHeritage.com
NYT > Genealogy
Országos Széchényi ...
The Genealogue

Louis KesslerLouis Kessler's Behold Blog

the Development of my Genealogy Program named Behold

When Everything Fails At Once…

2020. március 22., vasárnap 23:05:05

Remember the words inscribed in large friendly letters on the cover of the book called The Hitchhiker’s Guide to the Galaxy:

DON’T PANIC

I returned 9 days ago from a two week vacation with my wife and some good friends on a cruise to the southern Caribbean. While away, we had a great time, but every day we heard more and more news of what was happening with the coronavirus back home and worldwide.

On the ship, extra precautions were being taken. Double the amount of cleaning was being done, and purell sanitizer was offered to (and taken by) everyone when entering and leaving all public areas. The sanitizer had been a standard procedure on cruise ships for many years. I joked that this cruise would be one where I gained 20 pounds: 10 from food and 10 from purell. Our cruise completed normally and we had a terrific time. There was no indication that anyone at all had got sick on our cruise.

We flew home from Fort Lauderdale to Toronto to Winnipeg. Surprisingly to us, the airports were full of people as were our flights. None of the airport employees asked us anything related to the coronavirus and gave no indication that there was even a problem. I don’t think we saw 2 dozen people with masks on out of the thousands we saw.

After a cab ride home at midnight, our daughters filled us in on what was happening everywhere. Since we were coming from an international location, my wife and I began our at-least 2 week period of self-isolation to ensure that we are not the ones to pass the virus onto everyone else. We both feel completely fine but that does not matter. Better safe than sorry.


Failure Number 1 – My Phone

On the second day of cruise, I just happened to have my smartphone in the pocket of my bathing suit as I stepped into the ship’s pool. I realized after less than two seconds and immediately jumped out. I turned on the phone and it was water stained but worked. I shook it out as best as I could and left it to dry.

I thought I had got off lucky. I was able to use my phone for the rest of the day. All the data and photos were there. It still took pictures. The screen was water stained but that wasn’t so bad. But then that night, when I plugged it in to recharge, it wouldn’t. The battery had kicked the bucket. Once the battery completely ran out, the phone would work only when plugged in.

Don’t panic!

I had been planning to use my phone to take all my vacation pictures. Obviously that wouldn’t be possible now. I went down to the ship’s photo gallery. They had some cameras for sale but I was so lucky that they had one last one left of the inexpensive variety. I bought the display model of a Nikon Coolpix W100 for $140 plus $45 for a 64 GB SD card. I took over 1000 photos of our vacation over the remainder of our cruise, including some terrific underwater photos since the camera is waterproof.

imageBefore the cruise was over, my phone decided to get into a mode where it wouldn’t start up until I did a data backup to either an SD card (which the phone didn’t support) or a USB drive which I didn’t have with me.

Somehow, with some fiddling, the phone then decided it needed to download an updated operating system so I wrongly let it do that. Bad move! It was obvious that action failed as then the phone would no longer get past the logo screen. 

At home, Saturday at 11 pm, I ordered a new phone for $340 from Amazon. It arrived at my house on Monday afternoon and I’m back in action. The only thing on my old phone were about a month of pictures including the first 3 days of our vacation. If it’s not too expensive, I might try to see if a data recovery company can retrieve the pictures for me. If not, oh well.


Failure Number 2 – My Desktop Computer

I had left my computer running while I was gone. I was hoping for it to do a de novo assembly of my genome from my long read WGS (Whole Genome Sequencing) test.  I had tried this a few months ago, running on Ubuntu under Windows. When I first tried, it had run for 4 days but when I realized it was going to take several days longer I canned it. Knowing I was going to be away for 14 days was the perfect opportunity to let it run. I started it up the day before I left and it was still running fine the next morning when I headed to the airport.

When I got back, I was faced with the blue screen of death. Obviously something happened. “Boot Device Not Found”.

image

Don’t panic!

I went into the BIOS and it sees my D drive with all my data, but not my C drive. My C drive is a 256 GB SSD (Solid State Drive) which includes the Windows Operating System as well as all my software. My data was all on my D drive (big sigh of relief!) but I also have an up-to-date backup on my network drive from my use of Windows File History running constantly in the background. So I wasn’t worried at all about my data. Programs can be reinstalled. Data without backups are lost forever.

I spent the rest of Saturday seeing if I can get that C drive recognized. No luck. My conclusion is that my SSD simply failed which can happen. I had a great computer but it was about 8 years old. The SSD drive was a separate purchase that I installed when I bought it to speed up startup and all operations and programs. My computer was as dead as a doorknob,

Saturday night, along with the phone I purchased at Amazon, I also purchased a new desktop at Amazon. Might as well get a slight upgrade while I’m at it.  From my current HP Envy 700-209, a 4-core 4th generation i7 with 12 GB RAM, 256 GB SSD and 2 TB hard drive, I decided on a refurbished/renewed HP Z420 Xeon Workstation with 32 GB RAM, 512 GB SSD and a 2 TB hard drive. It comes with 64-bit Windows 10 installed on the SSD drive. I’ve always had excellent luck with refurbished computers. The supplying company makes doubly sure that they are working well before you get them and the price savings are significant.

On Tuesday, the computer was shipped from Austin Texas to Nashville Tennessee. It went through Canada customs Thursday morning arriving here in Winnipeg at 9 a.m. and at my house just before noon.

First step, hook it up and a problem: My monitors have different cables than its video card needs. I ordered the less expensive video card with it, an NVIDEA Quadro K600. It did not come with the cables. I’m not a gamer so I don’t need a high-powered card, I made sure it could handle two monitors but I didn’t think about the cables. As it turns out, comparing my old NVIDEA GeForce GTX 645 card, I see my old card is a better card. So first step, switch my old card into my new computer.

image

Now start it up, update the video driver, and get all the windows updates. (The latter took about a half a dozen checks for updates and 3 hours of time)

Next turn it off and remove my 2 TB drive from my old computer to an empty slot in my new computer and connect it up. That will give me a D drive and an E drive, each with 2 TB which should last me for a while.

That was good enough for Thursday. Friday and Saturday, I spent configuring Windows the way I like it and updating all my software, including:

  1. Set myself up as the user with my Microsoft account.
  2. Change my user files to point to where they are on my old D drive.
  3. Set my new E drive to be my OneDrive files and my workplace for analysis of my huge (100 GB plus) genome data files.
  4. Reinstall the Microsoft Office suite from my Office 365 subscription.
  5. Set my system short dates and long dates the way I like them:
    2020-03-22 and Sun Mar 22, 2020
    image
  6. Set up my mail with Outlook. Connect it to my previous .pst file (15 GB) containing all my important sent and received emails back to 2002.
  7. Reinstall and set up MailWasher Pro to pre-scan my mail for spam.
  8. Reinstall Diskeeper. If you don’t use this program, I highly recommend it. It defragments your drives in the background, speeds up your computer and reduces the chance of crashes. Here’s my stats for the past two days:
    image
  9. Reindex all my files and email messages with Windows indexer:
    Capture1
  10. Change my screen and sleep settings to “never” turn off.
  11. Get my printer and scanner working and reinstall scanner software.
  12. Reinstall Snagit, the screen capture program I use.
  13. Reinstall UltraEdit, the text editor I use.
  14. Reinstall BeyondCompare, the file comparison utility I use. I also use it for FTPing any changes I make to my websites to my webhost Netfirms.
  15. Reinstall TopStyle 5, the program I use for editing my websites. (Sadly no longer supported, but it still works fine for me)
  16. Reinstall IIS (Internet Information Server) and PHP/MySQL on my computer so that I can test my website changes locally.
  17. Reinstall Chrome and Firefox so that I can test my sites in other browsers.
  18. Delete all games that came with Windows.
  19. File Explorer: Change settings to always show file extensions. For 20 years, Windows has had this default wrong. image
  20. Set up Your Phone, so I can easily transfer info to my desktop.
  21. Set up File History to continuously back up my files in the background, so if this ever happens again, I’ll still be able to recover.
    image
    (and occasionally it saves me when I need to get a previous copy of a file)
  22. Reinstall Family Tree Builder so I can continue working on my local copy of my MyHeritage family tree. I hope Behold will one day replace FTB as the program I use once I add editing and if MyHeritage allows me to connect to their database. I also have a host of other genealogy software programs that I’ve purchased so that I can evaluate how they work. I’ll reinstall them when I have a need for them again. These include: RootsMagic, Family Tree Maker, Legacy, PAF and many others.
  23. My final goal for the rest of today and tomorrow is to reinstall my Delphi development environment so that I can get back to work on Behold. This includes installation of three 3rd party packages and is not the easiest procedure in the world. Also Dr. Explain for creating my help files and Inno Setup for creating installation programs. I’ll also have to make sure my Code Signing certificate was not on my C drive. If so, I’ll have to reinstall it.
  24. Any other programs I had purchased, I’ll install as I find I need them, e.g. Xenu which I use as a link checker, or PDF-XChange Editor which I use for editing or creating PDF files, or Power Director for editing videos. I’ll reinstall the Windows Susbsystem for Linux and Ubuntu when I get back to analyzing my genome.
  25. One program I’m going to stop using and not reinstall is Windows Photo Gallery. Windows stopped supporting it a few years ago, but it was the most fantastic program for identifying and tagging faces in photos.  I know the replacement, Microsoft Photos, does not have the face identification, but hopefully it will be good enough for all else that I need. Maybe I’ll have to eventually add that functionality to Behold if I can get my myriad of other things to do with it done first.

Every computer needs a good enema from time to time. You don’t like it to be forced on you, but like cleaning up your files or your entire office or your whole residence, you’ll be better off for it.

How would you cope if both your phone and computer failed at the same time?

Just don’t panic!

Computers 23 years ago

2020. február 26., szerda 4:48:02

#Delphi25 #Delphi25th – I came across an email I sent to a friend of mine on February 6, 1997 (at 1:17 AM). I’ll just give it here without commentary, but it should amuse and bring back recollections of people who were early PC users.
 image

You should find this message to be a little different. I am sending it using Microsoft Mail & News through my Concentric Network connection, rather than than using my Blue Wave mail reader through my Muddy Waters connection. This gets around my problem of not being able to attach files, as you had tried for me. In a future E-mails, I can attach pictures for you. I presume you can read GIFs, or would you prefer JPG or TIF?

I will still be keeping my MWCS account until the end of 1997, but I am switching over more and more to my Concentric account. I am still not entirely happy with Windows-based Newsreaders yet, and find Blue Wave much more convenient for reading newsgroups. Hopefully, by the end of the year I will have this sorted out.

I bit the bullet, and switched over to Windows 95 at home. I first had to upgrade my machine. I bought 16 MB more memory (to give me 24 MB) for $99 at Supervalue (of all places!) and bought a 2 GB hard drive for $360 (also at Supervalue!) less a $30 US mail-in rebate on the Hard Drive and a $30 sweatshirt thrown in due to a Supervalue coupon when over $200 is spent. My 260 MB drive that I bought 3 1/2 years ago already had Stacker on it to make it 600 MB, and I only had 80 MB free. I wanted to get rid of Stacker before going to Windows 95.

It only took me 3 1/2 hours to install the RAM and the Hard Drive myself at home! It wasn’t without problems, but the operation was a success. I had hooked up my old and new Drives as master and slave and everything worked. The next night, I took another 3 1/2 hours to transfer everything from my old drive to my new one, removing the old drive, and getting the system working from the new drive - again not without problems, but completed that evening. I am very proud of myself! The next evening, it took about an hour to get Windows 95 installed, and to customize it to the way I liked.

This hardware upgrade should be good for another couple of years. I only have the power supply, base, keyboard, mouse, and monitor as original parts. All the rest has been since upgraded.

Windows 95 - Well I actually like 90% of it better than Windows 3.1, and am only finicky about 10% of it. I know, I know, buy a Mac you will say. Well I hope you are prepared to buy a new operating system every six months like Jobs says you’ll have to. I still agree Macs are a good system, but there is much more software available for PCs, Macs are 40% more expensive, and they still use that horrible character font that they used in the early 80’s - yecch!

In the meantime, I have kept myself very, very, very, very, very, very,
very, very busy. I have been working hard on many different fronts, after work playing hard with the kids until their bedtimes (usually closer to 10 p.m. than to 8), most often working on the Computer from 10 to 11 to 12 to (yikes) 1 or 2 sometimes - Got my web pages up (http://www.concentric.net/~Ikessler); have responded to about 50 e-mail messages and inquiries about it; designed a tender proposal for the photographic work for our Cemetery Photography Project
(http://www.concentric.net/~Ikessler/cemphoto.shtml); and I’ve started learning how to use Borland Delphi to develop my BEHOLD program (http://www.concentric.net/Ikessler/behold.shtml)

Whew! I’m getting tired just thinking about all this!

Take care.  Louis

25 Years of Delphi

2020. február 14., péntek 9:52:01

The Delphi programming language is having its #Delphi25 #Delphi25th birthday on Friday Feb 14, 2020. I’ve been using Delphi for about 23 years since 1997 when Delphi 2 was released.

Delphi is an amazing language. I use it now for Behold and Double Match Triangulator, and I’ve made use of it for a number of personal projects along the way as well.

It’s appropriate on this day that I write about Delphi and how I use it and what I like about it.


Pre-Delphi

I should provide a bit of background to give context to my adoption of Delphi as my programming language of choice.

As I entered high school (grade 10), my super-smart friend and neighbor Orest who lived two doors over and was two grades ahead of me recommended I follow his lead and get into programming at school. The high schools in Winnipeg at that time (1971) had access to a Control Data Corporation mainframe, and provided access to it from each school via a cardreader and a printer. You would feed your computer cards into the cardreader. In the room was one (maybe two) keypunches, likely KP-26 or maybe KP-29.

The computer language Orest used at the time and the school was teaching was FORTRAN, a Waterloo University version called FORTRAN IV with WATFOR and WATFIV. What an amazing thing. You type up a sequence of instructions on computer cards, feed them through the card reader, and a few minutes later your results are printed on classic fanfold computer output.

Image result for fortran iv with watfor and watfiv  See the source image  See the source image

For three years of high school, my best friend Carl and I spent a lot of time in that small computer room together. I remember a few of the programs I wrote.

  1. There was the hockey simulation following the rules of a hockey game we invented using cards from a Uno Card Game. We simulated a full World Hockey Association season of the 12 teams each playing 78 games giving each team a different strategy. 11 of my friends would each have a team and look for the daily results and standings.
  2. For a special school event, my friend Carl and I wrote a dating program. We got everyone in school (about 300) and all the teachers (about 30) to fill out a multiple choice survey of about 10 questions about themselves, and the same questions for what they wanted in a date. During our school event, people would come to the computer room, and Carl and I would run them against the database and give them their top 5 dates with hilarious results.
  3. I played touch football with a number of friends once or twice a week during the summer. I recorded all the stats on the back of a computer card in between plays, and I then would punch the results onto computer cards and wrote a program that would give total all the passing stats, receiving stats, interceptions and fumble recoveries by player, giving the leaders and record holders in each category. Everyone loved seeing the stats and played harder and better because of it.
  4. I wrote a program to play chess. Carl wrote one as well. We had a championship match – chess program vs chess program that got us in our city’s newspaper.

At University, I took statistics but also included many computer science courses. While there, I continued work on my chess program in my spare time and the University of Manitoba sponsored me as a contestant in the North American Computer Chess Championships in Seattle, Washington in 1977 and in Washington, D.C. in 1978. Games were played with modems, and connected dumb terminals to the mainframes back at our Universities. Read all about my computer chess exploits here: http://www.lkessler.com/brutefor.shtml

After getting my degree in statistics, I went for my Masters in Computer Science. Now we finally no longer needed computer cards, but had terminals we could enter our data on. There was a Script language for developing text documents, and I used it to build my family tree, with hierarchical numbering, table of contents and an index of names. It printed out to several hundred pages on fanfold paper. I still have that somewhere.

I started working full time at Manitoba Hydro as a programmer/analysis rewriting  and making enhancements to programs for Tower Analysis (building electric transmission towers) and Tower Spotting (optimizing the placing of the towers). These were huge FORTRAN programs containing tens of thousands of lines of what we called spaghetti code.

Then I was part of a 3 year project to develop MOSES (Model for Screening Expansion Scenarios) which we wrote in the language PL/I. That was followed by another 3 year project from 1986 to 1988 where our team wrote HERMES (Hydro Electric Reservoir Management Evaluation System) which we also said stood for Having Empty Reservoirs Makes Engineers Sad. I learned that one of the most important parts of any program is coming up with a good name for it. I also learned how to three-ball juggle.

The HERMES program was written in Pascal. That was a new language for me but I learned it quite thoroughly over the course of the project. I believe I purchased my first personal computer, an IBM 386 20 Mhz for home sometime around 1993. When I did, FORTRAN was still available but very expensive. So instead I purchased Borland’s Turbo Pascal.  I started programming what would one day become my genealogy program Behold.


My Start with Delphi and Evolution Thereof

I like to joke that I’m not an early adopter and that’s why I didn’t buy into Delphi when it came out in 1995, but did buy Delphi 2 in 1997. Delphi was basically still the Pascal language. But what Delphi added over Turbo Pascal was primarily two things:  the addition of Object- Oriented Programming (OOP), and an Integrated Development Environment (IDE). Those were enough that I had to go “back to school” so to speak, and I loaded up on getting my hands on any Delphi Books that I could. They’re still on my shelf now.

IMG_20200213_233105

I purchased Delphi 2 on May 14, 1997 for $188.04 plus $15 shipping & handling.

I didn’t upgrade every year. It was expensive. But I only upgraded when I felt there was some important improvement or new features I needed.

I upgraded to Delphi 4 in June 1998 for $249.95 plus $15 s/h. At this time, Borland had changed its name to Inprise. By 2001, they abandoned that name and went back to Borland.

I was able to use Delphi 4 for quite some time. Finally there was a feature I absolutely needed and that was Unicode which came in Delphi 2009.  I was allowed to upgrade my version of Delphi 4 and I did that and upgraded to Delphi 2009 in Sept 2008 for $374.

Embarcadero purchased Delphi from Borland in 2008. In 2011, I upgraded to Delphi XE2 for $399 which included a free upgrade to Delphi XE3.

I upgraded to Delphi XE7 in 2015 for $592. And I upgraded to Delphi 10.1 in 2016 for $824.40.

The upgrades were starting to get expensive so in 2017 I started subscribing to Delphi maintenance for $337 per year.


Third Party Packages

Delphi includes a lot of what you want, but not everything. I needed a few packages from third parties who built components for Delphi. For Behold I used two:

TRichView by Sergei Tkachenko. TRichView is a WYSIWYG (What You See Is What You Get) full featured document editor that forms the main viewing (soon to be editing/viewing) window of my program Behold that I call “The Everything Report”. Behold is listed among the many Applications that have been made with TRichView.

I purchased TRichView in 2000 when it was still version 1.3.  Now it’s 7.2. Back then the cost was $35, and it was a lifetime license that Sergey grandfathered in for his early customers. He has continued to develop the program and has not charged me another nickel for any upgrades. I did, however, pay $264 to Sergey in 2004 for some custom code he developed that I needed. I liked that lifetime license policy so much that it inspired me to do so as well for my Behold and Double Match Triangulator customers who all get free upgrades for life when they purchase a license. Sergey no longer offers lifetime licenses. His current price for TRichview is $330, but he also offers other products that work with it. That’s at 20 years of Delphi development for Sergey.

image

LMD Innovative’s ElPack is the other package I use for Behold. This is a package of over 200 components that extend the functionality of the VCL controls that Delphi has. The main purpose I purchased this was for their ElXTree which allows custom creation of TreeViews and grids:

imageimage

I first purchased ElPack in 2000 from the company EldoS (Eugene Mayevski) who originally developed it.  The cost was $68. About 6 months after I purchased it, I noticed a free product available called Virtual Treeview written by Mike Lischke, but I was already using and happy with ElPack so I continued to use it. I considered switching to Virtual Treeview several years later, but my use of ElPack was already so deeply embedded into Behold, that it wasn’t worth the effort.

I did have to pay for upgrades to ElPack, so I upgraded only when there was a reason to. Usually it was because I got a new version of Delphi and the old version wouldn’t install. Also, my third party packages were also a reason I didn’t upgrade Delphi so often, because I couldn’t really upgrade until both TRichView and ElPack had versions that worked with the new version of Delphi, which could take up to a year after the Delphi version release.

In 2003, LMD Innovative acquired ElPack from EldoS and continued developing it. LMD’s current price for ElPack is $190. They have a partnership with TRichView and give me 20% off for being a TRichView customer. I tend to upgrade ElPack every two years or so.

TMS Software’s FlexCel Studio was a package I purchased for Double Match Triangulator (DMT) to provide native Excel report and file generation, without requiring use of Excel automation and not even requiring Excel on your computer. I use it to produce the Excel files that DMT puts its results into. The capabilities of this component actually amaze me. It can do anything you can think of doing in Excel and more.

image

I first purchased FlexCel in August 2017 for $157.


Additional Tools I Used to Work With Delphi

Developing programs with Delphi requires additional tools from time to time. Here’s some of the tools that were useful in my Delphi Development:

In 2009, I purchased for $129 a program called EurekaLog, which integrated with Delphi and worked to find and help locate memory leaks in my program Behold. The program helped me determine how my code was causing leaks, so after a few years and all leaks eradicated and better programming to avoid future leaks, I really didn’t have a great need to keep using the program.

In 2010 when I was tuning Behold for efficiency, I purchased a line by line profiler from Automated QA called AQTime that worked by itself or with Delphi. This was a very expensive program at $599, but I was able able to speed up Behold 100 times by finding inefficient algorithms and sections of code that I could improve, so it was worth the price. The program has since been acquired by SmartBear and still sells for $599. I no longer have a version of this program that works with the latest version of Delphi. Delphi does provide a lite version of AQTime for free, but that does not include its fantastic line-by-line profiler. I’m no longer in need of super-optimizing my low-level code because that rarely changes. When I need to ensure a section of code is not too slow, I now put timers around the section and that often tells me what I need to know.

Dr. Explain is the program I chose for writing the help files for my programs. I first purchased it in 2007 for $182, upgraded in 2014 for $100. The current price of an advanced license is $390.

image

And my installation program of choice for Behold and DMT is the free Inno Setup Compiler from jrsoftware. I purchase Comodo Code Signing certificates for about $70 a year.

image


Personal Uses of Delphi

Other than the two programs Behold and DMT that I am developing and selling licenses for, I also have used Delphi over the years to build some programs for my own use. These include:

  • A database search program I build for my local Heritage Centre so they could easily query their Microsoft Access databsse which had listings and details of over 60,000 items. Originally written in Turbo Pascal and later converted to Delphi. (1996)
  • A program to build some of my link web pages for me such as my Computer Chess Links page. (1997)
  • A program to screen stocks for me to find stocks that I was interested in purchasing. (1997)
  • A program to run through all possible picks and determine what selections my competitors picked in our local newspaper’s hockey, football and stock market contests (1997). (Aside:  I have won more than $20,000 in such contests using this type of analysis to help me gain an advantage.)
  • A page counter for early versions of my websites. (2001)
  • A program to help win at the puzzle called Destructo, where you’re trying to break through a wall. (2001)
  • A program that produces the RSS feeds for this blog on this website (2004).
  • A program to analyze the log files from my websites, especially to find pages that link into my sites.(2005)
  • A program to help play soduko.  (2005)
  • A program to download stock market data and do technical analysis for me. (2008)
  • A program to analyze 100 GB raw data files from whole genome DNA tests. (2019)

One thing I never have done is resurrected my chess program. For a while I considered it, but I knew it would be a lot of work and I didn’t want to take my time away from my genealogy software. In the past couple of years, deep learning and Alpha Zero has made all other programs irrelevant.


What’s Next with Delphi

I am very pleased that Embarcadero has continued to support and improve Delphi and that my Third party packages continue to roll. Hopefully that will continue for the foreseeable future.

The stability I’ve had over the past 24 years being able to use Delphi has been fantastic. The development environment is great. I love how fast it compiles, how fast the code runs, and how easy it is to debug.

Here’s 24 Years of Delphi and 25 Years of Excellence and here’s Going Forward.

On my speaker topics page, I like saying that “Louis is fluent in five languages: English, Delphi/Pascal, HTML, GEDCOM and DNA.”

Well now I better post this page and get to bed, because I have to be up in 9 hours for the Happy Birthday Delphi 25 celebrations

GEDCOM Assessment

2020. február 9., vasárnap 4:07:12

I’ve working hard to get Behold 1.3 completed. It will primarily be a newer iteration of Behold’s Everything Report. Once that is released, I’ll start my effort to add GEDCOM export followed by editing.

I’ve designed Behold to be a comprehensive and flexible GEDCOM reader that understands and presents to you all the data contained in GEDCOM of any flavour, from 1.0 to 5.5.1 with developer extensions and user-defined tags. So when John Cardinal came up with his GEDCOM Assessment site, that was a opportunity I couldn’t resist.

“assess.ged is a special GEDCOM file which you may use to perform a review of the GEDCOM 5.5.1 import capability of any program that reads a GEDCOM file and imports the contents”

John is a long-time user of The Master Genealogist program written by Bob Velke. John is also a programmer and wrote programs to work with TMG including Second Site for TMG, TMG Utility and On This Day.

After TMG was retired in 2014, John wanted to help people get all their data out of TMG allowing them to transfer to other programs so he wrote the TMG to GEDCOM program. He also wrote a program that creates an e-book from a GEDCOM file called Gedcom Publisher. And John then wrote a program to create a website from any generic GEDCOM file and called that program GedSite.

In the process of all this, John gained an expertise in working with GEDCOM and has made tests for GEDCOM compatibility that he invites all genealogy software authors to try.

So try it I shall. 

I followed John’s “process” and downloaded version 1.03 of assess.ged file as well as the images file references and placed the latter in a C:GedcomAssessment folder. Then I loaded assess.ged into Behold 1.2.4 and used his website’s Data Entry page to capture the results. This really is a beautifully set up assessment system. My complements to John Cardinal.


A Few Things To Fix for Version 1.3

There were a number of tests that illustrated some aspects of GEDCOM that Behold does not fully support. I’ve made a list of them here:

  1. Behold by default uses the first person in the file and treats that person (and their spouse(s)) as the family the report is about. (You can of course pick anyone you want instead of or in addition to the first person). The assess.ged file does not link the 185 people in the file to each other, except for two who are connected as spouses. Behold was not using the first person in the file as a singular family but instead had the first section blank and listed all the people, including that first person, in its “Everyone Else” section. This should be a simple fix.
  2. I was surprised to see Behold display:  1 FACT Woodworking 2 TYPE Skills as Woodworking Skills rather than Skills: Woodworking. That’s a bug because I intended it the latter way. Same for 1 IDNO 43-456-1899 2 TYPE Canadian Health Registration which was being displayed as 43-456-1899 Canadian Health Registration rather than Canadian Health Registration: 43-456-1899.
  3. Behold somehow was ignoring and not displaying the TIME value on the change date of a record.
  4. The CONC tag to concatenate two lines is specified by GEDCOM to require the last word in the first line be split so that it’s second half begins the second line. Behold does this, but in doing so, Behold trims the lines before concatenating. As a result, if a GEDCOM used a non-standard method of including a leading space on the second line or a trailing space on the first line, then it is ignored and the word at the end of the first line and the beginning of the second line would be joined with no intervening space. I haven’t noticed programs using this non-standard format, but even so, I’ll think about it and maybe I’ll remove Behold’s trimming of concatenated lines in version 1.3.
  5. Behold displays: “Birth, date”.  But it should display “Birth: date”. Same for other events such as “Adoption, date” or “Baptism, date”. How did that ever happen?
  6. Behold currently displays the user-defined tag _PRIM as “Primary: Y” after a date, but retains the first-entered date as primary and does not use this tag to make that date primary. I think about deciding to honor the _PRIM tag in version 1.3.
  7. The non-standard shared event tag, e.g. 1 CENS 2 _SHAR @I124@ is not being displayed correctly by Behold. This will be fixed.
  8. Behold does not convert @@ in notes or text values to @, as it should. Technically all line items should be checked for @@ and changed as well so that includes names.
  9. Hyperlinks to objects unfortunately do not open the file because Behold added a period to the end of it. This is a bug that I noticed a few weeks ago and has already been fixed for the upcoming version of Behold under development.
  10. Alias tags (ALIA) whose value is the name of the person rather than a link is valid according to the GEDCOM standard, but it may be something I want to support if I see it was exported into GEDCOMs by some programs.
  11. I’m not displaying the tags under a non-standard top level 0 _PLAC structure correctly. This includes 1 MAP, 2 LATI, 2 LONG and 1 NOTE tags under the 0 _PLAC record.
  12. Non-standard place links such as: _PLAC @P142@ that link to the 0 _PLAC records should have been working in Behold, but the display of these links needs to be improved.
  13. If a person’s primary name has a name type, then it should be repeated with the type on the next line, e.g.
       Birth name:  Forename(s) Surname
    Also additional names should be called “Additional name” rather than just “Name”.
  14. Names with a comma suffix should not be displayed with a space between the surname and the comma. I’ve actually never seen this in the wild.
    e.g. /Smith/, Jr should be displayed as Smith, Jr and not Smith , Jr
  15. Notes on places are repeated and shouldn’t be.  Dates should be shown following any notes or other subordinate info for the place.
  16. Addresses could be formatted better.
  17. EVEN and ROLE tags on a source citation should have their tag text looked up and displayed instead of just displaying the tag name.
  18. The OBJE link was not included in source citation when it should have been.

So that was a really good exercise. Most of these are minor, but a lot more issues came up than I expected. Over the next few days, I’ll resolve each of these in the development version of Behold which soon is to become version 1.3.

  
Results and Comparison

John presents a Comparison Chart that currently compares the results for 15 programs. There are 192 tests. Here’s my summary of John’s Comparison.

image

I’ve added Behold’s result in my chart. I’ve also excluded John’s program GedSite in summarizing the other programs, because his results are for a program that has already been tuned to handle these tests. So GedSite’s numbers are a good example of the results that I and other developers should try to attain with our programs.

Behold didn’t do too bad with 161 supported constructs out of 192. Best was GedSite’s 185 followed by Genealogie Online’s 179, then by My Family Tree’s 169 and then by Behold’s 161. Genealogy Online is the baby of Bob Coret who is another GEDCOM expert, and My Family Tree is by Andrew Hoyle of Chronoplex Software who also makes GEDCOM Validator, so you would expect both of them to be doing well with regards to GEDCOM compliance.

I’ve emailed the JSON text file of Behold’s results to John. Hopefully he’ll add Behold to his comparison chart.


Comments About John’s Test File and Data Entry Page

  1. The assess.ged file version 1.03 includes a 1 SEX M line in each of the test cases. I’m not sure why. SEX is not a required tag in an INDI record. For a test file, it would be simpler to just leave the SEX lines out.
  2. I disagree with the constructs of two of the Master Place Link by XREF tests. They include within one event, both a standard test place link and the non-standard place xref link, i.e.: 
       1 CENS
       2 PLAC New York
       2 _PLAC @P158@
    The trouble I have with this is that GEDCOM only allows one place reference per event. By using this alternative tag, you’ve effectively got two which is illegal if they were both PLAC tags. And what if they are not the same? John should take out the 2 PLAC New York line from his NAME 02-Link by XREF tests where he has the 2 _PLAC tag so that there is only one place reference. Any programs allowing both PLAC and _PLAC tags on the same event should cease and desist from doing this. The second test where the 3 _PLAC tag is under the 2 PLAC tag is an even more horrible construct that no one should support.
  3. The GEDCOM Assessment Data Entry Page does not completely function in all browsers. When using my preferred browser Microsoft Edge, entering “Supported (w/comment)” did not bring up the box to enter the comment. I tried Internet Explorer and the page did not function at all. I had to switch to Google Chrome (or Firefox) to complete the data entry.


Conclusion

What this little exercise does show is how hard it is to get all the little nuances of GEDCOM programmed correctly and as intended. This assessment took the better part of a day to do, but I think it was well worth the time and effort.

And what’s really nice about having a file with test cases is that they provide simple examples that illustrate issues that can be fixed or improved.

I hope all other genealogy software authors follow my lead and test their programs with GEDCOM Assessment’s assess.ged file. Then it’s a matter of using this analysis to help make their programs more compatible with the standard and thus do their part to help improve genealogical data transfer for everyone.




Update  Feb 10:  John reviewed the assessment with me. A few results changed status and I’ve updated the table above. John mentioned that his creation of GedSite wasn’t a conversion of Second Site for TMG, but was a completely new program.

Behold version 1.2.4’s final assessment is now available here:
https://www.gedcomassessment.com/en/assessment-behold.htm

Once I complete version 1.3, I’ll likely submit it again for a new assessment.

So Much Fun!

2020. január 25., szombat 22:37:09

The Family History Fanatics’ @FHFanatics online Winter DNA eConference has just finished and it was so much fun. Andy and Devon sure know how to put un a good show.

This was the first time I’ve presented at an online conference.  I was able to do this comfortably from my office at home and my family was really good and went out of their way not to bother me for 6 hours.

Devon Noel Lee, Jonny Perl and Paul Woodbury also presented and it was so great getting to interact with them. It was the next best thing to being at a physical conference, without the need to spend time in airports and hotels.

In addition to all the great genetic genealogy methodologies presented, I also learned that I know nothing about what genes are on what chromosomes, and that audiences love to suggest random numbers.

We ended with a question/answer period that turned into an entertaining roundtable discussion.

image

Now to get back to work and figure out how Jonny and I are connected through the 16 cM segment on chromosome 7 that we share.

Double Match Triangulator, Version 3.1.2

2020. január 22., szerda 5:16:01

I released small upgrade to Double Match Triangulator today. It includes a few fixes to minor problems and a couple of improvements:

  • Fixed the display of Base A-B segments in a combined run. They were always showing as single matches, but when they triangulate, they should show as Full Triangulation.
  • Improve the handling of the horizon effect by restricting just B-C segment matches to the Min Triang value and allowing smaller A-C and A-B matches.
  • Improve some of the data displayed and data descriptions in the log file.
  • Allow Person A and B to be processed if only one file has matches to the other. Previously, matches both ways was required which won’t happen when one of the files was downloaded before the other persons test results were available.
  • MyHeritage Matches Shared Segment files are now filtered only by their English filename to help prevent the more severe problem of people selecting their MyHeritage Matches List file by mistake.
  • If running only File A, then messages say that instead of "using 0 Person B Files"
  • The Min Single label now shows as dark grey instead of red when values of 10 cM or 12 cM are selected, since 85% of single matches 10 cM or more should be valid.
  • The display of the number of inferred segments on the People page is now right justified rather than center justified.
  • If you have the DMT or People file open when DMT is trying to save it, DMT will now prompt you to close the file.

As always, this new version and all future versions are free for previous purchasers of DMT,

I’ll be talking about Double Matching and Triangulation Groups at the Family History Fanatics Winter DNA eConference this Saturday. I’ll be presenting my material in a very visual style taking you through some of the challenges I have with my own genealogy and DNA matches. I’ll be introducing some concepts that I don’t think have been discussed before. One of the attendees will win a free lifetime license for DMT. Hope to see you there.

DNA eConference on Saturday, January 25

2020. január 18., szombat 3:39:04

I’m looking forward to the Family History Fanatics eConference coming up in a week’s time. I’ll be one of the speakers on what will be a great day of DNA talks.

I have given many talks at conferences, but this will be my first virtual talk from the comfort of my office at home. I just submitted my Syllabus today and will be spending the next few days reviewing my presentation and tweaking my slides.

image

I’ll be talking about Double Matching and Triangulation Groups, but I’ll be presenting it in a very visual style taking you through some of the challenges I have with my own genealogy and DNA matches. I’ll be introducing some concepts that I don’t think have been discussed before.

And for all you DNA Painter fans, I’ll be including:
image 

Andy and Devon who run Family History Fanatics, have a great lineup with three other fantastic speakers who will also present.

Devon herself, in her own wonderfully unique style will present:image

Then it’s Paul Woodbury, the DNA Team Lead at Legacy Tree Genealogists:image

And finally is no other than Mr. DNA Painter himself, Jonny Perl:image

Andy has some interesting twist planned after all of our talks.  It’s called “Genealogy Unscripted” and it will bring all four of us together to first answer questions and then “compete” in some sort of challenge to see who can claim the DNA trivia title. This comes as I am still recovering from watching last week’s Jeopardy Greatest of All Time tournament. Against Devon, Paul and Jonny, I’ll feel like Brad felt against James and Ken.

If you want to see any of the talks but can’t make it live to some or all of the eConference, by registering, you will be sent a link to the recorded conference. You’ll have a month to watch and/or re-watch any or all of the talks at your convenience.

The cost is only $25 USD.  Register now, before it fills up. Hope to see you there.

image

Aligning My Genome

2020. január 5., vasárnap 5:18:28

I purchased a Whole Genome Sequencing (WGS) test from @DanteLabs for $399 USD in August 2018. My Variant Call Format (VCF) files were ready for me to download in January 2019, which I dissected in a blog post in February:  My Whole Genome Sequencing. The VCF File.

My 1 TB hard disk containing my raw data arrived in April 2019. It included my raw reads (FASTQ files) and assembled Ch37 genome (BAM file) in April 20

I was originally intending to then analyze my raw reads and BAM (Binary Sequence Alignment Map) file, but at that time in April, Dante had a deep discount on their Long Reads WGS test that I purchased for $799 USD. So I figured I’d wait until I got the long read results and then analyze and compare both the short read and long read tests together. That would prove interesting and show the strengths and weaknesses of each test and maybe they can work together for improved results.


The FASTQ file

I got my long reads results from Dante in October. They provided only the FASTQ file and provided it online as a single downloadable file.

image

The file was 199 GB (yes, that’s GB). On my internet connection, it took 12 hours to download. It is a compressed file. I unzipped it to look at it. It took 78 minutes to decompress to 243 GB.  It’s a good thing I still had half of my 2 TB internal hard drive free to use.

This is what the beginning of my decompressed FASTQ file looks like. Four lines are used per sequence, containing: (1) an identifier and description, (2) the raw sequence letters, (3) a “+” character, and (4) the quality values for each letter in the sequence.

image

The lines extend further to the right than shown above. The 7 sequences here have 288, 476, 438, 302, 353, 494 and 626 bases. These are some of the shorter sequences in the file. If I go to the 1321st sequence in the file, it contains 6411 bases.

But even that is likely short compared to what some of the longest reads must be. This file is promised to have an N50 > 20,000 bases.  That is not an average length, but that means that if you total the lengths of all the sequences that are more than 20,000 bases, then they will make up more than 50% of all the bases. In other words, the N50 is like a length-weighted median.

By comparison, taking a look at my short read FASTQ files, I see that every single sequence is exactly 100 bases. That could be what Dante’s short read Illumina equipment is supposed to produce, or it could have been (i hope not) processed in some way already.

image


The Alignment Procedure

The FASTQ file only contains the raw reads. These need to be mapped to a standard human genome so that they can be aligned with each other. That should result in an average of 30 reads per base over the genome. That’s what a 30x coverage test means, and both my short read and long read tests were 30x coverage. The aligned results are put into a BAM file which is a binary version of a SAM (Sequence Alignment Map) file.

Dante produced and supplied me with my BAM file for my short reads. But I just learned that they only provide the FASTQ file with the long read WGS test. That means, I have to do the alignment myself to produce the BAM file.

On Facebook, there is a Dante Labs Customers private group that I am a member of. One of the files in the file area are instructions for “FASTQ(s) –> BAM” created by Sotiris Zampras on Oct 22. He gives 5 “easy” steps:

  1. If using Windows, download a Linux terminal.
  2. Open terminal and concatenate the FASTQ files.
  3. Download a reference genome
  4. Make an index file for the reference genome.
  5. Align the FASTQs and make a BAM file.

Step 1 - Download a Linux Terminal

Linux is an open-source operating system released by Linus Torvalds in 1991. I have been a Windows-based programmer ever since Windows 3.0 came out in 1990. Prior to that, I did mainframe programming.

I have never used Linux. Linux is a Unix-like operating system. I did do programming for 2 years on Apollo Computers between 1986 and 1988. Apollo had their own Unix-like operating system called Aegis. It was a multi-tasking OS and was wonderful to work with at a time DOS was still being used on PCs.

So now, I’m going to plunge in head first and download a Linux Terminal. Sotiris recommended the Ubuntu system and provided a link to the Ubuntu app in the Windows store. It required just one Windows setting change: to turn on the Windows Subsystem for Linux.

image

Then I installed Ubuntu. It worked out of the box. No fuss, no muss. I have used virtual machines before, so that I could run older Windows OS’s under a newer one for testing my software. But this Ubuntu was so much cleaner and easier. Microsoft had done a lot in the past couple of years to ensure that Windows 10 will run Linux smoothly.

I had to look up online to see what the basic Ubuntu commands were. I really only needed two:

  • List files in current directory:  ls
    List files with details:  ls –l
  • Change current directory:  cd folder
    The top directory named “mnt” contains the mounted drives.

Step 2 – Concatenate the FASTQ files

My short read FASTQ files were a bunch of files, but my long read file is just one big file, so nothing needed for it.

Step 3 – Download a Reference Genome.

Sotiris gave links to the human reference files for GRCh37.p13 and GRCh38.p13.

For medical purposes, you should use the most recent version, which currently is: March 2019 GRCh38.p13  (aka Build 38, or hg 38). But I’m doing this to compare my results to my raw data from other DNA testing companies. Those are all June 2013 GRCh37.p13 (aka Build 37, or hg 19).

So I’m going to use Build 37.

The two reference genome files are each about 800 MB in size. They are both compressed, and after unzipping, they are both about 3 GB.

The files are in FASTA format. They list the bases of each chromosome in order. This is what they look like:

image

Both files denote the beginning of each chromosome with a line starting with the “>” character. That is followed simply by the sequence of bases in that chromosome. In Build 37, each chromosome is one long line, but my text editor folds each line after 4096 characters for displaying. In Build 38 they are in 60 character groups with each group on a new line.

The lines contain only 5 possible values:  A, G, C, T or N, where N represents Nucleic acid, meaning any of A, G, C or T. There are usually blocks of N, especially at the beginning and end of each chromosome as you can see above.

Each genome also includes references for chromosomes X, Y, and MT.

Those are followed by a good number of named patches, e.g. GL877870.2 HG1007_PATCH in Build 37 which contains 66021 bases.

Here are the number of bases for the two builds by chromosome:

image

You’ll notice the number of bases in the reference genome is different in the two Builds by as much as 3.6%. A good article explaining the differences between GRCh37 and GRCh38 is Getting to Know the New Reference Genome Assembly, by Aaron Krol 2014.

Just for fun, I compared the mt Chromosome which has the same number of bases in the two Builds. All bases also have the same value in both Builds. The count of each value is:

  • 5124 A
  • 2169 G
  • 5181 C
  • 4094 T
  • 1 N

The one N value is at position 3107.

Step 4 – Make an Index File for the Reference Genome

Sotiris didn’t mention this in his instructions, but there were two tools I would have to install. The Burrows-Wheeler Aligner (BWA) and SAMtools.

I was expecting that I’d need to download and do some complex installation into Ubuntu. But no. Ubuntu already knew what BWA and SAMtools were. All I had to do was execute these two commands in Ubuntu to install them:

sudo apt install bwa

sudo apt install samtools

Again. No fuss. No muss. I’m beginning to really like this.

In fact, Ubuntu has a ginormous searchable package library that has numerous genomics programs in it, including samtools, igv, minimap, abyss, ray, sga, canu, and all are available by that simple single line install.

The command to index the Reference Genome was this:

bwa index GRCh37.p13.genome.fa.gz

It displayed step-by-step progress as it ran. It completed in 70 minutes after processing 6,460,619,698 characters. The original file and the 5 index files produced were:

image

Step 5 – Align the FASTQs and Make a BAM File

The BWA program has three algorithms. BWA-MEM is the newest and is usually the preferred algorithm.  It is said to be faster and more accurate and has long-read support. BWA-MEM also tolerates more errors given longer alignments. It is expected to work well given 2% error for an 100bp alignment, 3% error for a 200bp, 5% for 500bp and 10% for 1000bp or longer alignment. Long reads are known to have much higher error rates than short reads, so this is important.

The command for doing the alignment is:

bwa mem -t 4 GRCh37.p13.genome.fa.gz MyFile.fastq.gz | samtools sort -@4 -o FinalBAM.bam

So the program takes my FASTQ file and aligns it to GRCh37. It then pipes (that’s the “|” character) the output to samtools which then creates my resulting BAM file.

I have a fairly powerful Intel i7 four-core processor with 12 GB RAM. The –t 4 and –@4 parameters are telling the program to use 4 threads. Still, I knew this was going to take a long time.

Here’s what I did to start the program:

image

First I used the ls –l command to list the files in the folder.

Then I ran the command. By mistake, I had the reference genome and my FASTQ in the wrong order and it game me a “fail to locate the index files”. Once I figured out what I did wrong, I ran it correctly.

The display showed the progress every 40 million bp that was processed. That seemed to average about 6200 sequences indicating that, at least to start, the there was an average of about 6450 bases per sequence. Working out the amount of time per base, I can extrapolate the total time needed to being about 102 hours, or a little over 4 days.  That’s do-able, so I let it go and was interested to see if it would speed up, slow down, or end up completing when I predicted.

Slowly temporary files were being created:

image

Every 75 minutes or so, 4 new temporary files (likely for the 4 threads) of about 500 MB each were being created. Don’t you like the icon Windows uses for BAM files? It uses that icon for FASTQ and FASTA files as well.

Now I just had to wait.

What amazed me while I was waiting was how well Windows 10 on my computer handled all this background processing. I could still do anything I wanted, even watching YouTube videos, with hardly any noticeable delay. So I took a look at the Task Manager to see why.

image

Even though 10.5 of my 12 GB RAM was being used, only 58% of the CPU was taken. I’m thinking that the 4 thread setting I used for BWA was easily being handled because of the 8 Logical processors of my CPU.

What also impressed me was that my computer was not running hot. Its cooling fan was dispersing air that was at the same temperature I usually experience. My worry before I started this was that days of 100% processing might stress my computer to it’s limits. Fortunately, it seems that’s not going to be a problem.


Wouldn’t You Know It

While I was waiting for the run to complete, I thought I’d look to see what Dante used to align my short read WGS test. For that test they provided me with not just the FASTQ raw read files, but also some of the processed files, including the BAM and VCF files.

I unzipped my 110 GB BAM file, which took 3 hours to give me a 392 GB text file that I could read. I looked at the first 200 lines and I could see that Dante had used BWA-MEM version 0.7.15 to produce the BAM file.

I thought I’d go to BWA’s github repository to see if that was the most recent version. It’s pretty close. 0.7.17 was released Oct 23, 2017. The 0.7.15 version is from May 31, 2016 and the changes weren’t significant.

But while there, I was surprised to see this notice:

image

Seems that minimap2 is now recommended instead of BWA-MEM. Here’s what they say in the minmap2 Users’ Guide:

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 paper or the preprint.

Oh well. It looks like I’ll let the BWA-MEM run finish, and then try running minimap2.


BWA Finally Done

The BWA-MEM program finally completed 5 days and 12 hours later. That’s 132 hours, a bit more than the 102 hours I had estimated.  It would have been nice for the progress to be shown with a percentage completed, as I wasn’t entirely sure at any time how much more there was remaining to do. In the end, BWA created 328 temporary BAM files totaling 147 GB.

Following this, BWA reported it had spent 480,409 seconds of real time (133.4 hours) and 1,638,167 of cpu time, a ratio of 3.4 representing the gain it got from using 4 threads.

Then BWA passed off to samtools for the assembly of the final BAM file. There was about an hour of nothing visible happening. Then samtools started creating the BAM file. Windows Explorer showed its progress, with the size of the BAM file being created growing by about 12 KB every second. This took another 3.5 hours and the result was a single BAM file of 145 GB (152,828,327 KB).


Minimap2

After reading that BWA now recommends use of minimap2, and that minimap2 was much faster, more accurate and produces a better alignment, obviously I could not stop with the BAM file I had.

I went back to Ubuntu and ran the following:

sudo apt install minimap2

but I got the message:

image

I found out it required a newer version of Ubuntu than Windows had supplied in their store. So I followed the instructions: How to Upgrade Ubuntu 18.04 to 19.10 on Windows 10 Linux subsystem by Sarbasish Basu. Then I was able to install and run minimap2.

minimap2 –ax map-ont –t 4 GRCh37.p13.genome.fa.gz MyFile.fastq.gz | samtools sort -@4 -o FinalBAM.bam

where the ax parameter “map-ont” is for Oxford Nanopore long noisy reads.

I ran this. It gave little feedback. After about 6 hours, it told me it mapped 500000 sequences. Then another 6 hours and it mapped another 500000 sequences. It wouldn’t have been so bad if minimap2 was as resource friendly as BWA, but I found it sometimes noticeably impacted my working on my computer. I could still do things, but much slower than normally. What would normally be instantaneous would sometimes take 3 to 30 seconds – to open a browser, my email, etc.

None-the-less, I let it go for 3 days (72 hours) and then canned the program because I needed my computer back. Afterwards, I calculated that there likely are about 20 million long read sequences in my file. Extrapolating time-wise, that would have been about 10 days of running to complete.


YSeq

A few weeks later, in the Genetic Genealogy Tips & Techniques Facebook group, Blaine Bettinger posted regarding his Dante WGS test that he took in November and said that he purchased the “FASTQ Mapping to hg38” from YSeq for $25 and recommended that service.

Since minimap2 didn’t look like it was going to work for me on my computer, I thought using MSeq sounded like a good idea. Since I’m interested in DNA for cousin matching purposes, all the big consumer DNA companies are using Build 37 (i.e. hg19), so I decided to all purchase the “FASTQ Mapping to hg19” from YSeq for an additional $25.

I gave them access to my long read FASTQ files at Dante by setting up a temporary password for them. After about 4 weeks, my results were ready.

You can see they had a whole bunch of result files for me:

image

They used the minimap2 program. The files are for both hg19 and hg 38. The minimap2_hg38_sorted.bam file is 132 GB and the minimap2_hg19_sorted.bam file is the same size 132 GB.

There’s also a bunch of Y-DNA results and M (mt) DNA results along with various stats. There’s also a few files with 23andMe in the name that contain my variants in 23andMe’s raw data format. The pipeline files show me exactly what commands were run and I appreciate having those.

YSeq’s email to me telling me my results were ready included my BAM coverage figures: 38.8x for hg38 and 38.4x for hg19, so I achieved more than the 30x coverage that Dante promised. The average read was 5999 bases. That are much longer than a short read WGS test that typically averages 100 bases per read. I don’t have stats on what the N50 was (see earlier in the article for a definition of N50), but Dante promises at least 20,000 and I trust that I got that.

YSeg’s email also gave me my mt haplogroup: K1a1b1a, which is the same as  what my Family Tree DNA mtDNA test gave me. And their Y-haplogroup path was the same as my Big Y-500 test at Family Tree DNA. YSeq ended with: R-Y2630 –> R-YP4538 compared to FTDNA:  R-Y2630 –> R-BY24982 –> R-BY24978.

YSeq provides BAM files just for mt and Y which is convenient for uploading to services such as YFull. Personally, I’m not that interested in Y and mt because, other than my uncle, none of my matches are close enough to connect on my family tree. I have provided my Y-DNA to the Ashkenazi Levite DNA study and I’ve let them do the tough stuff with it.

Each of the two 132 GB BAM files took me about 22 hours to download at an average download speed of 1.7 MB/second.


So What the Heck Do I Plan To Do With These BAMs?

I’ve now got BAM files that were produced from:

  • My short read WGS produced by Dante using BWA-MEM.
  • My long read WGS produced by me using BWA-MEM.
  • My long read WGS produced by YSeq using minimap2.

Other than scientific curiosity and an interest in learning, I’m mostly interested in autosomal DNA matching for genealogy purposes. I have two goals for these BAMs:

1. To compare the BAMs from my short read and long read WGS test with each other and to the raw data from the 5 SNP-based DNA tests I took. I want to see if from that I can determine error rates in each of the tests and see if I can correct the combined raw data file that I now use for matching at GEDmatch.

2. To see how much of my own DNA I might be able to phase into my two parents. This will be a long term goal. Reads need to contain at least two heterozygous (different value for both parents) SNPs in order to connect each end of them to the next read of the same parent’s chromosome. And there are some very long regions of homozygous (same value for both parents) SNPs. WGS long reads are generally not long enough to span all them. But I’d still like to see how many long segments can be phased.

All this will happen when the time is right, and if I ever get some time.

Behold Version 1.2.4, 64-bit

2020. január 4., szombat 5:48:02

I’ve released an update to Behold that includes a 64-bit executable that will run on 64-bit Windows computers.

If Behold 1.2.3 is running fine for you, there’s no reason to upgrade. There are no other changes in it.

The new installation program now contains both a 32-bit and 64-bit executable. The bit-level of your Windows computer will be detected, and the appropriate executable will be installed.

What does 64-bit give you that 32-bit doesn’t? Well, really it just can handle larger files. If Behold 1.2.3 runs out of memory because your GEDCOM is too big, then 64-bit Behold may not.  My tests using GEDCOMs created by Tamura Jones’ GedFan program indicate that on my computer which has 12 GB of RAM, 32-bit Behold can load up to fan value 19 (a half a million individuals) but 64-bit Behold can load up to fan value 22 (four million individuals).

The 64-bit version of Behold is actually a bit slower than the 32-bit version. On my computer I find it about 30% slower when comparing the loading speed of the same file. But I believe 64-bit is more stable and is less likely to crash than a 32-bit program because it really can’t run out of address space. 

And a bit of a teaser:  I’m starting to get back to working on Behold. Stay tuned.

GenSoftReviews Users Choice Awards 2019

2020. január 2., csütörtök 23:58:04

I’m pleased to announce the winners of the 11th annual GenSoftReviews Users Choice Awards.

awardtitle

Users of genealogy software go to GenSoftReviews to rate and review their software and say what they like and don’t like about it. These awards are given to software that has achieved at least a 4.00 out of 5 rating at the end of 2019 from  10 or more reviews which included at least one review during 2019.

GenSoftReviews uses an exponential weighting algorithm so that a review a year old will have 1/2 the weight, 2 years old will have 1/4 weight, etc. This ensures that the most recent reviews will always have the greatest influence.

At the end of 2019, GenSoftReviews had:

  • 1025 programs listed
  • 303 programs with at least 1 user review
  • 60 programs with at least 10 user reviews
  • 42 programs with at least 10 user reviews and 1+ review(s) in 2019.

So 42 programs were eligible for an award this year.

Of those, 20 programs won a Users Choice Award by ending the year with a user rating of at least 4.00 out of 5. The winning programs for 2019 are:

  1. Second Site for TMG - Rated 5.00 out of 5

  2. The Next Generation (TNG) - Rated 4.91 out of 5

  3. Personal Ancestral File (PAF) - Rated 4.87 out of 5

  4. Generations - Rated 4.83 out of 5

  5. Ahnenblatt - Rated 4.80 out of 5

  6. Brother’s Keeper - Rated 4.80 out of 5

  7. webtrees - Rated 4.79 out of 5

  8. Genealogie Online - Rated 4.77 out of 5

  9. GedSite - Rated 4.73 out of 5

  10. Family Book Creator - Rated 4.70 out of 5

  11. Family Historian - Rated 4.62 out of 5

  12. The Master Genealogist (TMG) - Rated 4.51 out of 5

  13. Family Tree Maker - Up To Version 16 - Rated 4.49 out of 5

  14. Famberry - Rated 4.44 out of 5

  15. Ancestral Quest - Rated 4.35 out of 5

  16. Mundia - Rated 4.22 out of 5

  17. Ultimate Family Tree - Rated 4.16 out of 5

  18. MyHeritage - Rated 4.14 out of 5

  19. Reunion - Rated 4.12 out of 5

  20. iFamily for Mac - Rated 4.03 out of 5

Congratulations to all the winners! Please continue to make your users happy!


A Few Observations

The top rated program this year was Second Site for TMG achieving a perfect 5.00 score on 24 reviews. Second Site was written by John Cardinal and is a Windows program that creates a website directly from the database of the the program The Master Genealogist (TMG) by Bob Velke. TMG was discontinued in 2014 but there are still many users of that program. Those users continue to like TMG and have continued to rate it high enough to earn another Users Choice Award this year.

John Cardinal has a second program on the list as well:  GedSite creates a website like Second Site, but uses a GEDCOM file as input instead of a TMG database.

Another website creator program in second place is Darrin Lythgoe’s The Next Generation (TNG). Darrin’s program is one of 4 winning programs that first won  a GenSoftReviews Users Choice award in 2009, the first year of the awards. The other programs that also won in 2009 are Personal Ancestral File (PAF), Brother’s Keeper and Reunion.

Three programs are winners for the first time this year:  Second Site for TMG, Gedsite and Mundia.  Mundia was an online site that Ancestry purchased and then closed down in 2014. But there are still users that relish the memory of it.

In addition to Mundia, the winners include 5 other no longer supported programs that users still love. They are: Personal Ancestral File (PAF), Generations, The Master Genealogist, Family Tree Maker - Up To Version 16 and Ultimate Family Tree (UFT). It is especially eyebrow-raising that UFT is still loved by users since it was discontinued in 2003 which is now 17 years ago.

Supported full-featured Windows programs on the list include Brother’s Keeper, Family Historian, Ahnenblatt and Ancestral Quest.

Supported full-featured Mac programs on the list are: Reunion and iFamily for Mac.

Supported full-featured Online programs on the list are:  webtrees, Genealogie Online, Famberry and MyHeritage.

The winner list includes one winning utility program: Family Book Creator which is a plugin for Family Tree Maker that creates family books from your FTM data.

Winners from 2018 that are absent from the list this year include Roots Magic, Family Tree Builder, Clooz, Oxy-Gen and Relatively Yours, as each received a few reviews that put them below the 4.00 threshold, and Evidentia, Rootstrust, Familienbande and Ancestris that did not have at least one review in 2019.


Thank You

Thank you to all the people who have come to GenSoftReviews to rate and review the genealogy software you use. Since I opened the site on September 24, 2008, you have contributed close to 5,600 reviews!

Please do continue to go to GenSoftReviews each year to update your assessment of the programs you use. Hopefully your reviews at GenSoftReviews will help the software developers to make the programs better for you and for everyone.

     


    Update: Jan 5: Ahnenblatt’s rating was adjusted due to seven negative reviews found to be made by one person and now are weighted as one person instead of seven.

    Using MyHeritage to get Triangulation Groups

    2019. december 13., péntek 22:40:31

    MyHeritage has a very nice feature in their chromosome browser. If three or more people all triangulate over a segment, the browser will place a rectangle with rounded corners over the triangulating segment for you. These form a triangulation group. All of the people you select must all match each other on that segment for the rectangle to show. If just one person doesn’t match all the others, then the rectangle will not show.

    image

    I have been having a difficult time with my own matches at MyHeritage. The reason is that out of my 12,142 DNA Matches that I have there, I only have my uncle and one other person that I know how I’m related or who is our MRCA (Most Recent Common Ancestor).

    At MyHeritage, my Uncle matches me on 52 segments that total 1994.1 cM. My next highest match is someone sharing 9 segments with me totaling 141.4 cM. MyHeritage estimates him as a 1c2r to 2c1r, but I know that he is more distant than that or I would have already been able to place him in my tree.

    I uploaded my uncle’s Family Tree DNA test to MyHeritage. But because our common ancestor is my father’s parents, all his DNA will allow me to do is to help separate segment matches on my father’s side from my mother’s. Segment matches triangulating with my uncle and myself almost always will be my father’s side. Missing B-C segment matches over those triangulating segments (i.e. I match my uncle and Person C, but my uncle does not match Person C) will almost always be on my mother’s side.


    My Other Known Relative

    I have one other person at MyHeritage that I match to whose relationship I know. It is a 2c2r who is my father’s mother’s father’s mother’s brother’s son’s son and his MRCA would be FMFMR in the notation DMT uses. We share 8 segments totaling 79.4 cM.

    When I use the MyHeritage chromosome browser with my uncle and this cousin, I have a problem:

    image

    You can see how nicely MyHeritage has boxed 4 triangulations between myself and my uncle (red) and my cousin (yellow) on chromosomes 1, 2, 8 and 18.

    There is one small segment on chromosome 4 where I match my cousin but not my uncle. That is fine. I could have my uncle’s mother’s father segment there, but my uncle could have his mother’s mother’s segment.

    The problem occurs on chromosomes 4, 11 and 16, segments where I match both my uncle and my cousin, but MyHeritage does not show a triangulation box. That includes a large match of 15.4 cM with my cousin on chromosome 4. These are Missing B-C matches. I cannot match both my uncle and my cousin without my uncle also matching my cousin. Therefore on these three segments, either I must be matching my third cousin on my mother’s side through some unknown relationship, or these are by-chance false matches between me and my cousin where either of his chromosomes is matching either of mine over the segment.

    That throws a wrench in the works as far as assigning an MRCA to my cousin in Double Match Triangulator. If I do, DMT will assume the ancestral path FMFMR would apply to all 8 segments. If I had a lot of other people with known MRCAs, this wouldn’t be too much of a problem, because DMT’s “majority rule” would mitigate the few incorrect assignments using the majority of correct assignments. But since I don’t have any other people to use, I know DMT will assign my FMFMR side to the other three segments, and those incorrect assignments will replicate when DMT iterates its clustering of matches and recompution of the ancestral paths.


    Confirming the Triangulation Group at MyHeritage

    My cousin does not match my uncle on 3 segments. Maybe if I can identify the people in the triangulation groups of those three segments, I can have more information to investigate to determine if those segments are indeed on my mother’s side.

    So I’ll first look at my triangulations on that large 15.4 cM segment on chromosome 4. But how do I do that at MyHeritage? MyHeritage allows you to select up to 7 people to compare. However they give you no way to know what people you need to pick to match on that specific segment.

    This is where Double Match Triangulator comes to the rescue. I don’t have my cousin’s segment match file from MyHeritage. But I do have my own and my uncle’s. Those two are enough.

    I start DMT, setting my segment match file as File A and my uncle’s as File B. I open my People file and set my uncle’s MRCA to FR, and I run DMT.

    DMT shows me all the people who triangulate with myself and my uncle and assign them all an ancestral path and cluster of F (Father). These are true triangulations, since my file tells me that I match Person C and my uncle, and my uncle’s file tells me that he matches Person C.

    DMT also shows me the people who have Missing B-C matches (that my uncle does not match to on that segment) and put them in the cluster of M (mother).

    Now I look at the DMT output and go down to chromosome 4 in M (mother) section and look at the Missing B-C matches between 58 Mbp and 77 Mbp. I’m adding a box around this range to visually denote them.

    image

    There are 39 people that I match to on this segment that are Missing B-C matches. Theoretically, all these people should form a triangulation group here, since they all triangulate over the same region. I know they all match both me and my uncle, but what I don’t know is if they match each other. I would expect they would since I have found that triangulations or Missing B-C matches of 7 cM or more at other companies generally are all valid and do match each other.

    So now I’ll use MyHeritage’s nice triangulation display to verify that these matches are in this maternal triangulation group.

    I’ll start by adding the my cousin, and the first Missing B-C person

    image

    That’s good. MyHeritage gives the rectangle so it triangulates.

    Now I’ll add the the next person. I know exactly who to add because DMT gives me each person’s exact names. Then I can simply copy from the DMT spreadsheet and paste into MyHeritage’s Chromosome Browser search field and presto, that person is found. This is especially useful when the name is in a foreign script like Cyrillic or Hebrew.

    image

    I add that found match and click compare and I get:

    image

    I add the next two and I still have a triangulation group:

    image

    This happens with the next person I add::

    image

    MyHeritage no longer draws the triangulation box. This last person in dark blue only has an 8 cM match with me, and MyHeritage has proven the match to not match everyone in the group.

    If I had added added 7 people at once and there was no triangulation box, I wouldn’t know who is breaking the triangulation group. If that happens, it is best to add people one at a time, and verify the triangulation group for each one. If the triangulation group breaks, you know who it is who is out.

    So now I note that the above person is not in the group, delete them and try the next person.

    image

    Aha! That’s better! Once the 7 slots on MyHeritage fills up, I just can delete the last person and replace with the next and thus test all 37 people reasonably quickly. It does take a few seconds for MyHeritage to compare 8 people because there are 8 x 7 / 2 = 28 comparisons for them to make and they are comparing all segments on all chromosomes each time. I’m so glad MyHeritage is correctly checking all 28 comparisons rather than incorrectly just doing the 8 comparisons of 1 vs 2, 2 vs 3, … 7 vs 8. It is possible that 3 might match 4 but not match anyone else.

    As I went along, every other person was turning up false. I was quite shocked to end up with only the above 6 of the 41 people being in the triangulation group. These are all segments above 7 cM. Those rejected went as high as 12.2 cM.

    This prompted me to do the same for the largest triangulating match with my cousin on my father’s side.  This is by coincidence also a 15.4 cM match on chromosome 2 between 221 Mbp and 234 Mbp. Here’s the section of the DMT report for the triangulating segments with a box around the 56 matches in that range.

    image

    An “F” in darker blue is the part of the segment that triangulates on my father’s side. An “f” in lighter blue is where I match the person but my uncle does not. A “b’ in grey is where my uncle matches the person but I do not.

    I checked many of them one-by-one using MyHeritage’s chromosome browser, and only found 3 out of the 56 matches to form a triangulation group with my uncle and myself.

    image

    That included my cousin 15.4 cM and the two large matches of 23.8 cM and 21.9 cM that you can see near the bottom of the DMT report which are extending to the right with ‘f’s in lighter blue.  All the other 53 must be false matches.


    Explaining False Triangulating Segments

    In all the work that I have done up to now, I have come to agree with the conclusion of Jim Bartlett that almost all segment matches 7 cM or more that triangulate are valid and will form a triangulation group. I have extended that thinking to include Missing B-C matches 7 cM or more, as long as B-C do share segments elsewhere which tells you that the B-C segment is an explicit non-match

    But here at MyHeritage, I have many triangulating and Missing B-C matches above 7 cM that don’t match with the triangulation group. One was even as large as 16 cM. The sample of them that I checked on MyHeritage gave me this:

    image

    This is not a good result for MyHeritage, and indicates to me that their segment matching is less reliable than other companies and can produce many false triangulations above 7 cM.

    If I could hazard a guess to why, it might be because of the imputation and stitching techniques that they use to determine matching segments. That technique may be less precise than what the other companies use. Others, including Roberta Estes have less faith in imputation techniques.


    Conclusion

    The display of triangulation groups in MyHeritage’s chromosome browser is a wonderful feature. It will help you ensure that overlapping segments all match each other forming a triangulation group that likely was passed down from a common ancestor. 

    But it also enabled me to illustrate that MyHeritage’s matching is not as reliable as other companies, since their match data allows the formation of invalid triangulations with larger segments than at other companies.

    I recommend that Double Match Triangulator users, when using MyHeritage data, that you increase your “Min Triang” setting from 7 cM to at least 12 cM or 15 cM to ensure fewer false positives are included in your MyHeritage triangulations and Missing B-C matches.

    SNAGHTMLa6d367b

    At some point in a future blog post, I’d do a full run of DMT with my MyHeritage segment match data.

    GEDmatch - Are You In or Out?

    2019. december 10., kedd 22:39:38

    It’s already day-old news. GEDmatch has been purchased by Verogen.

    For some of the reporting on this, see the following news posts:

    For some of the details and opinions of expert genealogists and genetic genealogists, see the following blog posts:

    Because of the new confirm acceptance requirements, and more stringent re-opt-in requirements for European residents, the number of matches available is going down for police searches.

      But not just for police searches. For everyone’s searches.

      One month ago, on November 5th, my 3000th closest match at GEDmatch shared 30.5 cM with me. Today my 3000th closest match shares 27.9 cM.  I have 699 matches at GEDmatch sharing from 27.9 cM to 30.4 cM. That means I have lost 23% of my matches at GEDmatch.  You likely also have as well.

      I’m definitely staying. What are you going to do?

      DMT - The Horizon Effect

      2019. december 4., szerda 0:34:52

      In Version 3 of Double Match Triangulator, I added the ability to specify the smallest segment match that DMT would consider to be part of a valid triangulation (default 7 cM) and the smallest segment match that DMT would consider to be a valid single match (default 15 cM).

      A situation that can happen when you get close to the triangulation limit is something I will call the horizon effect.  If two of the three valid overlapping matches in a triangulation are over the triangulation limit (i.e. >= 7 cM), but the other is slightly under it (e.g.  <= 6.9 cM), then you’ve got a problem. DMT will eliminate the small segment and incorrectly classify the triplet, not as a triangulation, but as a Missing A-B or Missing B-C match.


      Is this a Major Problem?

      To be honest, I would say no.

      1. Leaving out valid triangulations only gives less data to work with but is not a problem.
      2. The misclassifying of a triangulation as Missing B-C might allow the B-C match to be used incorrectly as an inferred match.
      3. The misclassifying of a triangulation as Missing A-B would get the A-C match to map onto the incorrect parent.

      But cases 2 and 3 shouldn’t be too concerning since DMT uses a consensus approach. If the majority agree it is a triangulation through a particular common ancestor, then the (hopefully) fewer misclassified matches will be outnumbered by the good ones.


      A Possible Improvement

      Even so, I’d like to see if I can address this horizon effect and do something to reduce the number of misclassified matches. I came up with an idea.

      Currently, DMT ignores all matches in the Person A and Person B match files that are below the triangulation limit. I can change that so that Person A segment matches that are less than the limit will still be compared to Person B matches.

      e.g. If we have a B-C match of 7.2 cM that overlaps with an A-C match of 6.8 cM and an A-B match of 6.7 cM, then DMT will now say that is a triangulation.

      What is the extra bit on the B-C match?  Well it could be an extra bit at either end that matches by chance, or it could be that B and C are more closely related than A and C and have a larger match between them.

      I know some A-C and A-B matches below the triangulation limit will then be included, but that limit is no magic number. Segments above the limit are not necessarily valid, and segments below it are not necessarily invalid. We are simply using the limit to pick the point at which we expect that most triangulations will be valid.


      Can’t Always be Done

      DMT 3’s inclusion of smaller A-C matches for triangulations will only work if the match data contains segments smaller than the limit selected. If the limit you select in DMT is 5 cM, but your match data does not include segments smaller than 5 cM, then DMT will not have any smaller A-C segments to work with.

      In that case, the horizon effect will occur more often and DMT’s consensus approach will have to be relied upon to produce reasonably logical results.

      Lower limits of individual segment matches at each company are:

      • Family Tree DNA:  1 cM
      • 23andMe:  5 cM  (on the X chromosome:  2 cM)
      • MyHeritage DNA:  6.1 cM
      • GEDmatch:  default 7 cM, but you can reduce that down as low as 1 cM

      If you’re using GEDmatch, you could download just Person A’s segment matches to a slightly lower limit. e.g. if your triangulation limit is 7 cM, try downloading A’s segment match file to 5 cM.  I would not go as low as 1 cM at GEDmatch. Doing so is known to introduce too many false matches. See False Small Segment Matches at GEDmatch.

      If your segment match files go down to a certain cM, e.g. 6.1 cM, then you could raise your triangulation limit in DMT a bit, say to 8 cM.

      Personally, I don’t think it’s necessary to worry too much about this fine tuning. DMT should give reasonably similar results whichever way you do it. Really, you’d be much better off spending your time trying to identify common ancestors of more of your DNA relatives, as that will improve DMT’s results the most.


      So How Did It Do?

      I made the above changes to my working version of DMT and ran the same data that I did for my 23andMe article.

      This time around, DMT included 175 A-C segment matches between 6 and 6.99 cM and 169 segments between 5 and 5.99 cM. With the 892 people I match, these extra segments increased the number of triangulations I have from 1355 to 1757, an increase of 402 triangulations. 7 cM is at the lower limit of valid triangulation size, so some of those that include segments down to 5 cM might not be valid and be by-chance matches. Picking a very conservative number out of my head and saying that only 80% of these were valid matches, then this adds about 320 new valid triangulations and about 80 false triangulations.The power of consensus again should work to use that extra data advantageously.

      Final results are that 816 (up from 790) of the 892 people I match with are now assigned clusters, and grandparent mappings now cover 52.8% of my paternal side, up from 46.1%. 

      The improved grandparent mapping (from DNA Painter) is:

      image

      Compare this to the 46.1% diagram from before, and I you’ll have a hard time finding the differences, which is good:

      image


      Update to DMT Coming

      I think it’s worthwhile including this small improvement in an update to DMT. I’ve got a few more small fixes/improvements to make and one other idea for using the results from one company to initialize the run for another company. So hopefully within a week or two, I’ll have a new release of DMT available.

      Genealogy is Virtually Everywhere

      2019. november 13., szerda 21:34:23

      Last night, I attended a talk of a prominent genealogy speaker. This is a speaker who keynotes conferences and attracts thousands to her talks.

      Diahan Southard gave her talk “Your Slice of the DNA Pie”, and I watched it on my computer at home. It was a presentation of the Virtual Genealogical Association, an organization formed in April 2018 to provide a forum for genealogists to connect online. Webinars such as Diahan’s are just one of their offerings. Membership is just $20 a year.

      image

      The VGA just completed their first highly successful Virtual Conference. There was one track with 6 well-known genealogical speakers on the Friday, 6 more on Saturday and 5 on Sunday, so the Conference lasted three full days. In addition, three prerecorded talks were included. All talks are available to attendees for re-watching (or watching if they missed the live talk) for the next six months.

      image

      Like any physical conference, handouts by the speakers were made available to attendees for each of their talks individually, or as a syllabus. Attendees were told about a surprise bonus at the end of the conference which was a special offer from the National Institute for Genealogical Studies of a free 7 or 10 week online course worth $89 authored by two of the VGA Conference speakers: Gena Philibert-Ortega and Lisa Alzo.

      The VGA Conference was hosted and directed by their delightful president Katherine Willson. She said they were very happy that over 250 people paid the $59 (members) or $79 (non-members) fee to attend the 3-day online conference, something that was really the first of its kind. 

      The VGA plans to continue these annual conferences. The next is already scheduled for Nov 13-15, 2000, so be sure to block those days off now in your calendar.

        
      What is Virtual Genealogy?

      Most of us are used to attending live genealogy conferences. You know, the ones you have to physically be there, be semi-awake, have showered, look decent, be pleasant even if you’re not feeling pleasant.

      They may be offered by your local genealogical society in the city you live, a regional conference in a city you can drive to, or a national or international conference that you usually have to fly to. Live conferences require many people to organize and run. They are expensive to put on, require booking of a venue, obtaining of sponsors to cover the costs, vendors to fill an exhibition hall, rooms and logistics to enable the speakers to speak, etc., etc.

      By comparison, I would say:

      Virtual Genealogy simply is any genealogical activity you can do on your computer or smartphone in your pajamas.

      This includes everything from:

      • attending online lectures
      • taking online courses or workshops
      • watching conference livestreams
      • communicating with other genealogists via social media
      • researching your family online
      • using genealogy software to record your family tree information

      It’s only in the past couple of years that many of these virtual genealogical activities have become available. I can truly say now that you can be a bedroom genealogist and learn and do almost everything you need to without slipping out of bed (as long as your laptop or smartphone is within arms reach).

      This wasn’t possible just a few years ago, but it is possible now.

        
      Legacy Family Tree Webinars

      The big kid on the block as far as online genealogy lectures goes is Legacy Family Tree Webinars. They have been around since 2010 and started off simply as a way for the the genealogy software program Legacy Family Tree to make instructional videos available for their software. They offer a webinar membership for a $50 annual fee giving you full access to their webinar library. Many new webinars on any and every topic are made available free for the live presentation.

      In August 2017, Legacy software and the Family Tree Webinars were purchased by MyHeritage. MyHeritage has allowed them to continue running, with the added advantage of making MyHeritage instructional videos and talks available to everyone for free.

      The long-time host of most of the videos is Geoff Rasmussen. He just celebrated Family Tree Webinar’s 1000th webinar in September with this wonderful amusing behind-the-scenes video.

      image

        
      Family History Fanatics

      Another not-to-be-missed webinar producer is the family of Andy Lee, Devon Lee and their son Caleb Lee, who call themselves Family History Fanatics. They have their own You Tube channel with 16.7 K subscribers where they post their numerous instructional videos and live streams.

      image

      They also produce online webinars and workshops which are well worth the modest $30 ($25 early bird) price they charge for them. Their next is a DNA Workshop: Integrated Tools that will be three instructional talks of 2 hours each on Dec 5, 12 and 19.

      They also from time to time host one-day eConferences. I paid the $20 early bird fee to attend their A Summer of DNA eConference last August, which included 4 talks by Daniel Horowitz, Donna Rutherford, Emily Aulicino and Leah Larkin.

      Their next eConference will be January 25 called “A Winter of DNAVirtual Conference”.  It will feature four DNA experts. I know three of them will be Jonny Perl (DNA Painter), Paul Woodbury (LegacyTree Genealogists) and myself.

      I have given many lectures at genealogy conferences around the world, but this will be my first ever live webinar. It will be about double match triangulation and the ideas behind it and what it can be used for. I’m really looking forward to this.

      Andy doesn’t have the details up yet for the January conference but likely will soon and will then accept registrations. I’ll write a dedicated blog post when registration becomes available.

        
      APG

      The Association of Professional Genealogists (APG) has webinars for anyone interested.

      The APG also has a Virtual Chapter (membership $20 annually) with monthly online presentations by a prominent speaker.

        
      Live Streaming of Genealogical Conferences

      Another wonderful trend happening more and more is the live streaming now being offered by genealogy conferences. Many of the livestreams have been recorded and made available following the conference, so you don’t always have to wake up at 3 a.m. to catch the talk you want.

      MyHeritage Live in Amsterdam took place in September. They have made many of the 2019 lectures available. Lectures from their MyHeritage Live 2018 from Oslo are also still available for free.

      The first ever RootsTech in London took place last month. A few of their live stream videos have been made available. You can find quite a few Salt Lake City RootsTech sessions from 2019 and from 2018 still available for free.

      The National Genealogical Society offered 10 live stream sessions for their conference last May for $149.

      With regards to big Conferences, nothing compares to being there in person. But when you can’t make it, you can still feel the thrill of the conference while it happens with live streams and enjoy later the recordings of some of their sessions.



      What’s Next?

      My next webinar I plan to watch is another Virtual Genealogy Association webinar: "Artificial Intelligence & the Coming Revolution of Family History" presented by Ben Baker this Saturday morning, Nov 16.

      Never stop learning.

      What’s next on your agenda?

      Using DMT, Part 2: My GEDmatch data

      2019. október 25., péntek 4:47:05

      In my last blog post, I analyzed my segment matches at 23andMe with Double Match Triangulator,. This time let’s do the same but with my GEDmatch segment matches.



      Getting Segment Match Data from GEDmatch

      At GEDmatch you need their Tier 1 services (currently $10 a month) in order to download your segment matches. But you can download anyone’s segment matches, not just your own. But+But they don’t include close matches of 2100 cM or more, meaning they won’t include anyone’s parents, children, siblings, and maybe even some of their aunts, uncles, nephews or nieces. The But+But could in some cases be problematic because people who should triangulate will not if you don’t include their close matches. But+But+But even that should still be okay in DMT since DMT’s premise is to use the matches and triangulations that exist with the ideas that there will generally be enough of those to be able to determine something.

      GEDmatch not too long ago merged their original GEDmatch system and their Genesis system into one. Now all the testers on GEDmatch who used to be in two separate pools, can all be compared with each other. While doing so, GEDmatch also changed their Segment Search report now providing a download link. The download is in a different format than their screen report is and used to be. With all these changes, if you have old GEDmatch or Genesis match file reports that DMT helped you download, you should recreate each of them again in the new format. Check DMT’s new download instructions for GEDmatch.

      When running GEDmatch’s Segment Search, the default is to give you your closest 1000 kits. I would suggest increasing that to give you your 10000 closest kits. I’ve determined that it is definitely worth the extra time needed to download the 10000 kits. For example, in my tests, comparing 1000 kits versus 1000 will average about 50 people in common.  Whereas comparing 1000 kits versus 10,000 kits will average 350 people in common.  So 300 of the 1000 people in common with Person A are in not in the first 1000 kits in common with Person B, but are in the next 9,000 kits. When using the 1000s, DMT can cluster 18% of the people. When using 10,000s, that number goes up to 56%

      It can take anywhere from a few minutes to an hour to run the 10,000 kit Segment Search at GEDmatch, so if you have 10 kits you want to get segment matches for, it could take the good part of a day to complete.

      I downloaded the segment match data for myself and 8 people I match to using the 10,000 kit option. They include 3 relatives I know, and 5 other people who I am interested in. 

      My closest match is my uncle who shares 1958 cM with me. GEDmatch says:

      image

      I really don’t know why GEDmatch does this. To find the one-to-one matches of a few close relatives and include them in the segment match list would use only a tiny fraction of the resources required overall by their segment match report, The penalty of leaving out those close matches is huge as matches with all siblings and some uncles/aunts, nephews and nieces are left out. Parents and children are also left out, but they should match everywhere.

      I have added into DMT an ability to download the one-to-one matches from GEDmatch for your matches that the Segment Search does not include. In my case, my uncle is 1958 cM, so I didn’t need to do this for him. You can also use the one-to-one matches to include more distant relatives who didn’t make your top 1000 or 10000 people.



      Entering my Known Relatives’ MRCAs in my People File

      These are the 3 people at GEDmatch that I know my relationship to and who is our Most Recent Common Ancestor (MRCA).

      1. My uncle on my father’s side, MRCA = FR, 1958 cM
      2. A daughter of my first cousin on my mother’s side, MRCA = MR, 459 cM
      3. A third cousin on my father’s mother’s father’s side, MRCA = FMFR, 54 cM

      (The F, M, and R in the MRCA refer to Father, Mother and paiR of paRents. So an MRCA of MR are your mother’s parents. FMFR are your father’s mother’s father’s parents. MRCAs are always from the tester’s point of view.)

        Since DNA shared with my uncle can come from either of my paternal grandparents, and since DNA from my 1C1R can come from either of my maternal grandparents, their double matches and triangulations do not help in the determination of the grandparent. However, they should do a good job separating my paternal relatives from my maternal relatives.

        The third cousin will allow me to map people to my FM (father’s mother) grandparent and to my FMF (father’s mother’s father) great-grandparent.

        Lets see how this goes.



        Painting

        At GEDmatch, I only have the one third cousin to work with to determine grandparents. My cousin only shares 54 cM or about 1.5% of my DNA. The process DMT uses of automating the triangulations and extending them to grandparents, then clustering the matches and repeating, results in DMT being able to map 44% of my paternal side to grandparents or deeper. This is using just this one match along with my uncle and my 1C1R.

        Loading the mappings into DNA Painter gives:

        image

        If you compare the above diagram to the analysis from my 23andMe data I did in my previous post, you’ll see a few disagreements where this diagram is showing FM regions and the 23andMe results show FF regions. These estimates are not perfect. They are the best possible prediction based on the data given. If I was a betting man, I would tend to trust the 23andMe results more than the above GEDmatch results in these conflicting regions because 23andMe had many more MRCAs to work with, including some on both the FF and FM sides, than just the one FMF match I have here at GEDmatch.

        The bottom line is the more MRCAs you know, the better a job DMT can do in determining triangulation groups and the ancestral segments they belong to. None-the-less, using only one MRCA that specifies a grandparent, this isn’t bad.



        Clustering

        DMT clusters the 9998 people in my GEDmatch segment match file as follows:

        image

        DMT clustered 39% of the people I match to as paternal and 17% as maternal.

        34% were clustered into one big group, my Father’s Mother’s Father (FMF) which is higher than the expected percentage (12.5%). My third cousin whose MRCA is FMFR might be biasing this a little bit. Every additional MRCA you know will add information that DMT can work with to help improve its segment mappings and clustering. I just don’t happen to know any more at GEDmatch, so these are the best estimates I can do from just the GEDmatch data. As more people test, and I figure out how some more of my existing matches are connected, I should be able to add new MRCAs to my GEDmatch runs.



        Grandparents on My Mother’s Side

        I have one relative on my mothers side included here at GEDmatch. Being the daughter of my first cousin, she is connected through both my maternal grandparents. Her MRCA is MR and DMT can’t use her on her own for anything more than classifying people as maternal.

        I don’t like the idea of being unable to map grandparents on my mothers side. I don’t have any MRCAs to do that.

        But I have an idea and there is something I can try.  If I take a look at the DMT People File and scroll down to where the M cluster starts, you’ll see my 1C1R named jaaaa Saaaaaa Aaaaaaa with her MRCA of MR. Listed after her are all the other people that DMT assigned the cluster M to and they are shown by highest total cM.

        The next highest M person matches me with 98.2 cM. That is a small enough number that the person will be at least at the 2nd cousin level, but more likely the 3rd or 4th cousin level.. Since they are further than a 1st cousin, they should not be sharing both my maternal grandparents with me, but should only share one of them. I don’t know which grandparent that would be, but I’m going to pick one and assign them an MRCA to it. This will allow DMT to distinguish between the two grandparents. I’ll just have to remember that the grandparent I picked might be the wrong one.

        So I assign the person Maaaa Kaaaaa Aaaaaaaaa who shares 98.2 cM with me the MRCA of MF as shown below. Note I do not add the R at the end and make it MFR. The R indicates the paiR of paRents, and indicates you know the exact MRCA and share both of those ancestors. DMT accepts partial MRCAs like this.

        image

        DMT starts by assigning the MRCA you give it. Unlike known MRCAs whose cluster is always based on the MRCA, DMT could determine a different cluster for partial MRCAs.

        I run DMT adding the one partial MRCA to this one person. Without even downloading the segment match file for Maaaa Kaaaaa Aaaaaaaaa, DMT assigned 976 of the 1674 people that were in the M cluster to the MF cluster, leaving the other 698 people in the M cluster. In so doing, it painted 11.5% of the maternal chromosome by calculating triangulations with Maaaa Kaaaaa Aaaaaaaaa and then extending the grandparents based on other AC matches on the maternal side that overlap with the triangulations.

        That’s good. Now what do you think I’ll try. Let’s try doing it again. Likely many of those 698 people still assigned cluster M are on the MM side. So let’s take the person in the M cluster with the highest cM and assign them an MRCA of MM. But we have to be a bit careful. That person with 86.3 cM named Jaaaaaaa Eaaaaa Kaaaa Aaaaaaaa has a status of “In Common With B”. That means they don’t triangulate with any of the B people. If they don’t triangulate, DMT cannot assign the ancestral path to others who also triangulate in the same triangulation group at that spot. So I’ll move down to the next person named Eaaaa Jaaaaaa Maaaaaa whose status is “Has Triangulations” and shares 78.5 cM and assign the MRCA of MM to her, like this:

        image

        After I did that, I ended up with 296 people assigned the MM cluster, 859 assigned MF (so 117 were changed), and 519 still left at M. meaning there were still some people that didn’t have matches in any of the triangulation groups that DMT had formed, or maybe they had the same number of MF matches as they do MM matches so no consensus. Or maybe the people I labelled MM and MF are really MMF and MFM and these people left over are MMM or MFF thus not triangulating with the first two. It could be any of those reasons, or maybe their segment is false, I don’t know which. In any case, I now have 23% of my maternal chromosomes painted to at least the grandparent level.

        If I load this mapping into DNA Painter, it gives me:

        image

        and yahoo!  I even have a bit of my one X chromosome painted to my MM side.



        Confirming the Grandparent Side

        This is nice. But I still have to remember that I’m not sure whether MF and MM are correct as MF and MM or if they are reversed.  As it turns out, I can use clustering information that I did at Ancestry DNA to tell. Over at Ancestry, I have 14 people whose relationships I know that have tested there. And some of them are on my mother’s side.

        I have previously used the Leeds Method and other clustering techniques to try to cluster my Ancestry DNA matches into grandparents. Here’s what my Leeds analysis was, with the blue, green, yellow and pink being my FF, FM, MF and MM sides.

        And lucky me, I did find a couple of people on GEDmatch who tested on Ancestry who were in my MM clusters from DMT. And in fact they were in my pink grouping which is my mother’s mother’s side in my Leeds method. So this does not prove, but gives me good reason to believe that I’ve got the MF and MM clusters designated correctly.



        Extending Beyond Grandparents

        You would think I could extend this procedure. After all, now that I have a number of people who are on the MF side, some of them should be MFM and some should be MFF.

        Well I probably could, but I’d have to be careful. Because now I have to be sure the people I pick are 3rd cousins or further. Otherwise, they match me on both sides. So there is a limit here. This might be something I explore in the future. If you can understand anything of what I’ve been saying up to now, feel free to try it yourself.



        Same for Paternal Grandfather

        I now have people mapped to 3 of my grandparents:  FM, MF and MM.  I can do the same thing to get some mapped to my father’s father FF. In exactly the same way that I did it for my maternal grandparents, I can assume that my highest F match is likely FF because it would have been mapped FM if it was not. So I’ll give an MRCA of FF to Eaaaa Jaaaaaaaa Kaaaaaaaa who matches me 71 cM on 9 segments, where one of those segments triangulates with some of my B people.

        I run everything all again and I now people are clustered this way:

        image

        The FF assignments stole some people from the other Fxx groups. There are still a lot of people clustered into FMF, but that is possible. Some sides of your family may have more relatives who DNA tested, even a lot more than others do.

        Loading my grandparent mappings into DNA Painter now gives me this:

        image

        You can see all 4 of my grandparents (FF=blue, FM=green, MF=pink, MM=yellow) and a bit more detail on my FM side.



        Filling in the Entire Genome

        If I had enough people who I know the MRCA for who trace back to one of my grandparents or further, and all of my triangulations with them covered the complete genome, then theoretically I’d be able to map ancestral paths completely. Endogamy does make this more difficult, because many of my DNA relatives are related on several sides. But the MRCA, because it is “most recent”, should on average pass down more segments than the other more distant ancestors do. Using a consensus approach, the MRCA segments should on average outnumber the segments of the more distant relatives and should be expected to suggest the ancestor who likely passed down the segment.

        Right now in the above diagram, I have 34% of my genome mapped.

        So if I go radical, and assume that DMT has got most of its cluster assignments correct, then why don’t I try copying those clusters into the MRCA column and run the whole thing again. That will allow DMT to use them in triangulations and those should cover most of my Genome. Let’s see what happens.

        image

        Doing this has increased my coverage from 34% to 60% of my genome. About 25% of the original ancestral path assignments were changed because of the new assumptions and because the “majority rules” changed.  

        With DMT, the more real data you include, the better the results should be. What I’m doing here is not really adding data, but telling DMT to assume that its assumption are correct. That sort of technique in simulations is called bootstrapping. It works when an algorithm is known to converge to the correct solution. I haven’t worked enough with my data to know yet whether its algorithms converge to the correct solution, so at this point, I’m still hypothesizing. The way I will be able to tell is if with different sets of data, I get very similar solutions. My matches from different testing companies likely are different enough to determine this. But I’m not sure if I have enough known MRCAs to get what would be the correct solution.

        Let’s iterate a second time. I copy the ancestral paths of the 25% that were changed over to the MRCA column. This time, only 3% of the ancestral path assignments got changed. So we are converging in on a solution that at least DMT thinks makes sense. 

        I’ll do this one more time, copying the ancestral paths of the 3% that were changed over to the MRCA column.  This time, only 1.5% changed.

        Stopping here and loading into DNA Painter gives 63% coverage, remaining very similar to the previous diagram, although I’m surprised the X chromosome segment keeps popping in and out of the various diagrams I have here.

        image



        What Have I Done?

        What I did above was to document and illustrate some of the experimentation I have been doing with DMT so I can see what it can do, figure out how best to use it, and hopefully map my genome in the process. Nobody has ever done this type of automated determination of ancestral paths before, not even me.

        I still don’t know if the above results are mostly correct or mostly incorrect. A future blog post (Part 3 or later) will see if I can determine that.

        By the way, in all this analysis, I found a few small things to fix in DMT, so feel free to download the new version 3.1.1.

        Using DMT, Part 1: My 23andMe Data

        2019. október 18., péntek 0:22:19

        I am going to show you how I am using Double Match Triangulator, and some of the information it provides, at least for me.

        My own DNA match data is difficult to analyze. I come from a very endogamous population on all my sides that originates in Romania and Ukraine. The endogamy gives me many more matches than most people, but because my origins are from an area that have scarce records prior to 1850, I can only trace my tree about 5 generations. Therefore the vast majority of my matches are with people I may never be able to figure out the connection to. But there should be some that I can, and that is my goal, to find how I’m related to any of my 5th cousins and closer who DNA tested.

        I’ve tested at Family Tree DNA, 23andMe, Ancestry, MyHeritage, and I’ve uploaded my DNA to GEDmatch. I have also tested my uncle (father’s brother) at Family Tree DNA and uploaded his DNA to MyHeritage and GEDmatch.

        Ancestry does not provide segment match data, so the only way to compare segments with an Ancestry tester is if they’ve uploaded to GEDmatch.



        Where to Start

        With DMT, the best place to start is with the company where you have the most DNA relatives whose relationship you know. I know relationships with:

        • 14 people at Ancestry, but they don’t give you segment match data
        • 9 people at 23andMe.
        • 3 people at GEDmatch.
        • 2 people at MyHeritage.
        • 2 people at Family Tree DNA

        So I’ll start first with the 9 people at 23andMe. The somewhat odd thing about those 9 people are that they are all related on my father’s side, meaning I won’t be able to do much on my mother’s side.

        We’ll see what this provides.



        Getting Segment Match Data from 23andMe

        At 23andMe, you can only download the segment match data of people you administer. That means you have to ask your matches if they would download and send you their match data if you want to use it.

        But there is an alternative. If you subscribe to DNAGedcom, you can get the segment match files of any of the people you match to. I have a section of the DMT help file that describes how to get 23andMe match data.

        I used DNAGedcom to download my own segment matches, as well as the segment matches of the 9 relatives I know relationships with at 23andMe, plus 7 other people I’m interested in that I don’t know how I’m related to.



        Entering my Known Relatives’ MRCAs in my People File.

        First I load my own 23andMe match file as Person A in DMT. Then I run that file alone to create my People file. DMT tells me I have 8067 single segment matches. DMT excludes those that are less than 7 cM and produces a People file for me with the 892 people who I share 4244 segments of at least 7 cM. It is sorted by total matching cM, highest to lowest so that I can see my closest relatives first.

        I go down the list and find the 9 relatives I know and enter our Most Recent Common Ancestors (MRCAs). Here’s the first few people in my list whose names I altered to keep them private:

        image

        Naa is the daughter of my first cousin, i.e. my 1C1R. She is my closest match at 23andMe and we share 541 cM on 22 segments that are 7 cM or greater. Our MRCA from my point of view is my Father’s paRents, so I enter FR as our MRCA.

        Daaaaa Paaaaa is my father’s first cousin sharing 261 cM. So he is also my 1C1R but since he is my Father’s Father’s paRent’s daughter’s son, he gets an MRCA of FFR.

        Similarly, Baaaa Raaaaaa and Raaa Raaaaaa are brothers who are both 2nd cousins sharing 153 and 143 cM with me. Their MRCA is FFR

        The other two people I marked FFR are my 2C1R sharing 152 cM and 94 cM.

        I also have 3 people related on my father’s mother’s side. The two FMFR’s shown above are 3rd cousins on my father’s mother’s father’s side sharing 90 cM and 84 cM. Also on line 96 (not shown above) is a 3rd cousin once removed sharing 58 cM who I’ve given an MRCA of FMFMR.



        Single Matching

        I save the People file with the 9 MRCAs entered, and I’ll run that file alone again. This time, it uses the MRCA’s and paints any matching segments at least 15 cM to the MRCA ancestral path.

        This is exactly what you do when you use DNA Painter. You are painting the segment matches whose ancestral paths you know onto their places on the chromosome.

        The reason why single segment matches must be 15 cM to paint is because shorter single segment matches might be a random match by chance. That can be true even if it is a close relative you are painting. It’s better to be safe than sorry and paint just the segments you are fairly certain are valid. Beware of small segments.

        DMT tells me it is able to paint 15.3% of my paternal segments to at least the grandparent level using the 9 people. My closest match, my first cousin once removed doesn’t help directly, because some of her segment matches may be on my father’s father’s side, and some may be on my father’s mother’s side, and we can’t tell which on their own. But her matching segments may overlap with one of the other relative’s matches, and hers can then be used to extend that match for that grandparent. DMT does this work for you.

        With this data, DMT cannot paint any of my maternal segments. I’d need to know MRCAs of some of my maternal relatives at the 2nd cousin level or further to make maternal painting possible.

        DMT produces a _dnapainter file that I can upload to DNA Painter. This is what it looks like in DNA Painter showing the 15.3% painted on my paternal chromosomes:

        image

        My father’s father’s side (in blue), could only be painted to the FF level because I don’t have any MRCAs beyond FFR.

        My father’s mother’s side (in green), was paintable to the FMFM level because I had a 3C3R with an MRCA of FMFMR.  But there’s less painted on the FM side than the FR side because my FM relatives share less DNA with me.



        Double Matching and Triangulating

        Now let’s use the power of DMT to compare my matches to the matches of my 9 known relatives and the 7 unknown relatives, combine the results, and produce triangulation groups and finally produce some input for DNA Painter.

        The great thing about triangulating is that by ensuring 3 segments all match each other (Person A with Person C, Person B with Person C, and Person A with Person B), it considerably reduces the likelihood of a by chance match, maybe down to segment matches as small as 7 cM, which is the default value for the smallest segment DMT will include.

        This allows DNA to paint 46.1% of the paternal DNA, about 3 times what was possible with just single matching. In DNA Painter, this looks like:

        image



        Clusters

        DMT also clusters the people I match to according to the ancestral paths that the majority of their segments were assigned to.

        DMT clustered the 892 people that I match to as follows:

        image

        There were 83 U people who DMT could not figure out if they were on the F or M side. There were 75 F people who DMT could not figure out if there were on the FF or FM side. And there were 19 X people who didn’t double match and were only a small (under 15 cM) match in my segment match file, so these were excluded from the analysis.

        Overall 63% of the people I match to were clustered to my father’s side, 24% to my mothers side. Here’s how my closest matches (see first diagram above) look after they were clustered:

        image

        So it looks like quite a few of my closest matches whose MRCA I don’t know might be on my FF side. That tells me where I should start my search for them.

        There are also 4 matches that might be on my M mother’s side that I can try to identify.

        All but two of my top matches have at least one segment that triangulates with me and at least one of the 16 people whose match files I ran against. All the details about every match and triangulation are included in the map files that DMT produces, so there’s plenty of information I can look through if I ever get the time.

        Next post:  GEDmatch.

        DMT 3.1 Released

        2019. október 17., csütörtök 3:34:21

        I’m working on a series of articles to show how I am using the new 3.0 version of Double Match Triangulator to analyze my own segment match data.

        As much as I’d like you to believe that I’ve developed DMT for the good of genetic genealogists everywhere, I humbly admit that I actually developed it so that I could analyze my own DNA to help me figure out how some of my DNA matches might be related.

        Of course, as I started working on my articles, first looking at my 23andMe matches, I found some problems in my new 3.0.1 version, and a few places I could make enhancements.

        If you downloaded version 3.0 that was released on Oct 1, or 3.0.1 on Oct 3, please upgrade to 3.1 whenever you can.

        image

        If you are a member of the Genetic Genealogy Tips and Tricks group on Facebook, or the DNA Painter User Group on Facebook, my free trial key I gave there is still valid and will let you run the full version of DMT until the end of October. Just look for my post on Oct 2 on either group for the key.

        Some of the enhancements in Version 3.1 include:

        • The grandparent extension algorithm:  I found a few extra extensions where they should not be. So I completely changed the algorithm to one that was clearer and easier for me to verify that it is working properly. The final results are similar to the old algorithm, but they’re significant enough to make a noticeable effect on the grandparent assignments.
        • Small improvements were made to the determination of triangulation boundaries.
        • Internally, I increased the minimum overlap DMT uses from 1.0 Mbp to 1.5 Mbp. This was to prevent some incorrect overlaps between two segments when there was a bit of random matching at the overlapping ends of the segment.

        And there were a few bug fixes:

        • If you run 32-bit Windows, then the DMT installer installs the 32-bit version of DMT rather than the 64-bit version. The 32-bit version had a major bug that crashes it when writing the People file. Nobody complained to me about this, so I guess most of you out there are running 64-bit Windows. Maybe one day in the not-too-distant future, I’ll only need to distribute just the 64-bit version of DMT.
        • In the Combine All Files run, a few matches were not being assigned an AC Consensus when they should have been. Also a few assignments of AC No Parents was made when there was a parent.
        • 23andMe FIA match files downloaded using DNAGedcom were not being input correctly.

        Now back to see what DMT can do for me.

        The GEDCOM 5.5.5 Initiative and Making It Work

        2019. október 7., hétfő 7:47:24

        It’s been 35 years since GEDCOM 1.0 was released to the genealogical software development community in 1984.

        It’s been 20 years since GEDCOM 5.5.1, the last official specification by FamilySearch was released on October 2, 1999.

        Five days ago, October 2, 2019, the gedcom.org website was renewed containing the newly-released GEDCOM 5.5.5 specifications.

        GEDCOM was originally an acronym for GEnealogical Data COMmunication. It has been the standard that genealogical software has been using for the past 35 years. It specifies how genealogical data should be exported to a text file so the data can be preserved in a vendor-neutral form (separate from their proprietary databases) in a format that other programs will be able to import.

        For 15 years, between 1984 and 1999, the GEDCOM standard was developed and made available by the Church of Jesus Christ of the Latter Day Saints (LDS). They had a team in place that had discussions with many genealogical software vendors. They prepared the standard by taking all the ideas and figuring out how computer programs of the day could and should transfer data between programs.

        They were very successful. There have been hundreds of genealogy programs developed since then, and just about every single one of them adopted the standard and can import GEDCOM and export GEDCOM. I don’t know of too many standards that gained nearly 100% acceptance. To me, that is a very successful adoption of a standard in any field.

        Why was GEDCOM successful by the LDS? My take on that is because of the way they operated.

        1. They had a team in place to develop the standard.
        2. They sent drafts to developers and solicited feedback.
        3. The team evaluated all the feedback and suggested possible implementations, and evaluated conflicting ideas,
        4. And most importantly: one person acted as editor and made the decisions. I believe that may have been Bill Harten, who has been called the “Father of GEDCOM”.  


        What Happened in 1999?

        In 1999, the LDS decided it no longer was going to continue the development of GEDCOM and it disbanded the team. The last GEDCOM version the LDS issued was 5.5.1, which was still labeled beta, but was definitely the de facto standard because the LDS itself used it and not the 5.5 version, in their very own  Personal Ancestral File (PAF) software.

        The standard was very good, but not perfect. Each software developer used GEDCOM basics for important items like names, linkages between children and families, families and parents, birth, marriage, death dates and places.

        GEDCOM had more than that. Way more. It had lots of different events and facts. It had links to multimedia. It had sources, repositories and source references. Technically it had everything you needed to properly source and reference your genealogical material. The funny thing was that it was ahead of its time.

        Genealogical programs back then and genealogists in particular did not have the understanding of the need to document our sources. We were all name collectors back then. C’mon admit it. We were. We only learned years later our folly and the importance of sources.

        So in the next 10 years after 1999, software developers starting adding source documentation to their programs. And in doing so, most of them ignored or only loosely used the sourcing standards that GEDCOM included. They did so because no one else was exporting sources with GEDCOM. What happened is that each of them developed their own unique sourcing structure in their programs and often didn’t export that to GEDCOM, or if they did, they only used some of GEDCOM and developed their own custom tags for the rest, which none of the other programs would be able to understand.

        This continued to other program features as well. For example, adding witness information or making a place a top level record was not something the last version of GEDCOM had, so developers, and even groups of developers came out with extensions to GEDCOM (such as Gedcom 5.5EL).

        The result:  Source data and a lot of other data did not transfer between programs, even though almost all of them were exporting to and importing from what should have been the very same GEDCOM.


        Two Attempted Fixes

        About 2010, a BetterGEDCOM grassroots initiative was started by Pat Richley-Erickson (aka Dear Myrtle) and a number of others. For 10 years, too much data, and especially source data, was not transferring between programs.

        At the time I wrote in a blog post about BetterGEDCOM:

        “The discussion is overwhelming. To be honest, I don’t see how it is going to come together. There are a lot of very smart people there and several expert programmers and genealogists who seem to be having a great time just enjoying the act of discussing and dissecting all the parts. … I really hope they come back to earth and accomplish something. I’ve suggested that they concentrate on developing a formal document.”

        There was a very large amount of excellent discussion about where GEDCOM should go and what was wrong with it and what should be fixed. But nobody could agree on anything.

        The reason in my opinion why this didn’t work out:  Because the discussion took over, similar to a Facebook group where everyone had their own opinion and no one would compromise.

        The one thing missing was an editor. Someone who would take all the various ideas and make the decision as to the way to go.

        After a few years, the BetterGEDCOM group realized it wasn’t getting anywhere. Their solution was to create a formal organization with by-laws and a Board of Directors to spearhead the new initiative. It was called FHISO and they created the website at fhiso.org. They obtained the support of many software vendors and genealogical organizations.

        On March 22, 2013, FHISO initiated their standards development process with an Open Call for Papers. Scores of papers were submitted, two by myself.

        Then discussion started and continued and continued and continued. There was little agreement on anything. They had excellent technical people involved, but the nature of the discussion was often too technical for even me to understand and got involved in externalities far beyond what GEDCOM needed and spent months on items that were more academic in nature than practical.

        FHISO had hoped to come out with an Extended Legacy Format (ELF) which would update the 5.5.1 standard. They were working on a Serialization Format (which GEDCOM already effectively had) and a Data Model. It’s the data model that is what is wanted and is most important. But to date, after over 6 years, all they have is a document defining Date, Age and Time Microformats.

        Horrors, that document is 52 pages long and is a magnificent piece of work, if you wanted to submit a PhD thesis. Extrapolating that out to the rest of GEDCOM, we’d likely be looking at a 20 year development time for a 50,000 page document which no programmer would be able use as a standard.


        We Need Something Practical – GEDCOM 5.5.5

        Genealogy technology expert Tamura Jones on his Modern Software Experience website has for years been an in-depth critical reviewer of genealogical software and the use of GEDCOM. I have greatly benefited from several of his reviews of my Behold software which sparked me to make improvements.

        I had the pleasure of meeting Tamura when I was invited to speak at the Gaenovium one day technology conference in his home town of Leiden Netherlands in 2014. I then gave the talk “Reading wrong GEDCOM right”. Over the years Tamura and I have had many great discussions about GEDCOM, not always agreeing on everything.

        In May of last year (2018), on his own initiative, Tamura created an annotated version of the GEDCOM 5.5.1 specification that was very needed. It arose from his many articles about GEDCOM issues, with solutions and best practices. Prior to release, he had a draft of the document reviewed by myself and six other technical reviewers, and incorporated all of our ideas that made sense with respect to the document. Many of the comments included thoughts about turning the annotated edition into a new version of GEDCOM.

        So I was surprised and more than pleased when a few months ago, Tamura emailed me and wanted reviewers for what would be a release of GEDCOM 5.5.5.

        image

        The goal was to fix what was not right in 5.5.1, remove no longer needed or used constructs, ensure there was just one way to do anything, fix examples and clear up everything that was unclear. This version 5.5.5 was not to have anything new. Only obsolete, deprecated, duplicate, unnecessary and failed stuff was to be taken out.

        The 5.5.5 result was published on October 2, exactly 20 years after 5.5.1 was released. You can find GEDCOM 5.5.5 on the gedcom.org site.

        For a description of what was done, see the Press Release.


        Why GEDCOM 5.5.5 Works and What’s Coming

        GEDCOM 5.5.5 is based on GEDCOM 5.5.1. It included the thoughts and ideas of a number of genealogy software developers and experts, just like the original GEDCOM did. It is practical as it is a repair of what has aged in 5.5.1.

        The reason why this document has come out was because it was done the way the LDS originally did it. A draft was sent to experts. Each sent back comments, opinions, criticisms and suggestions back to the editor, who then compared the ideas and decided what would work and what wouldn’t.

        I know I spent dozens of hours reviewing the several drafts that were sent to me. I sent back what was surely a couple hundred comments on every piece of the document. I didn’t agree with the document on all items. In the end, some of my arguments were valid and the item was changed. But some were not and I appreciate those decisions, because other reviewers may feel different than me and some final decision must be made. I’m glad Tamura is the one willing to make the decision as I very much respect his judgement. I also greatly thank him for the thousands of hours I know he must have put into this.

        What made this happen?

        1. There was a standard that had already been annotated to start with.
        2. Drafts were sent to developers and feedback was received.
        3. Tamura reviewed all the feedback and considered all possible implementations, and evaluated conflicting ideas,
        4. And most importantly: Tamura acted as editor and made the decisions. He is not selling software himself, so is unbiased in that respect.

        I will support and promote this initiative. I will be adding GEDCOM 5.5.5 support into future versions of Behold. I encourage other genealogical software developers to do so as well.

        This is only a first step. There are enhancements to GEDCOM that many vendors want. These ideas need to be compared to each other, and decisions need to be made as to how they would best be implemented in a GEDCOM 5.7 or 6.0.  A similar structure with one respected editor vetting the ideas is likely the best way to make this happen. And as long as the vendors are willing to offer their ideas and compromise on a solution, Tamura will be willing to act as editor to resolve the disputes. I would expect this would give us the best chance of ending up with a new GEDCOM standard that is clear, easy for genealogy software programmers to implement, enable all our genealogical information to transfer between programs, and be something that genealogy software developers would agree to use.

        It Took a Lot of Effort to Get to DMT 3.0

        2019. október 6., vasárnap 8:11:49

        I’m cleaning up my directories after over 3 years of development of Double Match Triangulator.

        When I released Version 1 of DMT back in August 2016, I had it write information about the run to a log file, primarily so that I could debug what was going on. But I realized it is useful for the user as it could contain error messages about the input files and statistics about the matches that the user could refer to. Also, if DMT wasn’t working right, I could be sent the log file and it would be a great help in debugging the problem.

        I have not deleted any of my log files since Version 1.0. I want to delete them now because there are a lot of them and they are taking up space and resources on my computer. How much? Well, I’ve got 9,078 log files totaling 421 MB of space.

        There were 1,159 days since I started accumulating log files. I have at least one log file on 533 of those days, so I worked on DMT 46% of the days, i.e. over 3 days a week, averaging 17 runs each day I was working on it. These are almost all development runs, where I’m testing code in DMT and making sure everything is working correctly. The maximum in any day was 89 runs. After every run, I’d check to see what worked and what didn’t, go back to my program to fix what didn’t work and recompile and run a test again. If everything worked, then I’d go onto the next change I needed to make.

        So I’m going to have a bit of fun and do some statistics about my work over the past three years.

        Here’s a plot of my daily runs:

        image

        Here’s my distribution of working times:

        image

        You can see I don’t like starting too early in the morning, usually not before 9 a.m., but I’m on a roll by noon and over 10% of my runs were from noon to 1 p.m.

        I relax in the afternoon. In the summer if it’s nice, I’m outside for a bike ride or a swim. In the winter I often go for a walk if it’s not too cold (e.g. –30 C).

        Then you can see my second wind starting at 9 p.m with a good number of late nights to 1 a.m. You can also see the occasional night where I literally dream up something at 4 a.m. and run to my office to write it down and maybe even turn on my computer to try it out.

        The days of the week are pretty spread out. I’m actually surprised that Fridays are so much lower than the other days. Not too sure why.

        So that just represents the time I spent on DMT over the past 3 years. It doesn’t include time I spent working on Behold, updating my website, writing blog posts, answering emails, being on social sites, maintaining GenSoftReviews, working on my family tree, deciphering my DNA results, going to conferences, watching online seminars, vacations, reading the paper each morning, watching TV, following tennis, football and hockey, eating breakfast, lunch and supper each day, cleaning the dishes, doing house errands, buying groceries, and still having time to spend with family.

        Wow. It’s 1:11 a.m. I’m posting this and going to bed!

        Double Match Triangulator 3.0 Released

        2019. október 2., szerda 6:46:32

        After 10 months of hard work, I have finally released DMT 3.0.

        You can download it from www.doublematchtriangulator.com.
        (It is a Windows program. My apologies to Mac or Unix users)

        All purchasers of DMT get a lifetime license. All updates including this one are free for past purchasers.

        There’s a lot new and improved here.

        • An MRCA (Most Recent Common Ancestor) column has been added to the People file, allowing you to enter the MRCA for each person that you know.
        • The process of segment matching and assignment of ancestral paths to segments and matches is now automated in an expert-system manner that mimics what a person would do map their chromosomes.
        • You can now select the minimum cM of the matches that you want DMT to include in its analysis.
        • You can now run DMT with only Person A to get the listing of Person A’s matches and people.
        • DMT now handles all the new segment match file formats from the various companies.
        • The Map page, People page and Log file all have extensive revisions.
        • DMT outputs a dnapainter csv file that can be uploaded to www.dnapainter.com.
        • Now uses conditional formatting extensively in the Excel files, so most of the formatting should move when the data is copied and pasted or sorted.
        • Now can filter all matches to a minimum cM.
        • Calculates and uses all inferred matches.
        • Clusters people into their primary ancestral lines.
        • Does parental filtering if one or both parents have DNA tested.


        This is what the DMT program looks like. It is this one window:

        dmt-main-window


        This is what DMT’s Map page looks like, listing every segment match:

        double-match-triangulator-map-file


        This is what DMT’s People page looks like, listing every person matched:

        image


        Here’s an example of a upload to DNA Painter from a DMT file;

        image


        There are many great tools available now for genetic genealogists who are interested in using their DNA to help them figure out their connections to relatives.

        With version 3.0 of DMT, I’ve now made available a tool that automates segment matching, triangulation and chromosome mapping. It’s a bit different from all the others and it should give you new and different insight into your DNA matches.