GEDmatch has just launched a new relationship predictor tool, thanks to a partnership with Briton Nicholson. Here, we get to know the mathematician who is helping to lead the charge in genetic genealogy breakthroughs.
What is your background?
I had been doing traditional genealogy and genetic genealogy for about four years before it really hit me how ripe genetic genealogy was for mathematical and scientific innovation. As a budding field, it really shouldn’t be surprising how much there is to learn about genetic genealogy. I think it’s fascinating that people managed to develop calculus over 400 years ago, despite how complicated it is, and then we have these other fields that have barely been studied. Tens of millions of people have had their DNA genotyped in the past decade and they’re very successfully using the tools available to them to corroborate or fill in their family trees.
The tools are great, but there was a lot of room for improvement. And that’s a dream opportunity for someone who’s just come out of doing research in a few different fields. I had recently gotten a B.S. in applied mathematics. I had a lot of experience in the natural sciences, getting an M.S. in geophysics/geological oceanography.
A lot of data science, programming, and even some modeling and simulation experience came along with that. Then I ran ocean climate models doing Ph.D.-level oceanography work for two years, followed by two more years in the modeling and simulation department where I worked on models for all different fields. All the while I was taking courses in statistics and using that information to solve problems.
So, whenever I had a question about how we can improve genetic genealogy, I was able to figure it out. I was clustering my matches in Gephi until 2018 when AutoClusters became available at MyHeritage. That was developed by my now-friend Evert-Jan Blom at Genetic Affairs (who also partnered with GEDmatch on its Autotree and Autocluster tools). I found AutoClusters to be very impressive, and so I stopped working on clustering.
Tell us how you are applying math and science to genealogy.
I tried modelling genetic inheritance. I started out with very simple models and eventually made them much more realistic. From the start, I made a point to use statistics from peer-reviewed papers to train my models. I also began using known properties of inheritance, such as recombination interference. One of the most interesting is the differences in paternal and maternal recombination rates, which result in wider ranges of shared DNA for paternal relationships and narrower ranges for maternal relationships.
Also interesting is the high variability of grandparent/grandchild relationships compared to other relationship types. This results in high probabilities of paternal and or grandparent/grandchild relationships at certain centiMorgan (cM) values.
Simulations can generate probability curves for relationship types relative to others, which can then be used for relationship prediction. This would be much more difficult and less accurate with empirical data. Developing simulations turned out to be a great idea because they can be used to answer almost any question in genetic genealogy. Nobody will ever have enough error-free empirical data for 3/4 siblings or pedigree collapse cases to compete with simulations.
Having a math and statistics background is very handy. I’ve been able to come up with equations to describe a lot of properties of genetic inheritance. Usually I did so because I needed them for simulations. This includes questions like
· How many different types of 1st cousins can a person have depending on the sex of the intermediate relatives? (4)
· What’s the expected split of DNA that I get from a grandparent pair? (22%/28%)
Recently, I needed a formula for the amount of distinct DNA that can be found in multiple siblings’ kits. I realized that, for the first time, the formula I needed for genetic genealogy was likely already well-known. And it was true. I did a search for the equations I was coming up with along with the key phrase “mathematical set theory” and, sure enough, there was already an equation for that.
What’s required to build new tools and make discoveries is a whole lot of time and work. The early years of genetic genealogy — I’d say prior to 2018 — were dedicated mostly to figuring out what the genotyping companies have to offer. During that time influential bloggers came to the forefront to show people how to use the tools. But now we’re starting to have real scientific development in genetic genealogy. This took a dramatic turn in 2021 as our understanding of what’s possible has grown by leaps and bounds.
Despite that, I don’t think we’ve even come close to realizing the potential of genetic genealogy.
Can you elaborate on your relationship predictor tools?
The genotyping companies have been giving relationship predictions for years. People usually recommend going to a third-party predictor to get the full list of possible relationship values for any cM value.
I describe the process for generating relationship prediction probabilities here (https://dna-sci.com/2021/04/06/a-new-probability-calculator-for-genetic-genealogy/). I believe that this was the first time anyone had described the process of relationship prediction. All relationship predictors, including mine, are currently made by simulations. This allows calculating probabilities based on an equal number of matches from each relationship type.
In the case of my relationship predictor, I used 500,000 data points for each relationship type. I believe that using a large number such as that will provide much more accurate and smoother probability curves. This can be contrasted with the curves from the Ancestry white paper, which appear to have only a few data points plotted.
The next step is to place counts for each relationship type into 1 cM bins. Plots of these frequencies at this stage will show very fuzzy curves. The next step, which is to smooth the data, is probably the most time-consuming and arduous process that I’ve undertaken as a data scientist. I’ve now had to do this over a dozen times, for multiple testing sites and multiple different relationship predictor tools. But this is a huge advantage that my predictors have over others. I ensure that the probability data are smoothed to the point of being monotonic over the appropriate intervals, but without flattening the curves. I plot the smoothed probability curves over the unsmoothed curves for each relationship type, ensuring that the fit is reasonable.
These probabilities are calculated by dividing the count of each relationship type in each bin by the total number of individual pairs from all relationship types. Once these probabilities have been appropriately smoothed, a file is generated containing the probabilities to be used in a relationship predictor.
Two months after releasing my first relationship predictor, I added population weights to the data. This is really important because we’re likely to have many more distant relatives showing up on our match lists than close relatives. A person likely has about 80,000 more 8th cousins than 1st cousins. So, although each relationship type started out equally represented in my probability data, with 500,000 pairs, adding population weights greatly increased the number of distant cousins relative to close cousins. This results in much more realistic predictions for cM values around 40 cM and below.
People who have put in the time and work in genetic genealogy know that not many of their 30 cM matches and below are 3rd cousins, but that’s the kind of prediction we were used to seeing. Now, with population weights, you can see here (https://dna-sci.com/tools/brit-cim/) that these matches are much more likely to be 7th to 8th cousins. Population weights are best in cases in which you don’t know how you’re related to a match. But sometimes we have a known relative test and we want to make sure that the amount we share with them is reasonable. For cases such as that, I’ve kept a relationship predictor version that doesn’t use population weights (https://dna-sci.com/tools/unweighted-cim/).
How did you get involved with GEDmatch?
I’ve been using GEDmatch consistently since 2017. I recall when the Golden State Killer was caught using GEDmatch and I wrote about it at that time. All of the important tools can be found there. Chromosome browsers are an essential tool for genetic genealogists and so I’ve always appreciated that anyone can upload their DNA for free to GEDmatch and see which chromosomes and segments they share with matches. A hidden gem at GEDmatch is the ability to search trees that have been uploaded by users.
You can then check your DNA against the people who uploaded those trees to see if you’re related. It’s also great that Tier 1 members can search their segments, and make superkits and phased kits at GEDmatch. And, of course, the tools being integrated from Genetic Affairs are a huge help.
Evert-Jan Blom recently suggested that my probability data, which he’s using for his AutoKinship tool, could greatly help users at GEDmatch by showing them the possible relationships for a given total cM value with their matches. Everyone else thought it was a great idea, too, so Evert-Jan put me in touch with them and we made it happen. I look forward to seeing the relationship predictions in use throughout the tools at GEDmatch.
Sign up to GEDmatch for free today!