Clustering Kits

Video Transcription

(00:00):
Clustering is a method of looking at multiple DNA matches at one time to see how you might be related to them, and GEDmatch now has a clustering tool.

(00:16):
Howdy, I’m Andy Lee with Family History Fanatics, and this is a segment of DNA. Be sure to subscribe to our channel and click on the bell if you wanna be notified about upcoming episodes. Just in the last year, there have been several clustering tools that have been developed and put on various websites. GEDmatch now has their own clustering tool as well, and I’m gonna show you some of the features of it. The first thing to remember though is with their clustering tool, it is part of their tier one package, so you have to be a member that you’re paying the $10 a month in order to access this tool. So here we are on the GEDmatch website, and if I scroll down to the tier one tool, I’m going to find the clustering tool. It’s called Clusters Single Kit, input the basic version.

(01:05):
I click on that and I’m going to get to the screen of where it’s gonna ask me for some input. Now what you need in order to cluster is you need a kit number and then you need to define what the thresholds are, and we’ll talk about the thresholds in just a second. I’m gonna leave them at the basic thresholds Initially, Genesis is now gonna go through and it’s going to start gathering all the data to create the cluster table. Now depending on what your thresholds are, depending on how many matches you have, this could take several minutes. So just sit tight and wait. Once shed match has figured out where everything is, it is now going to display the table. And initially it’s gonna show you it by the kit number, but as you can see, it’s going to reorganize it based on the average shared centimorgan for each cluster.

(01:49):
Now again, this can take a little bit, but it’s nice to actually watch this graphic as you’re seeing kits move back and forth and you’re starting to see these clusters form. Now depending on what you set your thresholds at, and depending on the matches, this table could be humongous. In this case, this is about 500 matches. If I look across the top, and if I look down the side, I’m gonna see the exact same names of these matches that I have. Now, each one of these matches match between that threshold that I’ve set, but they also match other people between that threshold as well. And that’s what these boxes all indicate. Now, one of the things that you can do with the clustering tool on GEDmatch is you can actually change how this is going to be displayed. So if I look at the dropdown menu up at the top, it’s going to let me change that by the name, by the kit number, by the cluster number.

(02:44):
What I’m gonna do is I’m gonna change it to the cluster size, and once I change that to the cluster size, everything’s gonna reorganize based on whichever is the biggest cluster. And then on down the line. Now you can see here for this giant cluster at the beginning, they all share on average about 15 or 16 centimorgan except for one person. Now this is gonna be a very distant relations that all of these people have to myself. So normally what I like to do is I like to focus on those clusters that have the most amount of shared DNA with me, and that would be the cluster average cm to the reference kit. So the cluster that has the most average between all the matches with the main kit. Now on this page, there’s actually a lot of information that you can access. For instance, if you go to each one of the amount of shared cinema Morgans clicking on it is going to show you the one to one comparison for autosomal.

(03:44):
So you can do a one-to-one comparison with any one of those matches. The other thing is by just clicking on the box itself, the colored box of the match, it will also go to that one-to-one. So there’s two different ways that you can do a one-to-one comparison really quickly between a specific match and yourself. The little i icon is going to open up an information box and that information box is gonna tell you about that match their name, the kit number, and the email that is associated with that kit. So let’s talk a little bit about thresholds. I’ve done just the basic thresholds, which were 15 centimorgans to 50 centimorgans. If I go back, I can actually change that number and changing the thresholds can have a dramatic effect on the cluster chart. It can affect how many people are gonna be showing. It can affect how big some of those clusters are, all because you’re looking at different ranges of the amount of shared centimorgans between all these people.

(04:42):
So in this case, I’m actually going to just change this lower threshold from 15 centimorgans up to 20 centimorgans, and then I’m gonna click on my cluster. So that’s all I’ve done is just change that by 5 centimorgans. Now you can see initially right off instead of 500 matches, it’s only found 371 matches that meet that criteria. So I’ve already decreased the total matches by a third just by going up by 5 centimorgans. Now where this could be important is for people that have a lot of inmy in their family history. By changing the amount of the threshold from 15 to 20, you’re gonna eliminate a lot of people that may be even more distantly related than what you want to be looking at. So now that the chart is finished, it actually looks a little bit different. I’ve blocked out the names, but I can tell you right now that these names at the top are not the same ones that were at the top before, all because of that change in the threshold.

(05:40):
Not only that, there’s actually a little bit larger cluster right at the beginning. If I go and I look at cluster size, if you remember that we had this really giant cluster before that was our big cluster. Now as I can already see, this cluster is much smaller than what it was before. Before people were sharing about 15 to 16 centimorgans. I’ve changed that threshold up to 20 centimorgans. So it’s going to be looking at people that share more than that already. So let me go back and let me actually change the size again, and I’m gonna actually change it up to 25 this time. But I’m also gonna change the upper threshold. And in this case I want to include some of my cousins in this. And so I’m gonna change this up to 1300 sent to Morgans. So it’s not gonna be getting into my aunts and uncles and grandparents, but it should be getting into my cousins second cousins and all of that.

(06:32):
Now, because I’ve changed that from 25 to this 1300, that 25 makes a big difference. You can see instead of that 371, it’s only looking at 55 kits, even though I’ve changed that upper threshold all the way up to 1300. So just another 5 centimorgans dramatically affected how many matches I’m showing. And as I scroll down, I can see some of these, the more centimorgans, the higher the upper threshold for Cento Morgans, the more of these gray squares you’re going to find. Now, these gray squares indicate that they actually probably belong in multiple clusters. So for instance, these people right here, not only do they belong with this orange cluster, but they also belong down here with this purple cluster. And again, that’s because we’re looking at more and more relatives that are closer related to us. So for instance, I’m gonna share a lot more relatives with my first cousins than I am with my second or my third cousin.

(07:30):
So it’s really important that you play around with the thresholds to find out the information that you’re trying to look for. But in essence, what you’re doing is you’re categorizing people based on other people as well. So no matter what type of clustering you do, you’re going to be getting some information that may help you in your genealogy. So one of the really tricky things with clustering is families that have and dog me, because there is so much shared DNA, you may end up with this huge giant cluster that covers most of your page. Let me show you an example here. This is a family that has end domy, and initially these clusters are looking okay, but then we start to see this orange cluster down here. And as we scroll across, we can see, whoa, this orange cluster is covering everything. This is all still one cluster. And in actuality, when I go through this cluster is 450 people large. Now, from a genealogy standpoint, that is just way too many people to try to work with at one time. And so from a clustering standpoint, this large of a cluster is not very useful.

(08:47):
So what somebody with endogamy needs to do is they need to go through and they need to adjust their thresholds to try to break up a cluster like this. Now the other thing with this is this may take a long time to load. So for instance, this chart took about seven minutes to actually get created, and that’s just because of all these connections that we see here in this one giant cluster. Now, this is the same family only. What I’ve done is I’ve changed that lower threshold up to 25 centimorgans. Now initially what you can see is that it’s broken out and there’s a lot more little clusters. And so I’ve taken that great big giant one. I’ve broken out into little ones, although once we get down far enough, there is going to be a really big cluster that still is the majority of it, but it’s not as much.

(09:38):
Before it was 450 people that were in this giant cluster. Now it’s only 344 people that are in this giant cluster. So I’ve started to break this cluster down. And again, as you change those thresholds, you’re gonna be able to start to separate out how some of these families fit together as in which ones are most closely related together. So what are some of the things that you can do with this information? Well, first off, if you want to save this information, you can actually do that as an HTML file. And all you need to do is anywhere on the page, just do a right click and then do a save as. And for that, you want to save it as an HTML file. So change that to html. When you open it up, you can see that the page looks very similar. And one of the nice things about that is you still have the ability to rearrange things based off of the cluster number or the size or however you want.

(10:32):
And it’s still gonna go through that motion of rearranging your clusters in that way. The one thing that you do lose, as you’ll notice, is you’re gonna lose your links to that one to one comparison, and you’re gonna also lose that info link, but it still gives you most of the information that you’re going to need. So if you don’t do the tier one every single month, and what it is is it’s a good idea to do your clustering for your kits, save those HTML files, and then at least you have the basic information that you can use later when you aren’t a tier one member. Another thing that you can do is use this clustering in conjunction with the multiple kit analysis. Now, the multiple kit analysis is another tier one tool that can be used to look at several kits in lots of different ways.

(11:16):
I’m only gonna show you a little bit about that in this video, but I will have more videos that show how you can use multipl kit analysis with all their different functions. So for the multipl kit analysis, what you need is you need to select some kits that you want to analyze. It automatically selects your base kits, so I’ve been selected. Then I can go through and I can select some other kits. So let’s take a look at this blue cluster right here. I can go through and I can click on each one of these individually if I want to, but if it is a large cluster, that’s gonna take a long time. So what I can do instead is I can go over and I can find the legend. And in the legend, it’s all organized and it has the colors of those kits.

(12:00):
So in this case, I want this blue kit. If I click on that, I can see it has selected all of those kits already. For me, the nice thing about this is if I want to select multiple clusters, then I can do that. Now, where would you want to select multiple clusters? Well, if you’re seeing that there’s a lot of overlap. So for instance, right here, I can actually see that, hey, between this red and this blue, there’s actually this one person that overlaps. So let me select the red and the blue cluster just to see how those people combine together. Now, once I have selected the kits, I want to analyze, I go up to the multiple kit analysis, I click on the button, and I’m gonna just look at a compact map for my segments. I’m gonna see what segments I share in common with those people.

(12:50):
Now remember, I had two selections. I had the blue kit and I had the red kit. I’m gonna zoom in just a little bit here so we can see this more. Now, these first four here, these first four people, these were all part of the blue kit, and you can see that we all share a segment on chromosome number five. Now, these other people, they were part of the red kit, and you can share that we share a segment all on chromosome number four. And then of course, there was one person that was a link between those two, and that is this Logan Brown. Now, it doesn’t show me how Logan Brown matches with those, but then I could change my primary person to Logan Brown, and I can see how Logan is related between those. Maybe Logan shares DNA with them at another location on the, and that is a quick introduction to the clustering tool on GEDMatch. Now, clustering involves a lot of different things as follows, analysis. And so in the future I’ll have some more videos on clustering and specific cases of how clustering can be used to solve some problems. But if you have any questions about clustering, put it in the comments below. And if you like this video, be sure to give it a thumbs up and share it with all your friends.