(This article was originally published in Genealogical Computing, fall 1998)
| Introduction | |
| Why? | |
| Merging Databases | |
| Merging Individuals | |
| Merging the Rest | |
| Spotting Duplicates | |
| Summary |
| The Survey | |
| Useful merging pointers |
Publication: Genealogical Computing
Editor: Dick Eastman
"Thats one small step for a man, one great leap for mankind." So said the explorers who search the heavens, the Where-Were-Going. As family historians, we search the past, the Where-Were-From. Knowing where were from and where were going is a big help in knowing who and what we are. Its the common appeal of genealogy and space travel. What are our limits? How far is too far, how much is too much, how hard is too hard? Stories of the adventures of our ancestors, both inspiring and humbling, help us see how high we can hope, and how low we can fall. Being human, in many ways, means being in the middle. Between the past and the future, between the best and the worst, we look around and try to understand.
Why should we be concerned about merging technology? Almost everyone who has a computer and a family needs to use it. Developers are taking some interesting approaches. Like the way that only activists participate in bond elections and primaries, and the silent majority gets two unappetizing choices at election time, beta testers are getting a disproportionate amount of input into how the finished family history software products work. Customers who just assume that someone will know what they want and have it ready when they recognize that need had parents that spoilt them rotten.
Family history record-keeping is increasingly becoming a digital process. Linking ones information to the information already gathered by other family members and researchers is becoming more and more common. The data in computers is just a model of the lives our families lead, and with a strong emphasis on the sources of our information and a glaring weakness for dealing with the complexities of living, loving, and dying.
At the end of the day, we have to put our information together somehow. This is a never-ending journey, digitizing the history of the human race. There is no coordinating board, no template for how the dates and places must be entered. Yet, there are literally millions of people on earth entering the history of their families into computers, sharing the information with others, linking the matches, and matching the links.
The following is a survey of the capabilities for merging data that users can muster in the genealogical software marketplace. A number of developers were surveyed (the actual text of the survey is in the sidebar) and the results follow.
A few basics will help us communicate better here. First, computer programs store the data that we enter in FILES, collections of information separated form all the other collected information, programs, and what not found on the machine. Each genealogical program stores the information in its own way, called a PROPRIETARY FORMAT. Most other programs dont read that format, but a given program can generally read many such FILES in its own FORMAT. Most programs can also read and write in GEDCOM format, which is an acronym for Genealogical Data COMmunications, a specification from the Family History Department of the Church of Jesus Christ of Latter Day Saints.
Second, when data is merged, information is copied from a SOURCE, to a TARGET. Sometimes the target is called the SURVIVING information.
The software programs surveyed were Ancestral Quest (AQ), Brothers Keeper (BK), Family Tree Maker (FTM), Legacy (LG), Personal Ancestral File (PAF), The Master Genealogist (TMG), and Universal Family Tree (UFT). If you make or use a program that you think should be included in such a survey, please send your email to
gceditor@ancestry.com.The data merging process generally contains three separate parts: merging the databases, merging the duplicated individuals, and merging the duplicated sources, repositories, place names, etc. Almost every program will allow a user to "import" another data set from two sources: a file made by that program, or a GEDCOM format file made by another program.
The database merging process is evolving. Not too long ago, a merge meant reading all of the information in the SOURCE data set, and copying it into the TARGET data set. Now, several programs offer the user some help. According to the survey responses, AQ, BK, LG, PAF and UFT support merging of their own data format and GEDCOM. FTM supports FTM, GEDCOM, and PAF. TMG supports many formats. A wider selection of data formats to import from appears to be a trend that well see more of in the future. Its a useful capability for two reasons. First, it frees the user to use the program that has the features that he or she wants. Second, users can share information more freely.
TMGs GenBridge is a specification for direct import of data across platforms. A process is available for other developers to use this method, but as of this writing, TMG is the only one of the products surveyed which does. GenBridge allows for direct import among a number of programs, including most of the programs on this survey.
One quick caveat: there isnt a way for a developer to know the meaning of the data structures in your particular data set. Users dont follow instructions in the practice of entering family history information like the do in, say, the Civil War Soldiers Project. The freedom that users have to put their data in as they wish, to enter their place names and nicknames as they wish, prevents them from having the easy ability to break that information up if a new file format allows it. For example, I dont know of any programs that pull out the letters "Jr." from the end of a mans name and put it into a suffix field. Some programs support suffixes, but they dont automatically force the user to use them.
One interesting capability, according to Ken McGinnis: "Legacy lets you have two family files open and displayed on the screen at the same time. You can navigate your way up or down a family tree until you find a person you want to copy into your main family file which is also open. Now you tag that person with the option of lets say that person and all of his descendants and then from the main family file you tell Legacy to import all of the tagged people from the second file. It's easy and powerful." This program has another useful feature: It will allow you to specify a list of tags that you will import, and to save and re-load that list in the future. Another feature I expect to see in other places someday is the use of a "Top 5" choice.
When data sets are merged, LG supports the automatic creation of a citation at the individual level identifying the source of the information. TMG and FTM support the automatic creation of a field level source citation. This is an area of some concern to researchers, especially when the amount of data is large, and typing a citation for each item would be laborious.
Any time that two data sets are merged, theres a good chance that the user will find some individuals in the combined data set more than once. Its also possible to find that people you thought were different (J Doe and John Doe, for example) might be the same person. In times past, the user could write down all of the combined information, delete one person, type the data into the other, and hope it wasnt a frequent occurrence. That is still the case with BK, although John Steed has a Statement of Direction saying that he intends to write a routine to merge individuals for a future update. Most of the programs surveyed allow the user to see both sets of information and to select which items to keep and which to lose. TMG also offers a look at a combined chronological overlay, that is color coded based on the source of the dates, for the user to assess the desirability of merging two records.
And heres where users find a bit of a rub. Some programs store a specific set of items about each individual, such as a birth place and date. A merge in this kind of program can hold one and only one birth date. If the two you are merging dont match, you will have to put the one that isnt your favorite into notes or lose it altogether. This kind of program is very useful for printing charts and reports, but not as useful for research. Other programs link individuals to events or tags, such as births, and associate the date and place with the event. LG, UFT, TMG, and FTM all support multiple or alternate events. This type of program generally supports the storage of conflicting data, which further study may reconcile. In this last group, some allow the user to define his or her own tags, such as "Executrix."
In addition to individuals, some programs have separate tables containing source citations, master sources, repositories, and places. Database merging routines often combine these supporting tables, creating duplicates. LG combines source citations, if they are exact. UFT and FTM merge master sources. PAF and TMG merge master sources and repositories. Management of these tables is seldom as sophisticated as the names, dates, and places. AQ links any multimedia attachments to the new target individual.
A number of programs not only support the merging of individuals, but will suggest that some individuals are duplicates, through a process referred to as a merge routine. AQ uses soundex code for first and last name, and optionally, middle name. PAF 3.0 allows exact spelling or soundex comparisons. FTM uses exact spelling, with exact match on birth date as well. TMG and UFT allow the user a number of choices from ignore to exact on both names. TMG also allows a "sounds like" choice that Bob Velke says is better than soundex. LG allows a soundex on surname and a number of letters on first name. Most programs also warn users if they enter a duplicate name.
Some programs will suggest merges based on information other than names. AQ will suggest merges based on an ID, such as Ancestral File Number (AFN). PAF 3.0 compares birth year, sex, AFN, and death year. It does not recommend merges of people who are parent and child. FTM compares the parents, marriages, and children. It will generate a report listing name matches that fail the other tests. TMG will compare birth date, death date, parent names, and optionally disqualifies siblings, parents, and different sexes. It also allows the user to specify how close dates must be to qualify. TMG also allows the user to define a factor for "circa" and "before/after" dates. LG compares birth and death dates at a user specified accuracy, soundex, or parent soundex, but it doesnt allow the user to automatically disqualify parents and children, and suggests them often if your dates are loose enough. They are planning to allow you to flag a pair of individuals in your data set as "no match" and it wont recommend them on subsequent routines.
The variety and innovation in this area is a sign of a healthy, growing issue. We are in the first stages of development of a new technology, comparing records about people and trying to infer individuality. There are many credit-reporting agencies with more recent records than most genealogists having the same problems and spending more money to try to solve it. There are no measures to assess the utility of these routines at present, but a measurement of false positives (people who are suggested but really are different) and false negatives (people who arent suggested but who really are the same) on a known data set might be useful. Please see the "useful merging pointers" sidebar.
One would hope that developers would change the capabilities of the programs in the marketplace in response to the changing needs of users. It appears that exactly that is happening. Users can merge from a wider variety of data formats than in the past. Users can merge individuals more easily. Routines to help identify candidates for merging are becoming quite sophisticated. More programs store the resultant conflicting data today. Its also encouraging that they are not all doing the same thing. The resultant diversity and innovation offer us more chances to connect Where-Weve-Been to Where-Were-Going than weve ever had before.
IF THE ANSWER TO QUESTION 1 IS YES:
AVAILABLE: The feature is in the product, and the product is available now.
ANNOUNCED: The feature is going to be in a specific update, with a scheduled ship date less than a year in the future.
PLANNED: The feature will be included in a specific update, with no scheduled ship date, or a scheduled ship date more than a year away.
STATEMENT OF DIRECTION: The feature is one that the developer intends to include in an update at some unknown future date.
OTHER: Describe as you see fit.
| Gaylon Findlay at Ancestral Quest says that using a match on parent soundex reduces false positives. | |
| If the program youre using doesnt have a choice for initials, but has a choice for a number-of-letters match, it can work the same way. | |
| Beware of people about whom you know almost nothing. They will match a whole lot of others. | |
| Including blank dates is risky. They match other blank dates from different centuries. |