(This article was originally published in Genealogical Computing issue 15:2)

Background on the GENTECH Data Exchange Project

by Beau Sharbrough

VP Technology, GENTECH

 

What's the Problem?

Genealogists do two things: research and publish. The use of computers to help in these activities is such an improvement in the ease and quality of those tasks that it is fast becoming the standard way for a person to do both. The trend in research will develop more emphasis on networking than original research in that environment, and people will want to share their data more than in the past. Examples of that are already evident in on-line settings like GenServ, GenWeb, the Tafel Matching System, and the basic communication taking place on CIS, AOL, GENIE, FIDONet, and Internet Usenet newsgroups such as soc.genealogy.computing.

Most genealogy programs store their information in tables, which are unique to that program, called a database structure. Unless two genealogists use the same program, they will use different data structures, and will have difficulty exchanging their information. To exchange data with other programs, most of the time they convert their information to an intermediate file in a Astandard@ format, a process called Aexporting.@ The other program reads that intermediate file and converts the information into its own database structure, a process called Aimporting.@ Since both file structures are unique, the probability is high that some information is lost in the process.

We call the intermediate file format used for most genealogy data AGEDCOM.@ The LDS Church developed GEDCOM, an acronym for Genealogy Data Communication. The current approved specification is version 4.0. Version 5.3 is the most recent draft specification available, and they have announced version 5.4. Some programs write and read according to the draft specification, and some to the official specification. Users caught between these developers learn a lot about a technical problem solving process called Afinger pointing,@ in which everyone claims that it's someone else's fault.

Data exchange through GEDCOM is resulting in the destruction and the creation of information. GENTECH will attempt to relieve this difficulty in two ways. First, we want to be instrumental in a dialog to adopt guidelines to work around any problems with the specification. Second, we want to produce a standard GEDCOM file for testing.

After they exchange data, merges of data sets are nightmares for the user. We do not train programs to spot duplicates and merge them. Programs that allow users to merge people still do not merge the tags and sources in a common sense fashion. If the future of genealogy computing is going to involve more exchange of data, merging data sets must be rendered as painless as possible. We intend for the Data Merging project to address this issue. It will not draw much attention until after the first of the year.

 

 

What is GENTECH?

GENTECH is a nonprofit corporation devoted to bridging the gap between genealogy and technology. We have held three annual conferences on technology issues in genealogy. They have two committees. The Conference Committee produces the annual conference. The Technology Committee provides technical support. Our board has given the Technical Committee a mandate to provide support in three areas: data exchange, society support, and conference support.

The day before the annual conference, GENTECH is host to a Technical Meeting. We think that those meetings have been the largest assemblies of genealogy developers ever, with over a dozen present. Discussions at those meetings have provided much of the background for this project.

Before GENTECH94, there was an impressive assembly of genealogy developers at the Technical Meeting. I specifically recall seeing Bob Velke (author of The Master Genealogist), Tom Wetmore (author of Lifelines), Cliff Manis (founder of GenServ), and many others. The discussion topic started with how a local society might enter their cemetery information into a computer, and branched into the lack of a source for definitions in genealogy. Dick Eastman, founder of the ROOTS forum on CompuServe, pointed out that pilots and ham radio operators have a single industry magazine, and that genealogy does not. The point being made was that technicians with capabilities were waiting for instructions, but they were not getting them, and that they would not wait for ever. They needed for genealogists to tell them about the Abusiness@ of genealogy, so that they could apply the technology to it. One specific example discussed was the definition of a Ahalf cousin.@ No organization would provide a definition to the developer concerned. The issue of a need for standards, and the lack of a standards body, was raised.

We remarked that the day when a person can dial up a computer and print out their own pedigree was coming: that it is a question of when, not if. Technologists will not wait forever for genealogists to tell them what they want done and how. Eventually they will do whatever they want to BECAUSE THEY CAN. The power to search through 200 or 2,000 other people's files and look for links to your own family is a dream to many genealogists. A computer programmer can do it already, and in less than a weekend. Technicians have power, and no one is telling them how to responsibly limit its exercise. That pedigree idea was out in the future, but coming for sure.

A year later, at GENTECH95, the GenWeb initiative by Gary Hoffman was a reality. Already, programmers and genealogists around the world were asking how they could use this technique to link their genealogy files to others. That future, in some ways, was already here, but it was only the beginning. We cannot make it work yet because we are having problems exchanging our information. Part of the problem is that technologists are running away with this idea, without the discipline of traditional genealogists. Many GenWeb sites have no sources attached. No single way to display, to link, or to document your files is currently being used. There were many discussions of the need for standards, and again, no genealogy group came forward and volunteered to make one.

 

The Issue of Standards

I think that there are two reasons for the absence of a volunteer. The first reason, I think, is that the proposer would need to have the respect of the genealogy program developers, and there is not a large group of people who can make that claim. The second reason is that the LDS is currently the keeper of the only exchange standard in wide use - GEDCOM. Many people are waiting for the LDS to revise the GEDCOM specification, instead of trying to do the work on their own. This makes sense, because GEDCOM is very widely used, and a specification proposed by an individual or local society would not find many genealogists interested.

The only reasons to develop a standard through a method other than the LDS are because some developers feel that the revisions coming from the LDS are slow, and because the LDS and secular genealogists have different interests. The interests of the Church and the interests of genealogists are visibly similar, but fundamentally different.

Standards may not even be reasonable goals right now. Tom Wetmore said a wise thing. He said that, in other disciplines, the specifications that have succeeded have done so largely because they are what worked best after several competing specs are tried. I think that is how nature works - we innovate. The variations compete, and the best things about each one are copied and used by everybody. Whatever we do, it seems like it would be short sighted to choose an approach that stifles innovation.

 

What is the Data Exchange Project?

The Technical Committee at GENTECH volunteered to try another approach to this problem. There is no special knowledge or talent at GENTECH that qualifies us any more than any other group of concerned users. It is just time to try again and we want to do what we can to help. Our approach assumes that we have not solved the problem because we have not correctly identified the problem. We fell back on an old Dale Carnegie technique: identify the problem, suggest solutions, choose one, and start it. Then start over.

When we received the board's permission to try this, we sent out the enclosed press release. The response was surprising. Our mail splits into those who question our involvement, to those who report problems. Some questions about our involvement allowed us to refine our approach to the project. Examples are:

GENTECH: We are taking pains not to presuppose that we know what the problem is. We are willing to spend 90 days or so gathering input from as wide a discussion area as we can to try to reformulate the problem.

GENEALOGIST: Hmm. That's interesting. At first I thought it is absurd but on rereading I concluded that I do not know what the problem is, either.

GENEALOGIST: Why is something dated 15 June, posted electronically more than 2 weeks after its release?

GENTECH: We wrote it at that time for forwarding to non-electronic media, specifically for the print media. We sent it to editors for disclosure to columnists, since most other magazine sections are working about 60 days in the future. I went out of town between June 16 and June 29. I was not able to post the release wider until I got back.

GENEALOGIST: Will this "press release" be made available to the commercial online services? Another week has gone by and no information regarding this project has been posted on CompuServe's Genealogy Forum nor on its sister forum, GENSUP. Lots of computer genealogists have accounts there, but many do not take advantage of the Internet links available to them, due to the extra costs encountered. They may wish to participate in this project, but don't know about its existence. I haven't seen anything posted on Prodigy, either.

GENTECH: It is starting to look to me like we have assembled a committee who has an inventory of skills solving problems, but not in running a PR campaign. Seeing the maximum exposure to this project is very important to the committee, because the ultimate goal is a consensus.

The deadlines are not set in concrete, either. They are not the original set of deadlines, and I doubt that they are final. We want to stimulate the broadest discussion of genealogy data exchange techniques ever. Period. We have no desire to limit the discussion to particular groups. The deadlines are goals for the committee, not barriers to the public. I cannot think of any reason a person cannot bring up a new problem next February.

GENEALOGIST: Before sending any complaints, I'd like some details about this project. What information do you require with the complaint? The message exchange (phone call transcripts, E-mail, letters)? The GEDCOM file? The programs? What format?

GENTECH: It is as amazing to me as it is to you that there is no central place to ask for help with these problems, no standard form for expressing them, and no standard form for reporting them to the public. Are you willing to work with Lee to answer those questions? They are good ones, and should be at the top of committee concerns.

The immediate answer is: I do not know. Maybe when we have three dozen, we can formalize them. Setting the format before we see the data would require us to know what the data will look like. We do not.

GENEALOGIST: So what good is it going to do me, them, or anyone else to _even_ complain to a committee? What penalties will be invoked if your committee says the offender is a developer? What if the offender is a user? What if the offender is the LDS Church?

GENTECH: Well, I do not know the answer to these questions. First, I do not anticipate any punitive action being taken by, or moral authority invested in, this committee. We are very concerned with seeing the specific problems, and dealing with them if possible. It is just plain premature to try to say what anyone should do if, say, the LDS is the offender. Certainly, we cannot and will not punish anyone.

For example, some problems that we have seen come from developers using different versions of the GEDCOM spec. My limited understanding is that one version is "official," and the other is a "draft." I cannot tell you who the offender is in a case like that. First, blame is not our product. Second, it looks like the current confusion in this area is a cooperative effort - there are plenty of places to point fingers. Remember, when you point a finger, three point back.

It is time to deal with the issues, not the personalities. This committee does not have any special moral authority to rule on disputes. We are just another concerned group of genealogy data exchangers who have the same problems as everyone else. It is my suspicion that the problems are not technical in nature, that they are just growing problems in a developing industry. Communication facilitates cooperation, and cooperation solves more problems than confrontation. At least, that is my opinion.

GENEALOGIST: Finally, is this project embraced by the Church *and* the developers? I would ask about the users of GEDCOM, but I do not think many of us even know about this project (and if we did, most of us probably wouldn't care one way or the other about the outcome).

GENTECH: I do not know of anyone other than the committee members who "embrace" this project. The real issue is whether there are data exchange problems out there. If there are, people who care about it ought to do something. No one has given this committee special knowledge or insight into the nature of the problem. We are just volunteering to help, not installing ourselves as governor. I hope that developers and the LDS and users can focus on their common interests, and cooperate for the benefit of the people who, in the future, will be using the information that you and I are digitizing today.

 

What's Next?

In the next issue, we'll summarize the results of phases one and two. We should have lots of suggestions for how to deal with the problems that we are running into when we exchange data. Will the LDS be involved in the solution? Is there really a problem? Will there be a new specification, or just a better set of instructions for the old one? How could there be enforcement of genealogy standards? Will two programs ever exchange data successfully? We may not answer all of these questions next time, but we will definitely discuss them.

Also, efforts are already underway to widen the scope of the discussion. One of the fundamental problems in genealogical computing is a lack of strict definitions of terms. We'll discuss that, along with the other issues raised in this article, next time.

 

QUESTIONS

Questions raised about the project indicate the need for the following clarifications:
The deadlines are not set in concrete.. They are not the original set of deadlines, and I doubt that they are final. We want to stimulate the broadest discussion of genealogy data exchange techniques ever. Period. We have no desire to limit the discussion to particular groups. The deadlines are goals for the committee, not barriers to the public. There is no reason a person cannot bring up a new problem next February.
It is time for a central place to ask for help with these problems, and a standard form for expressing them, and a standard form for reporting them to the public. Maybe when we have more reports, we can formalize them. We will need to see 50 or so to see how they are common and how they are different, and then we'll propose a problem report format.
We've been asked what good reporting a problem will be - what penalties we will assess for non-compliance with the spec, and how we will variously punish users, developers, and the LDS church. First, we cannot and will not punish anyone. We have no moral authority to do so, and we hope we never do. We are as concerned as any other genealogists about identifying and dealing with the various occurances of these kinds of problems. We don't intend to focus on who is to blame for unpleasant results in the past. We intend to try to find cooperative, constructive ways that people who choose to exchange genealogy data may do so.

Communication facilitates cooperation, and cooperation solved more problems than confrontation.

We have been asked if the LDS church embraces this project. Only the committee members Aembrace@ the project, but the product manager of GEDCOM for the Family History Department of the LDS has expressed an interest in joining the project, and has asked how he might participate. Since phase 1 only consists of gathering complaints, we have not asked him for any special efforts to date. We've indicated that his interest is paramount to us, that we think that many times the church or the GEDCOM spec is singled out as the source of a problem when something else is really the cause, and that we hope to teach people how to use GEDCOM effectively.

 

PRELIMINARY RESULTS

It's neither a final nor a complete list, but a summary of preliminary problem reports includes three categories: Technical, Procedural, and Other.

Technical problems refer to difficulties that developers have.
Programs don't indicate what version of GEDCOM they export.
There are not any ALEVELS@ of GEDCOM compliance. For example, some developers would be interested in importing only GEDCOM ABMDB@ information: the birth - marriage - death - burial information. They would want to claim a GEDCOM compliance, but not one that reads and uses every possible GEDCOM tag. It's presently perceived as an all or nothing compliance issue.
We can't tell when a culture normally reverses the order of names. In Western cultures the family name is normally last, and the individual's name is first. In the Orient, it is often the opposite. Developers who want to display and list a person's name correctly don't know when to do so, or how it should be structured.
Of course, people have complained about the two GEDCOMs question. Which do we code to? Which do we read in? and related questions.
There is not a set of standard test files that developers can use to test their own compliance.
There is not a clear set of specifications for multimedia information.
When should we use the RESIDES tag, versus the ADDRESS tag?

Procedural problems refer to how users should use genealogy programs to facilitate exchange.
Some programs don't import my notes or my sources.
Are there rules for displaying the pedigree of adoptees? Which family would we normally show? How should we choose?
There is no place or form for reporting problems. We're not even sure when we have a problem.
There is not a way to selectively import or export in my genealogy program.

Other problems didn't fit into the first two categories.
Screens for entering sources don't follow Lackey's Cite Your Sources.
Screens for entering sources don't follow the Chicago Manual of Style.
We can't easily enter events which occur over a range of dates, such as military service or historical events.

Some programs claim compliance and don't really read or write files properly