(This article was originally published in Genealogical Computing issue 15:3)

The GENTECH Data Exchange Project, Part 2

In the last issue, we promised to summarize the results of the Data Exchange Project so far, and to answer the following questions: Is there really a problem? Will there be a new specification, or just a better set of instructions for the old one? How could there be enforcement of genealogy standards? Will two programs ever exchange data successfully? Will the LDS be involved in the solution? We also promised to discuss a lack of strict definitions of terms.

The fundamental issue is effective communication. Communication is one of the seven wonders of the human experience. Here I am, sitting in this totally private, isolated brain, sensing the world around me and piling up experiences and opinions about what is really out there. And there you are, doing something sort of the same, and trying to figure out what I mean. Our words are like messages in a bottle, sent from one desert island to the other. It=s a wonder that we communicate at all. I have a thought, produce a symbol such as a word or picture or a grunt, and you sense it, and get an idea. It is almost certain that your idea is different from mine, and that there is no such thing as total communication. What we got is a chance, under the right conditions, to get close.

Language is an agreement to associate certain symbols with certain things, they can vary. Culture is shared language. Culture is shared experience, the way that we all remember Jackie Gleason saying, ATo the moon,@ or Robin Williams saying, AShazzbot.@ If you are sitting next to me when I see a program or hear a conversation, we are more likely to get similar ideas about it. After eleven years of sitting together at the dinner table and the Sunday services and watching TV and movies, our inventory of shared experiences is so high that my wife and I only misunderstand each other about three times a day.

We humans not only send messages that we think others should already recognize and understand, we send new messages that have never been sent before. Woe to the artist or composer who tried to send one message and the audience got another. (I=m not saying that all creative works contain a message, but I have a bias that says they should.) This creative use of symbols to express new ideas is another of the seven wonders of the human experience. Back in 1985, George Carlin pointed out that no one had ever said, AWould you saw my legs off?@ It was also the case at that time that no one had said, ADrop by my home page and leave me an E-mail.@ We are constantly reinventing language, because we are curious monkeys who like to find new ways to do things.

Related to this communication issue is my belief that information only exists in human minds. (I=m not saying that there is intelligent life on earth, but I=m not saying that there ain=t.) While computers, on the one hand, are full of data, only a human can absorb that data and create information. My hard disk has no pictures of my family - it has files full of 1s and 0s. Data, information, data, information, got it? The role of computers in communication of information is that they are tools we use to transport our symbols.

Returning to genealogical data exchange, it is interesting to look at the process of the communication of family history information using computers. I put my ideas onto symbols. I ship you those symbols through a variety of cool high tech methods. You look at the symbols. You automatically understand every thing that I intended for you to understand. Not. Where did the communication break down?

 

 

Can we talk?

Yes, there is really a problem. It is the problem of unity of purpose. It is the problem of the meaning of symbols. Are they to be defined strictly, or flexibly? You cannot have too much of either one. Strict, flexible, strict, flexible, got it? If we define our symbols strictly, we can=t innovate with them. If we define them flexibly, we can=t always understand what they mean. In genealogical data structures, we are fighting to do both, but we are not united, and we fall.

The TYRANNY of LABELS. I describe my grandfather as a unique individual by using his name. I describe him as a member of the set of grandfathers by using the term Agrandfather.@ Ok, fine, but really, what characteristics do grandfathers have? In general, I am describing the characteristics that all of the members of a set share in common when I use a label for the group. In particular, I am saying that he is the father of one of my parents, that he was born significantly before me, and lived close to the time that my parent was born. I am not saying that he is generally pleasant, or forgetful, or that he always brings candy with him, or that he always has Prince Albert in a can. In genealogical data structures, we are struggling to identify the characteristics of the members of the set, and then to somehow express the other information, too.

There is a tyranny in labels, and it is time that we tightened up our use of symbols. There can be no communication without agreed meanings of symbols. In software development projects, the first item of business is defining the data representations. In genealogy, each developer is defining them for herself the best she can. It=s vital to the communication process that the word represent the same ideas from one program to another. I might write a family history that will be read by persons who aren=t born yet, and I certainly can=t ask them or every other potential reader to agree in advance on the meanings of my symbols - so how will I ever communicate? By executive order, by the tyranny of definition. Long live the king.

 

The King is dead, Long live the King

Since the publication of our first article, the current official GEDCOM spec has changed from version 4.0 to version 5.5. To the outside observer, it appears that the family history department is trying to adapt the specification to changing needs. In answer to our question from last time, there will be a new standard, AND there will be revisions of the old standard. There have been some excellent suggestions for revisions to the content and the use of the existing GEDCOM spec. We will submit them to the Family History Department (FHD) of the LDS during January 1996. Some particular suggestions that have been offered are:
Include the version of GEDCOM that the GEDCOM file is coded to in the file header. Since the specs change, give the file reader a chance to read it under the same conditions that were used to create it.
Allow use of the LOCAtion and SITUated tags to define places. The recommendation from Pierre Cloutier is the most clearly presented suggestion of its kind that I have seen. He proposes using place relationships, a little like person relationships, where San Francisco is a LOCAtion, SITUated in California, which is also a LOCAtion, SITUated in the USA, which is also a location.
Try to allow more user tags and less GEDCOM tags. It strikes many people as confusing to have multiple tags with similar meanings. Reducing the number of GEDCOM tags and allowing a freer exchange of user defined tags would address many of the problems that we saw in phase one.

In general, it appears that the interests of the FHD and the interests of genealogical developers are not similar enough to justify a continued interdependence for data exchange. Each group appears quite capable of going in a different direction under its own steam, and I expect that to be the outcome. In particular, I expect that GEDCOM will continue to be revised, and I expect developers to rely on it less for exchange purposes.

One of the main issues leading to this conclusion is the management of the specification by the FHD. The GEDCOM 5.x spec was in draft form for four yeara. In contrast, web browsers were conceived, developed, distributed, and HTML has been to version 3.0 during that same period. In fairness to the FHD, they have reportedly offered several times to give the management of GEDCOM to any other qualified organization, but they haven=t percieved the candidates as having the technical expertise required.

It isn=t clear from this vantage point how much better developers can do without the leadership of the FHD, and it=s not certain that data exchange will improve at all. One thing that might help would be the creation of a trade association.

 

People Gotta Live Together

Several of the suggestions that were made point out a problem in the genealogical software industry. There is simply next to no cooperation among the various providers of products and services. A trade association would offer a forum for addressing a variety of current issues. In particular, the design of tests for file exchange, the processing of inquiries, the distribution of problem reports, and suggestions of standard ways for users to enter names, places, and sources could be handled very well by such a group. For the sake of discussion, let us say that developers created the Family History Software Developers Association, and called it GenDev, or another cool name. If they assembled once a year and discussed industry wide issues, the user community would probably benefit.

It would be nice if: they would publish an inquiry or complaint form, that could be used to report problems, or to ask questions; if those forms went to the association, and then were forwarded to the member involved; the association published the statistics. It would be REALLY nice if they would publish a list of standard ways to do things, such as the entry of certain date, place, or source information.

 

What do you mean by NEW STANDARDS?

Besides the communication problems mentioned above, some issues in dealing with the strictness of definitions need more work. That should be approached from two directions: better definitions, and better communication of definitions.

First, genealogists must give technologists better definitions of common terms. This approach is being addressed by a joint project between the Federation of Genealogical Societies and GENTECH called the Lexicon Working Group. Present members of the group are Robert C Anderson (co-editor of the American Genealogist), Curt Witcher (president of the FGS), Marsha Hoffman Rising (past president of the FGS), Mike StClair (a developer for 25 years), and myself. Mr Anderson is giving a progress report at GENTECH96. Briefly, the group is responsible for producing a list of definitions of genealogical terms, and relationships among those objects, which will be suitable for both users and programmers to understand their meanings. It is a project that may never end, but rather splinter into more specific working groups. The group will produce a Request for Comment (a document with some suggestions, requesting comment) during the first quarter of 1996.

This effort will not produce single definitions for terms in every case, but will document the ones that are used if that is reasonable. This could prove most useful if it allows a developer to Amap@ his database over these definitions. Another program could look at this map, look at their own map, and convert the data directly. Direct conversion might be preferable to conversion to GEDCOM, and then conversion from GEDCOM. It is my opinion that this is the most likely course that data exchange will take in the next couple of years, because of what is happening in other areas, such as SGML and Java.

Second, a number of persons suggested that we should find better ways to communicate about our definitions. Web Pages are made of Hypertext, and exchanged thru the Hypertext Transfer Protocol, or HTTP. This is a specialized form of SGML (Standard Group Markup Language, an international standard for the definition of device-independent, system-independent methods of representing texts in electronic form). It has been suggested that we create a Genealogical Transfer Protocol, where instead of using GEDCOM, we use an SGML version. In a related idea, Rafal Prinke of Poland suggested that each file might contain, at the beginning, a Aschema@ that identifies the structure of the data to follow. Java is a programming language that lets you describe the data you are sending in a Aheader@ and send it over the Internet. It=s sort of like HTTP, except the rules for page layout come with the page. Over the New Year=s holiday, Tom Wetmore suggested that Java might be used to transfer a genealogical protocol instead of GEDCOM. Whatever method is most successful, it=s obvious that this allows a developer to transfer many kinds of data by documenting them as they are sent. This approach preserves innovation, which is a key part of any workable solution.

It seems reasonable to me that both groups might meet in the middle. This represents an effort to improve both the strict definitions and the loose definitions of genealogical data, and that=s gonna work.

 

OK, then what?

If these issues were behind us, there would be some very interesting challenges remaining. We could move from data exchange to data interpretation. We could work on merging our datasets with those of our relatives. I have already managed to merge two people=s files together. There were about 2500 names in one file, about 1000 in the second file, and about 500 of them were duplicates: the same people. If genealogical researchers are going to communicate with their relatives, this will be a likely nineties phenomenon. I found that the computer did not know which ones were the same people. This was good and bad. It is good that it did not assume that people were the same and start merging like a runaway train. It is bad that it took me about six months to clean up the mess. As one correspondent put it, we need some Arobust merging/matching@ rules. Those rules will not only allow us to merge files together on our own computer, but they will enable a program to go around the Internet looking for duplicates and E-mailing the descendants. If the rules were easy, and the data was good, we would be close on this. Neither is the case, and we are not close, and we will not be close next year, either. But someday ... well, imagine getting an E-mail telling you where to look for your family.

The other challenge is the addition of new types of genealogical data. We spoke several years ago about sound and image data - now most programs will let you make some kind of scrapbook and put recorded conversations out there. The digitizing of videotape, smells, DNA chains, and whatever comes up next year will take a while to absorb. The linking of your conclusions to evidence scattered online around the world will take a while to create. I can imagine a large on-line database of place names, containing the names in many languages, and longitude and latitude coordinates. Your genealogy program could link your places to them, and you wouldn=t have to re-enter the information - your program could even suggest the links for you.

 

So what does it all mean?

We need a clear division between data that requires strict definition and data that does not. For the strict definitions, we need leadership from traditional genealogists that we haven=t had in the past. The Lexicon Working Group will try to address that need. For the flexible definitions, we need better ways to communicate what we mean by them, and the use of schemas, Java, and SGML type approaches could be very helpful in that effort. We need better cooperation among developers to allow users to master the skills of entering and communicating their evidence, and it=s not clear that such cooperation is coming soon. And we need more ideas all the time, from people like the ones who submitted problems and suggestions to the Data Exchange Project, to understand what the issues are and how to address them.

As the VP Technology of GENTECH, I=m grateful for every effort made to contribute to the project. I=d like to say thanks to Lee Hoffman, the Project Leader, and Karon Bosze, who both worked with me to gather the information for the project. As an active amateur genealogist, I=m very hopeful that these ideas can be used to do things better. In Phase 3 of this project, we=ll try to make that hope a reality.

Gcart2-1.pcx (157078 bytes)

Gcart2-2.pcx (109028 bytes)