Changes: Forum:How we encode our data

Revision as of 05:36, 5 October 2007

Forums: Index > Watercooler > How we encode our data

This is about Metadata. Feel free to skip the theory to get to the meat: #Using /persondata subpages

There are some in the wiki and internet community who advocate representation of information in a way that computers can evaluate. This movement is of relevance to genealogy researchers since the inferences that can be made from such representation of information delivers valuable results. Genealogy information is more simple than the general case of information representation and so conflicts can be automatically identified. EG.

Joe was married to Mary.
- The date of this event was X.
Child Y 's mother was Mary.
- The source of this idea is A.
Child Y's mother was Jane.
- The source of this idea is B.

Situations with conflicting information such as item #2's version of the truth 2 and item 3's alternate view are well known to anyone dabbling even briefly with genealogy research. In some future genealogy wikia, such information can alter the probablistic confidence of particular views of a family history using stuff like Bayesian inference.

Gedcom has some of this information, but LDS's goal^[1] for Gedcom was that it be a format for exporting or importing data to various programs or internet sites, nothing more. Gedcom 6.0 (XML) format continues to confine itself to that goal as stated in it's draft spec.[1]

In the Wikicommunity, many infobox templates are recording information conforming to the HCard Microformat. This sort of encoding can potentially support a superset of the information that Gedcom 6.0 will support. If we wish to follow that sort of direction, the Gedcom5.5 java program that converts to Gedcom6.0 like XML or alternatively to Resource Description Framework (RDF) format might be of interest. Further information on the program and discussion of the issues for such semantic representation of genealogical information may be found on Jay Askren's site.

Another method of encoding metadata for a person has been advanced by the Biography project. They use "Persondata" information contained in commented text placed at the end of the article. The advantage of this is that it is unobtrusive- no one is required to use infoboxes for their articles. The disadvantage is that if people don't see the information in the resulting article, there is no incentive to keep the information valid with respect to other information in the article. Push comes to shove, I think that sooner or later we will have some kind of standard infobox to normalize the appearance of articles.

What does the hassle of conforming to such templates or supporting these hidden blocks of information buy us? Well, brushing aside all the gee whiz applications of semantic databases, our genealogy wikia would eventually benefit from very practical features, such as the simple idea that it allows information to be shared between articles. Meta's Semantic Mediawiki extension supports encoding data in a central way that can be accessed anywhere in the wiki. It looks like normal wikitext. For example, a person article for Joseph Hester might have the text:

Joseph's parents were [[father is::Elias Hester (c1832]].

Now, any time this information is updated, everyone that wants the change can get it. EG. I have a family tree, and for one of the cells I can hardcode the Elias Hester or I simply put

[[father of::Joseph Hester (c1858)| ]]

Some of this stuff is working today, (see example for california at ontology semantic wiki page [2]). When it matures, it is surely something that future contributors to Genealogy wikia will want to begin to use. Note that any it is just another wikitext operator, and this doesn't impose any radical demands on authors. It can be ignored by the majority of contributors, but I expect will gradually gain many converts simply due to time savings. It can be used in an evolutionary way, and I expect the transition will be fairly gradual, with a mixture of usage of hardcoding versus re-using data. This will suit wikia managers very well, because the server loading created by complex templates using such queries are not well understood. It could be that caching will make it a non issue, but note that data dependencies are multiplied. Change the data declaration father is:: relation for William the conquerer, and you could potentially invalidate the cached pages of hundreds and hundreds of pages using this information. It's also impossible to predict what the issues are with vandals. The same issue arose when wikipedia first started, (the objection was that allowing users great power will mean they will abuse it)- come to think of it, I think the nobility said the same thing about allowing the rabble to vote. Anyhow, a gradual transition allows everyone to learn and adapt.

Other explorations of interest:

Microformats and genealogy information [3]
Inline queries using Semantic mediawiki extension [4]
Meta's article on the extension: Semantic MediaWiki

In the near term, we cannot predict how the data representation formats will evolve, and can only adapt along with them. At some point, it is inevitable that Genealogy wikia will have a data mass sufficent to earn us a seat at the table so that we may positively influence such evolution.

For the near term, we should encourage folks to encode information using standard templates such as Template:Person. This will help the future upgrade of the data to representations such as the above.

Secondly, an important point was made by Askin on his page. The fundamental issue with data interchange is making sure that a that the Person A in an input file corresponds to the Person B with the same name, birthdate, birth location but different parent than person A. Jay Askin noted that globally unique identifiers have been in use by LDS for some time, to deal with this and considered the use of the AFNs (ancestry file numbers) to deal with that issue. The problem he noted is that the mechanism for creating new ones is controlled by the LDS organization, and it is not clear how open that process is to other contributors. Perhaps it is no big deal- that if LDS would parcel off authority for ranges of numbers and trust other organizations (eg genealogy wikia) to see that they are being used properly, then that seems like it would be acceptable.

Another proposal to carry our own global unique identifiers (GUIDs) (pronounced gooeed). Data import programs would specify the AFNs if they are passed in a gedcom file, but as part of the import for all new records we would also would specify our own Unique identifiers. EG. When we start exporting data from genealogy wikia we make a pass over all articles and generate GUIDs for them using something like a GUID from a site like this. And we just periodically update all new pages with the persondata (or alternatively template:person) UID field with these GUIDS. A bot also would periodically resurrect any inadvertently deleted GUIDs. These GUIDs are not typically displayed, but used for matching when importing/exporting data and when looking up data.

Which brings me to why I am thinking about any of this now. It is my intention to add an AFN and a GUID field to persondata subpages of articles. This represents information that is a superset of information in the Template Person. It supports wikipedia's Hcard metadata approach as well as the Persondata metadata style that the Wikipedia Biography project is using.

Using /persondata subpages

This metadata approach for Genealogy allows authors to re-use data now, without any SQL queries or waiting for some unknown date when Semantic wiki extensions will arrive.

Beginning today, it now is possible to do queries. EG:

{{get|William I, King of England (1027-1087)|key=birthdate}}

produces: Template:Get

Similarly,

Father is:Template:Get

Image is:[[Image:Template:Get|100px]]

Of course, if the "Get" occurs in the william article, the query is compact:

{{get|key=father}}

gives:

Template:Get

This is not much longer syntax than what is required for the semantic wiki wikitext, but semantic wikitext will be better because it is definately more simple to specify. This works now, and supports microformats as well as the Persondata initiative, so that's what the GEDCOM bot will produce.

Simply create a persondata subpage on the talk page for any article and give it a whirl.

One of the benefits of doing this encoding is that family trees will now automatically update. No more specifying all the levels of the tree, or searching and updating all the fricking trees that might be affected by a newly found ancestor, or worse- fixing a mistaken parentage. You simply plop a single Ahnentafel on the page and you are done. You don't even need to specify anything since it will assume the name of the article. That is, unless you have moved it 12 times because you keep renaming it because you feel like overspecifying middle names, death dates etc. Whatever- you must now move the metadata too. To each his own.







	William I, King of England (1027-1087)

Naturally, you can specify the start of the tree so that you can display the tree of any ancestor from another article. Eg. the example above was generated with:

{{Ahnentafel2|William I, King of England (1027-1087)}}

Globalization: Reuse means that a lot of the drudgery of keeping various language versions in sync will now be removed. EG. slip the lang parameter in there, and you have:

For the purposes of this example, I only supported the 2 level tree. I will fix the 6 level one in due course.

Don't worry. Be happy.

~ Phlox 02:48, 5 October 2007 (UTC)

Notes

^ The church of Latter Day Saints (LDS) authored the Gedcom spec. It has been a great contribution to the community.

[1] The church of Latter Day Saints (LDS) authored the Gedcom spec. It has been a great contribution to the community.

[1]

@@ Line 12: / Line 12: @@
 #Child Y's mother was Jane.
 #*The source of this idea is B.
-Situations with conflicting information such as item #2's version of the truth 2 and item 3's alternate view are well known to anyone dabbling even briefly with genealogy research.  In some future genealogy wikia, such information can alter the probablistic confidence of particular views of a family history using stuff like [[wikipedia:Bayesian inference]].
+Situations with conflicting information such as item #2's version of the truth 2 and item 3's alternate view are well known to anyone dabbling even briefly with genealogy research.  In some future genealogy wikia, such information can alter the probablistic confidence of particular views of a family history using stuff like [[wikipedia:Bayesian inference|Bayesian inference]].
@@ Line 18: / Line 18: @@
-In the Wikicommunity, persondata templates are recording information conforming to the HCard Microformat.  This sort of encoding contains a superset of what Gedcom 6.0 does.   If we wish to follow that sort of direction, the Gedcom5.5 java program that converts to Gedcom6.0 like XML or alternatively to [[wikipedia:Resource Description Framework|Resource Description Framework]] (RDF) format might be of interest.  Further information on the program and discussion of the issues for such semantic representation of genealogical information may be found on [http://jay.askren.net/Projects/SemWeb/ Jay Askren's site].
+In the Wikicommunity, many infobox templates are recording information conforming to the HCard Microformat.  This sort of encoding can potentially support a superset of the information that Gedcom 6.0 will support.   If we wish to follow that sort of direction, the Gedcom5.5 java program that converts to Gedcom6.0 like XML or alternatively to [[wikipedia:Resource Description Framework|Resource Description Framework]] (RDF) format might be of interest.  Further information on the program and discussion of the issues for such semantic representation of genealogical information may be found on [http://jay.askren.net/Projects/SemWeb/ Jay Askren's site].
-Another method of encoding metadata for a person has been advanced by the Biography project.  They use "Persondata" information contained in commented text placed at the end of the article.  The advantage of this is that it is unobtrusive- no one is required to use infoboxes for their articles.  The disadvantage is that if people don't see the information in the resulting article, there is no incentive to keep the information valid with respect to other information in the article.  Push comes to shove, I think that sooner or later we will have some kind of standard infobox to normalize the appearance of articles.
+Another method of encoding metadata for a person has been advanced by the [[wikipedia:Wikipedia:WikiProject Biography|Biography project]].  They use "[[wikipedia:Wikipedia:Persondata|Persondata]]" information contained in commented text placed at the end of the article.  The advantage of this is that it is unobtrusive- no one is required to use infoboxes for their articles.  The disadvantage is that if people don't see the information in the resulting article, there is no incentive to keep the information valid with respect to other information in the article.  Push comes to shove, I think that sooner or later we will have some kind of standard infobox to normalize the appearance of articles.
-What does all this buy us.  Well, brushing aside all the gee whiz applications of semantic databases, our genealogy wikia could benefit from very practical features, such as the simple idea that it allows information to be shared between articles.  Meta's Semantic Mediawiki [[m:Semantic MediaWiki|extension]] supports encoding data in a central way that can be accessed anywhere in the wiki.  It looks like normal wikitext.  For example, a person article for Joseph Hester might have the text:
+What does the hassle of conforming to such templates or supporting these hidden blocks of information buy us?  Well, brushing aside all the gee whiz applications of semantic databases, our genealogy wikia would eventually benefit from very practical features, such as the simple idea that it allows information to be shared between articles.  Meta's Semantic Mediawiki [[m:Semantic MediaWiki|extension]] supports encoding data in a central way that can be accessed anywhere in the wiki.  It looks like normal wikitext.  For example, a person article for Joseph Hester might have the text:
  <nowiki>Joseph's parents were [[father is::Elias Hester (c1832]]. </nowiki>
@@ Line 52: / Line 52: @@
-Which brings me to why I am thinking about any of this now.  It is my intention to add an AFN and a GUID field to a persondata subpage of articles that is a superset of information in the Template Person, but employs wikipedia's [[wikipedia:Wikipedia:WikiProject Microformats/hcard|Hcard]] metadata approach that the Wikipedia Biography project is using.
+Which brings me to why I am thinking about any of this now.  It is my intention to add an AFN and a GUID field to persondata subpages of articles.  This represents information that is a superset of information in the Template Person. It supports wikipedia's [[wikipedia:Wikipedia:WikiProject Microformats/hcard|Hcard]] metadata approach as well as the Persondata metadata style that the Wikipedia Biography project is using.
 ==Using /persondata subpages==