Tech Issue:Redundant articles

This page concerns multiple articles on the same individual that are created inadvertently. There are other problems that are not treated in depth here:
 * Multiple articles on the same place or other non person subject. These can be handled in the normal wikipedia way of manual identification and suggested merge using the merge template.
 * Deliberately created article on same individual in order to record an alternate theory regarding the individual. These are normal in genealogical research and should be retained even after research has conclusively shown that an alternate version is incorrect.  Others may come across data suggesting the identical disproved theory and will be interested in reviewing prior work on the alternate theory.

The issue: Inadvertently created duplicates:

Wikipedia has 10 million pages including redirects. One might think there problem would be much more difficult because in our case, the variants for a person's name are limitted, whereas the number of ways of titling an arbitrary topic of human knowlege is virtually limitless. Somehow wikipedia manages to identify topics that are about the same thing that are named differently and merge them. This works because everyone is an editor and can identify such problems. Because of this, we have an inherant advantage over other genealogy sites.

However, our problem is fundamentally different from Wikipedia in an important respect. Wikipedia will have a smaller number of pages that are of common interest to many people. Genealogy wikia will have a much larger number of pages mostly about recent individuals of interest only to their descendants. In our first decade of existence, Genealogy wikia will quickly overtake wikipedia in sheer numbers of articles. It is not difficult to see why this will be the case. Already, it is not uncommon for individuals running their own genealogy sites to have a million individuals. We will be pulling in this data via Bot and so we will quickly reach multiple millions. Further, due to the nature of migrations, genealogy crosses language boundaries and our site is therefore multilingual. We will have not just english language genealogies, but those from a broad global community.

When we crack 100 million individuals, we are really going to need some serious tools and infrastructure to automatically identify such duplicates.

Mechanisms and infrastructure to deal with redundancy

 * We are tracking two kinds of unique identifiers in info pages: (AFN)s, and Genealogics person id. Unfortunately, we cannot create our own of these and so will be using our own GUIDs to keep things straight.  Our GUIDS will be invisible to users and only be generated when we start to do exports of information to genealogy programs.  Likely we will stuff it in a notation field and read them on import.
 * Formal data encoding needed: This is the style of encoding used in structured databases.   Ultimate solutions to this problem will likely employ probablistic code, but this is a ways down the road.  In the meantime we will use more rudimentary heuristical techniques.  What is common any programatic solution is that the code not also be saddled with the problem of performing natural language processing on free text.  If the contributor placed the name of the father in a field, then such a program does not have to parse the various formats for cell tables to extract the information.  It would be worse if it had to try to figure out other useful data like birth county from free text.
 * Whatever data encoding formalism we use, so long as we have the crucial data encoded in a controlled way, we will be in position for such later tools.
 * Interim solutions using hard coded heuristics will be using until we can incorporate a probablistic system. Some rules might be the following:
 * Father surname, mother surname match, with birth county match, and birth or death date in an acceptable range.
 * Relative with an UID match- eg genealogics or AFN number identical. Assume the UID data is correct