Forums: Index > Watercooler > Yewenyi article upgrade

I propose the upgrade of Yewenyi-gedcom-generated articles that use an archaic article style no longer in use at familypedia. An example of the converted articles may be found at [1]. If I hear no complaints, I shall proceed with a fully automated a bot run to convert them. Up until now, these have been manually guided. A couple thousand articles may be involved. -~ Phlox 07:48, 29 June 2009 (UTC)

Sounds like a plan to me. My only caveat is that some of them have little snippets of extra info on them, with occasional ones with scads of extra info. So we should at least check size changes afterwards and look for the outliers. Thurstan 08:26, 29 June 2009 (UTC)
Sorry, I see from Catherine Stuart (Abt 1851-?) that you've got it covered. Thurstan 08:27, 29 June 2009 (UTC)

It's a pain in the neck to cover every single variation, but I got a lot of them, including the rarer case of notes being placed between Biography and birth:. However, some I just skip (for example those with births or deaths with "Bet date1, date2" and won't pick them up later unless it turns out there are a lot of them. I am not also not confident of the multiple marriage handling because I didn't have enough examples so I skip those too. I do some rationalization of place names but don't have the data on localities to verify them. I rename from Abt to c<date> in article names if the article does not yet exist. I rename link destination from county name to countyname, subdivision name if that article does exist. -~ Phlox 08:41, 29 June 2009 (UTC)

Seems like an excellent project in principle, and well planned so far, but maybe premature.
  1. I don't recall reading a progress report on SMW that said all of our properties, forms, and query systems etc are now fully in place ready for promoting to the world. Until that is done, I want Phlox to concentrate on getting us there. Then I want him to be working on our GEDCOM conversion and merge-on-the-spot process. If this little project is helping him with that, as he seemed to be suggesting to me a few days ago, good; but let's see that explained.
  2. Also, I wonder whether the proposed upgrade may be partly a waste of time if properties etc get changed again in the final polishing, requiring the yewenyi upgrades to be further upgraded.
  3. Small points about the Catherine example: some of the specific census info (e.g. "Location:at Wester Howmill, Turriff, Aberdeenshire, Scotland") seems to have been lost, her spouse should have a space between his "name" and his dates bracket, and the children box overlaps the person box.

Robin Patterson (Talk) 09:31, 29 June 2009 (UTC)

Does this constitute objection to the upgrade? If so I shall not proceed without consensus, but this will stall upgrade to SMW. Properties are stable and these articles (most of which have not been edited ever by a human since their creation in 2005 can be manipulated with the SMW form. It seems to me there is very little downside to this upgrade. -~ Phlox 19:56, 29 June 2009 (UTC)

We don't want stuff lost (consigned to the page history, where nobody is likely to look for it). I noticed you doing something with a page like Forms General Info an hour or three ago. If that or something else will ensure that all non-trivial matter gets kept up front, fine. And with stable properties (apart from any new ones that may arise) I'm happy to see the upgrade proceed as discussed if it takes very little Phlox time (which I expect; you convinced me, a few days ago, that 3,000 would be almost as easy as 12). — Robin Patterson (Talk) 15:05, 30 June 2009 (UTC)
Thurstan looked at my examples and saw that I was picking up the non trivial data. Can I guarantee 100%? Long ago I think I made it clear how skeptical I am of any claim that any converter of text with natural language in it can be 100%. My response is tat on the whole the data is in better condition than when it came in. Stuff like planet earth and entries with empty data are removed, redlinks to wrongly named counties are renamed to articles on them if they exist.
When do you jump. When is the converter good enough? First off, when we think of change, its quite different if the change is reversible. Unpleasant things like death, taxes, divorce, loss of job- generally not reversible. Things like upgrades? Reversible. A bot can be run to revert every single change.
At the end of the day it is still a gut call and realistic assessment of the downsides. As for changed parameters, or discovering data was possibly not converted properly somewhere down the line, do we have recourse? Sure- we do a bot run and rename the parameter or move it to another template. For missing data, we do a bot run to revert and rerun the conversion. Do we want to do that on articles that possibly had been edited after the upgrade? No. That's why you choose articles that have not been touched since 2005.
Phased transitions are tough. This is a precursor to large scale upgrade of info pages, later gedcom. We need to know how close we are. Without doing real conversions whose before/after can be eyeballed, we don't know how close we are.
I didn't see any objections about non trivial stuff not being converted. Unless that is the case, I think we should press on. We have bigger fish to fry than these few thousand archaic articles. -~ Phlox 15:44, 30 June 2009 (UTC)
Well, the converter was NOT good enough when it omitted the presumably non-trivial "Location:at Wester Howmill, Turriff, Aberdeenshire, Scotland" twice in the one article you put up for inspection. "We need to know how close we are." - not close enough yet. — Robin Patterson (Talk) 00:46, 1 July 2009 (UTC)

It is trivial. If there are no others, I shall proceed after fixing that one. If there are any objections, speak up. If you would like to help fix these trivial errors, consider the code at Help:AutoWikiaBrowser/Script01#Code. I await any assistance being offered. Maybe we should just stick with the Planet Earth version of thes articles.-~ Phlox 09:23, 1 July 2009 (UTC)

What is trivial? You surely don't mean that someone's precise location on census night is so unimportant that it need not be recorded on Familypedia? Perhaps you mean that the error is trivial to fix, in which case it is disappointing that you had not fixed it 47 hours after it was pointed out. I have "considered" the AWB script you mention, but not read every word of it. Noticed some links that would work on Wikipedia but not here. Which words should I search for if I'm to see how to help you not omit locations? — Robin Patterson (Talk) 09:55, 1 July 2009 (UTC)
I beg your pardon? What would lead you to believe that I thought that census information was not important on Familypedia? As for the case above, Thurstan stated that I had it covered in a subsequent note so I never looked at it. As far as I knew, there was no bug.
I published a paragraph that included a complaint that your procedure "omitted the presumably non-trivial "Location:at Wester Howmill, Turriff, Aberdeenshire, Scotland" twice in the one article you put up for inspection." First sentence of your response said "It is trivial". That's ambiguous, to say the least. That is what made me question what you meant. If I were a less careful reader, I could easily assume that your description of "trivial" was referring to something that had just been described as "presumably non-trivial", especially as you yourself had earlier connected "trivial" with "data" in "...I was picking up the non trivial data" and had said (after my list of errors had reached this page) "I didn't see any objections about non trivial stuff not being converted". Less ambiguity, maybe? — Robin Patterson (Talk) 06:17, 2 July 2009 (UTC)
Thurstan said you had something covered. I think he was wrong, if he meant what I thought he meant. But that was NOT "a subsequent note" to my pointing out the omission. His was "... you've got it covered. Thurstan 08:27, 29 June 2009 (UTC)". My complaint that included three apparent errors in the conversion, written below his note, was "09:31, 29 June 2009 (UTC)". — Robin Patterson (Talk) 06:17, 2 July 2009 (UTC)
Let's step back from the histrionics and take a look at the quality control problem being addressed. Consider the situation exemplified by Daniel Pagett (Aft 1759-Bef 1904). We know from this orphaned article that:
  • He was born to unknown parents between a date "Month Day" (redlink to this whatever this notation is supposed to mean), 1759 and "Month Day" 1804 on a place called "Earth" (including a Wikipedia link thoughtfully provided for people who don't know where that is).
  • Similar quality of information for Death date.
  • For his internment, we are informed that he was interred on again on this "Month Day" redlink, and "Year" also a redlink. Happily, there is some content to this line, for we learn that he was buried at a place with a blue link called "Location" (wikipedia link to "Location" provided).
A child is listed, but it is a redlink. The article is an orphan not connected to any family tree whatever.
External links has a broken link to a nonexistent wikipedia article.
The biography section redundantly lists the identical birth date information in only a slightly different form.
There is a Gallery section, listing "image name", lots of blank lines, then "caption text" No instructions on what this means or how to fix it.
Multiple headers with no content- Siblings, Spouses, Parents, Sources
Not even the source gedcom was provided.
This article has been on our system since 2005 with no human touching it. It is not an isolated case. There are well over two hundred articles with this "Month day" junk information alone. Feel free to google it yourself. This article is of a quality not suitable for familypedia, and it reflects poorly on our site. I am working hard to upgrade these weaknesses. This involves highly complex regex expressions and I think most people would agree that Familypedia is better with the upgrade than without it. Lest anyone make the uninformed assertion that this is an exageration, consider the following line that only covers one variant of birth information on these pages:
Birth:(:|<br>|\s)*(.*Date:\s*(?<approx>(abt|bef|aft|about|circa|c\. |c )*)?[ \[<]*({{date\|((?<day>\d*)|Day)\|
(Month|(?<month>[a-z]*))\|(?<year>\d{3,4})}}|(?<year>\d{3,4})?[ <>\[\]]*(((?<day>\d{1,2})|Day)?[ <>\[\]]*
(?<month>[a-z]*)[ <>\[\]]*(?<year>\d{3,4})?)))?([^|:[]*[: ][^|:[]*(?#could be Location: probably )
at ((\[\[)?(wikipedia:)?((Location\|Location)|(?<locality>[^\r|]*)[^\]]*)]], )?((\[\[)?(wikipedia:)?
((Location\|Location)|(?<county>[^\r|]*)[^\]]*)]], )?((\[\[)?(wikipedia:)?((Location\|Location)|
(?<state>[^\r|]*)[^\]]*)]], )?((\[\[)?(wikipedia:)?((Location\|Location)|
(?<country>[^*\r|]*)[^\]]*)]]))?((<br>|\s*)*([^|:[]*Notes: *(?<notes>.*))?|
on \[\[wikipedia:Earth\|Earth\]\])[^|[/r/n]
I agree that hundreds of articles like that are of very poor quality and deserve to have the junk taken out. My objection has been to the throwing out of babies with the bathwater. — Robin Patterson (Talk) 06:17, 2 July 2009 (UTC)

So If there are no further bug reports, I shall proceed with the run after fixing the bug. If further errors are found, I propose the option be to revert the article, or if widespread errors are detected, to run a bot to revert the changes and rerun with a bug fix.
Until other articles are done, we will not be able to see if there are other bugs. I found a couple in the very first example. You are going to fix one "bug". Then how about just running your latest version on the ten largest yewenyi articles? I'm willing to look in detail at ten big articles. You said that you were doing about 12 a day; "It's a testbed. I am learning about regex processing. Upgrading Yewenyi is a byproduct." So there's surely no rush to do the lot. — Robin Patterson (Talk) 06:17, 2 July 2009 (UTC)
It may well be that there are other omissions. If it is something big, I fail to understand why the remedy of reverting and reruning the bot run is a bad idea. Note also that it is well known that many if not most genealogy programs do NOT handle all information that can be stored in a gedcom. It might be interesting for example to take some of the more complex gedcoms and see how well WR handles them. From the information included on their site and the variety of information I know are in Gedcoms, I would not be surprized if they are dropping huge amounts of data. Then we could ask the same question or WR that you asked of me. But really, these histrionics get us nowhere- Dillan is doing the best he can and so am I. Kindly avoid making unlikely assumptions about my rigor in attempting to recover as much information as possible.
You suggested there was some bug with wikipedia links. Please be specific. What line of the AWB script has an error. I don't think you are asserting that you even vaguely understand regex expressions anyway, nor do you have a working copy of AWB, so I am baffled how you believe you can be of any assistance on that score, but if I am wrong about that, feel free to proffer your technical advice. Folks can be of assistance by identifying all blocking bugs that need to be fixed prior to upgrading these low quality articles.
Besides this, there are many other areas where contributions can be made. For example, the current long form will have a shorter version that we present to the user with just the essential details. What fields on the current person form should appear on the simple form? Nothing is stopping anyone from creating prototype forms, yet I see little or no efforts with these less challenging technical areas, nor in the construction of query pages such as those in the Concept name space. The bulk of the old Info categories are completely obsolete, and work can be done to look to their upgrade.
You said "If you would like to help fix these trivial errors, consider the code at Help:AutoWikiaBrowser/Script01#Code. I await any assistance being offered. " That seemed as if it could be addressed to me, among others, because I would like to help fix these trivial errors; so I did what you asked and reported on my experience. (Thank you, Robin.) But "I am baffled how you believe" I "can be of any assistance on that score", at least until you answer the one question I asked you about where I might help. In case you have forgotten it or did not read it, here it is again: Which words should I search for if I'm to see how to help you not omit locations? — Robin Patterson (Talk) 06:17, 2 July 2009 (UTC)
Links that would work on Wikipedia but not here (a fact that I did not describe as a bug or an error although it may be) include:
<string>re-categorisation per CFD</string>
<string>clean up and re-categorisation per CFD</string>
<string>removing category per CFD</string>
<string>stub sorting</string>
<string>Typo fixing</string>
<string>bad link repair</string>
<string>Fixing links to disambiguation pages</string>   
They will lead to WP pages if they each have "Wikipedia:" prefixed.— Robin Patterson (Talk) 06:17, 2 July 2009 (UTC)
I can spend even more time discussing this rather than working on other improvements to the site, but really, think about the cost benefit and take a look at the before and after quality of the articles. Consider the worst possible downside and consider the proposed remedy. Very little cost, with high gain. Thanks. ~ Phlox 21:46, 1 July 2009 (UTC)

Approximately thirty Yew articles have been converted- there are plenty of examples to look at. I await further bug reports.

Apparently the xml is too confusing. As for the observations about summary edit strings in the xml file: You probably did not notice it, but you were not in the xml section FindAndReplace, but in the general settings section. This particular setting allows you to specify edit summaries when using AWB with different wikis. General AWB settings are not relevant to to the Yewenyi conversion code as my personal preferences for text font or text edit box size which are also set there. The code is in the findandreplace section, line 23 of the xml file states "Converts Yewenyi format articles". I would have thought that would seem like a good place to start rather than at the end of the file. What do I think you should search for? Search for regular expression syntax that has to do with whatever you think the presumably "trivial" bug might be associated with. Take a look at a regex reference. There is one linked on my home page.

I have much evidence to make me believe that being of assistance on coding is something that you cannot offer, so let's just drop the observations on wikipedia this or that or what you presume to be trivial or not, ok?

We need greater attention to quality rather than quantity at familypedia. We can begin by cleaning house and correcting past sins. The articles I point to are embarrassments and can be attended to swiftly. The protest seems to be that no edits are better than imperfect edits. My response is that if there is any evidence of a systemic problem, this can be reversed, so I see no reason for stalling this upgrade.

If there are specific articles for conversion please list them. My assumption is that Robin doesn't want any of the Yewenyi articles upgraded until the upgrade is immaculate. An appreciation of the complexity of the task might yield an understanding that immaculate is not an option.

~ Phlox 09:20, 2 July 2009 (UTC)

Think about what has been advocated:
  • Edits that make errors of addition are tolerable: garbage- Month Day, busy work- (wikipedia link to nonexistant articles), Filler- empty headings, gallery sections.
  • Edits that make errors of conversion are tolerated so long the source material does not yet exist on familypedia. So for example, if a gedcom translator drops some fields, that is ok. If the fields were in a familypedia article, it is cause for deferral of the effort until perfection is achieved, even if the automated pass is regarded as a first step.
  • Edits that perform automated quality control on articles at familypedia are not ok until such time as they are perfect (effectively never).
QED: no quality control at Familypedia. >34,900 and who cares what the articles look like. Because what this says is that we don't. We haven't since 2005. Folks are giddy with excitement about the prospects of backing up the trucks, and dumping in millions of gedcom records and not bother to fix them- ever. When there is a way to improve them, apparently no one wants to run awb to run confirmed edits on them. Robin, the excuse of not being able to run awb does not apply even for you. AWB can be run for you, but you oppose it. You can do a deferred confirmed edit using AWB. What you don't want to do is compare the history version to the current version.
Proposal: Since folks don't want to use AWB for quality control, we will use it in confirmed edit mode as described. Run the bot on 200 at a time, and await manual confirmation of each change (the same thing you can do in awb when you confirm an edit). When the 200 are confirmed or a month has passed, aWB confirms another 200.
It is work. What I don't understand is that why I am the only one apparently interested in doing some janitorial work here. It's not a lot of fun, especially when my work is denigrated. Yeah- ok- I missed a spot where I was waxing the floors. Yes, I don't need a lecture on how waxed floors (census reports) are good. I don't need pointers on using the machine from folks that don't know one end the machine from the other. So folks that can't/won't run the waxing machine themselves, get down on your hands and knees and help me scrub where the machine missed. It's not so difficult. It's called a revert, or a copy paste of the missing information into the upgraded article. -~ Phlox 17:00, 2 July 2009 (UTC)

Exaggerations again, my friend. Not good for keeping people fully cooperative. You are "the only one apparently interested in doing some janitorial work here"? Thurstan and rtol have been doing some; so have a few others, I think. I have done some and stressed that I want to do more. Where has it been advocated that "Edits that make errors of addition are tolerable" and "Edits that perform automated quality control on articles at familypedia are not ok until such time as they are perfect"? "What you don't want to do is compare the history version to the current version." - er, I found that error by checking the history; what basis have you for saying I don't want to? — Robin Patterson (Talk) 13:03, 3 July 2009 (UTC)

As you "don't need pointers on using the machine from folks that don't know one end the machine from the other", you shouldn't have wasted your time and mine asking for help from a set of people that could clearly include one such person, in your opinion. I seem to have been the only one to "consider the code" - and as you did not specify which section was the only one worth looking at, I considered it all, but got no thanks and some criticism for reporting on it and asking which bit I should concentrate on. — Robin Patterson (Talk) 13:03, 3 July 2009 (UTC)
The offer was open to anyone, and still is. Your response gave me evidence that you had very little understanding of what you were looking at and would be unable to assist with coding matters. If that is not the case, I apologize. -~ Phlox 16:51, 3 July 2009 (UTC)
You say "if there is any evidence of a systemic problem, this can be reversed" - well, the one and only example that we were given at the start had an omission of data that you and I both agree is genealogically valuable; you agreed that that omission was a bug and said you would fix it. There has been no change to the code in the couple of days since you said you would fix that bug, over two days after I listed the omission; so I can't begin to work out what your proposed change will be and therefore what it will do. Unless that fix covers a great many possible systemic errors, the statistical likelihood is that at least one other article (and maybe hundreds) might throw up another error that one or both of us agree needs fixing. You now seem content to convert thousands of articles then revert them all if another bug appears. Would that be easier than following my suggestion of doing the ten biggest ones and have a keen floor-polisher check them before going on? I doubt it. You now say you have done about thirty. I will have a look at several and report on findings. — Robin Patterson (Talk) 13:03, 3 July 2009 (UTC)

"AWB can be run for you, but you oppose it". Nowhere have I opposed the running of AWB for me. You're not exaggerating there, you're engaging in terminological inexactitude. This is the first time it has been suggested that AWB can be run for me. Please explain how it can be run for someone with Windows Millennium Edition. — Robin Patterson (Talk) 13:03, 3 July 2009 (UTC)
The one and only example? I am confused about how you could be under the impression that there was only one example available to you. You yourself quoted me saying at one point that I was doing about 12 a day. Robin, since June 29, there has been the opportunity to review over well over thirty articles. I mentioned their existence, and it is trivial to identify them by the mention of Yewenyi in their summary. I would have thought you understood how to use Special:Contributions to do this. There are not just 30, but 100 upgrades of Yewenyi articles to consider. So far I have seen review of a single article and a single bug report. That indicates to me a lack of interest or rigor in janitorial duties.
What is it about the proposal of running awb for others that you require clarification on? My proposal was: "Run the bot on 200 at a time, and await manual confirmation of each change (the same thing you can do in awb when you confirm an edit). When the 200 are confirmed or a month has passed, aWB confirms another 200." Compare the before and after versions, and if you don't like it, correct it. Not a lot different than what you can do in AWB.
This would be a robotic run. It is my understanding that you oppose a robotic run, so it is accurate to say that "AWB could be run for you, but you oppose it." Do you understand my meaning now?
I await further reports of bugs blocking the full run, or the alternative proposed run of 200 articles.~ Phlox 02:21, 4 July 2009 (UTC)

Rev .30 list of articles[]

Converted Yewenyi articles using rev .30

~ Phlox 08:49, 7 July 2009 (UTC)

I sampled a number of pages and I could find one fault only: Occassionally, "in the year" pops up. This seems to be in the display only. The properties are fine. I could not discover the glitch in the display. As this is minor, I suggest that the rest of the upgrade is run. rtol 11:14, 19 July 2009 (UTC)
For Catherine Maud Robinson (1880-1966), the death date seems to have been lost. Thurstan 11:25, 19 July 2009 (UTC)


Besides Yewenyi, there is also user:kborland. In this case, though, it may be faster to upgrade {{person}} rtol 10:59, 19 July 2009 (UTC)


Thanks everyone for the bug reports. I will adjust for these errors. My recollection is that my list was Yewenyi initiated articles, so Kevin's stuff probably won't be touched. I'll check to make sure, and if he is affected, I will examine the situation. Right now, I am in the backwoods working upstream of this problem- immersed in location stuff which affects how these upgrades are handled. ~ Phlox 18:33, 19 July 2009 (UTC)

Report on a few more of the early Yewenyi-origin conversions[]

Some of you may have noticed that I've been a little inactive recently. But I did look for the full list of conversions. "Approximately thirty Yew articles have been converted", said Phlox on about 2 July. I said I'd be willing to look at ten initially. In fact, as it turns out, he and his bot had done about one hundred and thirty. See list. Nobody should have to look at that many before a trivial code-fix is made that could fix at least a couple of constant-looking errors.

I've looked at ten. Most are still in one of my working windows. Two of the ten have no errors attributable to the conversion. Each of the other eight has at least one such error, noted on my last save of that list. As it's nearly 3 in the morning and I'm recovering from a cold I'm stopping now and will look more closely at those eight in my next session.

If any of these are blocking bugs not mentioned earlier, please state the article names. ~ Phlox 17:22, 20 July 2009 (UTC)

Catherine Cramp (Abt 1837-Aft 1871) has some problems:

  • (trivial) "Armidale, New South Wales" has been put into "Wedding1 county"
  • (significant) Second partner UNKNOWN (Bef 1846-Aft 1871) has disappeared

The second of these reveals a fatal flaw in the whole process: Yewenyi's postings elsewhere on the internet show that only the first three offspring are fathered by John Geary (Bef 1836-Aft 1871). This information is not contained in the article here.

There is also an odd problem with the display templates: "St Peter's Parish" (Wedding1 locality) is truncated. Thurstan 23:01, 21 July 2009 (UTC)

Do we know the Gedcom sources for any of the Yewenyi articles? ~ Phlox 04:04, 22 July 2009 (UTC)
I have downloaded a GEDCOM that he submitted to, which lacks a lot of the notes, and removes "living" people. There is another version of the genealogy at (no GEDCOM), and various others in various places around the 'net. I have been using the version to fill in the gaps. Thurstan 05:15, 22 July 2009 (UTC)