Saturday, December 31, 2005

Post Validation

Free Census was originally intended to have three stages, transcription, checking and validation. After Validation, the final file would be cleared for upload, after a final check by the uploader. Who was the man running the whole project - it was very much a one-man band in those days.

However, when the first validated files appeared after a couple of years, it was obvious that Plan A would not work. So a Cornish volunteer, Rick Parsons, produced FCtools, a diagnostic programme. It has a variety of uses, one of which is to turn the final zip back into a spreadsheet and check it for errors. This is the sort of thing that you get:

Warning: row 63, page number = 4 not sequential
Warning: row 95, possibly too many lines on page 4
Warning: row 128, consecutive schedule numbers (30) are the same
Warning: row 663, schedule number = 19 not sequential
Warning: row 883, consecutive schedule numbers (78) are the same
Warning: row 1004, consecutive schedule numbers (112) are the same
Warning: row 1012, schedule number = 192 not sequential
Warning: row 1013, schedule number = 116 not sequential
Warning: row 1144, consecutive schedule numbers (146) are the same
Warning: row 1311, birth place = Willesden N 3 contains unusual characters
Warning: row 1312, birth place = Willesden N 3 contains unusual characters
Warning: row 1630, schedule number = 23 not sequential
Warning: row 2012, schedule number = 1 not sequential
Error: row 2987, field too long , truncated
Error: row 2991, field too long , truncated
Error: row 2996, field too long , truncated
Error: row 3004, field too long , truncated
Error: row 3009, field too long , truncated
Warning: row 3009, consecutive schedule numbers (91) are the same
Error: row 3010, field too long , truncated
Error: row 3013, field too long , truncated
Warning: row 3779, birth place = Walworth S1 contains unusual characters
Warning: row 3844, Head of household is not the first entry in the schedule
Warning: row 4410, possibly too many lines on page 13
Warning: row 4443, possibly too many lines on page 14
Warning: row 4619, first page of ED = 20
Warning: row 4619, first schedule of ED = 119
Warning: row 4956, schedule number = 8 not sequential
Warning: row 4957, schedule number = 85 not sequential
Warning: row 4958, schedule number = 140 not sequential
Warning: row 5122, consecutive schedule numbers (20) are the same
Warning: row 5431, consecutive schedule numbers (71) are the same
Warning: row 5617, consecutive schedule numbers (113) are the same
Warning: row 5649, possibly too many lines on page 20
Warning: row 5898, consecutive schedule numbers (169) are the same
Warning: row 5961, possibly too many lines on page 30

The post-validation involves correcting these errors, if in fact they are errors. It also involves eye-balling the data, because there are some things that, although incorrect, are not picked up by the software. In the case of the COCP returns, FCTools is also the means of producing the html. This gives us the opportunity for a final check; when the html is eye-balled for errors. Some elements have to be introduced by hand at this stage; including the credits for the transcribers and checkers.

The html file goes off to the COCP web pages; the validation file is uploaded to Free Census. We have now uploaded over a million records to the OLDB and about 1.2 million to our own web pages.

Amazing! And it has only taken five years.....

Friday, December 30, 2005

Validation

In the original plan, Validation was the third and final stage. However, it is in fact the penultimate stage and is followed by post-validation. This note covers both.

When a corrected zip arrives back, I load it into Valdrev, and run it against the images. Unlike checking, I do not have to view every line, but only those on which Valdrev stops.

Valdrev stops for:

1. Alerts, either inserted by the checker, or inserted by the transcriber and not resolved by the checker.

2. Records that have notes left by the transcriber or the checker, but not those contained in the transcriber’s Mynotes file. I do not see those, although Valdrev does stop.

3. County or place of birth names that do not exist as far as the geographical database GENIE is concerned.

From this you can see that if the transcriber leaves lots of notes, I get lots of stops. During validation I edit the notes left by transcribers. Usually, I delete them, but sometimes I retain them, edit them or add to them or insert new ones – as the fancy takes me! If Chapman codes for the Irish or Scottish counties have not been used, I get stops on all those. In the original plan, it was thought that the validation process would be pretty quick, with stops every hundred or so records. Like many things, this didn’t work out and stops are only too frequent.

The main problem is that the geographical database GENIE is limited in size and it doesn’t hold many perfectly good place names. In general, I pass all place names that are “as is”. I do not avail myself of the validater’s option to put in the modern or corrected names.

At the end of this process, I pack for uploading; in theory, this output file could be uploaded. But in practice, we know that there are a lot of errors still in the file, invisible during validation. The file is, therefore, loaded into FCTools. This is a diagnostic tool that identifies errors and gives warnings of possible problems.

FCTools produces a list of errors and a spreadsheet. As well as making the corrections indicated by FCTools, the opportunity its taken to “eyeball” the spreadsheet. It is surprising how many minor errors jump out and hit you in the eye! Once it as good as it can be, two files are produced. The validation file is uploaded to the Online DataBase (OLDB) and to the Mormon’s Great Granite Cave in Utah. The html file is sent to our web site.

Wednesday, December 21, 2005

What does a checker do?

As you know, I constantly witter on about this being a system and we being a team. So this is all about those strange creatures (to transcribers) who are checkers.

In an ideal world, checkers would be people who had done some transcribing. As it is not an ideal world, many checkers have not done any. Some of the most successful have just done one transcription and have probably forgotten what they learnt.

Starting up a checker is a little more complicated than getting a transcriber underway. However, if the instructions are followed, it can be done. The software is downloaded from the Free Census web site and is known as WINCC. The checker gets a zipped data file by email and loads it into the software. The task is then to go through the data, line by line, and compare it to the returns on the fiche. I cannot make the checkers look at each line and it is possible to just tab and save your way through the whole thing. I have had one or two people who appear to have done just that.

The checker should attempt to identify the transcriber’s mistakes AND correct them. This latter might seem pretty obvious – but isn’t to some people. The software enables the checker to do lots of interesting things including inserting people;complete households; or even whole pages. They can split up households and join people to households. They can alter the header data and leave lots of interesting little notes for the validater.

The checker should attempt to solve any flagged up records left by the transcriber. They can leave them for the validator and they can also flag items themselves. Of course, it is just as possible for a checker to be wrong as it is for the transcriber. However, I do not second guess checker’s decisions – well, not often. They can see the evidence and what the transcriber thought – I want them to make a decision. Flagging up the query or letting a query through is OK with me – but I would rather they sorted it out before it gets to me.

A well transcribed piece is easy to check; but as we are all human, most transcribed pieces are full of errors. The system will take care of them if we all operate it properly. Sometimes the checkers get nightmare pieces with virtually every record requiring a correction. Usually, these are repetitious mistakes and easy to correct. But each correction involves a number of key strokes or mouse clicks and it can become very boring to do ten mouse clicks for each of 5000 records!

At the end of the process the software produces another zipped file. The name has changed from censXXXX to ZZZZyyyy. This zipped file is emailed to the validater, who stuffs it into a third piece of software. You’ll have to wait for the next installment to find out about that.

Monday, December 19, 2005

What do Transcribers do?

This is a note for those people about to start, or for people who might volunteer and for checkers who have not done any transcribing.

Nowadays, we are all working from discs. Many of our volunteers have bought their own, but we can now supply free discs, courtesy of the LDS. A new transcriber is allocated a piece or a group of parishes and a lot of information. There are help files with the software and FAQ and other things on the main Free Census web site. Everyone gets a lengthy piece written by me and there are also “hints’ pages on the web site. In spite of this, questions still arise and transcribers and checkers still come across new problems.

I am not complaining about this, for many volunteers this is a completely new field; they haven’t transcribed census returns before and they haven’t got much experience with computers or the the internet. It constantly amazes me how much we are achieving given our wide geographical spread and our inexperience. It is also a fact that the enumerators of the 19th century had many and varied ideas on how they should carry out their task. As the census taking was organised by the government, the instructions to the enumerators were confused and confusing and sometimes downright contradictory.

The original project software is WINCENS. It started off as a DOS programme and was then changed into a Windows programme. It is still DOS of course, under the Windows interface, and transcribers will occasionally see the black DOS screens. However, we are using an alternative to WINCENS - SSCENS.

Because many people did not like the WINCENS programme, especially its inability to allow people to retrace their steps and edit the data, we introduced SSCENS. This is just a spreadsheet with knobs on. But it does allow editing and people can look at a page in its entirety; can look at a whole document. If the rules of SSCens are followed, then this spreadsheet can be converted into a format that will work with the checking software, WINCC. Any spreadsheet will do, although most people seem to be using Excel. It does not matter what platform you are using or which version of Windows. Anybody using an Apple should contact me as I am an Apple user.

A transcriber should aim to combine speed with accuracy. This is a system; transcription, checking and validation. A transcriber should not spend hours on trying to decipher one surname. Give it your best shot and move on. If you cannot resolve a problem, flag it up as a query; leave a note if you think it will help.

On completion, a transcriber should email us their completed spreadsheet. You should try and reformat the file as .csv, but if you can’t, then send it as it is. The SSCENS spreadsheet is reformatted for input into the next stage – checking.

Saturday, December 17, 2005

Communications

This project is built round the use of emails. There is, however, a supplement to the use of email. Instant Messaging. This gives you the chance to have instantaneous one-to-one comms with me and to use the COCP chat room to talk to each other and to us. Here are the instructions.

Go to http://www.jabber.org/

Left hand side, under software, click on clients

Next page, under Platforms, click on Exodus (if you are an Apple user, contact me first).

Next page; click on Get Exodus – download “Stable releases”

Next page; click on exodus 0.9.1.0.exe

Click on download now.

Install Exodus. Open an account with jabber. When you have done this, email me and I will send you my jabber account name.

You can then add me as a new contact. Jabber will send me a message asking for permission to display my online status to you. I say yes. You will then be able to see when I am online. Just click on my icon and a panel opens up. The panel is split into two; the bottom one is where you type. You type "hello" and away we go.

We can then have one-to-one contact any time we are both online. The software will tell you when this happy state is available. In addition, you can use the COCP chat room. There is usually someone there, including about 4 regular participants, all of whom have all the discs you can imagine. So you can get a more or less instant second opinion on anything.

You do not have to do this unless you want to - but I recommend you give it a try. You will get instant response, most of the time!

Wednesday, November 02, 2005

Place of birth

One of the areas that gives me most trouble is place of birth. And I suspect that many of you are spending a lot of time trying to work out what the enumerator wrote and what he should have wrote.

The first thing to do is restate the golden rule – AS IS! We aim to reproduce what the enumerator wrote.

The place of birth column should have two items of information in it – county of birth and place of birth. Transcribers will have two columns to enter data into.

I’ll deal with the county first. Here you enter the county using the Chapman code. This is a 3-letter code that covers all UK counties, including the Irish ones. For some reason, most volunteers have a mental block about the Irish and Scots county codes and don’t use them. You can also enter UNK, OVB & OVF. These are Free Census conventions and are not necessarily “As is”.

UNK is used when no county is given. According to Free Census, as few of these as possible please.

OVB and OVF for people born overseas is not used on the census returns. It is in fact, confusing place of birth with nationality. Anyone born in the former British Empire or the United States counts as OVB. The problem with this is that the composition of the British Empire varied a lot between 1841 & 1891. The inclusion of the United States makes the whole thing silly. The COCP has suggested a way out of this, but I am afraid that Free Census is not interested in our suggestions. So, the rule of thumb is – is it an British name? If it is, count it as OVB. The only exception is where the census returns say British Subject or Naturalised British Subject. Here, the “as is:” rule comes into play again.

Often, the enumerator only gives the county. In that case, you should enter the Chapman code for the county and a hyphen. If however, the county is one of those where the county town has the same name as the county, enter the Chapman code and the county town. So Gloucester would be entered as GLS Gloucester. You should take note of how the enumerator generally enters county names. So if he is giving Leicestershire Hinkley and then writes Leicester, we can be fairly sure he means the county town and has just omitted the county as being to obvious to write.

Place of birth. Enter what the enumerator wrote. Never mind if you know it is wrong – just enter it. If the enumerator writes NK, or not known or unknown or this man is an idiot; please just enter it. If no place of birth is given, then enter a hyphen. This is not “as is” of course, but it shows that we have looked and that there is nothing there.

If you can work out what the enumerator should have written; by all means leave a note. It may help me in deciding what to do.

My approach (which differs from Free Census), is that we are replicating the enumerator’s words. There is an option during validation to enter the modern or correct name. I do not use this, believing that I might be misleading the researcher. In any case, they should not expect to get it all on a plate and a bit of lateral thinking will usually get them to the right answer.

My aim in writing this is to try and speed things up. Do NOT spend hours puzzling over a place name if you are sure you have transcribed what is there. If you are in doubt but have some ideas; don’t hesitate to leave a note. Remember that this is a system involving three of us. One of us might know the answer!

Monday, October 10, 2005

What we are about

The Cornwall Online Census Project aims to transcribe ALL the Cornish 19th century census returns and to place them online, 'free-to-view'. In addition, as each census is completed, it will be distributed on disc to various archives and institutions. We aim to achieve a situation where no one has to pay to consult a transcription of the Cornish census returns.

We post all our completed returns as texts to our own web pages and we also upload completed works to the Online Data Base operated by Free Census.

Tuesday, September 06, 2005

First message

Hello folks

This is yet another method of communicating with Supreme HQ

Let me know how you feel!

Rgds

Michael

Tuesday, July 26, 2005