Saturday, December 31, 2005

Post Validation

Free Census was originally intended to have three stages, transcription, checking and validation. After Validation, the final file would be cleared for upload, after a final check by the uploader. Who was the man running the whole project - it was very much a one-man band in those days.

However, when the first validated files appeared after a couple of years, it was obvious that Plan A would not work. So a Cornish volunteer, Rick Parsons, produced FCtools, a diagnostic programme. It has a variety of uses, one of which is to turn the final zip back into a spreadsheet and check it for errors. This is the sort of thing that you get:

Warning: row 63, page number = 4 not sequential
Warning: row 95, possibly too many lines on page 4
Warning: row 128, consecutive schedule numbers (30) are the same
Warning: row 663, schedule number = 19 not sequential
Warning: row 883, consecutive schedule numbers (78) are the same
Warning: row 1004, consecutive schedule numbers (112) are the same
Warning: row 1012, schedule number = 192 not sequential
Warning: row 1013, schedule number = 116 not sequential
Warning: row 1144, consecutive schedule numbers (146) are the same
Warning: row 1311, birth place = Willesden N 3 contains unusual characters
Warning: row 1312, birth place = Willesden N 3 contains unusual characters
Warning: row 1630, schedule number = 23 not sequential
Warning: row 2012, schedule number = 1 not sequential
Error: row 2987, field too long , truncated
Error: row 2991, field too long , truncated
Error: row 2996, field too long , truncated
Error: row 3004, field too long , truncated
Error: row 3009, field too long , truncated
Warning: row 3009, consecutive schedule numbers (91) are the same
Error: row 3010, field too long , truncated
Error: row 3013, field too long , truncated
Warning: row 3779, birth place = Walworth S1 contains unusual characters
Warning: row 3844, Head of household is not the first entry in the schedule
Warning: row 4410, possibly too many lines on page 13
Warning: row 4443, possibly too many lines on page 14
Warning: row 4619, first page of ED = 20
Warning: row 4619, first schedule of ED = 119
Warning: row 4956, schedule number = 8 not sequential
Warning: row 4957, schedule number = 85 not sequential
Warning: row 4958, schedule number = 140 not sequential
Warning: row 5122, consecutive schedule numbers (20) are the same
Warning: row 5431, consecutive schedule numbers (71) are the same
Warning: row 5617, consecutive schedule numbers (113) are the same
Warning: row 5649, possibly too many lines on page 20
Warning: row 5898, consecutive schedule numbers (169) are the same
Warning: row 5961, possibly too many lines on page 30

The post-validation involves correcting these errors, if in fact they are errors. It also involves eye-balling the data, because there are some things that, although incorrect, are not picked up by the software. In the case of the COCP returns, FCTools is also the means of producing the html. This gives us the opportunity for a final check; when the html is eye-balled for errors. Some elements have to be introduced by hand at this stage; including the credits for the transcribers and checkers.

The html file goes off to the COCP web pages; the validation file is uploaded to Free Census. We have now uploaded over a million records to the OLDB and about 1.2 million to our own web pages.

Amazing! And it has only taken five years.....

No comments: