Case: Develop a Research Library Website from Scanned Input

This ASP solution to a common problem will demonstrate the relative ease of developing a working system this way. The problem is to take archived academic publications, scan and OCR them, edit the scanned data into text files as the first electronic data form.

This is processed by ASP to create text files, database tables and html files for online viewing. There are many ways of dealing with historic data, but most require indexing which translates the the written page numbers into hyperlinks, hopefully some keywords for searching, some way to tell if all the records were processed, validation of the data to the appropriate level, and creation of the html.

First step is hand editing the scanned data, removing scan errors and preparing the data for automation by inserting separation characters when needed. The data originates in original academic publications, out of print, and printed in small serif type so is the hardest to OCR anyway, in this case the printing quality was also bad. Whatever, after editing there are several types of files, general index, author index, article, bibliography. These are the basic files inputted to the ASP process.

This begins the server process in VBScript, sets the timeout because this takes about two minutes to run, includes the standard VBScript ADO related variable name file and sets the buffers to true so the page is parsed completely before the client gets a response. Since I don’t output anything to the client they see a blank screen when it finishes. Instead a file is created with totals after the content files are created. There are a bunch of variables, declared next…