Best Indexing Methods for Scanned Records

Indexing is what makes a scanned archive usable. Without indexing, you have thousands of image files with meaningless names. With good indexing, you can find any document in seconds. The challenge is choosing the right level and method of indexing — enough to be useful, without spending more on indexing than the records are worth.

Levels of Indexing

Box-Level Indexing

The simplest and cheapest level. Each archive box is assigned a reference number and a brief description of its contents (for example: “Box 247 — Purchase Ledger 2018-2019”). Individual documents within the box are not catalogued.

  • Cost: Negligible — a few pence per box
  • Retrieval speed: You can find the right box quickly, but then need to search through the box contents manually or scroll through a large multi-page PDF
  • Best for: Rarely accessed archives where the cost of detailed indexing is not justified — old financial records approaching retention expiry, historical files kept for compliance rather than active use

Document-Level Indexing

Each individual document within a box is identified and labelled — typically with a date, document type and brief description. Documents become separately named files rather than pages within a single large PDF.

  • Cost: 5-15p per document (manual identification and naming)
  • Retrieval speed: You can find a specific document by name, date or type without searching through an entire box
  • Best for: Actively used archives where staff need to find specific documents regularly — client files, personnel records, project files, matter files

Field-Level Indexing

The most detailed level. Key data fields are extracted from each document and recorded in a database — for example, an invoice might be indexed by invoice number, supplier name, date, amount and PO reference. This creates a structured, searchable database of your records.

  • Cost: 10-30p per document (manual data entry or semi-automated extraction)
  • Retrieval speed: Instant — search by any combination of indexed fields
  • Best for: High-value records accessed frequently by multiple criteria — financial records for audit, regulated records for compliance, customer records for service delivery

Indexing Approaches

Manual Indexing

A human operator looks at each document, identifies what it is, and enters the relevant information into a database or names the file accordingly. This is the most flexible approach — a person can handle any document type, read poor handwriting, and make judgement calls about classification.

Manual indexing is accurate but slow and expensive. A skilled operator can index 200-400 documents per day at document level, or 100-200 at field level. For a large archive, this translates to weeks or months of labour.

Barcode Separation

Barcode cover sheets are placed between documents before scanning. The scanner reads the barcodes and uses them to separate the batch into individual files, naming each file according to the barcode value.

  • Fast and reliable — the scanner does the separation automatically
  • Requires preparation time to create and insert barcode sheets
  • Works brilliantly for pre-organised batches (separate files, each with a known reference)
  • Less useful for mixed, unorganised archives where you do not know what each document is before scanning

OCR Zone Capture

Software reads text from specific areas (zones) on standardised forms — for example, always reading the invoice number from the top-right corner, the date from the header, and the supplier name from the address block.

  • Highly efficient for uniform, standardised document types
  • Requires setup — you must define the zones for each document template
  • Accuracy depends on document consistency — works well for invoices from the same supplier, poorly for mixed correspondence
  • Can be combined with manual verification — software extracts, human confirms

AI-Based Classification

Machine learning models that can identify document types, extract key data fields, and classify documents without predefined templates. This technology is improving rapidly but is not yet a complete replacement for human indexing on mixed archives.

  • Best for organisations with large volumes of varied but recurring document types
  • Requires training data — the AI needs examples of each document type to learn from
  • Accuracy improves over time as the model processes more documents
  • Current state: good for classification (identifying document types), moderate for data extraction (pulling specific values), poor for unstructured or handwritten content

Choosing the Right Level

The decision comes down to cost versus benefit:

  • How often will records be accessed? Daily access justifies field-level indexing. Annual access justifies box-level
  • How many records are there? Field-level indexing of a million documents is prohibitively expensive. Box-level indexing of 50 documents is too vague to be useful
  • How specific do searches need to be? If you need to find “all invoices from Acme Ltd over £5,000 in Q3 2023,” you need field-level indexing. If you need to find “the 2023 purchase ledger,” box-level is sufficient
  • What is the remaining retention period? Investing heavily in indexing records that will be destroyed in two years is poor value

Practical Examples

Accountancy Practice

Client files indexed at document level by client reference, tax year and document type (tax return, correspondence, accounts). Field-level indexing for key financial data is unnecessary — the documents themselves contain the detail. Cost: 10-15p per document.

NHS Trust

Patient records indexed at field level by NHS number, patient name, date of birth and episode date. Fast retrieval by patient identifier is clinically essential. Cost: 15-25p per document, justified by patient safety and access requirements.

Property Management Company

Lease files indexed at document level by property reference, tenant name and document type (lease, correspondence, inspection report). Key lease dates (start, end, break clause, rent review) extracted as field-level metadata for diary management. Cost: 10-20p per document.

Warehouse of Legacy Financial Records

500 boxes of purchase and sales ledgers from 2010-2018, retained for HMRC compliance. Box-level indexing only — each box labelled with year range and ledger type. Rarely accessed and approaching destruction date. Cost: effectively zero beyond the label on each box.

Get a Free Quote

Every project is different, so the best way to understand your options is to get in touch with our team. We provide clear, no-obligation advice — usually within the same day.

Call us on 01691 650355 or use the form below.

    See how affordable we are:

    I am happy to receive newsletters and offers from Evastore