What Is a Scanned PDF?
When you scan a physical document with a scanner or photograph it with your phone, the result is typically an image embedded in a PDF wrapper. The pages look like text, but they're actually just pixels — you can't click to select words, copy sentences, or search the content. This makes scanned PDFs frustrating to work with.
What Is OCR?
OCR (Optical Character Recognition) is the process of analysing the pixel patterns in an image and identifying the characters and words they represent. The output is machine-readable text that can be selected, copied, searched, or re-processed. Modern OCR engines are highly accurate for printed text, typically achieving 95–99% accuracy on clean, well-lit scans.
Browser-Based OCR With Tesseract.js
Our OCR tool uses Tesseract.js, the WebAssembly port of the open-source Tesseract OCR engine (originally developed at HP, now maintained by Google). It runs entirely in your browser — your document is never uploaded to any server. Tesseract supports over 100 languages and handles most common fonts and layouts.
Getting the Best OCR Results
OCR accuracy depends heavily on scan quality. Here's how to improve results:
- Scan at 300 DPI or higher — 150 DPI scans often produce errors on small fonts
- Ensure even lighting — avoid shadows across the page
- Keep the document flat — curved edges or angled pages reduce accuracy
- High contrast is key — dark text on white background gives the best results
- For colour scans, the tool automatically converts to greyscale before OCR
Step-by-Step: OCR a PDF in Your Browser
1. Open the OCR PDF tool and drop in your scanned PDF. 2. Select the language of the document (defaults to English). 3. Choose which pages to process — all pages or a specific range. 4. Click 'Extract Text' and wait — OCR takes 2–10 seconds per page depending on complexity. 5. Copy the extracted text, or download it as a plain .txt file.
Limitations of Browser-Based OCR
Browser OCR works well for straightforward text documents. Complex layouts (multi-column newspaper style, mixed text and tables, handwriting) may produce less accurate results. For production-grade OCR on complex documents, consider cloud services like Google Document AI or AWS Textract. For handwritten text, modern AI-based tools significantly outperform Tesseract.