All articles
How-To5 min read

How to Extract Text from a Scanned PDF Using OCR

A scanned PDF is just a picture of a document — you can't select or copy the text. OCR (optical character recognition) converts it to real text. Here's how to do it free in your browser.

Published March 8, 2025

What Is a Scanned PDF?

When you scan a physical document with a scanner or photograph it with your phone, the result is typically an image embedded in a PDF wrapper. The pages look like text, but they're actually just pixels — you can't click to select words, copy sentences, or search the content. This makes scanned PDFs frustrating to work with.

What Is OCR?

OCR (Optical Character Recognition) is the process of analysing the pixel patterns in an image and identifying the characters and words they represent. The output is machine-readable text that can be selected, copied, searched, or re-processed. Modern OCR engines are highly accurate for printed text, typically achieving 95–99% accuracy on clean, well-lit scans.

Browser-Based OCR With Tesseract.js

Our OCR tool uses Tesseract.js, the WebAssembly port of the open-source Tesseract OCR engine (originally developed at HP, now maintained by Google). It runs entirely in your browser — your document is never uploaded to any server. Tesseract supports over 100 languages and handles most common fonts and layouts.

Getting the Best OCR Results

OCR accuracy depends heavily on scan quality. Here's how to improve results:

  • Scan at 300 DPI or higher — 150 DPI scans often produce errors on small fonts
  • Ensure even lighting — avoid shadows across the page
  • Keep the document flat — curved edges or angled pages reduce accuracy
  • High contrast is key — dark text on white background gives the best results
  • For colour scans, the tool automatically converts to greyscale before OCR

Step-by-Step: OCR a PDF in Your Browser

1. Open the OCR PDF tool and drop in your scanned PDF. 2. Select the language of the document (defaults to English). 3. Choose which pages to process — all pages or a specific range. 4. Click 'Extract Text' and wait — OCR takes 2–10 seconds per page depending on complexity. 5. Copy the extracted text, or download it as a plain .txt file.

Limitations of Browser-Based OCR

Browser OCR works well for straightforward text documents. Complex layouts (multi-column newspaper style, mixed text and tables, handwriting) may produce less accurate results. For production-grade OCR on complex documents, consider cloud services like Google Document AI or AWS Textract. For handwritten text, modern AI-based tools significantly outperform Tesseract.

Try it yourself — free & private

No sign-up, no upload. Everything runs in your browser.

OCR PDF

Frequently Asked Questions

How accurate is the OCR?+

For clean, high-resolution scans of printed text, accuracy is typically 95–99%. Accuracy drops for low-resolution scans, unusual fonts, handwriting, or pages with complex mixed layouts.

Can it OCR handwriting?+

Tesseract was designed for printed text. Handwriting recognition is unreliable — expect significant errors. For handwriting, consider specialised AI tools like Google's Document AI.

What languages are supported?+

Tesseract supports over 100 languages. The tool loads the English language pack by default; other language packs are loaded on demand.

Does OCR work on already-digital PDFs?+

Digital PDFs already contain real text — use the Extract Text tool instead, which is much faster and 100% accurate since it reads the text directly from the PDF structure.

ocrscantext recognitionextract

Related Articles