Developers often run into images that contain valuable text: invoices, receipts, scanned forms, ID cards, dashboards, or screenshots. But how do folks find different ways to extract that text cleanly and reliably?
Hi all,
I need to extract text from images in a Java application and I am looking for a reliable approach and code examples. Specifically, how to perform OCR on images using Java SDK, what preprocessing steps help accuracy, and how to handle multiple languages. I will be processing images from URLs and from user uploads, sometimes low quality. Tips for scaling this in production would be great too. Thanks!
Great question! OCR quality depends on two things: the OCR engine and the quality of the input image. Here’s how you can do it:
- Tesseract via Tess4J: open source, runs locally, good for many Latin scripts and more with trained data.
- Hosted OCR APIs: Google Cloud Vision, AWS Textract, Azure Computer Vision. Difficult documents and tables usually see higher accuracy, which requires a subscription.
- Convert to a suitable format. For text-heavy images, PNG is often better for sharp lines; high-quality JPG can also work. See format tradeoffs here: JPEG vs PNG.
- Increase contrast, denoise, deskew, and binarize if needed. These steps can dramatically boost OCR precision. See a helpful overview of enhancement ideas: Image Enhancement.
- Use sufficient resolution. 300 DPI for scans is a common baseline. If you are unsure what DPI means or how it works, check out DPI vs pixels.
Add Tess4J to your build and make sure you have Tesseract trained data files available. Then:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>5.10.0</version>
</dependency>
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.net.URL;
public class OcrExample {
public static void main(String[] args) throws Exception {
// 1) Load image from disk
BufferedImage img = ImageIO.read(new File("invoice.png"));
// 2) Configure Tesseract
Tesseract tesseract = new Tesseract();
tesseract.setDatapath("/path/to/tessdata"); // folder containing .traineddata files
tesseract.setLanguage("eng"); // e.g., "eng", "spa", "eng+deu"
// 3) Run OCR
try {
String text = tesseract.doOCR(img);
System.out.println(text);
} catch (TesseractException e) {
e.printStackTrace();
}
}
}Code language: PHP (php)
Reading from a URL is similar:
BufferedImage img = ImageIO.read(new URL("https://example.com/receipt.jpg"));
String text = tesseract.doOCR(img);Code language: JavaScript (javascript)
- Clean input: crop borders, remove stamps or heavy watermarks, and deskew tilted scans.
- Binarize and denoise: thresholding can make text crisper and suppress background patterns.
- Use the right language pack: for multilingual docs, combine languages like “eng+fra”.
- Work in batches: normalize files to consistent dimensions and formats before OCR.
- Cache results: if an image does not change, persist extracted text and skip repeated OCR.
- Parallelize: run OCR in worker threads or microservices. Limit concurrency to available CPU cores.
- Preprocess once: keep a normalized copy alongside the original to avoid repeating transforms.
If you already manage media assets at scale, you can use Cloudinary to normalize images on the fly before feeding them to your OCR code. For example: fetch a remote image, convert to PNG, apply grayscale, sharpen, strong contrast, and threshold to boost text clarity, then pipe the transformed image into your OCR.
<dependency>
<groupId>com.cloudinary</groupId>
<artifactId>cloudinary-http44</artifactId>
<version>1.39.0</version>
</dependency>
import com.cloudinary.Cloudinary;
import com.cloudinary.Transformation;
import com.cloudinary.utils.ObjectUtils;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.net.URL;
Cloudinary cloudinary = new Cloudinary(ObjectUtils.asMap(
"cloud_name", "YOUR_CLOUD_NAME",
"api_key", "YOUR_API_KEY",
"api_secret", "YOUR_API_SECRET"
));
// 1) Build a preprocessing URL for a remote image
String preppedUrl = cloudinary.url()
.type("fetch")
.transformation(new Transformation()
.fetchFormat("png") // lossless for crisp text
.quality("auto") // sensible default optimization
.effect("grayscale") // reduce color noise
.effect("sharpen") // sharpen edges
.effect("contrast:30") // boost contrast
.effect("threshold:200") // strong binarization
)
.generate("https://example.com/receipt.jpg");
// 2) Feed the transformed image into your OCR
BufferedImage img = ImageIO.read(new URL(preppedUrl));
String text = tesseract.doOCR(img);
System.out.println(text);Code language: PHP (php)
This pattern centralizes file retrieval, consistent transforms, and delivery. You can also normalize formats upfront based on your needs, drawing on background knowledge like JPEG vs PNG and enhancement techniques from Image Enhancement.
- Use Tess4J for local OCR or a hosted OCR API for tougher documents and higher accuracy.
- Preprocess images: consistent format, higher contrast, grayscale, denoise, binarize, deskew.
- Pipeline tip: generate a preprocessed image URL with Cloudinary and pass that into your OCR code for more consistent results at scale.
- Mind resolution and readability. See DPI vs pixels to avoid undersampling.
Ready to streamline your OCR pipeline with consistent preprocessing, storage, and delivery? Create a free Cloudinary account and start optimizing today.