Chinese Lab’s Latest OCR Model Hits Competition

Read Time: 1.5 min.

In today’s tech landscape, Chinese labs continue to make significant strides with their latest innovations in optical character recognition (OCR). After Deep Seeks and Quentry’s vision language model, we now have Hanjuan’s Huan OCR, a 1 billion parameter model that has already taken the competition by storm. The key feature of this model is its end-to-end architecture, which retains the original aspect ratio of images, ensuring no distortion or loss of details.

The installation script provided by the model card can easily be followed, which includes torch and transformers along with other necessary components. Once installed, users must specify their preferred language for OCR.

This model’s design features a native resolution vision transformer called Hunjo Width, allowing images to maintain their original aspect ratio without distortion or loss of details. These tokens are then passed into an adaptive MLP connector that compresses the content while discarding empty background and redundant information. The final output is a lightweight 5 billion parameter language model with XT rope positional encoding, enabling complex text recognition tasks like reading multicolored pages and video subtitles.

When testing on this script, it’s reportedly performant: downloading models quickly, rendering results accurately within seconds, and handling both text-based prompts and images. The speed of the model is impressive, especially considering its low 1 billion parameters.

For instance, OCR-ing a PDF document containing LaTeX formulas and paragraphs was smooth with a minimal impact on VRAM usage. It also handled various languages such as Hindi, Arabic, Polish, and even handwritten notes, proving itself to be versatile in terms of text type recognition. In real-world applications, it consistently performed well across benchmarks like Omni Do Bench and OCR Bench VQA, outperforming models with larger parameters.

Overall, Hanjuan’s Huan OCR is a groundbreaking model that demonstrates the power of end-to-end architecture for image recognition tasks. Its simplicity in design and high performance make it an invaluable tool for anyone looking to integrate OCR capabilities into their projects without needing complex setups or proprietary APIs.