SnipNScan 📷

📆 February 26, 2022 | ⌛ 7 minutes read

Image Hell

A picture is worth a thousand words - A common adage

Images are a rich source of information. But what worth is that thousand words if I can’t extract the actual text inside the images? Let me explain my frustration about images.

Webinar Link Hell

Imagine yourself in some online conference. Stare in wonder as the presenter asks you to access some (long-ass) link on their screen share. Extra points if the presenter does not drop the link into chat - good luck typing it into your browser before they move on!

Try typing this in 30s
www.a_very_long_foobar_link.com/123409%~A875&sYm-b0ls?ref=LMAO-GOODLUCK-TYPING-THIS-IN-30-SECONDS?&12#-90

How not to ask for code help

Sometimes I get questions from friends asking for help with their code. Instead of sending me text, I get an image instead.

Actual image from a friend - truncated for brevity

Images of text (Why?!)

One thing that frustrates me is that certain research papers and lecture slides format text as images. Some of these texts are meant to be copy-pasted, such as boilerplate lab code for a tutorial or some DOI link.

It is simply a waste of time to manually type out references and long-ass code.

Extra: QR tasks I’d rather do on desktop

As an added bonus frustration, some QR codes link to very text-heavy surveys or websites not optimised for mobile (NUS class attendance login page, I’m looking at you 😤). I’d rather do these using my PC, but the issue is that there’s no native way¹ to scan QR codes on desktop!

A Solution

The frustrations that I faced above can be solved if I built a desktop tool to scan QR codes and scan text in images. Kind of like the snipping tool but for optical character recognition (OCR).

So I built a GUI-based Linux tool over the 2021 winter break to do just that. Its FOSS too (licenseed under AGPL-3); you are more than welcome to mess with my code.

Dependencies

This mini project depends on OpenCV, Tesseract and zbar for the backend logic and OCR, and wxWidgets for UI.

SnipNScan

This section is for the nerds. It is basically me recounting what I learnt and how I approached the preprocessing of images for OCR scanning. If you’re here in the capacity of a non-technical onlooker, perhaps the results section below will be of more interest to you!

Application Control Flow

For reference, this is the simplified user control flow for this app:

Take a snapshot of your screen
Drag over the area of the snapshot you want to scan
Magic happens with OCR @ the backend
You get the text/QR code link from the scanned area
Profit

Preprocessing Images (Step 3)

After taking a screenshot and cropping it, we will need to preprocess the image for our OCR engine to scan.

I am not a data scientist nor am I experienced in OpenCV - so the implementation of pre-processing is in the capacity of a hobbyist which is likely to be iffy.

The end-goal here is to perform OCR. So, we will want to remove any redundant colors and noise from the image that we want to scan. This can be done through grayscaling, denoising and thresholding.

Grayscaling

Why is there a need to grayscale? Most functions in OpenCV only work for Grayscaled images. Also, we are reducing the amount of information needed to do operations on. Instead of 3 channels (Red Blue Green), we only deal with one (Grayscale).

Denoising and Adaptive Thresholding

Denoising helps to reduce unwanted pixels from the snapshot after grayscaling. It will make the OCR work better.

Adaptive Thresholding is a whole different beast. Based on the surrounding areas, it determines if the region should be set to 0 or 1 (dark or light). There’s also some fine-tuning that can be done with a constant.

Compared to simple thresholding that sets every pixel based on a threshold value, adaptive thresholding generalises better to different lighting conditions.

Thresholding the denoised, grayscaled image

Dark Mode Pain

After testing the pre-processing with some image of code (using lightmode…) that a friend sent me, I was quite satisfied with the results – it was pretty accurate! But then I hopped over to Stack Overflow with dark mode on and realised that everything broke.

Dark mode broke stuff. After some investigation and testing I realised that the constant added to the adaptiveThreshold function caused the preprocessed image to behave very very wonkily.

Using dark theme images from my blog (Original)

Thresholding with badly chosen (lightmode) constant

A lazy workaround

The thresholding constant, c for lightmode is positive (c = 26). Through some testing I realised that negative values of c, (s.a. c=-42) work better for dark background/themed images.

Thresholding with appropriate constant (much better...)

The question is - how to determine programatically if an image is dark/light mode?

The Lazy Fix: find out the most dominant shade of grey and determine from there if the image is light or dark themed from there.

The idea is to use OpenCV’s histogram function to determine the spread of colors on the grayscaled image. Plotting the histogram using OpenCV clearly shows a distinction between the spread between light and dark images.

Issues seeing the graphs? Try using darkmode on this blog to view the histograms properly. (This is my problem, sorry!)

NOTE: The horizontal axis represents how light the shade is. The further right, the lighter the shade.

Histogram of a particularly noisy, light image (Ling's cars)

It’s quite clear to see that for light themes, the mode lies beyond the middle of the scale to the most extreme right. For dark themes, it lies around the start to the middle.

So, the solution is simple and lazy². Decide the threshold between dark and light themes to be exactly in the middle of the horizontal axis.

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


bool isDominantDark(...){
// Histogram logic omitted for brevity
// Trivial O(n) find max value
    int intensity = 0;
    int dominantBin = 0;
    for(int i=0; i<bins; ++i){
        // Loop through the histogram hist to find the highest intensity
        float binVal = hist.at<float>(i);
        // maxval was initialised earlier in histogram logic
        int relFreq = cvRound(binVal*hist_height/max_val);
        if(relFreq > intensity){
            intensity = relFreq;
            dominantBin = i;
        }
    }
    // Greedy solution: Take midpt of intensity values as the binary trigger for dark/light mode
    return (dominantBin < (256/2));
}

Now, to apply the appropriate constant is simply calling the isDominantDark function and assigning from there.

0
1
2
3


int c = 26;
if(isDominantDark(grayed)){
    c = -42;
}

Results

The end result was quite satisfactory to me. OCR results were quite accurate & the QR code scanning function worked perfectly out of the box for zbar and opencv.

Lightmode

Output:

def get_date_today():
return (2013, 18, 30)

class Artist(object):
def _init_ (self, name, dob):
# Fill in your code here
self.name = name
self.dob = dob

Darkmode

Output:

Cnake ~ Recreating the classic Snake game with C, ncurses and Linked Lists. ta 6/8/2021 | ¥ 5 min read @ Tags: C/C++ Linux Data Structures

Conclusion: Ling’s Cars

Just for fun – I wanted to test out my utility over very noisy environments to test how well the application works/generalises in the wild. Why not test it out on one of my favourite websites, LingsCars.com? If you wonder why I chose this website, I impore you to click the link. It’s a work of art 👌

This is the noisy image I want to scan.

This is how the pre-processing turned out:

And what was returned by the OCR:

UNAS AS Gomme 4 Leader of the Pack - The UK's favorite car leasing website! Contract hire cars, LINGSCARS is the "UK's favourite car leasing website" - On 2016 We leaseu over 85 million in cars! (RRP)

The important bits (to me) were scanned correctly. All in all a win!

This project was quite fun but a bit of a hassle at times because I was coding OpenCV with C++. Perhaps my next OpenCV project (if I ever do another one) will be in Python.

This project was a useful endeavour and self-made software I actually use actively compared to my other projects such as my Cnake game. The only limitations so far with this project is compiling to windows, and friends sending me videos of their code to debug. But that’s a writeup for another time!

At least in my experience with Ubuntu and Windows. I’m not too sure about MacOS and other distros. ↩︎
It is lazy because it follows the heuristic that there are only 2 distinct possibilities for c. It does not account for the possibility that the image can be dominantly ‘grey’ – where neither the postive nor negative values of c are the best option for the adaptive threshold. ↩︎