Basic OCR

Jan 04, 2010

This is a very basic optical character recognition script written in PHP. This is untested and serves merely a proof of concept. As noted in the comments, adjusting the sample size can improve results, since with a large sample size on a small image there can be many collisions. A database is needed to compare output results of this script with known values.

<?php

/* create a test image */
$im = @imagecreate(100, 20) or die("Cannot Initialize new GD image stream");
$background_color = imagecolorallocate($im, 255, 255, 255);
$text_color = imagecolorallocate($im, 0, 0, 0);
imagestring($im, 1, 5, 5,  "Hello, World!", $text_color);


/***
 * Assumptions:
 *   A monochrome image where characters are black
 *   A single character is connected
 *   Characters are disjointed by white space
 */

$width = imagesx($im);
$height = imagesy($im);

/***
 * Notes:
 *   The smaller the sample size the more accurate it will be,
 *   however, it will take longer. Larger images can use a larger
 *   sample size wihtout compromizing much accuracy.
 */
$x_sample = 1;
$y_sample = 1;

$last = 0;
for($i = 0; $i < $width; $i++) {
    $col = array();

    for($j = 0; $j < $height; $j++) {
        $col[$j] = imagecolorat($im, $i, $j);
    }

    if(($current = array_sum($col)) > 0) {
        if($last == 0) {
            $l = 0;
        }
        for($k = 0; $k < $height; $k++) {
            if(($l % $x_sample) == 0) {
                if(($k % $y_sample) == 0) {
                    $sample .= $col[$k];
                }
            }
        }

        $l++;
        $last = $current;
    } else {
        $last = 0;
    }

    if(!empty($sample) && $last == 0) {
        echo $sample . "\n";
        $sample = "";
    }
}
?>

Output (each line represents a character)

00000011111100000000000000001000000000000000000010000000000000000011111100000000
00000000011000000000000000001011000000000000000011010000000000000000010100000000
000000100001000000000000001111110000000000000000000100000000
000000100001000000000000001111110000000000000000000100000000
00000000011000000000000000001001000000000000000010010000000000000000011000000000
000000000001000000000000000001100000000000000000010000000000
00000011111100000000000000000110000000000000000001100000000000000011111100000000
00000000011000000000000000001001000000000000000010010000000000000000011000000000
00000000111100000000000000000100000000000000000010000000000000000000010000000000
000000100001000000000000001111110000000000000000000100000000
00000000011000000000000000001001000000000000000010100000000000000011111100000000
00000001110100000000