Attrasoft ImageFinder

4.   DOCUMENTFINDER
4.1   Why Image Matching?
4.1.1   Expectations
4.2   Run DocumentFinder
4.2.1   Input
4.2.2   Training
4.2.3   Matching
4.3   Restrictions
4.4   Examples
4.5   Simple Matching
4.5.1   Operation
4.5.2   File Input Examples
4.5.3   Dir Input Examples
4.6   Three-Step Matching
4.6.1   Matching Procedure
4.6.2   Example �BioAPI�
4.7   Parameters
4.8   Checking the Results
4.8.1   B1_matchlist.txt
4.8.2   Example �Abm54�
4.8.3   Example �All�
4.9   Query Set Vs Target Set
4.9.1   Query Set against Target Set
4.9.2   Examples

4.   DocumentFinder
The DocumentFinder is a customized example of the ImageFinder. It brings the number of parameters from hundreds to two and makes it easier to use.
To start the DocumentFinder, click menu item "Example/DocumentFinder".
Recent development in scanner technology has made it very easy to convert paper documents into digital documents. A $1000 scanner, for example Fujitsu 4120c, can scan and save 50 pages in a single click. The scanner creates image names via some auto-numbers you have specified. More expensive scanners can scan and save 1,000 pages in a single click.
The central task in any image data management system is to retrieve images that meet some specified constraints. This software attempts to solve a particular problem, retrieve document images similar to a given document image.
Assume you have a million pages of documents already converted into digital form, and you want to retrieve documents that meet some specified constraints. A typical document retrieval system should have several components:
1. Text;
2. Image;
3. 1-D barcode; and
4. 2-D barcode.
Each component addresses a particular area of retrieval and their functions generally do not overlap. A complete solution should use all of the above options. This software deals with the image matching only.
4.1   Why Image Matching?
Image matching deals with several particular problems, which cannot be addressed by text search, 1D barcode search, and 2D barcode search:
(1) Tables: Image Matching is responsible for retrieving documents with similar Tables.

Figure 4.1 Documents with Tables.
(2) Images: Image Matching is responsible for retrieving documents with Similar Images.

Figure 4.2 Documents with Images.
(3) Special Symbols: Image Matching is responsible for retrieving documents with similar Special Symbols.

Figure 4.3 Documents with Special Symbols.
(4) Figures, Drawing, Designs: image matching is responsible for retrieving documents with similar Figures, Drawing, and Designs.

Figure 4.4   Documents with Figures, Drawing, and Designs.

(5) Maps: Image Matching is responsible for retrieving documents with similar Maps.

Figure 4.5 Documents with Maps.
(6) Hand Written Notes: Image Matching is responsible for retrieving documents with Hand Written Notes.

Figure 4.6    Documents with Handwritten Notes.
(7) �
4.1.1   Expectations
With image matching, the document images are converted into templates first. This one-time conversion runs at the speed of several images per second. After that, the template matching runs at a speed of higher than 100,000 matches per second and it will eliminate about 98% - 99% of the unmatched images. This means if you want to match one newly captured image against 1,000,000 existing images, in a matter of seconds, the DocumentFinder will reduce the 1,000,000 possible matches to about 10,000 candidates. At this point, the other types of matching can further narrow down the retrieved set.

Figure 4.7   The DocumentFinder.
The final example in this software has the following results:
Total Images N = 1,023
Possible Matches N*N = 1,046,529
Attrasoft Matches = 4,828
Actual Duplicates = 2,883
Attrasoft Found Duplicates = 2,863
Percent Eliminated = 99.5 %
= 1 � 4,828/1,046,529
Percent of Duplicates Found = 99.3 %
= 2,863/2,883

4.2   Run DocumentFinder
There are two ways to enter data into the DocumentFinder:

Directory

File

In the case of directory input, the DocumentFinder will compare each of the images in the directory to all images in the directory. In the case of file input, the DocumentFinder will compare each of the images in the file to all images in the file.
The DocumentFinder can be operated in two ways:

Simple Matching (2 clicks); or

Three-Step Matching.

Three-Step Matching offers more options.
4.2.1   Input

Clicking the �Search Dir� button and selecting any file in that directory can specify the input directory.

Clicking the �File Input� button and selecting a file can specify the input file.

The input files list one image per line. Each line specifies an absolute path. For example,
C:\xyz1\0001.jpg
C:\xyz1\0002.jpg
C:\xyz2\0003.jpg
C:\xyz2\0004.jpg
�
There are three sample input files in the software CD, �df_input_misc.txt�, �df_input_bioapi.txt�, and �df_input_abm54.txt�.
4.2.2   Training
The DocumentFinder is an example of a customized ImageFinder. There are many parameters in the ImageFinder. The DocumentFinder has eliminated all parameters except two. If you choose the Simple Matching (2 clicks), there will be no parameters. If you choose the Three-Step Matching, there are two parameters.
The first parameter has three choices:

98% Setting (Default)

97% Setting

96% Setting

Clicking �Template/98% Setting�, or �Template/97% Setting�, or �Template/96% Setting� can set the parameter. The 98% Setting will retrieve about 98% of the matches; and the 96% Setting will retrieve about 96% of the matches.
The second parameter is the �Training� button:

Training

Training (High Accuracy)

Training (Low Accuracy)

Training (Lower Accuracy)

The output of �Training (High Accuracy)� is half of that of �Training�; the output of �Training� is half of that of �Training (Low Accuracy)�; the output of �Training (Low Accuracy)� is half of that of �Training (Lower Accuracy)�.
The combination of �96 Setting� and �Training (High Accuracy)� can be used to quickly probe a set of images to find duplicates.
4.2.3   Matching
The image matching will be done in three steps:

Converting Images to Records

Training

Template Matching

Let us look at each phase.
Converting Images to Records:
It is very important that the memory (RAM) used is less than the size of the physical RAM, i.e. avoid using virtual memory, otherwise the procedure will be slower.
This step is slow (several images per second); however:

(1) This step can be done once for all; and

(2) This is linear, i.e. the time is directly proportional to the number of images, provided the memory used is less than the physical RAM (otherwise, the problem should be done in piecemeal).

Therefore, this step does not have much impact on the operating speed.
Training
Training uses the data collected in advance. This step will take about 1 minute. Once the software is trained, it can make multiple matches until you close the software or choose a different setting.
Template Matching
Each record will take roughly 0.13K. For example, 1 million records will take 130 MB.
The matching speed will be between 100,000 � 1,000,000 comparisons per second.

4.3   Restrictions
The data used in this software are obtained as follows:
Document Scanning Dimension: 8.5 inch x 10.66 inches.
Scanning Resolution: 70 pixels per inch.
Image Mode: Black and White.
Image File Type: 24-bit Jpeg.
Resulting Image Dimension: 576 x 746.
Scanner Used: Fujitsu 4120c.
The restrictions are:
Document Scanning Dimension: 8.5 inch x 10.66 inches.
Scanning Resolution: 70 pixels per inch.
Image File Type: 24-bit jpeg image.
Other settings might require a different set of internal parameters.
The reason to choose 70 pixels per inch is that a typical jpeg image will be about 100K, which is neither too large nor too small.
4.4   Examples
There are four examples in the software.
Name          Location                # of Images
Misc             .\misc\                    93 x 2
BioAPI         .\bioapi\                 119 x 3
Abm54         .\abm54\                160 x 3
All                                              1,023 images
The �Misc� example has 93 matching pairs, or 186 images. It has various types of documents such as Tables, Images, Music, Figures, Drawings, Designs, Maps, �. By default, the template file for this example is a2.txt; the result file is b2.txt, and the match-list file is b2_matchlist.txt.
The �BioAPI� example has 119 matching triplets, or 357 images. BioAPI is a document, which attempts to set up a standard for the templates. This document has 119 pages. By default, the template file for this example is a3.txt; the result file is b3.txt, and the match-list file is b3_matchlist.txt.
The �Abm54� example has 160 triplets, or 480 images. By default, the template file for this example is a4.txt; the result file is b4.txt, and the match-list file is b4_matchlist.txt.
The last example is the combination of all three examples. This one has 1,023 images. By default, the template file for this example is a5.txt; the result file is b5.txt, and the match-list file is b5_matchlist.txt.
4.5   Simple Matching

Figure 4.8 Run Menu.
The Simple Match uses �Run� Menu. The image matching will be done in three steps:

Converting Images to Records

Training

Template Matching

Simple Matching combines all three steps into a single click. Note that step 1, Converting Images to Records, can be done once for all, so the Simple Matching should be used only once for each new image set.
After that, Three-Step Matching should be used to increase the computation speed.
4.5.1   Operation
Operation Using File Input:

Select input file by clicking �Run/Select File�;

Run by clicking �Run/File Input Run�.

Operation Using Dir Input:

Select input file by clicking �Run/Select Directory�;

Run by clicking �Run/Dir Input Run�.

The Input File should list one image per line using the absolute path (directory path + image name). For example,
C:\xyz1\0001.jpg
C:\xyz1\0002.jpg
C:\xyz2\0003.jpg
C:\xyz2\0004.jpg
�
The Simple Match will:

Convert images in the Input File or Input Directory to records and save them to a1.txt;

Training from sample.txt and sampletruth.txt;

Complete Template Matching;

Print the results to b1.txt.

4.5.2   File Input Examples
In this section, we will use the �Misc� example, which has 93 matching pairs, or 186 images. It has various types of documents like Tables, Images, Music, Figures, Drawings, Designs, Maps, �.
(1) Select the input file by clicking �Run/Select File�.
To select the input file, click �Run/Select File�, and select �.\df_input_misc.txt�. Click �ShowFile� button to display the content of this file:
.\misc\misc_a_0001.jpg
.\misc\misc_a_0002.jpg
.\misc\misc_a_0003.jpg
�

(2) Run by clicking �Run/File Input Run�.
You should first observe:
(a) The images are translated into templates;
(b) The DocumentFinder is trained;
(c) Image Matching.
The number of comparisons is N*(N-1)/2, or 186 * (186 �1)/2 = 17,205. The 17,205 comparisons take a split second. The results are in a file, b1.txt, which should be opened at this time.
(3) Results
B1.txt will look like this:
.\misc\misc_a_0001.jpg
.\misc\misc_b_0001.jpg
.\misc\misc_a_0002.jpg
.\misc\misc_b_0002.jpg
�
The first block means that misc_a_0001.jpg is compared with the other 185 images specified by the input file and one match is found, which is misc_b_0001.jpg. This is correct.
Because there are 186 images in the input file, you should have 186 blocks in the output file. Note that this is N*(N-1)/2 comparison, not N*N comparison; therefore, misc_a_0001.jpg will match with misc_b_0001.jpg; but misc_b_0001.jpg will not match with misc_a_0001.jpg, because it will never meet with misc_a_0001.jpg.
The DocumentFinder does have the N*N matching command, �Target/a1 + a1 è b1�, but we will introduce this command later.
(4) Analysis
Scroll all the way to the end of this file; the last line is:
Total Number of Matches = 139
i.e. the DocumentFinder found 139 matches. All 93 pairs are identified. The results are:
Total Images N = 186
Possible Matches N*(N-1)/2 = 17,205
Attrasoft Matches = 139
Actual Duplicates = 93
Attrasoft Found Duplicates = 93
Percent Eliminated = 99.2 %
= 1 - 139/17,205
Percent of Duplicates Found = 100 %
= 93/93

4.5.3   Dir Input Examples
The procedure using Dir Input is:

Select input file by clicking �Run/Select Directory� and select any file in the directory, .\misc\;

Run by clicking �Run/Dir Input Run�.

You should get similar results like the last section, except the order of appearance.
4.6   Three-Step Matching
Besides the Simple-Match, the software also supports matching step by step. The Three-Step Matching will be used most often because you do not need to convert the images to templates each time.
In this chapter, we will show how to match in three steps. Please remember, Step 1, Converting Images to Template Records, can be done once for all, so you do not have to do step 1 in each run.
4.6.1   Matching Procedure

Figure 4.9   Template Menu is used to convert images to templates.
Step 1. Converting Images to Records:

Click �Templates/Select File� and select a file;

Click �Templates/Scan File (File Input)�;

Click �Templates/Generate a1.txt�;

Click �a1.txt� to see the records generated (Windows Explorer).

Note that the output is saved into a1.txt. You also have the option to save it to a2.txt, a3.tx, a4.txt and a5.txt.

Figure 4.10 Query Menu for Training and Template Matching.
Step 2. Training:

Click �Query/Training�.

Note that in the next computation, as long as you do not close the program, you do not need to train again.
Step 3. Matching

Click �Query/Generate b1.txt�.

Note that: �b1.txt� is generated from �a1.txt�. You also make the following substitutions in the above commands:
A2.txt ==> b2.txt
A3.txt ==> b3.txt
A4.txt ==> b4.txt
A5.txt ==> b5.txt
The effect of these three steps is the same as the Simple Matching. Once the images are converted into records, you should not do Step 1, converting images to templates, over and over again. The reason for this Three-Step approach is to save time: you will convert images to records only once. The converted records are saved into a1.txt (or a2.txt, a3.txt, a4.txt, or a5.txt, depending on the command used). If the same images are used, you do not have to convert them again the second time.
Also, as long as you do not close the software, you do not need to train each time; it is calculated once for all.
4.6.2   Example �BioAPI�
In this section, we will use the �BioAPI� example, which has 119 matching triplets, or 357 images. BioAPI is a document, which attempts to set up a standard for the templates. This document has 119 pages.
(1) Select input file by clicking �Templates/Select File�
Click �Templates/Select File�, and select �.\df_input_bioapi.txt�;
Click the �ShowFile� button to display the content of this file:
.\bioapi\bioapi_a_0001.jpg
.\bioapi\bioapi_a_0002.jpg
.\bioapi\bioapi_a_0003.jpg
�
(2) Convert images into Templates by clicking �Templates/Scan Images � File Input�.
You should first observe that the images are translated into templates. After that, you must save the templates into 1 of 5 files:

Click �Templates/Generate a1.txt�;

Click �a1.txt� to see the records generated.

(3) Training
To train the DocumentFinder, click �Query/Training�.
(4) Match
Click �Query/Generate b1.txt�.
Note that, �b1.txt� is generated from �a1.txt�. You also make the following substitutions in the above commands:
A2.txt ==> b2.txt
A3.txt ==> b3.txt
A4.txt ==> b4.txt
A5.txt ==> b5.txt
The number of images is N = 357 images. The number of comparisons is N (N-1)/2, or 357 * (357 �1)/2 = 63,546. The 63,546 comparisons take only a split second. The results are in file, b1.txt, which should be opened at this time.
(5) Results
B1.txt will look like this:
.\bioapi\bioapi_a_0001.jpg
.\bioapi\bioapi_b_0001.jpg
.\bioapi\bioapi_c_0001.jpg
.\bioapi\bioapi_a_0002.jpg
.\bioapi\bioapi_a_0091.jpg
.\bioapi\bioapi_b_0002.jpg
.\bioapi\bioapi_b_0091.jpg
.\bioapi\bioapi_c_0002.jpg
.\bioapi\bioapi_c_0091.jpg�
The first block means that bioapi_a_0001.jpg is compared with all of the other images specified by the input file and 2 matches are found; this result is correct.
Because there are 357 images in the input file, you should have 357 blocks in the output file. Note that this is N*(N-1)/2 comparison, not N*N comparison; therefore, bioapi_a_0001 will match with bioapi_b_0001; but bioapi_b_0001 will not match with bioapi_a_0001, because it will never meet with bioapi_a_0001.
(6) Analysis
Scroll all the way to the end of the output file; the last line is:
Total Number of Matches = 874
i.e. the DocumentFinder found 874 matches. The actual matching pairs are as follows, 119 for (a, b) pairs; 119 for (a, c) pairs; 119 for (b, c) pairs; and 357 total. The results are:
Total Images N = 357
Possible Matches N*(N-1)/2 = 63,546
Attrasoft Matches = 874
Actual Duplicates = 357
Attrasoft Found Duplicates = 357
Percent Eliminated = 98.6 %
= 1 - 874/63,546
Percent of Duplicates Found = 100 %
= 357/357

4.7   Parameters
The first of the two parameters has three settings:

96% Setting

97% Setting

98% Setting (Default)

The 96% Setting will find 96% of the duplicates and the 98% Setting will find 98% of the duplicates. The 98% Setting will let more images pass through.
There are 3 menu items:
Templates/96% Setting
Templates/97% Setting
Templates/98% Setting

Click one of the above to make a selection.
The second of the two parameters is to choose one of the four training commands:
Query/Training
Query/Training (High Accuracy)
Query/Training (Low Accuracy)
Query/Training (Lower Accuracy)

4.8   Checking the Results
If this is a test run, i.e. you know the correct answers. You can see the matching results in seconds. You must prepare a file, which indicates the matching pairs. To test the results in b1.txt, you must prepare B1_matchlist.txt file; to test the results in b2.txt, you must prepare B2_matchlist.txt file; �.

Figure 4.11   The Results Menu.
4.8.1   B1_matchlist.txt
To check the results, see b1.txt, you must have a matching file, called b1_matchlist.txt. An example of b1_matchlist.txt is:
5
1    IMAGE00053770    IMAGE01312024
2    IMAGE00053771    IMAGE01312025
3    IMAGE00053772    IMAGE01312026
4    IMAGE00053773    IMAGE01312027
5    IMAGE00053774    IMAGE01312028
Line 1 is the number of matches in this file. Each line has the following format:
Number, tab, filename, tab, filename, tab.
Note:
(1) You must have a tab at the end of each line;
(2) The file names do not contain �.jpg�.
Once you get the two files prepared, click �Results\Compare b1 and b1_matchlist� to check format. There are four other commands for b2.txt, �, b5.txt.
There are two common errors:
(1) The last Tab is missing;
(2) The number of rows is less than the first number in the file.
4.8.2   Example �Abm54�
In this section, we will use the �Abm54� example, which has 160 triplets, or 480 images. ABM54 is the User�s Guide for Attrasoft ImageFinder 5.4, which is the parent software of the DocumentFinder.
Generating a4.txt:

Click �Templates/Select File� and select �Input_abm54.txt�;

Click �Templates/Scan File (File Input)�;

Click �Templates/Generate a4.txt�;

Click �a4.txt� to see the records generated.

Template Matching:

Click �Query/Train�;

Click �Query/Generate b4.txt�.

Correct Match
Now open �b4_matchlist.txt�, you will see the correct matches are:
480
4108 abm54_a_0001 abm54_b_0001
4109 abm54_a_0002 abm54_b_0002
4110 abm54_a_0003 abm54_b_0003
�
Results
Click �Results/Compare b4 and b4_matchlist� and you will get:
N= 480
Check...
Clear...
Total Memory = 12,164,966
Total Matches = 478
Total Memory = 12,188,142
These results show that there are 480 matches in the b4_matchlist file and b4.txt has 478 of the 480 matches.
Analysis
Scroll all the way to the end of the b4.txt file, the last line is:
Total Number of Matches = 900
i.e. the DocumentFinder found 900 matches. The actual number of matching pairs are computed as follows, 160 for (a, b) pairs; 160 for (a, c) pairs; 160 for (b, c) pairs; and 480 total. The results are:
Total Images N = 480
Possible Matches N*(N-1)/2 = 114,960
Attrasoft Matches = 900
Actual Duplicates = 480
Attrasoft Found Duplicates = 478
Percent Eliminated = 99.2 %
= 1 - 900/114,960
Percent of Duplicates Found = 99.6 %
= 478/480

4.8.3   Example �All�
In this section, we will use the �All� example, which combines the three previous examples and has 1,023 images.
If you have not changed the default data, the �a5.txt� file is the template file, which has 1,022 records.
Template Matching:

Click �Query/Train�;

Click �Query/Generate b5.txt�.

Open b5.txt, the last line is:
Total Number of Matches = 1,913.

Correct Match
There are 930 correct matches in �b5_matchlist.txt, which is the summation of all correct matches in the previous examples, 93 + 357 + 480 = 930.
Results
Click �Results/Compare b5 and b5_matchlist� and you will get:
N= 930
Check...
Clear...
Total Memory = 11,318,022
Total Matches = 928
Total Memory = 11,347,118
This results show that there are 930 matches in the b5_matchlist file and b4.txt has 928 of the 930 matches.
Analysis
Scroll all the way to the end of the b5.txt file, the last line is:
Total Number of Matches = 1913.
i.e. the DocumentFinder found 1,913 matches.
Total Images N = 1,023
Possible Matches N*(N-1)/2 = 522,753
Attrasoft Matches = 1,913
Actual Duplicates = 930
Attrasoft Found Duplicates = 928
Percent Eliminated = 99.6 %
= 1 � 1,913/522,753
Percent of Duplicates Found = 99.78%
= 928/930
4.9   Query Set Vs Target Set
4.9.1   Query Set against Target Set
The number of comparisons for a set with N images are N (N �1)/2. If N is large, this number is large. A natural division is, say,
N = A + B + C
And N (N-1)/2 is converted into AA, AB, AC, BB, BC, CC. Now AA, BB, and CC are self-matching, but AB, AC, and BC are not. The data set is divided into:

Figure 4.12   Query Set against Target Set.
Target Set
The Target Set is a dataset being searched against in a given test or subtest: an experiment searches a Query Set against a Target Set. The Target Set is also known as a database.
Query Set
The Query Set is a dataset containing the searches for a given test or subtest: an experiment searches a Query Set against a Target Set. The Query Set is also known as a Search set.
There are 5 commands for this purpose, see Figure 4.12
Note that �Target\ a1 + a1 ==> b1� is different from the earlier command:

�Query/Generate b1.txt� is N (N-1)/2 comparison;

�Target\ a1 + a1 ==> b1� is N*N comparison.

4.9.2   Examples
If you have not changed the default data, the �a5.txt� file is the template file, which has 1,023 records, and �b1_matchlist.txt� is the match-list file. Save a5.txt to a1.txt.
In the N*(N-1)/2 matching, there are 930 matching pairs. In the N*N matching, there are 1,023 self matching and 930 mirror matching pairs. The total number of matches should be 930 + 1,023 + 930 = 2,883.
Click �Target\ a1 + a1 ==> b1� to make an N*N comparison and you will get 4,828. The total number of matches is 1,023 x 1,023 = 1,046,529, which will take about 5 seconds. Click �Results/Compare b1 and b1_matchlist� and the DocumentFinder will indicate 2,863 out of 2,883 duplicates have been identified.
Total Images N = 1,023
Possible Matches N*N = 1,046,529
Attrasoft Matches = 4,828
Actual Duplicates = 2,883
Attrasoft Found Duplicates = 2,863
Percent Eliminated = 99.5 %
= 1 � 4,828/1,046,529
Percent of Duplicates Found = 99.3 %
= 2,863/2,883

Return