4. DOCUMENTFINDER4.1 Why Image Matching?4.1.1 Expectations4.2 Run DocumentFinder4.2.1 Input4.3 Restrictions
4.2.2 Training
4.2.3 Matching
4.4 Examples
4.5 Simple Matching4.5.1 Operation4.6 Three-Step Matching
4.5.2 File Input Examples
4.5.3 Dir Input Examples4.6.1 Matching Procedure4.7 Parameters
4.6.2 Example �BioAPI�
4.8 Checking the Results4.8.1 B1_matchlist.txt4.9 Query Set Vs Target Set
4.8.2 Example �Abm54�
4.8.3 Example �All�4.9.1 Query Set against Target Set
4.9.2 Examples
4. DocumentFinderThe DocumentFinder is a customized example of the ImageFinder. It brings the number of parameters from hundreds to two and makes it easier to use.
To start the DocumentFinder, click menu item "Example/DocumentFinder".
Recent development in scanner technology has made it very easy to convert paper documents into digital documents. A $1000 scanner, for example Fujitsu 4120c, can scan and save 50 pages in a single click. The scanner creates image names via some auto-numbers you have specified. More expensive scanners can scan and save 1,000 pages in a single click.
The central task in any image data management system is to retrieve images that meet some specified constraints. This software attempts to solve a particular problem, retrieve document images similar to a given document image.
Assume you have a million pages of documents already converted into digital form, and you want to retrieve documents that meet some specified constraints. A typical document retrieval system should have several components:
1. Text;Each component addresses a particular area of retrieval and their functions generally do not overlap. A complete solution should use all of the above options. This software deals with the image matching only.
2. Image;
3. 1-D barcode; and
4. 2-D barcode.Image matching deals with several particular problems, which cannot be addressed by text search, 1D barcode search, and 2D barcode search:
(1) Tables: Image Matching is responsible for retrieving documents with similar Tables.
![]()
Figure 4.1 Documents with Tables.
(2) Images: Image Matching is responsible for retrieving documents with Similar Images.
![]()
Figure 4.2 Documents with Images.
(3) Special Symbols: Image Matching is responsible for retrieving documents with similar Special Symbols.
![]()
Figure 4.3 Documents with Special Symbols.
(4) Figures, Drawing, Designs: image matching is responsible for retrieving documents with similar Figures, Drawing, and Designs.
![]()
Figure 4.4 Documents with Figures, Drawing, and Designs.
(5) Maps: Image Matching is responsible for retrieving documents with similar Maps.
![]()
Figure 4.5 Documents with Maps.
(6) Hand Written Notes: Image Matching is responsible for retrieving documents with Hand Written Notes.
![]()
Figure 4.6 Documents with Handwritten Notes.
(7) �
With image matching, the document images are converted into templates first. This one-time conversion runs at the speed of several images per second. After that, the template matching runs at a speed of higher than 100,000 matches per second and it will eliminate about 98% - 99% of the unmatched images. This means if you want to match one newly captured image against 1,000,000 existing images, in a matter of seconds, the DocumentFinder will reduce the 1,000,000 possible matches to about 10,000 candidates. At this point, the other types of matching can further narrow down the retrieved set.
![]()
Figure 4.7 The DocumentFinder.
The final example in this software has the following results:
Total Images N = 1,023
Possible Matches N*N = 1,046,529
Attrasoft Matches = 4,828Actual Duplicates = 2,883
Attrasoft Found Duplicates = 2,863Percent Eliminated = 99.5 %
= 1 � 4,828/1,046,529
Percent of Duplicates Found = 99.3 %
= 2,863/2,883There are two ways to enter data into the DocumentFinder:
In the case of directory input, the DocumentFinder will compare each of the images in the directory to all images in the directory. In the case of file input, the DocumentFinder will compare each of the images in the file to all images in the file.
- Directory
- File
The DocumentFinder can be operated in two ways:
Three-Step Matching offers more options.
- Simple Matching (2 clicks); or
- Three-Step Matching.
- Clicking the �Search Dir� button and selecting any file in that directory can specify the input directory.
- Clicking the �File Input� button and selecting a file can specify the input file.
The input files list one image per line. Each line specifies an absolute path. For example,C:\xyz1\0001.jpgThere are three sample input files in the software CD, �df_input_misc.txt�, �df_input_bioapi.txt�, and �df_input_abm54.txt�.
C:\xyz1\0002.jpg
C:\xyz2\0003.jpg
C:\xyz2\0004.jpg
�The DocumentFinder is an example of a customized ImageFinder. There are many parameters in the ImageFinder. The DocumentFinder has eliminated all parameters except two. If you choose the Simple Matching (2 clicks), there will be no parameters. If you choose the Three-Step Matching, there are two parameters.
The first parameter has three choices:
Clicking �Template/98% Setting�, or �Template/97% Setting�, or �Template/96% Setting� can set the parameter. The 98% Setting will retrieve about 98% of the matches; and the 96% Setting will retrieve about 96% of the matches.
- 98% Setting (Default)
- 97% Setting
- 96% Setting
The second parameter is the �Training� button:
The output of �Training (High Accuracy)� is half of that of �Training�; the output of �Training� is half of that of �Training (Low Accuracy)�; the output of �Training (Low Accuracy)� is half of that of �Training (Lower Accuracy)�.
- Training
- Training (High Accuracy)
- Training (Low Accuracy)
- Training (Lower Accuracy)
The combination of �96 Setting� and �Training (High Accuracy)� can be used to quickly probe a set of images to find duplicates.
The image matching will be done in three steps:
Let us look at each phase.
- Converting Images to Records
- Training
- Template Matching
Converting Images to Records:
It is very important that the memory (RAM) used is less than the size of the physical RAM, i.e. avoid using virtual memory, otherwise the procedure will be slower.TrainingThis step is slow (several images per second); however:
Therefore, this step does not have much impact on the operating speed.
- (1) This step can be done once for all; and
- (2) This is linear, i.e. the time is directly proportional to the number of images, provided the memory used is less than the physical RAM (otherwise, the problem should be done in piecemeal).
Training uses the data collected in advance. This step will take about 1 minute. Once the software is trained, it can make multiple matches until you close the software or choose a different setting.Template MatchingEach record will take roughly 0.13K. For example, 1 million records will take 130 MB.The matching speed will be between 100,000 � 1,000,000 comparisons per second.
The data used in this software are obtained as follows:
Document Scanning Dimension: 8.5 inch x 10.66 inches.The restrictions are:
Scanning Resolution: 70 pixels per inch.
Image Mode: Black and White.
Image File Type: 24-bit Jpeg.
Resulting Image Dimension: 576 x 746.
Scanner Used: Fujitsu 4120c.Document Scanning Dimension: 8.5 inch x 10.66 inches.Other settings might require a different set of internal parameters.
Scanning Resolution: 70 pixels per inch.
Image File Type: 24-bit jpeg image.The reason to choose 70 pixels per inch is that a typical jpeg image will be about 100K, which is neither too large nor too small.
There are four examples in the software.
Name Location # of Images
Misc .\misc\ 93 x 2
BioAPI .\bioapi\ 119 x 3
Abm54 .\abm54\ 160 x 3
All 1,023 imagesThe �Misc� example has 93 matching pairs, or 186 images. It has various types of documents such as Tables, Images, Music, Figures, Drawings, Designs, Maps, �. By default, the template file for this example is a2.txt; the result file is b2.txt, and the match-list file is b2_matchlist.txt.
The �BioAPI� example has 119 matching triplets, or 357 images. BioAPI is a document, which attempts to set up a standard for the templates. This document has 119 pages. By default, the template file for this example is a3.txt; the result file is b3.txt, and the match-list file is b3_matchlist.txt.
The �Abm54� example has 160 triplets, or 480 images. By default, the template file for this example is a4.txt; the result file is b4.txt, and the match-list file is b4_matchlist.txt.
The last example is the combination of all three examples. This one has 1,023 images. By default, the template file for this example is a5.txt; the result file is b5.txt, and the match-list file is b5_matchlist.txt.
![]()
Figure 4.8 Run Menu.
The Simple Match uses �Run� Menu. The image matching will be done in three steps:
Simple Matching combines all three steps into a single click. Note that step 1, Converting Images to Records, can be done once for all, so the Simple Matching should be used only once for each new image set.
- Converting Images to Records
- Training
- Template Matching
After that, Three-Step Matching should be used to increase the computation speed.
Operation Using File Input:
Operation Using Dir Input:
- Select input file by clicking �Run/Select File�;
- Run by clicking �Run/File Input Run�.
The Input File should list one image per line using the absolute path (directory path + image name). For example,
- Select input file by clicking �Run/Select Directory�;
- Run by clicking �Run/Dir Input Run�.
C:\xyz1\0001.jpgThe Simple Match will:
C:\xyz1\0002.jpg
C:\xyz2\0003.jpg
C:\xyz2\0004.jpg
�
- Convert images in the Input File or Input Directory to records and save them to a1.txt;
- Training from sample.txt and sampletruth.txt;
- Complete Template Matching;
- Print the results to b1.txt.
In this section, we will use the �Misc� example, which has 93 matching pairs, or 186 images. It has various types of documents like Tables, Images, Music, Figures, Drawings, Designs, Maps, �.
(1) Select the input file by clicking �Run/Select File�.
To select the input file, click �Run/Select File�, and select �.\df_input_misc.txt�. Click �ShowFile� button to display the content of this file:(2) Run by clicking �Run/File Input Run�..\misc\misc_a_0001.jpg
.\misc\misc_a_0002.jpg
.\misc\misc_a_0003.jpg
�You should first observe:(3) Results(a) The images are translated into templates;The number of comparisons is N*(N-1)/2, or 186 * (186 �1)/2 = 17,205. The 17,205 comparisons take a split second. The results are in a file, b1.txt, which should be opened at this time.
(b) The DocumentFinder is trained;
(c) Image Matching.B1.txt will look like this:(4) Analysis.\misc\misc_a_0001.jpgThe first block means that misc_a_0001.jpg is compared with the other 185 images specified by the input file and one match is found, which is misc_b_0001.jpg. This is correct.
.\misc\misc_b_0001.jpg.\misc\misc_a_0002.jpg
.\misc\misc_b_0002.jpg
�Because there are 186 images in the input file, you should have 186 blocks in the output file. Note that this is N*(N-1)/2 comparison, not N*N comparison; therefore, misc_a_0001.jpg will match with misc_b_0001.jpg; but misc_b_0001.jpg will not match with misc_a_0001.jpg, because it will never meet with misc_a_0001.jpg.
The DocumentFinder does have the N*N matching command, �Target/a1 + a1 è b1�, but we will introduce this command later.Scroll all the way to the end of this file; the last line is:Total Number of Matches = 139
i.e. the DocumentFinder found 139 matches. All 93 pairs are identified. The results are:
Total Images N = 186
Possible Matches N*(N-1)/2 = 17,205
Attrasoft Matches = 139Actual Duplicates = 93
Attrasoft Found Duplicates = 93Percent Eliminated = 99.2 %
= 1 - 139/17,205
Percent of Duplicates Found = 100 %
= 93/93The procedure using Dir Input is:
You should get similar results like the last section, except the order of appearance.
- Select input file by clicking �Run/Select Directory� and select any file in the directory, .\misc\;
- Run by clicking �Run/Dir Input Run�.
Besides the Simple-Match, the software also supports matching step by step. The Three-Step Matching will be used most often because you do not need to convert the images to templates each time.
In this chapter, we will show how to match in three steps. Please remember, Step 1, Converting Images to Template Records, can be done once for all, so you do not have to do step 1 in each run.
![]()
Figure 4.9 Template Menu is used to convert images to templates.
Step 1. Converting Images to Records:
Note that the output is saved into a1.txt. You also have the option to save it to a2.txt, a3.tx, a4.txt and a5.txt.
- Click �Templates/Select File� and select a file;
- Click �Templates/Scan File (File Input)�;
- Click �Templates/Generate a1.txt�;
- Click �a1.txt� to see the records generated (Windows Explorer).
![]()
Figure 4.10 Query Menu for Training and Template Matching.
Step 2. Training:
Note that in the next computation, as long as you do not close the program, you do not need to train again.
- Click �Query/Training�.
Step 3. Matching
- Click �Query/Generate b1.txt�.
Note that: �b1.txt� is generated from �a1.txt�. You also make the following substitutions in the above commands:A2.txt ==> b2.txt
A3.txt ==> b3.txt
A4.txt ==> b4.txt
A5.txt ==> b5.txtThe effect of these three steps is the same as the Simple Matching. Once the images are converted into records, you should not do Step 1, converting images to templates, over and over again. The reason for this Three-Step approach is to save time: you will convert images to records only once. The converted records are saved into a1.txt (or a2.txt, a3.txt, a4.txt, or a5.txt, depending on the command used). If the same images are used, you do not have to convert them again the second time.
Also, as long as you do not close the software, you do not need to train each time; it is calculated once for all.
In this section, we will use the �BioAPI� example, which has 119 matching triplets, or 357 images. BioAPI is a document, which attempts to set up a standard for the templates. This document has 119 pages.
(1) Select input file by clicking �Templates/Select File�
Click �Templates/Select File�, and select �.\df_input_bioapi.txt�;(2) Convert images into Templates by clicking �Templates/Scan Images � File Input�.Click the �ShowFile� button to display the content of this file:
.\bioapi\bioapi_a_0001.jpg
.\bioapi\bioapi_a_0002.jpg
.\bioapi\bioapi_a_0003.jpg
�You should first observe that the images are translated into templates. After that, you must save the templates into 1 of 5 files:(3) Training
- Click �Templates/Generate a1.txt�;
- Click �a1.txt� to see the records generated.
To train the DocumentFinder, click �Query/Training�.(4) MatchClick �Query/Generate b1.txt�.(5) ResultsNote that, �b1.txt� is generated from �a1.txt�. You also make the following substitutions in the above commands:
A2.txt ==> b2.txt
A3.txt ==> b3.txt
A4.txt ==> b4.txt
A5.txt ==> b5.txtThe number of images is N = 357 images. The number of comparisons is N (N-1)/2, or 357 * (357 �1)/2 = 63,546. The 63,546 comparisons take only a split second. The results are in file, b1.txt, which should be opened at this time.
B1.txt will look like this:(6) Analysis.\bioapi\bioapi_a_0001.jpg
.\bioapi\bioapi_b_0001.jpg
.\bioapi\bioapi_c_0001.jpg.\bioapi\bioapi_a_0002.jpg
.\bioapi\bioapi_a_0091.jpg
.\bioapi\bioapi_b_0002.jpg
.\bioapi\bioapi_b_0091.jpg
.\bioapi\bioapi_c_0002.jpg
.\bioapi\bioapi_c_0091.jpg�The first block means that bioapi_a_0001.jpg is compared with all of the other images specified by the input file and 2 matches are found; this result is correct.
Because there are 357 images in the input file, you should have 357 blocks in the output file. Note that this is N*(N-1)/2 comparison, not N*N comparison; therefore, bioapi_a_0001 will match with bioapi_b_0001; but bioapi_b_0001 will not match with bioapi_a_0001, because it will never meet with bioapi_a_0001.Scroll all the way to the end of the output file; the last line is:Total Number of Matches = 874i.e. the DocumentFinder found 874 matches. The actual matching pairs are as follows, 119 for (a, b) pairs; 119 for (a, c) pairs; 119 for (b, c) pairs; and 357 total. The results are:Total Images N = 357
Possible Matches N*(N-1)/2 = 63,546
Attrasoft Matches = 874Actual Duplicates = 357
Attrasoft Found Duplicates = 357Percent Eliminated = 98.6 %
= 1 - 874/63,546
Percent of Duplicates Found = 100 %
= 357/357The first of the two parameters has three settings:
The 96% Setting will find 96% of the duplicates and the 98% Setting will find 98% of the duplicates. The 98% Setting will let more images pass through.
- 96% Setting
- 97% Setting
- 98% Setting (Default)
There are 3 menu items:
Templates/96% Setting
Templates/97% Setting
Templates/98% Setting
Click one of the above to make a selection.The second of the two parameters is to choose one of the four training commands:
Query/Training
Query/Training (High Accuracy)
Query/Training (Low Accuracy)
Query/Training (Lower Accuracy)If this is a test run, i.e. you know the correct answers. You can see the matching results in seconds. You must prepare a file, which indicates the matching pairs. To test the results in b1.txt, you must prepare B1_matchlist.txt file; to test the results in b2.txt, you must prepare B2_matchlist.txt file; �.
![]()
Figure 4.11 The Results Menu.
To check the results, see b1.txt, you must have a matching file, called b1_matchlist.txt. An example of b1_matchlist.txt is:
5Line 1 is the number of matches in this file. Each line has the following format:
1 IMAGE00053770 IMAGE01312024
2 IMAGE00053771 IMAGE01312025
3 IMAGE00053772 IMAGE01312026
4 IMAGE00053773 IMAGE01312027
5 IMAGE00053774 IMAGE01312028Number, tab, filename, tab, filename, tab.
Note:
(1) You must have a tab at the end of each line;
(2) The file names do not contain �.jpg�.Once you get the two files prepared, click �Results\Compare b1 and b1_matchlist� to check format. There are four other commands for b2.txt, �, b5.txt.
There are two common errors:
(1) The last Tab is missing;
(2) The number of rows is less than the first number in the file.In this section, we will use the �Abm54� example, which has 160 triplets, or 480 images. ABM54 is the User�s Guide for Attrasoft ImageFinder 5.4, which is the parent software of the DocumentFinder.
Generating a4.txt:
Template Matching:
- Click �Templates/Select File� and select �Input_abm54.txt�;
- Click �Templates/Scan File (File Input)�;
- Click �Templates/Generate a4.txt�;
- Click �a4.txt� to see the records generated.
Correct Match
- Click �Query/Train�;
- Click �Query/Generate b4.txt�.
Now open �b4_matchlist.txt�, you will see the correct matches are:Results480
4108 abm54_a_0001 abm54_b_0001
4109 abm54_a_0002 abm54_b_0002
4110 abm54_a_0003 abm54_b_0003
�Click �Results/Compare b4 and b4_matchlist� and you will get:AnalysisN= 480These results show that there are 480 matches in the b4_matchlist file and b4.txt has 478 of the 480 matches.Check...
Clear...
Total Memory = 12,164,966
Total Matches = 478
Total Memory = 12,188,142Scroll all the way to the end of the b4.txt file, the last line is:4.8.3 Example �All�Total Number of Matches = 900
i.e. the DocumentFinder found 900 matches. The actual number of matching pairs are computed as follows, 160 for (a, b) pairs; 160 for (a, c) pairs; 160 for (b, c) pairs; and 480 total. The results are:
Total Images N = 480
Possible Matches N*(N-1)/2 = 114,960
Attrasoft Matches = 900Actual Duplicates = 480
Attrasoft Found Duplicates = 478Percent Eliminated = 99.2 %
= 1 - 900/114,960
Percent of Duplicates Found = 99.6 %
= 478/480In this section, we will use the �All� example, which combines the three previous examples and has 1,023 images.
If you have not changed the default data, the �a5.txt� file is the template file, which has 1,022 records.
Template Matching:
- Click �Query/Train�;
- Click �Query/Generate b5.txt�.
Open b5.txt, the last line is:Total Number of Matches = 1,913.
Correct MatchThere are 930 correct matches in �b5_matchlist.txt, which is the summation of all correct matches in the previous examples, 93 + 357 + 480 = 930.ResultsClick �Results/Compare b5 and b5_matchlist� and you will get:AnalysisN= 930This results show that there are 930 matches in the b5_matchlist file and b4.txt has 928 of the 930 matches.Check...
Clear...
Total Memory = 11,318,022
Total Matches = 928
Total Memory = 11,347,118Scroll all the way to the end of the b5.txt file, the last line is:4.9 Query Set Vs Target SetTotal Number of Matches = 1913.i.e. the DocumentFinder found 1,913 matches.Total Images N = 1,023
Possible Matches N*(N-1)/2 = 522,753
Attrasoft Matches = 1,913Actual Duplicates = 930
Attrasoft Found Duplicates = 928Percent Eliminated = 99.6 %
= 1 � 1,913/522,753
Percent of Duplicates Found = 99.78%
= 928/9304.9.1 Query Set against Target Set
The number of comparisons for a set with N images are N (N �1)/2. If N is large, this number is large. A natural division is, say,
N = A + B + C
And N (N-1)/2 is converted into AA, AB, AC, BB, BC, CC. Now AA, BB, and CC are self-matching, but AB, AC, and BC are not. The data set is divided into:
![]()
Figure 4.12 Query Set against Target Set.
Target Set
The Target Set is a dataset being searched against in a given test or subtest: an experiment searches a Query Set against a Target Set. The Target Set is also known as a database.Query SetThe Query Set is a dataset containing the searches for a given test or subtest: an experiment searches a Query Set against a Target Set. The Query Set is also known as a Search set.There are 5 commands for this purpose, see Figure 4.12Note that �Target\ a1 + a1 ==> b1� is different from the earlier command:
4.9.2 Examples
- �Query/Generate b1.txt� is N (N-1)/2 comparison;
- �Target\ a1 + a1 ==> b1� is N*N comparison.
If you have not changed the default data, the �a5.txt� file is the template file, which has 1,023 records, and �b1_matchlist.txt� is the match-list file. Save a5.txt to a1.txt.
In the N*(N-1)/2 matching, there are 930 matching pairs. In the N*N matching, there are 1,023 self matching and 930 mirror matching pairs. The total number of matches should be 930 + 1,023 + 930 = 2,883.
Click �Target\ a1 + a1 ==> b1� to make an N*N comparison and you will get 4,828. The total number of matches is 1,023 x 1,023 = 1,046,529, which will take about 5 seconds. Click �Results/Compare b1 and b1_matchlist� and the DocumentFinder will indicate 2,863 out of 2,883 duplicates have been identified.
Total Images N = 1,023
Possible Matches N*N = 1,046,529
Attrasoft Matches = 4,828Actual Duplicates = 2,883
Attrasoft Found Duplicates = 2,863Percent Eliminated = 99.5 %
= 1 � 4,828/1,046,529
Percent of Duplicates Found = 99.3 %
= 2,863/2,883
Return