Introduction:
Tracking file involves the similarity algorithm. In our new launched application, we provide the simple similarity analysis to help to compare two files and show the compare result immediately. In case to keep tracking of file relationship and history, we need to develop a file management application. The file management application is one of our goals to achieve in future.
This article mainly introduces the Similarity Analysis Basic application which supports in Android, iOS, MacOS and Windows OS.
Appreciation:
Thanks for Adobe, Apryse and Foxit sponsored the PDF specification (ISO 32000-2-2020).
When developing the similarity analysis, when want to extract the text from PDF file. We have read the PDF specification (ISO 32000-2-2020) and tried to extract the text. Without the document, we cannot extract the text from the PDF file.
Background:
Apryse has iText to extract the text from the PDF file. It is written in Java and .NET and it is an open source. In our case, we are using C++ so we try to develop ourselves.
We completed the extractor at the end of December last year. After trying a few different types of PDF file, we found that PDF internal is complicated! We want to provide a more stable, reliable application so we have postponed launching. We extracted the PDF generated from Microsoft Word (PDF 1.7) and a few samples successfully.
Limitations:
In our PDF extractor, we have a limitation on extracting Microsoft Power Point file (*.pptx) convert to PDF accurately. The major imperfections are the line termination and space. In some cases, the line termination is misaligned. We apologies on this imperfection. We are studying the method and trying to fix this.
We want to support more PDF versions, unfortunately we are lacking resources to achieve this. We will continue to improve our extractor to support more PDF files. In case the file is not sensible and confidential, please contact us and help us to evaluate it. Again, we are apologies on this imperfection.
Guided illustration:
The following diagrams illustrate user interface design.
There are only three buttons in this application. There are two “Browser” button allows to browser (pick up) file from the location device (mobile or PC). The “Process similarity analysis” button allows comparison between source file and compare file. In the case of either source file or compare file is empty, the “Process similarity analysis” button is disabled (grey in color). Both source file and compare file is input properly, the “Process similarity analysis” button will be changed to Light Orange color.
Figure 1. This is the main page of Similarity analysis basic.
When both source file and compare file is input properly, the “Process similarity analysis” button will be changed to Light Orange color. Please be patient to wait for the result. The internal timeout is 10 seconds.
Figure 2. Ready to process the similarity analysis.
Explanation of Similarity analysis result as follows:
- Similarity unique count is a first come first serve algorithm that meant first match content will be treated as unique count.
- Similarity overlap count is a content match in somewhere repeatedly. The overlap percentage more than 100% that meant some patterns (or word or sentence) are more than one occurrence.
- When overlap percentage more than 100%, the similarity percentage more than 50% meant that the content is most likely highly similar and the content in somewhere are repeated.
Figure 3. The similarity analysis result of source file and compare file.
Summary:
We provide quick similarity analysis for two input files in the basic version. The application supports pdf and text file, it is no limited to input same file type at the same time.
Support and contact:Please send email to support@thinkwider.co for any inquiries and support. Thanks a lot.
Demonstration:
Youtube:
Download: