Calcium - Room327

Friday, December 28 2018
14:00 - 15:00

Room 327

Retrieval of Mathematical Expressions

P Pavan Kumar

IIITDM, Kurnool

Retrieval of multi-dimensional structures like mathematical expressions, chemical structures, DNA structures etc., is a challenging problem due to their complexity in representation. For example, mathematical expressions are 2D in nature while DNA structures may have 3D representation. That means, unlike text, which is linear in a left-to-right manner as well as uni-directional, these structures are non-linear in space. As such, retrieval of these structures is a difficult task when compared to text retrieval and very few works have been reported in the literature on this problem.

Current proposal focuses on the problem of retrieval of these multi-dimensional complex structures. Majority of approaches for Text retrieval are based on bag of words model. These words are represented in a vector form by mapping words in the vocabulary to numeric quantities. Matching is performed using dot product of query and database vectors to retrieve top k matches. For multi-dimensional structures like mathematical expressions, DNA structures etc., a pre-processing step is required to convert them into a linear form without losing any information like spatial relationships, orientation etc. For mathematical expressions, we have converted standard LaTEX expression into a canonical form which contains atomic and generic symbols, using a mapping table that maps LaTeX symbols/keywords to unique integer labels. That means, multi-letter keywords like Sigma, Integral are mapped to unique integer labels. All variables (as well as digits) are mapped to a common unique label. Special delimiter integer labels are defined and used to capture spatial relationships like superscript, subscript etc., and also for special structures like matrices. This pre-processing step is similar to compiler’s style of Lexical processing. Hence canonical forms convert expressions into a generic templates or structures. These generic templates can be used in applications that detect plagiarism of structures. After pre-processing, canonical expressions are matched using standard dynamic programming based string matching methods with a time complexity of O(n^2), where n is number of symbols in the canonical expressions.

However, the above algorithms can be improved or extended as follows: As the size of expression database increases, time complexity of O(n^2) may still not be acceptable. To reduce time complexity, database can be indexed in the form of a tree, based on presence or absence of the elements in the structures. Further, database expressions can be clustered and the nearest cluster to a query can be found using Mahalanobis distance. Database can be partitioned and the parallel or multi-threaded version of the algorithm can be proposed by exploiting multi-core or multi-processor architectures.

Download as iCalendar