Data & Tools

Please Note: CROHME is now a happy memory. All previous data and tools should now be available through the TC11 datasets repository. Here we provide all data sets and tools in one unique place, organized to simplify the work of participants. Feel free to contact us to better explain things if needed.


Datasets

The CROHME 2023 dataset merges data from the previous CROHME but also from OffRaSHME competition, new samples of images and soon new bimodal samples.

For each online equation, we provide also the corresponding image, either rendered from the on-line ink or scanned from the original physical page. Equation from OffRaSHME are only in off-line.

Please note that the 2019 test set will be used as a reference to compare the systems with the last competition results and should not be used in training or validation of the current systems.

The full package can be downloaded on
DOI

Task 1 : on-line recognition

  • Train set: 10979 inkml files = CROHME2019_train(9993) + CROHME2019_val(986)
    • +150k artificial (generated) inkml files
    • + 1045 new real equations
  • Validation set: 1147 inkml files = CROHME2016_test(1147)
    • + 555 new real equations
  • Reference Test set: 1199 inkml files = CROHME2019_test
  • New Test set 2 is kept hidden

Task 2 : off-line recognition

  • Train set : 20979 images = 9375 rendered inkml + 1604 scanned images + 10,000 OffRaSHME images
    • + 1045 new real equations (scanned)
  • Validation set : 1147 rendered inkml
    • + 555 new real equations
  • Reference Test set: 1199 rendered inkml
  • New Test set 2 is kept hidden with new real scanned samples

Task 3 : bimodal recognition

  • Train set : 10979 images = 9375 inkml & rendered inkml + 1604 inkml & scanned images
    • + 1045 new real equations (inkml & scanned)
  • Validation set : 1147 inkml & rendered inkml
    • + 555 new real equations
  • Reference Test set: 1199 rendered inkml + soon new real bimodal samples
  • New Test set 2 is kept hidden with new real bimodal samples

Data Augmentation and private datasets

We will soon provide a large extension of the dataset thanks to formula generation.

Participant are allowed to use classical deep-learning data augmentation (scaling, shifting, rotation, …) but not to create new original samples. Thus private data sets are not allowed to obtain comparable systems.

Handwritten Formulas

Handwritten formula data will be given in two formats (online and offline). Online data will be in the same InkML and Label Graph (.lg) format used during the last CROHME.

Inkml data. Stroke data is defined by lists of (x,y) coordinates, representing sampled points as each stroke is written. Groupings of stroke into symbols (segments) along with the true class of each symbol is provided in two formats: InkML and Label Graphs (.lg). Both provide a representation of expression symbols and structure. The InkML format represents formula structure in Presentation MathML (an XML tag-based representation), where label graphs files use a simpler CSV-based representation. PLease note both MathMl and LG files can be used as the ground truth, but the LaTex representations are not the official ground truth as they are not normalized.

Example of inkml file as a sequence of (x,y) coordinates:

Image data. Two type of images provided. Some images come from a rendering of inkml file (the rendering tool is available in crohmeLib tool). When the ink have been collected using a e-pen solution (as with Annoto or Wacom technologies), we have scanned and segmented these pages.

  • a real scanned image

real scanned image.

  • the rendered version (in a 1000×1000 image)

Bimodal data.  Each inkml file is also available as an image (rendered or real), that is what we call bi-modal data.

Expected outputs

For a given test expression, participating systems must recognize symbol locations and identities, and then produce a file representing the expression structure. Participants may use the CROHME InkML/MathML format as given in the training data, or a simpler label graph (CSV) format representing segmentation, layout and classification for strokes/bounding boxes in the input (this is an adjacency matrix defined by labels over strokes/bounding boxes). Tools to convert one format to the other are publicly available.

The official evaluation format is SymLG. This Labeled Graph of Symbol is not connected to the strokes of on-line or the bounding boxes of off-line, thus this is a way to have exactly the same evaluation metric in all tasks.

Evaluation Tools

CROHME Organizers provides tools for math expression selection, running the test phase, evaluation and visualization :

  • CROHMELib
    • including InkML viewer and rendering tool
  • lgEval
    • inluding the evaluation tool and LG converter

Both tool are available here on gitLab (forked from DPRL gitLab sources).