To speed up and refine decision-making in a fast-paced, global marketplace, enterprises may deploy generative artificial intelligence models to assist summarize and interpret the charts that usually fill market summaries and financial reports.
But even the newest vision-language models sometimes struggle with this task, because it requires a model to integrate visual, numerical, and linguistic understanding. An organization that invests in a state-of-the-art model might still receive inaccurate or incomplete information.
To fill this performance gap, researchers from MIT and the MIT-IBM Computing Research Lab developed a multifaceted resource for AI users that’s specifically designed to show vision-language models (VLMs) methods to effectively interpret charts.
They used a novel data generation method to construct a state-of-the-art dataset that features greater than 1,000,000 varied charts. The dataset also encodes many visual, linguistic, and numerical components of every chart image, which enable models to robustly reason in regards to the information in a chart.
The researchers used this dataset, called ChartNet, to coach a series of open-source VLMs. Lots of these smaller models significantly outperformed orders of magnitude larger, business models on tasks like data extraction and chart summarization.
By enabling open-source models to outperform their business counterparts, ChartNet could allow small firms with limited budgets to more readily utilize AI. The open-source dataset may be used to enhance the capabilities of AI models for tasks like business trend evaluation and scientific figure interpretation.
“We developed ChartNet to be a one-stop shop for chart understanding, covering mainly anything that an AI model and a practitioner who’s training that model might need. We hope our work motivates researchers to attain state-of-the-art performance with smaller models that don’t require infinite amounts of computation,” says Jovana Kondic, an MIT electrical engineering and computer science (EECS) graduate student and lead creator of a paper on ChartNet.
She is joined on the paper by many co-authors from MIT, the MIT-IBM Computing Research Lab, and IBM Research, including Pengyuan Li, a research staff member at IBM Research; Dhiraj Joshi, a senior scientist at IBM Research; Isaac Sanchez, a software engineer at IBM Research; Aude Oliva, director of strategic industry engagement on the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Computing Research Lab, and a senior research scientist within the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Rogerio Feris, a principal scientist and manager on the MIT-IBM Computing Research Lab. The research will likely be presented at IEEE Computer Vision and Pattern Recognition Conference.
A dataset bottleneck
Researchers have made great strides developing generative AI models that excel at natural language processing and reasoning about natural images. But less work has focused on interpreting complex multimodal data contained inside charts, Kondic says.
Yet for big and small businesses in nearly every industry, chart understanding is a critical task.
“The finance industry thrives on charts. If vision-language models can extract information out of charts, like descriptions of trends, that facilitates a variety of workflows that occur downstream,” Joshi says.
The shortage of high-quality training data is a serious bottleneck holding back the event of VLMs that may accurately interpret charts. Many datasets contain limited chart images pulled from the web and sometimes lack the mandatory scale and extra information to assist a model interpret the underlying data.
“A vision-language model, unlike our brains, might have to see 1000’s of examples during training to reliably recognize something as a line chart,” Kondic says.
The researchers sought to beat those shortcomings by generating synthetic data. Synthetic data are artificially generated by algorithms to mimic the statistical properties of actual data.
The ChartNet dataset holds more 1,000,000 high-quality chart images, together with the corresponding code used to generate each chart, a textual description, and a table that accommodates its numerical information. As well as, each datapoint includes question-and-answer pairs to show the model methods to appropriately answer questions on the chart image.
“These additional modes of information guide the model to attach and align the various pieces of knowledge that the chart image encodes,” Kondic says.
Data generation
To construct ChartNet, the researchers created a two-step, synthetic data generation pipeline.
First, their automated system translates any pre-existing set of chart images into code. Then the system iteratively augments that code to vary different points of every chart, equivalent to chart type, data values, topic, colours, etc.
“We will start from a single chart that we use as a seed and give you a whole lot of augmentations of it. That is how we were capable of construct a dataset with greater than 1,000,000 diverse images,” Kondic explains.
In addition they incorporated an automatic quality check process to make sure the synthetic data are top quality. This process verifies that the code is executable and rendered chart images are accurate and clean.
“We don’t want to simply be generating diverse samples. We also want the data to be presented in a meaningful way,” she says.
ChartNet also features a collection of chart datapoints annotated by human experts. This provides access to additional varieties of charts and supporting data that carry validity guarantees.
A practitioner could use the annotated data to fine-tune an existing VLM, further boosting performance for a particular application, Joshi adds.
The researchers tested ChartNet by training IBM’s Granite Vision series of models in addition to several other open-source models of assorted sizes and evaluating them on various chart interpretation tasks. The dataset improved the accuracy of all models in chart reconstruction, chart data extraction, chart summarization, and chart query answering.
With ChartNet, small open-source models consistently outperformed much larger business models.
“Loads of prior training datasets only focused on answering easy questions on a chart. We tried to transcend that with ChartNet by generating data that support all points of sturdy chart understanding,” Kondic says.
In the longer term, the researchers plan to proceed expanding ChartNet by incorporating data with added levels of complexity. In addition they wish to draw on feedback from the research community.
This research was funded, partly, by the MIT-IBM Computing Research Lab.

