Designing a brand new strategy to optimize complex coordinated systems | MIT News

Coordinating complicated interactive systems, whether it’s the several modes of transportation in a city or the assorted components that must work together to make an efficient and efficient robot, is an increasingly vital subject for software designers to tackle. Now, researchers at MIT have developed a completely latest way of approaching these complex problems, using easy diagrams as a tool to disclose higher approaches to software optimization in deep-learning models.

They are saying the brand new method makes addressing these complex tasks so easy that it may be reduced to a drawing that may fit on the back of a napkin.

The brand new approach is described within the journal Transactions of Machine Learning Research, in a paper by incoming doctoral student Vincent Abbott and Professor Gioele Zardini of MIT’s Laboratory for Information and Decision Systems (LIDS).

“We designed a brand new language to speak about these latest systems,” Zardini says. This latest diagram-based “language” is heavily based on something called category theory, he explains.

All of it has to do with designing the underlying architecture of computer algorithms — the programs that may actually find yourself sensing and controlling the assorted different parts of the system that’s being optimized. “The components are different pieces of an algorithm, and so they should confer with one another, exchange information, but in addition account for energy usage, memory consumption, and so forth.” Such optimizations are notoriously difficult because each change in a single a part of the system can in turn cause changes in other parts, which may further affect other parts, and so forth.

The researchers decided to concentrate on the actual class of deep-learning algorithms, that are currently a hot topic of research. Deep learning is the idea of the massive artificial intelligence models, including large language models comparable to ChatGPT and image-generation models comparable to Midjourney. These models manipulate data by a “deep” series of matrix multiplications interspersed with other operations. The numbers inside matrices are parameters, and are updated during long training runs, allowing for complex patterns to be found. Models consist of billions of parameters, making computation expensive, and hence improved resource usage and optimization invaluable.

Diagrams can represent details of the parallelized operations that deep-learning models consist of, revealing the relationships between algorithms and the parallelized graphics processing unit (GPU) hardware they run on, supplied by firms comparable to NVIDIA. “I’m very enthusiastic about this,” says Zardini, because “we appear to have found a language that very nicely describes deep learning algorithms, explicitly representing all of the vital things, which is the operators you employ,” for instance the energy consumption, the memory allocation, and another parameter that you just’re attempting to optimize for.

Much of the progress inside deep learning has stemmed from resource efficiency optimizations. The most recent DeepSeek model showed that a small team can compete with top models from OpenAI and other major labs by specializing in resource efficiency and the connection between software and hardware. Typically, in deriving these optimizations, he says, “people need lots of trial and error to find latest architectures.” For instance, a widely used optimization program called FlashAttention took greater than 4 years to develop, he says. But with the brand new framework they developed, “we will really approach this problem in a more formal way.” And all of that is represented visually in a precisely defined graphical language.

However the methods which have been used to seek out these improvements “are very limited,” he says. “I feel this shows that there’s a serious gap, in that we don’t have a proper systematic approach to relating an algorithm to either its optimal execution, and even really understanding what number of resources it’ll take to run.” But now, with the brand new diagram-based method they devised, such a system exists.

Category theory, which underlies this approach, is a way of mathematically describing the several components of a system and the way they interact in a generalized, abstract manner. Different perspectives will be related. For instance, mathematical formulas will be related to algorithms that implement them and use resources, or descriptions of systems will be related to robust “monoidal string diagrams.” These visualizations will let you directly mess around and experiment with how the several parts connect and interact. What they developed, he says, amounts to “string diagrams on steroids,” which includes many more graphical conventions and plenty of more properties.

“Category theory will be considered the mathematics of abstraction and composition,” Abbott says. “Any compositional system will be described using category theory, and the connection between compositional systems can then even be studied.” Algebraic rules which might be typically related to functions can be represented as diagrams, he says. “Then, lots of the visual tricks we will do with diagrams, we will relate to algebraic tricks and functions. So, it creates this correspondence between these different systems.”

Consequently, he says, “this solves an important problem, which is that we now have these deep-learning algorithms, but they’re not clearly understood as mathematical models.” But by representing them as diagrams, it becomes possible to approach them formally and systematically, he says.

One thing this permits is a transparent visual understanding of the way in which parallel real-world processes will be represented by parallel processing in multicore computer GPUs. “In this fashion,” Abbott says, “diagrams can each represent a function, after which reveal tips on how to optimally execute it on a GPU.”

The “attention” algorithm is utilized by deep-learning algorithms that require general, contextual information, and is a key phase of the serialized blocks that constitute large language models comparable to ChatGPT. FlashAttention is an optimization that took years to develop, but resulted in a sixfold improvement within the speed of attention algorithms.

Applying their method to the well-established FlashAttention algorithm, Zardini says that “here we’re capable of derive it, literally, on a napkin.” He then adds, “OK, possibly it’s a big napkin.” But to drive home the purpose about how much their latest approach can simplify coping with these complex algorithms, they titled their formal research paper on the work “FlashAttention on a Napkin.”

This method, Abbott says, “allows for optimization to be really quickly derived, in contrast to prevailing methods.” While they initially applied this approach to the already existing FlashAttention algorithm, thus verifying its effectiveness, “we hope to now use this language to automate the detection of improvements,” says Zardini, who along with being a principal investigator in LIDS, is the Rudge and Nancy Allen Assistant Professor of Civil and Environmental Engineering, and an affiliate faculty with the Institute for Data, Systems, and Society.

The plan is that ultimately, he says, they’ll develop the software to the purpose that “the researcher uploads their code, and with the brand new algorithm you mechanically detect what will be improved, what will be optimized, and you come back an optimized version of the algorithm to the user.”

Along with automating algorithm optimization, Zardini notes that a sturdy evaluation of how deep-learning algorithms relate to hardware resource usage allows for systematic co-design of hardware and software. This line of labor integrates with Zardini’s concentrate on categorical co-design, which uses the tools of category theory to concurrently optimize various components of engineered systems.

Abbott says that “this whole field of optimized deep learning models, I imagine, is sort of critically unaddressed, and that’s why these diagrams are so exciting. They open the doors to a scientific approach to this problem.”

“I’m very impressed by the standard of this research. … The brand new approach to diagramming deep-learning algorithms utilized by this paper might be a really significant step,” says Jeremy Howard, founder and CEO of Answers.ai, who was not related to this work. “This paper is the primary time I’ve seen such a notation used to deeply analyze the performance of a deep-learning algorithm on real-world hardware. … The subsequent step will likely be to see whether real-world performance gains will be achieved.”

“This can be a beautifully executed piece of theoretical research, which also goals for top accessibility to uninitiated readers — a trait rarely seen in papers of this sort,” says Petar Velickovic, a senior research scientist at Google DeepMind and a lecturer at Cambridge University, who was not related to this work. These researchers, he says, “are clearly excellent communicators, and I cannot wait to see what they provide you with next!”

The brand new diagram-based language, having been posted online, has already attracted great attention and interest from software developers. A reviewer from Abbott’s prior paper introducing the diagrams noted that “The proposed neural circuit diagrams look great from an inventive standpoint (so far as I’m capable of judge this).” “It’s technical research, however it’s also flashy!” Zardini says.