What is a multimodal AI model?

Multimodal AI is a subset of artificial intelligence that integrates information from various modalities, such as text, images, audio, and video, to build more accurate and comprehensive AI models .

Fuyu-8B is an open-source, simplified multimodal architecture with a decoder-only transformer , supporting arbitrary image resolutions, and excelling in diverse applications, including question answering and complex visual understanding.

What is an example of a multimodal model?

For example, a multimodal model might combine the text, image, and audio modalities to recognize emotions in a video clip . By combining different modalities, multimodal models can achieve better performance than models that use only a single modality.

Is ChatGPT 4 multimodal?

But now, because GPT-4o is natively multimodal and is able to handle the audio input, natural language processing, and audio output itself, ChatGPT is able to reply in an average of 0.32 seconds—and you can really feel the speed.

What is Fuyu good for?

The yellow-orange persimmon fruit is an excellent source of fiber, Vitamin A and C, Vitamin B6, potassium and the mineral manganese . Persimmons are fat-free and are a good source of healthy carbohydrates and natural sugar. The two most common varieties of persimmons are fuyu and hachlya.

fuyu) means winter in general, 冬季 (とうき touki), means specifically the season of winter. Be sure to dress warmly this fuyu 冬 in your 冬着 (ふゆぎ fu- yugi), aka winter clothing! Meaning: Winter.

How do I choose a Fuyu?

First, Fuyus should be a bright orange color (they are sometimes sold while under-ripe and pale). A few days on a kitchen counter should be sufficient to bring them to bright orange if they aren't.

What is the difference between generative AI and multimodal AI?

Some generative artificial intelligence (AI) systems use only one type of input, such as text, and produce only one type of output, such as text. Other AI systems accept multiple types of inputs, such as text and images, and can produce various forms of output. These are called multimodal AI systems.

What is multimodal learning model?

Multimodal learning suggests that when a number of our senses – visual, auditory, kinaesthetic – are being engaged during learning, we understand and remember more . By combining these modes, learners experience learning in a variety of ways to create a diverse learning style.

What is multimodal conversational AI?

A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities . This paper motivates, defines, and mathematically formulates the multimodal conversational research objective.

Adept Fuyu-Heavy: A new multimodal model (2024)

We’re excited to introduce Adept Fuyu-Heavy, a new multimodal model designed specifically for digital agents.Fuyu-Heavy is the world’s third-most-capable multimodal model, behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger.We’re excited about this model because:

It excels at multimodal reasoning. To us the killer feature is UI understanding, but it also performs well on more traditional multimodal benchmarks. In particular, Fuyu-Heavy scores higher on the MMMU benchmark than even Gemini Pro.
On standard text-based benchmarks, it matches or exceeds the performance of models in the same compute class despite having to devote some of its capacity to image modeling.
It demonstrates that (with some modifications) we can scale up the Fuyu architecture and reap all of the associated benefits, including handling arbitrary size/shape images and efficiently re-using existing transformer optimizations.

Below we’ve included two sample model interactions.The first is a video of Fuyu-Heavy’s UI understanding capabilities.The second is a sample response on the MMMU benchmark.Read on for further details.

Prompt

According to the table below, which food is the most likely cause of the outbreak of food poisoning:

A. Cold chicken
B. Potato salad
C. Egg sandwiches
D. Fruit pie and cream

Adept

The table shows the number of people who ate different foods and the number of people who ate those foods and got sick.To determine which food is most likely the cause of the outbreak of food poisoning, we’re looking for the food with the highest percentage of people who ate that food and got sick.

Let’s calculate the percentage of people who ate each food and got sick:

Cold chicken: 34 / 86 = 0.395, which is 39.5%
Potato salad: 38 / 54 = 0.704, which is 70.4%
Egg sandwiches: 40 / 76 = 0.526, which is 52.6%
Fruit pie and cream: 12 / 32 = 0.375, which is 37.5%
Cheese: 12 / 48 = 0.25, which is 25%

The food with the highest percentage of people who ate that food and got sick is potato salad, with 70.4%.Therefore, potato salad is the most likely cause of the outbreak of food poisoning.

Answer: B

Adept’s goal is to build Useful General Intelligence.In order to accomplish this goal, we first had to:

Connect our agents to the digital world (see ACT-1)
Build robust tooling for training, evaluation, inference, and data collection
Establish a foundational model design that would scale to meet our needs (see Fuyu-8B)

This fall, we were satisfied that we’d accomplished these goals and that it was time to scale up the Fuyu architecture. This wasn’t easy — in addition to the standard hiccups with model scaling, we had to deal with the extra problems associated with training a new architecture on both text and image data. In particular:

Image data stresses systems — keeping the model fed with data becomes harder, memory usage blows up, cloud storage ingress/egress becomes limiting; even handling image formats/coordinates consistently between training and inference is tricky
See Also
adept/fuyu-8b · Hugging Face Fuyu-8B: A Multimodal Architecture for AI Agents Meet Fuyu-8B: The Very Unique Foundation Model Behind the Adept Platform | Towards AI
Image models are famously unstable (see this great paper) — we’ve tweaked the Fuyu architecture and training procedure substantially to deal with this
Finally, high-quality image pre-training data is scarce, we’ve devoted a lot of effort to collecting, curating, and even creating this data. There’s also a delicate balance between text and image tasks — we had to develop recipes for striking this balance at scale

Over the last 4 months, we’ve tackled all these problems and more.Fuyu-Heavy will shortly be powering our enterprise product.We’ve already applied lessons learned from Fuyu-Heavy to train its successor.

Although our ultimate goal is to build useful digital agents, and we internally benchmark our models with this standard in mind, it’s important for us to sanity-check our progress against commonly used benchmarks as well.According to these benchmarks, Fuyu-Heavy is the strongest multimodal model trained outside of Google or OpenAI.

Despite the trade-off between language modeling performance and multimodal performance, Fuyu-Heavy performs roughly on par with Gemini Pro on standard text-only evaluations, outperforming it on the commonly used MMLU benchmark.Inflection-2 has stronger performance on some of these text evals, but it’s a much larger model.For all of these evals, we used the standard number of shots in the few-shot prompting regime.

	MMLU	GSM8K	MATH	HumanEval
Adept Fuyu-Heavy	72.1	82.9	29.5	58.0
Gemini Pro	71.8	86.5 (Maj1@32)	32.6	67.7
Inflection-2	79.6	81.4	34.8	44.5
Grok-1	73	62.9	23.9	63.2

It’s also become common to benchmark the ability of models to have long-form conversations.To do this, we put Fuyu-Heavy through a supervised finetuning stage followed by a direct preference optimization (DPO) stage on publicly available chat data.We use the two most commonly used chat evaluations — MT-Bench and AlpacaEval 1.0.Fuyu-Heavy Chat performs comparably to Claude 2.0 despite being a smaller model and having to use some of its capacity to model images.

	MT-Bench	AlpacaEval 1.0
Adept Fuyu-Heavy	8.01	92.20%
GPT-4 Turbo	9.32	97.70%
Mistral Medium	8.61	96.83%
Claude 2.0	8.06	91.60%

Finally, we evaluate Fuyu-Heavy on what has become the standard for multimodal performance - MMMU.Fuyu-Heavy performs slightly better than Gemini Pro on this metric.For further context on what MMMU measures, we’ve included further examples of our model output for that data-set below.For completeness, we also include results on VQAv2 (which is quite flawed, see this post) on AI2D, a diagram understanding dataset,and on ChartQA - a dataset of question-answer pairs for plots and charts.Adept Fuyu-Heavy outperforms Gemini Pro on all of these metrics and outperforms Gemini Ultra on AI2D.

	MMMU	VQAv2	AI2D	ChartQA
Adept Fuyu-Heavy	48.3	76.2	81.2	75.4
Gemini Pro	47.9	71.2	73.9	74.1
Gemini Ultra	59.4	77.8	79.5	80.8

Prompt

Answer the following MMMU question based on the image.

Janet Ludlow’s firm requires all its analysts to use a two-stage dividend discount model (DDM) and the capital asset pricing model (CAPM) to value stocks.Using the CAPM and DDM, Ludlow has valued QuickBrush Company at $63 per share. She now must value SmileWhite Corporation.Calculate the required rate of return for SmileWhite by using the information in the following table:

A. 14%
B. 15%
C. 16%

Adept

The required rate of return for SmileWhite can be calculated using the capital asset pricing model (CAPM), which is:

Required Return = Risk-free Rate + Beta x (Expected Market Return - Risk-free Rate)

The risk-free rate and expected market return are provided in the table, and the beta for SmileWhite is given as 1.15.

Let’s calculate the required rate of return for SmileWhite:

Required Return = 4.50% + 1.15 x (14.50% - 4.50%)
Required Return = 4.50% + 1.15 x 10%
Required Return = 4.50% + 11.50%
Required Return = 16%

Therefore, the answer is:

C: 16%
Answer: C

Fuyu-Heavy can also perform more complex calculations, such as the following:

Prompt

Table 11.47 provides a recent survey of the youngest online entrepreneurs whose net worth is estimated at one million dollars or more.Their ages range from 17 to 30.Each cell in the table illustrates the number of entrepreneurs who correspond to the specific age group and their net worth.We want to know whether the ages and net worth independent. $\chi$ ^2 test statistic = ______

A. 1.56
B. 1.76
C. 1.96
D. 2.06

Adept

To answer this question, we need to perform a chi-square test of independence.

The null hypothesis for this test is that the ages and net worth are independent, while the alternative hypothesis is that they are not independent.

The test statistic for a chi-square test of independence is:

where O is the observed frequency, E is the expected frequency, and the sum is taken over all the cells in the contingency table.

The expected frequency for each cell is calculated as:

The degrees of freedom for a chi-square test of independence are:

The critical value for a chi-square distribution with df degrees of freedom at a given level of significance (e.g., 0.05) can be found in a chi-square table or calculated using a statistical software package.

If the calculated chi-square value is greater than the critical value, we reject the null hypothesis and conclude that the ages and net worth are not independent. If the calculated chi-square value is less than or equal to the critical value, we fail to reject the null hypothesis and conclude that the ages and net worth are independent.

Let’s calculate the expected frequencies for each cell in the table:

Now, let’s calculate the chi-square test statistic:

The degrees of freedom for this test are:

The critical value for a chi-square distribution with 2 degrees of freedom at the 0.05 level of significance is approximately 5.99.

Since our calculated chi-square value (1.76) is less than the critical value (5.99), we fail to reject the null hypothesis and conclude that the ages and net worth are independent.

Therefore, the answer is:

B. 1.76

Answer: B

We’re building on this work in several exciting ways:

Conducting fundamental scaling research to improve our base model capabilities
Converting these base models into useful agents through reward modeling, self play, and various inference-time search techniques
Connecting these agents to the world to build useful, reliable products

If any of this is exciting to you, please reach out through our careers page.

We’d like to thank the Microsoft Azure team for providing and servicing the cluster that trained Fuyu-Heavy and for answering innumerable configuration questions.We’d also like to thank NVIDIA for helpful collaborations on model efficiency, for advice on various particulars of networking, etc, and, of course, for creating the GPUs :)

Adept Fuyu-Heavy: A new multimodal model (2024)

Prompt

Adept

Prompt

Adept

Prompt

Adept

FAQs

What is a multimodal AI model? ›