Adept Fuyu-Heavy: A new multimodal model (2024)

Table of Contents
Prompt Adept Prompt Adept Prompt Adept FAQs

We’re excited to introduce Adept Fuyu-Heavy, a new multimodal model designed specifically for digital agents.Fuyu-Heavy is the world’s third-most-capable multimodal model, behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger.We’re excited about this model because:

  • It excels at multimodal reasoning. To us the killer feature is UI understanding, but it also performs well on more traditional multimodal benchmarks. In particular, Fuyu-Heavy scores higher on the MMMU benchmark than even Gemini Pro.
  • On standard text-based benchmarks, it matches or exceeds the performance of models in the same compute class despite having to devote some of its capacity to image modeling.
  • It demonstrates that (with some modifications) we can scale up the Fuyu architecture and reap all of the associated benefits, including handling arbitrary size/shape images and efficiently re-using existing transformer optimizations.

Below we’ve included two sample model interactions.The first is a video of Fuyu-Heavy’s UI understanding capabilities.The second is a sample response on the MMMU benchmark.Read on for further details.

Prompt

According to the table below, which food is the most likely cause of the outbreak of food poisoning:

Adept Fuyu-Heavy: A new multimodal model (1)

A. Cold chicken
B. Potato salad
C. Egg sandwiches
D. Fruit pie and cream

Adept

The table shows the number of people who ate different foods and the number of people who ate those foods and got sick.To determine which food is most likely the cause of the outbreak of food poisoning, we’re looking for the food with the highest percentage of people who ate that food and got sick.

Let’s calculate the percentage of people who ate each food and got sick:

  • Cold chicken: 34 / 86 = 0.395, which is 39.5%
  • Potato salad: 38 / 54 = 0.704, which is 70.4%
  • Egg sandwiches: 40 / 76 = 0.526, which is 52.6%
  • Fruit pie and cream: 12 / 32 = 0.375, which is 37.5%
  • Cheese: 12 / 48 = 0.25, which is 25%

The food with the highest percentage of people who ate that food and got sick is potato salad, with 70.4%.Therefore, potato salad is the most likely cause of the outbreak of food poisoning.

Answer: B

Adept’s goal is to build Useful General Intelligence.In order to accomplish this goal, we first had to:

  • Connect our agents to the digital world (see ACT-1)
  • Build robust tooling for training, evaluation, inference, and data collection
  • Establish a foundational model design that would scale to meet our needs (see Fuyu-8B)

This fall, we were satisfied that we’d accomplished these goals and that it was time to scale up the Fuyu architecture. This wasn’t easy — in addition to the standard hiccups with model scaling, we had to deal with the extra problems associated with training a new architecture on both text and image data. In particular:

Over the last 4 months, we’ve tackled all these problems and more.Fuyu-Heavy will shortly be powering our enterprise product.We’ve already applied lessons learned from Fuyu-Heavy to train its successor.

Although our ultimate goal is to build useful digital agents, and we internally benchmark our models with this standard in mind, it’s important for us to sanity-check our progress against commonly used benchmarks as well.According to these benchmarks, Fuyu-Heavy is the strongest multimodal model trained outside of Google or OpenAI.

Despite the trade-off between language modeling performance and multimodal performance, Fuyu-Heavy performs roughly on par with Gemini Pro on standard text-only evaluations, outperforming it on the commonly used MMLU benchmark.Inflection-2 has stronger performance on some of these text evals, but it’s a much larger model.For all of these evals, we used the standard number of shots in the few-shot prompting regime.

MMLUGSM8KMATHHumanEval
Adept Fuyu-Heavy72.182.929.558.0
Gemini Pro71.886.5 (Maj1@32)32.667.7
Inflection-279.681.434.844.5
Grok-17362.923.963.2

It’s also become common to benchmark the ability of models to have long-form conversations.To do this, we put Fuyu-Heavy through a supervised finetuning stage followed by a direct preference optimization (DPO) stage on publicly available chat data.We use the two most commonly used chat evaluations — MT-Bench and AlpacaEval 1.0.Fuyu-Heavy Chat performs comparably to Claude 2.0 despite being a smaller model and having to use some of its capacity to model images.

MT-BenchAlpacaEval 1.0
Adept Fuyu-Heavy8.0192.20%
GPT-4 Turbo9.3297.70%
Mistral Medium8.6196.83%
Claude 2.08.0691.60%

Finally, we evaluate Fuyu-Heavy on what has become the standard for multimodal performance - MMMU.Fuyu-Heavy performs slightly better than Gemini Pro on this metric.For further context on what MMMU measures, we’ve included further examples of our model output for that data-set below.For completeness, we also include results on VQAv2 (which is quite flawed, see this post) on AI2D, a diagram understanding dataset,and on ChartQA - a dataset of question-answer pairs for plots and charts.Adept Fuyu-Heavy outperforms Gemini Pro on all of these metrics and outperforms Gemini Ultra on AI2D.

MMMUVQAv2AI2DChartQA
Adept Fuyu-Heavy48.376.281.275.4
Gemini Pro47.971.273.974.1
Gemini Ultra59.477.879.580.8

Prompt

Answer the following MMMU question based on the image.

Janet Ludlow’s firm requires all its analysts to use a two-stage dividend discount model (DDM) and the capital asset pricing model (CAPM) to value stocks.Using the CAPM and DDM, Ludlow has valued QuickBrush Company at $63 per share. She now must value SmileWhite Corporation.Calculate the required rate of return for SmileWhite by using the information in the following table:

Adept Fuyu-Heavy: A new multimodal model (2)

A. 14%
B. 15%
C. 16%

Adept

The required rate of return for SmileWhite can be calculated using the capital asset pricing model (CAPM), which is:

Required Return = Risk-free Rate + Beta x (Expected Market Return - Risk-free Rate)

The risk-free rate and expected market return are provided in the table, and the beta for SmileWhite is given as 1.15.

Let’s calculate the required rate of return for SmileWhite:

Required Return = 4.50% + 1.15 x (14.50% - 4.50%)
Required Return = 4.50% + 1.15 x 10%
Required Return = 4.50% + 11.50%
Required Return = 16%

Therefore, the answer is:

C: 16%
Answer: C

Fuyu-Heavy can also perform more complex calculations, such as the following:

Prompt

Table 11.47 provides a recent survey of the youngest online entrepreneurs whose net worth is estimated at one million dollars or more.Their ages range from 17 to 30.Each cell in the table illustrates the number of entrepreneurs who correspond to the specific age group and their net worth.We want to know whether the ages and net worth independent.$\chi$^2 test statistic = ______

Adept Fuyu-Heavy: A new multimodal model (3)

A. 1.56
B. 1.76
C. 1.96
D. 2.06

Adept

To answer this question, we need to perform a chi-square test of independence.

The null hypothesis for this test is that the ages and net worth are independent, while the alternative hypothesis is that they are not independent.

The test statistic for a chi-square test of independence is:

Adept Fuyu-Heavy: A new multimodal model (4)

where O is the observed frequency, E is the expected frequency, and the sum is taken over all the cells in the contingency table.

The expected frequency for each cell is calculated as:

Adept Fuyu-Heavy: A new multimodal model (5)

The degrees of freedom for a chi-square test of independence are:

Adept Fuyu-Heavy: A new multimodal model (6)

The critical value for a chi-square distribution with df degrees of freedom at a given level of significance (e.g., 0.05) can be found in a chi-square table or calculated using a statistical software package.

If the calculated chi-square value is greater than the critical value, we reject the null hypothesis and conclude that the ages and net worth are not independent. If the calculated chi-square value is less than or equal to the critical value, we fail to reject the null hypothesis and conclude that the ages and net worth are independent.

Let’s calculate the expected frequencies for each cell in the table:

Adept Fuyu-Heavy: A new multimodal model (7)

Now, let’s calculate the chi-square test statistic:

Adept Fuyu-Heavy: A new multimodal model (8)

The degrees of freedom for this test are:

Adept Fuyu-Heavy: A new multimodal model (9)

The critical value for a chi-square distribution with 2 degrees of freedom at the 0.05 level of significance is approximately 5.99.

Since our calculated chi-square value (1.76) is less than the critical value (5.99), we fail to reject the null hypothesis and conclude that the ages and net worth are independent.

Therefore, the answer is:

B. 1.76

Answer: B

We’re building on this work in several exciting ways:

  • Conducting fundamental scaling research to improve our base model capabilities
  • Converting these base models into useful agents through reward modeling, self play, and various inference-time search techniques
  • Connecting these agents to the world to build useful, reliable products

If any of this is exciting to you, please reach out through our careers page.

We’d like to thank the Microsoft Azure team for providing and servicing the cluster that trained Fuyu-Heavy and for answering innumerable configuration questions.We’d also like to thank NVIDIA for helpful collaborations on model efficiency, for advice on various particulars of networking, etc, and, of course, for creating the GPUs :)

Adept Fuyu-Heavy: A new multimodal model (2024)

FAQs

What is a multimodal AI model? ›

Multimodal AI is a subset of artificial intelligence that integrates information from various modalities, such as text, images, audio, and video, to build more accurate and comprehensive AI models.

What is Fuyu 8B? ›

Fuyu-8B is an open-source, simplified multimodal architecture with a decoder-only transformer, supporting arbitrary image resolutions, and excelling in diverse applications, including question answering and complex visual understanding.

What is an example of a multimodal model? ›

For example, a multimodal model might combine the text, image, and audio modalities to recognize emotions in a video clip. By combining different modalities, multimodal models can achieve better performance than models that use only a single modality.

Is ChatGPT 4 multimodal? ›

But now, because GPT-4o is natively multimodal and is able to handle the audio input, natural language processing, and audio output itself, ChatGPT is able to reply in an average of 0.32 seconds—and you can really feel the speed.

What is Fuyu good for? ›

The yellow-orange persimmon fruit is an excellent source of fiber, Vitamin A and C, Vitamin B6, potassium and the mineral manganese. Persimmons are fat-free and are a good source of healthy carbohydrates and natural sugar. The two most common varieties of persimmons are fuyu and hachlya.

What does Fuyu mean? ›

fuyu) means winter in general, 冬季 (とうき touki), means specifically the season of winter. Be sure to dress warmly this fuyu 冬 in your 冬着 (ふゆぎ fu- yugi), aka winter clothing! Meaning: Winter.

How do I choose a Fuyu? ›

First, Fuyus should be a bright orange color (they are sometimes sold while under-ripe and pale). A few days on a kitchen counter should be sufficient to bring them to bright orange if they aren't.

What is the difference between generative AI and multimodal AI? ›

Some generative artificial intelligence (AI) systems use only one type of input, such as text, and produce only one type of output, such as text. Other AI systems accept multiple types of inputs, such as text and images, and can produce various forms of output. These are called multimodal AI systems.

What is multimodal learning model? ›

Multimodal learning suggests that when a number of our senses – visual, auditory, kinaesthetic – are being engaged during learning, we understand and remember more. By combining these modes, learners experience learning in a variety of ways to create a diverse learning style.

What is multimodal conversational AI? ›

A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities. This paper motivates, defines, and mathematically formulates the multimodal conversational research objective.

Top Articles
5 DIY Valentine's Day Box Ideas
DIY Christmas Gift Tags!
Gfr Soccer
211475039
895 Area Code Time Zone
Santa Maria Cars Craigslist
Ups Cc Center
Henry Ford Hospital: Ein Meisterwerk von Frida Kahlo
KMS ver. 1.2.355 – Haste & Tactical Relay
Ironman Kona Tracker
The Courier from Waterloo, Iowa
Crete Il Forum
Th 8 Best Army
Rick Steves Forum
Insidekp.kp.org Myhr Portal
Craigslist Gaming Chair
Highplainsobserverperryton
Zom100 Mangadex
Ihub Fnma Message Board
6 Fun Things to Do in Bodega Bay - Sonoma County Tourism
Tamilrockers.com 2022 Isaimini
Antonios Worcester Menu
Violetken 5E
Craiglist Galveston
The Star Beacon Obituaries
Maine Marine Forecast Gyx
Citymd West 146Th Urgent Care - Nyc Photos
Statek i zarządzanie załogą w Assassin's Creed Odyssey - Assassin's Creed Odyssey - poradnik do gry | GRYOnline.pl
Liveops Nation Okta Com Sign In
Small Party Hall Near Me
Camwhor*s Bypass 2022
Valentino Garavani Flip Flops
Odawa Hypixel
Josh Bailey Lpsg
Surface Area Formulas (video lessons, examples, step-by-step solutions)
Arti kata petang-petang - Kamus Besar Bahasa Indonesia (KBBI) Online
Secondary Math 2 Module 3 Answers
Kpq News Wenatchee Washington
‘Covfefe’ tells you all you need to know about Trump | CNN Politics
Z93 Local News Monticello Ky
Viewfinder Mangabuddy
Kristine Leahy Spouse
Busted Magazine Columbus Ohio
Download Diablo 2 From Blizzard
Southern Ute Drum
Ev Gallery
Dimensional Doors Mod (1.20.1, 1.19.4) - Pocket Dimensions
Pastel Pink Facetime Icon
Cambridge Assessor Database
Kohl's Hixson Tennessee
The Hardest Quests in Old School RuneScape (Ranked) – FandomSpot
‘A Knights Tale’ Turns 20: Secrets Behind Heath Ledger’s Royal Rock Flick
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 6850

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.