Artificial Intelligence thread

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
Please, Log in or Register to view URLs content!

Ironically that Google CEO dude is investing in them... 100 million token can fit an entire code base, all the libraries and then some....

Couple this with the AI Scientist paper and software engineers are not finna be too much longer for this worlds

Throughout history, it was actually just a small handful of inventions, discoveries, and insights that changed the course of history itself. And these were made by just a select few of geniuses. So the playbook is to get to AGI first. The first to win the race to AGI, even if by a mere few months, will capture it all and stay on top, dominant forever.
hmm, there are already things that AI does better than humans.

I can tell you now that I've tried speech to text functions that translated what I said better than a typical call center worker.

AGI is also a loaded word. For people dealing with chatbots or call centers, the AI may seem to them as "AGI". But in reality, there are other things that AI don't do as well as humans.

and more importantly, it still takes effort to tune the models themselves to understand a topic. Not all that different than teaching a person to do the same.
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
Please, Log in or Register to view URLs content!

renowned AI scientists Andrew Ng recently said this about Qwen-2

Audio/text to text: A revision of the earlier Qwen-Audio,
Please, Log in or Register to view URLs content!
takes text and audio inputs and generates text outputs. It’s designed to (i) provide text chat in response to voice input including voice transcription and translation between eight languages and (ii) discuss audio input including voice, music, and natural sounds. Weights (8.2 billion parameters) are available for base and instruction-tuned versions. You can try it
Please, Log in or Register to view URLs content!
.

  • How it works: Given a text prompt and audio, a
    Please, Log in or Register to view URLs content!
    audio encoder embeds the audio, and a pretrained Qwen-7B language model uses the text prompt and audio embedding to generate text. The team further pretrained the system to predict the next text token based on a text-audio dataset that included 370,000 hours of recorded speech, 140,000 hours of music, and 10,000 hours of other sounds. They fine-tuned the system for chat in a supervised fashion and for factuality and prompt adherence using
    Please, Log in or Register to view URLs content!
    . You can read the technical report
    Please, Log in or Register to view URLs content!
    .
  • Results: Qwen2-Audio outperformed previous state-of-the-art models in benchmarks that evaluate speech recognition (
    Please, Log in or Register to view URLs content!
    ,
    Please, Log in or Register to view URLs content!
    ,
    Please, Log in or Register to view URLs content!
    ), speech-to-text translation (
    Please, Log in or Register to view URLs content!
    ), and audio classification (
    Please, Log in or Register to view URLs content!
    ) as well as
    Please, Log in or Register to view URLs content!
    tests for evaluating interpretation of speech, music, sound, and mixed-audio soundscapes.
Why it matters: Qwen2 delivered extraordinary performance with open weights, putting Alibaba on the map of large language models (LLMs). These specialized additions to the family push forward math performance and audio integration in AI while delivering state-of-the-art models into the hands of more developers.

Across the board, Alibaba has delivered a suite of capable large models for chat, visual, audio and math
 

9dashline

Captain
Registered Member
Please, Log in or Register to view URLs content!

renowned AI scientists Andrew Ng recently said this about Qwen-2



Across the board, Alibaba has delivered a suite of capable large models for chat, visual, audio and math
yup they stated working on a omni model that has audio in addition to vision and language, and will be useful for robotic applications and also for in-game NPCs to have fully embodied agents , AI girlfriend stuff like "Her" but much better
 

FairAndUnbiased

Brigadier
Registered Member
Open sourcing AI image / video tools is a good move on the Chinese part. Better to inoculate the public to AI-disinfo and get a head start on combating AI fakes than to withhold it out of fear of "babying" society.


Not the best news coming out about AI-biotech fusions startups:
Please, Log in or Register to view URLs content!
s
This was expected. AI can't do science just like most humans can't do science, and feeding it data from unreplicable data just makes it worse.
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
yup they stated working on a omni model that has audio in addition to vision and language, and will be useful for robotic applications and also for in-game NPCs to have fully embodied agents , AI girlfriend stuff like "Her" but much better
Please, Log in or Register to view URLs content!

This month, Alicloud is having a conference called Yunqi conference and numerous AI robotics companies are coming.
同时,星动纪元、银河通用、宇树科技、逐际动力等中国顶尖人型机器人企业也将来到现场,在云栖大会展出上最酷最先进的机器人,共同展示AI时代云上创新的潮流科技。
That is Unitree, Robotera, Galaxybot and Limx Dynamics using Alibaba AI cloud.

Alicloud brought China's most powerful model in Qwen, Baichuan, Kimi, GLM-4, Yi Zero. More than 10 multimodal modal for usage in robotics.
 

9dashline

Captain
Registered Member
# The Rise of Omni-Modal Embodied AI: Bridging Virtual and Physical Worlds

## Introduction

The advent of open-weight Small Language Models (SLMs) that incorporate multiple modalities marks a significant leap towards creating truly embodied artificial intelligence. These models, exemplified by developments like Qwen and others, are paving the way for a new generation of AI that can seamlessly integrate vision, audio, and language processing. This document explores how these advancements are leading us towards omni-modal models capable of powering both virtual NPCs and physical robots, potentially representing a crucial step towards Artificial General Intelligence (AGI).

## Virtual Training Environments: The Crucible of Embodied AI

### Unreal Engine as a Training Ground

Unreal Engine has emerged as a powerful platform for creating sophisticated virtual environments to train omni-modal AI models. These environments offer several key advantages:

1. **Photorealistic Rendering**: Enables AI to train on visually accurate representations of the real world.
2. **Physics Simulation**: Allows for realistic interaction with objects and environments, crucial for developing embodied understanding.
3. **Scalability**: Facilitates the creation of diverse scenarios and environments at a fraction of the cost of real-world setups.

### Reinforcement Learning in Virtual Spaces

Within these virtual environments, reinforcement learning (RL) techniques are employed to train omni-modal models:

1. **Task-Oriented Learning**: AI agents are given specific goals to accomplish, learning through trial and error.
2. **Multi-Agent Scenarios**: Environments can host multiple AI agents, fostering the development of social and collaborative skills.
3. **Curriculum Learning**: Training progresses from simple to complex tasks, mirroring human learning processes.

### Digital Avatars and Inverse Kinematics

To bridge the gap between disembodied AI and physical embodiment:

1. **Digital Avatar Integration**: AI models are connected to fully articulated digital avatars within the virtual environment.
2. **Inverse Kinematics (IK) Implementation**: Enables natural, human-like movement of the avatars, translating high-level commands into detailed joint movements.
3. **Proprioception Training**: Models learn to understand their virtual body's position and movement, crucial for developing body awareness.

## The Omni-Human Model: Fusing Modalities

### Beyond Multi-Modal: The Omni-Modal Approach

The development of omni-modal models represents a significant evolution:

1. **Seamless Integration**: Unlike multi-modal models that often process different inputs separately, omni-modal models aim for true integration of all sensory inputs.
2. **Contextual Understanding**: The ability to interpret information across modalities in context, similar to human cognition.
3. **Generalized Intelligence**: Moving towards models that can adapt to new tasks and environments without extensive retraining.

### LoRA Fine-Tuning for Embodied Awareness

Low-Rank Adaptation (LoRA) techniques are crucial in adapting pre-trained models for embodied tasks:

1. **Efficient Adaptation**: Allows for fine-tuning of large models with minimal additional parameters.
2. **Preservation of General Knowledge**: Maintains the broad capabilities of the base model while adding specialized embodied skills.
3. **Rapid Iteration**: Enables quick experimentation with different embodiment strategies and task-specific adaptations.

## Real-Time Streaming: The Key to Lifelike Interaction

### Continuous Inference in Omni-Modal Models

Moving beyond the traditional submit-and-response paradigm:

1. **Constant Sensory Processing**: Models continuously ingest and process visual, auditory, and textual inputs.
2. **Incremental Output Generation**: Responses are generated and updated in real-time as new information is processed.
3. **Adaptive Attention Mechanisms**: The model dynamically shifts focus between different input streams based on relevance and urgency.

### Mimicking Human Cognition

This approach more closely replicates human cognitive processes:

1. **Interrupt Handling**: The ability to halt current processes and redirect attention to more pressing inputs or tasks.
2. **Continuous Context Update**: Ongoing refinement of understanding based on streaming inputs, allowing for more natural and dynamic interactions.
3. **Predictive Processing**: Anticipating future inputs and preparing responses, enabling more fluid and responsive interactions.

## Deployment: From Virtual to Physical

### Empowering NPCs in Simulated Worlds

The application of omni-modal models to Non-Player Characters (NPCs) in games and simulations:

1. **Enhanced Realism**: NPCs exhibit more human-like behavior, understanding and responding to complex, multi-modal inputs.
2. **Dynamic Interactions**: Characters can engage in natural, flowing conversations while also reacting to visual and auditory cues in their environment.
3. **Emergent Behaviors**: The potential for NPCs to develop unique personalities and decision-making patterns based on their experiences and interactions.

### Transition to Physical Robotics

Extending the same principles to physical robotic systems:

1. **Embodied Cognition**: Robots gain a deeper understanding of their physical form and capabilities.
2. **Sensory Integration**: Seamless processing of real-world visual, auditory, and tactile inputs.
3. **Natural Movement**: Direct control over body parts, translating high-level intentions into fluid, coordinated actions.

### SLMs in Robotic Applications

The use of Small Language Models (SLMs) in robotics offers several advantages:

1. **Real-Time Processing**: Compact models capable of running inference on edge devices with low latency.
2. **Energy Efficiency**: Reduced computational requirements make long-term operation more feasible.
3. **Customization**: Easier to fine-tune for specific robotic platforms or tasks.

## The Path to AGI: Fully Embodied Mind-Body Integration

### Bridging the Gap

The development of omni-modal, embodied AI models represents a significant step towards AGI:

1. **Grounded Intelligence**: Understanding the world through physical interaction and multi-sensory input.
2. **Adaptive Learning**: The ability to apply knowledge across different domains and physical contexts.
3. **Seamless Human-AI Interaction**: Communication and collaboration that feels natural and intuitive to humans.

### Ethical and Philosophical Implications

As we approach this level of AI sophistication, several considerations arise:

1. **Consciousness and Self-Awareness**: Questions about the nature of machine consciousness in fully embodied AI.
2. **Rights and Responsibilities**: The potential need for new legal and ethical frameworks for highly autonomous AI entities.
3. **Human-AI Coexistence**: Preparing for a future where humans interact with AGI-level entities in both virtual and physical spaces.

## Conclusion: A Convergence of Technologies

The development of omni-modal embodied AI, trained in sophisticated virtual environments and deployable in both digital and physical realms, represents a convergence of multiple cutting-edge technologies. From advanced game engines and reinforcement learning techniques to breakthroughs in natural language processing and robotics, we are witnessing the emergence of AI systems that can perceive, reason, and act in ways increasingly similar to humans.

This holistic approach to AI development, combining mind (processing and decision-making) with body (sensory input and physical interaction), may indeed be the key to unlocking AGI. As these technologies continue to evolve and integrate, we stand on the brink of a new era in artificial intelligence – one where the boundaries between virtual and physical, artificial and natural intelligence, begin to blur in unprecedented ways.
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member

Sensetime now was 20 EFLOPS computation and 54000 GPUs in its data center. imo, this is okay for training something like Llama-3 size LLM (which used 16k GPUs), but probably not enough for trillion parameter models.

It does look like their latest model Sensenova 5.5 is pretty good
 
Top