Voxel51 Filtered Views Newsletter - July 26, 2024

Author: Harpreet Sahota (Hacker in Residence at Voxel51)

Welcome to Voxel51's weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.

👁️ Could the secret to unmasking deepfakes be hiding in plain sight, right in the eyes of the beholder?

Researchers at the University of Hull have developed a technique to identify AI-generated fake images by examining eye reflections. The method compares the consistency of light reflections between the left and right eyeballs. These reflections are typically consistent in real images, while deepfakes often differ. The researchers applied astronomical techniques to study galaxies to analyze eye reflections. They used the Gini coefficient, which measures light distribution, to compare similarities between left and right eyeballs.

This method, inspired by techniques used in astronomy to study galaxies, could provide a new weapon in the ongoing battle against deepfakes.

Key takeaways:

Inconsistent eyeball reflections may indicate an AI-generated image
Astronomers' tools for studying galaxies can be repurposed for deepfake detection
While not foolproof, this approach provides a basis for detecting deepfakes in the ongoing "arms race" against AI-generated fake images.

This innovative approach demonstrates how techniques from one scientific field (astronomy) can be creatively applied to solve problems in another area (image authentication), showcasing the potential for interdisciplinary research in addressing modern technological challenges.

As AI-generated images become increasingly sophisticated, how might this astronomical approach to deepfake detection evolve to stay ahead of the curve? The answers may lie in the stars – or in this case, the eyes – but you'll have to dive deeper into the article to uncover the full scope of this intriguing research.

🧠 Could the next big leap in AI come from a model within a model?

Source

The AI landscape, long dominated by transformer architectures, is now witnessing a surge in the search for new model architectures.

Transformers, which power notable models like OpenAI’s Sora and GPT-4, are hitting computational efficiency roadblocks. Researchers are exploring alternatives to the dominant transformer architecture in AI, with test-time training (TTT) models emerging as a promising contender. These models, developed by a team from Stanford, UC San Diego, UC Berkeley, and Meta, could potentially process vast amounts of data more efficiently than current transformer model.

Key takeaways:

TTT models replace the transformer’s “hidden state” with a machine learning model that doesn’t expand with more data, making them highly efficient.
TTT layers have linear complexity like RNNs, but can keep reducing perplexity for longer contexts like Transformers. This approach allows for processing more data without increasing computational demands - unlike transformers.
Two instantiations are proposed: TTT-Linear and TTT-MLP, where the hidden state is a linear model and a two-layer MLP respectively. Both match or exceed the baselines in performance, especially for longer contexts.
TTT layers can be integrated into any network architecture and optimized end-to-end, similar to RNN layers and self-attention.
Experiments show TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time.

As the AI landscape evolves, will TTT models revolutionize the field by overcoming the limitations of transformers? While it's too early to say for certain, the race for more efficient AI architectures is heating up, and the implications for the future of generative AI are fascinating to consider.

🤖 Could spatial intelligence be the key to unlocking the next level of AI reasoning?

Fei-Fei Li, a renowned computer scientist known as the "godmother of AI," is reportedly developing a startup focused on enhancing AI's spatial intelligence, a subfield of Visual AI which involves developing algorithms capable of realistically extrapolating images into three-dimensional reconstructions.

World Labs, which has reached a valuation of over $1 billion in just four months, aims to enhance AI's reasoning capabilities by developing human-like visual data analysis. The company is developing a framework for understanding the three-dimensional physical world, including object dimensions, spatial location, and functionalities.

Key takeaways:

The startup is exploring "spatial intelligence" to improve AI's understanding of 3D environments
Li's approach could help AI extrapolate and predict outcomes based on visual data
This development may bridge the gap between current AI limitations and artificial general intelligence (AGI)

💎 GitHub Gems

LivePortrait is a project for efficient portrait animation.

The main goal of this framework is to synthesize lifelike videos from a single source image, using it as an appearance reference, while deriving motion (facial expressions and head pose) from a driving video, audio, text, or generation.

Key aspects of LivePortrait:

Unlike mainstream diffusion-based methods, LivePortrait extends the potential of an implicit-keypoint-based framework, balancing computational efficiency and controllability.
The project uses about 69 million high-quality frames and employs a mixed image-video training strategy to enhance generation quality and generalization ability.
The framework includes upgraded network architecture, motion transformation, and optimized objectives.
LivePortrait introduces stitching and retargeting modules, utilizing a small MLP with minimal computational overhead to enhance control over the generated animations.
Experimental results show that LivePortrait is competitive even when compared to diffusion-based methods. It achieves a remarkable generation speed of 12.8ms on an RTX 4090 GPU with PyTorch. The library also supports Apple Silicon, though generation times are slower. One user reported 100-second generation time of a four-second video using an M1 Max.
The framework can handle various styles, including realistic, oil painting, sculpture, and 3D rendering. It can also be fine-tuned for animating cats, dogs, and pandas.

The inference code and models for LivePortrait are publicly available on GitHub. The authors have made it super easy to get up and running. The documentation is pretty good, and they seem quite responsive to GitHub issues. Alternatively, you can give this a try on Hugging Face Spaces.

📙 Good Reads

How deeply can AI usage alter our cognitive processes and work habits?

Nick Potkalitsky explores the cognitive impact of extended AI interaction in his newsletter "Educating AI."

Drawing from personal experience and theoretical insights, he discusses how immersive engagement with AI tools like ChatGPT and Claude can lead to subtle yet significant changes in our thinking patterns and self-perception. His article explores the cognitive and experiential effects of prolonged AI usage, particularly in the context of writing and content creation. Potkalitsky shares insights from an intensive 6-day period of AI-assisted work, during which he noticed several significant changes in his mental processes and work patterns.

Here's a breakdown of the effects Potkalitsky discusses:

Lack of Focus: The constant shifting between tasks (editing, copy-pasting, prompt engineering) in AI-assisted work cycles can lead to difficulty concentrating on specific tasks, especially when attempting to write original content afterwards.
Adrenaline Boost: The gruelling nature of AI work cycles can induce an adrenaline rush, pushing users to work for extended periods to match the AI's efficiency. This can create a potentially harmful feedback loop of stress and overwork.
Externalization of Writing Process: AI tools can fundamentally alter our relationship with language and writing. Improving text becomes externalized, challenging our traditional understanding of authorship and creativity.
Analogical Thinking: Potkalitsky proposes that prolonged AI use may lead to "machine mirroring," where our conscious processes begin to imitate the tools we frequently use. This is similar to how extended spreadsheet use can influence how we organize information mentally.
Anthropomorphization: The author suggests that the mirroring process can lead to attributing human-like qualities to AI, including consciousness and empathy.
Time and Creativity Investment: Potkalitsky notes that for this cognitive mirroring to occur, users need to spend significant time immersed in the machine process, engage creatively, and choose the tool purposefully.

These effects raise important questions about AI's long-term impact on cognition, creativity, and sense of self. How can we harness AI's benefits while mitigating these potential cognitive side effects? Potkalitsky's article offers valuable insights into this complex issue, encouraging readers to reflect on their AI interactions and their implications for education and cognition.

🎙️ Good Listens : AI Consciousness and the Space of Possible Minds

This week's recommendation is from the ML Street Talk podcast, hosted by Tim Scarfe.

Tim’s guest this week is Murray Shanahan, a principal research scientist at Google DeepMind and professor of cognitive robotics at Imperial College London. The episode explores the intersection of artificial intelligence, consciousness, and philosophy, with a healthy dose of Ludwig Wittgenstein's ideas and philosophies mixed in.

This episode offers a unique blend of cutting-edge AI research and classical philosophy. Shanahan's application of Wittgensteinian concepts to modern AI challenges provides fresh insights into both fields. I felt like I came away with a deeper understanding of the philosophical questions surrounding AI consciousness and the limitations of our current language in describing AI phenomena.

Key points from the episode:

The episode touches on practical experiments, such as playing 20 questions with AI, to illustrate the fluid nature of AI "thinking."
Shanahan shares his experiences with advanced AI models like Claude 3, offering insights into their capabilities and limitations.
Shanahan explores the concept of "consciousness-adjacent language" when discussing AI, highlighting the need for new vocabulary to describe emerging AI behaviours.
The conversation introduces the intriguing "simulators" theory, which views language models as simulators capable of producing various "simulacra" or role-players.
Shanahan discusses the stochastic nature of language model outputs, using the analogy of a "tree of possibilities" to explain how AI responses can vary with each interaction.
Shanahan draws heavily on Wittgenstein's philosophy of language to frame the discussion about AI and consciousness. He uses Wittgenstein's "language games" concept to analyze how we interact with and interpret AI behaviour.
The conversation gets into Wittgenstein's idea that words' meaning is determined by their use in context. This perspective is applied to understand how we ascribe meaning to AI outputs and the potential pitfalls of anthropomorphizing AI.
Shanahan introduces the concept of language models as "simulators" producing various "simulacra," connecting this idea to Wittgenstein's thoughts on rule-following and the nature of understanding.
The discussion touches on Wittgenstein's private language argument and its relevance to understanding AI consciousness and internal states.
Wittgenstein's notion of "forms of life" is explored about AI, questioning whether AI can truly participate in human forms of life and language games.

Whether you're an AI enthusiast, a philosophy buff, or simply curious about the intersection of technology and human understanding, this episode provides thought-provoking content that will challenge your perspectives on artificial intelligence and consciousness.

👩🏽‍🔬 Interesting Research

The scale of data and computation in machine learning continues to grow exponentially while the pursuit of efficiency becomes increasingly important.

The paper "Data curation via joint example selection further accelerates multimodal learning" presents an approach that could revolutionize how we train large-scale multimodal models. By introducing a method that intelligently selects batches of data rather than individual examples, the authors demonstrate remarkable improvements in training speed and computational efficiency.

This work challenges our current understanding of data curation and opens up new possibilities for scaling machine learning models more effectively. The authors achieve state-of-the-art performance with up to 13 times fewer iterations and 10 times less computation. This method, called JEST (multimodal contrastive learning with joint example selection), reveals new insights into the importance of batch composition in machine learning.

To analyze this groundbreaking research, we'll use the PACES method, which breaks down the paper into its key components: Problem, Approach, Claim, Evaluation, and Substantiation.

Problem: Identify the main problem being studied in the paper.
Approach: Summarize the main technique proposed to address the problem (in 2-3 sentences).
Claim: State the paper's main contribution to the field (in one sentence).
Evaluation: Describe how the approach is evaluated, including datasets, baselines, and setup.
Substantiation: Assess whether the evaluation supports the paper's claim.

Purpose

The paper discusses the inefficiency of current data curation methods in large-scale multimodal pretraining. These methods rely on selecting individual data points and do not consider the importance of batch composition. The authors explore the potential of jointly selecting batches of data as being more effective for learning compared to selecting examples independently in multimodal contrastive learning. The authors aim to speed up multimodal learning through a novel data curation method.

Approach

The researchers developed a method called JEST (multimodal contrastive learning with joint example selection), which:

Uses contrastive objectives to measure the joint learnability of a batch
Derives a simple and tractable algorithm for selecting high-quality batches
Leverages recent advances in model approximation to reduce computational overhead
Uses pretrained reference models to steer the data selection process towards well-curated datasets

The main contributions of this paper include:

A new algorithm (JEST) for joint example selection in multimodal learning
An efficient implementation (Flexi-JEST) that reduces computational overhead
Demonstration of significant acceleration in training, surpassing state-of-the-art models with fewer iterations and less computation
Exposing data curation as a new dimension for neural scaling laws

Claim

JEST significantly accelerates multimodal learning, achieving state-of-the-art performance with up to 13 times fewer iterations and 10 times less computation than current methods. The significance of this work lies in its potential to:

Dramatically accelerate multimodal learning, reducing both the number of iterations and overall computation required
Improve the efficiency of large-scale pretraining across modalities
Provide a new approach to data curation that can scale effectively and reduce reliance on manual curation.
Offer insights into the importance of batch composition in learning beyond individual example quality.

Evaluation

The authors evaluated their approach through several experiments:

Comparing the learnability of batches selected by JEST vs. independent selection
Comparing JEST and Flexi-JEST against state-of-the-art models on multiple downstream tasks, including ImageNet and COCO.
Analyzing the impact of different filtering ratios and the effectiveness of multi-resolution training
The authors measure performance in training iterations, computational efficiency (FLOPs), and accuracy on various benchmarks. They also analyze the impact of different filtering ratios and the effectiveness of multi-resolution training.

Substantiation

The evaluation strongly supports the paper's claim. The results demonstrate that JEST and Flexi-JEST consistently outperform baseline methods and achieve comparable or better performance with significantly fewer iterations and less computation. The authors provide extensive ablation studies and analyses that further substantiate their claims about the effectiveness of joint example selection in accelerating multimodal learning.

In summary, this paper presents a novel approach to data curation in multimodal learning that shows promise in significantly accelerating training while maintaining or improving performance on downstream tasks. The method's ability to bootstrap from smaller, well-curated datasets to improve learning on larger datasets could have broad implications for efficient large-scale model training.