Wednesday, January 14, 2026
City and Coffee
  • Home
  • World
    Crane collapse causes deadly train crash in Thailand | Transport

    Crane collapse causes deadly train crash in Thailand | Transport

    Trump says trade agreement with Mexico, Canada ‘irrelevant’ to US | Automotive Industry News

    Trump says trade agreement with Mexico, Canada ‘irrelevant’ to US | Automotive Industry News

    Israeli soldier dances as homes demolished in occupied West Bank | Occupied West Bank

    Israeli soldier dances as homes demolished in occupied West Bank | Occupied West Bank

    Mamdani backs striking NYC nurses, denounces hospital bosses’ pay | Newsfeed

    Mamdani backs striking NYC nurses, denounces hospital bosses’ pay | Newsfeed

    India’s VPN ban in Kashmir ‘adds to psychological pressure’, say residents | Internet News

    India’s VPN ban in Kashmir ‘adds to psychological pressure’, say residents | Internet News

  • US

    Six Prosecutors Quit Over DOJ Push to Investigate Renee Good’s Widow

    Timothy Busfield, Actor and Director, Booked on Child Sex Abuse Charges

    Trump Explores Diplomacy With Iran While Weighing Strikes, Officials Say

    Federal Prosecutor Is Fired Amid Further Turmoil in Comey Case

    Federal Prosecutors Open Investigation Into Fed Chair Powell

  • Europe
    Kyiv seeks relief from Russian strikes and cold

    Kyiv seeks relief from Russian strikes and cold

    Singer Julio Iglesias faces Spanish inquiry into sexual assault allegations

    Singer Julio Iglesias faces Spanish inquiry into sexual assault allegations

    Marine Le Pen’s political fate rests on appeal trial opening in France

    Marine Le Pen’s political fate rests on appeal trial opening in France

    UK to bring into force law to tackle Grok AI deepfakes this week

    UK to bring into force law to tackle Grok AI deepfakes this week

    Margam park Roman villa find could be ‘Port Talbot’s Pompeii’

    Margam park Roman villa find could be ‘Port Talbot’s Pompeii’

  • MENA
    US announces start of phase two of Trump’s Gaza peace plan

    US announces start of phase two of Trump’s Gaza peace plan

    Trump vows ‘very strong action’ if Iran executes protesters

    Trump vows ‘very strong action’ if Iran executes protesters

    Authoritarian regimes die gradually then suddenly, but Iran is not there yet

    Authoritarian regimes die gradually then suddenly, but Iran is not there yet

    Trump weighs next move on Iran

    Trump weighs next move on Iran

    Iran leader Khamenei says anti-government protesters are vandals trying to please Trump

    Why are there protests in Iran and what has Trump said about US action?

  • APAC
    At least 32 dead after construction crane falls on train

    At least 32 dead after construction crane falls on train

    US approves sale of Nvidia’s advanced H200 chips to China

    US approves sale of Nvidia’s advanced H200 chips to China

    S Korea prosecutors seek death penalty over failed insurrection attempt

    S Korea prosecutors seek death penalty over failed insurrection attempt

    Australian author charged with distributing child exploitation material

    Australian author charged with distributing child exploitation material

    Top UN court begins hearings in Rohingya genocide case

    Top UN court begins hearings in Rohingya genocide case

  • Tech
    Trump Doesn’t Need the Proud Boys Anymore

    Trump Doesn’t Need the Proud Boys Anymore

    $50 Target Promo Code & Coupons | January 2026

    $50 Target Promo Code & Coupons | January 2026

    China’s Hottest App Is a Daily Test of Whether You’re Still Alive

    China’s Hottest App Is a Daily Test of Whether You’re Still Alive

    Total Wireless Promo Codes & Deals: 50% Off Select Plans

    B&H Photo Promo Codes and Deals This January

    MacKenzie Scott Donates $45 Million to the Trevor Project

    MacKenzie Scott Donates $45 Million to the Trevor Project

  • Entertainment
    Sarah Trahern to Retire as Country Music Association CEO

    Sarah Trahern to Retire as Country Music Association CEO

    Tiger Shroff, Conor McGregor Set Bare Knuckle Fighting India Launch

    Tiger Shroff, Conor McGregor Set Bare Knuckle Fighting India Launch

    Michelle Williams Joins Daniel Craig in Damien Chazelle’s Next Movie

    Michelle Williams Joins Daniel Craig in Damien Chazelle’s Next Movie

    Jon Stewart Exclaims ‘What the F— Is Happening in This Country?’

    Jon Stewart Exclaims ‘What the F— Is Happening in This Country?’

    Neeraj Ghaywan Discusses ‘Homebound’ Movie

    Neeraj Ghaywan Discusses ‘Homebound’ Movie

  • Travel
    These Are Yelp’s Top 100 Taco Spots in the U.S.

    These Are Yelp’s Top 100 Taco Spots in the U.S.

    15 Best Spanx Long Weekend Deals 2026

    15 Best Spanx Long Weekend Deals 2026

    This Arizona Waterfall Only Appears a Few Days a Year

    This Arizona Waterfall Only Appears a Few Days a Year

    Adidas Sambas and Gazelles Are Up to 55% Off Right Now

    Adidas Sambas and Gazelles Are Up to 55% Off Right Now

    15 Best Personal Safety Devices for Senior Travelers

    15 Best Personal Safety Devices for Senior Travelers

  • Lifestyle
    Rag & Bone Fall 2026 Menswear Collection

    Rag & Bone Fall 2026 Menswear Collection

    Undercover Pre-Fall 2026 Menswear Collection

    Undercover Pre-Fall 2026 Menswear Collection

    Blumarine Pre-Fall 2026 Collection | Vogue

    Blumarine Pre-Fall 2026 Collection | Vogue

    Eudon Choi Pre-Fall 2026 Collection

    Eudon Choi Pre-Fall 2026 Collection

    Maria McManus Pre-Fall 2026 Collection

    Maria McManus Pre-Fall 2026 Collection

  • Sports
    How to bet on golf: Tips and strategies to win in 2026

    How to bet on golf: Tips and strategies to win in 2026

    Mike Tomlin is out as coach; what’s next for Steelers?

    Mike Tomlin is out as coach; what’s next for Steelers?

    CFP National Championship first look: Previewing Miami-Indiana

    CFP National Championship first look: Previewing Miami-Indiana

    Brooks Koepka: ‘Work to do’ to rebuild PGA Tour relationships

    Brooks Koepka: ‘Work to do’ to rebuild PGA Tour relationships

    Drake Maye carries Patriots to wild-card win over Chargers

    Drake Maye carries Patriots to wild-card win over Chargers

  • Blogs
No Result
View All Result
City and Coffee
No Result
View All Result
Home Tech

Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

content@helloomylife.com by content@helloomylife.com
October 16, 2024
in Tech
0
Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be
0
SHARES
865
VIEWS
Share on FacebookShare on Twitter


For some time now, corporations like OpenAI and Google have been touting advanced “reasoning” capabilities as the next big step of their newest synthetic intelligence fashions. Now, although, a brand new research from six Apple engineers reveals that the mathematical “reasoning” displayed by superior giant language fashions might be extraordinarily brittle and unreliable within the face of seemingly trivial modifications to frequent benchmark issues.

The fragility highlighted in these new outcomes helps help earlier analysis suggesting that LLMs’ use of probabilistic sample matching is lacking the formal understanding of underlying ideas wanted for actually dependable mathematical reasoning capabilities. “Present LLMs usually are not able to real logical reasoning,” the researchers hypothesize primarily based on these outcomes. “As a substitute, they try to duplicate the reasoning steps noticed of their coaching knowledge.”

Combine It Up

In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Massive Language Fashions”—at the moment obtainable as a preprint paper—the six Apple researchers begin with GSM8K’s standardized set of more than 8,000 grade-school level mathematical word problems, which is often used as a benchmark for contemporary LLMs’ advanced reasoning capabilities. They then take the novel strategy of modifying a portion of that testing set to dynamically change sure names and numbers with new values—so a query about Sophie getting 31 constructing blocks for her nephew in GSM8K may turn into a query about Invoice getting 19 constructing blocks for his brother within the new GSM-Symbolic analysis.

This strategy helps keep away from any potential “knowledge contamination” that may outcome from the static GSM8K questions being fed immediately into an AI mannequin’s coaching knowledge. On the similar time, these incidental modifications do not alter the precise issue of the inherent mathematical reasoning in any respect, that means fashions ought to theoretically carry out simply as effectively when examined on GSM-Symbolic as GSM8K.

As a substitute, when the researchers examined greater than 20 state-of-the-art LLMs on GSM-Symbolic, they discovered common accuracy diminished throughout the board in comparison with GSM8K, with efficiency drops between 0.3 p.c and 9.2 p.c, relying on the mannequin. The outcomes additionally confirmed excessive variance throughout 50 separate runs of GSM-Symbolic with completely different names and values. Gaps of as much as 15 p.c accuracy between the most effective and worst runs had been frequent inside a single mannequin and, for some cause, altering the numbers tended to lead to worse accuracy than altering the names.

This type of variance—each inside completely different GSM-Symbolic runs and in comparison with GSM8K outcomes—is greater than a bit of shocking since, because the researchers level out, “the general reasoning steps wanted to resolve a query stay the identical.” The truth that such small modifications result in such variable outcomes suggests to the researchers that these fashions usually are not doing any “formal” reasoning however are as an alternative “try[ing] to carry out a type of in-distribution pattern-matching, aligning given questions and answer steps with related ones seen within the coaching knowledge.”

Don’t Get Distracted

Nonetheless, the general variance proven for the GSM-Symbolic checks was usually comparatively small within the grand scheme of issues. OpenAI’s ChatGPT-4o, for example, dropped from 95.2 p.c accuracy on GSM8K to a still-impressive 94.9 p.c on GSM-Symbolic. That is a reasonably excessive success charge utilizing both benchmark, no matter whether or not or not the mannequin itself is utilizing “formal” reasoning behind the scenes (although whole accuracy for a lot of fashions dropped precipitously when the researchers added only one or two further logical steps to the issues).

The examined LLMs fared a lot worse, although, when the Apple researchers modified the GSM-Symbolic benchmark by including “seemingly related however in the end inconsequential statements” to the questions. For this “GSM-NoOp” benchmark set (quick for “no operation”), a query about what number of kiwis somebody picks throughout a number of days could be modified to incorporate the incidental element that “5 of them [the kiwis] had been a bit smaller than common.”

Including in these crimson herrings led to what the researchers termed “catastrophic efficiency drops” in accuracy in comparison with GSM8K, starting from 17.5 p.c to a whopping 65.7 p.c, relying on the mannequin examined. These huge drops in accuracy spotlight the inherent limits in utilizing easy “sample matching” to “convert statements to operations with out actually understanding their that means,” the researchers write.



Source link

Tags: AppleEngineersFlimsyReasoningShow
Previous Post

Al Pacino Went Broke and Had to Act in Bad Films for Money

Next Post

Pandas from China seen exploring new home at DC Zoo

Next Post
Pandas from China seen exploring new home at DC Zoo

Pandas from China seen exploring new home at DC Zoo

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

ADVERTISEMENT

Premium Content

The Top Broadway Shows to See This Spring, According to a Theater Expert

The Top Broadway Shows to See This Spring, According to a Theater Expert

March 27, 2025
Hunter Schafer Says Passport Lists Gender as Male After Trump Order

Hunter Schafer Says Passport Lists Gender as Male After Trump Order

February 21, 2025
From the Archives: The Year Was 1971 and These Were Lauren Hutton, Jacqueline de Ribes, and Hanae Mori’s Travel Tips

From the Archives: The Year Was 1971 and These Were Lauren Hutton, Jacqueline de Ribes, and Hanae Mori’s Travel Tips

August 31, 2025

Browse by Category

  • APAC
  • Entertainment
  • Europe
  • Lifestyle
  • MENA
  • Sports
  • Tech
  • Travel
  • US
  • World

Browse by Tags

Amazon attack ceasefire China City Collection Conflict Day dead deal Deals Donald Fall Football Gaza Hamas Iran Israel Israeli IsraelPalestine killed Man News ReadytoWear Resort Review Russia Russian South Spring strike strikes talks Tested Top travel Trump Trumps U.S Ukraine war Week Win World Years
City and Coffee

We provide the most reliable and up-to-date news from around the globe. Stay informed with our unbiased coverage of the latest events, trends, and stories. Trust us as your daily source for breaking news and insightful analysis

Browse by Tag

Amazon attack ceasefire China City Collection Conflict Day dead deal Deals Donald Fall Football Gaza Hamas Iran Israel Israeli IsraelPalestine killed Man News ReadytoWear Resort Review Russia Russian South Spring strike strikes talks Tested Top travel Trump Trumps U.S Ukraine war Week Win World Years

Recent Posts

  • US announces start of phase two of Trump’s Gaza peace plan
  • At least 32 dead after construction crane falls on train
  • Trump Doesn’t Need the Proud Boys Anymore
  • Sarah Trahern to Retire as Country Music Association CEO
No Result
View All Result
  • Home
  • World
  • US
  • Europe
  • MENA
  • APAC
  • Tech
  • Entertainment
  • Travel
  • Lifestyle
  • Sports
  • Blogs

© 2024 All Rights Reserved | cityandcoffee.com

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?