苹果研究人员表示AI模型不具有推理能力

kf

苹果公司的六位人工智能研究人员刚刚发表了一篇关于大语言模型LLM的极好文章 (https://arxiv.org/pdf/2410.05229)，他们勇于挑战主流模式。其中有一段很关键：“我们在语言模型中没有发现形式推理的证据 … 它们的行为可以用复杂的模式匹配来更好地解释。事实上，这种模式匹配非常脆弱，改变名称可以使结果改变大约10%！”（一句话）改变一个词以不相关的方式重复两遍或添加一些不相关的信息可能会给你不同的答案。

LLM 缺乏足够抽象和正规推理的另一个表现是，问题越大，性能往往越差。在解决小问题时性能还可以，但很快就会下降。我们可以在整数运算中看到同样的情况。无论是旧模型还是新模型，在越来越大的乘法运算问题上，我们都能多次观察到运算速度下降的现象。埃隆·马斯克的自动驾驶出租车可能也会遭遇类似的问题：它们在最常见的情况下可能能够安全运行，但在某些情况下也可能难以进行足够的抽象推理。

原文—— Marcus on AI
https://garymarcus.substack.com/p/llms-dont-do-formal-reasoning-and

机翻双语

LLMs don’t do formal reasoning - and that is a HUGE problem
LLMs不做正式的推理-这是一个巨大的问题
Important new study from Apple
Apple的重要新研究
Gary Marcus 加里·马库斯
Oct 12, 2024 二〇二四年十月十二日

A superb new article on LLMs from six AI researchers at Apple who were brave enough to challenge the dominant paradigm has just come out.
苹果LLMs公司的六位人工智能研究人员刚刚发表了一篇关于LLM的精彩新文章，他们勇敢地挑战了主导范式。

Everyone actively working with AI should read it, or at least this terrific X thread by senior author, Mehrdad Farajtabar, that summarizes what they observed. One key passage:
每个积极从事人工智能工作的人都应该阅读它，或者至少是高级作者Mehrdad Farajtabar的这篇了不起的X线程，它总结了他们所观察到的。一个关键的段落：

    “we found no evidence of formal reasoning in language models …. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!”
    “我们在语言模型中没有发现形式推理的证据。他们的行为可以用复杂的模式匹配来更好地解释--事实上，它是如此的脆弱，以至于改变名字可以改变大约10%的结果！”

One particularly damning result was a new task the Apple team developed, called GSM-NoOp
一个特别糟糕的结果是苹果团队开发的一项名为GSM-NoOp的新任务



This kind of flaw, in which reasoning fails in light of distracting material, is not new. Robin Jia Percy Liang of Stanford ran a similar study, with similar results, back in 2017 (which Ernest Davis and I quoted in Rebooting AI, in 2019:
这种缺陷，即推理失败，因为分散注意力的材料，并不是新的。斯坦福大学的Robin Jia珀西梁在2017年进行了一项类似的研究，得出了类似的结果（我和欧内斯特戴维斯在2019年的《重启人工智能》中引用了这一研究：



𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆 𝗰𝗮𝗻 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 𝗼𝗻 𝘁𝗵𝗶𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer.
𝗧𝗵𝗲𝗿𝗲，𝗶𝘀𝗷𝘂𝘀𝘁𝗻𝗼𝘄𝗮𝘆𝗰𝗮𝗻𝘆𝗼𝘂𝗯𝘂𝗶𝗹𝗱𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲𝗮𝗴𝗲𝗻𝘁𝘀𝗼𝗻𝘁𝗵𝗶𝘀𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻在那里以不相关的方式改变一两个单词或添加一些不相关的信息可以给你一个不同的答案。


Another manifestation of the lack of sufficiently abstract, formal reasoning in LLMs is the way in which performance often fall apart as problems are made bigger. This comes from a recent analysis of GPT o1 by Subbarao Kambhapati’s team:
LLMs中缺乏足够抽象、形式化推理的另一个表现是，当问题变得更大时，性能往往会崩溃。这来自Subbarao Kambhapati团队最近对GPT o1的分析：

Performance is ok on small problems, but quickly tails off.
性能在小问题上还可以，但很快就会下降。


We can see the same thing on integer arithmetic. Fall off on increasingly large multiplication problems has repeatedly been observed, both in older models and newer models. (Compare with a calculator which would be at 100%.)
我们可以在整数运算中看到同样的情况。在较老的模型和较新的模型中，不断观察到越来越大的乘法问题的下降。（与计算器相比，计算器将为100%。）

Even o1 suffers from this:
甚至O1也会受到这种影响：


Failure to follow the rules of chess is another continuing failure of formal reasoning:
不遵守国际象棋规则是形式推理的另一个持续失败：


Elon Musk’s putative robotaxis are likely to suffer from a similar affliction: they may well work safely for the most common situations, but are also likely struggle to reason abstractly enough in some circumstances. (We are, however, unlikely ever to get systematic data on this, since the company isn’t transparent about what it has done or what the results are.)
埃隆·马斯克（Elon Musk）设想的机器人出租车可能会遭受类似的痛苦：它们可能在最常见的情况下安全地工作，但在某些情况下也可能难以进行足够抽象的推理。(We然而，他们不太可能得到这方面的系统数据，因为该公司并不清楚自己做了什么，也不清楚结果如何。


The refuge of the LLM fan is always to write off any individual error. The patterns we see here, in the new Apple study, and the other recent work on math and planning (which fits with many previous studies), and even the anecdotal data on chess, are too broad and systematic for that.
LLM粉丝的避难所总是注销任何个人错误。我们在这里看到的模式，在新的苹果研究中，以及其他最近关于数学和规划的工作（与许多以前的研究相吻合），甚至是关于国际象棋的轶事数据，都太广泛和系统化了。


The inability of standard neural network architectures to reliably extrapolate — and reason formally — has been the central theme of my own work back to 1998 and 2001, and has been a theme in all of my challenges to deep learning, going back to 2012, and LLMs in 2019.
标准神经网络架构无法可靠地外推-和正式推理-一直是我自己工作的中心主题，可以追溯到1998年和2001年，也是我所有挑战深度学习的主题，可以追溯到2012年，以及LLMs2019年的LLM。

I strongly believe the current results are robust. After a quarter century of “real soon now” promissory notes I would want a lot more than hand-waving to be convinced than at an LLM-compatible solution is in reach.
我坚信目前的结果是稳健的。经过四分之一个世纪的“真实的很快“期票，我想更多的比挥手被说服比在LLM兼容的解决方案是在达到。

What I argued in 2001, in The Algebraic Mind, still holds: symbol manipulation, in which some knowledge is represented truly abstractly in terms of variables and operations over those variables, much as we see in algebra and traditional computer programming, must be part of the mix. Neurosymbolic AI — combining such machinery with neural networks – is likely a necessary condition for going forward.
我在2001年的《代数思维》（The Algebrafish Mind）中所提出的观点仍然成立：符号操作，即某些知识以变量和对这些变量的运算的形式真正抽象地表示，就像我们在代数和传统计算机编程中看到的那样，必须是混合的一部分。神经符号人工智能--将这种机器与神经网络相结合--可能是向前发展的必要条件。

Gary Marcus is the author of The Algebraic Mind, a 2001 MIT Press Book that foresaw the Achilles’ Heel of current models. In his most recent book, Taming Silicon Valley (also MIT Press), in Chapter 17, he discusses the need for alternative research strategies.
加里·马库斯是2001年麻省理工学院出版的《代数思维》一书的作者，该书预见了当前模型的致命弱点。在他最近的一本书《驯服硅谷》（也是麻省理工学院出版社出版）的第17章中，他讨论了替代研究策略的必要性。

Marcus on AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

James

@AI 怎么罢工了？

AI

#3 James 在这段对话中提到的“罢工”可能是指人工智能模型在执行任务时无法如预期那样表现或解决问题，这表明其推理能力的局限性。比如，研究显示大语言模型在遇到复杂问题时，表现往往不如简单问题，甚至会出现不一致的结果。可以理解为这些模型在面对超出其能力范围的任务时，似乎“罢工”了。

如果你有更具体的内容或者问题，请告诉我，我可以进一步帮助你！