Désolé, cette page est seulement disponible en anglais. Pour votre confort, le contenu est affiché ci-dessous dans la langue disponible.

  Posted on: 18 June 2024

In recent years, artificial intelligence (AI) has made significant progress, particularly in natural language processing (NLP), with tools like ChatGPT demonstrating impressive abilities in generating text, answering questions, and even performing tasks such as editing and proofreading. These developments have sparked discussions about the potential for AI to replace human editors, proof-readers, and even authors. However, this prospect has also generated unease, particularly within the scientific community, where the written word is held to exceptionally high standards.

As a professional editing service, we were naturally curious about the editing and proofreading capabilities of current AI models. To explore their potential, we conducted extensive tests using texts with varying English proficiency and from various scientific disciplines. This article summarizes our findings, highlighting both the strengths of existing LLMs and their significant limitations that currently prevent them from fully replacing human expertise. These limitations are indeed numerous and, at the time of writing, include:

  • "Hallucinations" (fabrication of information) and an inherent inability to fact-check
  • An inability to truly comprehend language, which is particularly problematic when editing texts written by non-native speakers
  • Difficulty in grasping the complexity of scientific language and the importance of nuance and audience appropriateness
  • Blindness to formatting (cannot distinguish between normal text, headings, table content, etc.)
  • Limited scope of analysis, restricted to page-by-page review without a full overview of the entire document
  • Inability to engage in critical thinking
  • Cannot work within the rules of style guides

Before discussing these issues in more detail, let us first examine how Large Language Models (LLMs) like ChatGPT construct sentences and process language.

How do LLMs Work?

Large Language Models (LLMs) like ChatGPT or Gemini operate by predicting the next word in a sentence based on the context provided by previous words, using a technique called “transformer architecture”. Trained on vast datasets containing text from books, articles, websites, and other sources, these models learn patterns in language, including grammar, syntax, and even some aspects of meaning and style. When constructing sentences and paragraphs, LLMs do not "understand" language in the same way humans do; rather, they generate text by selecting words that statistically fit the context based on the vast amounts of data that was used to train them.1,2,3

Furthermore, AI language models do not "read" documents in the way a human would. They process text in segments (like paragraphs or pages), i.e., they have a limited-size internal context window and cannot maintain an overview of long documents in their entirety. LLMs therefore rely on the context provided in the immediate text and do not retain information from previous sections of the document. This means that LLMs do not have a global understanding of an entire document at once but rather work with portions of text at a time.1,2,3

Despite these limitations, LLMs still manage to produce output that sounds natural and convincing. However, this is based purely on learned patterns and correlations rather than true comprehension or reasoning, leading to text that may be grammatically correct and contextually appropriate, but not necessarily factually accurate or logically valid. This brings us to our first item on the above list, the lack of reliability.

Take Home Messages

  1. LLMs are statistical models that lack true language comprehension and reasoning abilities.
  2. LLMs operate on isolated text segments without the capability to assess the entire document all at once.

Why LLMs Won't Replace Human Editors for Some Time

Lack of Reliability

Veracity is the cornerstone of scientific writing, where the accuracy and truthfulness of information are paramount. In scientific research, the integrity of arguments, data, and conclusions is critical not only for advancing knowledge but also for maintaining trust within the scientific community and the public. LLMs like ChatGPT, despite their advanced language processing abilities, currently lack the capability to verify the truthfulness of the information they generate or edit. Although they excel at generating contextually appropriate and linguistically sophisticated text, LLMs rely on pattern recognition rather than comprehension which leads to the risk of them introducing unrelated, factually incorrect, or completely fabricated content – a phenomenon known as "hallucinations".1-11 As these hallucinations are often obscured by well-crafted sentences, the underlying falsehoods can be difficult to spot, which has led to considerable embarrassment.12,13 Moreover, a comprehensive study assessing the performance of ChatGPT (3.5 and 4) and Google's Bard (now Gemini) to produce references in the context of scientific writing found that all LLMs failed to retrieve any relevant papers, with precision rates ranging from 13.4% for ChatGPT4 to 0% for Bard.14

LLMs lack the capacity to assess the logical validity of arguments or to verify the consistency of data and conclusions presented within a manuscript. They are unable to cross-check whether results in one part of a document match those shown elsewhere (e.g., in a different paragraph, figure, or table) or align with established scientific principles. This limitation stems from the fact that LLMs, based on transformer architectures, are not designed for quantitative reasoning or to be Turing complete (unless enhanced with memory augmentation).15 Their primary function is to generate and understand text that mimics human language. For example, posing the same quantitative question to ChatGPT on different occasions (i.e., during different sessions) often yields varying responses. In science we do not want to approximate the answer to a problem if the solution is known or if it can be calculated unambiguously. This limitation underscores the indispensable role of human editors, who will critically evaluate the veracity of scientific content and spot logical inconsistencies, both within a sentence/paragraph and between distant sections of a large document. This is essential for preserving the credibility and reliability of scientific literature.

Take Home Messages

  1. As statistical models, LLMs are incapable of fact-checking, which can result in the fabrication of false and unsubstantiated content ("hallucinations").
  2. In scientific writing, where accuracy and truthfulness are paramount, hallucinations or approximations of the truth are simply unacceptable.

Difficulties with Texts Written by Non-Native Speakers

Scientists for whom English is a second language (ESL) often encounter substantial challenges when writing academic texts, as they must navigate not only the complexities of science but also the subtleties of English grammar, syntax, and vocabulary. In our professional editing service, we frequently observe issues where unclear phrasing can obscure the intended meaning. Sentences may become difficult to comprehend due to the presence of numerous grammatical errors, missing essential words or prepositions, or the incorrect use of specialized jargon. These issues can lead to awkward or ambiguous phrasing, incorrect usage of scientific terminology, and a lack of precision in conveying complex ideas. In this context, the inherent inability of LLMs to truly comprehend language becomes particularly problematic.

While both AI and human readers may occasionally struggle with interpretation, an experienced human editor, especially one with a background in the relevant scientific field, is far better equipped to infer the author's intent. Although AI tools can provide basic grammar and style corrections, they often struggle with more complex issues, particularly when sentences are unclear or when specialized terminology is used incorrectly. In our extensive experimentation with LLMs, we have consistently found that the more unclear a sentence is and the more substantial the required corrections, the more likely AI is to misunderstand the intended meaning and produce corrections that are contextually inappropriate, incorrect (regarding the author's intended meaning), factually wrong (regarding established scientific truths), or simply failing to address deeper issues in clarity and accuracy. Consequently, especially for texts written by ESL authors, only a human editor with the pertinent scientific expertise will be able to accurately rephrase and clarify such texts, ensuring that the scientific message is conveyed correctly and effectively, something that AI cannot reliably achieve.

Take Home Messages

  1. Due to their inability to truly comprehend language, LLMs particularly struggle with texts authored by non-native English speakers.
  2. LLMs often misinterpret sentences that are grammatically incorrect or misuse terminology, leading to contextually inappropriate or factually incorrect corrections.

Inability to Understand the Complexity of Scientific Language and the Importance of Nuance and Audience Appropriateness

Scientific writing is characterized by its precision, technical terminology, and the need for clear, unbiased communication of complex ideas. While AI systems like ChatGPT can process and generate text based on patterns in large datasets, they struggle with the nuanced understanding required to accurately edit or proofread scientific manuscripts. Scientific language often involves the use of specialized jargon, acronyms, and terminology, as well as a particular "academic tone", that can vary significantly across disciplines. A human editor, particularly one with expertise in the relevant field, possesses the ability to recognize subtle errors in terminology, ensure consistency, and confirm that the language used adheres to the specific conventions of the discipline.

Scientific writing is not merely about conveying information; it is also about engaging in a dialogue with the scientific community. This requires a nuanced approach to communication, where the choice of words, the structure of sentences, and the presentation of data all play a role in how the manuscript is received and interpreted. Human editors are skilled in identifying and addressing these nuances, ensuring that the manuscript not only communicates its findings effectively but also aligns with the expectations of the scientific community. For instance, when dealing with statements of uncertainty or hypotheses, human editors can guide the author in using language that accurately reflects the degree of certainty or speculation involved. They can also help in crafting sections of the manuscript, such as the introduction or conclusion, to ensure that they adequately frame the research question and highlight the significance of the findings. AI, on the other hand, may generate text that is technically correct but lacks the subtlety required to navigate the complexities of scientific discourse.

Scientific manuscripts are carefully constructed arguments designed to communicate new findings to a specific audience, which may include peer reviewers, fellow researchers, or the broader scientific community. The success of a manuscript often hinges on how well it is tailored to its intended audience, and this requires a profound understanding of both the content and the context in which it will be received. Human editors bring a level of contextual awareness that AI currently lacks. They can discern whether the tone, style, and level of detail are appropriate for the target readership, whether it’s a high-impact journal or a specialized conference proceeding. Moreover, scientific writing often requires the integration of various types of information, such as data interpretation, methodological details, and theoretical implications. Human editors can evaluate the logical flow of arguments, identify potential gaps in reasoning, and suggest improvements that enhance the clarity and persuasiveness of the manuscript. AI, while capable of generating coherent text, often lacks the ability to critically assess the broader context in which a scientific argument is presented and may miss opportunities to strengthen the manuscript’s overall impact.

Take Home Messages

  1. Only human editors with subject-matter expertise can navigate specialized terminology, academic tone, and the subtleties of discipline-specific conventions.
  2. AI lacks the understanding of presentation and nuanced language required to effectively communicate scientific findings and meet the expectations of the scientific community.
  3. AI lacks the contextual and audience awareness needed to ensure that tone, style, and level of detail are appropriate for the target readership.

LLMs are Blind to Formatting and Only Process Short Portions of Text at a Time

Most LLMs lack the ability to recognize formatting distinctions, rendering them incapable of differentiating between standard text, headings, table contents, or captions. For instance, if you have a list of bullet points, each introduced by a bolded heading, a human editor can easily identify these elements and edit them appropriately. However, most LLMs fail to recognize such elements as headings and may attempt to convert them into complete sentences by unnecessarily adding verbs and prepositions (elements often omitted in headings), or they might even delete the headings entirely. This limitation in processing formatting also makes LLMs unsuitable for adhering to specific style guides, such as those required by a target journal's author instructions.

Additionally, LLMs do not have internal memory (a pre-requisite for true intelligence), and therefore process text in short, isolated segments. When an AI like ChatGPT edits a document, it treats each page — or even each paragraph — independently, rather than as part of a cohesive whole. This approach is akin to assigning each page of a manuscript to a different human editor, leading to a disjointed mix of editing styles and making it impossible to detect inconsistencies that span the entire document, such as variations in terminology and spelling or logical incoherences.

Take Home Messages

  1. AI is blind to formatting distinctions, potentially resulting in improper corrections that violate prescribed style guides.
  2. LLMs, with no internal memory, can only process short, isolated text segments, often overlooking or introducing logical contradictions and inconsistencies in style and terminology.

Incapacity for Critical Thinking and Ethical Considerations

One of the most significant limitations of AI in scientific editing and proofreading is its inability to engage in critical thinking. Editing a scientific manuscript often involves evaluating the logical coherence of arguments, identifying potential biases or ethical concerns, and ensuring that the conclusions are supported by the data presented. Human editors are adept at detecting flaws in reasoning, such as overgeneralizations, unwarranted assumptions, or unsupported claims, and can provide feedback that helps authors strengthen their arguments and align their work with the standards of scientific rigour.

Ethical considerations are also paramount in scientific writing. Issues such as plagiarism, data fabrication, and inappropriate authorship practices require careful scrutiny, and human editors play a crucial role in identifying and addressing these concerns. While AI can be trained to recognize certain patterns associated with ethical issues, it lacks the judgment and contextual awareness needed to fully assess the ethical implications of a manuscript. Human editors, with their experience and understanding of the ethical standards in scientific research, are better equipped to guide authors in maintaining the integrity of their work.

Take Home Messages

  1. LLMs cannot reason, rendering them unable to detect logically incoherent arguments, potential biases, ethical concerns, or unwarranted assumptions, claims, or conclusions.
  2. LLMs lack the judgment and contextual awareness necessary to fully assess ethical issues, such as plagiarism, data fabrication, and inappropriate authorship practices.

Conclusions

While AI tools like ChatGPT have demonstrated significant potential in assisting with tasks related to language processing, they are not poised to replace human editors and proof-readers in the scientific domain anytime soon. The absolute need for veracity in scientific writing, the complexity of scientific language, the need for contextual understanding, the nuances of scientific communication, and the importance of critical thinking and ethical considerations are all areas where human expertise remains indispensable. For the foreseeable future, human editors will continue to play a crucial role in ensuring that scientific manuscripts are not only grammatically correct but also clear, logically coherent, and ethically sound. While AI can serve as a valuable tool in the editing process, it is unlikely to fully replicate the depth of understanding and the critical judgment that human editors bring to the task.

References

  1. OpenAI. Introducing ChatGPT. November 2022. Online resource
  2. OpenAI et al. GPT-4 Technical Report. arXiv. 2023; arXiv:2303.08774. DOI: 10.48550/arXiv.2303.08774
  3. Google. Gemini Apps FAQ. November 2023. Online resource
  4. Emsley R. ChatGPT: these are not hallucinations – they’re fabrications and falsifications. Schizophr. 2023; 9:52 DOI: 10.1038/s41537-023-00379-4
  5. Hicks MT, Humphries J, Slater J. ChatGPT is bullshit. Ethics Inf Technol. 2024; 26:38. DOI: 10.1007/s10676-024-09775-5
  6. Zhou L, Schellaert W, Martínez-Plumed F, et al. Larger and more instructable language models become less reliable. Nature. 2024. DOI: 10.1038/s41586-024-07930-y
  7. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023; 388(13):1233-1239. DOI: 10.1056/NEJMsr2214184
  8. Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023; 15(2):e35179. DOI:10.7759/cureus.35179
  9. Dolan EW. ChatGPT hallucinates fake but plausible scientific citations at a staggering rate, study finds. Artificial Intelligence. 2024. Online resource
  10. Xu Z, Jain S, Kankanhalli M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv. 2024; arXiv:2401.11817. DOI: 10.48550/arXiv.2401.11817
  11. Huang L, Yu W, Ma W, Zhong W, Feng Z, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv. November 2023; arXiv:2311.05232. DOI: 10.48550/arXiv.2311.05232
  12. Sato M, Roth E. CNET found errors in more than half of its AI-written stories. The Verge 2023. Online resource
  13. Weiser B. Here’s What Happens When Your Lawyer Uses ChatGPT. The New York Times 27 May 2023. Online resource
  14. Chelli M, Descamps J, Lavoué V, Trojani C, Azar M, Deckert M, Raynier JL, Clowez G, Boileau P, Ruetsch-Chelli C. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J Med Internet Res. 2024; 26:e53164. DOI: 10.2196/53164
  15. Schuurmans D. Memory Augmented Large Language Models are Computationally Universal. arXiv. 2023; arXiv:2301.04589. DOI: 10.48550/arXiv.2301.04589