I love this post for a lot of reasons. While I understand the interest in having an AI write in "your voice," to an extent, this is something of most concern for professional writers who actually have a corpus of content and a voice and style that is capable of being emulated. My fear is two-fold - whether students will try to use academic texts from their research as the basis of the writing style they want to use for their work and whether they just use the better written output as a substitute for their own writing. My own sense, based on what I'm seeing, is that most students would not really go to the lengths to upload their own work just to have an AI write in their own voice. Any student who is capable of that level of sophistication using AI we probably don't need to worry about in terms of academic integrity. But, to me, the more interesting issue with models like Claude-3 and all the rest of them going forward (Marc - what have you heard about GPT-5?) is the increasing level of sophistication working with texts, not just to produce output, but also to interact with in terms of querying the text, getting ideas, probing conclusions, and just using the AI to have a conversation. Prior to Claude-3, I was not having great experiences with AI's ability to granularly comb through lengthy PDF's in a helpful way. But I have found their newest model much, much better. For example, as a big fan of Marc's substack, it would be an interesting experiment just to upload a bunch of his most provocative columns and have a conversation about them with the AI. Those are some positive uses I can see, but as with any powerful technology, the downsides may ultimately outweigh the productive applications. But I do think (ironically based on the points made in the Claude generated post!), there is indeed a reckoning coming where schools will not be able to ignore the issue for much longer.
Thanks, Steve! I think Claude 3 and Google's Gemini use something called long context understanding vs ChatGPT's and Microsoft's RAG to talk to long documents. The latter is pretty weak, while the newer long context understanding is supposedly more precise.
I'm glad you shared your experiment because I just did something similar with Claude and had markedly different results. I used it for the column I write every week for the Chicago Tribune, short (600 words), single topic pieces that sort of get in and get out of the subject in a way that (hopefully) offers something intriguing, but which doesn't have the space for serious exploration. I did the same test I'd done previously with GPT-4, which resulted in a sort of uncanny valley version of me that was sort of terrible, and Claude (the most advanced model) was even worse, that uncanny valley sense kicked up another notch to the point of parody. I don't know if this is something about the model or my "voice" or what, but I'd heard lots of people tell me that Claude was better at sounding human and in my specific case, it was markedly worse, at least as I perceive my own voice. Any idea what's happening here?
How many sample texts from your column did you use? In my limited experiments, Claude needed around 5,000+ words that I'd written copied into the prompt to start mimicking my style. that was around 5 substack posts. The output isn't me and I can tell, but there are moments that are eerie and its more than enough to fool the detectors.
I did the same 10 example columns with GPT (which I did by trying to make my own "agent") and Claude, so that would be right around 6000 words, give or take. I didn't put either output through a detector to see if it would fool those, though I assume they could pass, given the difficulty of the detection. I was just struck by how, for lack of a better word, "weird" both sounded as a whole, even though the outputs seemed vaguely like me. It really was that sense of an uncanny valley between me and my doppelgänger.
I think authors know their words and stylistic ticks enough to note pretty quickly when something sounds and feels off. To me, it was like working with an editor who knew my work and style, but didn't quite understand how I think. That last part is crucial--this is only mimicry and we really have to hammer that message to users.
John, Marc: How far apart did you two do these experiments?
It's possible that there was an update to Claude in between that lowered its quality. It'd be interesting for Marc to run the same experiment again, today, to see if it returns the same results as before. I would bet you'd see lower quality.
In some cases--though I think this is more ChatGPT than Claude--quality can vary depending on load/time of day/etc.
ChatGPT was November and Claude was three weeks ago or so. I signed up for Claude because people were touting its writing abilities as compared to GPT.
My only theory is that my own process involves so much discovery during the drafting process that even though I'm prompting with my voice and a reasonably extensive description of the subject I'm exploring, that without having access to that process it doesn't have enough to work from, so it fills the space with amped up B.S.
I feel a little crazy after reading that because I don't think the output sounded like you or like a human at all. It was clunky and ponderous and just so obvious in its cause/effect writing. I thought it was terrible. If you had written the following sentence unironically, I'd unsubscribe and block: "It's long past time to bring non-tenure track educators out of the triage unit and into the center of a proactive, collaborative process to develop holistic AI literacy and ethics programs."
Ha! I tried to mark areas that really didn't sound like me at all and think I hit on three words or phrases within that sentence that didn't sound like me at all. That's important to note, but the overall effect of reading the piece was unnerving to me. There were more than a few moments that I had to look back to see if the model was generating material vs. copying it from my posts.
I thought the most important thing is that it still felt like it was just "using a lot of words with little point" while you actually mean something with your words.
That sentence you quote is a great example of that sensation of the uncanny valley. Like, a human might write that sentence, but I don't want to know them if they did. I had the same sensation when I tried this experiment with GPT-4. It was sort of me, but not me in ways that made it violently unlike me. https://biblioracle.substack.com/p/gpt-and-the-writing-uncanny-valley
I've always liked Charlie Stross's line that LLMs don't answer questions, they give you a "lump of text in the shape of an answer." That essay reads to me like a lump of text in the shape of an essay by Marc Watkins. Even if you take out or fix the emboldened words, there are still lumps of botshit plainly visible.
Of course, what we really need is to run an experiment that asks 100 readers to read two essays side-by-side, one by Marc and one by Marcbot and ask them to tell the difference. If we want to get fancy, add some controls where both essays are by Marc or by Marcbot. I'd be willing to bet that we are safely on the side of most readers (and all the careful ones) being able to pick out the Marc-written essays. The question isn't when do LLMs reach parity with excellent writers. The question is if such a thing is possible using the transformer technology.
Here's something someone might wish to write about.
Trying to manage and adapt to emerging new technologies one by one by one seems a loser's game, because an accelerating knowledge explosion is producing new challenging technologies faster than we can figure out how to deal with them.
My favorite example here is nuclear weapons, because they are so very important, and after 75 years we still don't have a clue what to do about them. And while we've been scratching our heads about nukes, the knowledge explosion has handed us AI and genetic engineering, which we also have no clue how to make safe. And the hits keep right on coming...
Educated people should be shifting some focus from particular challenging technologies, to the machinery which is generating all the challenging technologies, the knowledge explosion. If we don't learn how to take control of the assembly line, there's really little hope that we'll be able to keep up and successfully manage all the products rolling off the end of it.
By the time we figure out what to do about ChatGPT and Claude they will be yesterday's news, and whatever comes next will likely make any solutions we've found irrelevant.
We shouldn't be standing at the end of the assembly line trying to keep up with an endless parade of new inventions, we should be standing at the other end, where the control panel is.
Last October, I was debating a very similar subject with a friend who's a database expert--basically, is it cheating to use ChatGPT to develop one's ideas, get drafts out and iterate on an argument. His point was that the old adage of "garbage in, garbage out" was still true. As in, you couldn't use ChatGPT to develop an idea if you didn't understand the underlying concepts in the first place.
I believe that to still be true, though less so than it was. And I can see a path where it gets incrementally less true.
There's something in my gut, however, that says the fundamental creativity involved in writing will always be a human domain. I can't explain that, and it is likely a result of the uncanny valley feeling I've gotten when I've asked the machine to try to write like me. But I hope it is true.
I can help you with that: humans create based on experiences, experiences which arent just words. So insofar as data goes, we have a lot more context to use.
AI is just using a very complex "world" of words, which can be terribly similar to what we think reality is, and yet isn't, not the least being that we don't know what is reality. And at every opportunity, AI will shift to "most common phrase" leading to "lets level set the expectations" and other weirdness.
In this sense, our smallness is almost an advantage.
Yes, if you train even a simple language model on a bit of your text it will sound like you, because it learned from you. Don’t even need an AI to do this. A directed network is enough.
I love this post for a lot of reasons. While I understand the interest in having an AI write in "your voice," to an extent, this is something of most concern for professional writers who actually have a corpus of content and a voice and style that is capable of being emulated. My fear is two-fold - whether students will try to use academic texts from their research as the basis of the writing style they want to use for their work and whether they just use the better written output as a substitute for their own writing. My own sense, based on what I'm seeing, is that most students would not really go to the lengths to upload their own work just to have an AI write in their own voice. Any student who is capable of that level of sophistication using AI we probably don't need to worry about in terms of academic integrity. But, to me, the more interesting issue with models like Claude-3 and all the rest of them going forward (Marc - what have you heard about GPT-5?) is the increasing level of sophistication working with texts, not just to produce output, but also to interact with in terms of querying the text, getting ideas, probing conclusions, and just using the AI to have a conversation. Prior to Claude-3, I was not having great experiences with AI's ability to granularly comb through lengthy PDF's in a helpful way. But I have found their newest model much, much better. For example, as a big fan of Marc's substack, it would be an interesting experiment just to upload a bunch of his most provocative columns and have a conversation about them with the AI. Those are some positive uses I can see, but as with any powerful technology, the downsides may ultimately outweigh the productive applications. But I do think (ironically based on the points made in the Claude generated post!), there is indeed a reckoning coming where schools will not be able to ignore the issue for much longer.
Thanks, Steve! I think Claude 3 and Google's Gemini use something called long context understanding vs ChatGPT's and Microsoft's RAG to talk to long documents. The latter is pretty weak, while the newer long context understanding is supposedly more precise.
I'm glad you shared your experiment because I just did something similar with Claude and had markedly different results. I used it for the column I write every week for the Chicago Tribune, short (600 words), single topic pieces that sort of get in and get out of the subject in a way that (hopefully) offers something intriguing, but which doesn't have the space for serious exploration. I did the same test I'd done previously with GPT-4, which resulted in a sort of uncanny valley version of me that was sort of terrible, and Claude (the most advanced model) was even worse, that uncanny valley sense kicked up another notch to the point of parody. I don't know if this is something about the model or my "voice" or what, but I'd heard lots of people tell me that Claude was better at sounding human and in my specific case, it was markedly worse, at least as I perceive my own voice. Any idea what's happening here?
How many sample texts from your column did you use? In my limited experiments, Claude needed around 5,000+ words that I'd written copied into the prompt to start mimicking my style. that was around 5 substack posts. The output isn't me and I can tell, but there are moments that are eerie and its more than enough to fool the detectors.
I did the same 10 example columns with GPT (which I did by trying to make my own "agent") and Claude, so that would be right around 6000 words, give or take. I didn't put either output through a detector to see if it would fool those, though I assume they could pass, given the difficulty of the detection. I was just struck by how, for lack of a better word, "weird" both sounded as a whole, even though the outputs seemed vaguely like me. It really was that sense of an uncanny valley between me and my doppelgänger.
I think authors know their words and stylistic ticks enough to note pretty quickly when something sounds and feels off. To me, it was like working with an editor who knew my work and style, but didn't quite understand how I think. That last part is crucial--this is only mimicry and we really have to hammer that message to users.
Mimicry is the exact right word. Close enough to be plausibly human, but not close enough to fool me, or even someone who has read my work (I think).
John, Marc: How far apart did you two do these experiments?
It's possible that there was an update to Claude in between that lowered its quality. It'd be interesting for Marc to run the same experiment again, today, to see if it returns the same results as before. I would bet you'd see lower quality.
In some cases--though I think this is more ChatGPT than Claude--quality can vary depending on load/time of day/etc.
Excellent work here, thank you.
ChatGPT was November and Claude was three weeks ago or so. I signed up for Claude because people were touting its writing abilities as compared to GPT.
My only theory is that my own process involves so much discovery during the drafting process that even though I'm prompting with my voice and a reasonably extensive description of the subject I'm exploring, that without having access to that process it doesn't have enough to work from, so it fills the space with amped up B.S.
I feel a little crazy after reading that because I don't think the output sounded like you or like a human at all. It was clunky and ponderous and just so obvious in its cause/effect writing. I thought it was terrible. If you had written the following sentence unironically, I'd unsubscribe and block: "It's long past time to bring non-tenure track educators out of the triage unit and into the center of a proactive, collaborative process to develop holistic AI literacy and ethics programs."
Ha! I tried to mark areas that really didn't sound like me at all and think I hit on three words or phrases within that sentence that didn't sound like me at all. That's important to note, but the overall effect of reading the piece was unnerving to me. There were more than a few moments that I had to look back to see if the model was generating material vs. copying it from my posts.
I thought the most important thing is that it still felt like it was just "using a lot of words with little point" while you actually mean something with your words.
That sentence you quote is a great example of that sensation of the uncanny valley. Like, a human might write that sentence, but I don't want to know them if they did. I had the same sensation when I tried this experiment with GPT-4. It was sort of me, but not me in ways that made it violently unlike me. https://biblioracle.substack.com/p/gpt-and-the-writing-uncanny-valley
I've always liked Charlie Stross's line that LLMs don't answer questions, they give you a "lump of text in the shape of an answer." That essay reads to me like a lump of text in the shape of an essay by Marc Watkins. Even if you take out or fix the emboldened words, there are still lumps of botshit plainly visible.
Of course, what we really need is to run an experiment that asks 100 readers to read two essays side-by-side, one by Marc and one by Marcbot and ask them to tell the difference. If we want to get fancy, add some controls where both essays are by Marc or by Marcbot. I'd be willing to bet that we are safely on the side of most readers (and all the careful ones) being able to pick out the Marc-written essays. The question isn't when do LLMs reach parity with excellent writers. The question is if such a thing is possible using the transformer technology.
Here's something someone might wish to write about.
Trying to manage and adapt to emerging new technologies one by one by one seems a loser's game, because an accelerating knowledge explosion is producing new challenging technologies faster than we can figure out how to deal with them.
My favorite example here is nuclear weapons, because they are so very important, and after 75 years we still don't have a clue what to do about them. And while we've been scratching our heads about nukes, the knowledge explosion has handed us AI and genetic engineering, which we also have no clue how to make safe. And the hits keep right on coming...
Educated people should be shifting some focus from particular challenging technologies, to the machinery which is generating all the challenging technologies, the knowledge explosion. If we don't learn how to take control of the assembly line, there's really little hope that we'll be able to keep up and successfully manage all the products rolling off the end of it.
By the time we figure out what to do about ChatGPT and Claude they will be yesterday's news, and whatever comes next will likely make any solutions we've found irrelevant.
We shouldn't be standing at the end of the assembly line trying to keep up with an endless parade of new inventions, we should be standing at the other end, where the control panel is.
Great piece.
Last October, I was debating a very similar subject with a friend who's a database expert--basically, is it cheating to use ChatGPT to develop one's ideas, get drafts out and iterate on an argument. His point was that the old adage of "garbage in, garbage out" was still true. As in, you couldn't use ChatGPT to develop an idea if you didn't understand the underlying concepts in the first place.
I believe that to still be true, though less so than it was. And I can see a path where it gets incrementally less true.
There's something in my gut, however, that says the fundamental creativity involved in writing will always be a human domain. I can't explain that, and it is likely a result of the uncanny valley feeling I've gotten when I've asked the machine to try to write like me. But I hope it is true.
I can help you with that: humans create based on experiences, experiences which arent just words. So insofar as data goes, we have a lot more context to use.
AI is just using a very complex "world" of words, which can be terribly similar to what we think reality is, and yet isn't, not the least being that we don't know what is reality. And at every opportunity, AI will shift to "most common phrase" leading to "lets level set the expectations" and other weirdness.
In this sense, our smallness is almost an advantage.
Yes, if you train even a simple language model on a bit of your text it will sound like you, because it learned from you. Don’t even need an AI to do this. A directed network is enough.