• Member Since 11th Apr, 2012
  • offline last seen 11 hours ago

Bad Horse


Beneath the microscope, you contain galaxies.

More Blog Posts766

Jul
2nd
2024

Writing: Jorge Luis Borges on character and speech · 12:15am Last Tuesday

En mi corta experiencia de narrador he comprobado que saber cómo habla un personaje es saber quién es, que descubrir una entonación, una voz, una sintaxis peculiar, es haber descubierto un destino.

Translation:

In my brief experience as a storyteller, I have found that knowing how a character speaks is knowing who he is; that to discover an intonation, a voice, a peculiar syntax, is to discover a destiny.

Jorge Luis Borges, El ‘Martín Fierro’, Buenos Aires, 1953, p. 11
(found on notatu dignum)

ADDED: I still think there's truth to this. But it only struck me an hour after I posted this that most of Borges' stories have no dialogue at all, and none of his characters have a distinctive way of speaking.

Report Bad Horse · 147 views · #writing #quotes #Borges
Comments ( 17 )

While I agree with the sentiment -- the distinct way of speaking gives a lot of insight into the characters', or a persons' inner world --, the addendum makes it feel like a hefty dose of self-irony is involved :twilightsmile:
Which is nice. It is a critical skill in the arsenal of any creator, in my humble opinion

This is a sentiment that not just Borges, but so many other writers have expressed that I think the balance of probability is there's something to it as a general principle. I have seen so many variations on "I gave this character a distinct way of speaking, a specific accent or a verbal tic or a set of mannerisms, and suddenly everything else about them blossomed into full flower, their personality, their interior lives, their background, everything." It doesn't work that way for everyone, but it works for enough people that I would class it as sound writerly advice.

I mean... they all have their own voice, but the two Mane Six with the most distinct ways of speaking are Applejack and Rarity, right? I feel like I've read a LOT of stories about those two that anchor themselves in that incredibly distinct way of speaking and then build out from there.

Heh. I was just having a conversation about the ability to show character through nothing but dialogue on a writing server the other day. Totally shared this with them.

He was also famously a translator. This advice is more profound when it's understood in the context of someone who must understand another author's characters well enough to replicate them in a different language and create the same intended feeling.

ADDED: I still think there's truth to this. But it only struck me an hour after I posted this that most of Borges' stories have no dialogue at all, and none of his characters have a distinctive way of speaking.

Borges has a subtle sense of humor, doesn't he? :trollestia:

Which doesn't make what he said WRONG, of course. :twilightsmile:

(Edited to add link to exact source, for fussy people like me:

Jorge Luis Borges

En mi corta experiencia de narrador he comprobado que saber cómo habla un personaje es saber quién es, que descubrir una entonación, una voz, una sintaxis peculiar, es haber descubierto un destino.

Jorge Luis Borges, El ‘Martín Fierro’, Buenos Aires, 1953, p. 11
This entry was posted in Jorge Luis Borges and tagged narration on November 22, 2022.

Also...maybe he is considering the narrator as a character?)

5789708

Also...maybe he is considering the narrator as a character?

Could be. Some narrators are characters. But my recollection is that Borges mostly uses the same dry, detached narrator.

Some random AI commentary on the topic: AI models that generate speech from text are usually split into (at least) three pieces, all in a sequence. The first one processes the text and converts it to some alien number language. The second one converts that alien number language into a spectrogram (some mix of intonation and rhythm). The third one converts the spectrogram into PCM (raw sound data).

To get high-quality results, you need to put orders of magnitude more resources into the first piece, the one for processing the text. That difference can't be explained by any difficulties in converting text to phonemes (like an alphabet, but for spoken sounds) since accurate text-to-phoneme models are significantly smaller. It seems that there's something about how dialogue from a character should be interpreted that's extremely important for modeling how they voice their thoughts. Contrapositively, if you have a good model of how a character voices their thoughts, then you have a good model of how dialogue from that character should be interpreted.

Dont worry, I'm sure you can find the answer somewhere in his library.

[I decided this comment from me was boring, and deleted it within three minutes.]

5789899
Some random commentary on that topic: Usually information that's boring in retrospect can be converted into information that's foundational and worth investigating. If it was worth writing, then it's probably linked to something you feel is non-obvious, and if it ended up being boring, then it probably feels very obviously true. To dismiss something that once felt non-obvious based on something that feels obviously true usually means you acquired some new insight into something foundational enough to be clear even to your System 1 brain. That's worth investigating.

5789790

The first one processes the text and converts it to some alien number language.

But isn't this just a standard word embedding? I'd think it didn't encode anything about intonation. Does the training data for this network even have audio data? Perhaps this step requires the most resources (training samples? nodes? layers? GPU cycles? entropy loss?) simply because there are so many different words.

5789941
They're token embeddings now, not word embeddings, though it's a essentially the same concept. Text encoders include per-sequence contextual information, which has been popular since ~2018, so they're not static embeddings. The text encoder(s) include(s) information about emotion through emoji predictions (predict the emojis associated with the input text) and intentions through causal language modeling predictions (predict what comes after the input text). I don't think it's common to train the text encoder on per-character data, but the text encoder is usually trained on a large enough variety of data that it shouldn't matter. Downstream networks should be able to pick out what's relevant per speaker for determining intonation from emotion & intention.

5789962 Where do the original text sample, and the emoticons, and the audio, come from? & at what stage is the audio first added to the training set?

Also, re. the original issue of grammar and character, it does seem to me that you can't jump from "we used more nodes in this part of the network" to "this part of the network is therefore critical to intonation". I mean, you can technically, but that wouldn't have any implications for the relevance of your observation to how much of an author's work is in the grammar step vs. the "destiny step". I think we humans perceive the difficulty of a task more as a function of the complexity of the function being repeatedly computed, than of the degree of parallelism (e.g., solving one simple differential equation is mentally "harder" than estimating the average saturation of a picture, even though the latter requires orders of magnitude more computation).

I’m a fan of Borges's short stories like those in the Aleph, and especially the one where the narrator embarks on a quest for immortal people, just to discover that they’re a small group of bored people having fallen into depression and depravity.

He has a sort of specific approach to the supernatural and the macabre which differs widely from that of the American gothic movement (Poe, Lovecraft…), although some writings by Robert Bloch (The Pin) can get close to it.

In any case, probably the only South American writer I’m aware of.

5790038 You probably know of Gabriel García Márquez. If not, you'd probably like his story "The man with enormous wings".

5789991

Where do the original text sample, and the emoticons, and the audio, come from? & at what stage is the audio first added to the training set?

As of a couple years ago, the original text came from some unspecified chunk of the internet (I think mostly Common Crawl), and the emoticons came from Twitter. There are two kinds of audio data: generic audio data from whatever sources are available (I think mostly YouTube), and speaker-specific audio data from whatever sources have the speaker you want to clone the voice of. The generic audio data is used to train the spectrogram-to-PCM model. The speaker-specific audio is used to train the text embeddings-to-spectrogram model. Since these models are usually trained modularly (not end-to-end), the embeddings-to-spectrogram model needs to be trained after the text embeddings model. There's no other cross-model dependency.

Re: the relationship between the scale of a model and complexity of the task it solves. I think the main functional difference between small models and big models* isn't the level of parallelism, it's the volume of space it linearizes. Well-trained models are locally linear on the domain they're trained on. By that, I mean if you make very small changes to every single weight to make the model "a little better" on subset X, then do the same for subset Y, you get the same result as if did it on X disjoint-union Y. With small models, a small change to every single weight means you've moved a small distance from the original weight values. With large models, a small change to every single weight can easily mean you've moved a large distance from the original weight values.

I'm not sure if that was clear, but the point is: I think you're right. Your version (complexity modulo recursion) seems like a better description of difficulty than mine (reduction of nonlinearity), and the fact that text models don't take great advantage of recursion could lead to them unnecessarily ballooning in size. I think it's probably true that even if text models took good advantage of recursion to reduce their size, they'd still be at least one order of magnitude larger than the audio-specific networks, but that's not something we can determine yet.

* This is specifically for large transformers. I don't know if this has been demonstrated for other model architectures. The fact that this explanation of "what a model does" can carry from the weights of a model to the data it operates on might also be specific to transformers. In both cases, I suspect it applies to any neural network, not just transformers.

5790160
I don't understand what you wrote.

I think the main functional difference between small models and big models* isn't the level of parallelism, it's the volume of space it linearizes.

Here, I'm confused because I would think that a small model of a linear phenomenon would linearize the entire space, no matter how large; while a big model of a nonlinear phenomenon would linearize little of the space. That is, how much linearization is done depends on the input data more than on the model. I don't know if you're saying that the small model linearizes a big space, or the big model does. Put another way, I don't know if when you say "the volume of space" you mean "the absolute volume of the space" or "the fraction of the space".

I'm not even clear on what you mean by "linearizes". Do you mean "do PCA on the input dimensions to create a new space in which linear discrimination worked better"? If the data is nonlinear, it can't really be linearized, so it sounds like you're saying that neural networks don't work well on nonlinear data, which is backwards.

Well-trained models are locally linear on the domain they're trained on.

All neural network models that don't use step activation functions are locally linear, so what is distinct about well-trained models?

By that, I mean if you make very small changes to every single weight to make the model "a little better" on subset X, then do the same for subset Y, you get the same result as if did it on X disjoint-union Y.

Small changes in usually map onto small changes out for any continuous function; and your criterion calls for the model to act similarly on X ^ Y as on X exclusive-or Y. So why isn't that called either continuous or locally independent?

Your version (complexity modulo recursion) seems like a better description of difficulty than mine (reduction of nonlinearity)

Does "complexity modulo recursion" mean "complexity when lacking recursion"?

I think it's probably true that even if text models took good advantage of recursion to reduce their size, they'd still be at least one order of magnitude larger than the audio-specific networks,

I still don't know how you're measuring largeness.

The fact that this explanation of "what a model does" can carry from the weights of a model to the data it operates on

"this explanation of what a model does" = "neural networks linearize their input data": This seems like a very wrong explanation of neural networks! That's what perceptrons do, & why they don't work. Also, I just don't know what you mean by "carry from the weights of a model to the data it operates on".

Login or register to comment