• Member Since 11th Apr, 2012
  • offline last seen 2 hours ago

Bad Horse


Beneath the microscope, you contain galaxies.

More Blog Posts759

Jun
9th
2017

Ficdom Structure 2: Author heat map · 2:07am Jun 9th, 2017

In "Author clusters question", I wrote about finding clusters of authors on fimfiction using data on who followed them.  What I'm going to show you now is less sophisticated, but visually easier to grasp.  Actually I did this first, back in November.  See that earlier post to explain the derivation of my measure m2(A,B), a measure of distance between authors.

(I never explained the adjustment to the ratio in that post; see Math Stuff at the bottom of this post if you care.)

This was my first attempt to try to find some structure or clustering among the authors on fimfiction, using scraped data from 2015 on who people follow.  I selected the 54 authors who had at least 4 stories and 1000 followers and followed at least 10 other authors meeting those requirements. (The 10 other authors requirement was because at first I was studying which authors followed which other authors.)


Heat map

I made a heat map from the distances between them using the R gplots library function heatmap.2, using default parameters. Redder values indicate authors are closer together.  heatmap.2 uses these values to build a dendrogram, drawn above and to the left, which clusters authors with similar distance vectors. It does this using hierarchical clustering with complete linkage.

Fig. 1. Heat map showing “distance” between authors. Click to enlarge!

Unfortunately, the algorithm never produces the tightest clusterings, because which branch of a binary tree it shows on the left and which on the right is arbitrary.  It could be improved a lot by taking into account the similarity of the leftmost part of a subtree to the rightmost part of whatever subtree is displayed to its left. By looking for sharp color boundaries that match up to deep cuts in the tree, you can see that the very first split in the tree should be flipped left/right, moving the red block at bottom left (called the “gore” block below) to the top right, and that would bring the 4 red areas in the 4 corners together.

The red area in the upper-right I’ll call the "popular authors". Their stories usually have the comedy, romance, adventure, or mature+sex tags.  Here also be crackfic territory.  The 4x4 dark red block in that block’s center shows the similarity of the set {Marshal Twilight, The Abyss, Shakespearicles, TittySparkles} to itself.  It's the core of clop in this dataset.

The large red block near the lower-left I’ll call the "cool kids" character authors.  Notice Bad Horse is next to Cold in Gardez and Skywriter.  Says Science! :rainbowdetermined2:

See the small red block in the very lower-left?  That shows that {Vengeful Spirit, Pedro Hander, ed2481, Tatsurou, Distorted Flare, MadMaxtheBlack} all have high similarity to each other.  They're authors who write stories with the gore tag.

See the nearly-white blocks above and to the right of that red block?  That shows that people who read the gore authors are unlikely to read {GhostOfHeraclitus, Friendly Uncle, AbsoluteAnonymous, Cloudy Skies, kits}, who usually write sweet stories.  Hardly surprising.

If you follow horizontally over to the right from the gore x gore red block, you'll see it's mostly white & yellow until you come to Mr101, who marks the left edge of the "popular authors".  A taste for gore and for popular authors goes together, while a taste for gore and for character-based fiction does not.

Is Science unjust?  Are there authors in one block you think belong in another?  Write down their names in the comments, and then we'll see what the next graph has to say in another post.


Math stuff

At first, I computed the sample probability that a user who watched user B would watch user A:

P(watches(X,A) | watches(X,B))

That didn't work well--it turned out that the number of watches that users make does not have a Poisson distribution.  Instead, some people are just more likely to follow people in general.  Most follows are made by a small number of people who follow lots of people.

BUT, those people who follow lots of people can only follow at most 10 of the 10 most-popular authors.  Do you see the problem?

No?

Well, I didn't either, but the data eventually made it clear: People who follow just a few people follow a few popular authors, and a few other people.  People who follow lots of people follow a few popular authors, and lots of other people.  SO, if user X follows user B, and user B is not at all popular, user X is probably one of those people who follows a lot of people--and that means P(watches(X,A )| watches(X,B)) is a lot higher if B is not popular than if B is popular.

I had hoped that P(watches(X, Bad Horse) | watches(X, Cold in Gardez)) would be high, but it wouldn't, because CiG has so many followers that only a small percentage of them are people who follow lots of people.  Whereas P(watches(X, Bad Horse) | watches(X, shitfic_author_1337)) would be higher just because the 3 people who watched shitfic_author_1337 each watched thousands of people.

So I redefined my measure as a likelihood ratio:

ratio(A,B) = P(watches(X,A) | watches(X,B)) / [P(watches(X,A) | numOfWatches(X) = ave(numOfWatches(Y | watches(Y,B))]

where numOfWatches(X) is the number of users that user X watches.  This normalizes the former ratio by A's tendency to watch people.  It messes up the math and coding, though, because a probability always falls in [0..1], whereas a ratio of probabilities can be any positive number.

Comments ( 19 )

This is what Vulcan Fantasy Football looks like. :twistnerd:

Cool. I’ve always been fascinated by probability, although my math skills top out at multiplication, and only with a calculator.

I'm not on the chart, probably because I broke his algorithm. :derpytongue2:

Quick nag: the scale seems backwards to me. I’d expect you’re comparing author similarity, in which case 1.0 would be identical. This suggests to the reader that you’re scoring dissimilarity, which is rather strange.

Also, the map shows that Skywriter, Cold In Gardez, and Bad Horse appear to be very highly similar.

How suspiciously flattering... :trixieshiftright:

:trollestia:

4565310 A distance metric is a dissimilarity metric. I want next to draw a map of author-space, where being close together on the map means being similar. Can’t do that with a similarity metric.

4565320
Derp, normalized difference as a dissimilarity metric. I get it. :twilightsheepish:

4565274
I think it’s just assumed your row and column (except for you) are entirely snow-white.

4565274 Time to start that SilverPip's Wasteland Journal you were thinking about.

Neat. The heatmap does a really nice job of summarizing a lot of information about the similarities between authors' audiences. Are the clusters from the hierarchical clustering pretty similar to the ones you found earlier by PCA and k-means clustering? It looks to be the case, but it'd be interesting to see what the differences were.

As someone not featured in your study, I reject science. :flutterrage:

More seriously, very cool. Glad to see a larger version of this posted.

Bad Horse is next to Cold in Gardez and Skywriter

We must not have seen you coming.

Huh. I’m pretty similar to Chuck Finley, Eakin and WandererD

I’m pretty chuffed!

You need better colors. Everything from .3 to .0 looks identically red.

4565653

A grand grouping. Meanwhile, all I can do is back in a tiny sliver of the reflected glory of my low bacon number to all of you.

Actually now I think of it, personal relationship distance between authors would be an interesting thing to quantify. I'd also be a pain in the arse to quantify.

4565963

Actually now I think of it, personal relationship distance between authors would be an interesting thing to quantify. I'd also be a pain in the arse to quantify.

Not so hard. Number of comments on each others' stories. If you ran the website, you could also count PMs between people.

Like it, easily grokked. With 4565667 in thinking you need greater color range.

Whom might you add?
Maybe some people who definitely meet the criteria now and were around then, with a good number of stories but below the follower count. Granted, can’t catch full-scale data for them, but could catch what their early adopters have in common with those who already qualified. People I can think of: Horizon, Estee, GaPJaxie, Ponydora Prancypants, Titanium Dragon, Present Perfect.
Get a reviewer/blogger block (though this even now many of these would have <1000 follower counts). Examples: Titanium Dragon, Present Perfect, Chris, JohnPerry, Bradel, RBDash47.

4565328

I think it’s just assumed your row and column (except for you) are entirely snow-white.

As pure as the driven snow. . . .:rainbowlaugh:

4565358

Time to start that SilverPip's Wasteland Journal you were thinking about.

That would be an interesting project.

4566551
4566891
I strongly suspect that I'd end up in the big red block in the middle-lower-left. I may do a lot of reviewing, and I may follow the other reviewers, but the upper-right hand block, with only a few exceptions (Rainbow Bob, Rated Ponystar, Bad_Seed_72) have almost uniformly not been reviewed by me at all. Conversely, I've reviewed almost everyone in that mid-lower-left block (only Alexstraza and SleeplessBrony have not been reviewed by me - clearly major blind spots for me).

I haven't read a single person in the far lower-left block other than Vengeful Spirit.

I have read a few stories by The Abyss but I have never reviewed any of their stuff.

My guess would be that my taste in what I read and my taste in what I write overlap pretty heavily, so I'd expect to share a lot of followers with most of the lower-left crew.

Though I guess I did get some followers from Rainbow Bob way back in the day.

Login or register to comment