International Simulation Football League
*What is a Word? - Printable Version

+- International Simulation Football League (https://forums.sim-football.com)
+-- Forum: Community (https://forums.sim-football.com/forumdisplay.php?fid=5)
+--- Forum: Media (https://forums.sim-football.com/forumdisplay.php?fid=37)
+---- Forum: Graded Articles (https://forums.sim-football.com/forumdisplay.php?fid=38)
+---- Thread: *What is a Word? (/showthread.php?tid=22351)



*What is a Word? - KoltClassic - 05-22-2020

I’ll preface my findings below by saying that this isn’t a critique of any of the various written grading processes on the site. The graders do an awesome job; this is more related to my own curiosity on the subject.

As you know, writing is a core tenant of this and most other sim leagues. Whether you are completing a point task, earning media money, or completing a wiki entry, if you want to have a successful career in the league you will eventually have to put the hypothetical pen to the paper and write out some words. And for all of these tasks, the number of words you write have some impact on the result of your task. For point tasks and wiki entries, reaching a certain word count threshold decides whether or not your task is considered successful. If you are writing a media piece word count decides how much you are paid and whether or not you receive a bonus for your work.

For many users word count is more of an afterthought. We take five or ten minutes a week to write our point task and confirm that our writing is far enough over the 200 word threshold that we’re guaranteed to get our TPE. But for a few of us, we like to live on the edge a bit more. You may have seen this or similar messages in graded point tasks before:

Let me start with this: Shooting for 200 even is always dicey. I check my PTs with the word counter built into the site, so if you want to be sure of your count, use that.


I was looking through some media posts the other day and stumbled upon this post:
The Philadelphia Liberty's Post Season History

This is a great article that I thought was very well put together! At the end of the article there is a word count included by the author, 1480 words. This is a pretty common occurrence for media authors as far as I have seen. For some reason, for this particular article I figured I’d look into whether or not that word count was consistent with what was added to the bank sheet when it was graded. As it turns out, it wasn’t. Listed in the bank payment sheet is a word count of 1489 words.

This difference piqued my curiosity. How big of a difference is what a user expects to be paid vs. what a user is actually paid? How does this equate to dollars that a user earns? How much of variance is there in payments?

Using the article above, I used a few different tools to see what this variance could look like:

Length given by user: 1480 words
Length given by Microsoft Word: 1480 words
Length given by “Check post length” in jcink ( copy/pasted from original article, we’ll get to this later): 1469 words
Length given by “Check post length” in jcink ( copy/pasted from wordcounter.net): 1473 words
Length listed in media log for payout (from wordcounter.net): 1489 words
Length given by Mac Pages app: 1503 words
Length given by Google Docs app: 1490 words

This post was paid at 1489 words, or $2,568,210 with a 1.5x media bonus of $1,284,105 for a total of $3,852,315

The lowest this article could have been paid according to my numbers above is $2,530,410 + $1,265,205 = $3,795,615
The most this article could have been paid according to my numbers above is $2,594,550 + $1,297,275 = $3,891,825
The variance for this post’s payout is $96,210

For this post in particular, the variance definitely isn’t a huge deal. The difference in payment equates to less than the payment of one tweet. I will also note that I don’t have Microsoft Word on my computer to validate any of these tests there, but I would encourage you to investigate these tests there if you are interested. But my real curiosity is where do these variances lie inside of the words? Below I’ll look at a few examples.

1. Philadelphia/Baltimore
Counts as one word in jcink
Counts as two words in Mac Pages
Counts as two words in Google Docs
Counts as two words in wordcounter.net

An interesting note on this format: If you add spaces between the words and the hyphen, this updates how jcink perceives the word. “Philadelphia / Baltimore” is considered to be three words to jcink, but it is still considered to be two words to Pages and Google Docs. There is one instance of an un-spaced forward slash used in the article.

2. Thirty-four
Counts as one word in jcink
Counts as two words in Mac Pages
Counts as one word in Google Docs
Counts as one word in wordcounter.net

There are 21 hyphenated words in this article, but there are a difference of 13 words between Mac pages and Google Docs. This means that there is something else that Pages is counting as one word, but Google Docs is counting as two…

3. 2018 – Season Three
Counts as four words in jcink
Counts as three words in Mac Pages
Counts as four words in Google Docs
Counts as three words in wordcounter.net

There are 9 instances of this in the article. 21 - 9 = 13, the difference we see between Pages and Docs.

4. An empty page
Counts as one word in jcink
Counts as zero words in Mac Pages
Counts as zero words in Google Docs
Counts as zero words in wordcounter.net

This one is pretty ridiculous, but if you consider how characters and text are interpreted by computers it makes slightly more sense. I won’t get too into it, but I would encourage you to read a little bit about Unicode, “an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems”. The gist of Unicode’s impact on what I’m looking into here is how computers store and interpret characters, and later how applications like jcink or google docs interpret those characters ( even the characters that are “empty”).


5. An “empty” page with one space character ( Unicode U+0020)
Counts as two words in jcink
Counts as zero words in Mac Pages
Counts as zero words in Google Docs
Counts as zero words in wordcounter.net

6. An “empty” page with one no-break space character ( Unicode U+00A0 )
Counts as two words in jcink
Counts as zero words in Mac Pages
Counts as zero words in Google Docs
Counts as zero words in wordcounter.net

I don’t have a good explanation for this; just jcink being jcink.

Bonus weirdness:
Let’s look at the last two paragraphs of this post, not including the title (2031 - Season Sixteen):
In both jcink and google docs, the first and second paragraphs have the same word count when analyzed SEPARATELY:
Paragraph 1: 224 words
Paragraph 2: 122 words

Giving us a total of 346 words.
The odd thing is, when we select `Check post length` in jcink when these paragraphs are combined and copy/pasted from the post, we get 345 words.
What is even more odd is if we delete the whitespace between the two paragraphs and just make it one big paragraph. Guess what “Check post length” gives us then??346 words.
wat.


I don’t have a great conclusion for this, but I thought it was an interesting dive into how word processing applications define what a word is. For more information I would encourage you to look into topics like text segmentation. Language is an extremely complex and interesting topic, and variances like this just scratch the surface of a gigantic topic. I will also say thank you to @Tesla for answering the questions that I’d presented around the grading process; their insight was very valuable to me getting a better look into what grading looks like as a media grader.



*What is a Word? - hotdog - 05-22-2020

in linguistics, defining a "word" is one of the trickiest tasks there is, and it gets even harder when you try to have a definition that works across different languages, too. It's super intuitive until you actually try to define it! Good analysis Big Grin