representing text

We know we can represent numbers using combinations of marbles and holes. Now we’ll look at representing text.

encoding

Previously we built a dictionary that let us interpret strings like 1101010 as numbers. Now we’ll stack a second dictionary on top that translates between numbers and letters. By chaining these two dictionaries together, we’ll be able to translate from strings, into numbers, and then into letters.

technical

A common theme with computers is chaining together multiple steps of translation and interpretation. This is what lets us climb from very simple things like bits to very complicated things like documents, videos, and games.

This dictionary just maps each letter to its position in the alphabet:

NumberLetter
0A
1B
2C
3D
25Z

This means we can interpret a number as a letter.

putting it together

00/0000/00
loading…

twenty two

  • The disk has the pattern marbleholemarblemarblehole
  • … which represents the string 10110
  • … which encodes the base two number 10110
  • … which equals the number known in English as “twenty-two”.
  • … then interpret that as the 23rd letter of the alphabet, W.

23rd?

You might be wondering why 22 represents the 23rd letter of the alphabet.

Look at our dictionary again.

NumberLetter
0A
1B

We started counting at 0, not 1. So we have to add 1 when looking up a letter.

This is called zero indexing, and is very common in computing. It might seem confusing or annoying to have to add that extra 1, but it’s worth it — in the long run, zero indexing makes a bunch of things easier.

definition

zero indexing is when we label the first element of a list as 0, not 1.

text

Most text is longer than a single letter [citation needed]. How can we represent this?

Let’s say we want to represent the word yes. The binary for this is:

  • Y 11000
  • E 100
  • S 10010

We could try concatenating them together: 1100010010010. But how do we know where the letters start and end? Remember, we have no way of representing a space — we only have marbles and holes, and they’re already busy representing 1 and 0.

The answer is standardised word size. We can declare that all letters must be represented by exactly 5 bits, and you can’t drop leading zeroes. Then the letters become:

  • Y 11000
  • E 00100
  • S 10010

And YES becomes 110000010010010. Because we know a new letter starts every 5 bits, we know to split it as 11000 00100 10010.

The downside to this approach is that it fixes how many characters we can encode. With 5 bits, we can encode 2 to the power of 5 (2 x 2 x 2 x 2 x 2) = 32 distinct characters. If we increase that to 6 bits, we get twice as many characters, but the tradeoff is that each letter takes up more space.

a word on words

In computing, a word is a single unit of data, composed of a fixed number of bits.

So in the example above, the word “YES” is actually composed of three words: Y, E, and S. Each letter is a single word in the computing sense.

When we talk about standardised word size, we’re talking about making the representation of each letter a fixed number of bits — it’s nothing to do with making English words a particular length.

Usually it should be clear from context what kind of word we’re talking about, but from now on we’ll write it as word when we mean it in the computing sense, unless I get bored or if I forget.

extending the dictionary

We can add in other characters to our dictionary too:

26SPACE
27!
28?
290
301
312
323
..

31 = 2??

It might seem confusing that we’re assigning the number 31 to the character 2. Shouldn’t we be using 2?

Remember that “2” and “31” are just squiggles on your screen — there’s nothing that objectively connects those squiggles to a particular number of things.

As long as we’re consistent about using 31 to mean the digit “2”, it’s no less correct than if we used 2. What’s important is making sure that our system consistently treats the number 31 the way we want the digit “2” to be treated.

Consider a real-life example:

  • The English word “prune” means prune
  • The French word “prune” means plum

Again, “prune” is just a squiggle — it has no connection to any kind of fruit. What matters is that English speakers consistently treat “prune” as meaning prune, and French speakers consistently treat “prune” as meaning plum. This is what gives the representation meaning.

Next, we’ll look at how we can apply our ability to represent text.

continue