We know we can represent numbers using combinations of marbles and holes. Now we’ll look at representing text.
encoding
Previously we built a dictionary that let us interpret strings like 1101010 as numbers. Now we’ll stack a second dictionary on top that translates between numbers and letters. By chaining these two dictionaries together, we’ll be able to translate from strings, into numbers, and then into letters.
This dictionary just maps each letter to its position in the alphabet:
Number | Letter |
---|---|
0 | A |
1 | B |
2 | C |
3 | D |
… | … |
25 | Z |
This means we can interpret a number as a letter.
putting it together
- The disk has the pattern
- … which represents the string 10110
- … which encodes the base two number 10110
- … which equals the number known in English as “twenty-two”.
- … then interpret that as the 23rd letter of the alphabet, W.
23rd?
You might be wondering why 22 represents the 23rd letter of the alphabet.
Look at our dictionary again.
Number | Letter |
---|---|
0 | A |
1 | B |
… | … |
We started counting at 0, not 1. So we have to add 1 when looking up a letter.
This is called zero indexing, and is very common in computing. It might seem confusing or annoying to have to add that extra 1, but it’s worth it — in the long run, zero indexing makes a bunch of things easier.
text
Most text is longer than a single letter [citation needed]. How can we represent this?
Let’s say we want to represent the word yes. The binary for this is:
- Y 11000
- E 100
- S 10010
We could try concatenating them together: 1100010010010. But how do we know where the letters start and end? Remember, we have no way of representing a space — we only have marbles and holes, and they’re already busy representing 1 and 0.
The answer is standardised word
size. We can declare that all letters must be represented by exactly 5 bits, and you can’t drop leading zeroes. Then the letters become:
- Y 11000
- E 00100
- S 10010
And YES becomes 110000010010010. Because we know a new letter starts every 5 bits, we know to split it as 11000 00100 10010.
The downside to this approach is that it fixes how many characters we can encode. With 5 bits, we can encode 2 to the power of 5 (2 x 2 x 2 x 2 x 2) = 32 distinct characters. If we increase that to 6 bits, we get twice as many characters, but the tradeoff is that each letter takes up more space.
a word on words
In computing, a word
is a single unit of data, composed of a fixed number of bits.
So in the example above, the word “YES” is actually composed of three words
: Y, E, and S. Each letter is a single word
in the computing sense.
When we talk about standardised word
size, we’re talking about making the representation of each letter a fixed number of bits — it’s nothing to do with making English words a particular length.
Usually it should be clear from context what kind of word we’re talking about, but from now on we’ll write it as word
when we mean it in the computing sense, unless I get bored or if I forget.
extending the dictionary
We can add in other characters to our dictionary too:
26 | SPACE |
27 | ! |
28 | ? |
29 | 0 |
30 | 1 |
31 | 2 |
32 | 3 |
.. | … |
31 = 2??
It might seem confusing that we’re assigning the number 31
to the character 2. Shouldn’t we be using 2
?
Remember that “2” and “31” are just squiggles on your screen — there’s nothing that objectively connects those squiggles to a particular number of things.
As long as we’re consistent about using 31
to mean the digit “2”, it’s no less correct than if we used 2
. What’s important is making sure that our system consistently treats the number 31
the way we want the digit “2” to be treated.
Consider a real-life example:
- The English word “prune” means prune
- The French word “prune” means plum
Again, “prune” is just a squiggle — it has no connection to any kind of fruit. What matters is that English speakers consistently treat “prune” as meaning prune, and French speakers consistently treat “prune” as meaning plum. This is what gives the representation meaning.
Next, we’ll look at how we can apply our ability to represent text.
continue