4.1 Introduction to Encoding Schemes
Computers operate using binary digits (bits – 0s and 1s). However, humans communicate using characters, symbols, and languages. To bridge this gap, encoding schemes are used to represent these human-readable characters as binary numbers that computers can understand and process. An encoding scheme defines a unique numeric code for each character.
4.2 ASCII (American Standard Code for Information Interchange)
- Purpose: ASCII was one of the earliest and most widely used encoding standards, primarily designed for the English language.
- Structure:
- Standard ASCII: Uses 7 bits to represent 128 characters (codes 0-127). This includes:
- Uppercase English letters (A-Z)
- Lowercase English letters (a-z)
- Digits (0-9)
- Punctuation symbols (!, @, #, etc.)
- Control characters (like newline, tab, backspace).
- Extended ASCII: Uses 8 bits (1 byte) to represent 256 characters (codes 0-255). The first 128 characters are the same as standard ASCII, while the additional 128 characters were used for graphical symbols and letters with accents (e.g., é, ñ).
- Standard ASCII: Uses 7 bits to represent 128 characters (codes 0-127). This includes:
- Limitations:
- Limited character set: It cannot represent characters from languages other than English effectively.
- Lack of support for symbols and characters used in many other languages.
4.3 ISCII (Indian Standard Code for Information Interchange)
- Purpose: Developed in India to address the need for representing characters from various Indian scripts (Devanagari, Bengali, Tamil, Telugu, etc.).
- Structure: ISCII is an 8-bit encoding standard. The first 128 codes are the same as ASCII. The remaining codes are used to represent characters from Indian scripts.
- Key Feature: ISCII allows for transliteration – the conversion of text from one Indian script to another. This is useful for multilingual applications and data exchange.
4.4 Unicode
- Purpose: Unicode is a modern, universal character encoding standard designed to represent every character from every language in the world. It aims to unify all other encoding schemes into one.
- Concept: Unicode assigns a unique number, called a code point, to every character. For example, the code point for the letter 'A' is U+0041, and for the Devanagari letter 'क' is U+0915.
- Advantages:
- Universal character support: Can represent characters from all languages.
- Solves compatibility issues: Eliminates the need for multiple encoding schemes.
Unicode Encoding Formats
The Unicode standard itself doesn't define how to store the code points in memory; that job is done by different Unicode encodings.
a) UTF-8 (Unicode Transformation Format - 8-bit)
- Mechanism: UTF-8 is a variable-width encoding. This means it uses a variable number of bytes to represent each character.
- For characters that are also in ASCII (like 'A', 'b', '7'), it uses only 1 byte. This makes it fully backward compatible with ASCII.
- For other characters (like 'é', '€', 'क', '😂'), it uses 2, 3, or 4 bytes as needed.
- Advantages:
- Space-efficient: For text that is primarily English, a UTF-8 file is roughly the same size as an ASCII file.
- Compatibility: Its backward compatibility with ASCII is a huge advantage.
- Widely Used: It is the dominant character encoding for the World Wide Web.
b) UTF-16 (Unicode Transformation Format - 16-bit)
- Mechanism: UTF-16 uses 2 or 4 bytes to represent each character. Most commonly used characters are represented using 2 bytes.
- Advantages:
- Efficient for languages with a large number of characters that fall within the Basic Multilingual Plane (BMP).
- Disadvantages:
- Less space-efficient for English text compared to UTF-8.
c) UTF-32 (Unicode Transformation Format - 32-bit)
- Mechanism: UTF-32 is a fixed-width encoding. It uses exactly 4 bytes (32 bits) to store every single character, regardless of what that character is.
- Advantages:
- Simplicity: Since every character has the same length, finding the Nth character in a string is very simple and fast.
- Disadvantages:
- Inefficient Storage: It is very wasteful of space. A text file containing only the word "Hello" would take
5 * 4 = 20bytes in UTF-32, whereas it would only take 5 bytes in UTF-8 or ASCII.
- Inefficient Storage: It is very wasteful of space. A text file containing only the word "Hello" would take
- Usage: UTF-32 is rarely used for storing or transmitting data but may be used internally by some programs for easier processing.
Summary Table:
| Encoding Scheme | Bits per Character | Compatibility with ASCII | Space Efficiency | Usage |
|---|---|---|---|---|
| ASCII | 7 or 8 | Yes | High | Older systems, limited applications |
| ISCII | 8 | Yes | Moderate | Indian languages |
| UTF-8 | 1-4 | Yes | Very High | Web, general-purpose |
| UTF-16 | 2 or 4 | No | Moderate | Windows, Java |
| UTF-32 | 4 | No | Low | Internal processing |
ASCII Character Table (Standard 7-bit)
| Dec | Hex | Oct | Character | Description |
|---|---|---|---|---|
| 0 | 00 | 00 | NUL | Null character |
| 1 | 01 | 01 | SOH | Start of Heading |
| 2 | 02 | 02 | STX | Start of Text |
| 3 | 03 | 03 | ETX | End of Text |
| 4 | 04 | 04 | EOT | End of Transmission |
| 5 | 05 | 05 | ENQ | Enquiry |
| 6 | 06 | 06 | ACK | Acknowledge |
| 7 | 07 | 07 | BEL | Bell |
| 8 | 08 | 10 | BS | Backspace |
| 9 | 09 | 11 | HT | Horizontal Tab |
| 10 | 0A | 12 | LF | Line Feed |
| 11 | 0B | 13 | VT | Vertical Tab |
| 12 | 0C | 14 | FF | Form Feed |
| 13 | 0D | 15 | CR | Carriage Return |
| 14 | 0E | 16 | SO | Shift Out |
| 15 | 0F | 17 | SI | Shift In |
| 16 | 10 | 20 | DLE | Data Link Escape |
| 17 | 11 | 21 | DC1 | Device Control 1 |
| 18 | 12 | 22 | DC2 | Device Control 2 |
| 19 | 13 | 23 | DC3 | Device Control 3 |
| 20 | 14 | 24 | DC4 | Device Control 4 |
| 21 | 15 | 25 | NAK | Negative Acknowledge |
| 22 | 16 | 26 | SYN | Synchronous Idle |
| 23 | 17 | 27 | ETB | End of Transmission Block |
| 24 | 18 | 30 | CAN | Cancel |
| 25 | 19 | 31 | EM | End of Medium |
| 26 | 1A | 32 | SUB | Substitute |
| 27 | 1B | 33 | ESC | Escape |
| 28 | 1C | 34 | FS | File Separator |
| 29 | 1D | 35 | GS | Group Separator |
| 30 | 1E | 36 | RS | Record Separator |
| 31 | 1F | 37 | US | Unit Separator |
| 32 | 20 | 40 | Space | Space |
| 33 | 21 | 41 | ! | Exclamation Mark |
| 34 | 22 | 42 | " | Double Quote |
| 35 | 23 | 43 | # | Number Sign/Hash |
| 36 | 24 | 44 | $ | Dollar Sign |
| 37 | 25 | 45 | % | Percent Sign |
| 38 | 26 | 46 | & | Ampersand |
| 39 | 27 | 47 | ' | Single Quote |
| 40 | 28 | 50 | ( | Left Parenthesis |
| 41 | 29 | 51 | ) | Right Parenthesis |
| 42 | 2A | 52 | * | Asterisk |
| 43 | 2B | 53 | + | Plus Sign |
| 44 | 2C | 54 | , | Comma |
| 45 | 2D | 55 | - | Hyphen/Minus Sign |
| 46 | 2E | 56 | . | Period/Dot |
| 47 | 2F | 57 | / | Slash |
| 48 | 30 | 60 | 0 | Digit Zero |
| 49 | 31 | 61 | 1 | Digit One |
| 50 | 32 | 62 | 2 | Digit Two |
| 51 | 33 | 63 | 3 | Digit Three |
| 52 | 34 | 64 | 4 | Digit Four |
| 53 | 35 | 65 | 5 | Digit Five |
| 54 | 36 | 66 | 6 | Digit Six |
| 55 | 37 | 67 | 7 | Digit Seven |
| 56 | 38 | 70 | 8 | Digit Eight |
| 57 | 39 | 71 | 9 | Digit Nine |
| 58 | 3A | 72 | : | Colon |
| 59 | 3B | 73 | ; | Semicolon |
| 60 | 3C | 74 | < | Less-than Sign |
| 61 | 3D | 75 | = | Equals Sign |
| 62 | 3E | 76 | > | Greater-than Sign |
| 63 | 3F | 77 | ? | Question Mark |
| 64 | 40 | 100 | @ | At Sign |
| 65 | 41 | 101 | A | Uppercase A |
| 66 | 42 | 102 | B | Uppercase B |
| 67 | 43 | 103 | C | Uppercase C |
| 68 | 44 | 104 | D | Uppercase D |
| 69 | 45 | 105 | E | Uppercase E |
| 70 | 46 | 106 | F | Uppercase F |
| 71 | 47 | 107 | G | Uppercase G |
| 72 | 48 | 110 | H | Uppercase H |
| 73 | 49 | 111 | I | Uppercase I |
| 74 | 4A | 112 | J | Uppercase J |
| 75 | 4B | 113 | K | Uppercase K |
| 76 | 4C | 114 | L | Uppercase L |
| 77 | 4D | 115 | M | Uppercase M |
| 78 | 4E | 116 | N | Uppercase N |
| 79 | 4F | 117 | O | Uppercase O |
| 80 | 50 | 120 | P | Uppercase P |
| 81 | 51 | 121 | Q | Uppercase Q |
| 82 | 52 | 122 | R | Uppercase R |
| 83 | 53 | 123 | S | Uppercase S |
| 84 | 54 | 124 | T | Uppercase T |
| 85 | 55 | 125 | U | Uppercase U |
| 86 | 56 | 126 | V | Uppercase V |
| 87 | 57 | 127 | W | Uppercase W |
| 88 | 58 | 130 | X | Uppercase X |
| 89 | 59 | 131 | Y | Uppercase Y |
| 90 | 5A | 132 | Z | Uppercase Z |
| 91 | 5B | 133 | [ | Left Square Bracket |
| 92 | 5C | 134 | ** | Backslash |
| 93 | 5D | 135 | ] | Right Square Bracket |
| 94 | 5E | 136 | ^ | Circumflex |
| 95 | 5F | 137 | _ | Underscore |
| 96 | 60 | 140 | ` | Grave Accent |
| 97 | 61 | 141 | a | Lowercase a |
| 98 | 62 | 142 | b | Lowercase b |
| 99 | 63 | 143 | c | Lowercase c |
| 100 | 64 | 144 | d | Lowercase d |
| 101 | 65 | 145 | e | Lowercase e |
| 102 | 66 | 146 | f | Lowercase f |
| 103 | 67 | 147 | g | Lowercase g |
| 104 | 68 | 150 | h | Lowercase h |
| 105 | 69 | 151 | i | Lowercase i |
| 106 | 6A | 152 | j | Lowercase j |
| 107 | 6B | 153 | k | Lowercase k |
| 108 | 6C | 154 | l | Lowercase l |
| 109 | 6D | 155 | m | Lowercase m |
| 110 | 6E | 156 | n | Lowercase n |
| 111 | 6F | 157 | o | Lowercase o |
| 112 | 70 | 160 | p | Lowercase p |
| 113 | 71 | 161 | q | Lowercase q |
| 114 | 72 | 162 | r | Lowercase r |
| 115 | 73 | 163 | s | Lowercase s |
| 116 | 74 | 164 | t | Lowercase t |
| 117 | 75 | 165 | u | Lowercase u |
| 118 | 76 | 166 | v | Lowercase v |
| 119 | 77 | 167 | w | Lowercase w |
| 120 | 78 | 170 | x | Lowercase x |
| 121 | 79 | 171 | y | Lowercase y |
| 122 | 7A | 172 | z | Lowercase z |
| 123 | 7B | 173 | { | Left Curly Brace |
| 124 | 7C | 174 | ** | ** |
| 125 | 7D | 175 | } | Right Curly Brace |
| 126 | 7E | 176 | ~ | Tilde |
| 127 | 7F | 177 | DEL | Delete |
Arbind Singh
Teacher, Software developer
Innovative educator and tech enthusiast dedicated to empowering students through robotics, programming, and digital tools.

