Tools

Character Encoding Converters | UTF-8, Unicode

Convert text, bytes, UTF-8, UTF-16, UTF-32, ASCII, Latin-1, Base64, HEX, binary, HTML entities, URL encoding, and Unicode code points.

Character Encoding Converters

Convert text between Unicode, UTF-8, UTF-16, UTF-32, ASCII, Latin-1, Windows-1252-style bytes, hexadecimal, binary, decimal bytes, Base64, percent encoding, HTML entities, Unicode code points, and normalization forms. Inspect every character, byte, code unit, and code point in one browser-based tool.

Text to UTF-8 Bytes to text UTF-16 LE/BE UTF-32 LE/BE ASCII / Latin-1 Windows-1252-style HEX / Binary / Decimal Base64 URL percent encoding HTML entities Unicode inspector NFC / NFD / NFKC / NFKD

1. Enter Text or Bytes

Encode mode expects normal text. Decode mode expects bytes in the selected byte input format.

Escape Tool Options

2. Conversion Output

Main output Ready

Convert text and bytes using Unicode-aware browser tools and manual byte encoders.

Characters 0
Code Points 0
UTF-8 Bytes 0
Code Units 0
Character Encoding Flow Text Characters Code Points U+XXXX Bytes HEX / binary UTF-8 keeps Unicode text portable. Inspect code points, code units, and encoded bytes.

Byte / Text Summary

3. Encoding Results

Output TypeValueUse Case

Unicode Character Inspector

#CharacterCode PointDecimalUTF-8 HexUTF-16 UnitsHTML Entity
\[ \text{UTF-8 byte count depends on the Unicode code point range.} \]

Character Encoding Formulas

Character encoding converts abstract characters into bytes. Unicode assigns a code point such as \(U+0041\) or \(U+1F44B\), and an encoding such as UTF-8 or UTF-16 decides how that code point becomes bytes or code units.

\[ \text{Character}\rightarrow\text{Code Point}\rightarrow\text{Encoded Bytes} \]

UTF-8 is variable length. It uses one to four bytes depending on the code point range:

\[ U+0000\ldots U+007F \Rightarrow 0xxxxxxx \] \[ U+0080\ldots U+07FF \Rightarrow 110xxxxx\ 10xxxxxx \] \[ U+0800\ldots U+FFFF \Rightarrow 1110xxxx\ 10xxxxxx\ 10xxxxxx \] \[ U+10000\ldots U+10FFFF \Rightarrow 11110xxx\ 10xxxxxx\ 10xxxxxx\ 10xxxxxx \]

UTF-16 uses 16-bit code units. Characters in the Basic Multilingual Plane use one code unit. Supplementary characters use a surrogate pair:

\[ U'=U-0x10000 \] \[ \text{High Surrogate}=0xD800+\left\lfloor\frac{U'}{0x400}\right\rfloor \] \[ \text{Low Surrogate}=0xDC00+(U'\bmod 0x400) \]

UTF-32 uses one 32-bit value per code point:

\[ \text{UTF-32 code unit}=U \]

Base64 encodes three bytes into four printable characters:

\[ 3\ \text{bytes}=24\ \text{bits} \] \[ 24\ \text{bits}=4\times6\ \text{bits} \]

Percent encoding writes each byte as a percent sign followed by two hexadecimal digits:

\[ \text{Byte } b \Rightarrow \%HH \] \[ HH=\operatorname{hex}(b) \]

Unicode normalization converts equivalent text into a consistent representation:

\[ \text{NFC}=\text{Canonical Decomposition}+\text{Canonical Composition} \] \[ \text{NFD}=\text{Canonical Decomposition} \] \[ \text{NFKC}=\text{Compatibility Decomposition}+\text{Canonical Composition} \] \[ \text{NFKD}=\text{Compatibility Decomposition} \]

Complete Guide to Character Encoding

Character encoding is the system that turns human-readable text into bytes computers can store, transmit, and interpret. A letter, symbol, emoji, mathematical sign, Arabic word, Hindi phrase, or Chinese character is not stored as a visual drawing. It is represented by a code point and then encoded into bytes. Without a correct encoding, text can become unreadable, corrupted, or misinterpreted.

Unicode is the central standard for modern text. It assigns code points to characters across languages and scripts. For example, the Latin capital letter A is \(U+0041\), the euro sign is \(U+20AC\), and the waving hand emoji is \(U+1F44B\). A code point is an abstract number. It does not by itself say how many bytes are stored. That is the job of an encoding such as UTF-8, UTF-16, or UTF-32.

UTF-8 is the dominant web encoding because it is compact for ASCII text, supports the full Unicode range, and avoids many byte-order problems. ASCII characters from 0 to 127 encode as one byte in UTF-8. Many European and Middle Eastern characters use two bytes. Many Asian characters use three bytes. Emoji and other supplementary characters usually use four bytes. This variable-length design makes UTF-8 efficient and backward-compatible with ASCII for basic English text.

UTF-16 is common inside several programming environments, including JavaScript strings. UTF-16 stores text as 16-bit code units. Characters in the Basic Multilingual Plane use one code unit. Supplementary characters such as many emoji use two code units called a surrogate pair. This is why a JavaScript string length can be larger than the number of visible characters. The emoji 👋 is one Unicode code point but two UTF-16 code units.

UTF-32 is simpler conceptually because it uses one 32-bit unit per Unicode code point. Its simplicity comes with size cost. A basic English letter that needs one byte in UTF-8 takes four bytes in UTF-32. UTF-32 is useful for some internal processing and teaching, but it is not usually the best storage or web interchange format.

ASCII is the historical 7-bit encoding for basic English letters, digits, punctuation, and control characters. ASCII contains 128 values from 0 to 127. UTF-8 was designed so that ASCII bytes mean the same thing in UTF-8. That compatibility is one reason UTF-8 became practical for web adoption. However, ASCII cannot directly encode characters such as €, é, नमस्ते, مرحبا, or emoji.

Latin-1, also known as ISO-8859-1 in many contexts, extends the idea of single-byte text to 256 values. It can represent many Western European characters but not the full range of Unicode. Windows-1252 is a related legacy encoding used historically in Windows environments. Many mojibake problems happen when text encoded in one legacy encoding is decoded as another.

Mojibake is the visible corruption that happens when bytes are decoded with the wrong character encoding. For example, UTF-8 bytes for one character can appear as two or three strange characters if interpreted as a legacy single-byte encoding. The fix is not to visually repair the text by hand; the correct fix is to decode the original bytes using the correct encoding whenever possible.

A byte order mark, or BOM, is a special sequence at the start of some text files. It can signal the byte order for UTF-16 or UTF-32. UTF-8 does not need byte order, but some files still include a UTF-8 BOM. BOM handling can matter when reading CSV files, source code, configuration files, and data exports. This tool lets you add or strip BOMs for encoding and decoding experiments.

Endianness describes byte order. In UTF-16LE, the least significant byte appears first. In UTF-16BE, the most significant byte appears first. The same idea applies to UTF-32LE and UTF-32BE. If a file is decoded with the wrong endianness, the result may look completely broken. Understanding endianness is important in binary formats, networking, file parsing, and systems programming.

Base64 is not a character encoding like UTF-8. It is a binary-to-text encoding that represents bytes using printable characters. It is useful when binary data must pass through systems designed for text. Base64 is common in email, data URLs, tokens, and APIs. It is not encryption. Anyone can decode Base64 if they have the encoded text.

Percent encoding is used in URLs. A byte is written as a percent sign followed by two hexadecimal digits, such as %E2%82%AC for the UTF-8 bytes of the euro sign. URL encoding is byte-oriented, so the text must first be encoded into bytes, usually UTF-8, and then each unsafe byte is written using percent notation.

HTML entities are another representation layer. The character © can be written as ©, ©, or ©. Entities are useful when text must appear inside HTML without being interpreted as markup, or when a developer wants to show special characters safely. HTML escaping is also a security habit when displaying user text in a web page.

Unicode normalization is essential for reliable text comparison. The character é can be represented as one precomposed character or as the letter e followed by a combining accent. These may look the same but have different code point sequences. NFC, NFD, NFKC, and NFKD normalize text in different ways. Search, sorting, file names, usernames, and duplicate detection can behave incorrectly if normalization is ignored.

Grapheme clusters are what users often think of as characters. A single visible emoji can contain multiple code points joined together. A flag emoji uses regional indicator symbols. A family emoji may use several people emojis joined with zero-width joiners. A letter with a combining mark can be multiple code points but one visible unit. This is why the tool reports characters, code points, UTF-16 code units, grapheme clusters, and bytes.

JavaScript strings use UTF-16 code units internally. This means string.length reports code units, not Unicode code points or grapheme clusters. This matters when validating usernames, limiting posts, slicing emoji text, or storing multilingual text. Cutting a string at a random code-unit boundary can split a surrogate pair and create invalid text.

Developers should usually store and exchange text as UTF-8 unless there is a specific reason not to. HTML pages, APIs, JSON files, CSV exports, XML, Markdown, source files, logs, and databases are easier to handle when UTF-8 is used consistently. Legacy encodings may still appear in old files, regional datasets, archived systems, or exports from older software.

Character encoding is also a mathematical topic. It uses base conversion, binary representation, hexadecimal notation, bit masks, byte grouping, modular ranges, and variable-length coding. When a code point is encoded as UTF-8, its bits are distributed into byte templates. When bytes are displayed as hexadecimal, each byte is split into two 4-bit nibbles. Encoding is not magic; it is structured arithmetic.

This tool is useful for students learning Unicode, developers debugging mojibake, teachers preparing computer science lessons, publishers checking multilingual content, SEO teams working with URLs, and anyone who needs to see the actual bytes behind text. It is intentionally transparent: you can see the text, the bytes, the code points, and the formulas.

This page is not an official exam score calculator. There is no universal score guideline, score table, or next exam timetable for character encoding conversion. It can support computer science, web development, digital publishing, data handling, cybersecurity basics, and applied mathematics, but official exam schedules and grading rules must come from the relevant school, course provider, or exam board.

Accuracy note: this browser tool is designed for learning, debugging, and everyday conversion. Browser support for legacy decoding can vary. For legal, archival, forensic, or production migration work, verify byte-level results with dedicated encoding libraries and known test files.

Reference Links

Useful references: Unicode Standard, WHATWG Encoding Standard, MDN TextEncoder, MDN TextDecoder, and MDN String normalize.

How to Use Character Encoding Converters

  1. Choose a tool mode. Encode text, decode bytes, inspect characters, normalize text, or use escape tools.
  2. Select an encoding. Choose UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, ASCII, Latin-1, or Windows-1252 style.
  3. Enter text or bytes. Encode mode expects text. Decode mode expects hex, binary, decimal, Base64, or percent-encoded bytes.
  4. Set BOM and error options. Add BOM, strip BOM, replace unsupported characters, or use strict mode.
  5. Run conversion. Click Convert Encoding and review all outputs.
  6. Inspect Unicode details. Review each character’s code point, decimal value, UTF-8 bytes, UTF-16 units, and HTML entity.
  7. Export results. Copy the main output, copy all details, download CSV, or print/save as PDF.
Encoding / FormatWhat It DoesBest Use
UTF-8Variable-length Unicode encoding using 1–4 bytes per code point.Web pages, APIs, JSON, modern files, database text exchange.
UTF-16Uses 16-bit code units; supplementary characters use surrogate pairs.JavaScript internals, Windows-style text processing, teaching code units.
UTF-32Uses one 32-bit value per Unicode code point.Teaching, debugging, internal processing, direct code point inspection.
ASCII7-bit historical English-focused encoding.Legacy systems, basic protocol examples, computer science basics.
Latin-1 / Windows-1252Single-byte legacy Western encodings.Old files, mojibake debugging, legacy data migration.
Base64Represents bytes using printable text characters.Emails, data URLs, APIs, tokens, binary-to-text transport.
Percent encodingRepresents bytes as %HH sequences.URLs, query strings, URI debugging.

Score, Course, and Exam Table Note

Requested ItemStatus for This Encoding ToolCorrect Guidance
Score guidelinesNot applicableThis is a text encoding and computer-science utility, not an official score calculator.
Score tableNot applicableThere is no universal academic score table for character encoding conversion.
Next exam timetableNot applicableUse official school, certification, or exam-board sources for course-specific exam dates.
Course relevanceUseful for computer science and web developmentSupports Unicode, binary, hexadecimal, byte order, web text, encoding, decoding, and data representation lessons.

Character Encoding Converter FAQ

What is character encoding?

Character encoding is the method used to convert characters into bytes. Unicode defines code points, while encodings such as UTF-8 and UTF-16 define how those code points are stored.

What is the difference between Unicode and UTF-8?

Unicode is the character set and code point standard. UTF-8 is an encoding that represents Unicode code points as one to four bytes.

Why does an emoji count as more than one JavaScript character?

JavaScript strings use UTF-16 code units. Many emoji are supplementary characters represented by two UTF-16 code units, and some visible emoji contain multiple code points.

What is a byte order mark?

A byte order mark is a special byte sequence at the start of a file that can identify Unicode encoding or byte order, especially for UTF-16 and UTF-32.

Is Base64 encryption?

No. Base64 is a binary-to-text encoding. It is reversible and does not provide secrecy.

What is mojibake?

Mojibake is corrupted-looking text caused by decoding bytes with the wrong character encoding.

Should I use UTF-8 for new web content?

Yes. UTF-8 is the standard practical choice for modern web content, APIs, JSON, and general text interchange.

Shares:

Related Posts