'Difference between codePointAt and charCodeAt
What is the difference between String.prototype.codePointAt() and String.prototype.charCodeAt() in JavaScript?
'A'.codePointAt(); // 65
'A'.charCodeAt(); // 65
Solution 1:[1]
To add a few for the ToxicTeacakes's answer, here is another example to help you know the difference:
"?".charCodeAt(0).toString(16);//d842
"?".charCodeAt(1).toString(16);//dfb7
"?".codePointAt(0);//20bb7
"?".codePointAt(1);//dfb7
console.log("\ud842\udfb7");//?, an example of hexadecimal digits
console.log("\u20bb7\udfb7");//?7?
console.log("\u{20bb7}");//? an unicode code point escapes the "\ud842\udfb7"
The following is the info about javascript string literals:
"\uXXXX"
The Unicode character specified by the four hexadecimal digits XXXX. For example, \u00A9 is the Unicode sequence for the copyright symbol."\u{XXXXX}"
Unicode code point
escapes. For example, \u{2F804} is the same as the simple Unicode escapes \uD87E\uDC04.
see also msdn
Solution 2:[2]
Example in JS
On The example with strings and emojis, I am going to illustrate how things could go wrong when you do not know that some of the characters could consist of 2 code units. Some of the characters take up more than one code unit. Consider using codePointAt() over charCodeAt() or use the first one if you are sure that your characters lie in of between 0 and 65535 (216)
// charCodeAt() is UTF-16
// codePointAt() is Unicode
/* UTF-16 is generally considered a bad idea today */
const strings = ["o", "four", "to"];
const emojis = ["?", "?"];
function printItemsLength(arr) {
for (const item of arr) {
console.log(item, item.length);
}
}
printItemsLength(strings);
console.log('================================');
printItemsLength(emojis);
console.log('================================');
console.log("i.charCodeAt(0)", "i".charCodeAt(0)); // 105
console.log("i.charCodeAt(1)", "i".charCodeAt(1)); // 105
console.log("i.codePointAt(0)", "i".codePointAt(0)); // 105
console.log('=============EMOJIS=============');
// getting the decimal (dec) by which you can find them
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[0] + '.charCodeAt(0)', emojis[0].charCodeAt(0)); // only half-character - 55357
console.log(emojis[0] + '.charCodeAt(1)', emojis[0].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
console.log(emojis[0] + '.codePointAt(0)', emojis[0].codePointAt(0)); // 128014
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[1] + '.charCodeAt(0)', emojis[1].charCodeAt(0)); // only half-character - 55357
console.log(emojis[1] + '.charCodeAt(1)', emojis[1].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
// full-character
console.log(emojis[1] + '.codePointAt(0)', emojis[1].codePointAt(0)); // 128095
console.log(emojis[1] + '.codePointAt(1)', emojis[1].codePointAt(1)); // will return lower surragate (non-displayable character)
// to find this emojis have a look here: https://www.w3schools.com/charsets/ref_emoji.asp
as someone may have noticed I have tried to convert back from charcode to the emoji, and it did not work on one of the symbols (that is because it is not in range of UTF-16
Introduction to Unicode and UTF-16
please skip this section if you already familiar with it
Unicodeβ is a set of characters used around the world;UTF-16- 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "?" (two 16-bits) read more
"surrogate pair" characters are emoji and some letters that consist of more than 1 character as it is explained here
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. read more
Unicode - It assigns every character a unique number called a code point.
Differentiating charCodeAt() from codePointAt()
charCodeAt(pos) returns code a code unit (not a full character).
If you need a character (that could be either one or two code units), you can use codePointAt(pos) to get its code.
charCodeAt() - returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index link
codePointAt() - returns a non-negative integer that is the Unicode code point value at the given position link
where pos is the index of the character you want to check.
Quote from the book:
UTF-16 is generally considered a bad idea today. It seems almost intentionally designed to invite mistakes. Itβs easy to write programs that pretend code units and characters are the same things.
jsfiddle sandbox Sources:
- What is Unicode, UTF-8, UTF-16?
- Marijn Haverbeke Eloquent JavaScript, 3rd Edition: A Modern Introduction to Programming [Text] β City(not-specified) : No Starch Press, 2018 β 447 p. can be found here
- What is "surrogate pair"
- to find this emojis have a look w3schools.com/charsets/ref_emoji
Chapter 5, p. 91 => Strings and character codes
Solution 3:[3]
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt
from this url, you can get the differences, their function is almost the same, but some differences on the returns and illegal argument
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Community |
| Solution 2 | Utmost Creator |
| Solution 3 | Allen |
