'Is there a way to check whether unicode text is in a certain language?

I'll be getting text from a user that I need to validate is a Chinese character.

Is there any way I can check this?



Solution 1:[1]

You can use regular expression to match with Supported Named Blocks:

private static readonly Regex cjkCharRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}");
public static bool IsChinese(this char c)
{
    return cjkCharRegex.IsMatch(c.ToString());
}

Then, you can use:

if (sometext.Any(z=>z.IsChinese()))
     DoSomething();

Solution 2:[2]

As several people mentioned here, in unicode, chinese, japan, and Korean characters are encoded together, and there are several ranges to it. https://en.wikipedia.org/wiki/CJK_Compatibility

For the simplicity, here's a code sample that detects all the CJK range:

public bool IsChinese(string text)
{
    return text.Any(c => (uint)c >= 0x4E00 && (uint)c <= 0x2FA1F);
}

Solution 3:[3]

Just check the characters to see if the codepoints are in the desired range(s). For exampe, see this question:

What's the complete range for Chinese characters in Unicode?

Solution 4:[4]

According to the wikipedia (https://en.wikipedia.org/wiki/CJK_Compatibility) there are several character code diapasons. Here is my approach to detect Chinese characters based on link above (code in F#, but it can be easily converted)

 let isChinese(text: string) = 
            text |> Seq.exists (fun c -> 
                let code = int c
                (code >= 0x4E00 && code <= 0x9FFF) ||
                (code >= 0x3400 && code <= 0x4DBF) ||
                (code >= 0x3400 && code <= 0x4DBF) ||
                (code >= 0x20000 && code <= 0x2CEAF) ||
                (code >= 0x2E80 && code <= 0x31EF) ||
                (code >= 0xF900 && code <= 0xFAFF) ||
                (code >= 0xFE30 && code <= 0xFE4F) ||
                (code >= 0xF2800 && code <= 0x2FA1F) 
                )

Solution 5:[5]

I found another way, using UnicodeRanges (more precisely UnicodeRanges.CjkUnifiedIdeographs), if someone is looking for :

public bool IsChinese(char character)
{
    var minValue = UnicodeRanges.CjkUnifiedIdeographs.FirstCodePoint;
    var maxValue = minValue + UnicodeRanges.CjkUnifiedIdeographs.Length;
    return (character >= minValue && character < maxValue);
}

Solution 6:[6]

in unicode, chinese, japan, and Korean characters are encoded together.

visit this FAQ: http://www.unicode.org/faq/han_cjk.html

chinese character are distributed in serveral blocks.

visit this wiki: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

You will find there are serveral cjk character charts in unicode website.

For simplicity, You can just use chinese character minimum and maximum range:

0x4e00 and 0x2fa1f to check.

Solution 7:[7]

This worked for me:

var charArray = text.ToCharArray();
var isChineseTextPresent = false;


foreach (var character in charArray)
{
    var cat = char.GetUnicodeCategory(character);


    if (cat != UnicodeCategory.OtherLetter)
    {
        continue;
    }


    isChineseTextPresent = true;
    break;
}

Solution 8:[8]

Added this for a project, it is incomplete and could be more optimized (by checking the code blocks in the right order), but it gets the job done well enough.

const CHINESE_UNICODE_BLOCKS = [
  [0x3400, 0x4DB5],
  [0x4E00, 0x62FF],
  [0x6300, 0x77FF],
  [0x7800, 0x8CFF],
  [0x8D00, 0x9FCC],
  [0x2e80, 0x2fd5],
  [0x3190, 0x319f],
  [0x3400, 0x4DBF],
  [0x4E00, 0x9FCC],
  [0xF900, 0xFAAD],
  [0x20000, 0x215FF],
  [0x21600, 0x230FF],
  [0x23100, 0x245FF],
  [0x24600, 0x260FF],
  [0x26100, 0x275FF],
  [0x27600, 0x290FF],
  [0x29100, 0x2A6DF],
  [0x2A700, 0x2B734],
  [0x2B740, 0x2B81D]
]

const JAPANESE_UNICODE_BLOCKS = [
  [0x3041, 0x3096],
  [0x30A0, 0x30FF],
  [0x3400, 0x4DB5],
  [0x4E00, 0x9FCB],
  [0xF900, 0xFA6A],
  [0x2E80, 0x2FD5],
  [0xFF5F, 0xFF9F],
  [0x3000, 0x303F],
  [0x31F0, 0x31FF],
  [0x3220, 0x3243],
  [0x3280, 0x337F],
  [0xFF01, 0xFF5E],
]

const LATIN_UNICODE_BLOCKS = [
  [0x0000, 0x007F],
  [0x0080, 0x00FF],
  [0x0100, 0x017F],
  [0x0180, 0x024F],
  [0x0250, 0x02AF],
  [0x02B0, 0x02FF],
  [0x1D00, 0x1D7F],
  [0x1D80, 0x1DBF],
  [0x1E00, 0x1EFF],
  [0x2070, 0x209F],
  [0x2100, 0x214F],
  [0x2150, 0x218F],
  [0x2C60, 0x2C7F],
  [0xA720, 0xA7FF],
  [0xAB30, 0xAB6F],
  [0xFB00, 0xFB4F],
  [0xFF00, 0xFFEF],
  [0x10780, 0x107BF],
  [0x1DF00, 0x1DFFF],
]

const DEVANAGARI_UNICODE_BLOCKS = [
  [0x0900, 0x097F]
]

const ARABIC_UNICODE_BLOCKS = [
  [0x0600, 0x06FF],
  [0x0750, 0x077F],
  [0x0870, 0x089F],
  [0x08A0, 0x08FF],
  [0xFB50, 0xFDFF],
  [0xFE70, 0xFEFF],
  [0x10E60, 0x10E7F],
  [0x1EC70, 0x1ECBF],
  [0x1ED00, 0x1ED4F],
  [0x1EE00, 0x1EEFF],
]

const TIBETAN_UNICODE_BLOCKS = [
  [0x0F00, 0x0FFF],
]

const GREEK_UNICODE_BLOCKS = [
  [0x0370, 0x03FF],
  [0x1D00, 0x1D7F],
  [0x1D80, 0x1DBF],
  [0x1F00, 0x1FFF],
  [0x2100, 0x214F],
  [0xAB30, 0xAB6F],
  [0x10140, 0x1018F],
  [0x10190, 0x101CF],
  [0x1D200, 0x1D24F],
]

const TAMIL_UNICODE_BLOCKS = [
  [0x0B80, 0x0BFF],
]

const CYRILLIC_UNICODE_BLOCKS = [
  [0x0400, 0x04FF],
  [0x0500, 0x052F],
  [0x2DE0, 0x2DFF],
  [0xA640, 0xA69F],
  [0x1C80, 0x1C8F],
  [0x1D2B, 0x1D78],
  [0xFE2E, 0xFE2F],
]

const HEBREW_UNICODE_BLOCKS = [
  [0x0590, 0x05FF],
]

function detectMostProminentLanguage(characters) {
  const possibilities = detectLanguageProbabilities(characters)

  let maxPair = [null, 0]
  let sum = 0

  Object.keys(possibilities).forEach(system => {
    const value = possibilities[system]

    if (maxPair[1] < value && system !== 'other') {
      sum += value
      maxPair[0] = system
      maxPair[1] = value
    }
  })

  return { system: maxPair[0], accuracy: maxPair[1] / sum }
}

function detectLanguageProbabilities(characters) {
  const possibilities = {}

  for (const character of characters) {
    if (isLatin(character)) {
      add(possibilities, 'latin')
    } else if (isChinese(character)) {
      add(possibilities, 'chinese')
    } else if (isJapanese(character)) {
      add(possibilities, 'japanese')
    } else if (isDevanagari(character)) {
      add(possibilities, 'devanagari')
    } else if (isHebrew(character)) {
      add(possibilities, 'hebrew')
    } else if (isTamil(character)) {
      add(possibilities, 'tamil')
    } else if (isGreek(character)) {
      add(possibilities, 'greek')
    } else if (isTibetan(character)) {
      add(possibilities, 'tibetan')
    } else if (isArabic(character)) {
      add(possibilities, 'arabic')
    } else if (isCyrillic(character)) {
      add(possibilities, 'cyrillic')
    } else {
      add(possibilities, 'other')
    }
  }

  return possibilities
}

function isHebrew(character) {
  return isWithinRange(HEBREW_UNICODE_BLOCKS, character)
}

function isCyrillic(character) {
  return isWithinRange(CYRILLIC_UNICODE_BLOCKS, character)
}

function isArabic(character) {
  return isWithinRange(ARABIC_UNICODE_BLOCKS, character)
}

function isTibetan(character) {
  return isWithinRange(TIBETAN_UNICODE_BLOCKS, character)
}

function isGreek(character) {
  return isWithinRange(GREEK_UNICODE_BLOCKS, character)
}

function isTamil(character) {
  return isWithinRange(TAMIL_UNICODE_BLOCKS, character)
}

function isDevanagari(character) {
  return isWithinRange(DEVANAGARI_UNICODE_BLOCKS, character)
}

function isJapanese(character) {
  return isWithinRange(JAPANESE_UNICODE_BLOCKS, character)
}

function isLatin(character) {
  return isWithinRange(LATIN_UNICODE_BLOCKS, character)
}

function isChinese(character) {
  return isWithinRange(CHINESE_UNICODE_BLOCKS, character)
}

function isWithinRange(blocks, character) {
  return blocks.some(([ start, end ]) => {
    const code = character.codePointAt(0)
    return code >= start && code <= end
  })
}

function add(possibilities, type) {
  possibilities[type] = possibilities[type] ?? 0
  possibilities[type]++
}

log('abc')
log('???????')
log('???')
log('????')
log('??????')
log('???')
log('???????')
log('?????????')
log('??????')
log('????????')

function log(text) {
  const { system, accuracy } = detectMostProminentLanguage([...text])
  console.log(`${text} => ${system} (${accuracy})`)
}

Solution 9:[9]

You need to query the Unicode Character Database, that contain info on every unicode character. There probably is a utility function in C# that can do this for you. Otherwise you can download it off the internet.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 hshib
Solution 2 Milana
Solution 3 Community
Solution 4 eternity
Solution 5 Krazyxx
Solution 6 liyonghelpme
Solution 7 Martin
Solution 8 Lance
Solution 9 Dov Grobgeld