От: | MTD | https://github.com/mtrempoltsev | |
Дата: | 11.01.18 15:56 | ||
Оценка: | +1 |
От: | Mna | 404 and heavy formation | |
Дата: | 11.01.18 15:48 | ||
Оценка: |
-e encoding
--encoding=encoding
Select the character encoding of the strings that are to be found.
Possible values for encoding are: s = single-7-bit-byte characters
(ASCII, ISO 8859, etc., default), S = single-8-bit-byte characters,
b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit
bigendian, L = 32-bit littleendian. Useful for finding wide
character strings. (l and b apply to, for example, Unicode
UTF-16/UCS-2 encodings).
кроме таких | |
кроме таких Do C++11 regular expressions work with UTF-8 strings? https://stackoverflow.com/questions/11254232/do-c11-regular-expressions-work-with-utf-8-strings // утверждается что магия поиска в URF-8 работает однако-как — непонятно, либо Boost.Regex либо ICU regex Boost.Regex //где-то в глубинах утверждается что сия магия должна как-то работать: // Transparently search Unicode strings that are encoded as either UTF-8, UTF-16 or UTF-32. ICU regex //основан на коде Java 1.4 значит там должно быть UCS-2 также ссылаются на http://www.pcre.org/ которое утверждает что оно как-то умеет все три вида(*) матчить (* UTF-8, UTF-16, or UTF-32 ) | |
От: | kov_serg | ||
Дата: | 11.01.18 20:25 | ||
Оценка: |
man utf-8 | |
UTF-8 — an ASCII compatible multi-byte Unicode encoding The UTF-8 encoding has the following nice properties: * UCS characters 0x00000000 to 0x0000007f (the classic US-ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. * All UCS characters > 0x7f are encoded as a multi-byte sequence consisting only of bytes in the range 0x80 to 0xfd, so no ASCII byte can appear as part of another character and there are no problems with e.g. '\0' or '/'. * The lexicographic sorting order of UCS-4 strings is preserved. * All possible 2^31 UCS codes can be encoded using UTF-8. * The bytes 0xfe and 0xff are never used in the UTF-8 encoding. * The first byte of a multi- byte sequence which represents a single non-ASCII UCS character is always in the range 0xc0 to 0xfd and indicates how long this multi-byte sequence is. All further bytes in a multi-byte sequence are in the range 0x80 to 0xbf. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. * UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in UTF-8. ENCODING The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character: 0x00000000 — 0x0000007F: 7 0xxxxxxx 0x00000080 — 0x000007FF: 11 110xxxxx 10xxxxxx 0x00000800 — 0x0000FFFF: 16 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 — 0x001FFFFF: 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0x00200000 — 0x03FFFFFF: 26 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 — 0x7FFFFFFF: 31 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx The xxx bit positions are filled with the bits of the character code number in binary representation. Only the shortest possible multi-byte sequence which can represent the code number of the character can be used. The UCS code values 0xd800-0xdfff (UTF-16 surrogates) as well as 0xfffe and 0xffff (UCS non-characters) should not appear in conforming UTF-8 streams. EXAMPLES The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded in UTF-8 as 11000010 10101001 = 0xc2 0xa9 and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is encoded as: 11100010 10001001 10100000 = 0xe2 0x89 0xa0 | |
От: | Mna | 404 and heavy formation | |
Дата: | 16.01.18 14:57 | ||
Оценка: |
От: | Mna | 404 and heavy formation | |
Дата: | 16.01.18 15:05 | ||
Оценка: |
man utf-8 | |
_>содержание скипнул | |