水晶编程语言和编码
最近,我发现自己在Crystal编程语言中正确设置文件编码的问题上很纠结,这就是我在这里记录这个的原因。
Crystal默认为以UTF-8格式读写文件。
有时你会遇到以Windows-1252代码页编码的文件。(这是西欧的代码页,例如,用于德语的Umlauts等)。
如果使用了不正确的编码,文件中的 "特殊字符 "会在你的输出文件中被混杂,或者Crystal会抛出错误(~在UTF-8流中遇到不正确的字节序列)。
下面是如何正确设置读取文件的编码。
cpf = File.open("filename") cpf.set_encoding("WINDOWS-1252", nil) 1TP3现在对该文件做一些处理,例如将其作为csv使用 cpf_csv = CSV.new(cpf, headers: true, separator: ',', quote_char: '"' )
编码是作为一个字符串传递给set_encoding的。作为第二个参数,你传入 "nil"
下面是如何正确设置编码以写入文件。
output_csv = File.open("filename",mode="w") output_csv.set_encoding("WINDOWS-1252", nil) 1TP3现在对该文件做一些处理 output_csv.print "\"test\";\"test\""
这是同样的程序。
哪些名字是可以作为编码传入的?
水晶使用 iconv 编码的应用程序来翻译文件。因此,你可以用以下方法确定编码
iconv --列表
例如,在我的系统中,这产生了以下列表。
437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865, 866, 866nav, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4, 8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/ucs4, ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110, arabic, arabic7, armscii-8, ascii, asmo-708, asmo_449, baltic, big-5, big-five, big5-hkscs, big5, big5hkscs, bigfive, brf, bs_4730, ca, cn-big5, cn-gb, cn, cp-ar, cp-gr, cp-hu, cp037, cp038, cp273, cp274, cp275, cp278, CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424, CP437, CP500, CP737, CP770, CP771, CP772, CP773, CP774, CP775, CP803, CP813。 CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863, CP864,CP865,CP866,CP866NAV,CP868,CP869,CP870,CP871,CP874,CP875。 CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918, CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949, CP950, CP1004, CP1008, CP1025, CP1026, CP1046, CP1047, CP1070, CP1079, CP1081, CP1084, CP1089, CP1097, CP1112, CP1122, CP1123, CP1124, CP1125, CP1129, CP1130, CP1132, CP1133, CP1137, CP1140, CP1141, CP1142, CP1143, CP1144, CP1145, CP1146, CP1147, CP1148, CP1149, CP1153, CP1154, CP1155, CP1156, CP1157, CP1158, CP1160, CP1161, CP1162, CP1163, CP1164, CP1166, CP1167, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258, CP1282, CP1361, CP1364, CP1371, CP1388, CP1390, CP1399, CP4517, CP4899, CP4909, CP4971, CP5347, CP9030, CP9066, CP9448, CP10007, CP12712, cp16804, cpibm861, csa7-1, csa7-2, csascii, csa_t500-1983, csa_t500, csa_z243.4-1985-1, csa_z243.4-1985-2, csa_z243.419851, csa_z243.419852, CSDECMCS, CSEBCDicatde, CSEBCDicatDEA, CSEBCDICCAFR, CSEBCDICDKNO, csebcdicdknoa, csebcdices, csebcdicesa, csebcdicess, csebcdicfise, csebcdicfisea, csebcdicfr, csebcdicit, csebcdicpt, csebcdicuk, csebcdicus, cseuckr, cseucpkdfmtjapanese, csgb2312, cshproman8, csibm037, csibm038, CIBM273, CIBM274, CIBM275, CIBM277, CIBM278, CIBM280, CIBM281, Csibm284, Csibm285, Csibm290, Csibm297, Csibm420, Csibm423, Csibm424, Csibm500, Csibm803, Csibm851, Csibm855, Csibm856, Csibm857, Csibm860, Csibm863, Csibm864, Csibm865, Csibm866, Csibm868, Csibm869, Csibm870, Csibm871, Csibm880, Csibm891, Csibm901, Csibm902, Csibm903, Csibm904, Csibm905, Csibm918, Csibm921, Csibm922, Csibm930, Csibm932, Csibm933, Csibm935, Csibm937, Csibm939, Csibm943, Csibm1008, Csibm1025, Csibm1026, Csibm1097, Csibm1112, Csibm1122, Csibm1123, Csibm1124, Csibm1129, Csibm1130, Csibm1132, Csibm1133, Csibm1137, Csibm1140, Csibm1141, Csibm1142, Csibm1143, Csibm1144, Csibm1145, Csibm1146, Csibm1147, Csibm1148, Csibm1149, Csibm1153, Csibm1154, Csibm1155, Csibm1156, Csibm1157, Csibm1158, Csibm1160, Csibm1161, Csibm1163, Csibm1164, Csibm1166, Csibm1167, Csibm1364, Csibm1371, Csibm1388, csibm1390, csibm1399, csibm4517, csibm4899, csibm4909, csibm4971, csibm5347, csibm9030, csibm9066, csibm9448, csibm12712, csibm16804, csibm11621162, csiso4unitedkingdom, csiso10swedish, csiso11swedishfornames, csiso14jisc6220ro, csiso15italian, csiso16portugese, csiso17spanish, csiso18greek7old, csiso19latingreek, csiso21german, csiso25french, csiso27latingreek1, csiso49inis, csiso50inis8, csiso51iniscyrillic, csiso58gb1988, csiso60danishnorwegian, csiso60norwegian1, csiso61norwegian2, csiso69french, csiso84portuguese2, csiso85spanish2, csiso86hungarian, csiso88greek7, csiso89asmo449, csiso90, csiso92jisc62991984b, csiso99naplps, csiso103t618bit, csiso111ecmacyrillic, csiso121canadian1, csiso122canadian2, csiso139csn369103, csiso141jusib1002, csiso143iecp271, csiso150, csiso150greekccitt, csiso151cuba, csiso153gost1976874, csiso646danish, csiso2022cn, csiso2022jp, csiso2022jp2, csiso2022kr, csiso2033, csiso5427cyrillic, csiso5427cyrillic1981, csiso5428greek, csiso10367box, Csisolatin1, Csisolatin2, Csisolatin3, Csisolatin4, Csisolatin5, Csisolatin6, csisolatinarabic, csisolatinincyrillic, csisolatingreek, csisolatinhebrew, cskoi8r, csksc5636, csmacintosh, csnatsdano, csnatssefi, csn_369103, cspc8codepage437, cspc775baltic, cspc850multilingual, cspc862latinhebrew, cspcp852, csshiftjis, csucs4, csunicode, cswindows31j, 古巴语, cwi-2, cwi, cyrillic, de, dec-mcs, dec, decmcs, din_66003, dk, ds2089, ds_2089, e13b, ebcdic-at-de-a, ebcdic-at-de, ebcdic-be, ebcdic-br, ebcdic-ca-fr, EBCDIC-CP-AR1, EBCDIC-CP-AR2, EBCDIC-CP-BE, EBCDIC-CP-CA, EBCDIC-CP-CH, EBCDIC-CP-DK, EBCDIC-CP-ES, EBCDIC-CP-FI, EBCDIC-CP-FR, EBCDIC-CP-GB, EBCDIC-CP-GR, EBCDIC-CP-HE, EBCDIC-CP-IS, EBCDIC-CP-IT, EBCDIC-CP-NL, EBCDIC-CP-NO, EBCDIC-CP-ROECE, EBCDIC-CP-SE, EBCDIC-CP-TR, EBCDIC-CP-US, EBCDIC-CP-WT, EBCDIC-CP-YU, EBCDIC-Cyrillic, EBCDIC-DK-NO-A, EBCDIC-DK-NO, ebcdic-es-a, ebcdic-es-s, ebcdic-es, ebcdic-fi-se-a, ebcdic-fi-se, ebcdic-fr, ebcdic-greek, ebcdic-int, ebcdic-int1, ebcdic-is-friss, ebcdic-it, ebcdic-jp-e, ebcdic-jp-kana, ebcdic-pt, ebcdic-uk, ebcdic-us, ebcdicatde, ebcdicatdea, ebcdiccafr, ebcdicdkno, ebcdicdknoa, ebcdices, ebcdicesa, ebcdicess, ebcdicfise, ebcdicfisea, ebcdicfr, ebcdicisfriss, ebcdicit, ebcdicpt, ebcdicuk, ebcdicus, ecma-114, ecma-118, ecma-128, ecma-yrillic, ecmacyrillic, elot_928, es, es2, euc-cn, euc-jisx0213, euc-jp-ms, euc-jp, euc-kr, euc-tw, euccn, eucjp-ms, eucjp-open, eucjp-win, eucjp, euckr, euctw, Fi, fr, gb, gb2312, gb13000, gb18030, gbk, gb_1988-80, gb_198880, georgian-academy, georgian-ps, gost_19768-74, gost_19768, gost_1976874, greek-ccitt, greek, greek7-old, greek7, greek7old, greek8, greekccitt, hebrew, hp-greek8, hp-roman8, hp-roman9, hp-thai8, hp-turkish8, hpgreek8, hproman8, hproman9, hpthai8, hpturkish8, hu, ibm-803, ibm-856, ibm-901, ibm-902, ibm-921, ibm-922, ibm-930, ibm-932, ibm-933, ibm-935, ibm-937。 ibm-939, ibm-943, ibm-1008, ibm-1025, ibm-1046, ibm-1047, ibm-1097, ibm-1112, ibm-1122, ibm-1123, ibm-1124, ibm-1129, ibm-1130, ibm-1132, ibm-1133, ibm-1137, ibm-1140, ibm-1141, ibm-1142, ibm-1143, ibm-1144, ibm-1145, ibm-1146, ibm-1147, ibm-1148, ibm-1149, ibm-1153, ibm-1154, ibm-1155, ibm-1156, ibm-1157, ibm-1158, ibm-1160, ibm-1161, ibm-1162, ibm-1163, ibm-1164, ibm-1166, ibm-1167, ibm-1364, ibm-1371, ibm-1388, ibm-1390。 ibm-1399, ibm-4517, ibm-4899, ibm-4909, ibm-4971, ibm-5347, ibm-9030, ibm-9066, ibm-9448, ibm-12712, ibm-16804, ibm037, ibm038, ibm256, ibm273, ibm274, ibm275, ibm277, ibm278, ibm280, ibm281, ibm284, ibm285, ibm290, ibm297,ibm367,ibm420,ibm423,ibm424,ibm437,ibm500,ibm775,ibm803。 ibm813, ibm819, ibm848, ibm850, ibm851, ibm852, ibm855, ibm856, ibm857, ibm860, ibm861, ibm862, ibm863, ibm864, ibm865, ibm866, ibm866nav, ibm868, ibm869, ibm870, ibm871, ibm874, ibm875, ibm880, ibm891, ibm901, ibm902, ibm903, ibm904, ibm905, ibm912, ibm915, ibm916, ibm918, ibm920, ibm921, ibm922, ibm930, ibm932, ibm933, ibm935, ibm937, ibm939, ibm943, ibm1004, ibm1008, ibm1025, ibm1026, ibm1046, ibm1047, ibm1089, ibm1097, ibm1112, ibm1122, ibm1123, ibm1124, ibm1129, ibm1130, ibm1132, ibm1133, ibm1137, ibm1140, ibm1141, ibm1142, ibm1143, ibm1144, ibm1145, ibm1146, ibm1147, ibm1148, ibm1149, ibm1153, ibm1154, ibm1155, ibm1156, ibm1157, ibm1158, ibm1160, ibm1161, ibm1162, ibm1163, ibm1164, ibm1166, ibm1167, ibm1364, ibm1371, ibm1388, ibm1390, ibm1399, ibm4517, ibm4899, ibm4909, ibm4971, ibm5347, ibm9030, ibm9066, ibm9448, ibm12712, ibm16804, iec_p27-1, iec_p271, Inis-8, inis-cyrillic, inis, inis8, inisyrillic, isiri-3342, isiri3342, ISO-2022-CN-Ext, ISO-2022-CN, ISO-2022-JP-2, ISO-2022-JP-3, ISO-2022-JP, ISO-2022-KR,ISO-8859-1,ISO-8859-2,ISO-8859-3,ISO-8859-4,ISO-8859-5。 iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, iso-8859-9e, iso-8859-10, iso-8859-11, iso-8859-13, iso-8859-14, iso-8859-15, iso-8859-16, iso-10646, iso-10646/ucs2, iso-10646/ucs4, iso-10646/utf-8, iso-10646/utf8, iso-celtic, Iso-ir-4, iso-ir-6, iso-ir-8-1, iso-ir-9-1, iso-ir-10, iso-ir-11, iso-ir-14。 Iso-ir-15, iso-ir-16, iso-ir-17, iso-ir-18, iso-ir-19, iso-ir-21, iso-ir-25。 Iso-ir-27, iso-ir-37, iso-ir-49, iso-ir-50, iso-ir-51, iso-ir-54, iso-ir-55。 ISO-IR-57, ISO-IR-60, ISO-IR-61, ISO-IR-69, ISO-IR-84, ISO-IR-85, ISO-IR-86。 Iso-ir-88, iso-ir-89, iso-ir-90, iso-ir-92, iso-ir-98, iso-ir-99, iso-ir-100。 Iso-ir-101, iso-ir-103, iso-ir-109, iso-ir-110, iso-ir-111, iso-ir-121。 Iso-ir-122, iso-ir-126, iso-ir-127, iso-ir-138, iso-ir-139, iso-ir-141。 Iso-ir-143, iso-ir-144, iso-ir-148, iso-ir-150, iso-ir-151, iso-ir-153。 Iso-ir-155, iso-ir-156, iso-ir-157, iso-ir-166, iso-ir-179, iso-ir-193。 ISO-IR-197, ISO-IR-199, ISO-IR-203, ISO-IR-226, ISO/TR_11548-1, ISO646-CA, ISO646-CA2, ISO646-CN, ISO646-CU, ISO646-DE, ISO646-DK, ISO646-ES, ISO646-ES2, ISO646-FI, ISO646-FR, ISO646-FR1, ISO646-GB, ISO646-H, ISO646-IT, ISO646-JP-CRB, ISO646-JP, ISO646-KR, ISO646-NO, ISO646-NO2, ISO646-PT, ISO646-PT2, ISO646-SE, ISO646-SE2, ISO646-US, ISO646-YU, ISO2022CN, ISO2022CNEEXT, ISO2022JP, ISO2022JP2, ISO2022KR, ISO6937。 ISO8859-1, ISO8859-2, ISO8859-3, ISO8859-4, ISO8859-5, ISO8859-6, ISO8859-7。 iso8859-8, iso8859-9, iso8859-9e, iso8859-10, iso8859-11, iso8859-13, iso8859-14, iso8859-15, iso8859-16, iso11548-1, iso88591, iso88592, iso88593, iso88594, iso88595, iso88596, iso88597, iso88598, iso88599, iso88599e, iso885910, iso885911, iso885913, iso885914, iso885915, iso885916。 ISO_646.IRV:1991, ISO_2033-1983, ISO_2033, ISO_5427-ext, ISO_5427, iso_5427:1981, iso_5427ext, iso_5428, iso_5428:1980, iso_6937-2, iso_6937-2:1983, iso_6937, iso_6937:1992, iso_8859-1, iso_8859-1:1987, ISO_8859-2, ISO_8859-2:1987, ISO_8859-3, ISO_8859-3:1988, ISO_8859-4。 ISO_8859-4:1988, ISO_8859-5, ISO_8859-5:1988, ISO_8859-6, ISO_8859-6:1987, ISO_8859-7, ISO_8859-7:1987, ISO_8859-7:2003, ISO_8859-8, ISO_8859-8:1988, ISO_8859-9, ISO_8859-9:1989, ISO_8859-9E, ISO_8859-10, ISO_8859-10:1992, ISO_8859-14, ISO_8859-14:1998, ISO_8859-15, ISO_8859-15:1998, ISO_8859-16, iso_8859-16:2001, iso_9036, iso_10367-box, iso_10367box, iso_11548-1, ISO_69372, IT, JIS_C6220-1969-RO, JIS_C6229-1984-B, JIS_C62201969RO, JIS_C62291984B, JOHAB, JP-OCR-B, JP, JS, JUS_I.B1.002, KOI-7, KOI-8, KOI8-R, KOI8-RU, KOI8-T, KOI8-U, KOI8, KOI8R, KOI8U, KSC5636, L1, L2, L3, L4, L5, L6, l7, l8, l10, latin-9, latin-greek-1, latin-greek, latin1, latin2, latin3。 latin4, latin5, latin6, latin7, latin8, latin9, latin10, latingreek, latingreek1, mac-centraleurope, mac-cyrillic, mac-is, mac-sami, mac-uk, mac, maccyrillic, macintosh, macis, macuk, macukrainian, mik, ms-ansi, ms-arab, ms-cyrl, ms-ee, ms-greek, ms-hebr, ms-mac-cyrillic, ms-turk, ms932, ms936, MSCP949, MSCP1361, msmaccyrillic, msz_7795.3, ms_kanji, naplps, nats-dano, nats-sefi, natsdano, natssefi, nc_nc0010, nc_nc00-10, nc_nc00-10:81, NF_Z_62-010, NF_Z_62-010_(1973), NF_Z_62-010_1973, NF_Z_62010, nf_z_62010_1973, no, no2, ns_4551-1, ns_4551-2, ns_45511, ns_45512, os2latin1, osf00010001, osf00010002, osf00010003, osf00010004, osf00010005, OSF00010006, OSF00010007, OSF00010008, OSF00010009, OSF0001000A, OSF00010020。 OSF00010100, OSF00010101, OSF00010102, OSF00010104, OSF00010105, OSF00010106。 OSF00030010, OSF0004000A, OSF0005000A, OSF05010001, OSF100201A4, OSF100201A8, OSF100201B5, OSF100201F4, OSF100203B5, OSF1002011C, OSF1002011D, OSF1002035D。 OSF1002035E, OSF1002035F, OSF1002036B, OSF1002037B, OSF10010001, OSF10010004, OSF10010006, OSF10020025, OSF10020111, OSF10020115, OSF10020116, OSF10020118。 OSF10020122, OSF10020129, OSF10020352, OSF10020354, OSF10020357, OSF10020359, OSF10020360, OSF10020364, OSF10020365, OSF10020366, OSF10020367, OSF10020370, OSF10020387, OSF10020388, OSF10020396, OSF10020402, OSF10020417, pt, pt2, pt154, r8, r9, rk1048, roman8, roman9, ruscii, se, se2, sen_850200_b, sen_850200_c, shift-jis, shift_jis, shift_jisx0213, sjis-open, sjis-win, sjis, ss636127, strk1048-2002, st_sev_358-88, t.61-8bit, t.61, t.618bit, tcvn-5712, tcvn, tcvn5712-1, tcvn5712-1:1993, thai8, tis-620, tis620-0, TIS620.2529-1, TIS620.2533-0, TIS620, TS-5881, TSCII, TURKISH8, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UhC, UJIS, uk, unicode、unicodebig、unicodelittle、us-ascii、us、utf-7、utf-8、utf-16。 utf-16be, utf-16le, utf-32, utf-32be, utf-32le, utf7, utf8, utf16, utf16be, utf16le,utf32,utf32be,utf32le,viscii,wchar_t,win-sami-2,winbaltrim。 windows-31j, windows-874, windows-936, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, WINDOWS-1257, WINDOWS-1258, WINSAMI2, WSS2, YU
请注意,同一个编码可以列出几个同义词。
如前所述。 Windows-1252 是西欧Windows系统下 "传统 "文件中特别经常使用的一种编码。
水晶参考/相关文件。
- https://crystal-lang.org/api/0.23.1/File.html#open%28filename%2Cmode%3D%26quot%3Br%26quot%3B%2Cperm%3DDEFAULT_CREATE_MODE%2Cencoding%3Dnil%2Cinvalid%3Dnil%2C%26block%29-class-method
- https://crystal-lang.org/api/0.27.0/IO.html#set_encoding%28encoding%3AString%2Cinvalid%3ASymbol%3F%3Dnil%29-instance-method
- https://crystal-lang.org/api/0.21.1/IO/EncodingOptions.html
- https://github.com/crystal-lang/crystal/blob/c9d1eef8fde5c7a03a029d64c8483ed7b4f2fe86/src/io.cr#L1025
- https://github.com/crystal-lang/crystal/blob/3c6c75e68f326bb91be35f71ae30672fd454776e/src/io/encoding.cr#L3
- https://github.com/crystal-lang/crystal/blob/master/src/iconv.cr