水晶编程语言和编码

最近,我发现自己在Crystal编程语言中正确设置文件编码的问题上很纠结,这就是我在这里记录这个的原因。

Crystal默认为以UTF-8格式读写文件。

有时你会遇到以Windows-1252代码页编码的文件。(这是西欧的代码页,例如,用于德语的Umlauts等)。

如果使用了不正确的编码,文件中的 "特殊字符 "会在你的输出文件中被混杂,或者Crystal会抛出错误(~在UTF-8流中遇到不正确的字节序列)。

下面是如何正确设置读取文件的编码。

cpf = File.open("filename")
cpf.set_encoding("WINDOWS-1252", nil)

1TP3现在对该文件做一些处理,例如将其作为csv使用

cpf_csv = CSV.new(cpf, headers: true, separator: ',', quote_char: '"' )

编码是作为一个字符串传递给set_encoding的。作为第二个参数,你传入 "nil"

下面是如何正确设置编码以写入文件。

output_csv = File.open("filename",mode="w")
output_csv.set_encoding("WINDOWS-1252", nil)

1TP3现在对该文件做一些处理
output_csv.print "\"test\";\"test\""

这是同样的程序。

哪些名字是可以作为编码传入的?

水晶使用 iconv 编码的应用程序来翻译文件。因此,你可以用以下方法确定编码

iconv --列表

例如,在我的系统中,这产生了以下列表。

437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
866, 866nav, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/ucs4,
ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
arabic, arabic7, armscii-8, ascii, asmo-708, asmo_449, baltic, big-5,
big-five, big5-hkscs, big5, big5hkscs, bigfive, brf, bs_4730, ca, cn-big5,
cn-gb, cn, cp-ar, cp-gr, cp-hu, cp037, cp038, cp273, cp274, cp275, cp278,
CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424,
CP437, CP500, CP737, CP770, CP771, CP772, CP773, CP774, CP775, CP803, CP813。
CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863,
CP864,CP865,CP866,CP866NAV,CP868,CP869,CP870,CP871,CP874,CP875。
CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918,
CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949,
CP950, CP1004, CP1008, CP1025, CP1026, CP1046, CP1047, CP1070, CP1079,
CP1081, CP1084, CP1089, CP1097, CP1112, CP1122, CP1123, CP1124, CP1125,
CP1129, CP1130, CP1132, CP1133, CP1137, CP1140, CP1141, CP1142, CP1143,
CP1144, CP1145, CP1146, CP1147, CP1148, CP1149, CP1153, CP1154, CP1155,
CP1156, CP1157, CP1158, CP1160, CP1161, CP1162, CP1163, CP1164, CP1166,
CP1167, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257,
CP1258, CP1282, CP1361, CP1364, CP1371, CP1388, CP1390, CP1399, CP4517,
CP4899, CP4909, CP4971, CP5347, CP9030, CP9066, CP9448, CP10007, CP12712,
cp16804, cpibm861, csa7-1, csa7-2, csascii, csa_t500-1983, csa_t500,
csa_z243.4-1985-1, csa_z243.4-1985-2, csa_z243.419851, csa_z243.419852,
CSDECMCS, CSEBCDicatde, CSEBCDicatDEA, CSEBCDICCAFR, CSEBCDICDKNO,
csebcdicdknoa, csebcdices, csebcdicesa, csebcdicess, csebcdicfise,
csebcdicfisea, csebcdicfr, csebcdicit, csebcdicpt, csebcdicuk, csebcdicus,
cseuckr, cseucpkdfmtjapanese, csgb2312, cshproman8, csibm037, csibm038,
CIBM273, CIBM274, CIBM275, CIBM277, CIBM278, CIBM280, CIBM281,
Csibm284, Csibm285, Csibm290, Csibm297, Csibm420, Csibm423, Csibm424,
Csibm500, Csibm803, Csibm851, Csibm855, Csibm856, Csibm857, Csibm860,
Csibm863, Csibm864, Csibm865, Csibm866, Csibm868, Csibm869, Csibm870,
Csibm871, Csibm880, Csibm891, Csibm901, Csibm902, Csibm903, Csibm904,
Csibm905, Csibm918, Csibm921, Csibm922, Csibm930, Csibm932, Csibm933,
Csibm935, Csibm937, Csibm939, Csibm943, Csibm1008, Csibm1025, Csibm1026,
Csibm1097, Csibm1112, Csibm1122, Csibm1123, Csibm1124, Csibm1129, Csibm1130,
Csibm1132, Csibm1133, Csibm1137, Csibm1140, Csibm1141, Csibm1142, Csibm1143,
Csibm1144, Csibm1145, Csibm1146, Csibm1147, Csibm1148, Csibm1149, Csibm1153,
Csibm1154, Csibm1155, Csibm1156, Csibm1157, Csibm1158, Csibm1160, Csibm1161,
Csibm1163, Csibm1164, Csibm1166, Csibm1167, Csibm1364, Csibm1371, Csibm1388,
csibm1390, csibm1399, csibm4517, csibm4899, csibm4909, csibm4971, csibm5347,
csibm9030, csibm9066, csibm9448, csibm12712, csibm16804, csibm11621162,
csiso4unitedkingdom, csiso10swedish, csiso11swedishfornames,
csiso14jisc6220ro, csiso15italian, csiso16portugese, csiso17spanish,
csiso18greek7old, csiso19latingreek, csiso21german, csiso25french,
csiso27latingreek1, csiso49inis, csiso50inis8, csiso51iniscyrillic,
csiso58gb1988, csiso60danishnorwegian, csiso60norwegian1, csiso61norwegian2,
csiso69french, csiso84portuguese2, csiso85spanish2, csiso86hungarian,
csiso88greek7, csiso89asmo449, csiso90, csiso92jisc62991984b, csiso99naplps,
csiso103t618bit, csiso111ecmacyrillic, csiso121canadian1, csiso122canadian2,
csiso139csn369103, csiso141jusib1002, csiso143iecp271, csiso150,
csiso150greekccitt, csiso151cuba, csiso153gost1976874, csiso646danish,
csiso2022cn, csiso2022jp, csiso2022jp2, csiso2022kr, csiso2033,
csiso5427cyrillic, csiso5427cyrillic1981, csiso5428greek, csiso10367box,
Csisolatin1, Csisolatin2, Csisolatin3, Csisolatin4, Csisolatin5, Csisolatin6,
csisolatinarabic, csisolatinincyrillic, csisolatingreek, csisolatinhebrew,
cskoi8r, csksc5636, csmacintosh, csnatsdano, csnatssefi, csn_369103,
cspc8codepage437, cspc775baltic, cspc850multilingual, cspc862latinhebrew,
cspcp852, csshiftjis, csucs4, csunicode, cswindows31j, 古巴语, cwi-2, cwi,
cyrillic, de, dec-mcs, dec, decmcs, din_66003, dk, ds2089, ds_2089, e13b,
ebcdic-at-de-a, ebcdic-at-de, ebcdic-be, ebcdic-br, ebcdic-ca-fr,
EBCDIC-CP-AR1, EBCDIC-CP-AR2, EBCDIC-CP-BE, EBCDIC-CP-CA, EBCDIC-CP-CH,
EBCDIC-CP-DK, EBCDIC-CP-ES, EBCDIC-CP-FI, EBCDIC-CP-FR, EBCDIC-CP-GB,
EBCDIC-CP-GR, EBCDIC-CP-HE, EBCDIC-CP-IS, EBCDIC-CP-IT, EBCDIC-CP-NL,
EBCDIC-CP-NO, EBCDIC-CP-ROECE, EBCDIC-CP-SE, EBCDIC-CP-TR, EBCDIC-CP-US,
EBCDIC-CP-WT, EBCDIC-CP-YU, EBCDIC-Cyrillic, EBCDIC-DK-NO-A, EBCDIC-DK-NO,
ebcdic-es-a, ebcdic-es-s, ebcdic-es, ebcdic-fi-se-a, ebcdic-fi-se, ebcdic-fr,
ebcdic-greek, ebcdic-int, ebcdic-int1, ebcdic-is-friss, ebcdic-it,
ebcdic-jp-e, ebcdic-jp-kana, ebcdic-pt, ebcdic-uk, ebcdic-us, ebcdicatde,
ebcdicatdea, ebcdiccafr, ebcdicdkno, ebcdicdknoa, ebcdices, ebcdicesa,
ebcdicess, ebcdicfise, ebcdicfisea, ebcdicfr, ebcdicisfriss, ebcdicit,
ebcdicpt, ebcdicuk, ebcdicus, ecma-114, ecma-118, ecma-128, ecma-yrillic,
ecmacyrillic, elot_928, es, es2, euc-cn, euc-jisx0213, euc-jp-ms, euc-jp,
euc-kr, euc-tw, euccn, eucjp-ms, eucjp-open, eucjp-win, eucjp, euckr, euctw,
Fi, fr, gb, gb2312, gb13000, gb18030, gbk, gb_1988-80, gb_198880,
georgian-academy, georgian-ps, gost_19768-74, gost_19768, gost_1976874,
greek-ccitt, greek, greek7-old, greek7, greek7old, greek8, greekccitt,
hebrew, hp-greek8, hp-roman8, hp-roman9, hp-thai8, hp-turkish8, hpgreek8,
hproman8, hproman9, hpthai8, hpturkish8, hu, ibm-803, ibm-856, ibm-901,
ibm-902, ibm-921, ibm-922, ibm-930, ibm-932, ibm-933, ibm-935, ibm-937。
ibm-939, ibm-943, ibm-1008, ibm-1025, ibm-1046, ibm-1047, ibm-1097, ibm-1112,
ibm-1122, ibm-1123, ibm-1124, ibm-1129, ibm-1130, ibm-1132, ibm-1133,
ibm-1137, ibm-1140, ibm-1141, ibm-1142, ibm-1143, ibm-1144, ibm-1145,
ibm-1146, ibm-1147, ibm-1148, ibm-1149, ibm-1153, ibm-1154, ibm-1155,
ibm-1156, ibm-1157, ibm-1158, ibm-1160, ibm-1161, ibm-1162, ibm-1163,
ibm-1164, ibm-1166, ibm-1167, ibm-1364, ibm-1371, ibm-1388, ibm-1390。
ibm-1399, ibm-4517, ibm-4899, ibm-4909, ibm-4971, ibm-5347, ibm-9030,
ibm-9066, ibm-9448, ibm-12712, ibm-16804, ibm037, ibm038, ibm256, ibm273,
ibm274, ibm275, ibm277, ibm278, ibm280, ibm281, ibm284, ibm285, ibm290,
ibm297,ibm367,ibm420,ibm423,ibm424,ibm437,ibm500,ibm775,ibm803。
ibm813, ibm819, ibm848, ibm850, ibm851, ibm852, ibm855, ibm856, ibm857,
ibm860, ibm861, ibm862, ibm863, ibm864, ibm865, ibm866, ibm866nav, ibm868,
ibm869, ibm870, ibm871, ibm874, ibm875, ibm880, ibm891, ibm901, ibm902,
ibm903, ibm904, ibm905, ibm912, ibm915, ibm916, ibm918, ibm920, ibm921,
ibm922, ibm930, ibm932, ibm933, ibm935, ibm937, ibm939, ibm943, ibm1004,
ibm1008, ibm1025, ibm1026, ibm1046, ibm1047, ibm1089, ibm1097, ibm1112,
ibm1122, ibm1123, ibm1124, ibm1129, ibm1130, ibm1132, ibm1133, ibm1137,
ibm1140, ibm1141, ibm1142, ibm1143, ibm1144, ibm1145, ibm1146, ibm1147,
ibm1148, ibm1149, ibm1153, ibm1154, ibm1155, ibm1156, ibm1157, ibm1158,
ibm1160, ibm1161, ibm1162, ibm1163, ibm1164, ibm1166, ibm1167, ibm1364,
ibm1371, ibm1388, ibm1390, ibm1399, ibm4517, ibm4899, ibm4909, ibm4971,
ibm5347, ibm9030, ibm9066, ibm9448, ibm12712, ibm16804, iec_p27-1, iec_p271,
Inis-8, inis-cyrillic, inis, inis8, inisyrillic, isiri-3342, isiri3342,
ISO-2022-CN-Ext, ISO-2022-CN, ISO-2022-JP-2, ISO-2022-JP-3, ISO-2022-JP,
ISO-2022-KR,ISO-8859-1,ISO-8859-2,ISO-8859-3,ISO-8859-4,ISO-8859-5。
iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-9, iso-8859-9e, iso-8859-10,
iso-8859-11, iso-8859-13, iso-8859-14, iso-8859-15, iso-8859-16, iso-10646,
iso-10646/ucs2, iso-10646/ucs4, iso-10646/utf-8, iso-10646/utf8, iso-celtic,
Iso-ir-4, iso-ir-6, iso-ir-8-1, iso-ir-9-1, iso-ir-10, iso-ir-11, iso-ir-14。
Iso-ir-15, iso-ir-16, iso-ir-17, iso-ir-18, iso-ir-19, iso-ir-21, iso-ir-25。
Iso-ir-27, iso-ir-37, iso-ir-49, iso-ir-50, iso-ir-51, iso-ir-54, iso-ir-55。
ISO-IR-57, ISO-IR-60, ISO-IR-61, ISO-IR-69, ISO-IR-84, ISO-IR-85, ISO-IR-86。
Iso-ir-88, iso-ir-89, iso-ir-90, iso-ir-92, iso-ir-98, iso-ir-99, iso-ir-100。
Iso-ir-101, iso-ir-103, iso-ir-109, iso-ir-110, iso-ir-111, iso-ir-121。
Iso-ir-122, iso-ir-126, iso-ir-127, iso-ir-138, iso-ir-139, iso-ir-141。
Iso-ir-143, iso-ir-144, iso-ir-148, iso-ir-150, iso-ir-151, iso-ir-153。
Iso-ir-155, iso-ir-156, iso-ir-157, iso-ir-166, iso-ir-179, iso-ir-193。
ISO-IR-197, ISO-IR-199, ISO-IR-203, ISO-IR-226, ISO/TR_11548-1,
ISO646-CA, ISO646-CA2, ISO646-CN, ISO646-CU, ISO646-DE, ISO646-DK, ISO646-ES,
ISO646-ES2, ISO646-FI, ISO646-FR, ISO646-FR1, ISO646-GB, ISO646-H,
ISO646-IT, ISO646-JP-CRB, ISO646-JP, ISO646-KR, ISO646-NO, ISO646-NO2,
ISO646-PT, ISO646-PT2, ISO646-SE, ISO646-SE2, ISO646-US, ISO646-YU,
ISO2022CN, ISO2022CNEEXT, ISO2022JP, ISO2022JP2, ISO2022KR, ISO6937。
ISO8859-1, ISO8859-2, ISO8859-3, ISO8859-4, ISO8859-5, ISO8859-6, ISO8859-7。
iso8859-8, iso8859-9, iso8859-9e, iso8859-10, iso8859-11, iso8859-13,
iso8859-14, iso8859-15, iso8859-16, iso11548-1, iso88591, iso88592, iso88593,
iso88594, iso88595, iso88596, iso88597, iso88598, iso88599, iso88599e,
iso885910, iso885911, iso885913, iso885914, iso885915, iso885916。
ISO_646.IRV:1991, ISO_2033-1983, ISO_2033, ISO_5427-ext, ISO_5427,
iso_5427:1981, iso_5427ext, iso_5428, iso_5428:1980, iso_6937-2,
iso_6937-2:1983, iso_6937, iso_6937:1992, iso_8859-1, iso_8859-1:1987,
ISO_8859-2, ISO_8859-2:1987, ISO_8859-3, ISO_8859-3:1988, ISO_8859-4。
ISO_8859-4:1988, ISO_8859-5, ISO_8859-5:1988, ISO_8859-6, ISO_8859-6:1987,
ISO_8859-7, ISO_8859-7:1987, ISO_8859-7:2003, ISO_8859-8, ISO_8859-8:1988,
ISO_8859-9, ISO_8859-9:1989, ISO_8859-9E, ISO_8859-10, ISO_8859-10:1992,
ISO_8859-14, ISO_8859-14:1998, ISO_8859-15, ISO_8859-15:1998, ISO_8859-16,
iso_8859-16:2001, iso_9036, iso_10367-box, iso_10367box, iso_11548-1,
ISO_69372, IT, JIS_C6220-1969-RO, JIS_C6229-1984-B, JIS_C62201969RO,
JIS_C62291984B, JOHAB, JP-OCR-B, JP, JS, JUS_I.B1.002, KOI-7, KOI-8, KOI8-R,
KOI8-RU, KOI8-T, KOI8-U, KOI8, KOI8R, KOI8U, KSC5636, L1, L2, L3, L4, L5, L6,
l7, l8, l10, latin-9, latin-greek-1, latin-greek, latin1, latin2, latin3。
latin4, latin5, latin6, latin7, latin8, latin9, latin10, latingreek,
latingreek1, mac-centraleurope, mac-cyrillic, mac-is, mac-sami, mac-uk, mac,
maccyrillic, macintosh, macis, macuk, macukrainian, mik, ms-ansi, ms-arab,
ms-cyrl, ms-ee, ms-greek, ms-hebr, ms-mac-cyrillic, ms-turk, ms932, ms936,
MSCP949, MSCP1361, msmaccyrillic, msz_7795.3, ms_kanji, naplps, nats-dano,
nats-sefi, natsdano, natssefi, nc_nc0010, nc_nc00-10, nc_nc00-10:81,
NF_Z_62-010, NF_Z_62-010_(1973), NF_Z_62-010_1973, NF_Z_62010,
nf_z_62010_1973, no, no2, ns_4551-1, ns_4551-2, ns_45511, ns_45512,
os2latin1, osf00010001, osf00010002, osf00010003, osf00010004, osf00010005,
OSF00010006, OSF00010007, OSF00010008, OSF00010009, OSF0001000A, OSF00010020。
OSF00010100, OSF00010101, OSF00010102, OSF00010104, OSF00010105, OSF00010106。
OSF00030010, OSF0004000A, OSF0005000A, OSF05010001, OSF100201A4, OSF100201A8,
OSF100201B5, OSF100201F4, OSF100203B5, OSF1002011C, OSF1002011D, OSF1002035D。
OSF1002035E, OSF1002035F, OSF1002036B, OSF1002037B, OSF10010001, OSF10010004,
OSF10010006, OSF10020025, OSF10020111, OSF10020115, OSF10020116, OSF10020118。
OSF10020122, OSF10020129, OSF10020352, OSF10020354, OSF10020357, OSF10020359,
OSF10020360, OSF10020364, OSF10020365, OSF10020366, OSF10020367, OSF10020370,
OSF10020387, OSF10020388, OSF10020396, OSF10020402, OSF10020417, pt, pt2,
pt154, r8, r9, rk1048, roman8, roman9, ruscii, se, se2, sen_850200_b,
sen_850200_c, shift-jis, shift_jis, shift_jisx0213, sjis-open, sjis-win,
sjis, ss636127, strk1048-2002, st_sev_358-88, t.61-8bit, t.61, t.618bit,
tcvn-5712, tcvn, tcvn5712-1, tcvn5712-1:1993, thai8, tis-620, tis620-0,
TIS620.2529-1, TIS620.2533-0, TIS620, TS-5881, TSCII, TURKISH8, UCS-2,
UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UhC, UJIS, uk,
unicode、unicodebig、unicodelittle、us-ascii、us、utf-7、utf-8、utf-16。
utf-16be, utf-16le, utf-32, utf-32be, utf-32le, utf7, utf8, utf16, utf16be,
utf16le,utf32,utf32be,utf32le,viscii,wchar_t,win-sami-2,winbaltrim。
windows-31j, windows-874, windows-936, windows-1250, windows-1251,
windows-1252, windows-1253, windows-1254, windows-1255, windows-1256,
WINDOWS-1257, WINDOWS-1258, WINSAMI2, WSS2, YU

 

请注意,同一个编码可以列出几个同义词。

如前所述。 Windows-1252 是西欧Windows系统下 "传统 "文件中特别经常使用的一种编码。

 

水晶参考/相关文件。