User talk:Dispenser/HTML entities
Latest comment: 6 years ago by Dispenser in topic Caret encoding
Caret encoding
editHi Dispenser, thank you for that list! I just found a different pattern of bad encoding: Look for "^euro" (without ;) or "^tilde" or "^trade" or "^rdquo" or "^sbquo". There are a few others. Best, --Achim (talk) 19:59, 16 October 2017 (UTC)
... &-195;&-162;^euro,^tilde,
...Fixes for decoding to UTF-8 ‘
Assume HTML named entities ‘
Mojibake, let try encode in Windows-1252 .encode('cp1252')
e2 80 98
Try to decode as UTF-8 sequence .decode('utf-8')
‘
U+2018 (Left curly apostrophe)Matches character from source URL
- @Multichill: How many pipelines did that pass through!?! Matching file names for HTML named entities we
9,7195,17412,043 files to rename. Recent upload with this brokenness are:- User:Panoramio upload bot (Dec 2016, 3,880 files)
User:Fæ (July 2017, 3,945 files)User:BD2412 (Oct 2017, 34 files)User:Alkari (3 days ago, 71 files)
- Dispenser (talk) 23:36, 16 October 2017 (UTC)
- Oh no, I've been fixing these, not making them. BD2412 T 23:50, 16 October 2017 (UTC)
- Looks like your right, I forgot
page_is_redirect=0
and page moves seem to break rev_parent_id=0. I've updated the figures. Dispenser (talk) 23:59, 16 October 2017 (UTC)
- Looks like your right, I forgot
-- ; sql --cluster analytics commonswiki_p SELECT REGEXP_REPLACE(page_title, ".*?(\\^\\^[0-9]+,|\\^\\^[xX]?[0-9a-fA-F]+,|\\^\\w{2,10}[,.]).*", "\\1") AS entity, COUNT(*) FROM page WHERE page_namespace=6 AND page_is_redirect=0 AND page_title REGEXP "\\^(\\^[0-9]+|\\^\\^[xX]?[0-9a-fA-F]+|quot|amp| ...[lot of HTML entity names]... )\\b," GROUP BY 1 ORDER BY 2;
- Oh no, I've been fixing these, not making them. BD2412 T 23:50, 16 October 2017 (UTC)
Top caret entities Entity Files Char Info ^^39, 7116 ' U+0027 (Apostrophe) ^quot, 3584 " U+0022 (Quote Mark) ^amp, 1259 & U+0026 (Ampersand) ^rsquo, 142 ’ U+2019 (Right curly apostrophe) ^gt, 118 > U+003E (Greater-than sign) ^pi, 71 π U+03C0 (pi symbol) ^^093, 47 ] U+005D (Right Bracket) ^^091, 46 [ U+005B (Left Bracket) ^sbquo, 35 ‚ U+201A (Openning quote mark) ^lt, 26 < U+003C (Less-than sign) ^ndash, 26 – U+2013 (En dash) ^ouml, 23 ö U+00F6 (o with diaeresis) ^uuml, 15 ü U+00FC (u with diaeresis) ^eacute, 10 é U+00E9 (e with acute) ^euro, 10 € U+20AC (Euro sign) ^^x0301, 10 ́ U+0301 (Combining acute accent) ^egrave, 9 è U+00E8 (e with grave) ^ldquo, 7 “ U+201C (Left curly quotation) ^rdquo, 6 ” U+201D (Right curly quotation) ^aacute, 5 á U+00E1 (a with acute)
- Still early stages, but I can't spare any more time. Still haven't cracked how exactly "Горный Алтай" became File:Đ^ldquo,ĐžŃ^euro,Đ˝Ń^lsaquo,Đš Đ^ĐťŃ^sbquo,Đ°Đš - panoramio - Tanya Dedyukhina.jpg. —Dispenser (talk) 16:37, 17 October 2017 (UTC)
- That should fix 4,845 for
-- ; sql --cluster analytics commonswiki_p /* Safe replacements for caret encoding of: * ^^39, => ' => ' U+0027 and ^quot, => " U+0022 and ^amp, => & */ SELECT page_title, REPLACE(page_title, "^^39,", "'") AS Replacement FROM page WHERE page_namespace=6 AND page_is_redirect=0 AND page_title LIKE "%^%,%" /* Limited ASCII to avoid messing with Mojibake */ HAVING Replacement REGEXP "^[A-Za-z0-9_().,&'\"\"\\-]+$" AND Replacement != page_title AND page_title NOT LIKE "%&%" AND Replacement NOT REGEXP "&[^_]" AND Replacement NOT LIKE "%''%" AND Replacement NOT LIKE '%""%';
^^39,
and 520 for^quot,
Dispenser (talk) 01:24, 18 October 2017 (UTC)- Added 847 for
^amp,
, but restricted since we have double caret encoded titles like File:Barcelona, entrance of Parc G^amp,uuml,ell by Antoni Gaud^amp,iacute - panoramio.jpg. —Dispenser (talk) 14:15, 18 October 2017 (UTC)
- Added 847 for
- Updated table with total counts. Dispenser (talk) 20:52, 18 October 2017 (UTC)
- Just for the record:
^^39,
and^^39
is fixed now. There have been some cases of bad encoding in conjunction with vowels, wherea^^39,
was meant to beà
orá
, same for other vowels like incafe^^39,
. --Achim (talk) 08:59, 18 November 2017 (UTC)- That's the user not knowing how to enter it. The sequence in Word is Ctrl+' then the letter, on MacOS its ⌥ Opt+e, and on Windows either type the alt code (Alt+0233) or copy & paste from the Character Map. We probably should be making a list of such problems.
I'd also note we have 310 files using two single quotes instead of a double quote:
SELECT img_name FROM image WHERE img_name REGEXP "(^|[^'])''(?!')([^']|'[^'])+?(?<!')''[^']";
—Dispenser (talk) 14:28, 18 November 2017 (UTC)- On that last note, we should consider avoiding © ® ™ too. About 24,000
SELECT img_name FROM image WHERE img_name REGEXP "©|®|™";
—Dispenser (talk) 15:32, 18 November 2017 (UTC)
- On that last note, we should consider avoiding © ® ™ too. About 24,000
- That's the user not knowing how to enter it. The sequence in Word is Ctrl+' then the letter, on MacOS its ⌥ Opt+e, and on Windows either type the alt code (Alt+0233) or copy & paste from the Character Map. We probably should be making a list of such problems.
- Just for the record:
- Still early stages, but I can't spare any more time. Still haven't cracked how exactly "Горный Алтай" became File:Đ^ldquo,ĐžŃ^euro,Đ˝Ń^lsaquo,Đš Đ^ĐťŃ^sbquo,Đ°Đš - panoramio - Tanya Dedyukhina.jpg. —Dispenser (talk) 16:37, 17 October 2017 (UTC)