Open main menu

Wikimedia Commons β

Commons:Bots/Work requests

< Commons:Bots(Redirected from Commons:BWR)

Shortcut: COM:BR · COM:BWR

Bot policy and list · Requests to operate a bot · Requests for work to be done by a bot · Changes to allow localization  · Requests for batch uploads
Gnome-system-run.svg


Filing cabinet icon.svg
SpBot archives all sections tagged with {{Section resolved|1=~~~~}} after 1 day.

Rename files with wonky Unicode encodingEdit

I placed this request at User_talk:CommonsDelinker/commands#Other_requests first but I was told I should place it here: The following 342 file names have bad Unicode encoding. For example, instead of containing the character "–" (&#8211;), they contain the text "&-8211;". The same for three other characters:

  • &-8217; = ’
  • &-8220; = “
  • &-8221; = ”

For the moment I will list here only three files, the complete list can be found at User:Fructibus/A.

CommonsDelinker: Replace File:Almost complete "sub-cordate" flint handaxe of Lower Palaeolithic date (500000 BC &-8211; 40000 BC). (FindID 112531).jpg with File:Almost complete "sub-cordate" flint handaxe of Lower Palaeolithic date (500000 BC – 40000 BC). (FindID 112531).jpg across all Wikimedia projects. Reason: Wonky Unicode encoding
CommonsDelinker: Replace File:An incomplete and ornate cast copper alloy double looped asymmetrical buckle with an integral bar and a cast copper alloy hinged plate. Medieval or Early Post Medieval (AD 1300 &-8211; AD 1600). (FindID 97064).jpg with File:An incomplete and ornate cast copper alloy double looped asymmetrical buckle with an integral bar and a cast copper alloy hinged plate. Medieval or Early Post Medieval (AD 1300 – AD 1600). (FindID 97064).jpg across all Wikimedia projects. Reason: Wonky Unicode encoding
CommonsDelinker: Replace File:An incomplete and ornate cast copper alloy double looped asymmetrical buckle with an integral bar and a cast copper alloy hinged plate. Medieval or Early Post Medieval (AD 1300 &-8211; AD 1600). Detai (FindID 97064).jpg with File:An incomplete and ornate cast copper alloy double looped asymmetrical buckle with an integral bar and a cast copper alloy hinged plate. Medieval or Early Post Medieval (AD 1300 – AD 1600). Detai (FindID 97064).jpg across all Wikimedia projects. Reason: Wonky Unicode encoding

I hope this is the right place for such requests. Thanks. Fructibus (talk) 12:31, 15 October 2017 (UTC)

-- ; sql --cluster analytics commonswiki_p
SELECT
  COUNT(*), SUM(page_is_redirect=0) AS Files,
  REGEXP_REPLACE(page_title, ".*?((&[^;]{2,16};)+).*", "\\1") AS HTML_Entity,
  page_title
FROM page
WHERE page_namespace=6 /*File:*/
AND page_title REGEXP "&[_-]*(x?[0-9a-fA-F]{2,6}|amp|quot);"
GROUP BY HTML_Entity
ORDER BY Files DESC
LIMIT 80;
Files HTML_Entity Replace Example
314 &-8211; U+2013 (En dash)   Done
67 &-195; ...SAN DIEGO (March 28, 2007) &-195;&-145; Sailors...   Done
27 &-34; U+0022 (Quote Mark)   Done
21 &-x2F; U+002F (Forward slash)   Done
15 &-8220; U+201C (Left curly quote)   Done
15 &-8221; U+201D (Right curly quote)   Done
12 &_-39; U+0027 (Apostrophe)   Done
11 &-226;&-128; ...an electrician&-226;&-128;&-153;s mate…   Done
9 &-x27; U+0027 (Apostrophe)   Done
9 &-8216; U+2018 (Left curly apostrophe)   Done
8 &_amp; U+0026 (Ampersand)   Done
8 &-65533; ...Women&-65533;s Acc...   Done
1 &-194;&-160; ...against Vanderbilt.&-194;&-160;.jpg   Done
7 &-194;&-168;   Done
7 &_-8217; U+2019 (Right curly apostrophe)   Done
5 &_quot; U+0022 (Quote Mark)   Done
4 &_-8216; U+2018 (Left curly apostrophe)   Done
4 &-39; U+0027 (Apostrophe)   Done
4 &-8217; U+2019 (Right curly apostrophe)   Done
4 &_1740; U+06CC (Arabic Dot-less Ya)   Done (good catch!)
2 &-201; ...de l'Arriere - &-201;conomisez le...   Done
2 &-259; U+0103 (a with breve)   Done
1 &-x1F; Fran&-x1F;çois...   Done
Those aren't wonky, just # is invalid for MediaWiki file names while they're just fine in Windows. Dispenser (talk) 17:42, 15 October 2017 (UTC)
There seem to three types of errors: space/underscore inserted after &, the HTML entity sent (both in decimal and hexadecimal), and for some reason &-226;&-128;&-153; which is the UTF-8 sequence e2 80 99 for U+2019 (Right curly apostrophe). Dispenser (talk) 03:06, 16 October 2017 (UTC)
Possibly fixed phab:T67297 (deploy, kinda). Most recent occurrences: 2x Oct 8 and 2x Oct 10 by User:Fæ (direct entity variant), 1x Sept 25 by User:Lz jawa using UploadWizard (space inserting), several times in August by User:*angys* using Flickr2Commons (direct hexadecimal entity). Also, User:Pharos in July using GWToolset (direct hexadecimal entity). Dispenser (talk) 03:43, 16 October 2017 (UTC)
New Report: User:Dispenser/HTML entities Dispenser (talk) 18:46, 16 October 2017 (UTC)

From what I have read, this is a small number of files. No objections to the obvious fixes, or I can eventually fix by tweaking an existing regex based renamer if nobody gets to it. -- (talk) 04:04, 16 October 2017 (UTC)

I think the multi-part entries require a closer look, as there are file names like File:Alexia Chascsa completes &-195;&-162;&-226;&-130;&-172;a"dunker&-195;&-162;&-226;&-130;&-172;&-194;- training at the Helicopter Overwater Survival Training facility during Aviation Spouses Day 130607-A-SM724-320.jpg or File:U.S. Army Staff Sgt. Matthew Parsons, 2nd Battalion, 309th Regiment, an egress-training instructor, climbs a rope obstacle during day two of the 174th&-195;&-162;&-226;&-130;&-172;&-226;&-132;&-162;s 130313-A-IM587-889.jpg. --Achim (talk) 17:02, 16 October 2017 (UTC)
html.unescape(u'&-195;&-162;&-226;&-130;&-172;&-226;&-132;&-162;'.replace('-', '#')).encode('cp1252').decode('utf-8').encode('cp1252').decode('utf-8') comes out as U+2019 (Right curly apostrophe)

If it didn't work the first time do it again ;-) Can't figure out the first one though. —Dispenser (talk) 02:45, 17 October 2017 (UTC)

There some Mojibake, like "Горный Алтай" becomes File:Đ^ldquo,ĐžŃ^euro,Đ˝Ń^lsaquo,Đš Đ^ĐťŃ^sbquo,Đ°Đš - panoramio - Tanya Dedyukhina.jpg which also has caret encoding. —Dispenser (talk) 16:46, 17 October 2017 (UTC)
Read UTF8 has been encoded bytewise using cp1250: Г (U+0413) → 0xD0 0x93 → Đ“ → Đ^ldquo, . --Achim (talk) 10:17, 19 October 2017 (UTC)

Caret encodingEdit

Main discussion: User talk:Dispenser/HTML entities#Caret encoding
Files Entity Char Info
7116 ^^39, ' U+0027 (Apostrophe)   Done
3584 ^quot, " U+0022 (Quote Mark)   Done
1259 ^amp, & U+0026 (Ampersand)   Done
142 ^rsquo, U+2019 (Right curly apostrophe)   Done
118 ^gt, > U+003E (Greater-than sign)
71 ^pi, π U+03C0 (pi symbol)   not found
47 ^^093, ] U+005D (Right Bracket)   Done
46 ^^091, [ U+005B (Left Bracket)   Done
35 ^sbquo, U+201A (Openning quote mark)   Done
26 ^lt, < U+003C (Less-than sign)
26 ^ndash, U+2013 (En dash)   Done

This is where a bot converts certain characters into HTML entities then swaps & ; for ^ , (sometimes stripping trailing ,). Most come from User:BotMultichillT (446 caret names, fixed in November 2011?) User:Panoramio upload bot (10,631 caret names, fixed in December 2016?) with last occuance in December 2016. There's apparently been code changes, as newer version of Panoramio bot created mojibake while earlier versions handle Thai script just fine. —Dispenser (talk) 21:58, 18 October 2017 (UTC)

@Steinsplitter: I've written a script (on the talk page) to do about 5,000 without complications. —Dispenser (talk) 19:12, 26 October 2017 (UTC)
A batch is done now, a few with errors (such target filename exists yet) have been skipped. --Steinsplitter (talk) 08:35, 29 October 2017 (UTC)

Unmarked encodingEdit

User:Achim55 notices some encoding without starting/ending marks. They're listed in User:Achim55/HTML entities. Basically, Å (U+00C5) would be the HTML entity &#197; . It also appears to replace all symbols with hyphens and converted everything to lowercase, so Århus / Aarhus Århus Sporveje becomes File:197rhus--aarhus-197rhus-sporveje-1007104.jpg.

The only approach I've got for fixing this: a dump of file names broken up by word boundaries, replace all 3-5 digit numbers with their decimal unicode character, match substitute against multilingual dictionary, and if its a real word add to good substitutes list. Build a report using the good substitutes list.

Any other ideas? —Dispenser (talk) 19:01, 26 October 2017 (UTC)

MojibakeEdit

Made a script to do mojibake detection. Doesn't seem to be much of an issue since these were all that I found:

Dispenser (talk) 20:09, 28 October 2017 (UTC)

Dispenser, thanks for the pattern, there are also
Have to go to bed now... --Achim (talk) 21:58, 28 October 2017 (UTC)
Dispenser, seems to be an issue, I found a total of 1973 files. --Achim (talk) 20:33, 29 October 2017 (UTC)

Spelling correctionEdit

copied from COM:AN --Achim (talk) 20:11, 19 October 2017 (UTC)

Hi, anybody hanging around willing to take a loot into the spelling here Wikiwordenboek should be "wikiwoordenboek". Though, I prefer the word to be linked directly to the nl:wiktionary as I did here, imo in the meantime, the spelling should be correct. Thank you for your time. Lotje (talk) 15:12, 19 October 2017 (UTC)

@Marcel coenders: It seems a good idea... --Kanashimi (talk) 11:39, 21 October 2017 (UTC)
Does anybody know how to extract the word (description) from the filename using AWB? E.g. filename = Nl-windvlaagje.ogg. Ten or twenty I'd do by hand but not +900 :-) --Hedwig in Washington (mail?) 02:30, 22 October 2017 (UTC)
wget "https://commons.wikimedia.org/wiki/Special:Search/wikiwordenboek?limit=5000" -O - | grep -P "/wiki/File:[^\"]*" -o You could also use the Database Scanner. —Dispenser (talk) 12:58, 22 October 2017 (UTC)

2-digit yearsEdit

There are a lot of dates in {{Information}} which display as e.g. "1 May 0008" instead of "1 May 2008" because the year only has 2 digits |Date=08-05-01. Could a bot fix these, at least for unambiguous dates which were uploaded close to the date specified? Jc86035 (talk) 16:40, 24 October 2017 (UTC)

There are too many ways to go wrong for this to be realistic. -- (talk) 17:57, 24 October 2017 (UTC)
I agree |Date=08-05-01 could be 8 May 2001; August 5, 2001 or 2008 May 1. We could do some fixes if it is not ambiguous, for example 08-12-28 can only be 28 December 2008. --Jarekt (talk) 20:25, 24 October 2017 (UTC)
You forget 28 December 1908. -- (talk) 21:03, 24 October 2017 (UTC)
  • The format {{other date|or|2008-12-28|1908-12-28}} can be used for 2 variants.
  • 2008-05-01 or 2001-05-08 or 2001-08-05 or 1908-05-01 generally
  • Questionable pages can be categorized by uploader to analyse which format is used by the specific uploader.
  • some conditions can be used in the algorithm (date = EXIF, date ≤ upload-date etc.) --ŠJů (talk) 13:33, 26 October 2017 (UTC)
Can't really see the point. A person would make a better job of it, and there is no benefit to adding templates which are likely to actually make later parsing more difficult. -- (talk) 08:13, 29 October 2017 (UTC)

Check "source" and similar fields for red linksEdit

If there's a red link in the source, {{derived from}}, or {{extracted from}} field, it's a good sign that something is wrong. Either there's a typo, or the original file has been deleted. If the original file was deleted, most often it's a copyright problem, and the derivative work is therefore also a copyvio. These would need to be reviewed manually, but a bot-generated list would be a place to start. Guanaco (talk) 04:33, 29 October 2017 (UTC)

Category:Files with broken file links -- (talk) 08:09, 29 October 2017 (UTC)
Thanks. 8,000 files is a lot, but it's doable. For the derived from/extracted from, I think I can categorize them using the templates. Guanaco (talk) 08:29, 29 October 2017 (UTC)
You could just use petscan. -- (talk) 08:35, 29 October 2017 (UTC)
I forgot that existed. Thanks. I ended up adding some automatic maintenance categories to the templates, so it's something we can keep an eye on once the backlog is cleared. Guanaco (talk) 09:31, 29 October 2017 (UTC)

Migrate interwiki links to WikidataEdit

Now that we have {{Interwiki from Wikidata}}, it would be useful to migrate interwiki links from pages here to instead be generated from Wikidata, via an appropriate sitelink from Wikidata <-> Commons.

This has a number of advantages:

  • The list of links to other wikis on Wikidata is more likely to be maintained, and so may be more comprehensive.
  • Interwiki links will automatically reflect any merges or editing of the Wikidata items.
  • Better identification between Commons categories and Wikidata is useful in preparation for Structured Data, for understanding categories here, and for facilitating templates that can draw on Wikidata.
  • In particular, sitelinks are helpful for getting Wikidata-driven templates on Commons to work automatically.
  • It will no longer be necessary to make separate edits here and at Wikidata to link to a new wiki article in a new language.

In the threads on Commons Village Pump and Wikidata Project Chat last month, with the most recent numbers on links between Commons and Wikidata, there seemed to be support for the proposition that

  • a Commonscat should sitelink to a category item on Wikidata if such a category item exists;
  • but if such a category item does not exist, then it is entirely acceptable for a Commons category to be sitelinked to an article-item there.

That principle should clear up any remaining confusion and let us finally move forward in this area.

So:

  • If there is an existing sitelink to Wikidata, please remove interwiki links if they correspond to the interwiki links at Wikidata.
    Also, if the sitelink goes to a category-item on Wikidata, and that item has a P301 "Category's main topic" statement pointing to a corresponding article-item, and that article-item on Wikidata has interwiki links, please remove interwiki links that match those from the category here.
    In the latter case, please also make sure the category has the template {{Interwiki from Wikidata}}. If there are any remaining interwiki links here that are left over after that, please add a category to it, Category:category with unmatched interwiki links.
    Also, if there are any interwiki links that do not match the interwikis from Wikidata, please add the category Category:category with non-matching interwiki links.
  • If there is a single Wikidata item with a P373 "Commons category" statement to the category here, please add a sitelink if one does not already exist, and proceed as above.
  • Similarly, if there are two Wikidata items with P373 "Commons category" to the category here, one from a category-type item, the other from an article-type item, and the two are connected by a P301 "Category's main topic" statement, please add a sitelink from the category here to the category-type item, dematerialise the local interwiki links as above, and make sure the category here has a {{Interwiki from Wikidata}} template.
  • Otherwise, if a Wikidata item can be identified from the interwiki links here, please add a sitelink and a P373 from it to the category here; unless it has a d:Property:P910 "topic's main category", in which case add a P373 to both items, but the sitelink to here from the corresponding category-type item.

Red flags:

  • If there are multiple Wikidata items with P373s linking here; or if the existing sitelink does not match the incoming P373; or if none of the interwikis match the interwikis on the Wikidata item; or if there is a category-type item and an article-type item to the category here, but no P301 "Category's main topic" statement connecting them, then please log this, and add the category Category:category with local interwiki links to the category here.
  • In some cases it may not be possible to sitelink, if the Wikidata item is already sitelinked to a Gallery page here. For such cases we probably need to have a general discussion about whether it is better to change the sitelink to the category here, relying on property P935 for the link from the item to the gallery; or to create a new category-type item, just for the sitelink.

Additional:

  • User:Jarekt has been going through some of the cases where there are multiple Wikidata items with P373s to a single Commons category here; he's recently c:Commons:Village_pump#Help_needed_with_mapping_categories_to_Wikipedia_articles asked for help, if anyone can assist with going through the back-log.
  • Also, in some cases interwikis here may not match P373s if there has been a change here, and the old P373s are now pointing to a disambiguation page or a redirect. In which case, please ideally update the P373s if there seems to be a clear candidate for where they should now point.

Proposed: Jheald (talk) 12:24, 7 November 2017 (UTC)

Where is the community proposal that supports these changes? The link given to the VP was not a proposal but a discussion about stats. It's not obvious to me that these would be entirely non-controversial, so I would like to be able to read that consensus. -- (talk) 12:40, 7 November 2017 (UTC)
@: Would you be okay to have this be considered to be the community proposal, and have discussion here? We can certainly advertise it; the more people who bring input the better, particular if there are any cases that need any special caution that they can highlight. Jheald (talk) 15:24, 7 November 2017 (UTC)
No. It's not written as a proposal. It's too long and detailed to be understandable (I'm technical and I skim over it as it would probably take 15 to 20 minutes of real reading to understand properly). It's in the wrong place, we have a specific noticeboard for proposals where it should be created. -- (talk) 15:28, 7 November 2017 (UTC)
@: Well if you'd like to summarise it somewhere that would be useful. But it's precisely the technicalities that we need most input on, in case there are issues and cases that the above doesn't cover. And it's the technicalities that are needed, for people to know exactly what is being suggested. I accept I may not be the best at identifying the key points and making them most prominent; but I think it is relevant to cover the different cases set out above, and show they have been given some consideration. So I do think a text at least somewhat like the above is required. Jheald (talk) 15:38, 7 November 2017 (UTC)
As this fits in with the pattern of controlling Commons content from Wikidata, and given how recent the mass removal of geo-coordinates was done without correct consensus, I'm minded to not invest my limited volunteer time writing a proposal. I do not fully understand the above technical request (it's not a proposal) so I doubt that others really will; certainly the case and ramifications for this project have not been made clear or pithy, and previous discussion is hard to follow and presumes lots of background reading to check the assertions. For example "links to other wikis on Wikidata is more likely to be maintained" is at best questionable as it lacks verifiable evidence. Thanks -- (talk) 15:46, 7 November 2017 (UTC)
Discussion linked at COM:VPP at Commons:Village_pump/Proposals#Proposal_to_migrate_interwiki_links_to_Wikidata_.28wherever_possible.29. Jheald (talk) 18:08, 7 November 2017 (UTC)
  Support and yes we should have more broad discussion or announcement about it. In the past I was the one that occasionally run interwiki update bot, but have not done so for last 5 years since Wikidata project started and everybody else migrated to Wikidata based interwikis. As a result many of the links are likely very stale. --Jarekt (talk) 13:25, 7 November 2017 (UTC)

Centralised discussion at Commons:Village_pump/Proposals#Proposal_to_migrate_interwiki_links_to_Wikidata_.28wherever_possible.29.
General policy-question follow-up to there, please. Jheald (talk) 19:04, 7 November 2017 (UTC)

Create interwiki links to WikipediaEdit

Could you create interwiki links to Wikipedia for files in these categories, please:

I am talking about articles of villages, municipalites, protected areas. Name of categories on commons are equal to name of articles on cs.wp, moreover, from articles on cs.wp commons categories are linked via template commonscat or from the infobox. Thanks!--Juandev (talk) 12:07, 12 November 2017 (UTC)

License checker for ThingiverseEdit

With the upcoming support of .stl files (according to meta:Tech/News/2017/46), a bot similar to the FlickreviewR 2 bot should probably be created. The Thingiverse licenses that are compatible with commons (as far as I can tell), are the following: {{CC-by-3.0}}, {{CC-by-sa-3.0}}, {{CC0}}, {{GPL-2.0}}, {{LGPLv2.1+}}, and {{BSD}}. The following licenses are not compatible: CC-by-nd-3.0, CC-by-nc-3.0, CC-by-nc-sa-3.0, and CC-by-nc-nd-3.0. I can start building the categories, similar to Category:Flickr review needed and its subcategories, if someone will build the bot. Elisfkc (talk) 19:36, 13 November 2017 (UTC)

https://xkcd.com/1205/ comes into mind. Do we have an estimate on how many such files per day will be uploaded? --Zhuyifei1999 (talk) 20:04, 13 November 2017 (UTC)
@Zhuyifei1999: No idea Elisfkc (talk) 01:05, 14 November 2017 (UTC)
@Elisfkc: How many are there?   — Jeff G. ツ 05:02, 15 November 2017 (UTC)
@Jeff G.: Once again, no idea. However, I was thinking earlier today, it would just be easier if, instead of a Thingiverse checker, all STL files were placed in a designated STL license review category by a bot. Considering the extra FOP implications of 3d work, I feel like it would be best to just throw all of the STL files through a license review process. Elisfkc (talk) 05:18, 15 November 2017 (UTC)

Category:Media needing categories (cyrillic names)Edit

A large part (if not the majority) of files in Category:Media needing categories (cyrillic names) are currently categorised one way or another. Could they be removed from the category by a bot? For example, User:TaxonBot... 91.219.24.99 18:25, 18 November 2017 (UTC)