Data talk:Emoji/List.tab

(Redirected from Data talk:Emoji.tab)
Latest comment: 1 year ago by Ebrahim in topic Method 2
See also: Talk:Emoji.

Method edit

For reference and future updates, here’s the regular expression and listing formula that turned the Unicode Emoji List into this JSON data table:

name='([a-f0-9_]+)'.*\s.*alt='(.*)' class.*\s.*class='name'>⊛? ?(.*)<\/td>.*\s.*class='name'>(.*)<\/td>
["$2","$1",{"en":"$3"},{"en":"$4"}],\n

I used RegExr to build and run this. It seemingly got the 2623 emoji characters correctly! If the table is updated and for internationalised names and keywords, we’ll probably need to upgrade it to something more solid. 🙂 ~ nicolas (talk) 11:28, 17 June 2017 (UTC)Reply

For Emoji 11.0, I also had to search and delete <span class='keye'> and </span> in the keywords. ~ nicolas (talk) 16:42, 18 February 2018 (UTC)Reply

Emoji 12 edit

Unicode currently splits the list in three Emoji List (1719 symbols) that contains everything but the skin colours variations, Full Emoji Modifier Sequences (1300 symbols) that has all skin colour variations, and Recommended Emoji ZWJ Sequences (906 symbols) that has other variations (hairstyles and genders).

The regex to scrap the first table is almost unchanged (“title” instead of “class”). Somehow I only get 1712 emojis instead of the 1719 on the official table, there is a mistake somewhere, I will look into it but for now 7 missings symbols is not a serious issue.

name='([a-f0-9_]+)'.*\s.*alt='(.*)' title.*\s.*class='name'>⊛? ?(.*)<\/td>.*\s.*class='name'>(.*)<\/td>
["$2","$1",{"en":"$3"},{"en":"$4"}],\n

I’m uploading it like that on Data:Emoji.tab. I haven’t figured out the regex for the two other tables yet. I believe we should follow Unicode and make two other tabular data sets, something like Data:Emoji.tab/Modifiers.tab and Data:Emoji.tab/ZWJSequences.tab. Or move Data:Emoji/List.tab and have Data:Emoji/Modifiers.tab and Data:Emoji/ZWJSequences.tab. ~ nicolas (talk) 19:23, 4 April 2019 (UTC)Reply

I did the import, and did the move from Data:Emoji.tab to Data:Emoji/List.tab. And I was wrong about “Recommended Emoji ZWJ Sequences”, it repeats things that are already in the two other tables. ~ nicolas (talk) 19:36, 4 April 2019 (UTC)Reply

Method 2 edit

This is my method to update this, hopefully will be useful for others,

  1. Open https://www.unicode.org/emoji/charts/emoji-list.html
  2. Run this on the browser JavaScript console, it will place the result in your clipboard automatically,
    copy(JSON.stringify([...document.querySelectorAll('tr')].filter(x => !Number.isNaN(+x.firstElementChild.innerText)).map(x => {
      const nodes = [...x.childNodes].filter(y => y.nodeName === 'TD');
      return [nodes[2].firstElementChild.firstElementChild.alt, nodes[1].innerText.toLowerCase().replace(/ /g, '_').replace(/u\+/g, ''), { en: nodes[3].innerText }, { en: nodes[4].innerText }];
    }), undefined, 4).split('\n').map((x, i) => '    ' + (i ? '' : '"data": ') + x).join('\n'));
    
  3. Replace the page's data from the line "data": [ with clipboard's content

This script has updated the data with Unicode v15.0, probably better to use something around https://www.unicode.org/Public/UCD/latest/ucd/emoji/ or not. −ebrahimtalk 18:25, 5 November 2022 (UTC)Reply

Return to "Emoji/List.tab" page.