User talk:Curpsbot-unicodify
From Wikipedia, the free encyclopedia
This page formerly redirected to my talk page. However, I've now removed the redirect so that discussion of the bot's edits can be centralized here. This page is on my watchlist, but to contact me directly you can leave a message at User talk:Curps. -- Curps 03:47, 24 September 2005 (UTC)
[edit] Escape sequence bot
From the limited overview of your bot given on Wikipedia:Bots, I am a little concerned. Many older browsers mangle non-ASCII characters, especially in character sets that they cannot display (most often East Asian characters). Converting escaped characters to their literal forms is likely to make this currently occasional problem much more common - I have seen only a few edits be reverted on these grounds. If anything we should be converting the other way! Obviously that has problems of its own, as it makes the page harder to edit, but I really don't think we should be removing escapes that are already there. Hope to hear from you soon, Soo 19:28, 22 August 2005 (UTC)
- See [1]. There should not be a problem with it.
- It's not really a full-blown bot yet, I'm running it manually one page at a time and checking the diffs. -- Curps 19:44, 22 August 2005 (UTC)
[edit] Bots
Um... did you request to run your bot already? I didn't see your bot proposal on the bot talk page. --AllyUnion (talk) 06:25, 25 August 2005 (UTC)
- Well, I'm running it manually, in the sense of checking each edit within minutes. So it hasn't been turned loose. But I'll go ahead and put a notice on the talk page. -- Curps 06:29, 25 August 2005 (UTC)
[edit] Redirect-eliminating bot
The second run of your redirect-eliminating bot (which is very useful, btw) seems to incorrectly treat links to disambiguated pages. Eg. this edit changed [[Janusz Radziwill (1579-1620)|Janusz Radziwiłł]]
to [[Janusz Radziwiłł (1579-1620)]]
, instead of [[Janusz Radziwiłł (1579-1620)|Janusz Radziwiłł]]
. -- Naive cynic 09:51, 30 August 2005 (UTC)
- You're right, it shouldn't have done that, and I noticed it and fixed the code. I thought I found the pages where it did that and fixed them, but I guess I missed a few. Let me know if you find any other examples. -- Curps 10:10, 30 August 2005 (UTC)
[edit] a few suggestions for User:Curpsbot-unicodify
- Only do a particular page once.
- Rationale: if someone has a legit reason to revert your bots changes from entities to inline unicode they aren't going to appreciate you doing it again
- Don't convert any right to left characters (hebrew/arabic) in a line that contains entites that won't be converted or other html markup.
- Rationale: if html markup ends up sandwitched between two right to left characters it can in some cases end up rendered in a rather odd order in the edit box due to the fact it is made up of a mixture of letters/numbers that are strong left to right and symbols that are directionally neutral.
- Don't convert talk pages.
- Rationale: Comments on talk pages are rarely edited after the fact so there is no point in converting them
- (above comments by User:Plugwash)
-
- 1 — Normally, Unicode conversions would only need to be done once. There have been some exceptions during development when I refined it to handle a previously overlooked range, but this should be reasonably settled now. I have also stopped converting ndash, mdash and minus to literal Unicode.
-
- However, the code is being expanded to optionally modify redirects (see User:Curpsbot-unicodify user page), and this will probably be done in a separate pass most of the time.
-
- 2 — I am putting in special checking for RTL issues. The edit box problem only seems to occur when there is a non-whitespace LTR character embedded between RTL characters on the same line, and this non-whitespace LTR character is not separated by whitespace from any RTL character. If that's found, it prompts on whether to go ahead with the change.
-
- However, the default case (with no RTL-LTR-RTL embedding) seems to have no edit-box issues, and that's nearly all of the cases encountered so far. In any case, I am concentrating on eastern European pages for now, where there is already heavy use of diacritics and new edits already have introduced literal Unicode characters, so I only rarely encounter RTL (only in interwiki or a few Yiddish names), so it's not much of an issue so far.
-
- In any case, this problem can be solved merely by adding a sufficiently long HTML comment at the right place on the line, to ensure that the editor interprets the line as a predominantly LTR line. I may modify the bot to prompt the operator on whether to do this.
-
- 3 — I'm not targetting talk pages. The only way a talk page would get converted is if it somehow was incorrectly added to a category and I ran the bot over that category and its subcategories. If this did happen though, it's no big deal: refactoring talk pages is OK as long as it doesn't alter what was said, and the bot never introduces any reader-visible changes.
-
- -- Curps 18:07, 31 August 2005 (UTC)
Cheers seems you already have that under control. Out of interest why did you stop converting dashes? Plugwash 22:14, 31 August 2005 (UTC)
- I noticed one or two people reverted them back to –. Maybe it's because in a browser edit window with fixed-width fonts, a literal ndash looks exactly like a hyphen, so maybe it was just a mistaken impression on their part. Perhaps I'll ask them on their talk page. -- Curps 01:05, 1 September 2005 (UTC)
-
- I'll give you a damn good reason not to convert dashes. There are several that look quite similar. It is very difficult for an editor to tell which one is being used when the character appears in the edit box − rather than, for example, – and − and the ordinary - hyphen. Tell me by looking at it which one I used where I should have had an — in the last sentence. Can you do so? Gene Nygaard 05:24, 2 September 2005 (UTC)
-
- A part of this problem is that the length of these dashes is different when I view an article page from what it is when I view these characters on the edit screen, due to different fonts being used. That is true for most, if not all, editors. Gene Nygaard 05:55, 2 September 2005 (UTC)
-
-
- I turned off conversion of —, – and − a few days ago. -- Curps 19:09, 2 September 2005 (UTC)
-
I also don't think you should be converting the ones specifically telling us about the Unicode symbols, as in the changes you made in ring (diacritic). Note that "U+055A ( ՚ )" is not as useful to editors as the previous (appearing on edit page there as this appears on article page here) "U+055A ( ՚ )", even though they appear the same on the article page there. The latter tells editors who go to the edit screen how they can make this character in their editing. Yes, some people will know from just the "U+" number that they can also make it with "՚" (since ՚ and ՚ are the same thing) but the use of hexadecimal numbers rather than decimal numbers for this purpose, and the meaning of that x in ՚, are confusing to many people, and even the fact that what follows "U+" is in fact a number escapes many of them. Gene Nygaard 05:50, 2 September 2005 (UTC)
- I'm not sure I agree with that reasoning. If they want to make that character, they will probably just cut and paste it from the non-edit window (ie, the displayed page itself). -- Curps 19:09, 2 September 2005 (UTC)
[edit] Second bot - Diacritic pages
It sounds like, and correct me if I'm wrong, you want to disambig diacritic pages. If that is the case, there is something you can alter from the pywikipedia framework to get you started. --AllyUnion (talk) 06:14, 2 September 2005 (UTC)
- It's not actually disambiguation, it's elimination of redirects, especially of the form [[Foo|Fóó]], where the article is now at "Fóó" and "Foo" is now just a redirect to it. The bot simplifies this to just [[Fóó]]. I added this code to the bot and it can be done simultaneously with the original functionality, but I'm usually running it as a separate pass for clarity. The edit summary describes what the bot does in each edit. -- Curps 15:01, 2 September 2005 (UTC)
[edit] Ring (diacritic)
Would you mind running the redirect portion of your bot on ring (diacritic)? Just be careful not to re-run the Unicode portion, as the conflict about changing the display versions for Unicode has been resolved. (Though I disagree with it. I wish I could completely eliminate the decimal version of all those entities. Aside from the fact, why would you go through all the trouble to convert the hex to decimal when the hex version is perfectly valid?) Anyway, there are a lot of non-Unicode redirects to Unicode links on that page. Thanks. —Gordon P. Hemsley→✉ 17:30, September 2, 2005 (UTC)
[edit] Curpsbot-unicodify
Hi Curps, you might want to update the code in your bot so that it doesn't change underscores to spaces in references to images (and possibly other media). Filenames usually contain underscores, not spaces (eg. "Image:Hungary_COA.jpg" and not "Image:Hungary COA.jpg" – see this edit by Curpsbot-unicodify).
The above notwithstanding – I couldn't be more grateful for the invaluable work your bot is doing. Thank you ever so much.
KissL 08:30, 6 September 2005 (UTC)
- yes the actual filenames do indeed contain underscores however links with spaces work just fine, same as with articles really the internal name has underscores but you can use spaces just fine. Plugwash 13:49, 6 September 2005 (UTC)
- As far as I can tell, Wikipedia handles Image: links exactly like any other wiki link: the URL contains underscores but the displayed headline within the page doesn't.
- For instance, this talk page is http://en.wikipedia.org/wiki/User_talk:Curpsbot-unicodify but the headline says User talk:Curpsbot-unicodify, and Image: pages seems to work the same way. So I think it's OK if the bot just changes underscores universally without making a special case for image links. -- Curps 17:30, 6 September 2005 (UTC)
Well, as you like. I personally like image links with underscores better, since they work too, and they represent the filename correctly; but it is true that the page headings are rendered with spaces, in the image namespace too. KissL 10:41, 7 September 2005 (UTC)
- I would prefer that the bot leave underscores alone in image names. I upload them with underscores because filenames with spaces can be a pain to deal with at a Unix shell prompt. Not everyone is a windoze zombie. RedWolf 00:54, 17 November 2005 (UTC)
[edit] Curpsbot-unicodify and dashes
Keep up the good work. You may be interested to know that some users prefer literal dashes, seePortal talk:Cricket#Editing_difficulties. Personally, I have no strong opinions either way on that particular issue. Susvolans ⇔ 17:13, 8 September 2005 (UTC)
- Thanks. I prefer the literal dashes as well, but there are pros and cons and I don't want the bot's actions to be controversial. We'll see how this evolves in the medium-term future. -- Curps 17:27, 8 September 2005 (UTC)
-
- Yeah we use a fixed width edit box font, presumablly for the same reason most programmers use one: its a lot easier to line up code in a fixed width font (and lining stuff up can make it a lot more readable in the edit box). Unfortunately one side effect of this is that hyphen-minus (-) and n dash (–) look identical in the edit box on many systems. Plugwash 18:58, 8 September 2005 (UTC)
[edit] the bot that changes straight to curly quotes
It just did one of my articles; I want it to do others. Any way of requesting this? Tony 14:02, 11 September 2005 (UTC)
- It doesn't change straight to curly quotes, it only changes &#…; and &…; (and %xx) into literal Unicode characters. Which article are you referrring to? If you want a particular article to be processed by the bot in the near future, just leave me a note. -- Curps 18:08, 11 September 2005 (UTC)
The article is Patrick White, in which yesterday your bot changed all single quote marks to ’ (and where applicable, ‘); it also fixed up the superscript in square km.
-
- How, I wonder, was this run-through triggered? Is there some way of triggering it myself?
- The funny thing is that now I've just checked again, I see that in the edit boxes the '&' things have been replaced with curly quotes; this is even better, since other editors won't freak out about the gobbledygook.
- The other funny thing is that many of the curly quotes, in both edit boxes and finished appearance, appear to have reverted to straight quotes.
Very odd.
Tony 00:43, 12 September 2005 (UTC)
- As you can see from the diff, the bot merely changed ’ and ‘ into the literal curly-quote characters. It never changes one character to a different character, and in general does not change the way an article appears to readers (it only changes the way it appears to editors, in the edit window, by changing "gobbledygook" to the corresponding actual characters).
- The bot isn't running autonomously yet, I'm just launching it manually. I specify a particular category and let it run over all the articles in the category and recursively through all the subcategories and sub-subcategories. I'm not sure exactly how the article Patrick White was selected, it was obviously in one of the sub-sub-subcategories.
- Right now, the only way to definitely get a particular article processed is to let me know (the name of the article or the name of a category it's in) and I'll manually run it. At some point in the not-too-distant future, I'll probably let it run over all the articles in the English Wikipedia.
- -- Curps 03:44, 12 September 2005 (UTC)
And a good thing that would be, too. Thanks for your efforts. I'll let you know of any articles I'd like the bot to crawl over—there are tons, actually, and it's a matter of rationing them. Tony 05:47, 12 September 2005 (UTC)
[edit] Unicodifier bot again
Heyya, Curps. The bot recently ran over the Georgian language article, which I've worked on a little in the past. I'm probably just not understanding what it did. Here's my concern: when I'm editing, I know how to type the escape sequences, but I don't know how to type the true Unicode characters. So as far as I can see, the bot has removed this article from among those I can easily work on. I must be missing something: is there an easy way to type these characters? ACW 01:43, 14 September 2005 (UTC)
- It varies depending on the OS, theres some information in our unicode article and copy and paste should work too. Failing that there is no one stopping you still using the html entities for new text you add, curps bot just serves to make the existing wikitext easier to read it does not limit what you can use in the article in future. Plugwash 02:02, 14 September 2005 (UTC)
-
- You can cut and paste them from Georgian alphabet or from Template:Unicode chart Georgian. This seems less errorprone than memorizing and typing in the &# codes. Or, as Plugwash said, you can keep on entering the &# entities for future edits, and perhaps the bot will pay a future visit and change those too, eventually.
-
- An ideal solution would be to allow for Javascript click-to-enter Georgian character insertion that would work much like the "Insert:" selection below each edit window. Currently we have European letters in there, but unfortunately it would be impractical to add Greek, Cyrillic, Armenian, Georgian, Thai, etc. It would be great if the Special:Preferences page allowed you to customize which alphabets appeared below your edit window. Maybe that's an idea for a future MediaWiki enhancement. -- Curps 02:30, 14 September 2005 (UTC)
-
-
- This sounds like a great idea. Perhaps there could be a "Show Special Keyboards" button, which revealed a bunch of other buttons with labels like "Show Georgian Keyboard", "Show IPA Keyboard", and so on. That way we wouldn't take up real estate when not needed. In fact, even better: suppose the presence of Georgian characters in the article automatically triggered the inclusion of the Georgian keyboard? ACW 19:05, 15 September 2005 (UTC)
-
-
-
-
- User:Func seems to have been working on this very sort of thing, see User talk:Func#Adapting your nupatrol.js for a different purpose. Give it a try and see if it works for you for IPA, and then adapting it to Georgian shouldn't be too hard (see also {{Unicode chart Georgian}}). -- Curps 00:03, 16 September 2005 (UTC)
-
-
- the charinsert box has an id of "editpage-specialchars" so it probablly wouldn't be too hard to modify it from your custom monobook.js. Plugwash 02:39, 14 September 2005 (UTC)
-
- Can you provide an example? I'm not familiar at all with CSS syntax. I know the whole thing can be made to disappear by using "none", but how do you alter the set of letters and characters? -- Curps 02:43, 14 September 2005 (UTC)
Even once the bot converted the HTML entities to Unicode characters, you can still use HTML entities for your new edits. I fail ti see the problem, put if it persists, perhaps somebody can clarify. --Pjacobi 17:55, 24 September 2005 (UTC)
[edit] about unicode
Are there any other reasons, apart from being too ugly and taking up more bytes, against using &#decimal; for unicode characters? (I don't disagree I 'm just curious). Do you need any help with the unicode draft? Thanks. MATIA 12:26, 14 September 2005 (UTC)
- Sure... readability. There are probably lots of people who can read a particular language and can even type in it, who'd be able to read and fix an interwiki link or snippet of text, but when they open their browser editor they just see the &# characters and they can't read it and don't know what to do (not everyone who could contribute is an experienced editor). -- Curps 13:53, 14 September 2005 (UTC)
- entities are ok for occasional characters but you try to enter even a whole word as entities and the wikisource becomes virtually impossible to read and therefore to work on. Larger blocks of text are even worse. And thats assuming you know what entities are and where to look stuff up if you don't then you have no hope of editing the text at all. Plugwash 02:21, 31 December 2005 (UTC)
[edit] Dump-based conversion
So I have a list of 11,610 articles that contain HTML entities (&foo; and &#XXX;) and URL-encoded characters (%NN) in links to other articles. Converting these to native Unicode entities would be convient for me in the way that I analyze offline database dumps. I think weird characters in links are also the most confusing place them to be for editors. Would you be able and willing to make use of this list to feed to Curpsbot-unicodify? I also have a list of 3,000 or so articles with links that have double, leading, or trailing spaces in links. Does the bot fix these cases as well?
In the long run, it seems like it would be useful to feed the bot a list of articles that needs to be fixed, rather than trolling various categories for candidates. If you would like me to provide such lists (or some scripts to produce them from database dumps), let me know.
Thanks!
Beland 01:52, 18 September 2005 (UTC)
- Sure, it would be useful to have that list, and I could run the bot over it. One problem, though, I don't use e-mail, so perhaps the easiest thing to do would be to dump it into the Wikipedia:Sandbox and then give me a link to the revision in question.
- The bot doesn't currently fix leading, trailing, or double spaces in links (or double underscores), but could easily be modified to do so.
- You are right that it would make more sense to use database dumps as a way of generating targets for the bot rather than trawling categories. I'm pretty sure I looked at a page once that had links to database dumps, perhaps you can point me to it. What are your scripts written in? -- Curps 05:13, 18 September 2005 (UTC)
-
- Well, to avoid non-ASCII character conversion problems, I had Pearle upload the files directly. I'll delete them when they are no longer needed. You can edit User:Pearle/for-curps to get the weird-character list, and User:Pearle/for-curps2 to get the extra-spaces list. Database dumps are found at: http://download.wikimedia.org/wikipedia/en/
- Further information on dumps is at Wikipedia:Database download. My scripts are written in Perl, though it's pretty trivial to write something that will search for a certain string in raw wikitext. As long as you don't mind downloading and storing a gigabyte or two.-- Beland 05:52, 18 September 2005 (UTC)
-
-
- OK, I've grabbed the for-curps files and have started running them. The bot usually pauses about a minute between pages, so it may take a couple of weeks to get everything done.
-
-
-
- I'll see about downloading the database. Is it now in the form of a collection of XML files (or one big XML file)? I suppose that shouldn't be too hard to parse. -- Curps 07:41, 18 September 2005 (UTC)
-
-
-
-
- I'm running the for-curps2 file now. Right now, the way it's doing it, it only processes the "A" part of [[A|B]], in other words, not altering the number of spaces in the "B" part or trailing or leading blanks there. Actually, I have a few misgivings about trailing and leading blanks even in the "A" part, since they might have been added simply for editing readability. -- Curps 16:22, 18 September 2005 (UTC)
-
-
-
-
-
- There's a problem with removing leading and trailing blanks. Here's one real-life example: [2]
-
-
Billy Martin is the first[[ American League]] manager to…
-
-
-
- Removing the leading blank causes the page to display differently (this would be even more pronounced if it was a trailing blank... the following word would be merged into the bluelink. I think it'll extra work to resolve this. -- Curps 18:40, 18 September 2005 (UTC)
-
-
[edit] Question about your bot
Hi. I saw that your bot converts to Unicode, and also trims spaces in places (the regex [_ ]). I wonder, did you consider having the bot remove extra empty lines between paragraphs? At least to me, it is an annoyance to see sections separated by a lot of newlines sometimes, and done in inconsistent manner. You can reply here, I will keep this page on my watchlist. Thanks, Oleg Alexandrov 15:13, 23 September 2005 (UTC)
- The bot only removes extra spaces inside a link (and also leading and trailing spaces inside the "A" part of a [[A|B]] link), which is guaranteed not to affect how a page displays. Generally, the bot tries to avoid altering how a page displays to the reader, it usually only alters how it appears to the editor in the browser edit window.
- It would be dangerous to delete multiple newlines, because often they are used to avoid text wrapping around tables, by editors unaware of the <br clear="all" /> or <br style="clear:both;"> HTML tags. So this isn't something that a bot should do. -- Curps 15:24, 23 September 2005 (UTC)
-
- Thanks, that clarifies it. Oleg Alexandrov 18:56, 23 September 2005 (UTC)
[edit] Templates
It did damage to Template:Infobox City and probably others too. Apparently it changed a field name containing an underscore which was inside brackets. Bad field name = bad rendering. PAR 15:34, 23 September 2005 (UTC)
- Hmmm. Well it isn't scheduled to go through any more templates today. I'll think about how to prevent this and go over the list of templates it did today. -- Curps 15:37, 23 September 2005 (UTC)
- Perhaps the solution is to avoid doing underscores within [[ {{{ }}} ]]. I'll give it some thought and avoid templates in the meantime. -- Curps 01:41, 24 September 2005 (UTC)
[edit] Non-user-visible edits
Your bot is also making a large number of edits that are composed only of non-user-visible edits. This costs Wikipedia (in bandwidth, database load, disk space to store revision history, etc.) and costs editors (in time to verify edits), all of which seems to be for no real benefit. Could you please modify the bot so that it does not make changes which are not user-visible (unless they have some non-visual impact on the integrity of Wikipedia) or at the very least only make such edits where it already has some user-visible work to do? Thanks. -Harmil 00:15, 24 September 2005 (UTC)
- That's by design... nearly all of the bot's edits are non-reader-visible. In general the only exceptions are where it fixes an error (like a missing semicolon on ² for instance). However, the changes are very visible to editors, and articles become more readable (and therefore more editable) for editors. Editors are just as important a constituency as readers, in fact there's no clear distinction between the two. I do see a real benefit to this, and so do a number of others (see other comments on this talk page; also, the bot itself was requested at Wikipedia:Bot requests).
- The bot only does about one edit a minute, so the bandwidth and database load is minimal. There has been no guidance by any of the developers that disk space is close to becoming an issue, and if it was it would suffice to store very old revisions as compressed diffs.
- It would be inadvisable to combine multiple different types of modifications (Unicode changes along with some user-visible change) into the same edit because that would actually make verification and checking harder. The human eye can skim over the diffs much faster if it knows it only needs to look out for one type of change.
- Regarding the recent run which is mostly changing only underscores and extra spaces, that was a request from User:Beland (see above in this same page). When it completes tonight, the bot will resume mostly Unicode-oriented fixes to pages. -- Curps 00:57, 24 September 2005 (UTC)
[edit] unicode robot on the math pages?
Are you sure you want to be running the unicode robot on the math pages? It just blasted through Wess-Zumino-Witten model converting the math symbol γ into a unicode character that I have no idea how to edit, primarily because I do not actually have a greek keyboard, and the greek alphabet does not appear in the javascript "insert menu" of the edit page. This may be fine for foreign words and names of various famous people, but is vaguely disconcerting when it starts converting mathematical formulas. linas 02:11, 24 September 2005 (UTC)
- There seems to be quite a lot of that sort of thing coming out of this bot. While I can agree that there's no need for some of the
ᖳᨊṡ
type silliness that goes on on many pages, there's just no good reason to go prowling through pages full of math markup in a combination of TeX and HTML and replace just the HTML entities. The result is more—not less—confusing for the editor who knows the topic, and a real pain to go through and revert on a site-wide basis.
- Where, in general are these changes being discussed before the bots are let loose? I'd like to participate in such conversations before future edits take place. -Harmil 02:55, 24 September 2005 (UTC)
-
-
- I presume this is a just a single user, Iam guessing this was not debated at all.
-
-
- I fully support using the bot to convert to Unicode; in my opinion, the resulting wiki markup is much cleaner and easier to understand, and there should be no special knowledge or skill needed to edit such an article. — Knowledge Seeker দ 03:20, 24 September 2005 (UTC)
-
-
- Yes, but you don't appear to actually contribute to math pages. I'm somewhat disconcerted about the enforcement of a policy coming from an external source; or to put it more plainly, meddling by outsiders in the affairs of others. (I have kids, and we say "I don't swim in your toilet, don't pee in my pool.") linas 14:47, 24 September 2005 (UTC)
-
I'm not sure how the literal gamma can be disconcerting to you. After all it appears as a literal gamma in math formulas and in the Wess-Zumino-Witten model page itself (the page that readers see) it's been there all along. By the way, there's nothing stopping you from typing in γ in your future edits if that's what you wish.
- I could revert too, if I wish. That's not the point.
For what it's worth, it is in fact possible to make Greek characters (or IPA symbols, or any custom characters of your choice) appear in the javascript insert menu. User:Func has a bit of code that you could add to your Monobook.js file, you could contact him if you wish. -- Curps 03:31, 24 September 2005 (UTC)
- Again, this misses the point. I'm not sure I quite like the idea, in fact, my gut impression is that its a bad idea. There are 10,000 math articles being edited by many dozens of highly active editors; this doesn't affect just me, but a whole community. I'd be a lot more comfortable if this had been properly debated at Wikipedia talk:WikiProject Mathematics (and Wikipedia talk:WikiProject Physics) and a consensus/policy decision had been reached. linas 14:47, 24 September 2005 (UTC)
-
- I agree with Linas. In fact I'd go further: math should not be written in Unicode or even HTML, but in LaTeX. The only reason we don't do it now is that the resulting PNG images make pages slow to load and visually uneven. That's a technical issue that will eventually be resolved; when it is, we'll want to back-convert everything to LaTeX. That will be easier to do from HTML than from Unicode. Please keep this bot off the math pages! --Trovatore 15:45, 24 September 2005 (UTC)
- But the relationship between γ and γ is strictly one to one: Why should it be easier to convert one style to LaTeX? --Pjacobi 16:18, 24 September 2005 (UTC)
- Is it? That I didn't know. Can you point me to a table between the γ-style names and the Unicode numbers?
- A general problem with Unicode is that different Unicodes can look visually almost the same; it's hard to tell by (manually) examining the source. With the HTML names it's much clearer. --Trovatore 16:33, 24 September 2005 (UTC)
- http://www.w3.org/TR/REC-html40/sgml/entities.html --Pjacobi 17:00, 24 September 2005 (UTC)
I don't think we will ever convert all the formulas to TeX. Rather, they will stay the way they are now, and the server will convert both the TeX and the HTML to MathML when generating the html page from wikicode. Oleg Alexandrov 16:27, 24 September 2005 (UTC)
- It (LaTeX-ifying) would certainly be a painful job, but I think we ought to do it. The source carries semantic information that ought to be correct. This is why at the start I was against using HTML even provisionally; changed my mind about that when the servers started to slow down somewhere around the start of this month. --Trovatore 16:35, 24 September 2005 (UTC)
-
- So what are we waiting for with MathML? For Microsoft to ship IE with the math plugin enabled by default? I notice that MathML demo looks pretty decent, although not perfect, in Firefox. Also, I thought that a web browser can announce that it supports MathML to the web server; thus in principle, the server could spew MathML to those browsers that support it. The second issue is "whose gonna write the code to enable this"? The third issue is "can the WP servers handle the load"? The fourth issue is "what to do with non-MathML browsers" and I propose the current solution "status quo: article author gets to choose TeX or HTML". linas 17:16, 24 September 2005 (UTC)
-
-
- The biggest obstacle to Wikipedia serving MathML has nothing to do with MathML itself; it's everything else on the page. If the entire page is not valid XML, such as XHTML+MathML+SVG, a browser is entitled (required?) to choke — and Mozilla browsers do just that. Convince the developers to generate pages guaranteed to pass validation; please! My impression is they see no motivation to do so, balanced against a lot of painful work dodging quirks of old and/or stupid browsers. --KSmrqT 19:52, 24 September 2005 (UTC)
-
Speaking as a pesky "outsider" "meddling...in the affairs of others," I think what Curps' bot is doing is splendid. Converting to literal unicode characters does result in "wiki markup [that] is much cleaner and easier to understand," as Seeker says, and results in edit box text that is far more welcoming to average users. Not everyone knows off the top of their heads what the Unicodes stand for, and having to edit text where these are heavily used is discouraging. The change that Curps' bot is making is a small one—Unicode is already in wide use on WP, the bot is merely making it literal—but it's one that benefits new users enormously. I also cannot understand the objections (aside from Trovatore's; he objects to using unicode at all). The bot obviously does not change the way anything appears on the page, so there can't be a problem there. It changes how text appears in the edit box—by simplifying how it looks: if you can understand what is written in the article, you clearly should be able to understand what is written in the edit box. So where's the problem? One danger that I can see is unusual browsers (ie. text browsers like Lynx etc) mangling the text; but this apparently can be gotten around [3]. As long as technical problems like this can be resolved, I think what the bot is doing is good. Just an "external source's" 2c.—encephalonεγκέφαλον 19:46, 24 September 2005 (UTC)
- Well, not to Unicode per se. I object to WISYWIGization when it removes useful semantic information from the source code. As Knuth (or was it Lamport?) says, WYSIWYG is really "what you see is all you get". --Trovatore 21:17, 24 September 2005 (UTC)
- Ah, ok. I'm not knowledgeable enough in this area to appreciate the difference, Trovatore. What semantic information is lost when γ replaces γ in the edit box? Regards—encephalonεγκέφαλον 21:59, 24 September 2005 (UTC)
- Well, this is a fairly trivial case, but someone might not recognize that that particular squiggle is a gamma (doesn't look much like one, frankly, on my browser). He could look at the source and say "oh, it's gamma". More in general, the names of math symbols are suggestive while the renderings of their Unicode equivalents are sometimes unclear; future-proofing should include preserving the names. --Trovatore 22:21, 24 September 2005 (UTC)
- Ah, ok. I'm not knowledgeable enough in this area to appreciate the difference, Trovatore. What semantic information is lost when γ replaces γ in the edit box? Regards—encephalonεγκέφαλον 21:59, 24 September 2005 (UTC)
I agree with linas: I'm highly opposed to the &<name>; → Unicode edits in math/physics articles (I can see the purpose in articles about for instance historical figures whose names aren't really expected to be edited that much like here). Optimally, yes, everything should be in LaTeX, but failing that our responsibility is accommodate maximum general editablility. — Laura Scudder | Talk 19:57, 24 September 2005 (UTC)
But why is:
γ is a ''G''-valued
less editable than:
γ is a ''G''-valued
In the de.wikipedia ISO-8859-1 => UTF-8 transition all HTML entities were replaced and nobody misses them. Of course, if you want to enter a new γ, you may also enter γ - whatever works best for you. --Pjacobi 20:23, 24 September 2005 (UTC)
If this were a vote, my vote would be Wait for work-around
If the original
γ is a ''G''-valued
had been replaced by the slightly clunkier, but more accurate
<math>\gamma</math> is a ''G''-valued
then the bot wouldn't touch it. However, we'd have to wait a few months after that was proposed in a math/physics style guide before it would be safe. Or we could manually partially revert all the math articles affected by this bot, and replace the stray &symbol_names by <math>\symbol_name</math>.-- Arthur Rubin 21:26, 24 September 2005 (UTC)
- Ok, I can see value here, because γ is different from (and more accurate than) γ. But, like Pjacobi, I cannot see what information is lost when converting γ to γ. If the goal is to get all math articles to express γ instead of γ, that's worthy, eminently reasonable, and should be done irrespective of whether the present text uses literal Unicode or not. If one wished to create a bot to replace all instances of γ to γ, is this more difficult when the source code contains γ instead of γ? If so, that would a good argument for keeping Curps' bot off the math pages. If not, I'm not sure what's the harm.—encephalonεγκέφαλον 22:18, 24 September 2005 (UTC)
-
- On Talk:Wess-Zumino-Witten model, I asked "How does using Unicode gammas interfere with editing?...", and User:Linas responded with "It makes the page harder to edit. ...", which I didn't feel was a very helpful answer. To me it would seem that something like
sin θ = 0.5
is far easier to edit thansin θ = 0.5
, since it is intuitive; you are using the actual symbols. Perhaps if we Linas could explain feels that using the correct symbols instead of an escape sequence hampers editing, we could understand come to a mutually satisfactory solution. I post this here at Linas's request. — Knowledge Seeker দ 04:06, 25 September 2005 (UTC)
- On Talk:Wess-Zumino-Witten model, I asked "How does using Unicode gammas interfere with editing?...", and User:Linas responded with "It makes the page harder to edit. ...", which I didn't feel was a very helpful answer. To me it would seem that something like
-
- Instead of comparing "γ" (Unicode U+003b3, entity γ) with "γ" (which presently appears to bracket the exact same character with a span of class "texhtml"), proper use of Unicode would argue for "𝛾" (U+1d6fe), which is "MATHEMATICAL ITALIC SMALL GAMMA", because really we're writing mathematics, not Greek. Of course, that's useless with today's fonts and browsers; and the MathML recommendation itself does not use characters in that upper plane for this purpose (though it does so for, say, fraktur and script characters). Returning to reality, ease of editing is an honest concern, and only those who write a great deal of mathematics can appreciate what a pain it is to go through some point-and-click dance instead of typing "γ". But that is not the end of the story, because we hope one day Real Soon Now to convert every formula from yucky wiki to uniform <math></math>, displaying beautiful HTML/MathML typesetting. Meanwhile, it's bad enough we have to live with (edit) both
''e''<sup> ''i'' π</sup> = −1
→ e i π = −1<math>e^{i\pi} = -1</math>
→ eiπ = − 1
- but to mix "π", "\pi", and "π" is even more unappealing. --KSmrqT 15:21, 25 September 2005 (UTC)
- Instead of comparing "γ" (Unicode U+003b3, entity γ) with "γ" (which presently appears to bracket the exact same character with a span of class "texhtml"), proper use of Unicode would argue for "𝛾" (U+1d6fe), which is "MATHEMATICAL ITALIC SMALL GAMMA", because really we're writing mathematics, not Greek. Of course, that's useless with today's fonts and browsers; and the MathML recommendation itself does not use characters in that upper plane for this purpose (though it does so for, say, fraktur and script characters). Returning to reality, ease of editing is an honest concern, and only those who write a great deal of mathematics can appreciate what a pain it is to go through some point-and-click dance instead of typing "γ". But that is not the end of the story, because we hope one day Real Soon Now to convert every formula from yucky wiki to uniform <math></math>, displaying beautiful HTML/MathML typesetting. Meanwhile, it's bad enough we have to live with (edit) both
What really confuses me is the fact that we're letting a bot make controversial edits. I can see a person making a preference edit like this, and while I would strongly argue against it, I would not be as concerned as when a bot essentially enforces one person's aesthetic by making a universal change.
Now, as for why this edit was a bad idea? Simply put, you now have an article that mixes and matches TeX-style \gamma
with γ which makes editing visually difficult. When it was a combination of HTML-style γ
and the TeX-style, it was much easier to read as the two are very similar. Ease of reading should be the most important aspect of a page, but since all of these forms are the same to the reader, I think the key thing is to make the page easy to edit, and it is easier to add and read characters that are the same as the math markup. I agree that if the math markup could be used everywhere, that would be ideal, but if it cannot, then mixing HTML-style markup with math markup is the closest that we can come. Or to put it another way: just be cause Unicode can do a thing, that does not mean that Unicode should do a thing. -Harmil 04:11, 25 September 2005 (UTC)
See m:blahtex. Apparently TeX is more visual than semantic, too? — Omegatron 06:18, 11 November 2005 (UTC)
[edit] Avoiding math pages
I have modified the bot to avoid editing pages that contain <math> for the time being. It may be necessary to revisit these pages later to at least convert any names of mathematicians, etc. -- Curps 16:20, 25 September 2005 (UTC)
- I appreciate that. I don't know if it's the right criterion, though; a lot of math pages don't have <math> because of the issues with PNG images. Would it be difficult to avoid pages that are in a heir category to Category:Mathematics? --Trovatore 18:31, 25 September 2005 (UTC)
-
- Perhaps, but sometimes there's category "spillover". For instance under category "Poland", you can have "Polish wars", under "Polish wars", you can have "World War II", and under World War II, you can have Japanese battleships. So Japanese battleships end up as a sub-sub-sub-sub-category of Poland (to pick an actual example).
-
- Also, I suspect that a lot of biographies of mathematicians would end up under sub-sub-categories of math, with no math formulas on them but lots of accented European letters. Also Computer Science is listed as a subcategory of Mathematics. So subcategories of Math is perhaps too broad. I think most serious mathematics pages will have at least one <math> tag on them, at least one formula or equation.
-
- -- Curps 00:06, 26 September 2005 (UTC)
-
-
- I would appreciate it if the bot simply left Greek letters alone. I have also been known to edit Greek history pages, and having the style of Greek alphabet change uncontrollably is another controversial editing decision. I don't mind myself, but some will. Septentrionalis 01:17, 26 September 2005 (UTC)
-
-
-
-
- For Greek history pages it's entirely appropriate for the bot to change to literal Greek letters. Any users who know Greek and can type on Greek keyboards are already entering literal Greek characters in all their edits now and in the future. Many Greek-related articles are already a mix of literal Greek (edits since June 28 2005 changeover) and & references. -- Curps 01:55, 26 September 2005 (UTC)
-
-
Curps, by the way, thanks for the attention. As you see, we have a pile of unresolved typographical issues.
How about this compromise: if the &greekletter; shows up with non-alpha characters on both sides (e.g. is surrouneded by spaces, or by quotes) it is reasonable to assume its math. If it is non-greek, surrounded by strings consisting entirely of letters, e.g. Fr&circomf;chet its reasonable to assume its a place name or person name. Mixed roman/greek is a formula: Tγ. I suspect that 99% of the problem are the greek letters, and not the dotted, tilda'd, circonflex'ed roman characters. So mixed roman/greek are math formulas, and anything else is not. I think that would be a pretty safe heuristic. linas 00:55, 26 September 2005 (UTC)
- Isolated Greek characters could also occur in an article about Greek language or grammar, or an astronomy articles (for Bayer designation names of stars), or in chemistry. Or, in a Greek quotation (on a literature or poetry page), there might be a one-letter word (I don't know Greek, are there any?). So this would introduce a number of complications. Perhaps the subcategory idea might work if the entire subcategory tree is printed out in a file and then manually edited to trim subcategories that don't fit. I'll have to give it some thought. -- Curps 01:38, 26 September 2005 (UTC)
-
- There is a list of all math articles at List of mathematical topics; some but not all physics areticles are in the list. linas 03:36, 26 September 2005 (UTC)
-
-
- It seems he has already created a List of mathematics categories, and has separated out the mathematician categories. I'll take a closer look tomorrow, but it looks promising. -- Curps 04:00, 26 September 2005 (UTC)
-
[edit] TeX and HTML doesn't always correspond, for Greek letters and others
Regarding HTML character entity references vs. TeX, there is not always a one-to-one correspondence for Greek letters [4] [5]:
- ε → \varepsilon
- U+03F5 (no &name) → \epsilon
- ϑ → \vartheta
- ϖ → \varpi
- U+03F1 (no &name) → \varrho
- ς → \varsigma
- φ → \varphi
- U+03D5 (no &name) → \phi
The same applies to a number of other code points.
- U+2200 ∀ → \forall
- U+2202 ∂ → \partial
- U+2203 ∃ → \exists
- U+2205 ∅ → \emptyset
- U+2207 ∇ → \nabla
- U+2208 ∈ → \in
- U+2209 ∉ → \not\in
- U+220B ∋ → \ni, \owns
- U+220F ∏ → \prod
- U+2211 ∑ → \sum
- U+2212 − → −
- U+2217 ∗ → \ast
- U+221A √ → \surd
- U+221D ∝ → \propto
- U+221E ∞ → \infty
- U+2220 ∠ → \angle
- U+2227 ∧ → \wedge, \land
- U+2228 ∨ → \vee, \lor
- U+2229 ∩ → \cap
- U+222A ∪ → \cup
- U+222B ∫ → \int
- U+2234 ∴ → ?
- U+223C ∼ → \sim
- U+2245 ≅ → \cong
- U+2248 ≈ → \approx
- U+224D (no &name) → \asymp
- U+2260 ≠ → \not\=, \ne, \neq
- U+2261 ≡ → \equiv
- U+2264 ≤ → \leq, \le
- U+2265 ≥ → \geq, \ge
- U+2282 ⊂ → \subset
- U+2283 ⊃ → \supset
- U+2284 ⊄ → \not\subset
- U+2286 ⊆ → \subseteq
- U+2287 ⊇ → \supseteq
- U+2295 ⊕ → \oplus
- U+2297 ⊗ → \otimes
- U+22A5 ⊥ → \perp
- U+22C5 ⋅ → \cdot
- ← → \leftarrow, \to
- → → \rightarrow, \gets
- ↑ → \uparrow
- ↓ → \downarrow
- ↔ → \leftrightarrow
- &crarr → \hookleftarrow
- ⇐ → \Leftarrow
- ⇒ → \Rightarrow
- ⇑ → \Uparrow
- ⇓ → \Downarrow
- ⇔ → \Leftrightarrow
- ¬ → \neg, \lnot
- § → \S
- ¶ → \P
- † → \dag, \dagger
- ‡ → \ddag, \ddagger
[edit] Shah Jahan OK??
Please note the changes this bot made to Shah Jahan. The Japanese characters were removed and replaced by ???? Is this expected behavior? Thanks...--Nemonoman 00:34, 28 September 2005 (UTC)
- The Japanese interwiki link was changed to ja:シャー・ジャハーン , which is correct. If this shows up as ???? on your computer, it's a font issue. What browser and operating system are you using? -- Curps 02:34, 28 September 2005 (UTC)
[edit] Ampersands
Although ampersands are commonly found in URLs they still need to be HTML encoded in web pages. Even Google gets this wrong. See also Ampersands (&'s) in URLs
- The bot doesn't change any of the ASCII character entity references (namely, " & < >) or ASCII numeric character references (except for   =   = SPACE). -- Curps 21:16, 5 October 2005 (UTC)
-
- I think he was trying to point out that links of the form "http://xxxxx.yyy?aaaa=....&bbbb=", mentioned in the main page as causing problems, are actually invalid if found in HTML. The ampersand needs encoding: "http://xxxxx.yyy?aaaa=....&bbbb=". Maybe the bot could help clean those up. -- KJBracey 09:48, 19 October 2005 (UTC)
[edit] Unicode private use area mapping
Hi, just a note to let you know that I found a page describing a private use area mapping which has a corect mapping for the offending character you found recently on Cædmon: http://www.tabligan.net/Linux%20Downloads/groff-1.19.1/font/devlj4/generate/symbol.map . If this is a common mapping, it might be useful the next time you encounter something like this. – gpvos (talk) 23:26, 5 November 2005 (UTC)
- Thanks, that solves the mystery, it's very useful information. Fortunately, though, private-use characters within Wikipedia are pretty rare. -- Curps 01:13, 8 November 2005 (UTC)
[edit] Short-changing the editors
unless they go to the trouble of memorizing and typing in the &# code for each such character, which is extremely unlikely
This is patently silly. Anyone on Wikipedia with any respectable Internet or Web experience understands HTML entities, and knows many of them either by heart or with only minor trouble, because they have meaningful and mnemonic names, and can be entered with standard keyboard in standard mode.
It is much easier to remember é
rather than remembering whatever key-mashing or Alt+0xxx code will produce the character é. Furthermore, the proper sequence will differ from system to system, e.g. in Windows it is Alt-0233; in Linux (X) it is Alt+i
, and on Mac I have no idea. But é
works everywhere, and it is easy to remember. &[letter][diacritic name];
always makes sense, as do entities such as ©
= ©, ¢
= ¢, or þ
= þ. - Keith D. Tyler ¶ 20:39, 8 November 2005 (UTC)
- You don't have to change the way you edit. When you make future edits, there's nothing stopping you from typing out é instead of é if you prefer. It will work as before and the bot will presumably convert those in due time in some future pass. But why would you insist that it should remain as "é" for all time? The character entity references or numeric character references are far less convenient for the reader, and in some cases, like "Łódź" the former version was simply unreadable. Or consider articles which give the lyrics to national anthems in Greek or Russian... in the former version, a user fluent in the language in question who spotted a typo would be confronted with a sea of &# and digits, and in many cases would simply have given up. -- Curps 21:12, 8 November 2005 (UTC)
-
- It is also reasonable to assume that editors who speak languages other than English will have more reasonable keyboard configurations or input methods. -- Beland 03:23, 11 November 2005 (UTC)
[edit] Software changes
It seems like this functionality is already built into the software. Can't we just ask for a button to display the source code in either unicode character format or html unicode entity format as needed? — Omegatron 06:11, 11 November 2005 (UTC)
[edit] Stop the Bot!
Your bot is changing &lsquo to '. We have had discussions about this and ' is not the same character and therefore an incorrect subsitute. This particular "glottal stop" consonant is somewhat in dispute because the correct rendering is actually a unicoide that does not render on common browsers. Please remove this particular change until the correct rendering can be established. Your other changes look ok to me - Marshman 18:58, 11 November 2005 (UTC)
- You're mistaken, the bot changes &lsquo to the literal Unicode character U+2018 LEFT SINGLE QUOTATION MARK, not to ' which is the apostrophe (U+0027). It may be that your browser is rendering that character as an apostrophe. -- Curps 02:25, 9 December 2005 (UTC)
- This is exactly the problem with doing these mass-changes. We are replacing an easily-identifiable, easily-editable source with one where it's hard to identify the characters, for example all these quotes and dashes. We shouldn't be typesetting in source. Please desist making further mass-changes like this. Thanks! Demi T/C 23:18, 13 December 2005 (UTC)
-
-
- It's also doing things like changing — and – and ... those are pretty much impossible to distinguish in a text edit box if they're the literal UTF-8 character rather than the entity. Please at least set it to change these back - David Gerard 00:13, 14 December 2005 (UTC)
-
-
-
-
- The default behavior is not to change mdash and ndash, and will not do so unless a flag is set. It did change those in its very earliest edits, but not since then. -- Curps 22:07, 14 December 2005 (UTC)
-
-
[edit] I don't get it
I am none too fluent in technical computer things, so this is rather a basic question: why are these changes necessary? I read the bot's user page (well, some of it), but the answer to this question is still not clear to me. Perhaps a brief lay-person explanation could be added to the bot's user page? Cheers, Qirex 13:30, 9 December 2005 (UTC)
- For readers, the bot makes no visible difference. But for editors, it improves legibility and therefore editability. Consider the Soviet national anthem differences: [14]: it's much easier for editors to work with the version on the right than the version on the left. -- Curps 21:55, 12 December 2005 (UTC)
[edit] Good bot
Wow, this bot sure has taken a lot of flak :) I'd just like to say that it shows up on my watchlist every now and then and I always appreciate its efforts. Sometimes, for example, I lazily type in œ and it helpfully converts it to the literal unicode value. - Haukur Þorgeirsson 16:41, 10 December 2005 (UTC)
- Yes, I absolutely agree. My watchlist is filled with China-related topics, and this bot has been incredibly helpful. Never again will I need to wrestle with unreadable strings of numbers. Thanks for the great work! =D -- ran (talk) 04:51, 15 December 2005 (UTC)
- Agreed. A change to the software could make this bot no longer necessary, though. Currently the software presents two different versions of the source. If you have a decent browser, it presents the source in UTF-8. If you have an old browser that doesn't handle UTF-8 very well, it converts all the Unicode characters into HTML entities and presents you with a plain text version of the source.
- I propose a blanket conversion of all special characters to UTF-8, everywhere, and then provide an option in the editing interface to see either the Unicode or plain text versions of the source during editing. Then you can see the unicode version if you're editing something that just requires you to read through and change content, and you can see the explicit entities if you're editing something that requires knowledge of the actual Unicode character values or if you want to copy and paste into a non-Unicode text editor, spell checker, or the like.
- See Bug 4012 — Omegatron 16:41, 15 December 2005 (UTC)
-
- Converting all special characters and providing only a blanket on/off is probablly a bad idea. some unicode characters are going to be a huge pita to work with in the edit box (text direction markers proper dashes non-breaking spaces, situations with a complex mix of ltr and rtl characters etc) whilst editing the whole page in entites isn't such a nice idea either.
-
- Btw the feature that is there now for conversion is a quick and disgusting hack to stop ie mac users causing MAJOR problems ;). Plugwash 17:06, 15 December 2005 (UTC)
-
-
- Yeeeaaaaaaaaahh, well. Something similar, then. :-)
- There are lots of instances where I would like to see the wikicode in plain unicode characters, like for plain editing of articles. I want to be able to read it as easily as possible, without mucked up numbers and ampersands everywhere, and edit the content of the article, not the technical details.
- There are also plenty of instances where I would like to see exactly which character I'm working with, like if I'm converting "word – word" into "word — word" or "10 μF" into "10 µF", and cases where I want to copy the text into a text editor for making repetitive edits or something. — Omegatron 17:31, 15 December 2005 (UTC)
-
-
-
-
- If you need to know exactly which character you're dealing with and manipulate "pita" characters you should copy-and-paste the wikitext into a real Unicode text editor like BabelPad and then copy-paste back when you're done. There's really no substitute for dedicated Unicode software. DopefishJustin 04:06, 25 January 2006 (UTC)
-
-
Could you please unicodify Serbian Orthodox Church? Nikola 11:03, 15 December 2005 (UTC)
[edit] Can we request unicodification?
If so I'd like the bot to work on Asia-related articles next. Articles like Hanja need unicodification. Thanks. -- Миборовский U|T|C|E|Chugoku Banzai! 00:07, 24 December 2005 (UTC)
- Could you also have a pass over Scrabble letter distributions please? There's some weird BiDi effects going on with the Arabic section, where the brackets appear to be unpaired but clearly are correct in the edit view. You seem to know what you're doing with Unicode so I wondered if you could have a look at this annoying complication. Soo 11:54, 31 March 2006 (UTC)
- Putting a pair of unicode LRMs arround each arabic letter did the trick. This sort of problem happens when people surround samples of right to left characters with characters that have neutral directionality (e.g. brackets) Plugwash 15:13, 3 April 2006 (UTC)
- Is unicodification a word?--205.188.116.139 13:11, 15 June 2006 (UTC)
- I have never seen either unicodify or unicodification outside the context of this bot and clones of its functionality (e.g. awb). They are also misleading terms, the text was unicode the whole time its just being converted from unicode encoded as numeric html entities to unicode encoded as a UTF-8 bytestream. Plugwash 21:06, 15 June 2006 (UTC)
- Is unicodification a word?--205.188.116.139 13:11, 15 June 2006 (UTC)
- Putting a pair of unicode LRMs arround each arabic letter did the trick. This sort of problem happens when people surround samples of right to left characters with characters that have neutral directionality (e.g. brackets) Plugwash 15:13, 3 April 2006 (UTC)
[edit] Source code?
Can you upload the source code to Wikipedia (or point to it online)? It would be useful for authors developing similar tools (to avoid rediscovering all the problems you solved) and it would add to the confidence that it was working well. Gdr 16:58, 6 January 2006 (UTC)
[edit] New seed list available
...on Wikipedia:Bad links/encoded1, if you are interested. Thanks! -- Beland 03:47, 3 April 2006 (UTC)