R and foreign characters


Working with Russian characters can be mind-numbingly frustrating. This is true for R, as for other applications, so below I've written out the my top five tricks for making Russian inputs work in R; i believe they should be transferable to most other languages.



Having forced any number of programs to accept Russian characters in the past, I have come to appreciate UTF-8 as the only sensible encoding system for non-latin script. R operates with UTF-8 as default, so using Russian or other foreign scripts should be straightforward, right?
Wrong. There is no end to the annoyance experienced when attempting to import data into R by appending
encoding = "utf-8"
to the end of every line. Sometimes it will work, but rarely both in the characters displayed on screen, and those output by R. So, annoyingly, characters formatted as Russian in a data.frame will magically appear as gobbledygook when written to an output file, or even a plot. Infuriating. The solution is brutal in its simplicity - don't rely on R's UTF-8 to display characters for you, instead start sessions in the appropriate language, using the line
Sys.setlocale("LC_CTYPE", "russian")
Now that solves all the problems, right?
Almost. Often when scraping data or when inputting data (e.g. through Shiny apps), strings need to be formatted as UTF-8 as follows:
>Enoding(annoyingMisbehavingString) <- "UTF-8"
Be careful with this one, though. Encoding text that already is utf-8 as utf-8 will not work well.
Finally, if you ever want to save .R scripts with non-Latin characters in them, do so with care. When you reopen the files the strings will be scrambled, for some reason not quite clear to me. If you use the script as a source file, any command reliant on the non-Latin string (e.g. grep) will return errors or no hits. The solution is to use a different function all together:
eval(parse("iPolarCalc.R", encoding = "UTF-8"))
And that's about it. For Windows systems at least.

======
Update: 06/02/2013
Except encoding issues never really end. Enter the latest problem:
displaying cyrillic characters with Knitr.

Knitr is great. It will take R code and combine it with markdown, allowing you to create ready formatted webpages with calculations and graphics created on the fly from R. But it doesn't work properly with non ascii characters. The solution: Don't use R-studio's built in knitr to html (ctrl-shift-h). Instead save the rmd file in your working directory, and run these lines:
knit("test.rmd", encoding = "utf-8")
markdownToHTML("test.md", "test.html")
browseURL(paste("file://", file.path(getwd(), "test.html"), sep = ""))
-->

=====
Update 21/11/2013

Here's my latest discovery: you know when you have foreign characters in a url? Chances are you didn't notice, because most browsers can handle this. Paste this into your browser, and you will get search results for the Katyn massacre:
https://www.google.co.uk/search?q=катынь

However, this is all smoke and mirrors: paste the same string into notepad, and you will see this:
https://www.google.co.uk/search?q=%D0%BA%D0%B0%D1%82%D1%8B%D0%BD%D1%8C

What does this have to do with R? well, we need some way to convert the former to the latter if we want to access URLs with foreign characters in. To do that, use curlEscape() from the rCurl package:

> curlEscape("катынь")
[1] "%D0%BA%D0%B0%D1%82%D1%8B%D0%BD%D1%8C"
Perfect.

30 comments:

  1. I have an SPSS file in Russian encoding, apparently it's 1251, and I can't read it either in R or in SPSS 21.


    Sys.setlocale("LC_CTYPE", "russian") doesn't work on my Mac machine for some reason. Is there any other way of solving this issue? Or, perhaps, there is something that I'm not doing right?

    ReplyDelete
    Replies
    1. Hi Valery, the short answer is I don't know, because I don't know how to use a mac. But this post seems to have something that may be of interest:
      http://stackoverflow.com/questions/17031002/get-weekdays-in-english-in-rstudio
      I would guess you are looking for "ru_RU.UTF-8". Best, R

      Delete
    2. Sys.setlocale("LC_CTYPE", "ru_RU.UTF-8")

      worked like a charm!

      Delete
    3. Hi Rolf, I have to deal with Vietnamese data and I would like to set locale in R to be "en_US.UTF-8" but it doesn't work.
      My code is: Sys.setlocale(category="LC_ALL", locale = "en_US.UTF-8")
      However, the warning message in console is:
      In Sys.setlocale(category = "LC_ALL", locale = "en_US.UTF-8") :
      OS reports request to set locale to "en_US.UTF-8" cannot be honored
      And the locale did not change to "en_US.UTF-8"
      I have tried several ways but nothing worked. Could you help me to set my locale to be "en_US.UTF-8"?

      Delete
  2. data_heatmap can't handle Russian letters as well, see https://github.com/yihui/knitr/issues/436#issuecomment-32781891
    Oh, it's bad, feel like back in 1999. Windows did not want to change my world after seeing Sys.setlocale("LC_CTYPE", "russian")

    ReplyDelete
  3. Hi Rolf! Thank you very much for your quite useful advices on dealing with Russian characters in R programming language. You saved a lot of time on this matter. But never knows what to expect from text in Cyrillic.

    ReplyDelete
  4. Thanks a lot! Solved all my problems)

    ReplyDelete
  5. I simply wanted to thank you so much again. I am not sure the things that I might have gone through without the type of hints revealed by you regarding that situation.
    Blue Prism Training in Bangalore

    ReplyDelete
  6. Feel wild excitement? Spend it in our online casino. Excellent play roulette online Feel what money is with us.

    ReplyDelete
  7. Rather old, and spammed thread, but I wonder about "Encoding text that already is utf-8 as utf-8 will not work well."

    I have UTF-8 encoded text which is not recognised as such, Encoding(var) gives me "unknown". Encoding(var) <- "UTF-8" does work, and the text is displayed as intentended. (On a 1252 locale, that is.)

    ReplyDelete
  8. Amazing article. Your blog helped me to improve myself in many ways thanks for sharing this kind of wonderful informative blogs in live. I have bookmarked more article from this website. Such a nice blog you are providing.
    lg mobile service center in velachery

    ReplyDelete
  9. Excellent knowledge shared, Thanks to you...
    For more details Click Here- I Digital Academy

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. Thanks for sharing this wonderful information. The trends you have mentioned are really great. I would love to come back again on your website to have a look at some more wonderful posts. In the mean while you can check my website too:
    Digital Marketing Courses near me

    ReplyDelete
  12. Great Blog to read, It gives more useful information. Thank lot.

    Best Tableau Training Institute in Pune

    ReplyDelete
  13. I would love to see your next update. Nice Post! Thank you.
    Lead Recycler

    ReplyDelete

  14. I really appreciate your valuable efforts and it was very helpful for me. Thank you so much...!
    Emergency Protective Order
    Preliminary Protective Order

    ReplyDelete
  15. Very innovative post! This post is very interesting and thanks for sharing it with us...
    Divorce Attorneys Fairfax va
    Divorce Attorney in Fairfax

    ReplyDelete