Weird Foreign Character Behavior in R / Rstudio - r

I have a CSV with Czech characters in it that looks like this:
id,address,city
660999,Vršovická 10,Praha
676838,Valentova 50,Praha 4
676858,Husova 6740,Pardubice
677971,Lipová 10,Třebíč
678304,Jana Ziky 10/1955,Ostrava
...
When I import into RStudio everything looks fine if I view it using the View() function.
But in the terminal when I view the values everything looks crazy.
xl = read.csv("some_csv.csv")
head(xl)
id address city
1 660999 Vršovická 10 Praha
2 676838 Valentova 50 Praha 4
3 676858 Husova 6740 Pardubice
4 677971 Lipová 10 TÅ™ebíÄ
5 678304 Jana Ziky 10/1955 Ostrava
When I check the encoding with Encoding(xl[1,2]) for example it says "unknown".
I also have Russian data with the same exact problem.
I've tried switching to Sys.setlocale("LC_CTYPE", "czech") and Sys.setlocale("LC_CTYPE", "russian") and importing under those settings and they behave the same.
I'm using Rstudio Version 0.98.501 with R version 3.0.2 on Windows 7. A colleague on a separate computer is having the same problem.
Anything I can do make these characters work correctly the terminal?

Related

Download csv file from Nasdaq site

I tried now for some time to simply download a csv file in RStudio(so i am using R) from the web. I have done stuff like this before and never run into the issues i have now. Tried several solutions suggested online. I simply try to download the following file from here https://www.nasdaq.com/market-activity/stocks/aapl/historical. This is the link to the csv -> https://www.nasdaq.com/api/v1/historical/AAPL/stocks/2020-09-09/2020-10-09. I tried httr and RCurl package methods none worked. Hope someone can help me out, thanks in advance.
Edit: What i tried so far:
-Updated ssl
-I can install new packages so internet generally works
-Updated git
-Updated R
-Updated all packages
-Tried what is suggested here: Download.file fails in RStudio
Edit2: A while ago i did webscrape with the package RSelenium and started a rem_session. Is it possible that Rselenium changes some underlying settings?
Edit3: Completely uninstalled Rstudio + R + Rtool manually deleted everything, after a fresh installation still the same problem
Have you tried to load it directly in R through read.csv()?
It worked for me... Check it out:
data = read.csv('https://www.nasdaq.com/api/v1/historical/AAPL/stocks/2020-09-09/2020-10-09')
data
Here is the output:
Date Close.Last Volume Open High Low
1 10/07/2020 $115.08 96848990 $114.62 $115.55 $114.13
2 10/06/2020 $113.16 161498200 $115.7 $116.12 $112.25
3 10/05/2020 $116.5 106243800 $113.91 $116.65 $113.55
4 10/02/2020 $113.02 144712000 $112.89 $115.37 $112.22
5 10/01/2020 $116.79 116120400 $117.64 $117.72 $115.83
6 09/30/2020 $115.81 142675200 $113.79 $117.26 $113.62
7 09/29/2020 $114.09 100060500 $114.55 $115.31 $113.57
8 09/28/2020 $114.96 137672400 $115.01 $115.32 $112.78
9 09/25/2020 $112.28 149981400 $108.43 $112.44 $107.67
10 09/24/2020 $108.22 167743300 $105.17 $110.25 $105
11 09/23/2020 $107.12 150718700 $111.62 $112.11 $106.77
12 09/22/2020 $111.81 183055400 $112.68 $112.86 $109.16
13 09/21/2020 $110.08 195713800 $104.54 $110.19 $103.1
14 09/18/2020 $106.84 287104900 $110.4 $110.88 $106.09
15 09/17/2020 $110.34 178011000 $109.72 $112.2 $108.71
16 09/16/2020 $112.13 155026700 $115.23 $116 $112.04
17 09/15/2020 $115.54 184642000 $118.33 $118.829 $113.61
18 09/14/2020 $115.355 140150100 $114.72 $115.93 $112.8
19 09/11/2020 $112 180860300 $114.57 $115.23 $110
20 09/10/2020 $113.49 182274400 $120.36 $120.5 $112.5
21 09/09/2020 $117.32 176940500 $117.26 $119.14 $115.26
If you want to save it afterwards as a csv file, just run write.csv():
write.csv(data, '~/Downloads/data.csv')
Let me know if it worked for you.
In case you are in a windows OS you need to set the mode=wb argument.
download.file(
"https://www.nasdaq.com/api/v1/historical/AAPL/stocks/2020-09-09/2020-10-09",
"file.csv", mode='wb'
)

Opening JSON files in R

I have downloaded some data from the following site as a zip file and extracted it onto my computer. Now, I'm having trouble trying to open the included json data files.
Running following code:
install.packages("rjson")
library("rjson")
comp <- fromJSON("statsbomb/data/competitions")
gave this error:
Error in fromJSON("statsbomb/data/competitions") : unexpected character 's'
Also, is there a way to load all files at once instead of writing individual statements each time?
Here is what I did(Unix system).
Clone the Github repo(mark location)
git clone https://github.com/statsbomb/open-data.git
Set working directory(directory to which you cloned the repo or extracted the zip file).
setwd("path to directory where you cloned the repo")
Read data.
jsonlite::fromJSON("competitions.json")
With rjson: rjson::fromJSON(file="competitions.json")
To run all the files at once, move all .json files to a single directory and use lapply/assign to assign your objects to your environment.
Result(single file):
competition_id season_id country_name
1 37 4 England
2 43 3 International
3 49 3 United States of America
4 72 30 International
competition_name season_name match_updated
1 FA Women's Super League 2018/2019 2019-06-05T22:43:14.514
2 FIFA World Cup 2018 2019-05-14T08:23:15.306297
3 NWSL 2018 2019-05-17T00:35:34.979298
4 Women's World Cup 2019 2019-06-21T16:45:45.211614
match_available
1 2019-06-05T22:43:14.514
2 2019-05-14T08:23:15.306297
3 2019-05-14T08:02:00.567719
4 2019-06-21T16:45:45.211614
The function fromJSON takes a JSON string as a first argument unless you specify you are giving a file (fromJSON(file = "competitions.json")).
The error you mention comes from the function trying to parse 'statsbomb/data/competitions' as a string and not a file name. In JSON however, everything is enclosed in brackets and strings are inside quotation marks. So the s from "statsbomb" is not a valid first character.
To read all json files you could do:
lapply(dir("open-data-master/",pattern="*.json",recursive = T), function(x) {
assign(gsub("/","_",x), fromJSON(file = paste0("open-data-master/",x)), envir = .GlobalEnv)
})
however this will take a long time to complete! You probably should elaborate a little bit on this function. E.g. split the list of files obtained with dir into chunks of 50 before running the lapply call.

Printing UTF-8 characters in R, Rmd, knitr, bookdown

UPDATE (April 2018):
The problem still persists, under different settings and computers.
I believe it is related to all UNICODE, UTF-8 characters.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
PROBLEM:
My Rmd/R file is saved with UTF-8 encoding. Other sessionInfo() details:
Platform: x86_64-w64-mingw32/x64 (64-bit)
LC_CTYPE=English_Canada.1252
other attached packages:
[1] knitr_1.17
Here is a simple data frame that I need to print as a table in a html document, e.g. with kable(dt) or any other way.
dt <- data.frame(
name=c("Борис Немцов","Martin Luter King"),
year=c("2015","1968")
)
Neither of the following works:
Way 1
If I keep Sys.setlocale() as is (i.e. "English_Canada.1252"), then I get this:
> dt;
name year
1 <U+0411><U+043E><U+0440><U+0438><U+0441> <U+041D><U+0435><U+043C><U+0446><U+043E><U+0432> 2015
2 Martin Luter King 1968
> kable(dt)
|name |year |
|:-----------------------------------------------------------------------------------------|:----|
|<U+0411><U+043E><U+0440><U+0438><U+0441> <U+041D><U+0435><U+043C><U+0446><U+043E><U+0432> |2015 |
|Martin Luter King |1968 |
Note that <U+....> are printed instead of characters.
Using dt$name <- enc2utf8(as.character(dt$name)) did not help.
Way 2
If I change Sys.setlocale("LC_CTYPE", "russian") #"Russian_Russia.1251"`,
then I get this:
> dt;
name year
1 Áîðèñ Íåìöîâ 2015
2 Martin Luter King 1968
> kable(dt)
|name |year |
|:-----------------|:----|
|Áîðèñ Íåìöîâ |2015 |
|Martin Luter King |1968 |
Note that characters have become gibberish.
Using print(dt,encoding="windows-1251"); print(dt,encoding="UTF-8") had no effect.
Any advice?
The closest I could find to address this problem are in the following links, but they did not help: http://blog.rolffredheim.com/2013/01/r-and-foreign-characters.html, https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows, https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets
I also tried to save my file with 1251 encoding (instead of current UTF-8 encoding) and some other character conversion/processing packages. Nothing helped yet.
UPDATE:
Opened related question:
How to change Sys.setlocale, when you get Error "request to set locale … cannot be honored"
The only solution that worked was the one suggested by Yihui Xie (knitr developer), which is :
creating a file .Rprofile, which contains one line Sys.setlocale("LC_CTYPE", "russian") and placing it in your home or working directory.
However, please note that, it works only with use of kable(), i.e with help of knitr package.
If you try to print with print(dt$name[1]), you still get Áîðèñ Íåìöîâ.
However, if you use kable(dt$name[1]), you'll get what you need - Борис Немцов !

Unexpected jumps/oddities in lubridate and zoo package (reproducible example added)

###### Bug in lubricate & zoo ? ######
dataframex <- as.data.frame(rnorm(420,0,1))
dataframex
names(dataframex) <- c("value")
head(dataframex)
library(lubridate); library(zoo) # To assign months to rows of the dataframe
dataframex
row.names(dataframex) <- as.yearmon(seq(ymd('1980-01-01'), by = '1 month', length.out=(420)))
dataframex
There appears unexpected jumps/oddities at certain time points that I could not figure out:
value
Oca 1980 -1.112455234
Şub 1980 -0.370769140
.....................
Mar 1995 0.219924804
Nis 1995 -1.46725 value # oddity "value" occurred
Oca 1980 1995 -0.158754605 # unexpected jump from Apr1995 to Jan1980
Tem 1995 1.464587312
......................
Eyl 2010 -0.1995 -0.158754605
Tem 1995 1.464587312 # unexpected jump from Sept2010 to July1995
Ağu 1995 -0. # oddity again
Ara 2010 0.277914132
So, sometimes "i" is wrongly printed among year labels, sometimes "i" is printed in value labels on the right.
What I did to solve the problem:
I suspected it can be a Windows Regional Settings problem. I changed TR-TR to EN-US. Again the same oddities occurred.
I also changed the regional settings to have "." as decimal separator, also tries "," as decimal points.
The error remained same!
Any help will be greatly appreciated.
I eventually figured out that the error is due to the problematic localization language file in Revolution R program.
The Step-by-Step solution:
1. Change the R localization language to solve the oddity/jump problem:
"Tools - Options - Environment - Help - International Settings - Language:English"
2. Restart R so that a new environment exists in R
Then, apply completely the same code above to check whether the oddity/jump problem is solved. If solved, then OK.
If oddity/jump problem is not solved, pass to Step 3 and 4.
3. Change the Regional settings from TR-TR to EN-US in Control Panel of Windows.
4. Change the International Setting in Revolution R:
"Tools - Options - Environment - Help - International Settings - Language:Same as Microsoft Windows"
5. Restart R so that a new environment exists in R
Then, apply completely the same code above. This time no oddity/no jump/no "i" occurs.
PS: Revolution R team should correct this language issue in related Turkish localization language files.

how to display chinese character properly in sqlite console?

Here is the sample csv file in utf-8 format which can be opened in win7's notepad and the chinese character displayed properly ,please download it .
http://pan.baidu.com/s/1sj0ia4H
Open your cmd ,and set chcp 650001.
C:\Users\pengsir>sqlite3 e:\\test.db
SQLite version 3.8.4.3 2014-04-03 16:53:12
Enter ".help" for usage hints.
sqlite> create table ipo(name TEXT,method TEXT);
sqlite> .separator ","
sqlite> .import "e:\\tmp.csv" ipo
sqlite> select * from ipo;
000001,公开招募
000002,申请表抽签é™é¢è®¤è´­
000004,定å‘å‘è¡Œ
000005,银行储蓄存å•æ–¹å¼
000006,申请表抽签é™é¢è®¤è´­
000007,自办å‘è¡Œ
000008,自办å‘è¡Œ
000009,定å‘å‘è¡Œ
000010,定å‘å‘è¡Œ
000011,申请表抽签等é¢è®¤è´­
sqlite>
why the same sqlite command can get proper display in sqlitemanager?
and how can i set to display chinese character in sqlite console?
In pysqlite3 , it can get right display in python console.
>>> import sqlite3
>>> con=sqlite3.connect("e:\\test.db")
>>> cur=con.cursor()
>>> cur.execute("select * from ipo;")
<sqlite3.Cursor object at 0x01751720>
>>> print(cur.fetchall())
[('000001', '公开招募'), ('000002', '申请表抽签限额认购'), ('000004', '定向发行'
), ('000005', '银行储蓄存单方式'), ('000006', '申请表抽签限额认购'), ('000007',
'自办发行'), ('000008', '自办发行'), ('000009', '定向发行'), ('000010', '定向发
行'), ('000011', '申请表抽签等额认购')]
>>>
This issue concers how
Command Prompt window
shows the characters, and is not about how sqlite3
prints the output;
As a simple demonstration here we absolutely exclude sqlite3 and look at the files by the type command:
Let's see whats happen in other different O.S., for example in OSX:
ISO-8859-1
correspond to (Windows latino 1), windows equivalent code page setting: chcp 819
UTF8
correspond to Unicode (UTF-8), windows equivalent code page setting: chcp 65001
Pretty the same behavior also happens in Windows:
use command chcp to inspect and/or setting-up your current code page
NOTICE: this is a screenshot of an Italian Windows XP and as you can see there is still no luck! :-( , in this case the cause consists in a leak of available fonts configurable in
command prompt properties in my "Windows XP" box:
I hope this is not the case of your "Windows Seven" box ( ..but if it is , please leave me a comment to be a more specific in this part of the answer ).
..when the problem switches to the "fonts available" then Additional Languages supports would be installed and still need forcing UTF-8 by a chcp 65001:
How to get proper fonts
follows the list of steps I followed to get the result on ITA WinXP SP2 as shown in the above screenshot:
Step 1 Install East Asian language files on your computer
lecture link: to install East Asian language files on your computer
In summary these two options have been both checked
and in "Advanced Tab" I've selected Chinese:
Step 2 Switch from raster to chinese font in the terminal/"Command Windows"
Extra Step 3 (Optional) Check font in notepad
Notepad can be useful for some inspections on fonts, for example open the temp.csv and play with fonts but be aware of: Necessary criteria for fonts to be available in a command window
Well the obvious problem is that Windows (pretty much in general) has a problem in dealing with UTF-8. Especially the command line tool is by default set to a country specific codepage rather than unicode.
Usually you can (temporarily) fix it by setting the codepage for the command-line session to utf-8, for example by typing:
chcp 65001
But the problem is that in your case this does not really fix it, since sqlite seems to still run with the default charset, and there does not seem to be any option to set the current sqlite3 session to unicode.
Still the good news above it all is, that your data is correct, and you can work with it correctly using sqlitemanager or similar tools, which are able to handle unicode appropriately.
To further substantiate this: If you open your original csv with Excel it probably also will give you messed up characters (since it usually does not default to unicode). Whereas LibreOffice will typically ask you for the encoding to use, and given unicode will show the correct text, but given a different encoding (eg: western europe, etc.) will give you the same result as excel (you can preview it there quite nicely, give it a shot).
Hope this helps!

Resources