read data into R - r

The World Health Organization dataset is available here: http://www.filedropper.com/who
When the data is read using fread (from the data.table package), or read_csv (from the readr package) some variables are wrapped within letter r, and are shown as character type. Like so:
"\r31.1\r".
I checked the dataset in notepad and indeed it looks weird as these values are wrapped within (' '). However they are numeric, and when the regular read.csv is used there is no such problem.
What's the reason behind this? How to fix?

the '\r' is e special character used as a new line delimiter for files on windows.
When using read_csv setting the argument escape_backslash=TRUE might do the trick.
Check this for further reading.

Related

Standardise strings in cells of a dataframe in R that appear bold

Loading an Excel sheet into R, some strings in the cells of the dataframe appear to be bold and in a different format. For example, like so:
π‡πšπ’ππšπ«πš
And when I copy paste this string into the R console, it appears like this:
Anyone know how to fix this (revert these strings into the standard format) in R?
Want to avoid going back into Excel to fix it.
Thanks!
These are actually UTF-8 encoded letters in the Mathematical Alphanumeric Symbols block in Unicode, and they don't map nicely back on to 'standard' ASCII letters in R unless you have a pre-existing mapping function such as utf8_normalize from the utf8 package:
library(utf8)
utf8_normalize('π‡πšπ’ππšπ«πš', map_compat = TRUE)
#> [1] "Haidara"
However, I would strongly recommend that you fix your Excel file before importing to avoid having to do this; it works with the example you have given us here, but there may be unwelcome surprises in converting some of your other strings.

openxlsx package, read.xlsx check.names=false still placing a . in column names

Usuaully I use Tidyverse to read in excel files with the read_excel command, however I encountered the dreaded "Unknown or uninitialised column" bug that refers to a non existent column and then warns about said not existent column from then on through the workflow.
So I decided to use openxlsx instead to read in the excel files. All was going well until I realised that openxlsx sees column names with white space as not syntactically correct and it adds a . to replace the whitespace. So 'Customer Name' becomes 'Customer.Name'.
I tried using the check.names=FALSE command to leave the headers in tact, but the package seems to ignore this command.
Many of the headers might have more than a single space between the words and the format has to stay the same. I cannot use an excel package that relies on Java as our company has blocked it.
How can I force openxlsx to leave the header alone?
Example of the code I am using is here: IMACS <- read.xlsx("//zfsstdscun001a.rz.ch.com/UKGI_Pricing/Bus_Insights/R_Scripts/IMACS.xlsx",check.names=FALSE, sheet = "IMACS")
All credit to #Matt on this.
Using readxl and read_excel together worked a treat.
IMACS <- readxl::read_excel("//zfsstdscun001a.rz.com/UKGI_Pricing/Bus_Insights/R_Scripts/CAT Risks/IMACSV2.xlsx",
sheet = "IMACS")

read_csv does not work separate commas and not capture separate rows

I am trying to parse a text log file like this, I can use the default read.csv to parse this file.
test <- read.csv("test.txt", header=FALSE)
It separated all comma parts, though not perfectly put in a dataframe, further manipulation can be done to improve.
However, I can not seem to do so using readr package
test <- read_csv("test.txt", header=FALSE)
All observations turn into 1 row, no separation between commas.
I am learning this package so any help would be great.
{"dev_id":"f8:f0:05:xx:db:xx","data":[{"dist":[7270,7269,7269,7275,7270,7271,7265,7270,7274,7267,7271,7271,7266,7263,7268,7271,7266,7265,7270,7268,7264,7270,7261,7260]},{"temp":0},{"hum":0},{"vin":448}],"time":4485318,"transmit_time":4495658,"version":"1.0"}
{"dev_id":"f8:xx:05:xx:d9:xx","data":[{"dist":[6869,6868,6867,6871,6866,6867,6863,6865,6868,6869,6868,6860,6865,6866,6870,6861,6865,6868,6866,6864,6866,6866,6865,6872]},{"temp":0},{"hum":0},{"vin":449}],"time":4405316,"transmit_time":4413715,"version":"1.0"}
{"dev_id":"xx:f0:05:e8:da:xx","data":[{"dist":[5775,5775,5777,5772,5777,5770,5779,5773,5776,5777,5772,5768,5782,5772,5765,5770,5770,5767,5767,5777,5766,5763,5773,5776]},{"temp":0},{"hum":0},{"vin":447}],"time":4461316,"transmit_time":4473307,"version":"1.0"}
{"dev_id":"xx:f0:xx:e8:xx:0a","data":[{"dist":[4358,4361,4355,4358,4359,4359,4361,4358,4359,4360,4360,4361,4361,4359,4359,4356,4357,4361,4359,4360,4358,4358,4362,4359]},{"temp":0},{"hum":0},{"vin":424}],"time":5190320,"transmit_time":5198748,"version":"1.0"}
Thanks to #Dave2e pointing out that this file is in JSON format, I found the way to parse it using ndjson::stream_in.

Importing data from Excel to vector in R

I am a novice in R and I have been having some trouble trying to get R and Excel to cooperate.
I have written a code that makes it able to compare two vectors with each other and determine the differences between them:
data.x<-read.csv(file.choose(), header=T)
data.y<-read.csv(file.choose(), header=T)
newdata.x<-grep("DAG36|G379",data.x,value=TRUE,invert=TRUE)
newdata.x
newdata.y<-grep("DAG36|G379",data.y,value=TRUE,invert=TRUE)
newdata.y
setdiff(newdata.x,newdata.y)
setdiff(newdata.y,newdata.x)
The data I want to transfer from Excel to R is a long row of numbers placed as so:
β€œ312334-2056”, β€œ457689-0932”, β€œ857384-9857”,….,
There are about 350 of these numbers placed in their own separate cell along a single row.
I used the command: = """" & A1 & """" To put double quotes around every number in order for R to read it properly.
At first I tried to simply copy/paste the data directly into a vector in R, but it's as if R won’t read it as a single row of data and therefore splits it up.
I also tried to save the excel file as a CSV file but that didn’t work either.
Lastly I tried to open it directly in to R using the command:
data.x<- read.csv(file.choose(), header=T)
But as I type in: data.x and press enter it simply says:
<0 rows> (or 0-lenghts row.names)
I simply can’t figure out what I’m doing wrong. Any help would be greatly appreciated.
It's hard to access without a reproducible example, but you should be able to transpose the Excel file into a single column. Then import using read_csv from the readr package. Take a look at the tidyverse package, which will contain some great tools to import and work with this type of data.
I use https://github.com/tidyverse/readxl/. It makes it easy to maintain formatting from excel into type safe tibbles.
If you can share some sample data a working solution can be generated.

Load Excel file to R while setting column equal to a factor

I have an Excel file that I am trying to load into R using the odbcConnectExcel and sqlQuery commands from RODBC package. One of the columns has numerical values with plus or minus signs, such as '5+ or '3-. However, if i do something like,
conn <- odbcConnectExcel("file.xls")
sqlQuery(conn, "SELECT * FROM `Sheet1$`")
then the column with the plus and minus signs will be returned as a numerical column with those symbols stripped. Is there a way to have this column read in as a factor in which the signs are maintained? I would prefer to not have to convert the file to another format first.
Thanks.
Data like this becomes a factor if you use the xlsReadWrite (http://www.swissr.org/software/xlsreadwrite) package to read the file:
library(xlsReadWrite)
x <- read.xls(file="file.xls")
However, note that you need to do something more than just install.packages("xlsReadWrite") to get this package to run. You need another file or so, I forgot.
This doesn't directly address your question, but hopefully it will help:
This is the best summary of options for connecting to Excel that I have seen: Export Data Frames To Multi-worksheet Excel File. While it deals generally with exporting, importing is also possible with most of these approaches.
My favorite is actually the RDCOMClient because it provides total control over Excel as an application.

Resources