I am trying to test unicode-heavy imports of various R packages. I'm through everything but JSON because of a persisetant error: The file is read in as one long, single-row file. The file is available here.
I think I am following the instructions in the help. I have tried two approaches:
read the data into an object, then convert to a data frame.
raw_json_data <- read_file("World Class.json")
test_json <- fromJSON(raw_json_data)
as.data.frame(test_json)
Read the file using fromJSON() then convert to a data frame. I happen to be using R's new pipe here, but that doesn't seem to matter.
rjson_json <- fromJSON(file = "World Class.json") |>
as.data.frame()
In every attempt, I get the same result: a data frame of 1 column and 1400 variables. Is there a step I am missing in this conversion?
EDIT: I am not looking for the answer "Use package X instead". The rjson package seems to read in the JSON data, which has a quite simple structure. The problem is that the as.data.frame() call results in one-row, 1400-character data frame, and I'm asking wht that is.
Try the jsonlite package instead.
library(jsonlite)
## next line gives warning: JSON string contains (illegal) UTF8 byte-order-mark!
json_data <- fromJSON("World Class.json") # from file
dim(json_data)
[1] 40 35
Related
I have the following code
A <- c(1,2,3); B <- c(2,3,3); C <- c(3,4,5);
colMeans(rbind (A,A,B,C,C,C,A,A))
which when executed returns
[1] 1.88 2.88 3.75
I am trying to get this work for arrays/matrices that I get from an excel file.
When trying to read from the clipboard (MacOSx) using
read.table(pipe("pbpaste"), sep="\t", header=TRUE)
I end up getting a dataframe with lists of characters (typeof).
I am fairly new to R so my issue is that these list are characters and not variable names. Tried various ways to convert them to a list of variables so that I could do the "colMeans(rbind())" on them.
Any thoughts ? Thanks.
You can call the type conversion explicitly after reading from the clipboard:
library(tidyverse)
read_delim(pipe("pbpaste"), delim="\t") %>% type_convert()
Note that it is recommended to always read data from files stores somewhere. The clipboard makes the code not reproducible anymore.
I'm using jsonr to read a JSON file in to R. However, the fromJSON(file="file.json") command is only reading the first line of the file. Here's the JSON:
{"id":"a","emailAddress":"a#a.com","name":"abc"}
{"id":"b","emailAddress":"b#b.com","name":"def"}
{"id":"c","emailAddress":"c#c.com","name":"ghi"}
How do I get all 3 rows into an R dataframe? Note that the above content lives in a single file.
I found a hacky way to do that; First i read in the whole file/string with readr, then i split the data by new lines "\n", and finally i parse each line with fromJSON and then i bind it into one dataframe:
library(jsonlite)
library(readr)
json_raw <- readr::read_file("file.json")
json_lines <- unlist(strsplit(json_raw, "\\n"))
json_df <- do.call(rbind, lapply(json_lines,
FUN = function(x){as.data.frame(jsonlite::fromJSON(x))}))
Normally, when we read a csv file in R, the spaces are automatically converted to '.'
> df <- read.csv("report.csv")
> str(df)
'data.frame': 598 obs. of 61 variables:
$ LR.Number
$ Vehicle.Number
However, when we read the same csv file in sparkR, the space remains intact and is not handled implicitly by spark
#To read a csv file
df <- read.df(sqlContext, path = "report.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
printSchema(df)
root
|-- LR Number: string (nullable = true)
|-- Vehicle Number: string (nullable = true)
Because of this, performing any activity with the column causes a lot of trouble and need to be call like this
head(select(df, df$`LR Number`))
How can I explicitly handle this? How can sparkR implicitly handle this.
I am using sparkR 1.5.0 version
As a work around you could use the following piece of psuedo code
colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)
colnames(df)<-colnames_df
Another solution is to save file somewhere and read using read.df()
Following worked for me
df = collect(df)
colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)
colnames(df)<-colnames_df
df <- createDataFrame(sqlContext, df)
printSchema(df)
Here we need to locally collect the data first, which will convert spark data frame to normal R data frame. I am sceptical whether this is a good solution as I don't want to call collect. However I investigated and found that even to use ggplot libraries we need to convert this into a local data frame
I'm trying to read quickly into R a ASCII fixed column width dataset, based on a SAS import file (the file that declares the column widths, and etc).
I know I can use SAScii R package for translating the SAS import file (parse.SAScii) and actually importing (read.SAScii). It works but it is too slow, because read.SAScii uses read.fwf to do the data import, which is slow. I would like to change that for a fast import mathod, laf_open_fwf from the "LaF" package.
I'm almost there, using parse.SAScii() and laf_open_fwf(), but I'm able to correctly connect the output of parse.SAScii() to the arguments of laf_open_fwf().
Here is the code, the data is from PNAD, national household survey, 2013:
# Set working dir.
setwd("C:/User/Desktop/folder")
# installing packages:
install.packages("SAScii")
install.packages("LaF")
library(SAScii)
library(LaF)
# Donwload and unzip data and documentation files
# Data
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dados.zip"
download.file(file_url,"Dados.zip", mode="wb")
unzip("Dados.zip")
# Documentation files
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dicionarios_e_input_20150814.zip"
download.file(file_url,"Dicionarios_e_input.zip", mode="wb")
unzip("Dicionarios_e_input.zip")
# importing with read.SAScii(), based on read.fwf(): Works fine
dom.pnad2013.teste1 <- read.SAScii("Dados/DOM2013.txt","Dicionarios_e_input/input DOM2013.txt")
# importing with parse.SAScii() and laf_open_fwf() : stuck here
dic_dom2013 <- parse.SAScii("Dicionarios_e_input/input DOM2013.txt")
head(dic_dom2013)
data <- laf_open_fwf("Dados/DOM2013.txt",
column_types=????? ,
column_widths=dic_dom2013[,"width"],
column_names=dic_dom2013[,"Varname"])
I'm stuck on the last commmand, passing the importing arguments to laf_open_fwf().
UPDATE: here are two solutions, using packages LaF and readr.
Solution using readr (8 seconds)
readr is based on LaF but surprisingly faster. More info on readr here
# Load Packages
library(readr)
library(data.table)
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
setDT(dic_pes2013) # convert to data.table
# read to data frame
pesdata2 <- read_fwf("Dados/DOM2013.txt",
fwf_widths(dput(dic_pes2013[,width]),
col_names=(dput(dic_pes2013[,varname]))),
progress = interactive()
)
Take way: readr seems to be the best option: it's faster, you don't need to worry about column types, shorter code and it shows a progress bar :)
Solution using LaF (20 seconds)
LaFis one of the (maybe THE) fastest ways to read fixed-width files in R, according to this benchmark. It tooke me 20 sec. to read the person level file (PES) into a data frame.
Here is the code:
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
# Read .txt file using LaF. This is virtually instantaneous
pesdata <- laf_open_fwf("./Dados/PES2013.txt",
column_types= rep("character", length(dic_pes2013[,"width"])),
column_widths=dic_pes2013[,"width"],
column_names=dic_pes2013[,"varname"])
# convert to data frame. This tooke me 20 sec.
system.time( pesdata <- pesdata[,] )
Note that that I've used character in column_types. I'm not quite sure why the command returns me an error if I try integer or numeric. This shouldn't be a problem, since you can convert all columns to numeric like this:
# convert all columns to numeric
varposition <- grep("V", colnames(pesdata))
pesdata[varposition] <- sapply(pesdata[],as.numeric)
sapply(pesdata, class)
You can try the read.SAScii.sqlite, also by Anthony Damico. It's 4x faster and lead to no RAM issues (as the author himself describes). But it imports data to a SQLite self-contained database file (no SQL server needed) -- not to a data.frame. Then you can open it in R by using a dbConnection. Here it goes the GitHub adress for the code:
https://github.com/ajdamico/usgsd/blob/master/SQLite/read.SAScii.sqlite.R
In the R console, you can just run:
source("https://raw.githubusercontent.com/ajdamico/usgsd/master/SQLite/read.SAScii.sqlite.R")
It's arguments are almost the same as those for the regular read.SAScii.
I know you are asking for a tip on how to use LaF. But I thought this could also be useful to you.
I think that the best choice is to use fwf2csv() from desc package (C++ code). I will illustrate the procedure with PNAD 2013. Be aware that i'm considering that you already have the dictionary with 3 variables: beginning of the field, size of the field, variable name, AND the dara at Data/
library(bit64)
library(data.table)
library(descr)
library(reshape)
library(survey)
library(xlsx)
end_dom <- dic_dom2013$beggining + dicdom$size - 1
fwf2csv(fwffile='Dados/DOM2013.txt', csvfile='dadosdom.csv', names=dicdom$variable, begin=dicdom$beggining, end=end_dom)
dadosdom <- fread(input='dadosdom.csv', sep='auto', sep2='auto', integer64='double')
I imported a dataset in the .sav SPSS format, and I'm getting an error that I haven't seen before.
1: In read.spss("C:\\Users\\acer\\Desktop\\X\\X\\PIREDEU\\ees2009_v0.9_20110622.sav", ... :
C:\Users\acer\Desktop\X\X\PIREDEU\ees2009_v0.9_20110622.sav: File contains duplicate label for value 1.1 for variable V200
Error in cat(list(...), file, sep, fill, labels, append) :
argument 2 (type 'list') cannot be handled by 'cat'
This came up after I typed warnings(PIREDEU). I imported the data using the foreign library:
library(foreign)
PIREDEU<-read.spss("C:\\Users\\acer\\Desktop\\X\\X\\PIREDEU\\ees2009_v0.9_20110622.sav", use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
I've fiddled with various combinations for the latter three arguments of the read.spss function, and I've gotten nowhere.
Anyone have any suggestions?
I used the below one and it worked perfectly, just ignore the warning message and check data by typing its name:
mydata4<-read.spss("C:\\Work\\data.sav",use.value.labels=F,to.data.frame=T)
mydata4 # check data
Do you have long strings in the file - longer than 8 bytes? Statistics uses some special arrangements to handle those. It looks like the problem is with the value labels. If you can delete those (using SPSS) you might be able to get the rest of the data.
Try to read data without labels.
library(foreign)
PIREDEU <- read.spss("C:\\Users\\acer\\Desktop\\X\\X\\PIREDEU\\ees2009_v0.9_20110622.sav",
use.value.labels = F,
to.data.frame = T)
Does it work?
Convert the spss datafile into .por (portable file) and in R, install the packages hMisc, memisc and foreign and load the package using library(foreign), library(hMisc) and library(memisc).
Then type the following:
mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors