R - converting list of characters to list of variables - r

I have the following code
A <- c(1,2,3); B <- c(2,3,3); C <- c(3,4,5);
colMeans(rbind (A,A,B,C,C,C,A,A))
which when executed returns
[1] 1.88 2.88 3.75
I am trying to get this work for arrays/matrices that I get from an excel file.
When trying to read from the clipboard (MacOSx) using
read.table(pipe("pbpaste"), sep="\t", header=TRUE)
I end up getting a dataframe with lists of characters (typeof).
I am fairly new to R so my issue is that these list are characters and not variable names. Tried various ways to convert them to a list of variables so that I could do the "colMeans(rbind())" on them.
Any thoughts ? Thanks.

You can call the type conversion explicitly after reading from the clipboard:
library(tidyverse)
read_delim(pipe("pbpaste"), delim="\t") %>% type_convert()
Note that it is recommended to always read data from files stores somewhere. The clipboard makes the code not reproducible anymore.

Related

rjson reads 35-column 40-row file as one long row

I am trying to test unicode-heavy imports of various R packages. I'm through everything but JSON because of a persisetant error: The file is read in as one long, single-row file. The file is available here.
I think I am following the instructions in the help. I have tried two approaches:
read the data into an object, then convert to a data frame.
raw_json_data <- read_file("World Class.json")
test_json <- fromJSON(raw_json_data)
as.data.frame(test_json)
Read the file using fromJSON() then convert to a data frame. I happen to be using R's new pipe here, but that doesn't seem to matter.
rjson_json <- fromJSON(file = "World Class.json") |>
as.data.frame()
In every attempt, I get the same result: a data frame of 1 column and 1400 variables. Is there a step I am missing in this conversion?
EDIT: I am not looking for the answer "Use package X instead". The rjson package seems to read in the JSON data, which has a quite simple structure. The problem is that the as.data.frame() call results in one-row, 1400-character data frame, and I'm asking wht that is.
Try the jsonlite package instead.
library(jsonlite)
## next line gives warning: JSON string contains (illegal) UTF8 byte-order-mark!
json_data <- fromJSON("World Class.json") # from file
dim(json_data)
[1] 40 35

read_csv (readr, R) populates entire column with NA if there are NA in the fist 1000 + x observations in a simple and clean csv (parsing failure)

I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.
I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.
When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.
The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)
Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).
An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.
See a simplified example below:
##create a dtafrane
df <- data.frame( id = numeric(), string = character(),
stringsAsFactors=FALSE)
##poluate columns
df[1:1500,1] <- seq(1:1500)
df[1500,2] <- "something"
# variable string contains the first value in obs. 1500
df[1500,]
## check the numbers of NA in variable string
sum(is.na(df$string)) # 1499
##write the df
write_csv(df, "df.csv")
##read the df with read_csv and read.csv
df_readr <- read_csv('df.csv')
df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
sum(is.na(df_readr$string)) #1500
sum(is.na(df_read_standard$string)) #1499
## the read_csv files is all NA for variable string
problems(df_readr) ##What should that tell me? How to fix it?
Thanks to MrFlick for giving the answering comment on my questions:
The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.
Also guess_max = Inf overcomes the problem, but the speed advantage of read_csv seems to be lost.

How to remove empty columns in transaction data read with the arules package?

I have a dataset made in the format of a basket data. I have read that dataset in R using a package call arules which has an inbuilt function for reading transactions, so I have used that and read my dataset. Following is the code I used:
trans = read.transactions("C:/Users/HARI/Desktop/Graph_mining/transactional_data_v3.csv", format = "basket", sep=",",rm.duplicates=TRUE)
inspect(trans[1:5])
items
1 {,
ANTIVERT,
SOFTCLIX}
2 {,
CEFADROXIL,
ESTROGEN}
3 {,
BENZAMYCIN,
BETAMETH,
KEFLEX,
PERCOCET}
4 {,
ACCUTANE(RXPAK;10X10),
BENZAMYCIN}
5 {,
ALBUTEROL,
BUTISOLSODIUM,
CLARITIN,
NASACORTAQ}
As you can see, when I use inspect(trans) it shows transactions with an empty column in each. My question is how can I remove those empty columns?
For a full dput of the trans object, please see this link.
I think I've found a solution to your problem. I took your csv file, opened it in Excel and replaced all empty cells with NA. Then I pasted the whole thing into LibreOffice Calc and saved it back to csv specifying that double quotes should be used for all cells (oddly enough, Excel won't do that except with a vba macro. You could read the file directly in LibreOffice instead of Excel, however, replacing empty cells with NA's will take forever). Then:
trans <- read.table("d:/downloads/transactional_data_2.csv", sep=",", stringsAsFactors = TRUE, na.strings="NA", header=TRUE)
trans2 <- as(trans, "transactions")
inspect(trans2[1:5])
RESULTS
inspect(trans[1:5])
items transactionID
1 {X1=SOFTCLIX,
X2=ANTIVERT} 1
2 {X1=ESTROGEN,
X2=CEFADROXIL} 2
3 {X1=KEFLEX,
X2=BETAMETH,
X3=PERCOCET,
X4=BENZAMYCIN} 3
4 {X1=BENZAMYCIN,
X2=ACCUTANE(RXPAK;10X10)} 4
5 {X1=CLARITIN,
X2=ALBUTEROL,
X3=NASACORTAQ,
X4=BUTISOLSODIUM} 5
I think that's the results you're looking for...?
I'm not super familiar with the arules package. My best guess is to read in the data using read.csv and then convert to transaction format instead of using the provided read.transactions:
tran2 <- read.csv("downloads/transactional_data.csv")
tran3 <- as(tran2, "transactions")
EDIT: I believe that the blanks in your data are not being read in correctly; additionally, there are duplicates which should also be filtered out. This should deal with that. You will need the reshape2 package.
trans2 <- read.csv("downloads/transactional_data.csv", na.strings="", stringsAsFactors=FALSE )
trans2$id <- seq(nrow(trans2))
t2.long <- melt(trans2, id.vars="id")
t2.long$variable <- NULL
t3 <- as(lapply(split(t2.long$value, t2.long$id), unique), "transactions")

R: use LaF (reads fixed column width data FAST) with SAScii (parses SAS dicionary for import instructions)

I'm trying to read quickly into R a ASCII fixed column width dataset, based on a SAS import file (the file that declares the column widths, and etc).
I know I can use SAScii R package for translating the SAS import file (parse.SAScii) and actually importing (read.SAScii). It works but it is too slow, because read.SAScii uses read.fwf to do the data import, which is slow. I would like to change that for a fast import mathod, laf_open_fwf from the "LaF" package.
I'm almost there, using parse.SAScii() and laf_open_fwf(), but I'm able to correctly connect the output of parse.SAScii() to the arguments of laf_open_fwf().
Here is the code, the data is from PNAD, national household survey, 2013:
# Set working dir.
setwd("C:/User/Desktop/folder")
# installing packages:
install.packages("SAScii")
install.packages("LaF")
library(SAScii)
library(LaF)
# Donwload and unzip data and documentation files
# Data
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dados.zip"
download.file(file_url,"Dados.zip", mode="wb")
unzip("Dados.zip")
# Documentation files
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dicionarios_e_input_20150814.zip"
download.file(file_url,"Dicionarios_e_input.zip", mode="wb")
unzip("Dicionarios_e_input.zip")
# importing with read.SAScii(), based on read.fwf(): Works fine
dom.pnad2013.teste1 <- read.SAScii("Dados/DOM2013.txt","Dicionarios_e_input/input DOM2013.txt")
# importing with parse.SAScii() and laf_open_fwf() : stuck here
dic_dom2013 <- parse.SAScii("Dicionarios_e_input/input DOM2013.txt")
head(dic_dom2013)
data <- laf_open_fwf("Dados/DOM2013.txt",
column_types=????? ,
column_widths=dic_dom2013[,"width"],
column_names=dic_dom2013[,"Varname"])
I'm stuck on the last commmand, passing the importing arguments to laf_open_fwf().
UPDATE: here are two solutions, using packages LaF and readr.
Solution using readr (8 seconds)
readr is based on LaF but surprisingly faster. More info on readr here
# Load Packages
library(readr)
library(data.table)
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
setDT(dic_pes2013) # convert to data.table
# read to data frame
pesdata2 <- read_fwf("Dados/DOM2013.txt",
fwf_widths(dput(dic_pes2013[,width]),
col_names=(dput(dic_pes2013[,varname]))),
progress = interactive()
)
Take way: readr seems to be the best option: it's faster, you don't need to worry about column types, shorter code and it shows a progress bar :)
Solution using LaF (20 seconds)
LaFis one of the (maybe THE) fastest ways to read fixed-width files in R, according to this benchmark. It tooke me 20 sec. to read the person level file (PES) into a data frame.
Here is the code:
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
# Read .txt file using LaF. This is virtually instantaneous
pesdata <- laf_open_fwf("./Dados/PES2013.txt",
column_types= rep("character", length(dic_pes2013[,"width"])),
column_widths=dic_pes2013[,"width"],
column_names=dic_pes2013[,"varname"])
# convert to data frame. This tooke me 20 sec.
system.time( pesdata <- pesdata[,] )
Note that that I've used character in column_types. I'm not quite sure why the command returns me an error if I try integer or numeric. This shouldn't be a problem, since you can convert all columns to numeric like this:
# convert all columns to numeric
varposition <- grep("V", colnames(pesdata))
pesdata[varposition] <- sapply(pesdata[],as.numeric)
sapply(pesdata, class)
You can try the read.SAScii.sqlite, also by Anthony Damico. It's 4x faster and lead to no RAM issues (as the author himself describes). But it imports data to a SQLite self-contained database file (no SQL server needed) -- not to a data.frame. Then you can open it in R by using a dbConnection. Here it goes the GitHub adress for the code:
https://github.com/ajdamico/usgsd/blob/master/SQLite/read.SAScii.sqlite.R
In the R console, you can just run:
source("https://raw.githubusercontent.com/ajdamico/usgsd/master/SQLite/read.SAScii.sqlite.R")
It's arguments are almost the same as those for the regular read.SAScii.
I know you are asking for a tip on how to use LaF. But I thought this could also be useful to you.
I think that the best choice is to use fwf2csv() from desc package (C++ code). I will illustrate the procedure with PNAD 2013. Be aware that i'm considering that you already have the dictionary with 3 variables: beginning of the field, size of the field, variable name, AND the dara at Data/
library(bit64)
library(data.table)
library(descr)
library(reshape)
library(survey)
library(xlsx)
end_dom <- dic_dom2013$beggining + dicdom$size - 1
fwf2csv(fwffile='Dados/DOM2013.txt', csvfile='dadosdom.csv', names=dicdom$variable, begin=dicdom$beggining, end=end_dom)
dadosdom <- fread(input='dadosdom.csv', sep='auto', sep2='auto', integer64='double')

How to combine multiple datasets into one in R?

I have 3 text files each of which has 14 similar columns. I want to first read these 3 files (data frames) and then combine them into one data frame. Following is what I have tried after finding some help in R mailing list:
file_name <- list.files(pattern='sEMA*') # CREATING A LIST OF FILE NAMES OF FILES HAVING 'sEMA' IN THEIR NAMES
NGSim <- lapply (file_name, read.csv, sep=' ', header=F, strip.white=T) # READING ALL THE TEXT FILES
This piece of code can read the files altogether but does not combine them into one data frame. I have tried data.frame(NGSim) but R gives an error: cannot allocate vector of size 4.2 Mb. How can I combine the files in one single data frame?
Like this:
do.call(rbind, NGSim)
library(plyr)
rbind.fill(NGSim)
or,
ldply(NGSim)
If file size is an issue that's the case you may want to the use data.table functions instead of less efficient base functions like read.csv().
library(data.table)
NGSim <- data.frame(rbindlist(lapply(list.files(pattern='sEMA*'),fread)))

Resources