Saving .dta: data frame with very long strings using R - r

I have a df with multiple variables, some are very long strings with up to 4500 characters. I would like to export this database as a .dta file.
I try to save it using haven's write_dta() function, but I get the following error message: Error in write_dta_(data, normalizePath(path, mustWork = FALSE), version = stata_file_format(version), : Writing failure: A provided string value was longer than the available storage size of the specified column.
Here is an example of the issue:
library(haven)
longFun <- function(n) {
do.call(paste0, replicate(5000, sample(LETTERS, n, TRUE), FALSE))
}
longString <- data.frame(VeryveryveryveryveryveryveryveryveryveryVeryveryveryveryveryveryveryveryveryverylongname = longFun(1), stringsAsFactors = F)
write_dta(longString,"tst.dta")
I am aware that write_dta has issues handling long strings (https://github.com/tidyverse/haven/issues/437) and that one possibility is to trim the strings (Error in write_dta : A provided string value was longer than the available storage size of the specified column). But it is essential for me to keep the full strings.
Is there any way to save variables with long strings as .dta files using R?
Edit:
I have tried the readstata13::save.dta13 option suggested by #jay.sf but this has two issues: 1) Is not able to manage - i.e. it truncates - long variable names above 32-UTF characters, that write_dta() manages well. 2) It is significantly slower than write_dta(). Given that I have to save a very large dataset this is a relevant concern.
In sum is there any other approach that would allow me to
a) save as .dta a df with very long strings
b) retain original variable names (longer than 32-UTF characters)
c) do this in a relatively fast manner.

Use save.dta13 from the readstata13 package.
R:
readstata13::save.dta13(longString, "tst.dta")
Stata:
. use "V:\tst.dta"
. list
+------------------------------------------------------------------------------------------------------+
| V1 |
|------------------------------------------------------------------------------------------------------|
1. | GZSPZGLLKOQHETKURLPKQDTZWTNHLDJDUSAFAXHFMPKUDIZURKIFLWQSXSFBLTPBGBLJKTDYJCHZOPZCFYKIMLGTQGDKRNBGUI.. |
+------------------------------------------------------------------------------------------------------+

Related

Write stata dataframe in R [duplicate]

I am getting an error while converting R file into Stata format. I am able to convert the numbers into
Stata file but when I include strings I get the following error:
library(foreign)
write.dta(newdata, "X.dta")
Error in write.dta(newdata, "X.dta") :
empty string is not valid in Stata's documented format
I have few strings like location, name etc. which have missing values which is probably causing this problem. Is there a way to handle this? .
I've had this error many times before, and it's easy to reproduce:
library(foreign)
test <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write.dta(test, 'example.dta')
One solution is to use factor variables instead of character variables, e.g.,
for (colname in names(test)) {
if (is.character(test[[colname]])) {
test[[colname]] <- as.factor(test[[colname]])
}
}
Another is to change the empty strings to something else and change them back in Stata.
This is purely a problem with write.dta, because Stata is perfectly fine with empty strings. But since foreign is frozen, there's not much you can do about that.
Update: (2015-12-04) A better solution is to use write_dta in the haven package:
library(haven)
test <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write_dta(test, 'example.dta')
This way, Stata reads string variables properly as strings.
You could use the great readstata13 package (which kindly imports only the Rcpp package).
readstata13::save.dta13(mtcars, 'mtcars.dta')
The function allows to save already in Stata 15/16 MP file format (experimental), which is the next update after Stata 13 format.
readstata13::save.dta13(mtcars, 'mtcars15.dta', version="15mp")
Note: Of course, this also works with OP's data:
readstata13::save.dta13(data.frame(a="", b=1), 'my_data.dta')

Error in write_dta : A provided string value was longer than the available storage size of the specified column

I am trying to export my data table from R Studio to the dta format. I use write_dta function from haven library in R and get the following error:
A provided string value was longer than the available storage size of the specified column.
I am quite new to R and Stata and don't understand what it means and what should I do about it.
It sounds like you have a piece of long text in your data.frame. The write_dta has known issues handling long strings (https://github.com/tidyverse/haven/issues/437). You can trim the strings in your data.frame like this:
df = as.data.frame(apply(YOUR_DATA, 2, function(x){
if(class(x) == 'character') substr(x, 1, 128) else x}))
And then try write_dta(df). The max length of 128 characters should be safe, but newer versions of Stata can handle a lot more.
I noticed that with the data.frame solution potential labels will get lost. A tibble would allow one to keep labels (e.g. imported *.sav file with labels from a survey collection plattform).
Here is a tidyverse solution using haven to read and write that would keep labels. Keep in mind that your inital df also needs to be a tibble.
library(tidyverse)
df <- haven::read_sav("YOUR FILE.sav") # could also be some other file format that you start with as a tibble
df <- df %>%
mutate(across(where(is.character), ~ substr(., 1, 2045)))
haven::write_dta(df, "NAME OF NEW FILE.dta")
For me the maximum string length that worked to write_dta(df) was 2045.

Read a 20GB file in chunks without exceeding my RAM - R

I'm currently trying to read a 20GB file. I only need 3 columns of that file.
My problem is, that I'm limited to 16 GB of ram. I tried using readr and processing the data in chunks with the function read_csv_chunked and read_csv with the skip parameter, but those both exceeded my RAM limits.
Even the read_csv(file, ..., skip = 10000000, nrow = 1) call that reads one line uses up all my RAM.
My question now is, how can I read this file? Is there a way to read chunks of the file without using that much ram?
The LaF package can read in ASCII data in chunks. It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.
The readr package has readr_csv_chunked and related functions.
The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.
It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.
vroom in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select= argument which may make it small enough to read it in in one go.
fread in the data.table package is a fast reading function that also supports a select= argument which can select only specified columns.
read.csv.sql in the sqldf (also see github page) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.
library(sqldf)
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file",
dbname = tempfile(), ...)
read.table and read.csv in the base of R have a colClasses=argument which takes a vector of column classes. If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3)) to only read the first 3 columns.
Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.
Also check out the High Performance Computing task view.
We can try something like this, first a small example csv:
X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
write.csv(X,"test.csv",quote=F,row.names=FALSE)
You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:
data = vector("list",200)
con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
dim(data[[1]])
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
head(data[[1]])
id X1 X2 X3
1 1 0.13870273 0.4480100 0.41655108
2 2 0.82249489 0.1227274 0.27173937
3 3 0.78684815 0.9125520 0.08783347
4 4 0.23481987 0.7643155 0.59345660
5 5 0.55759721 0.6009626 0.08112619
6 6 0.04274501 0.7234665 0.60290296
In the above, we read the first chunk, collected the colnames and subsetted. If you carry on reading through the connection, the headers will be missing, and we need to specify that:
for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
}
Finally, we build of all of those into a data.frame:
data = do.call(rbind,data)
all.equal(data[,1:3],X[,1:3])
[1] TRUE
You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop..
So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:
read_chunkcsv=function(file,rows_to_read,ntimes,col_subset){
data = vector("list",rows_to_read)
con = file(file,"r")
data[[1]] = read.csv(con, nrows=rows_to_read)
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,col_subset]
for(i in 2:ntimes){
data[[i]] = read.csv(con,
nrows=rows_to_read,col.names=COLS,header=FALSE)[,col_subset]
}
return(do.call(rbind,data))
}
all.equal(X[,1:3],
read_chunkcsv("test.csv",rows_to_read=10000,ntimes=10,1:3))

How to bind rows in R such that instead of type conversion error binding defaults to filling value to NA?

I am currently tasked with merging multiple xlsx files into one master R (.rds) data file. Since these files are filled in manually there is a lot type conversion errors when using approaches such as dyplr::bind_rows such as
Column ``XYZ`` can't be converted from numeric to character
While I very much need the binding to be "smart" such that it happens according to the corresponding column names of the to be merged dataframes -when encountering conversion issues instead of getting an error, I would like to have these problematic cell contents treated as NA and not get an error - just a warning perhaps.
Is there a convenient way/function for doing this in R?
I have used bind_rows from dyplr package.
My current import procedure
files <- list.files("data",pattern = "xlsx", full.names = TRUE)
tmp <- read_excel(files[1], sheet = "data", trim_ws = TRUE)
names(tmp) <- make.names(str_squish(names(tmp)))
for (i in 2:length(files)) {
print(i)
tmp2 <- read_excel(files[i], sheet = "data",trim_ws = TRUE)
names(tmp2) <- make.names(str_squish(names(tmp2)))
tmp<-bind_rows(tmp,tmp2)
}
It has been pointed out that using a loop here is not efficient, but since the files are messy - many manual mistakes - and relatively small in number I focused on being able to sequentially track the binding process.

Vectorise an imported variable in R

I have imported a CSV file to R but now I would like to extract a variable into a vector and analyse it separately. Could you please tell me how I could do that?
I know that the summary() function gives a rough idea but I would like to learn more.
I apologise if this is a trivial question but I have watched a number of tutorial videos and have not seen that anywhere.
Read data into data frame using read.csv. Get names of data frame. They should be the names of the CSV columns unless you've done something wrong. Use dollar-notation to get vectors by name. Try reading some tutorials instead of watching videos, then you can try stuff out.
d = read.csv("foo.csv")
names(d)
v = d$whatever # for example
hist(v) # for example
This is totally trivial stuff.
I assume you have use the read.csv() or the read.table() function to import your data in R. (You can have help directly in R with ? e.g. ?read.csv
So normally, you have a data.frame. And if you check the documentation the data.frame is described as a "[...]tightly coupled collections of variables which share many of the properties of matrices and of lists[...]"
So basically you can already handle your data as vector.
A quick research on SO gave back this two posts among others:
Converting a dataframe to a vector (by rows) and
Extract Column from data.frame as a Vector
And I am sure they are more relevant ones. Try some good tutorials on R (videos are not so formative in this case).
There is a ton of good ones on the Internet, e.g:
* http://www.introductoryr.co.uk/R_Resources_for_Beginners.html (which lists some)
or
* http://tryr.codeschool.com/
Anyways, one way to deal with your csv would be:
#import the data to R as a data.frame
mydata = read.csv(file="SomeFile.csv", header = TRUE, sep = ",",
quote = "\"",dec = ".", fill = TRUE, comment.char = "")
#extract a column to a vector
firstColumn = mydata$col1 # extract the column named "col1" of mydata to a vector
#This previous line is equivalent to:
firstColumn = mydata[,"col1"]
#extract a row to a vector
firstline = mydata[1,] #extract the first row of mydata to a vector
Edit: In some cases[1], you might need to coerce the data in a vector by applying functions such as as.numeric or as.character:
firstline=as.numeric(mydata[1,])#extract the first row of mydata to a vector
#Note: the entire row *has to be* numeric or compatible with that class
[1] e.g. it happened to me when I wanted to extract a row of a data.frame inside a nested function

Resources