I have an user defined function:
xml2csv = function(inputFile,outputFile) {
X <- read.table(inputFile, header = FALSE, fill = TRUE,sep=" ")
#step1: seperate cell by one or more space
Y <- concat.split.multiple(X, as.vector(colnames(X)), "")
#step2: seperate cell by ":"
Z <- concat.split.multiple(Y, as.vector(colnames(Y))[-c(1:2)], ":")
#delete repeat rows
U=Z[!Z[,1] == "__REPEAT__", ]
#convert factor column as character
V <- data.frame(lapply(U, as.character), stringsAsFactors=FALSE)
W = V
W[is.na(W)] = 0
write.csv(W, outputFile, quote = F, row.names = F)
}
It works perfectly fine for small inputFile; when the input file is big (>2000kb), the following error appears:
Error in textConnection(text, encoding = "UTF-8") : all connections are in use
Any suggestions?
It's hard to tell what's going on exactly (your error is not reproducible!), but my money is on it's a bug/limitation of concat.split.multiple when you have too many columns in X, since
it calls concat.split on each of the columns in lapply,
which uses textConnection internally,
which creates a temporary file,
which uses a file handler,
which is a limited resource in most operating systems :).
Related
Applying labels is an important part of making survey data comprehensible when reported
So the best example I can find uses expss::apply_labels()
e.g the famous mtcars example https://cran.r-project.org/web/packages/expss/vignettes/tables-with-labels.html
as input this requires a data.table and a list of comma separated assignment pairs e.g
apply_labels(dt, col1 = "label1", col2 = "label2", col3 = "label3")
This is fine if you have one data file and a few columns and you can be bothered typing them in for each time, but its not very helpful if you have lots of data files. So how could one load a csv metadata file
in format:
Col1 Col2 Col3
Label1 Label2 Label3
where the Col names match the same names in the data table
this means effectively translating the metadata csv file so that it generates
coln = "labeln"
for each column.
So far I have found the biggest problem is that the apply labels column names are objects not strings and it is very difficult to translate a string to the object in the right scope.
This is where I've got to
library(expss)
library(data.table)
library(glue)
readcsvdata <- function(dfile)
{
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE,
stringsAsFactors = FALSE, na.strings = getOption("datatable.na.strings","NA"))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <-readcsvdata(rawdatafilename)
commonnames <- intersect(names(mdt),names(rdt)) # find common
qlabels <- as.character(mdt[1, commonnames, with = FALSE])
comslist <- list()
for (i in 1:length(commonnames)) # loop through commonnames and qlabels
{
if (i == length(commonnames))
{x <- glue('{commonnames[i]} = "{qlabels[i]}"')} # no comma for final item
else
{x <- glue('{commonnames[i]} = "{qlabels[i]}",')} # comma for next item
comslist[[i]] <- x
}
comstring <- paste(unlist(comslist), collapse = '')
tdt = apply_labels(tdt, eval(parse(text = comstring)))
which yields
Error in parse(text = comstring) : :1:24: unexpected ',' 1: varone = "Label1", ^
oh and print(comstring) produces:
[1] "varone = \"Question one\",vartwo = \"Question two\",varthree =
\"Question three\",varfour = \"Question four\",varfive = \"Question
five\",varsix = \"Question six\",varseven = \"Question
seven\",vareight = \"Question eight\",varnine = \"Question
nine\",varten = \"Question ten\""
apply_labels is not very convenient for assignment labels from external dictionary. You can use var_lab instead:
library(expss)
library(data.table)
readcsvdata <- function(dfile)
{
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE,
stringsAsFactors = FALSE, na.strings = getOption("datatable.na.strings","NA"))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <-readcsvdata(rawdatafilename)
commonnames <- intersect(names(mdt),names(rdt)) # find common
qlabels <- as.list(mdt[1, commonnames, with = FALSE])
for (each_name in commonnames) # loop through commonnames and qlabels
{
var_lab(rdt[[each_name]]) <- qlabels[[each_name]]
}
There is a similar val_lab function for value labels. Additionally you may be interested in apply_dictionary and create_dictionary functions. To get help about them type ?apply_dictionary in the console.
I don't have expss handy, but I think this is generically about how to programmatically assign function arguments in R.
If you start with a CSV file that contains the three pairings you need,
csvdat <- read.csv(stringsAsFactors=FALSE, text="
col1,col2,col3
label1,label2,label3")
I'll write a fake function (since I don't have expss, and it's not critical) that takes a first argument and zero or more follow-on arguments dynamically.
my_fake_labels <- function(x, ...) {
dots <- list(...)
message("x labels : ", paste(sQuote(colnames(x)), collapse = ", "))
message("other names: ", paste(sQuote(names(dots)), collapse = ", "))
}
origDT <- data.table(aa=1, bb=2)
my_fake_labels(origDT, col1="label1", col2="label2", col3="label3")
# x labels : 'aa', 'bb'
# other names: 'col1', 'col2', 'col3'
It's that manual argument-setting that you're trying to avoid. (I know I'm not doing any label-setting here, let's ignore that for now.)
The programmatic way of doing this, using origDT as the first argument, and the elements of csvdat as the second and subsequent arguments:
do.call(my_fake_labels, c(list(origDT), csvdat))
# x labels : 'aa', 'bb'
# other names: 'col1', 'col2', 'col3'
The second argument to do.call needs to be a list, optionally named. Since a data.frame (and therefore a data.table) is just a glorified named list, this fits the bill. What this does is take each element of the list and apply it as the corresponding arguments of the function (the first argument of do.call).
The list(origDT) is because normally the c(...) function would concatenate the columns/elements of the two lists. If we did just c(origDT, csvdat), then the function would be called with ncol(origDT) + ncol(csvdat) arguments, instead of the desired 1 + ncol(csvdat). For this, c(list(origDT), ...) makes sure that the whole origDT is the function's first argument.
(It might also be easy to form the csvdat programmatically instead of requiring an external file, but I'm guessing that you have a reason to do it via CSV.)
Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)
What I want to do is take every file in the subdirectory that I am in and essentially just shift the column header names over one left.
I try to accomplish this by using fread in a for loop:
library(data.table)
## I need to write this script to reorder the column headers which are now apparently out of wack
## I just need to shift them over one
filelist <- list.files(pattern = ".*.txt")
for(i in 1:length(filelist)){
assign(filelist[[i]], fread(filelist[[i]], fill = TRUE))
names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
}
However, I keep getting the following or a variant of the following error message:
Error in names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1", :
'names' attribute [8] must be the same length as the vector [1]
Which is confusing to me because, as you can clearly see above, R Studio is able to load the files as having the correct number of columns. However, the error message seems to imply that there is only one column. I have tried different functions, such as colnames, and I have even tried to define the separator as being quotation marks (as my files were previously generated by another R script that quotation-separated the entries), to no luck. In fact, if I try to define the separator as such:
for(i in 1:length(filelist)){
assign(filelist[[i]], fread(filelist[[i]], sep = "\"", fill = TRUE))
names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
}
I get the following error:
Error in fread(filelist[[i]], sep = "\"", fill = TRUE) :
sep == quote ('"') is not allowed
Any help would be appreciated.
I think the problem is that, despite the name, list.files returns a character vector, not a list. So using [[ isn't right. Then, with assign, you create an objects that have the same name as the files (not good practice, it would be better to use a list). Then you try to modify the names of the object created, but only using the character string of the object name. To use an object who's name is in a character string, you need to use get (which is part of why using a list is better than creating a bunch of objects).
To be more explicit, let's say that filelist = c("data1.txt", "data2.txt"). Then, when i = 1, this code: assign(filelist[[i]], fread(filelist[[i]], fill = TRUE)) creates a data table called data1.txt. But your next line, names(filelist[[i]]) <- ... doesn't modify your data table, it modifies the first element of filelist, which is the string "data1.txt", and that string indeed has length 1.
I recommend reading your files into a list instead of using assign to create objects.
filelist <- list.files(pattern = ".*.txt")
datalist <- lapply(filelist, fread, fill = TRUE)
names(datalist) <- filelist
For changing the names, you can use data.table::setnames instead:
for(dt in datalist) setnames(dt, c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)"))
However, fread has a col.names argument, so you can just do it in the read step directly:
my_names <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
datalist <- lapply(filelist, fread, fill = TRUE, col.names = my_names)
I would also suggest not using "-log10(p)" as a column name - nonstandard column names (with parens and -) are usually more trouble than they are worth.
Could you run the following code to have a closer look at what you are putting into filelist?
i <- 1
assign(filelist[[i]], fread(filelist[[i]], fill = TRUE))
print(filelist[[i]])
I suspect you may need to use the code below instead of the assign statement
filelist[[i]] <- fread(filelist[[i]], fill = TRUE)
I have several hundred files regarding information in .pet files organized by date code (19960101 is format YYYYMMDD). I'm trying to add a column, NDate with the date code:
for (pet.atual in files.pet) {
data.pet.atual <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
data.pet.atual <- cbind(data.pet.atual, NDate= pet.atual)
}
What i'm trying to achieve, for example, is for the 01-01-1996 NDate = 19960101, for 02-01-1996 NDate = 19960102 and so on. Still the for loop just replaces the NDate field everytime it runs with the latest pet.atual, ideas? Thanks
Small modification should do the trick:
data.pet.atual <- NULL
for (pet.atual in files.pet) {
tmp.data <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
tmp.data <- cbind(tmp.data, NDate= pet.atual)
data.pet.atual <- rbind(data.pet.atual, tmp.data)
}
You can also replace the tmp.data<-cbind(...) by tmp.data$NDate <- pet.atual
You may also try fread() and rbindlist() from the data.table package (untested due to lack of a reproducible example):
library(data.table)
result <- rbindlist(lapply(files.pet, fread), idcol = "NDate")
result[, NDate := anytime::anydate(files.pet[NDate])]
lapply() "loops" over all entries in files.pet executing fread() for each entry and returns a list with the data.tables fread has created from reading each file. rbindlist() is used to combine all pieces into one large data.table. The parameter idcol = NDate generates an index column named NDate to identify the origin of each row in the final output. The ids are integer numbers 1 to the length of the list (if the list is not named).
Finally, the id number is used to lookup the file name in files.pet which is directly converted to class Date using the anytime package.
EDIT Perhaps, it might be more efficient to convert the file names to Date first before looking them up:
result[, NDate := anytime::anydate(files.pet)[NDate]]
Although fread() is pretty smart in analysing and guessing the right parameters for reading the files it might be necessary (and perhaps faster as well) to supply additional parameters, e.g.:
result <- rbindlist(lapply(files.pet, fread, header = FALSE, sep = ","), idcol = "NDate")
Yes, lapply will help, as Frank suggests. And you want to use rbind to keep the dates different for each file. Something along the lines of:
I'm assuming files.pet is a list of all the files you want to include...
my.fun<-function(file){
data <- read.table(file = file,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";")
data$NDate = file
return(data)}
data.pet.atual <- do.call(rbind.data.frame, lapply(files.pet, FUN=my.fun))
I can't test this without a reproducible example, so you may need to play with it a bit, but the general approach should work!
Is it possible in R to save a data frame (or data.table) into a textfile that contains different separators for various columns?
For example:
Column1[TAB]Column2[,]Column3 ?
[] indicate the separators, here a TAB and comma.
The function write.table accepts only one separator.
MASS::write.matrix can do the trick:
require(MASS)
m <- matrix(1:12, ncol = 3)
write.matrix(m, file = "", sep = c("\\tab", ","), blocksize = 1)
returns
1\tab5,9
2\tab 6,10
3\tab 7,11
4\tab 8,12
but as the documentation of this function does not say that multiple separators are allowed, it may be safer to do it by yourself, just in case the above has some side effects.
For example,
seps <- c("\\tab", ",", "\n")
apply(m, 1, function(x, seps)
cat(x, file = "", sep = seps, append = TRUE), seps = seps)
returns
1\tab5,9
2\tab6,10
3\tab7,11
4\tab8,12
Be aware that append is set to TRUE, so if the output file already exists it will be overwritten.