I'm trying to merge a directory full of comma delimited text files using R, while also incorporating the file name of each file as a new variable in the data set.
I've been using the following:
library(plyr)
file_list <- list.files()
dataset <- ldply(file_list, read.table, header=FALSE, sep=",")
Can anyone shed any light on how I'd add the file name for each file read as a new variable within dataset?
Many thanks,
-Jon
You can just make a wrapper around the read.table() function that adds in your filename variable. Something like this should work:
read.data <- function(file){
dat <- read.table(file,header=F,sep=",")
dat$fname <- file
return(dat)
}
Once there you just need to apply that function across your data files. Since you didn't post any example data I'm not sure what it actually looks like, but for now I'll assume it's clean as can be and that rbind() is sufficient to join them together, in which case this example should illustrate that function in action:
> data(iris)
> write.csv(iris,file="iris1.csv",row.names=F)
> write.csv(iris,file="iris2.csv",row.names=F)
> dataset <- do.call(rbind, lapply(list.files(pattern="csv$"),read.data))
> head(dataset)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species fname
1 5.1 3.5 1.4 0.2 setosa iris1.csv
2 4.9 3.0 1.4 0.2 setosa iris1.csv
3 4.7 3.2 1.3 0.2 setosa iris1.csv
4 4.6 3.1 1.5 0.2 setosa iris1.csv
5 5.0 3.6 1.4 0.2 setosa iris1.csv
6 5.4 3.9 1.7 0.4 setosa iris1.csv
> table(dataset$fname)
iris1.csv iris2.csv
150 150
Related
Before my question, here is a little background.
I am creating a general purpose data shaping and charting library for plotting survey data of a particular format.
As part of my scripts, I am using the subset function on my data frame. The way I am working is that I have a parameter file where I can pass this subsetting criteria into my functions (so I don't need to directly edit my main library). The way I do this is as follows:
subset_criteria <- expression(variable1 != "" & variable2 == TRUE)
(where variable1 and variable2 are columns in my data frame, for example).
Then in my function, I call this as follows:
my.subset <- subset(my.data, eval(subset_criteria))
This part works exactly as I want it to work. But now I want to augment that subsetting criteria inside the function, based on some other calculations that can only be performed inside the function. So I am trying to find a way to combine together these subsetting expressions.
Imagine inside my function I create some new column in my data frame automatically, and then I want to add a condition to my subsetting that says that this additional column must be TRUE.
Essentially, I do the following:
my.data$newcolumn <- with(my.data, ifelse(...some condition..., TRUE, FALSE))
Then I want my subsetting to end up being:
my.subset <- subset(my.data, eval(subset_criteria & newcolumn == TRUE))
But it does not seem like simply doing what I list above is valid. I get the wrong solution. So I'm looking for a way of combining these expressions using expression and eval so that I essentially get the combination of all the conditions.
Thanks for any pointers. It would be great if I can do this without having to rewrite how I do all my expressions, but I understand that might be what is needed...
Bob
You should probably avoid two things: using subset in non-interactive setting (see warning in the help pages) and eval(parse()). Here we go.
You can change the expression into a string and append it whatever you want. The trick is to convert the string back to expression. This is where the aforementioned parse comes in.
sub1 <- expression(Species == "setosa")
subset(iris, eval(sub1))
sub2 <- paste(sub1, '&', 'Petal.Width > 0.2')
subset(iris, eval(parse(text = sub2))) # your case
> subset(iris, eval(parse(text = sub2)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
22 5.1 3.7 1.5 0.4 setosa
24 5.1 3.3 1.7 0.5 setosa
27 5.0 3.4 1.6 0.4 setosa
32 5.4 3.4 1.5 0.4 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
I'm creating an R package with several files in /data. The way one loads data in the R package is to use the system.file(),
system.file(..., package = "base", lib.loc = NULL, mustWork = FALSE)
The file in /data I would like to load into an R data.table has the extension *.txt.gz, my_file.txt.gz. How do I load this into a data.table via read.table() or fread()?
Within the R script, I tried :
#' #import data.table
#' #export
my_function = function(){
my_table = read.table(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header=TRUE)
}
This leads to an error via devtools document():
Error in read.table(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header = TRUE) (from script1.R#7) :
no lines available in input
In addition: Warning message:
In file(file, "rt") :
file("") only supports open = "w+" and open = "w+b": using the former
I appear to get the same issue via fread()
#' #import data.table
#' #export
my_function() = function(){
my_table = fread(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header=TRUE)
}
This outputs the error:
Input is either empty or fully whitespace after the skip or autostart. Run again with verbose=TRUE.
So, it appears that system.file() doesn't give an object to the file which I could load into an R data.table. How do I do this?
Do yourself a HUGE favour and study fread() closely: it is one of the very best features in data.table. I have examples (at work) of reading from a pipe of other commands, of reading compresse data and more.
Here is a simple mock example:
R> write.csv(iris, file="/tmp/demo.csv")
R> system("gzip /tmp/demo.csv") # to be very plain
R> fread("zcat /tmp/demo.csv.gz")
V1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 1 5.1 3.5 1.4 0.2 setosa
2: 2 4.9 3.0 1.4 0.2 setosa
3: 3 4.7 3.2 1.3 0.2 setosa
4: 4 4.6 3.1 1.5 0.2 setosa
5: 5 5.0 3.6 1.4 0.2 setosa
---
146: 146 6.7 3.0 5.2 2.3 virginica
147: 147 6.3 2.5 5.0 1.9 virginica
148: 148 6.5 3.0 5.2 2.0 virginica
149: 149 6.2 3.4 5.4 2.3 virginica
150: 150 5.9 3.0 5.1 1.8 virginica
R>
Seems in the hast I wrote one column too many (rownames) but you get the idea.
Now, you don't even need fread (but it still more powerful than the alternatives):
R> head(read.csv(file="/tmp/demo.csv.gz"))
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
R>
R figured out by itself it needed to compress the file.
Edit: I was editing this question earlier when it was deleted under me, which is about as de-motivating as it gets. In a nutshell:
system.file() works, e.g. file <- system.file("rawdata", "population.csv", package="gunsales") does contain the complete path as the file exists: "/usr/local/lib/R/site-library/gunsales/rawdata/population.csv". But this is easy to mess up. (Needless to say I do have the package and the file.)
look into the data/ directory and what Writing R Extensions says. It is a good mechanism.
I wrote a function in R to return the name of the first column of a data frame:
my_func <- function(table, firstColumnOnly = TRUE){
if(firstColumnOnly)
return(colnames(table)[1])
else
return(colnames(table))
}
If I call the function like this:
my_func(fertility)<-"foo"
I get the following error:
Error in my_func(fertility, FALSE)[1] <- "foo" :
could not find function "my_func<-"
Why am I getting this error? I can do this without an error:
colnames(fertility)[1]<-"Country"
It seems like you are expecting that this:
my_func(fertility)<-"foo"
will be understood by R as:
colnames(table)[1] <- "foo" # if firstColumnOnly
or
colnames(table) <- "foo" # if !firstColumnOnly
It will not. One reason for this is that colnames() and colnames()<- are two distinct functions. The first one returns the column names, the second one assigns new names. Your function can only return the names, not assign them.
One workaround would be to write your function using colnames()<-:
my_func <- function(table, rep, firstColumnOnly = TRUE){
if(firstColumnOnly) colnames(table)[1] <- rep
else colnames(table) <- rep
return(table)
}
Test
head(my_func(iris,"foo"))
foo Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I am try to open all the csv files in my working directory and read all the tables into a large list of data frame. I find a similar solution on stackoverflow and the solution works. The code is:
load_data <- function(path)
{
files <- dir(path, pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
do.call(rbind, tables)
}
pollutantmean <- load_data("specdata")
However, I am confused to some steps. If I delete or omit do.call(rbind,tables), I am not able to access the column variables by calling tables[index]$variable. It returns NULL in the console. Then I try to print an output by calling tables[index] and I do not see any column variables' name appearing the the first row in the table. Can someone explain to me what cause the column variables' name missing and return NULL value?
To see why you are getting NULL let's create a reproducible example:
df1 <- head(mtcars)
df2 <- head(iris)
my_list <- list(df1, df2)
Test the subsetting with one bracket and two:
my_list[2]$Species
NULL
my_list[[2]]$Species
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
Subsetting with two brackets produces the desired output.
Further Explanation
Why doesn't one bracket work?
> my_list[2]
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
> my_list[[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
If someone couldn't tell the difference between the two outputs I wouldn't blame them, they look alike. There's one small important difference between using one bracket and two. The first returns a list, the second returns a data frame. To check, notice the [[1]] in the first line of the output of my_list[2]. That indicates that the output is a list. As a list we cannot analyze it as we would a data frame. We must use the two brackets to get back a data frame.
I am trying to create chunks of my dataset to run biglm. (with fastLm I would need 350Gb of RAM)
My complete dataset is called res. As experiment I drastically decreased the size to 10.000 rows. I want to create chunks to use with biglm.
library(biglm)
formula <- iris$Sepal.Length ~ iris$Sepal.Width
test <- iris[1:10,]
biglm(formula, test)
And somehow, I get the following output:
> test <- iris[1:10,]
> test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Above you can see the matrix test contains 10 rows. Yet when running biglm it shows a sample size of 150
> biglm(formula, test)
Large data regression model: biglm(formula, test)
Sample size = 150
Looks like it uses iris instead of test.. how is this possible and how do I get biglm to use chunk1 the way I intend it to?
I suspect the following line is to blame:
formula <- iris$Sepal.Length ~ iris$Sepal.Width
where in the formula you explicitly reference the iris dataset. This will cause R to try and find the iris dataset when lm is called, which it finds in the global environment (because of R's scoping rules).
In a formula you normally do not use vectors, but simply the column names:
formula <- Sepal.Length ~ Sepal.Width
This will ensure that the formula contains only the column (or variable) names, which will be found in the data lm is passed. So, lm will use test in stead of iris.