How can I automate data frame naming in R? - r

Let's say I have the following data frame:
x <- data.frame(let = sample(LETTERS, 100, replace = T),
num = sample(1:10, 100, replace = T))
I want to create several subsets of x where each new data frame is named after the levels of x$let. So far, I've come up with this simple function:
ss <- function(letra){
return(subset(x, let == letra))
}
Which is very rudimentary and doesn't do the naming as I wanted. My question is: how can I automate the following procedure?
a <- ss('A')
b <- ss('B')
c <- ss('C')
...
z <- ss('Z')

To elaborate a bit.
xs <- split(x, x$let)
Now we have a list, xs, of each subset of the original data frame. The names of each list component matches the factor level it was selected on:
xs[['D']]
let num
8 D 8
14 D 1
16 D 9
54 D 5
60 D 6
64 D 8
74 D 8
Most people use either xlsx or XLConnect to write Excel files from R. I happen to use XLConnect, but the solutions would be very similar.
Now we can simply do this:
require(XLConnect)
file_name <- paste0("file",LETTERS,".xlsx")
for (i in seq_len(length(xs))){
wb <- loadWorkbook(file_name[i],create = TRUE)
createSheet(wb,"Sheet1")
writeWorksheet(wb,data = xs[[i]],sheet = 1)
saveWorkbook(wb)
}
I've done this in a for loop so that it's easier to read and understand, but obviously this could all be shoved into an lapply or mapply type solution as well.

Agreed with Joshua that you may want to do something different, but if you're really hooked on your previous idea, you can use:
x <- data.frame(let = sample(LETTERS, 100, replace = T),
num = sample(1:10, 100, replace = T))
ss <- function(letra){
assign(letra, subset(x, let == letra), envir = .GlobalEnv)
# Returning the DF is optional:
# return(subset(x, let == letra))
}
ss('A')
print(A)
Update: taking Joran's suggestion, one can write:
x_split <- split(x,x$let)
for (let in x_split) {
write.csv(let, file = paste0((let$let)[1], ".csv"))
}

Related

R: object y not found in function (x,y) [function to pass through data frames in r]

I am writing a function to build new data frames based on existing data frames. So I essentially have
f1 <- function(x,y) {
x_adj <- data.frame("DID*"= df.y$`DM`[x], "LDI"= df.y$`DirectorID*`[-(x)], "LDM"= df.y$`DM`[-(x)], "IID*"=y)
}
I have 4,000 data frames df., so I really need to use this and R is returning an error saying that df.y is not found. y is meant to be used through a list of all the 4000 names of the different df. I am very new at R so any help would be really appreciated.
In case more specifics are needed I essentially have something like
df.1 <- data.frame(x = 1:3, b = 5)
And I need the following as a result using a function
df.11 <- data.frame(x = 1, c = 2:3, b = 5)
df.12 <- data.frame(x = 2, c = c(1,3), b = 5)
df.13 <- data.frame(x = 3, c = 1:2, b = 5)
Thanks in advance!
OP seems to access data.frame with dynamic name.
One option is to use get:
get(paste("df",y,sep = "."))
The above get will return df.1.
Hence, the function can be modified as:
f1 <- function(x,y) {
temp_df <- get(paste("df",y,sep = "."))
x_adj <- data.frame("DID*"= temp_df$`DM`[x], "LDI"= temp_df$`DirectorID*`[-(x)],
"LDM"= temp_df$`DM`[-(x)], "IID*"=y)
}

Assign variables dependent on indices in for loop

So I have a small problem in R. I have multiple data sets (data0, data1,...) and I want to do the following:
data01 <- data0[1:6,]
data02 <- data0[7:12,]
data11 <- data1[1:6,]
data12 <- data1[7:12,]
data21 <- data2[1:6,]
data22 <- data2[7:12,]
data31 <- data3[1:6,]
data32 <- data3[7:12,]
...etc
I would like to do this in a for loop like so:
for(i in 1:(some high number)){
datai1 <- datai[1:6,]
datai2 <- datai[7:12,]
}
I've tried messing around with assign() and get(), however I cannot make it work. I found something that might work in this question, however the difference is that here the variable d should also change depending on the index. Any idea how I could make this work?
Here is a more R-like approach than using assign:
data1 <- data0 <- data.frame(x = 1:12, y = letters[1:12]) #some data
mylist <- mget(ls(pattern = "data\\d")) #collect free floating objects into list
#it would be better to put the data.frames into a list when you create them
res <- lapply(mylist, function(d) split(d[1:12,], rep(1:2, each = 6))) #loop over list and split each data.frame
The result is a nested list and it's easy to extract its elements:
res[["data1"]][["2"]]
# x y
#7 7 g
#8 8 h
#9 9 i
#10 10 j
#11 11 k
#12 12 l
Assemble the variable names with paste() and then use get() and assign() as you suggest.
for (i in 1:10) {
datai <- get(paste('data', i, sep = ''))
assign(paste('data', i, '1', sep = ''), datai[1:6,])
assign(paste('data', i, '2', sep = ''), datai[7:12,])
}

R: run function over same dataframe multiple times

I’m looking to apply a function over an initial dataframe multiple times. As a simple example, take this data:
library(dplyr)
thisdata <- data.frame(vara = seq(from = 1, to = 20, by = 1)
,varb = seq(from = 1, to = 20, by = 1))
And here is a simple function I would like to run over it:
simplefunc <- function(data) {datasetfinal2 <- data %>% mutate(varb = varb+1)
return(datasetfinal2)}
thisdata2 <- simplefunc(thisdata)
thisdata3 <- simplefunc(thisdata2)
So, how would I run this function, say 10 times, without having to keep calling the function (ie. thisdata3)? I’m mostly interested in the final dataframe after the replication but it would be good to have a list of all the dataframes produced so I can run some diagnostics. Appreciate the help!
Dealing with multiple identically-structured data.frames individually is a difficult way to manage things, especially if the number of iterations is more than a few. A popular "best practice" is to deal with a "list of data.frames", something like:
n <- 10 # number of times you need to repeat the process
out <- vector("list", n)
out[[1]] <- thisdata
for (i in 2:n) out[[i]] <- simplefunc(out[[i-1]])
You can look at any interim value with
str(out[[10]])
# 'data.frame': 20 obs. of 2 variables:
# $ vara: num 1 2 3 4 5 6 7 8 9 10 ...
# $ varb: num 10 11 12 13 14 15 16 17 18 19 ...
and, as you might expect, the final result is in out[[n]].
This can be simplified slightly using Reduce, and adding a throw-away second argument to simplefunc:
simplefunc <- function(data, ...) {
datasetfinal2 <- data %>% mutate(varb = varb+1)
return(datasetfinal2)
}
out <- Reduce(simplefunc, 1:10, init = thisdata, accumulate = TRUE)
This effectively does:
tmp <- simplefunc(thisdata, 1)
tmp <- simplefunc(tmp, 2)
tmp <- simplefunc(tmp, 3)
# ...
(In fact, if you look at the source for Reduce, it's effectively doing my first suggestion above.)
Note that if simplefunc has other arguments that cannot be dropped, perhaps:
simplefunc <- function(data, ..., otherarg, anotherarg) {
datasetfinal2 <- data %>% mutate(varb = varb+1)
return(datasetfinal2)
}
though you must change all other calls to simplefunc to pass parameters "by-name" instead of by-position (which is a common/default way).
Edit: if you cannot (or do not want to) edit simplefunc, you can always use an anonymous function to ignore the iterator/counter:
Reduce(function(x, ign) simplefunc(x), 1:10, init = thisdata, accumulate = TRUE)
We can use a for loop
thisdata1 <- thisdata
for(i in 2:3){
assign(paste0('thisdata', i), value = simplefunc(get(paste0('thisdata', i-1))))
}
NOTE1: It is better not to create individual objects in the global environment where the operations can be done easily within a list.
NOTE2: Forgot to add the disclaimer earlier

R: Row resampling loop speed improvement

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.

Filtering multiple csv files while importing into data frame

I have a large number of csv files that I want to read into R. All the column headings in the csvs are the same. But I want to import only those rows from each file into the data frame for which a variable is within a given range (above min threshold & below max threshold), e.g.
v1 v2 v3
1 x q 2
2 c w 4
3 v e 5
4 b r 7
Filtering for v3 (v3>2 & v3<7) should results in:
v1 v2 v3
1 c w 4
2 v e 5
So fare I import all the data from all csvs into one data frame and then do the filtering:
#Read the data files
fileNames <- list.files(path = workDir)
mergedFiles <- do.call("rbind", sapply(fileNames, read.csv, simplify = FALSE))
fileID <- row.names(mergedFiles)
fileID <- gsub(".csv.*", "", fileID)
#Combining data with file IDs
combFiles=cbind(fileID, mergedFiles)
#Filtering the data according to criteria
resultFile <- combFiles[combFiles$v3 > min & combFiles$v3 < max, ]
I would rather apply the filter while importing each single csv file into the data frame. I assume a for loop would be the best way of doing it, but I am not sure how.
I would appreciate any suggestion.
Edit
After testing the suggestion from mnel, which worked, I ended up with a different solution:
fileNames = list.files(path = workDir)
mzList = list()
for(i in 1:length(fileNames)){
tempData = read.csv(fileNames[i])
mz.idx = which(tempData[ ,1] > minMZ & tempData[ ,1] < maxMZ)
mz1 = tempData[mz.idx, ]
mzList[[i]] = data.frame(mz1, filename = rep(fileNames[i], length(mz.idx)))
}
resultFile = do.call("rbind", mzList)
Thanks for all the suggestions!
Here is an approach using data.table which will allow you to use fread (which is faster than read.csv) and rbindlist which is a superfast implementation of do.call(rbind, list(..)) perfect for this situation. It also has a function between
library(data.table)
fileNames <- list.files(path = workDir)
alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
xx <- fread(x, sep = ',')
xx[, fileID := gsub(".csv.*", "", x)]
xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
}, min = 2, max = 3))
If the individual files are large and v1 always integer values it might be worth setting v3 as a key then using a binary search, it may also be quicker to import everything and then run the filtering.
If you want to do "filtering" before importing the data try to use read.csv.sql from sqldf package
If you are really stuck for memory then the following solution might work. It uses LaF to read only the column needed for filtering; then calculates the total number of lines that will be read; initialized the complete data.frame and then read the required lines from the files. (It's probably not faster than the other solutions)
library("LaF")
colnames <- c("v1","v2","v3")
colclasses <- c("character", "character", "numeric")
fileNames <- list.files(pattern = "*.csv")
# First determine which lines to read from each file and the total number of lines
# to be read
lines <- list()
for (fn in fileNames) {
laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
d <- laf$v3[]
lines[[fn]] <- which(d > 2 & d < 7)
}
nlines <- sum(sapply(lines, length))
# Initialize data.frame
df <- as.data.frame(lapply(colclasses, do.call, list(nlines)),
stringsAsFactors=FALSE)
names(df) <- colnames
# Read the lines from the files
i <- 0
for (fn in names(lines)) {
laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
n <- length(lines[[fn]])
df[seq_len(n) + i, ] <- laf[lines[[fn]], ]
i <- i + n
}

Resources