subsetting a data.frame using a for loop - r

I have a data.frame, and I want to subset it every 10 rows and then applied a function to the subset, save the object, and remove the previous object. Here is what I got so far
L3 <- LETTERS[1:20]
df <- data.frame(1:391, "col", sample(L3, 391, replace = TRUE))
names(df) <- c("a", "b", "c")
b <- seq(from=1, to=391, by=10)
nsamp <- 0
for(i in seq_along(b)){
a <- i+1
nsamp <- nsamp+1
df_10 <- df[b[nsamp]:b[a], ]
res <- lapply(seq_along(df_10$b), function(x){...}
saveRDS(res, file="res.rds")
rm(res)
}
My problem is the for loop crashes when reaching the last element of my sequence b

When partitioning data, split is your friend. It will create a list with each data subset as an item which is then easy to iterate over.
dfs = split(df, 1:nrow(df) %/% 10)
Then your for loop can be simplified to something like this (untested... I'm not exactly sure what you're doing because example data seems to switch from df to sc2_10 and I only hope your column named b is different from your vector named b):
for(i in seq_along(dfs)){
res <- lapply(seq_along(dfs[[i]]$b), function(x){...}
saveRDS(res, file = sprintf("res_%s.rds", i))
rm(res)
}
I also modified your save file name so that you aren't overwriting the same file every time.

Related

Dataframe output from a for-loop

I am trying to populate the output of a for loop into a data frame. The loop is repeating across the columns of a dataset called "data". The output is to be put into a new dataset called "data2". I specified an empty data frame with 4 columns (i.e. ncol=4). However, the output generates only the first two columns. I also get a warning message: "In matrix(value, n, p) : data length [2403] is not a sub-multiple or multiple of the number of columns [2]"
Why does the dataframe called "data2" have 2 columns, when I have specified 4 columns? This is my code:
a <- 0
b <- 0
GM <- 0
GSD <- 0
data2 <- data.frame(ncol=4, nrow=33)
for (i in 1:ncol(data))
{
if (i==34) {break}
a[i] <- colnames(data[i])
b <- data$cycle
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
If you look at the ?data.frame() help page, you'll see that it does not take arguments nrow and ncol--those are arguments for the matrix() function.
This is how you initialize data2, and you can see it starts with 2 columns, one column is named ncol, the second column is named nrow.
data2 <- data.frame(ncol=4, nrow=33)
data2
# ncol nrow
# 1 4 33
Instead you could try data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33)), though if you share a small sample of data and your expected result there may be more efficient ways than explicit loops to get this job done.
Generally, if you do loop, you want to do as much outside of the loop as possible. This is just guesswork without having sample data, these changes seem like a start at improving your code.
a <- colnames(data)
b <- data$cycle ## this never changes, no need to redefine every iteration
GM <- numeric(ncol(data)) ## better to initialize vectors to the correct length
GSD <- numeric(ncol(data))
data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33))
for (i in 1:ncol(data))
{
if (i==34) {break}
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
## it's weird to assign a row of data.frame at once...
## maybe you should keep it as a matrix?
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2

The loop in my R function appears to be running twice

I need to add rows to a data frame. I have many files with many rows so I have converted the code to a function. When I go through each element of the code it works fine. When I wrap everything in a function each row from my first loop gets added twice.
My code looks for a string (xx or x). If xx is present is replaces the xx with numbers 00-99 (one row for each number) and 0-9. If x is present it replaces it with number 0-9.
Create DF
a <- c("1.x", "2.xx", "3.1")
b <- c("single", "double", "nothing")
df <- data.frame(a, b, stringsAsFactors = FALSE)
names(df) <- c("code", "desc")
My dataframe
code desc
1 1.x single
2 2.xx double
3 3.1 nothing
My function
newdf <- function(df){
# If I run through my code chunk by chunk it works as I want it.
df$expanded <- 0 # a variable to let me know if the loop was run on the row
emp <- function(){ # This function creates empty vectors for my loop
assign("codes", c(), envir = .GlobalEnv)
assign("desc", c(), envir = .GlobalEnv)
assign("expanded", c(), envir = .GlobalEnv)
}
emp()
# I want to expand xx with numbers 00 - 99 and 0 - 9.
#Note: 2.0 is different than 2.00
# Identifies the rows to be expanded
xd <- grep("xx", df$code)
# I used chr vs. numeric so I wouldn't lose the trailing zero
# Create a vector to loop through
tens <- formatC(c(0:99)); tens <- tens[11:100]
ones <- c("00","01","02","03","04","05","06","07","08","09")
single <- as.character(c(0:9))
exp <- c(single, ones, tens)
# This loop appears to run twice when I run the function: newdf(df)
# Each row is there twice: 2.00, 2.00, 2.01 2.01...
# It runs as I want it to if I just highlight the code.
for (i in xd){
for (n in exp) {
codes <- c(codes, gsub("xx", n, df$code[i])) #expanding the number
desc <- c(desc, df$desc[i]) # repeating the description
expanded <- c(expanded, 1) # assigning 1 to indicated the row has been expanded
}
}
# Binds the df with the new expansion
df <- df[-xd, ]
df <- rbind(as.matrix(df),cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Empties the vector to begin another expansion
emp()
xs <- grep("x", df$code) # This is for the single digit expansion
# Expands the single digits. This part of the code works fine inside the function.
for (i in xs){
for (n in 0:9) {
codes <- c(codes, gsub("x", n, df$code[i]))
desc <- c(desc, df$desc[i])
expanded <- c(expanded, 1)
}
}
df <- df[-xs,]
df <- rbind(as.matrix(df), cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
assign("out", df, envir = .GlobalEnv) # This is how I view my dataframe after I run the function.
}
Calling my function
newdf(df)

Using R to loop through vector and copy some sequences to data.frame

I want to search through a vector for the sequence of strings "hello" "world". When I find this sequence, I want to copy it, including the 10 elements before and after, as a row in a data.frame to which I'll apply further analysis.
My problem: I get an error "new column would leave holes after existing columns". I'm new to coding, so I'm not sure how to manipulate data.frames. Maybe I need to create rows in the loop?
This is what I have:
df = data.frame()
i <- 1
for(n in 1:length(v))
{
if(v[n] == 'hello' & v[n+1] == 'world')
{
df[i,n-11:n+11] <- v[n-10:n+11]
i <- i+1
}
}
Thanks!
May be this helps
indx <- which(v1[-length(v1)]=='hello'& v1[-1]=='world')
lst <- Map(function(x,y) {s1 <- seq(x,y)
v1[s1[s1>0 & s1 < length(v1)]]}, indx-10, indx+11)
len <- max(sapply(lst, length))
d1 <- as.data.frame(do.call(rbind,lapply(lst, `length<-`, len)))
data
set.seed(496)
v1 <- sample(c(letters[1:3], 'hello', 'world'), 100, replace=TRUE)

subsetting a list of data frames using a for loop

My question is why does the last statement "a <- ..." work to give me a subset of that data frame within the list, but when I try to automate the process with a for loop through all data frames in the list I am met with all kinds of warnings and not the answer I am looking for??
time <- c(1:20)
temp <- c(2,3,4,5,6,2,3,4,5,6,2,3,4,5,6,2,3,4,5,6)
data <- data.frame(time,temp)
tmp <- c(1,diff(data[[2]]))
tmp2 <- tmp < 0
tmp3 <- cumsum(tmp2)
data1 <- split(data, tmp3)
#this does not work. I want to automate the successful process below through all data frames in the list "data1"
for(i in 1:length(data1)){
finale[i] <- subset(data1[[i]], data1[[i]][,2] > 3)
}
#this works to give me a part of what I want
a <- subset(data1[[1]], data1[[1]][,2] >3)
Maybe you may want to try with lapply
lapply(data1, function(x) subset(x, x[,2]>3))
Same result using a for loop
finale <- vector("list", length(data1))
for(i in 1:length(data1)){
finale[[i]] <- subset(data1[[i]], data1[[i]][,2] > 3)
}
It works because I preallocate a type and a length for finale, it didn't work for you, because you did not declare what finale should be.
You're trying to save a data.frame (2D object) in a vector (1D objetc). Just define finale as list and the code will work:
time <- c(1:20)
temp <- c(2,3,4,5,6,2,3,4,5,6,2,3,4,5,6,2,3,4,5,6)
data <- data.frame(time,temp)
tmp <- c(1,diff(data[[2]]))
tmp2 <- tmp < 0
tmp3 <- cumsum(tmp2)
data1 <- split(data, tmp3)
#this does not work. I want to automate the successful process below through all data frames in the list "data1"
finale <- vector(mode='list')
for(i in 1:length(data1)){
finale[[i]] <- subset(data1[[i]], data1[[i]][,2] > 3) # Use [[i]] instead of [i]
}
To save all in 1 data.frame:
finale <- do.call(rbind, finale)

R create a matrix

I have to read some external files, extract some columns and complete the missing values with zeros. So if the first file has in the column$Name: a, b, c, d, and the column$Area with discrete values; the second file has in the some column: b, d, e, f and so on for the further files I need to create a data frame such this:
a b c d e f
File1 value value value value 0 0
File2 0 value 0 value value value
This is the dummy code I wrote to try to better explain my problem:
listDFs <- list()
for(i in 1:10){
listDFs[[i]] <-
data.frame(Name=c(
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse=""))),
Area=runif(7))
}
lComposti <- sapply(listDFs, FUN = "[","Name")
dfComposti <- data.frame(matrix(unlist(lComposti),byrow=TRUE))
colnames(dfComposti) <- "Name"
dfComposti <- unique(dfComposti)
#
## The CORE of the code
lArea <- list()
for(i in 1:10){
lArea[[i]] <-
ifelse(dfComposti$Name %in% listDFs[[i]]$Name, listDFs[[i]]$Area, 0)}
#
mtxArea <- (matrix(unlist(lArea),nrow=c(10),ncol=dim(dfComposti)[1],byrow=TRUE))
The problem is about the "synchronization" between the column name and each values.
Have you some suggestion??
If my code result to be un-clear I can also upload the files I work with.
Best
The safest is never to lose track of the names: they could be put back in the wrong order...
You can concatenate all your data.frames into a tall data.frame, with do.call(rbind, ...), and then convert it to a wide data.frame with dcast.
# Add a File column to the data.frames
names( listDFs ) <- paste( "File", 1:length(listDFs) )
for(i in seq_along(listDFs)) {
listDFs[[i]] <- data.frame( listDFs[[i]], file = names(listDFs)[i] )
}
# Concatenate them
d <- do.call( rbind, listDFs )
# Convert this tall data.frame to a wide one
# ("sum" is only needed if some names appear several times
# in the same file: since you used "replace=TRUE" for the
# sample data, it is likely to happen)
library(reshape2)
d <- do.call( rbind, listDFs )
d <- dcast( d, file ~ Name, sum, value.var="Area" )

Resources