How to append a growing array to itself efficiently - r

I have a data frame "v" with id and value columns, such as:
set.seed(123)
v <- data.frame(id=sample(1:5),value=sample(1:5))
v
id value
1 2 1
2 4 3
3 5 4
4 3 2
5 1 5
In the loop, I want to find the index of v which v's id matches tmp and then find the subset of v based on this index.
tmp is a sample with "replacement" of v$id
Here is my attempt:
df <- vector(mode='list',length = iter)
iter = 1
for (i in 1:iter)
{
tmp <- sample(v$id, replace=T)
index.position <- NULL
for (j in 1:length(tmp)) {index.position <- c(index.position, which(v$id %in% tmp[j]) )}
df[[i]] <- v[index.position,]
}
tmp
[1] 1 5 3 5 2
df
[[1]]
id value
5 1 5
3 5 4
4 3 2
3.1 5 4
1 2 1
This works as expected. However, the execution is very slow when both "v" and "iter" are large because growing the index.position array is not memory efficient.
I have also tried to create an empty matrix or list as a placeholder and then assign index.position to it as I loop, but did not really speed up the process.
(reference: Growing a data.frame in a memory-efficient manner)
Edit: id "isn't" unique in v

Try to avoid for...for... loop. It is extremely inefficient. It is equal to:
for (i in 1:iter)
{
df[[i]] <- v[sample(nrow(v),replace = T),]
}
a more verbose version of Gregor's solution...

Related

Writing a function in in R

I am doing an exercise to practice writing functions.
I'm trying to figure out the general code before writing the function that reproduces the output from the table function. So far, I have the following:
set.seed(111)
vec <- as.integer(runif(10, 5, 20))
x <- sort(unique(vec))
for (i in x) {
c <- length(x[i] == vec[i])
print(c)
}
But this gives me the following output:
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
I don't think I'm subsetting correctly in my loop. I've been watching videos, but I'm not quite sure where I'm going wrong. Would appreciate any insight!
Thanks!
We can sum the logical vector concatenate it to count
count <- c()
for(number in x) count <- c(count, sum(vec == number))
count
#[1] 3 1 4 1 5 4 3 2 7
In the OP's for loop, it is looping over the 'x' values and not on the sequence of 'x'
If we do
for(number in x) count <- c(count, length(vec[vec == number]))
it should work as well
You can try sapply + setNames to achieve the same result like table, i.e.,
count <- sapply(x, function(k) setNames(sum(k==vec),k))
or
count <- sapply(x, function(k) setNames(length(na.omit(match(vec,k))),k))
such that
> count
1 2 3 4 5 6 7 8 9
3 1 4 1 5 4 3 2 7
Here is a solution without using unique and with one pass through the vector (if only R was fast with for loops!):
count = list()
for (i in vec) {
val = as.character(i)
if (is.null(count[[val]]))
count[[val]] = 1
else
count[[val]] = count[[val]] + 1
}
unlist(count)

Why does the for loop assign a value equal to 6 for the fourth iteration?

This simple for loop takes the maximum value in each row, subtracts the minimum value, and assigns the result to column "e" in the corresponding row. All resulting values in column "e" are correct except the last, which should equal 3. However, the for loop returns 6 as the answer.
#data frame
df <- data.frame(a=c(1,0,3,9),
b=c(2,3,2,8),
c=c(3,6,1,7),
d=c(4,9,4,6))
#Simple for loop to generate new, calculated column "e"
for(i in 1:nrow(df)){
df$e[i] <- max(df[i,])-min(df[i,])
}
Here's a loop where you don't need to add any intermediary objects (other than i):
for(i in 1:nrow(df)) df[i, "e"] <- max(df[i,], na.rm=T) - min(df[i,], na.rm=T)
df
a b c d e
1 1 2 3 4 3
2 0 3 6 9 9
3 3 2 1 4 3
4 9 8 7 6 3
note: na.rm is needed as the value of df[2:nrow(df),"e"] was assigned as NA when i <-1
We can do this with apply
df$e <- apply(df, 1, function(x) diff(range(x)))
df$e
#[1] 3 9 3 3
Another option is rowRanges from matrixStats
library(matrixStats)
df$e <- c(diff(t(rowRanges(as.matrix(df)))))
df$e
#[1] 3 9 3 3
In the for loop, we can create a temporary object to assign the values and then update the 'e' column
tmp <- numeric(nrow(df))
for(i in 1:nrow(df)){
tmp <- (max(df[i,]) - min(df[i,]))
}
df$e <- tmp
df$e
#[1] 3 9 3 3

combine list of data frames in list in specific manner

I got a list which have another list of data frames.
The outside list elements represents years and inside list represent months data.
Now I want to create a final list which will contain data for all months. Each Month columns will be "cbinded" by other years column values.
Alldata <- list()
Alldata[[1]] <- list(data.frame(Jan_2015_A=c(1,2), Jan_2015_B=c(3,4)), data.frame(Feb_2015_C=c(5,6), Feb_2015_D=c(7,8)))
Alldata[[2]] <- list(data.frame(Jan_2016_A=c(1,2), Jan_2016_B=c(3,4)), data.frame(Feb_2016_C=c(5,6), Feb_2016_D=c(7,8)))
Expected output list is as following
I've tried using for loops and its little complex, I want any R function to do this task.
I have done this using for loops using following code. But this is really complex and I myself found this little complicate. Hope I will get any simpler and tidy code for this operation.
I created list with each months and years data as a list item in form of data frames
x2 <- list()
for(l1 in 1: length(Alldata[[1]])){
temp <- list()
for(l2 in 1: length(Alldata)){
temp <- append(temp, list(Alldata[[l2]][[l1]]))
}
x2 <- append(x2, list(temp))
}
# then created final List with succesive years data of each month as list items. This is primarily used for Tracking data for years For Example: how much was count was for Jan_2015 and Jan_2016 for "A"
finalList <- list()
for(l3 in 1: length(x2)){
temp <- x2[[l3]]
td2 <- as.data.frame(matrix("", nrow = nrow(temp[[1]])))
rownames(td2)[rownames(temp[[1]])!=""] <- rownames(temp[[1]])[rownames(temp[[1]])!=""]
for(l4 in 1:ncol(temp[[1]])){
for(l5 in 1: length(temp)){
# lapply(l4, function(x) do.call(cbind,
td2 <- cbind(td2, temp[[l5]][, l4, drop=F])
}
}
finalList <- append(finalList, list(td2))
}
> finalList
[[1]]
V1 Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
1 1 1 3 3
2 2 2 4 4
[[2]]
V1 Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
1 5 5 7 7
2 6 6 8 8
You could do the following below. The lapply will iterate over the outer list and the do.call will cbind the inner list of data frames.
lapply(Alldata, do.call, what = 'cbind')
[[1]]
Jan_2015_A Jan_2015_B Feb_2015_C Feb_2015_D
1 1 3 5 7
2 2 4 6 8
[[2]]
Jan_2016_A Jan_2016_B Feb_2016_C Feb_2016_D
1 1 3 5 7
2 2 4 6 8
You can also use dplyr to get the same results.
library(dplyr)
lapply(Alldata, bind_cols)
Here is a third option proposed by J.R.
lapply(Alldata, Reduce, f = cbind)
EDIT
After clarification from OP, the above solution has been modified (see below) to produce the newly specified output. The solution above has been left there since it is a building block for the solution below.
pattern.vec <- c("Jan", "Feb")
### For a given vector of months/patterns, returns a
### list of elements with only that month.
mon_data <- function(mo) {
return(bind_cols(sapply(Alldata, function(x) { x[grep(pattern = mo, x)]})))
}
### Loop through months/patterns.
finalList <- lapply(pattern.vec, mon_data)
finalList
## [[1]]
## Jan_2015_A Jan_2015_B Jan_2016_A Jan_2016_B
## 1 1 3 1 3
## 2 2 4 2 4
##
## [[2]]
## Feb_2015_C Feb_2015_D Feb_2016_C Feb_2016_D
## 1 5 7 5 7
## 2 6 8 6 8
## Ordering the columns as specified in the original question.
## sorting is by the last character in the column name (A or B)
## and then the year.
lapply(finalList, function(x) x[ order(gsub('[^_]+_([^_]+)_(.*)', '\\2_\\1', colnames(x))) ])
## [[1]]
## Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
## 1 1 1 3 3
## 2 2 2 4 4
##
## [[2]]
## Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
## 1 5 5 7 7
## 2 6 6 8 8

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

Using a list to store results of a double loop (for-loop) in R

I want to make calculations for elements of individual rows using a for-loop.
I have two data.frames
df: contains data of all trading-days stocks
events: contains data of only event days of stocks
Even though there might be a much easier approach for this specific example, I’d like to know how to do such a task with a loop in a loop (for-loops).
First, my data.frames:
comp1 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date1 <- c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
ret <- c(1.2,2.2,-0.5,0.98,0.73,-1.3,-0.02,0.3,1.1,2.0,1.9,-0.98,1.45,1.71,0.03)
df <- data.frame(comp1,date1,ret)
comp2 <- c(1,1,2,2,2,3,3)
date2 <- c(2,4,1,2,5,4,5)
q <- paste("")
events <- data.frame(comp2,date2,q)
df
# comp1 date1 ret
# 1 1 1 1.20
# 2 1 2 2.20
# 3 1 3 -0.50
# 4 1 4 0.98
# 5 1 5 0.73
# 6 2 1 -1.30
# 7 2 2 -0.02
# 8 2 3 0.30
# 9 2 4 1.10
# 10 2 5 2.00
# 11 3 1 1.90
# 12 3 2 -0.98
# 13 3 3 1.45
# 14 3 4 1.71
# 15 3 5 0.03
events
# comp2 date2 q
# 1 1 2
# 2 1 4
# 3 2 1
# 4 2 2
# 5 2 5
# 6 3 4
# 7 3 5
I want to make calculations of df$ret. As an example let's just take 2 * df$ret. The results for each event-day should be stored in mylist. The final output should be the data.frame "events" with a column "q" where I want the results of the calculation to be stored.
# important objects:
companies <- as.vector(unique(df$comp1)) # all the companies (here: 1, 2, 3)
days <- as.vector(unique(df$date1)) # all the trading-days (here: 1, 2, 3, 4, 5)
mylist <- vector('list', length(companies)) # a list where the results should be stored for each company
I came up with some piece of code which doesn't work. But I still think it should look something like this:
for(i in 1:nrow(events)) {
events_k <- events[which(comp1==companies[i]),] # data of all event days of company i
df_k <- df[which(comp2==companies[i]),] # data of all trading days of company i
for(j in 1:nrow(df_k)) {
events_k[j, "q"] <- df_k[which(days==events_k[j,"date2"]), "ret"] * 2
}
mylist[i] <- events_k
}
I don't understand how to set up the loop inside the other loop and how to store the results in mylist. Any help appreciated!!
Thank you!
Don't feel bad. All of your problems are common R gotchas. First, try changing
events <- data.frame(comp2,date2,q,stringsAsFactors=FALSE)
earlier instead. Your column q is being converted to a factor implicitly, disallowing the arithmetic * 2 operation later.
Next, let's consider the fixed loop
for(i in 1:nrow(events)) {
events_k <- events[which(comp1==companies[i]),] # data of all event days of company i
df_k <- df[which(comp2==companies[i]),] # data of all trading days of company i
for(j in 1:nrow(df_k)) {
events_k[j, "q"] <-
if (0 == length(tmp <- df_k[which(days==events_k[j,"date2"]), "ret"] * 2)) NA
else tmp
}
mylist[[i]] <- events_k
}
Your first problem was on the last line, where you used [ instead of [[ (in R, the former means always wrapped with a list, whereas the latter actually accessed the value in the list).
Your second problem is that sometimes which(days==events_k[j,"date2"]) is numeric(0) (i.e., there is no matching event date). The code will then work, but you'll still have a lot of dataframes with NAs. To remove those, you could do something like:
mylist <- Filter(function(df) nrow(df) > 0,
lapply(mylist, function(df) df[apply(df, 1, function(row) !all(is.na(row))), ]))
which will filter out list elements with empty dataframes, and rows in dataframes with all NA.

Resources