converting FOR loop to apply in R - r

I'm trying to learn how to use apply, but am getting a bit stuck. This loop is converting all of the columns (except the first four) to their accumulated sums.
Can anyone help me? Thanks!
for (i in seq_along(newbuilds.byfuel.bydate)) {
if (i >= 5) {
newbuilds.byfuel.bydate[i] <- cumsum(newbuilds.byfuel.bydate[i])
}
}

This is how I would do this if your object is a data.frame:
## dummy dataset
x <- mtcars
Use lapply to loop over the desired columns, calculate the cumsum, and then overwrite the original columns:
x[5:ncol(x)] <- lapply(x[5:ncol(x)], cumsum)
Alternatively, loop over the un-dropped columns:
x[-(1:4)] <- lapply(x[-(1:4)], cumsum)

Related

rownames on multiple dataframe with for loop in R

I have several dataframe. I want the first column to be the name of each row.
I can do it for 1 dataframe this way :
# Rename the row according the value in the 1st column
row.names(df1) <- df1[,1]
# Remove the 1st column
df1 <- df1[,-1]
But I want to do that on several dataframe. I tried several strategies, including with assign and some get, but with no success. Here the two main ways I've tried :
# Getting a list of all my dataframes
my_df <- list.files(path="data")
# 1st strategy, adapting what works for 1 dataframe
for (i in 1:length(files_names)) {
rownames(get(my_df[i])) <- get(my_df[[i]])[,1] # The problem seems to be in this line
my_df[i] <- my_df[i][,-1]
}
# The error is Could not find function 'get>-'
# 2nd strategy using assign()
for (i in 1:length(my_df)) {
assign(rownames(get(my_df[[i]])), get(my_df[[i]])[,1]) # The problem seems to be in this line
my_df[i] <- my_df[i][,-1]
}
# The error is : Error in assign(rownames(my_df[i]), get(my_df[[i]])[, 1]) : first argument incorrect
I really don't see what I missed. When I type get(my_df[i]) and get(my_df[[i]])[,1], it works alone in the console...
Thank you very much to those who can help me :)
You may write the code that you have in a function, read the data and pass every dataframe to the function.
change_rownames <- function(df1) {
row.names(df1) <- df1[,1]
df1 <- df1[,-1]
df1
}
my_df <- list.files(path="data")
list_data <- lapply(my_df, function(x) change_rownames(read.csv(x)))
We can use a loop function like lapply or purrr::map to loop through all the data.frames, then use dplyr::column_to_rownames, which simplifies the procedure a lot. No need for an explicit for loop.
library(purrr)
library(dplyr)
map(my_df, ~ .x %>% read.csv() %>% column_to_rownames(var = names(.)[1]))

convert X-Y data.frame to matrix for every column in R efficiently

I have found away to do this using reshape2 but it is quite slow and doesn't quite give me exactly what I want. I have a data.frame that looks like this:
df<-data.frame(expand.grid(1:10,1:10))
colnames(df) <- c("x","y")
for(i in 3:10){
df[i] <- runif(100,10,100)
}
I run:
require(reshape2)
matrices<-lapply(colnames(df)[-c(1:2)],function(x){
mat<-acast(df, y~x, value.var=x, fill= 0,fun.aggregate = mean)
return(mat)
})
there I have a list of matrices for each value vector in my data, I can transform this into an array of 1:10,1:10,1:10 dimension, but I am looking to see if there is a faster way to do this as my datasets can contain many value columns and this process can take a long time and I can't seem to find a more efficient way of doing it..
Thanks for any help.
If your data.frame is stored regularly as you say, you could accomplish this in a for loop, which may actually be faster than casting:
# preallocate array
myArray <- array(0, dim=c(10,10,10))
# loop through:
for(i in 1:10) {
myArray[,,i] <- as.matrix(df[df$y==i,])
}

Count the occurrences of unequal numbers from a data frame

I have a data frame df with four columns. I would like to find the number of unequal number for each pair of rows.
I have tried to do it using for loop and it works out perfectly. However, it take a very long time to run. Please see below my code:
dist_mat <- matrix(0, nrow(df), nrow(df))
for(i in 1:nrow(df))
{
for(j in 1:nrow(df))
{
dist_mat[i,j] <- sum(df[,1:4][i,]!=df[,1:4][j,])
}
}
I thought there would be other way of doing this fast. Any suggestion is appreciated.
P.S. The data is numeric.
Given that the matrix is symmetric, and the diagonal will be zero, you don't need to loop twice over each row so you can cut the looping down by over half:
for(i in 1:(nrow(df)-1))
{
for(j in (i+1):nrow(df))
{
dist_mat[i,j] <- sum(df[i,1:4]!=df[j,1:4])
}
}
dist_mat[lower.tri(dist_mat)] <- dist_mat[upper.tri(dist.mat)]
This is a job for combn:
DF <- data.frame(x=rep(1,6), y=rep(1:2,3))
combn(seq_len(nrow(DF)), 2, FUN=function(ind, df) {
c(ind[1], ind[2], sum(df[ind[1],]!=df[ind[2],]))
}, df=as.matrix(DF))
Note that I convert the data.frame into a matrix, since matrix subsetting is faster than data.frame subsetting. Depending on your data types this could become a problem.
If your distance measure wasn't so unusual, dist would be helpful (and fast).

Create data.frame conditional on another df without for loop

I'm trying to create a data.frame that takes different values depending on the value of a reference data.frame. I only know how to do this with a "for loop", but have been advised to avoid for loops in R... and my actual data have ~500,000 rows x ~200 columns.
a <- as.data.frame(matrix(rbinom(10,1,0.5),5,2,dimnames=list(c(1:5),c("a","b"))))
b <- data.frame(v1=c(2,10,12,5,11,3,4,14,2,13),v2=c("a","b","b","a","b","a","a","b","a","b"))
c <- as.data.frame(matrix(0,5,2))
for (i in 1:5){
for(j in 1:2){
if(a[i,j]==1){
c[i,j] <- mean(b$v1[b$v2==colnames(a)[j]])
} else {
c[i,j]= mean(b$v1)
}}}
c
I create data.frame "c" based on the value in each cell, and the corresponding column name, of data.frame "a".
Is there another way to do this? Indexing? Using data.table? Maybe apply functions?
Any and all help is greatly appreciated!
(a == 0) * mean(b$v1) + t(t(a) * c(tapply(b$v1, b$v2, mean)))
Run in pieces to understand what's happening. Also, note that this assumes ordered names in a (and 0's and 1's as entries in it, as per OP).
An alternative to a bunch of t's as above is using mapply (this assumes a is a data.frame or data.table and not a matrix, while the above doesn't care):
(a == 0) * mean(b$v1) + mapply(`*`, a, tapply(b$v1, b$v2, mean))
#subsetting a matrix is faster
res <- as.matrix(a)
#calculate fill-in values outside the loop
in1 <- mean(b$v1)
in2 <- sapply(colnames(a),function(i) mean(b$v1[b$v2==i]))
#loop over columns and use a vectorized approach
for (i in seq_len(ncol(res))) {
res[,i] <- ifelse(res[,i]==0, in1, in2[i])
}

Correlation between selection of columns in df using a for-loop

I have a dataframe (df) with 8 columns. I'd like to use a for loop to calculate Pearson correlation for a selection of columns the following way:
cor1=cor(df[,1], df[,2])
cor2=cor(df[,3], df[,4])
and so on. What is the best way to do this?
Easiest is just to compute the correlation matrix, then you can index it if you want:
df <- data.frame(rnorm(10),rnorm(10),rnorm(10))
corMat <- cor(df)
For example, correlation between variables 1 and 2:
corMat[1,2]
Or do you really need to have specific correlations in separate objects?
Edit
Here is a for loop example of what you want:
df <- data.frame(rnorm(10),rnorm(10),rnorm(10),rnorm(10))
for (i in seq(1,ncol(df),by=2))
{
assign(paste("cor",i/2+0.5,sep=""),cor(df[,i],df[,i+1]))
}
Though it is quite inefficient.
You can use apply with a generalized function :
df<-data.frame(a=rnorm(10),b=rnorm(10),c1=rnorm(10),d=rnorm(10))
f<- function(x) {
cc=x[1] #column index
if (cc<ncol(df)){
cor(x[-1],df[,cc+1]) #ignore 1st element [-1]
}
}
apply(rbind(1:dim(df)[2], 2, f) #apply over columns after adding a column id numbers at the top row
May be there is a R function to get column/row id inside apply function? In that case, we don't need to rbind the column ids.

Resources