Aggregating columns - r

I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57

I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}

Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.

Related

how to use the `map` family command in **purrr** pacakge to swap the columns across rows in data frame?

Imagine there are 4 cards on the desk and there are several rows of them (e.g., 5 rows in the demo). The value of each card is already listed in the demo data frame. However, the exact position of the card is indexed by the pos columns, see the demo data I generated below.
To achieve this, I swap the cards with the [] function across the rows to switch the cards' values back to their original position. The following code already fulfills such a purpose. To avoid explicit usage of the loop, I wonder whether I can achieve a similar effect if I use the vectorization function with packages from tidyverse family, e.g. pmap or related function within the package purrr?
# 1. data generation ------------------------------------------------------
rm(list=ls())
vect<-matrix(round(runif(20),2),nrow=5)
colnames(vect)<-paste0('card',1:4)
order<-rbind(c(2,3,4,1),c(3,4,1,2),c(1,2,3,4),c(4,3,2,1),c(3,4,2,1))
colnames(order)=paste0('pos',1:4)
dat<-data.frame(vect,order,stringsAsFactors = F)
# 2. data swap ------------------------------------------------------------
for (i in 1:dim(dat)[1]){
orders=dat[i,paste0('pos',1:4)]
card=dat[i,paste0('card',1:4)]
vec<-card[order(unlist(orders))]
names(vec)=paste0('deck',1:4)
dat[i,paste0('deck',1:4)]<-vec
}
dat
You could use pmap_dfr :
card_cols <- grep('card', names(dat))
pos_cols <- grep('pos', names(dat))
dat[paste0('deck', seq_along(card_cols))] <- purrr::pmap_dfr(dat, ~{
x <- c(...)
as.data.frame(t(unname(x[card_cols][order(x[pos_cols])])))
})
dat
# card1 card2 card3 card4 pos1 pos2 pos3 pos4 deck1 deck2 deck3 deck4
#1 0.05 0.07 0.16 0.86 2 3 4 1 0.86 0.05 0.07 0.16
#2 0.20 0.98 0.79 0.72 3 4 1 2 0.79 0.72 0.20 0.98
#3 0.50 0.79 0.72 0.10 1 2 3 4 0.50 0.79 0.72 0.10
#4 0.03 0.98 0.48 0.06 4 3 2 1 0.06 0.48 0.98 0.03
#5 0.41 0.72 0.91 0.84 3 4 2 1 0.84 0.91 0.41 0.72
One thing to note here is to make sure that the output from pmap function does not have original names of the columns. If they have the original names, it would reshuffle the columns according to the names and output would not be in correct order. I use unname here to remove the names.

Generate subsequences in R

I have a df which is 67200 obs long, with 5 vars. I would like to create a list of subsequences from one var, each of equal length (600 obs). I would like the sequence to be iterative so that I can identify rolling sequences i.e. seq1 = 0:600, seq2 = 1:601, seq3 = 2:602, and so on. I will then sum the data from each subsequence to identify the sequence with the highest total.
I understand how to make a basic sequence using seq, however after reading around SO and other sites, I can only find info on how to identify specific sequences. Any help with ideas on ways to create said subsequences would be great.
Sample Data:
Var1 Var2 Var3 Var4 Var5
0.00 0.31 0.32 0.00 0.01
0.10 0.46 0.46 0.13 0.01
0.20 0.46 0.47 0.14 0.02
0.30 0.40 0.21 0.14 0.02
0.40 0.38 0.11 0.20 0.03
0.50 0.38 0.07 0.25 0.04
Expected Output:
List of x each subsequnce
To answer your question I think you can achieve your expected output with lapply and seq :
x <- 600
n <- 0:(nrow(df) - 600)
lapply(n, function(i) seq(i, i+x))
However, reading the description it seems you are trying to perform rolling calculation and the above is not the best approach to do this. Look into zoo library it has functions like rollsum, rollmean or a general rollapply which will have better way to do this.

Find correlation between columns whose names are specified as values in another dataframe

I have two dataframes, one is a list of pairs of individuals, similar to below (but with about 150 pairs):
ID_1 ID_2
X14567 X26789
X12637 X34560
X67495 X59023
The other dataframe consists of once column per individual with numerical values relating to that individuals underneath. All told about 300 columns and 300 rows. For example:
X14567 X12637 X26789 X67495 X34560 X59023
0.41 0.29 0.70 0.83 0.41 0.30
0.59 0.44 0.20 0.94 0.03 0.97
0.48 0.91 0.78 0.92 0.40 0.09
0.07 0.21 0.42 0.14 0.96 0.96
0.33 0.13 0.53 0.04 0.52 0.49
0.94 0.28 0.37 0.26 0.11 0.09
I want to find the correlation of these values between each pair of individuals. to end up with something like:
ID_1 ID_2 Correlation
X14567 X26789 -0.25
X12637 X34560 -0.25
X67495 X59023 -0.11
Is there a way that I can pull the values from the first dataframe to specify the name of the two columns that I need to find correlations between in such a way that can be easily repeated for each row of the first dataframe?
Many thanks for your help
If x and y are your two data.frames and the column names are set appropriately, you can use apply.
apply(x, 1, function(row) cor(y[row[1]], y[row[2]]))
From there just add the values to your x data.frame:
x$cor <- apply(x, 1, function(row) cor(y[row[1]], y[row[2]]))
V1 V2 cor
2 X14567 X26789 -0.2515737
3 X12637 X34560 -0.2563294
4 X67495 X59023 -0.1092830
If you just want the correlations between all columns in your second data frame, you can do:
library(reshape2)
df.corr = melt(cor(df))
To remove repeated columns (that is, the correlation of each column with itself):
df.corr = subset(df.corr, Var1 != Var2)
Example using built-in mtcars data frame:
mtcars.corr = melt(cor(mtcars))
Var1 Var2 value
1 mpg mpg 1.00000000
2 cyl mpg -0.85216196
3 disp mpg -0.84755138
...
119 am carb 0.05753435
120 gear carb 0.27407284
121 carb carb 1.00000000

Looping arithmetic calculation between tables

I have a table that look like this:
table1
Not Visible Visible <NA>
All 0.29 0.50 0.20
Bowtie 0.24 0.17 0.59
Cola 0.15 0.83 0.02
Squig 0.49 0.51 0.49
I then have 9 other similar tables. Below is an example:
table2
Not Visible Visible <NA>
All 0.28 0.50 0.23
Bowtie 0.11 0.30 0.59
Cola 0.30 0.67 0.03
Squig 0.42 0.51 0.06
I want the result of table1 - table2 as below but I also want table 1 with each of the other 9 tables.
Not Visible Visible <NA>
All 0.01 0.00 -0.03
Bowtie 0.13 -0.13 0.00
Cola -0.15 0.16 -0.01
Squig 0.07 0.00 0.43
How do I do this without writing Table 1 - table 2; table 1 - table 3; table 1 - table 4 etc?
If I try looping with the code below (as an example), I get the non-numeric argument to binary error:
Tables <- c("table1", "table2") ## as an example
for (r in Tables) {
yy <- paste(r,"res", sep = "-")
zz <- table1-r
assign(yy,zz)
}
Any ideas?
Consider using a list of tables (not the string literals of their names) and then use lapply() where resulting list can be saved as individual tables or binded into dataframe:
# LIST OF TABLES WITH NAMED ELEMENTS (t1 NOT INCLUDED)
tables <- setNames(list(t2, t3, t4, t5, t6, t7, t8, t9),
c("table2", "table3", "table4", "table5",
"table6", "table7", "table8", "table9"))
# ITERATIVELY SUBTRACT FROM t1
tableList <- lapply(tables, function(x) t1 - x)
# SAVE EACH TABLE AS SEPARATE OBJECTS
list2env(tableList, envir=.GlobalEnv)
# DATAFRAME BINDING - WIDE FORMAT (INCLUDING t1)
df <- as.data.frame(cbind(t1, do.call(cbind, tableList)))
# DATAFRAME BINDING - LONG FORMAT (INCLUDING t1)
df <- as.data.frame(rbind(t1, do.call(rbind, tableList)))
You could try this without looping
z=names(table1)
table3 = table1[z]-table2[z]

How to color point in R with the same scale

I have a data frame in the following form:
Data <- data.frame(X = sample(1:10), Y = sample(1:10))
I would like to color the dots obtained with
plot(Data$X,Data$Y)
using the values from another data frame:
X1 X2 X3 X4 X5
1 0.57 0.40 0.64 0.07 0.57
2 0.40 0.45 0.49 0.21 0.39
3 0.72 0.65 0.74 0.61 0.71
4 0.73 0.54 0.76 0.39 0.64
5 0.88 0.81 0.89 0.75 0.64
6 0.70 0.65 0.78 0.51 0.66
7 0.84 0.91 0.89 0.86 0.83
8 -0.07 0.39 -0.02 0.12 -0.01
9 0.82 0.83 0.84 0.81 0.79
10 0.82 0.55 0.84 0.51 0.59
So to have five different graphs using the five columns from the second data frame to color the dots. I manage to do this by looking here (Colour points in a plot differently depending on a vector of values), but I'm not able to figure out how to set the same color scale for all the five different plots.
The columns in the second data frame could have different minimum and maximum so If I generate the colors using the cut function on the first column this will generate factors, and later colors, that are relative to this column.
Hope this is clear,
Thanks.
You need your color ramp to include all values so you likely want to get them in the same vector. I would probably melt the data, then make the color ramp, then use the facet function in ggplot to get multiple plots. Alternately if you don't want to use ggplot you could cast the data back to multiple columns with 5 extra columns for your colors.
require(reshape2)
require(ggplot2)
Data.m <- melt(Data,id=Y)
rbPal <- colorRampPalette(c('red','blue'))
Data.m$Col <- rbPal(10)[as.numeric(cut(Data.m$value,breaks = 10))]
ggplot(Data.m, aes(value, Y,col=Col)) +
geom_point() +
facet_grid(variable~.)
Your Data object has two variables, X and Y, but then you talk about making 5 graphs, so that part is a little unclear, but I think the melt function will help getting a comprehensive color ramp and the facet_grid function may make it easier to do 5 graphs at once if that is what you want.

Resources