Correlation between two dataframes by row - r

I have 2 data frames w/ 5 columns and 100 rows each.
id price1 price2 price3 price4 price5
1 11.22 25.33 66.47 53.76 77.42
2 33.56 33.77 44.77 34.55 57.42
...
I would like to get the correlation of the corresponding rows, basically
for(i in 1:100){
cor(df1[i, 1:5], df2[i, 1:5])
}
but without using a for-loop. I'm assuming there's someway to use plyr to do it but can't seem to get it right. Any suggestions?

Depending on whether you want a cool or fast solution you can use either
diag(cor(t(df1), t(df2)))
which is cool but wasteful (because it actually computes correlations between all rows which you don't really need so they will be discarded) or
A <- as.matrix(df1)
B <- as.matrix(df2)
sapply(seq.int(dim(A)[1]), function(i) cor(A[i,], B[i,]))
which does only what you want but is a bit more to type.

I found that as.matrix is not required.
Correlations of all pairs of rows between dataframes df1 and df2:
sapply(1:nrow(df1), function(i) cor(df1[i,], df2[i,]))
and columns:
sapply(1:ncol(df1), function(i) cor(df1[,i], df2[,i]))

Related

How to make a count matrix of common elements across many groups?

I'm trying to identify common elements across multiple vectors, with all combinations possible.
I had previously tried this one here, but it doesn't quite work out because it only retrieves the common elements between 2 groups.
Take this example: I have 10 vectors (varying in number of elements) that may have common elements with one or more other vectors. It is also possible that some elements are exclusive to some groups. As an example, here is the data:
#Creating a mock example: 10 groups, with varying number of elements:
set.seed(753)
for (i in 1:10){
assign(paste0("grp_",i), paste0("target_", sample(1:40, sample(20:34))))
}
Simply put, I want to do something analogous to a Venn diagram, but put into a data frame/matrix with the counts, instead. Something like this (note that here, I am just adding a snapshot of random parts of how the result data frame/matrix should look like):
grp1 grp2 grp3 grp4 grp1.grp4.grp5.grp8.grp10
grp1 - 16 12 20 5
grp2 16 - 10 20 4
grp3 12 10 - 16 3
grp4 20 20 16 - 5
grp1.grp4.grp5.grp8.grp10 5 4 3 5 10
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10 0 0 0 0 0
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10
grp1 3
grp2 6
grp3 4
grp4 1
grp1.grp4.grp5.grp8.grp10 5
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10 2
From the table above, please also note that counts that have the same row and column names mean that they are exclusive to that particular group (e.g. count on row1/col1 means that there are 88 exclusive elements).
Any help is very much appreciated!
EDIT: the real counts for the expected final matrix has now been added.
Ok, if I understood all well, lets give it a try. Note that I added your sample data in a list, so we can index them to intersect.
set.seed(753)
grps <- list()
for (i in 1:10){
grps[i] <- list(paste0("target_", sample(1:40, sample(20:34))))
}
You want all 10 groups resulting in 1023 x 1023 combinations
Making it flexible makes testing a bit easier ;)
The key here is I keep them as list with integers that we can index in grps.
N <- 10
combinations <- unlist(sapply(1:N, function(n) combn(1:N, n, simplify = F)), recursive = F)
Now we have to loop twice over your combinations as you compare each 1023 x 1023 combinations with their intersects. The use of sapply gives us the nice 1023 x 1023 matrix you want.
results <- sapply(seq_along(combinations), function(i) {
sapply(seq_along(combinations), function(j) {
length(intersect(
Reduce(intersect, grps[combinations[[i]]]),
Reduce(intersect, grps[combinations[[j]]])
))
})
})
Now we create the names as shown in your example, they are based on the combinations we created and used earlier.
names <- sapply(combinations, function(x) paste("grp", x, sep = "", collapse = "."))
Create the colnames and rownames of the matrix
colnames(results) <- rownames(results) <- names
Seems in your output you want to values for the diagonals, so we change that to NA
diag(results) <- NA

lapply - dividing columns and calculate standard deviation

I have got a list with 10 data.frames and I just need to divide two columns for each data.frame and after calculate the relative standard deviation.
I would like to use lapply.
Here an example of one of the data.frame contained within the list:
df <- read.table(text = 'X Y
2 4
5 3
1 2
7 1
4 2
6 1', header = TRUE)
I have to perform the following operations with lapply for all my 10 data.frames:
ratio <- df$X / df$Y
sd <- sd(ratio)
We can do this by looping over the list with lapply, extract the columns of interest, divide to get the 'ratio' and then do the sd on that ratio. (It could be done on a single step too)
lapply(lst, function(x) {ratio <- x$X/x$Y
sd(ratio) })
where 'lst' is the list of 'data.frame's.

Calculating ratio of consecutive values in dataframe in r

I have a dataframe with 5 second intraday data of a stock. The dataframe exists of a column for the date, one for the time and one for the price at that moment.
I want to make a new column in which it calculates the ratio of two consecutive price values.
I tried it with a for loop, which works but is really slow.
data["ratio"]<- 0
i<-2
for(i in 2:nrow(data))
{
if(is.na(data$price[i])== TRUE){
data$ratio[i] <- 0
} else {
data$ratio[i] <- ((data$price[i] / data$price[i-1]) - 1)
}
}
I was wondering if there is a faster option, since my dataset contains more than 500.000 rows.
I was already trying something with ddply:
data["ratio"]<- 0
fun <- function(x){
data$ratio <- ((data$price/lag(data$price, -1))-1)
}
ddply(data, .(data), fun)
and mutate:
data<- mutate(data, (ratio =((price/lag(price))-1)))
but both don't work and I don't know how to solve it...
Hopefully somebody can help me with this!
You can use the lag function to shift the your data by one row and then take the ratio of the original data to the shifted data. This is vectorized, so you don't need a for loop, and it should be much faster. Also, the number of lag units in the lag function has to be positive, which may be causing an error when you run your code.
# Create some fake data
set.seed(5) # For reproducibility
dat = data.frame(x=rnorm(10))
dat$ratio = dat$x/lag(dat$x,1)
dat
x ratio
1 -0.84085548 NA
2 1.38435934 -1.64637013
3 -1.25549186 -0.90691183
4 0.07014277 -0.05586875
5 1.71144087 24.39939227
6 -0.60290798 -0.35228093
7 -0.47216639 0.78314834
8 -0.63537131 1.34565131
9 -0.28577363 0.44977422
10 0.13810822 -0.48327840
for loop in R can be extremely slow. Try to avoid it if you can.
datalen=length(data$price)
data$ratio[2:datalen]=data$price[1:datalen-1]/data$price[2:datalen]
You don't need to do the is.NA check, you will get NA in the result either the numerator or the denominator is NA.

Group Data frame by unique values and mean

I'm working with a data frame of 18 columns, with the working columns being CPM and SpendRange. Spend range is broken up into levels 1:3000 in steps of 50.
I'm trying to average the the CPM (Cost per Mil) within each spend range and result in a data frame with the unique spend ranges and the mean CPM in each.
I tried:
CPMbySpend<-aggregate(Ads$CPM,by=list(Ads$SpendRange),function(x) paste0(sort(unique(x)),collapse=mean(Ads$CPM))
CPMbySpend<-data.frame(CPMbySpend)
Obviously finding out that I can't use the collapse as a function... any suggestions?
Optimum output would be:
1-50 | mean(allvalues with spendrange 1-50)
51-100 | mean(allvalues with spendrange 51-100)
Using the new dataset
Ads <- read.csv("Test.csv", header=TRUE, stringsAsFactors=FALSE)
Ads$CPM <- as.numeric(Ads$CPM) #the elements that are not numeric ie. `-$` etc. will be coerced to NAs
#Warning message:
#NAs introduced by coercion
res <- aggregate(Ads$CPM,by=list(SpendRange=Ads$SpendRange),FUN=mean, na.rm=TRUE)
If you want to order the SpendRange i.e 0-1, 1-50 etc, one way is to use mixedorder from gtools.
library(gtools)
res1 <- res[mixedorder(res$SpendRange),]
row.names(res1) <- NULL
head(res1)
# SpendRange x
#1 0-1 1.621987
#2 1-50 2.519853
#3 51-100 3.924538
#4 101-150 5.010795
#5 151-200 3.840549
#6 201-250 4.286923
Otherwise, you could change the order by specifying the levels of SpendRange by calling factor.i.e.
res1$SpendRange <- factor(res1$SpendRange, levels= c('0-1', '1-50',.....)) #pseudocode
and then use
res1[order(res1$SpendRange),]

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M
It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]
What about just:
unique.matrix(testframe, MARGIN=2)
Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))
Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]
If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

Resources