I need to calculate pairwise, consecutive correlations for each of these date variables (there are 246 in my dataset):
Company 2009/08/21 2009/08/24 2009/08/25
A -0.0019531250 -0.0054602184 -6.274510e-03
AA -0.0063291139 -0.0266457680 -1.750199e-02
AAPL 0.0084023598 -0.0055294118 -1.770643e-04 ...
...
So that I can find cor(col1,col2), cor(col2,col3), but nothing for cor(col1,col3). I realize that if I wanted all combinations I could use the combn function, but I can't figure out how to do it for my circumstances without something inefficient like a for loop.
Approach 1
you could do:
lapply(1:(ncol(dat)-1), function(i) cor(dat[, i], dat[, i+1],
use="pairwise.complete.obs"))
Example
A dataframe with 10 variables will give you 9 consecutive correlations, i.e. columns 1-2, 2-3, 3-4 etc, if that is what you want.
dat <- replicate(10, rnorm(10))
lapply(1:(ncol(dat)-1), function(i)
cor(dat[, i], dat[, i+1], use="pairwise.complete.obs"))
Approach 2 (very concise)
Using the iris dataset as well:
dat <- iris[, 1:4]
diag(cor(dat, use="pairwise.complete.obs")[, -1])
[1] -0.1175698 -0.4284401 0.9628654
As you pointed out, combn is the way to go. Assume your data.frame is called dat then for consecutive columns, try this:
ind <- combn(ncol(dat), 2)
consecutive <- ind[ , apply(ind, 2, diff)==1]
lapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]]))
Consider this simple example:
> data(iris)
> dat <- iris[, 1:4]
> # changing colnames to see whether result is for consecutive columns
> colnames(dat) <- 1:ncol(dat)
> head(dat) # this is how the data looks like
1 2 3 4
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
>
> ind <- combn(ncol(dat), 2)
> consecutive <- ind[ , apply(ind, 2, diff)==1]
> lapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]])) # output: cor matrix
[[1]]
1 2
1 1.0000000 -0.1175698
2 -0.1175698 1.0000000
[[2]]
2 3
2 1.0000000 -0.4284401
3 -0.4284401 1.0000000
[[3]]
3 4
3 1.0000000 0.9628654
4 0.9628654 1.0000000
If you want just the correlation, use sapply
> sapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]])[2,1])
[1] -0.1175698 -0.4284401 0.9628654
Usually, loops in R should be avoided, but I think they sometimes have an undeserved stigma. In this case, the loop is easier for me to read than "cooler" functions. It's also fairly efficient. Any call like cor(mydata) calculates n^2 correlations, while the for loop only calculates n correlations.
x = matrix( rnorm(246*20000), nrow=246 )
out = numeric(245)
system.time( { for( i in 1:245 )
out[i] = cor(x[,i],x[,i+1]) } )
# 0.022 Seconds
system.time( diag(cor(x, use="pairwise.complete.obs")[, -1]) )
# Goes for 2 minutes and then crashes my R session
First, I'll assume your data is stored in df.
Here's what I'd do. First create a function that for any given column number it will calculate the correlation between that column and the one up from it like this
cor.neighbour <- function(i) {
j <- i + 1
cr <- cor(df[, i], df[, j])
# returning a dataframe here will make sense when you see the results from sapply
result <- data.frame(
x = names(df)[i],
y = names(df)[j],
cor = cr,
stringsAsFactors = FALSE
)
return(result)
}
Then to apply it to your whole data I would first create a vector of all the columns I want to use, i which, by the way, is all but the last column. Then use lapply to process through them
i <- 1:(ncol(df) - 1)
cor.pairs <- lapply(i, cor.neighbour)
# change list in to a data frame
cor.pairs <- melt(cor.pairs, id=names(cor.pairs[[1]]))
Related
I have 2 data frames and I am applying pnorm() and qnorm() on the dataframe, but I am getting the errors, while calculating.
n <- c(0.3,0.5,0.1,0.2)
m <- c(0.1,0.4,0.5,0.3)
o <- c(0.2,0.2,0.2,0.4)
p <- c(0.3,0.1,0.3,0.3)
df1 = data.frame(n,m,o,p)
df1
n m o p
1 0.3 0.1 0.2 0.3
2 0.5 0.4 0.2 0.1
3 0.1 0.5 0.2 0.3
4 0.2 0.3 0.4 0.3
r <- c(0.2,0.4,0.1,0.3)
df2 = rbind.data.frame(r)
df2
X2 X4 X1 X3
1 0.2 0.4 0.1 0.3
b <- 0.15
result <- pnorm((qnorm(df1)+sqrt(b)*df2)/sqrt(1-b))
Output:
Getting an error:
Error in qnorm(df1) : Non-numeric argument to mathematical function
Expected output:
Output:
0.3139178 0.110853 0.1919158 0.3289671
0.5334785 0.4574897 0.1919158 0.1031127
0.0957727 0.5667216 0.1919158 0.3289671
0.2035948 0.3442989 0.4079641 0.3289671
actually I have these 2 data-frames df1 and df1 and in excel and I have a formula in excel which I need to convert into R.
=NORMSDIST((NORMSINV(A1)+SQRT(0.15)*H1)/SQRT(1-0.15))
here A1 is the df1 first value and so on and H1 is the df2 value and so on.
What you're trying to do is: apply a function to every row in df1. To do so we need to write a function.
getDist <- function(x, b = 0.15) {
pnormInput <- as.numeric((qnorm(as.numeric(x)) + sqrt(b) * df2) / sqrt(1 - b))
pnorm(pnormInput)
}
Next we apply this function to every row in df1 (using apply).
result <- apply(df1, 1, function(x) getDist(x))
Next we have to transpose result (flip the table we got).
result <- t(result)
# [,1] [,2] [,3] [,4]
# [1,] 0.3139178 0.1108530 0.1919158 0.3289671
# [2,] 0.5334785 0.4574897 0.1919158 0.1031127
# [3,] 0.0957727 0.5667216 0.1919158 0.3289671
# [4,] 0.2035948 0.3442989 0.4079641 0.3289671
I think this is a classic case of trying to do many operations in one line and losing track of what every function is doing. My answer is essentially the same as #PoGibas', but a bit more explicit and less elegant.
I'll calculate the terms separately and then combine them again afterwards:
num1 <- apply(df1, 1, qnorm) # Apply 'qnorm' row-wise
num2 <- sqrt(b) * r # Add the constant sqrt(b) to vector r
num <- sweep(num1, 1, num2, "+") # Add the vector num2 row-wise to the dataframe num2
den <- sqrt(1-b) # den is a constant
result <- pnorm(num/den) # num is a data frame, which is elementwise divided by the constant den.
t(result)
By doing the operations step-by-step, you will often have a much easier time finding the source of an error.
I am looking to create a function that takes in the training set and the testing set as its arguments, min-max scales/normalizes and returns the training set and uses those same values of minimum and range to min-max scale/normalize and return the test set.
So far this is the function I have come up with:
min_max_scaling <- function(train, test){
min_vals <- sapply(train, min)
range1 <- sapply(train, function(x) diff(range(x)))
# scale the training data
train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))
for(i in seq_len(ncol(train))){
column <- (train[,i] - min_vals[i])/range1[i]
train_scaled[i] <- column
}
colnames(train_scaled) <- colnames(train)
# scale the testing data using the min and range of the train data
test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))
for(i in seq_len(ncol(test))){
column <- (test[,i] - min_vals[i])/range1[i]
test_scaled[i] <- column
}
colnames(test_scaled) <- colnames(test)
return(list(train = train_scaled, test = test_scaled))
}
The definition of min max scaling is similar to this question asked earlier on SO - Normalisation of a two column data using min and max values
My questions are:
1. Is there a way to vectorize the two for loops in the function? e.g. using sapply()
2. Are there any packages that allow us to do what we are looking to do here?
Here is the code for the min-max normalization. See this Wikipedia page for the formulae, and also other ways of performing feature scaling.
normalize <- function(x, na.rm = TRUE) {
return((x- min(x)) /(max(x)-min(x)))
}
To get a vector, use apply instead of lapply.
as.data.frame(apply(df$name, normalize))
Update to address Holger's suggestion.
If you want to pass additional arguments to min() and max(), e.g., na.rm, then you can use:
normalize <- function(x, ...) {
return((x - min(x, ...)) /(max(x, ...) - min(x, ...)))
}
x <- c(1, NA, 2, 3)
normalize(a)
# [1] NA NA NA NA
normalize(a, na.rm = TRUE)
# 0.0 NA 0.5 1.0
Just keep in mind, that whatever you pass to min() via the ellipsis ... you also implicitly pass to max(). In this case, this shouldn't be a big problem since both min() and max() share the same function signature.
Regarding your 2nd question, you can use the caret package:
library(caret)
train = data.frame(a = 1:3, b = 10:12)
test = data.frame(a = 1:6, b = 7:12)
pp = preProcess(train, method = "range")
predict(pp, train)
# a b
# 1 0.0 0.0
# 2 0.5 0.5
# 3 1.0 1.0
predict(pp, test)
# a b
# 1 0.0 -1.5
# 2 0.5 -1.0
# 3 1.0 -0.5
# 4 1.5 0.0
# 5 2.0 0.5
# 6 2.5 1.0
This packages also defines other transformation methods, see: http://machinelearningmastery.com/pre-process-your-dataset-in-r/
set.seed(1984)
### simulating a data set
df <- data.frame(var1 = rnorm(100,5,3),
var2 = rpois(100,15),
var3 = runif(50,90,100))
df_train <- df[1:60,]
df_test <- df[61:100,]
## the function
normalize_data <- function(train_set, test_set) ## the args are the two sets
{
ranges <- sapply(train_set, function(x) max(x)-min(x)) ## range calculation
normalized_train <- train_set/ranges # the normalization
normalized_test <- test_set/ranges
return(list(ranges = ranges, # returning a list
normalized_train= normalized_train,
normalized_test =normalized_test ))
}
z <- normalize_data(df_train, df_test) ## applying the function
## the results
z$ranges
var1 var2 var3
13.051448 22.000000 9.945934
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209
I need to add vectors[attributes] returned by remove_outliers function in a dataframe. Right now, I am getting a large matrix. I have tried append method(as following)
# function to calculate IQR and upper and lower limit of given attribute
remove_outliers <- function(attribute, na.rm = TRUE, ...) {
IQR_val <- quantile(attribute, probs=c(.25, .75), na.rm = na.rm, ...)
LF <- 1.5 * IQR(attribute, na.rm = na.rm)
attribute_W_NA <- attribute
attribute_W_NA[attribute < (IQR_val[1] - LF)] <- NA
attribute_W_NA[attribute > (IQR_val[2] + LF)] <- NA
attribute_W_NA
}
cleaned_data <- NULL
for(i in 1:ncol(data_rm_val)){
# cleaned data with NA entries replacing outliers
cleaned_data <- cbind(cleaned_data, remove_outliers(data_rm_val[,i]))
}
it results in large matrix
This is input dataframe:
current output is: (with cbind in loop)
and desired result should be a dataframe with the same number of rows and columns.
Any help would be greatly appreciated.
PS: I am a newbie in R and Data Science.
Simply use lapply with your user defined function avoiding the need of cbinding or appending. When using lapply() on a dataframe you run operations on each column:
cleaned_data <- data.frame(lapply(data_rm_val, remove_outliers))
Now above assumes your defined function, remove_outliers returns a vector type. To ensure a vector always outputs, consider vapply() defining a length equal to input or nrow(data_rm_val):
cleaned_data <- data.frame(vapply(data_rm_val, remove_outliers, numeric(nrow(data_rm_val))))
Above two options work on a dataset of random numbers (since OP does not provide example data):
data_rm_val <- data.frame(matrix(rnorm(25),5))
# X1 X2 X3 X4 X5
# 1 0.4303766 1.8152041 0.3355174 -0.4880282 -0.63612820
# 2 0.2876950 -0.7613642 -1.5046115 0.1821653 0.09397964
# 3 -2.3402548 -0.6771749 -2.0122667 -0.9442210 -1.30994853
# 4 1.4224979 -1.7940421 -0.5110736 -0.2837820 -0.24240172
# 5 -0.7484131 -0.8159326 -1.2690513 -1.0422656 1.23811458
cleaned_data <- data.frame(lapply(data_rm_val, remove_outliers))
# X1 X2 X3 X4 X5
# 1 0.4303766 NA 0.3355174 -0.4880282 -0.63612820
# 2 0.2876950 -0.7613642 -1.5046115 0.1821653 0.09397964
# 3 -2.3402548 -0.6771749 -2.0122667 -0.9442210 -1.30994853
# 4 1.4224979 NA -0.5110736 -0.2837820 -0.24240172
# 5 -0.7484131 -0.8159326 -1.2690513 -1.0422656 NA
cleaned_data2 <- data.frame(vapply(data_rm_val,
remove_outliers, numeric(nrow(data_rm_val))))
# X1 X2 X3 X4 X5
# 1 0.4303766 NA 0.3355174 -0.4880282 -0.63612820
# 2 0.2876950 -0.7613642 -1.5046115 0.1821653 0.09397964
# 3 -2.3402548 -0.6771749 -2.0122667 -0.9442210 -1.30994853
# 4 1.4224979 NA -0.5110736 -0.2837820 -0.24240172
# 5 -0.7484131 -0.8159326 -1.2690513 -1.0422656 NA
I have a large list that contains 1000 lists of the same variables and same length.
My goal is to calculate mean, standard deviation, and standard error of all lists within the large list.
I have calculated mean of the variables using Reduce(), but I couldn't figure out how to do the same for standard deviation.
My list looks something like this:
large.list <- vector('list', 1000)
for (i in 1:1000) {
large.list[[i]] <- as.data.frame(matrix(c(1:4), ncol=2))
}
large.list
[[1]]
V1 V2
1 1 3
2 2 4
[[2]]
V1 V2
1 1 3
2 2 4
[[3]]
V1 V2
1 1 3
2 2 4
......
[[1000]]
V1 V2
1 1 3
2 2 4
To calculate mean, I do:
list.mean <- Reduce("+", large.list) / length(large.list)
list.mean
V1 V2
1 1 3
2 2 4
This is overly simplified version of a large list, but how can I calculate list-wide standard deviation and standard error like I did for mean?
Thank you very much in advance!
If you stay with Reduce(), you have to do a little bit statistics:
var(x) = E(x^2) - (E(x))^2
Note that you already got E(x) as list.mean. To get E(x^2), it is also straightforward:
list.squared.mean <- Reduce("+", lapply(large.list, "^", 2)) / length(large.list)
Then variance is:
list.variance <- list.squared.mean - list.mean^2
Standard deviation is just
list.sd <- sqrt(list.variance)
However, a much more efficient solution is to use tapply()
vec <- unlist(large.list, use.names = FALSE)
DIM <- dim(large.list[[1]])
n <- length(large.list)
list.mean <- tapply(vec, rep(1:prod(DIM),times = n), mean)
attr(list.mean, "dim") <- DIM
list.mean <- as.data.frame(list.mean)
list.sd <- tapply(vec, rep(1:prod(DIM),times = n), sd)
attr(list.sd, "dim") <- DIM
list.sd <- as.data.frame(list.sd)
If I may suggest an alternative, you could transform the list into a 3-dimensional matrix, and then use apply() to produce the output.
Here's how to transform the list (assuming dimensional regularity):
m <- do.call(cbind,lapply(large.list,as.matrix));
m <- array(m,c(nrow(m),ncol(m)/length(large.list),length(large.list)));
And here's how to use apply() on the matrix:
apply(m,1:2,mean);
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
apply(m,1:2,sd);
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
here a solution based on reshaping the list into data.table. we are basically extracting the value of index i from each sub-list to create a single vector.
ll <- unlist(large.list)
DX <- data.table(V1= ll[c(T,F,F,F)],
V2= ll[c(F,T,F,F)],
V3= ll[c(F,F,T,F)],
V4= ll[c(F,F,F,T)])
then all calculation are straight forward:
mm <- DX[,lapply(.SD,mean)]
sdd <- DX[,lapply(.SD,sd)]
I would like to know if there is some elegant solution to this problem:
Let's say I have a vector of values
a <- c(1,2,3,3.1,3.2,5,6,7,7.1,7.2,9)
and I want to apply some function (e.g. mean) only to values fulfilling certain condition, which in this case is to have the difference between values smaller than 0.5 .
So the values that should be averaged are (3,3.1,3.2) and (7,7.1,7.2) and the function should return vector
b <- c(1,2,3.1,5,6,7.1,9)
Edit: One approach I've tried (not sure if right) is to binarize the vector a (1 meaning the difference between values is <0.5; 0 meaning the diff is >0.5), so I got vector
bin <– c(0,0,1,1,0,0,0,1,1,0)
but I don't know how to apply mean to the separate groups of ones. So the main problem for me is to distinguish the groups of needed values and apply the mean to them separately. Any ideas?
I am new here so if anything is unclear, please let me know. Thank you in advance.
This doesn't qualify as elegant, but I think that it works in the case you provide. I use rle (base R) to identify runs where diffs are less than 0.5.
a <- c(1, 2, 3, 3.1, 3.2, 5, 6, 7, 7.1, 7.2, 9)
crit <- diff(a) < 0.5
crit <- c(head(crit, 1), crit) | c(crit, tail(crit, 1))
run <- rle(crit)
aa <- split(a, rep(seq(length(run$lengths)), times=run$lengths))
myFun <- function(crit, val) {
if (crit) {
mean(val)
}
else {
val
}
}
unlist(mapply(FUN=myFun, crit=run$values, val=aa, USE.NAMES=FALSE))
Yields:
> unlist(mapply(FUN=myFun, crit=run$values, val=aa, USE.NAMES=FALSE))
[1] 1.0 2.0 3.1 5.0 6.0 7.1 9.0
Maybe someone can build a cleaner solution from this.
Update: OP points out that this fails on a sequence like {3, 3.1, 3.2, 7, 7.1, 7.2} because the code above lumps this into one run and averages across the whole sequence. Here's a more robust solution.
a <- c(1, 2, 3, 3.1, 3.2, 7, 7.1, 7.2, 10)
run <- unclass(rle(diff(a) < 0.5))
len <- run$lengths
val <- run$values
pos <- seq_along(len)
last <- pos == max(pos)
len <- len + val - c(0, head(val, -1)) + (last * !val)
prevLen <- c(0, head(cumsum(len), -1))
myFun <- function(l, v, pl, x) {
if (l == 0) {
NULL
} else {
seg <- seq(l) + pl
if (v == TRUE) {
mean(x[seg])
} else {
x[seg]
}
}
}
unlist(mapply(FUN=myFun, l=len, v=val, pl=prevLen, MoreArgs=list(x=a)))
Now whenever it comes across a small difference run (i.e., val == TRUE) it adds more one to the length of that small difference run (i.e., len + val), but that additional element comes from the next run, but it can't steal from the last run if it's not a small difference run (i.e., last * !val).
Maybe I overcomplicated the problem:
a <- c(1,2,3,3.1,3.2,5,6,7,7.1,7.2,9)
thr <- 0.5
## create a correct binary vector
d <- diff(a)
d <- c(d[1], d)
rd <- abs(diff(rev(a)))
rd <- c(rd[1], rd)
dc <- d < thr | rd < thr
# [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## use rle to count continous values
r <- rle(dc)
r
# Run Length Encoding
# lengths: int [1:5] 2 3 2 3 1
# values : logi [1:5] FALSE TRUE FALSE TRUE FALSE
## create grouping vector
groups <- double(length(a))
groups[!dc] <- seq(sum(!dc))
groups[dc] <- sum(!dc)+rep(seq(sum(r$values)), r$lengths[r$values])
groups
# [1] 1 2 6 6 6 3 4 7 7 7 5
## create mean for each group
m <- tapply(a, groups, FUN=mean)
m
# 1 2 3 4 5 6 7
# 1.0 2.0 5.0 6.0 9.0 3.1 7.1
## recreate origin order
m[order(unique(groups))] <- m
m
# 1 2 3 4 5 6 7
# 1.0 2.0 3.1 5.0 6.0 7.1 9.0
Another possibility based on ave
# find id on which mean should be calculated
id1 <- which(diff(a) < 0.5)
id2 <- sort(union(id1, id1 + 1))
id2
# [1] 3 4 5 8 9 10
# group the id
grp <- cumsum(c(1, diff(id2)) - 1)
grp
# [1] 0 0 0 2 2 2
# calulate mean per group and insert into original vector
a[id2] <- ave(a[id2], grp)
a
# [1] 1.0 2.0 3.1 3.1 3.1 5.0 6.0 7.1 7.1 7.1 9.0
# remove duplicated means, i.e. remove index of duplicated values of grp
a[-id2[as.logical(ave(grp, grp, FUN = function(x) duplicated(x)))]]
# [1] 1.0 2.0 3.1 5.0 6.0 7.1 9.0