R: running corr across two matrices by column - r

I'm tracking how much my cats are pooping, and trying to figure out if that's correlated with how much they're eating.
So if I have the following data:
food <- cbind(fluffy=c(0.9,1.1,1.3,0.7),misterCuddles=c(0.5,1.2,1.4,0.5))
poop <- cbind(fluffy=c(0.9,1.1,1.3,0.7),misterCuddles=c(-0.5,-1.2,-1.4,-0.5))
dates <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")
rownames(food) <- dates
rownames(poop) <- dates
cube <- abind(food, poop, along=3)
Notes for the curious:
amounts are in deci-pennies: 1.1 means the poop weighs about as much as 11 pennies
negative poop amounts demonstrate that mister cuddles is part unicorn
This gives me the following:
> cube
, , food
fluffy misterCuddles
2013-01-01 0.9 0.5
2013-01-02 1.1 1.2
2013-01-03 1.3 1.4
2013-01-04 0.7 0.5
, , poop
fluffy misterCuddles
2013-01-01 0.9 -0.5
2013-01-02 1.1 -1.2
2013-01-03 1.3 -1.4
2013-01-04 0.7 -0.5
Now if I want to find the correlation for mister cuddles to demonstrate his magic:
> corr(cube[,"misterCuddles",])
[1] -1
What I'd like is a named vector with the correlation number for each cat:
> c(fluffy=1.0,misterCuddles=-1.0)
fluffy misterCuddles
1 -1
Is there a way I can do this in one shot, ideally in parallel? In reality, I have buttloads of cats.
Thanks!
EDIT
Can it be as simple as...
> result <- simplify2array(mclapply(colnames(food), function(x) corr(cube[,x,])))
> names(result) <- colnames(food)
> result
fluffy misterCuddles
1 -1

library(boot) # for corr
sapply(dimnames(cube)[[2]], function(x) corr(cube[ , x, ]))
# fluffy misterCuddles
# 1 -1

Related

Create data frame from values in every two continuous rows from an existing data frame

I have data frame z1:
z1 <- data.frame(time=as.factor(rep(0.5:9.5,times=rep(c(9,10,8,11,12),2))),
roi= rep(c(1:9,1:10,1:8,1:11,1:12),2), area=runif(100, 5.0, 7.5))
I want to create a new data frame z2 has 10*nrow(z1) rows with condition:
at each time value, every second row (z1$roi[i:i+1] and z1$area[i:i+1]) for i in 1: c(nrow(z1) -1) are used to make column roi and area in z2, like
z2$roi <- seq(z1$roi[i],z1$roi[i+1], length.out = 10)
z2$area <- seq(z1$area[i],z1$area[i+1], length.out = 10)
If the data frame z1 looks like:
time roi area
1 0.5 1 6.181150 #=z1$roi[1]
2 0.5 2 5.469366 #=z1$roi[2]
3 0.5 3 6.742525
.
.
.
98 9.5 10 6.063234
99 9.5 11 6.824393 #=z1$roi[99]
100 9.5 12 7.346298 #=z1$roi[100]
the data frame z2 would be:
time roi area
1 0.5 1.000000 6.181150 #=z1$roi[1]
2 0.5 1.111111 6.102063
.
.
.
9 0.5 1.888889 5.548453
10 0.5 2.000000 5.469366 #=z1$roi[2]
.
.
.
991 9.5 11.00000 6.824393 #=z1$roi[99]
992 9.5 11.11111 6.882383
.
.
.
999 9.5 11.88889 7.288309
1000 9.5 12.00000 7.346298 #=z1$roi[100]
Can anyone help me? Thank you!
with tidyverse, changing a bit your values to appreciate the output (replace 5 by 10):
z1 <- head(z1,3)
library(tidyverse)
z1 %>%
mutate_at(vars(roi,area),~map2(.,c(.[-1],last(.)),~seq(.x,.y,length.out=5))) %>%
unnest %>%
head(-5)
# time roi area
# 1 0.5 1.00 6.302351
# 2 0.5 1.25 6.151644
# 3 0.5 1.50 6.000938
# 4 0.5 1.75 5.850231
# 5 0.5 2.00 5.699525
# 6 0.5 2.00 5.699525
# 7 0.5 2.25 5.687045
# 8 0.5 2.50 5.674566
# 9 0.5 2.75 5.662087
# 10 0.5 3.00 5.649608
We will apply the same transformations to cols time and area, so we use mutate_at on those.
We want to transform them into list columns containing vectors, so we can unnest afterwards and get a long data.frame(you may need to get acquainted with tidyr::unnest to understand this step, basically it makes a 'regular' data.frame out of a data.frame that would have vectors, lists, or nested data.frames as elements).
The map family will return such a list output, but each value depends on current AND next value, so we use purrr::map2 to get both input.
. is current value, c(.[-1],last(.)) is the next value (for last element there is no next value, so we keep the last value).
We unnest to create a long data.frames.
The repeated last value created duplicated rows, so we remove them with head(-n)
You could do this as a linear interpolation problem using approx():
s1 <- seq_len(nrow(z1)-1)
s2 <- rep(s1,each=9)
out <- approx(
x = seq_along(z1$area),
y = z1$area,
xout = c(s2 + head(seq(0,1,length.out=10),-1), nrow(z1))
)$y
z1
# time roi area
#1 0.5 1 6.413124
#2 0.5 2 6.837422
#3 0.5 3 6.656612
And then just join the results back together using row indexing:
cbind(z1[c(s2,nrow(z1)),], out)
# time roi area out
#1 0.5 1 6.413124 6.413124
#1.1 0.5 1 6.413124 6.460268
#1.2 0.5 1 6.413124 6.507413
#1.3 0.5 1 6.413124 6.554557
#1.4 0.5 1 6.413124 6.601701
#1.5 0.5 1 6.413124 6.648845
#1.6 0.5 1 6.413124 6.695989
#1.7 0.5 1 6.413124 6.743134
#1.8 0.5 1 6.413124 6.790278
#2 0.5 2 6.837422 6.837422
#2.1 0.5 2 6.837422 6.817332
#2.2 0.5 2 6.837422 6.797242
#2.3 0.5 2 6.837422 6.777152
#2.4 0.5 2 6.837422 6.757062
#2.5 0.5 2 6.837422 6.736972
#2.6 0.5 2 6.837422 6.716882
#2.7 0.5 2 6.837422 6.696792
#2.8 0.5 2 6.837422 6.676702
#3 0.5 3 6.656612 6.656612
This sort of logic should scale much better than having to calculate a sequence for each row. Something of the order of 10 secs vs 1 minute for 1 million rows from a quick and dirty test.

R: Improvement of loop to create distance matrix from data frame

I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.
Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0
Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0

subtract pairs of rows for long dataset

I would like to find a way to subtract in a long dataset, both time and meters values every two rows (two measures per day) and therefore create a new table that store those values.
(03:21-09:37 and 3.2-0.9, so on...). Is there a function that can do it automatically. how is possible to set it? I am completly new using R and I need to figure out those things only with R
time <- c("03:21","09:37","15:41","21:46","03:54","10:12")
day <- c(1,1,1,1,2,2)
meters <- c(3.2,0.9,3.2,0.9,3.2,0.9)
df <- data.frame(day,time,meters)
day time meters
1 1 03:21 3.2
2 1 09:37 0.9
3 1 15:41 3.2
4 1 21:46 0.9
5 2 03:54 3.2
6 2 10:12 0.9
Here are a couple of options that quickly come to mind to consider:
Option 1: Subset with TRUE and FALSE to calculate the difference:
Time <- strptime(df$time, format="%H:%M")
TimeD <- Time[c(TRUE, FALSE)] - Time[c(FALSE, TRUE)]
MetersD <- df$meters[c(TRUE, FALSE)] - df$meters[c(FALSE, TRUE)]
cbind(meters = MetersD, time = TimeD)
# meters time
# [1,] 2.3 -6.266667
# [2,] 2.3 -6.083333
# [3,] 2.3 -6.300000
Option 2: Use %/% to create a grouping variable and use aggregate
df$pairs <- c(0, 1:(nrow(df)-1) %/% 2)
df$time2 <- strptime(df$time, format="%H:%M")
aggregate(list(meters = df$meters, time = df$time2),
by = list(pairs = df$pairs), FUN=function(y) diff(rev(y)))
# pairs meters time
# 1 0 2.3 -6.266667
# 2 1 2.3 -6.083333
# 3 2 2.3 -6.300000
Update
It's not too difficult to extend the idea to get your "day" column back tooL
with(df, {
time <- strptime(time, format="%H:%M")
time <- time[c(TRUE, FALSE)] - time[c(FALSE, TRUE)]
meters <- meters[c(TRUE, FALSE)] - meters[c(FALSE, TRUE)]
day <- day[c(TRUE, FALSE)]
data.frame(day, time, meters)
})
# day time meters
# 1 1 -6.266667 hours 2.3
# 2 1 -6.083333 hours 2.3
# 3 2 -6.300000 hours 2.3
Using diff
# Create a proper date
df$date <- strptime(paste(df$day,df$time),format="%d %H:%M")
new_df <- data.frame(
diff_meters = abs(diff(df$meters)),
diff_time = diff(df$date))
new_df
diff_meters diff_time
1 2.3 6.266667 hours
2 2.3 6.066667 hours
3 2.3 6.083333 hours
4 2.3 6.133333 hours
5 2.3 6.300000 hours
It's pretty easy to get every other row, if that's what you're actually looking for (not really clear from the question nor your comment:
new_df[seq(1,nrow(new_df),2),]
diff_meters diff_time
1 2.3 6.266667 hours
3 2.3 6.083333 hours
5 2.3 6.300000 hours

Exclude smaller values than a threshold in R

I have data in a tab-delimited text file like this:
FID HV HH VOLUME
1 -2.1 -0.1 0
2 -4.3 -0.2 200
3 -1.4 1.2 20
4 -1.2 0.6 30
5 -3.7 0.8 10
These tables have mostly more than 6000 rows and much more columns.
I need to extract values of the column VOLUME smaller than e.g. 20.
I tried to do it with following command
x <- -which(names(x)["VOLUME"] > 20)
but it did not work.
Is there any method to do it? Any help is appreciated.
Say your data is sample:
subset(sample, VOLUME<20)
Assuming x is your data, try this:
x <- x[which(x$VOLUME <= 20),]

R: What are the best functions to deal with concatenating and averaging values in a data.frame?

I have a data.frame from this code:
my_df = data.frame("read_time" = c("2010-02-15", "2010-02-15",
"2010-02-16", "2010-02-16",
"2010-02-16", "2010-02-17"),
"OD" = c(0.1, 0.2, 0.1, 0.2, 0.4, 0.5) )
which produces this:
> my_df
read_time OD
1 2010-02-15 0.1
2 2010-02-15 0.2
3 2010-02-16 0.1
4 2010-02-16 0.2
5 2010-02-16 0.4
6 2010-02-17 0.5
I want to average the OD column over each distinct read_time (notice some are replicated others are not) and I also would like to calculate the standard deviation, producing a table like this:
> my_df
read_time OD stdev
1 2010-02-15 0.15 0.05
5 2010-02-16 0.3 0.1
6 2010-02-17 0.5 0
Which are the best functions to deal with concatenating such values in a data.frame?
The plyr package is popular for this, but the base functions by() and aggregate() will also help.
> ddply(my_df, "read_time", function(X) data.frame(OD=mean(X$OD),stdev=sd(X$OD)))
read_time OD stdev
1 2010-02-15 0.15000 0.07071
2 2010-02-16 0.23333 0.15275
3 2010-02-17 0.50000 NA
You can add the missing bit to return 0 instead of NA for the last std.dev.
Also, you don't need the quotes (on the variables) you had in the data.frame construction.
You can try the package data.table. If you know MySQL it should be very easy for you to get all the functions, otherwise the basics are good enough too ;-)
my_dfdt<-data.table(my_df)
mean<-my_dfdt[,mean(OD), by="read_time"]
sd<- ..
you can also join both in one line or to cbind at the end, your call of style
Another advantage: it is extremely fast, if you have large samples. Very fast...see documentation why.
This illustrates how you could use aggregate to get the mean and standard deviation by your read_time.
>aggregate(my_df$OD, by=list(my_df$read_time), function(x) mean(x))
Group.1 x
1 2010-02-15 0.1500000
2 2010-02-16 0.2333333
3 2010-02-17 0.5000000
>aggregate(my_df$OD, by=list(my_df$read_time), function(x) sd(x))
Group.1 x
1 2010-02-15 0.07071068
2 2010-02-16 0.15275252
3 2010-02-17 NA

Resources