R: Improvement of loop to create distance matrix from data frame - r

I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.

Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0

Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0

Related

How to reference multiple dataframe columns to calculate a new column of weighted averages in R

I am currently calculating the weighted average column for my dataframe through manual referencing of each column name. Is there a way to shorten the code by multiplying sets of arrays
eg:
df[,c(A,B,C)] and df[,c(PerA,PerB,PerC)] to obtain the weighted average, like the SUMPRODUCT in Excel? Especially if I have multiple input columns to calculate the weighted average column
df$WtAvg = df$A*dfPerA + df$B*df$PerB + df$C*df$PerC
Without transforming your dataframe and assuming that first half of the dataframe is the size and the second half is the weight, you can use weighted.mean function in apply function:
df$WtAvg = apply(df,1,function(x){weighted.mean(x[1:(ncol(df)/2)],
x[(ncol(df)/2+1):ncol(df)])})
And you get the following output:
> df
A B C PerA PerB PerC WtAvg
1 1 2 3 0.1 0.2 0.7 2.6
2 4 5 6 0.5 0.3 0.2 4.7
3 7 8 9 0.6 0.1 0.3 7.7

Segmenting a data frame by row based on previous rows values

I have a data frame in R that contains 2 columns named x and y (co-ordinates). The data frame represents a journey with each line representing the position at the next point in time.
x y seconds
1 0.0 0.0 0
2 -5.8 -8.5 1
3 -11.6 -18.2 2
4 -16.9 -30.1 3
5 -22.8 -40.8 4
6 -29.0 -51.6 5
I need to break the journey up into segments where each segment starts once the distance from the start of the previous segment crosses a certain threshold (e.g. 200).
I have recently switched from using SAS to R, and this is the first time I've come across anything I can do easily in SAS but can't even think of the way to approach the problem in R.
I've posted the SAS code I would use below to do the same job. It creates a new column called segment.
%let cutoff=200;
data segments;
set journey;
retain segment distance x_start y_start;
if _n_=1 then do;
x_start=x;
y_start=y;
segment=1;
distance=0;
end;
distance + sqrt((x-x_start)**2+(y-y_start)**2);
if distance>&cutoff then do;
x_start=x;
y_start=y;
segment+1;
distance=0;
end;
keep x y seconds segment;
run;
Edit: Example output
If the cutoff were 200 then an example of required output would look something like...
x y seconds segment
1 0.0 0.0 0 1
2 40.0 30.0 1 1
3 80.0 60.0 2 1
4 120.0 90.0 3 1
5 160.0 120.0 4 2
6 120.0 150.0 5 2
7 80.0 180.0 6 2
8 40.0 210.0 7 2
9 0.0 240.0 8 3
If your data set is dd, something like
cutoff <- 200
origin <- dd[1,c("x","y")]
cur.seg <- 1
dd$segment <- NA
for (i in 1:nrow(dd)) {
dist <- sqrt(sum((dd[i,c("x","y")]-origin)^2))
if (dist>cutoff) {
cur.seg <- cur.seg+1
origin <- dd[i,c("x","y")]
}
dd$segment[i] <- cur.seg
}
should work. There are some refinements (it might be more efficient to compute distances of the current origin to all rows, then use which(dist>cutoff)[1] to jump to the first row that goes beyond the cutoff), and it would be interesting to try to come up with a completely vectorized solution, but this should be OK. How big is your data set?

R: running corr across two matrices by column

I'm tracking how much my cats are pooping, and trying to figure out if that's correlated with how much they're eating.
So if I have the following data:
food <- cbind(fluffy=c(0.9,1.1,1.3,0.7),misterCuddles=c(0.5,1.2,1.4,0.5))
poop <- cbind(fluffy=c(0.9,1.1,1.3,0.7),misterCuddles=c(-0.5,-1.2,-1.4,-0.5))
dates <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")
rownames(food) <- dates
rownames(poop) <- dates
cube <- abind(food, poop, along=3)
Notes for the curious:
amounts are in deci-pennies: 1.1 means the poop weighs about as much as 11 pennies
negative poop amounts demonstrate that mister cuddles is part unicorn
This gives me the following:
> cube
, , food
fluffy misterCuddles
2013-01-01 0.9 0.5
2013-01-02 1.1 1.2
2013-01-03 1.3 1.4
2013-01-04 0.7 0.5
, , poop
fluffy misterCuddles
2013-01-01 0.9 -0.5
2013-01-02 1.1 -1.2
2013-01-03 1.3 -1.4
2013-01-04 0.7 -0.5
Now if I want to find the correlation for mister cuddles to demonstrate his magic:
> corr(cube[,"misterCuddles",])
[1] -1
What I'd like is a named vector with the correlation number for each cat:
> c(fluffy=1.0,misterCuddles=-1.0)
fluffy misterCuddles
1 -1
Is there a way I can do this in one shot, ideally in parallel? In reality, I have buttloads of cats.
Thanks!
EDIT
Can it be as simple as...
> result <- simplify2array(mclapply(colnames(food), function(x) corr(cube[,x,])))
> names(result) <- colnames(food)
> result
fluffy misterCuddles
1 -1
library(boot) # for corr
sapply(dimnames(cube)[[2]], function(x) corr(cube[ , x, ]))
# fluffy misterCuddles
# 1 -1

Exclude smaller values than a threshold in R

I have data in a tab-delimited text file like this:
FID HV HH VOLUME
1 -2.1 -0.1 0
2 -4.3 -0.2 200
3 -1.4 1.2 20
4 -1.2 0.6 30
5 -3.7 0.8 10
These tables have mostly more than 6000 rows and much more columns.
I need to extract values of the column VOLUME smaller than e.g. 20.
I tried to do it with following command
x <- -which(names(x)["VOLUME"] > 20)
but it did not work.
Is there any method to do it? Any help is appreciated.
Say your data is sample:
subset(sample, VOLUME<20)
Assuming x is your data, try this:
x <- x[which(x$VOLUME <= 20),]

R - moving window comparison with datasets of unequal size

I need to compare a large set of values to a small set and find the minimum difference between the two. Maybe this is “moving window” comparison? I’ve looked at several time series packages but can’t find (or recognize) a function that compares data sets of different sizes. Text example below. Any help is greatly appreciated.
----------1st comparison-----------
Time S1 S2 Diff Mean Diff
1 1.3 1.2 0.1
2 1.7 1.6 0.1 0.10
3 1.2
4 1.6
----------2nd comparison------------
1 1.3
2 1.7 1.2 0.5
3 1.2 1.6 -0.4 0.05
4 1.6
----------3rd comparison------------
1 1.3
2 1.7
3 1.2 1.2 0.0
4 1.6 1.6 0.0 0.00 <- minimum difference
What about something like this:
require(zoo)
S1 <- c(1.3,1.7,1.2,1.6)
S2 <- c(1.2,1.6)
We can use rollapply to apply a function rolling along a vector. The width is set at the size of the smaller comparison vector. We then use an anonymous function to pass the values from our large vector, S1, as the variable x from which we then subtract the values from the small vector and take the mean. We can then use min to return the smallest value:
> min( rollapply( S1 , width = 2 , function(x) mean(x-S2) ) )
[1] 0
It's hard to make it more generalisable without the structure of your data

Resources