R - moving window comparison with datasets of unequal size - r

I need to compare a large set of values to a small set and find the minimum difference between the two. Maybe this is “moving window” comparison? I’ve looked at several time series packages but can’t find (or recognize) a function that compares data sets of different sizes. Text example below. Any help is greatly appreciated.
----------1st comparison-----------
Time S1 S2 Diff Mean Diff
1 1.3 1.2 0.1
2 1.7 1.6 0.1 0.10
3 1.2
4 1.6
----------2nd comparison------------
1 1.3
2 1.7 1.2 0.5
3 1.2 1.6 -0.4 0.05
4 1.6
----------3rd comparison------------
1 1.3
2 1.7
3 1.2 1.2 0.0
4 1.6 1.6 0.0 0.00 <- minimum difference

What about something like this:
require(zoo)
S1 <- c(1.3,1.7,1.2,1.6)
S2 <- c(1.2,1.6)
We can use rollapply to apply a function rolling along a vector. The width is set at the size of the smaller comparison vector. We then use an anonymous function to pass the values from our large vector, S1, as the variable x from which we then subtract the values from the small vector and take the mean. We can then use min to return the smallest value:
> min( rollapply( S1 , width = 2 , function(x) mean(x-S2) ) )
[1] 0
It's hard to make it more generalisable without the structure of your data

Related

How to reference multiple dataframe columns to calculate a new column of weighted averages in R

I am currently calculating the weighted average column for my dataframe through manual referencing of each column name. Is there a way to shorten the code by multiplying sets of arrays
eg:
df[,c(A,B,C)] and df[,c(PerA,PerB,PerC)] to obtain the weighted average, like the SUMPRODUCT in Excel? Especially if I have multiple input columns to calculate the weighted average column
df$WtAvg = df$A*dfPerA + df$B*df$PerB + df$C*df$PerC
Without transforming your dataframe and assuming that first half of the dataframe is the size and the second half is the weight, you can use weighted.mean function in apply function:
df$WtAvg = apply(df,1,function(x){weighted.mean(x[1:(ncol(df)/2)],
x[(ncol(df)/2+1):ncol(df)])})
And you get the following output:
> df
A B C PerA PerB PerC WtAvg
1 1 2 3 0.1 0.2 0.7 2.6
2 4 5 6 0.5 0.3 0.2 4.7
3 7 8 9 0.6 0.1 0.3 7.7

Random number generation but common within group

Is there a way to generate random numbers from a distribution such that these numbers are common for rows within a group? Within an unbalanced panel, there is a household_id variable according to which I want to generate random numbers from truncated normal distribution using rtruncnorm.
Thank you.
Household_id Random number
1 0.6
1 0.6
1 0.6
2 0.1
3 0.9
3 0.9
4 0.2
5 0.7
6 0.3
6 0.3
So, the household_id is for identifying the household in this unbalanced panel. Now, I want to generate random numbers using rtruncnorm such as shown, they are the same for within household cells.
Thank you

R: Improvement of loop to create distance matrix from data frame

I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.
Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0
Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0

generate an output from a calculation between 2 columns in R

I have a data set representing movement through a 2d environment with respect to time:
time(s) start_pos fwd_dist rev_dist end_pos
1 0.0 4.0 -3.0 2.0
2 2.0 5.1 0.5 3.0
3 3.0 4.7 -0.5 3.5
4 3.5 3.6 -1.8 2.1
5 2.1 2.6 -2.1 1.0
6 1.0 1.5 -1.5 -0.2
I want to make another column which is the result of a check to see which is larger between "end_pos" and "start_pos" and subtracting the larger number from "fwd_dist". I'm trying to loop through the dataset but seem to be struggling with the syntax in R
i<-0
while (i < length(data[,1]){if (data[i,4] > data[i,1]){print (data[i,2]-data[i,4])} else {print (data[i,2]-data[i,1])}; i<-i+1}
I keep getting the error:
Error in if (data[i, 4] > data[i, 1]) { :
argument is of length zero
pmax(start_pos,end_pos)
will give you the parallel maximum (i.e., componentwise) of two vectors. So you are probably looking for
fwd_dist-pmax(start_pos,end_pos)
A data frame based approach:
data$difference <- data$fwd_dist - pmax(data$start_pos, data$end_pos)

Exclude smaller values than a threshold in R

I have data in a tab-delimited text file like this:
FID HV HH VOLUME
1 -2.1 -0.1 0
2 -4.3 -0.2 200
3 -1.4 1.2 20
4 -1.2 0.6 30
5 -3.7 0.8 10
These tables have mostly more than 6000 rows and much more columns.
I need to extract values of the column VOLUME smaller than e.g. 20.
I tried to do it with following command
x <- -which(names(x)["VOLUME"] > 20)
but it did not work.
Is there any method to do it? Any help is appreciated.
Say your data is sample:
subset(sample, VOLUME<20)
Assuming x is your data, try this:
x <- x[which(x$VOLUME <= 20),]

Resources