Exclude smaller values than a threshold in R - r

I have data in a tab-delimited text file like this:
FID HV HH VOLUME
1 -2.1 -0.1 0
2 -4.3 -0.2 200
3 -1.4 1.2 20
4 -1.2 0.6 30
5 -3.7 0.8 10
These tables have mostly more than 6000 rows and much more columns.
I need to extract values of the column VOLUME smaller than e.g. 20.
I tried to do it with following command
x <- -which(names(x)["VOLUME"] > 20)
but it did not work.
Is there any method to do it? Any help is appreciated.

Say your data is sample:
subset(sample, VOLUME<20)

Assuming x is your data, try this:
x <- x[which(x$VOLUME <= 20),]

Related

Is there way to calculate multiple new rows of a data frame based on previous rows' values?

I am creating a data frame (hoops) with three columns (t, x, y) and 700 rows. See code at bottom. In the first row, I have set column t to equal 0. In the second row, I want to have the column t be calculated by taking the previous row's t value and add a constant (hoops_var). I want this to formula to continue to row 700.
hoops<-data.frame(t=double(),x=double(),y=double())
hoops_var<- 1.5
hoops[1,1]<- 0
hoops[1,2]<- (hoops$t+23)
hoops[1,3]<- (hoops$t/2)
# What I want for row 2
hoops[2,1]<- hoops[[1,1]]+hoops_var #this formula for rows 2 to 700
hoops[2,2]<- (hoops$t+23) #same as row 1
hoops[2,2]<- (hoops$t/2) #same as row 1
# What I want for row 3 to 700 (same as row 2)
hoops[3:700,1]<- hoops[[2,2]]+hoops_var #same as row 2
hoops[3:700,2]<- (hoops$t+23) #same as rows 1 & 2
hoops[3:700,3]<- (hoops$t/2) #same as row 1 & 2
The first four rows of the table should look like this
The only applicable solution I found (linked at bottom) did not work for me.
I am fairly new to R, so apologies if this is a dumb question. Thanks in advance for any help.
R: Creating a new row based on previous rows
You should use vectorized operations
# first create all columns as vectors
hoops_t <- hoops_var*(0:699) #0:699 gives a vector of 700 consecutive integers
hoops_x <- hoops_t+23
hoops_y <- hoops_t/2
# now we are ready to put all vectors in a dataframe
hoops <- data.frame(t=hoops_t,x=hoops_x,y=hoops_y)
Now if you want to change the t column you can use lag from dplyr to shift all values for example
library(dplyr)
hoops$t[2:nrow(hoops)] <- lag(hoops$x*hoops$y)[2:nrow(hoops)]
I select only [2:nrow(hoops)] (all rows except the first one) because you don't want the first row to be modified
You could use the following :
n <- 10 #Number of rows in the data.frame
t <- seq(0, by = 1.5, length.out = n)
x <- 23 + t
y <- t/2
hoops <- data.frame(t, x, y)
hoops #Sample for 10 rows.
# t x y
#1 0.0 23.0 0.00
#2 1.5 24.5 0.75
#3 3.0 26.0 1.50
#4 4.5 27.5 2.25
#5 6.0 29.0 3.00
#6 7.5 30.5 3.75
#7 9.0 32.0 4.50
#8 10.5 33.5 5.25
#9 12.0 35.0 6.00
#10 13.5 36.5 6.75

How to reference multiple dataframe columns to calculate a new column of weighted averages in R

I am currently calculating the weighted average column for my dataframe through manual referencing of each column name. Is there a way to shorten the code by multiplying sets of arrays
eg:
df[,c(A,B,C)] and df[,c(PerA,PerB,PerC)] to obtain the weighted average, like the SUMPRODUCT in Excel? Especially if I have multiple input columns to calculate the weighted average column
df$WtAvg = df$A*dfPerA + df$B*df$PerB + df$C*df$PerC
Without transforming your dataframe and assuming that first half of the dataframe is the size and the second half is the weight, you can use weighted.mean function in apply function:
df$WtAvg = apply(df,1,function(x){weighted.mean(x[1:(ncol(df)/2)],
x[(ncol(df)/2+1):ncol(df)])})
And you get the following output:
> df
A B C PerA PerB PerC WtAvg
1 1 2 3 0.1 0.2 0.7 2.6
2 4 5 6 0.5 0.3 0.2 4.7
3 7 8 9 0.6 0.1 0.3 7.7

Segmenting a data frame by row based on previous rows values

I have a data frame in R that contains 2 columns named x and y (co-ordinates). The data frame represents a journey with each line representing the position at the next point in time.
x y seconds
1 0.0 0.0 0
2 -5.8 -8.5 1
3 -11.6 -18.2 2
4 -16.9 -30.1 3
5 -22.8 -40.8 4
6 -29.0 -51.6 5
I need to break the journey up into segments where each segment starts once the distance from the start of the previous segment crosses a certain threshold (e.g. 200).
I have recently switched from using SAS to R, and this is the first time I've come across anything I can do easily in SAS but can't even think of the way to approach the problem in R.
I've posted the SAS code I would use below to do the same job. It creates a new column called segment.
%let cutoff=200;
data segments;
set journey;
retain segment distance x_start y_start;
if _n_=1 then do;
x_start=x;
y_start=y;
segment=1;
distance=0;
end;
distance + sqrt((x-x_start)**2+(y-y_start)**2);
if distance>&cutoff then do;
x_start=x;
y_start=y;
segment+1;
distance=0;
end;
keep x y seconds segment;
run;
Edit: Example output
If the cutoff were 200 then an example of required output would look something like...
x y seconds segment
1 0.0 0.0 0 1
2 40.0 30.0 1 1
3 80.0 60.0 2 1
4 120.0 90.0 3 1
5 160.0 120.0 4 2
6 120.0 150.0 5 2
7 80.0 180.0 6 2
8 40.0 210.0 7 2
9 0.0 240.0 8 3
If your data set is dd, something like
cutoff <- 200
origin <- dd[1,c("x","y")]
cur.seg <- 1
dd$segment <- NA
for (i in 1:nrow(dd)) {
dist <- sqrt(sum((dd[i,c("x","y")]-origin)^2))
if (dist>cutoff) {
cur.seg <- cur.seg+1
origin <- dd[i,c("x","y")]
}
dd$segment[i] <- cur.seg
}
should work. There are some refinements (it might be more efficient to compute distances of the current origin to all rows, then use which(dist>cutoff)[1] to jump to the first row that goes beyond the cutoff), and it would be interesting to try to come up with a completely vectorized solution, but this should be OK. How big is your data set?

R: Improvement of loop to create distance matrix from data frame

I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.
Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0
Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0

R - moving window comparison with datasets of unequal size

I need to compare a large set of values to a small set and find the minimum difference between the two. Maybe this is “moving window” comparison? I’ve looked at several time series packages but can’t find (or recognize) a function that compares data sets of different sizes. Text example below. Any help is greatly appreciated.
----------1st comparison-----------
Time S1 S2 Diff Mean Diff
1 1.3 1.2 0.1
2 1.7 1.6 0.1 0.10
3 1.2
4 1.6
----------2nd comparison------------
1 1.3
2 1.7 1.2 0.5
3 1.2 1.6 -0.4 0.05
4 1.6
----------3rd comparison------------
1 1.3
2 1.7
3 1.2 1.2 0.0
4 1.6 1.6 0.0 0.00 <- minimum difference
What about something like this:
require(zoo)
S1 <- c(1.3,1.7,1.2,1.6)
S2 <- c(1.2,1.6)
We can use rollapply to apply a function rolling along a vector. The width is set at the size of the smaller comparison vector. We then use an anonymous function to pass the values from our large vector, S1, as the variable x from which we then subtract the values from the small vector and take the mean. We can then use min to return the smallest value:
> min( rollapply( S1 , width = 2 , function(x) mean(x-S2) ) )
[1] 0
It's hard to make it more generalisable without the structure of your data

Resources