I have a data set representing movement through a 2d environment with respect to time:
time(s) start_pos fwd_dist rev_dist end_pos
1 0.0 4.0 -3.0 2.0
2 2.0 5.1 0.5 3.0
3 3.0 4.7 -0.5 3.5
4 3.5 3.6 -1.8 2.1
5 2.1 2.6 -2.1 1.0
6 1.0 1.5 -1.5 -0.2
I want to make another column which is the result of a check to see which is larger between "end_pos" and "start_pos" and subtracting the larger number from "fwd_dist". I'm trying to loop through the dataset but seem to be struggling with the syntax in R
i<-0
while (i < length(data[,1]){if (data[i,4] > data[i,1]){print (data[i,2]-data[i,4])} else {print (data[i,2]-data[i,1])}; i<-i+1}
I keep getting the error:
Error in if (data[i, 4] > data[i, 1]) { :
argument is of length zero
pmax(start_pos,end_pos)
will give you the parallel maximum (i.e., componentwise) of two vectors. So you are probably looking for
fwd_dist-pmax(start_pos,end_pos)
A data frame based approach:
data$difference <- data$fwd_dist - pmax(data$start_pos, data$end_pos)
Related
I have this line of code but I don't know what it means especially the note_ind part.
apply(mydat[,-c(1,2,3,note_ind:ncol(dataset))],c(1,2),as.numeric)
The notation x:y is used to create numeric vector sequences where each element is the previous element incremented by 1. It is shorthand for `seq(x, y, by = 1). It is most commonly used for integer sequences, but it works on doubles also.
1:10
[1] 1 2 3 4 5 6 7 8 9 10
1.1:10.1
[1] 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1
1.5:10.2 # sequence stops after 9.5 because 10.2 < 9.5 + 1 - seq() behaves the same way
[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
Presumably note_ind is an integer value from somewhere else in your code. ncol(data.set) is the number of columns, so note_ind:ncol(dataset) generates a seqence between those two values, incrementing by 1 for each element.
I am writing a for loop to delete rows in which all of the values between rows 5 and 8 is 'NA'. However, it only deletes SOME of the rows. When I do a while loop, it deletes all of the rows, but I have to manually end it (i.e. it is an infinite loop...I also have no idea why)
The for/if loop:
for(i in 1:nrow(df)){
if(is.na(df[i,5]) && is.na(df[i,6]) &&
is.na(df[i,7]) && is.na(df[i,8])){
df<- df[-i,]
}
}
while loop (but it is infinite):
for(i in 1:nrow(df)){
while(is.na(df[i,5]) && is.na(df[i,6]) &&
is.na(df[i,7]) && is.na(df[i,8])){
df<- df[-i,]
}
}
Can someone help? Thanks!
What's happening here is that when you remove a row in this way, all the rows below it "move up" to fill the space left behind. When there are repeated rows that should be deleted, the second one gets skipped over. Imagine this table:
1 keep
2 delete
3 delete
4 keep
Now, you loop through a sequence from 1 to 4 (the number of rows) deleting rows that say delete:
i = 1, keep that row ...
i = 2, delete that row. Now, the data frame looks like this:
1 keep
2 delete
3 keep
i = 3, the 3rd row says keep, so keep it ... The final table is:
1 keep
2 delete
3 keep
In your example with while, however, the deletion step keeps running on row 2 until that row doesn't meet the conditions instead of moving on to i = 3 right away. So the process goes:
i = 1, keep that row ...
i = 2, delete that row. Now, the data frame looks like this:
1 keep
2 delete
3 keep
i = 2 (again), delete that row (again). Now, the data frame looks like this:
1 keep
2 keep
i = 2 (again), this row says keep, so keep it and move on to i = 3
I'd be remiss to answer this question without mentioning that there are much better ways to do this in R such as square bracket notation (enter ?`[` in the R console), the filter function in the dplyr package, or the data.table package.
This question has many options: Filter data.frame rows by a logical condition
Store the row number in a vector and remove outside the loop.
test <- iris
test[1:5,2:4] <- NA
> head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA setosa
2 4.9 NA NA NA setosa
3 4.7 NA NA NA setosa
4 4.6 NA NA NA setosa
5 5.0 NA NA NA setosa
6 5.4 3.9 1.7 0.4 setosa
x <- 0
for(i in 1:nrow(test)){
if(is.na(test[i,2]) && is.na(test[i,3]) &&
is.na(test[i,4])){
x <- c(x,i)
}
}
x
test<- test[-x,]
head(test)
> head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.
Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0
Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0
I need to compare a large set of values to a small set and find the minimum difference between the two. Maybe this is “moving window” comparison? I’ve looked at several time series packages but can’t find (or recognize) a function that compares data sets of different sizes. Text example below. Any help is greatly appreciated.
----------1st comparison-----------
Time S1 S2 Diff Mean Diff
1 1.3 1.2 0.1
2 1.7 1.6 0.1 0.10
3 1.2
4 1.6
----------2nd comparison------------
1 1.3
2 1.7 1.2 0.5
3 1.2 1.6 -0.4 0.05
4 1.6
----------3rd comparison------------
1 1.3
2 1.7
3 1.2 1.2 0.0
4 1.6 1.6 0.0 0.00 <- minimum difference
What about something like this:
require(zoo)
S1 <- c(1.3,1.7,1.2,1.6)
S2 <- c(1.2,1.6)
We can use rollapply to apply a function rolling along a vector. The width is set at the size of the smaller comparison vector. We then use an anonymous function to pass the values from our large vector, S1, as the variable x from which we then subtract the values from the small vector and take the mean. We can then use min to return the smallest value:
> min( rollapply( S1 , width = 2 , function(x) mean(x-S2) ) )
[1] 0
It's hard to make it more generalisable without the structure of your data
Again I need your help for a maybe easy question that is not clear for a starter R user.
I need to manipulate a dataframe to substitute NA values by "realistic" ones to feed another application.
The data frame contains values of -3.0 that was the flag for non valid values in the original data base. What I need is to replace all the -3.0 values by data coming from another data frame, or maybe to interpolate.
The first data frame would be
1.0 2.0 3.0 4.0
2.0 3.0 -3.0 -3.0
1.0 4.0 -3.0 6.0
1.0 5.0 4.0 5.0
the second one would be
1.0 1.0 1.0 1.0
2.0 2.0 9.0 9.0
2.0 2.0 9.0 2.0
1.0 1.0 1.0 1.0
and the expected result
1.0 2.0 3.0 4.0
2.0 3.0 9.0 9.0
1.0 4.0 9.0 6.0
1.0 5.0 4.0 5.0
I suppose this can be done with a for loop but I haven't found the way to do it.
Thanks in advance
It's actually quite simple to do this without a for loop: if your data frames are A and B, then the command would be
A[A == -3] = B[A == -3]
In other words: for all the indices of A that have value -3, assign the values of B at the corresponding indices.