I'm relatively new to R.
I'm working on creating a tortuosity index of a line along GPS points but to work this out I'm trying to calculate the distance between the point n and the n+10 point for my latitude and longitude data.
I have already used distanceTrack to calculate the distance between every point in series:
lat<-data$Latitude
long<-data$Longitude
distanceTrack(lat,long)
But I wonder if there is an alternative for doing this over 10 steps?
I tried using another method that used sp package and spDistsN1 to get distances between successive points:
coordinates(gpsdat)<-~Longitude+Latitude
proj4string(gpsdat)<-CRS("+proj=longlat +datum=WGS84")
dist<-sapply(seq_along(gpsdat[-1,]), function(i)
spDistsN1(pts=gpsdat[i,],pt=gpsdat[i+1,],longlat=TRUE))
I thought I might be able to simply change some of the script so it was:
dist<-sapply(seq_along(gpsdat[-1,]), function(i)
spDistsN1(pts=gpsdat[i,],pt=gpsdat[i+10,],longlat=TRUE))
But this change returns that my subscript is out of bounds.
I wonder if anyone might have an idea of how to progress with this?
Thank you for any help you can offer!
You could do this without using an apply or loop structure with data.table and geosphere.
For that we shift the lat/lon rows by 5 in my example, filter out those, that dont have NA values, calculate the distance per row and reassign the distance to the original data.table. You can vary the amount of shifts to your desire.
My example also groups by L1, as it assumes different Line-Ids. You can adapt the grouping column or just omit it.
library(data.table)
library(geosphere)
## Data
points = data.table(
id = 1:20,
lon = seq(11, 30, 1),
lat = seq(51, 70, 1),
L1 = rep(seq(1,2,1), each=10)
)
## Shift Rows
points[, c("next_id", "next_lon", "next_lat") := data.table::shift(.SD, 5, NA, "lead"),
.SDcols=c("id","lon","lat"), by = L1];
## Filter NA values
points_not_NA = points[!is.na(points$next_lon),]
## Get Distance
distRes <- distCosine(as.matrix(points_not_NA[,c("lon","lat")]),
as.matrix(points_not_NA[,c("next_lon","next_lat")]))
## Assign initial value for distance
points$dist <- 0
## Assign resulting distance to filtered table
points[!is.na(points$next_lon),]$dist <- distRes
points
> points
id lon lat L1 next_id next_lon next_lat dist
1: 1 11 51 1 6 16 56 647237.8260
2: 2 12 52 1 7 17 57 643248.9003
3: 3 13 53 1 8 18 58 639283.2756
4: 4 14 54 1 9 19 59 635346.2618
5: 5 15 55 1 10 20 60 631443.2317
6: 6 16 56 1 NA NA NA 0.0000
7: 7 17 57 1 NA NA NA 0.0000
8: 8 18 58 1 NA NA NA 0.0000
9: 9 19 59 1 NA NA NA 0.0000
10: 10 20 60 1 NA NA NA 0.0000
11: 11 21 61 2 16 26 66 609045.7607
12: 12 22 62 2 17 27 67 605534.8132
13: 13 23 63 2 18 28 68 602102.1704
14: 14 24 64 2 19 29 69 598753.3917
15: 15 25 65 2 20 30 70 595494.0003
16: 16 26 66 2 NA NA NA 0.0000
17: 17 27 67 2 NA NA NA 0.0000
18: 18 28 68 2 NA NA NA 0.0000
19: 19 29 69 2 NA NA NA 0.0000
20: 20 30 70 2 NA NA NA 0.0000
Related
I have following dataframe in R,which I want to arrange and generate sequence column
bay row tier flag
43 11 88 NA
43 11 90 NA
43 11 86 NA
43 9 88 NA
43 9 92 NA
13 4 88 89
13 4 90 NA
13 4 94 NA
11 4 88 89
11 7 90 NA
I want to arrange column tier in descending order grouping by bay and row
Dataframe should look like
bay row tier flag
43 11 90 NA
43 11 88 NA
43 11 86 NA
43 9 92 NA
43 9 88 NA
13 4 94 NA
13 4 90 NA
13 4 88 89
11 7 90 NA
11 4 88 89
Then I want to generate a sequence for this arranged dataframe
bay row tier flag seq
43 11 90 NA 1
43 11 88 NA 2
43 11 86 NA 3
43 9 92 NA 4
43 9 88 NA 5
13 4 94 NA 6
13 4 90 NA 7
13 4 88 89 8
11 7 90 NA 9
11 4 88 89 8
The only thing I want to check while generating seq is when flag is duplicated I want to keep same sequence number for both rows otherwise different numbers for NA values
bay,row and tier is in integer format
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), order the 'bay', 'row' and 'tier' in descending order, then create a sequence column 'seq', get the index where the 'flag' are duplicated and is not an NA, then specifying the i as the logical index and grouped by 'flag' we change the 'seq' to the first value of 'seq'
library(data.table)
df2 <- setDT(df1)[order(-bay, -row, -tier)][, seq := seq_len(.N)]
i1 <- df2[, (duplicated(flag)|duplicated(flag, fromLast = TRUE)) & !is.na(flag)]
df2[i1, seq := seq[1L], .(flag)]
df2
# bay row tier flag seq
# 1: 43 11 90 NA 1
# 2: 43 11 88 NA 2
# 3: 43 11 86 NA 3
# 4: 43 9 92 NA 4
# 5: 43 9 88 NA 5
# 6: 13 4 94 NA 6
# 7: 13 4 90 NA 7
# 8: 13 4 88 89 8
# 9: 11 7 90 NA 9
#10: 11 4 88 89 8
I need to rank rows based on two variables and I just can't wrap my head around it.
Test data below:
df <- data.frame(A = c(12,35,55,7,6,NA,NA,NA,NA,NA), B = c(NA,12,25,53,12,2,66,45,69,43))
A B
12 NA
35 12
55 25
7 53
6 12
NA 2
NA 66
NA 45
NA 69
NA 43
I want to calculate a third variable, C that equals A when A!=NA. When A==NA then C==B, BUT the C score should always follow that a row with A==NA should never outrank a row with A!=NA.
In the data above Max(A) should equal max(C) and max(B) only can hold the sixth highest C value, because A has five non-NA values. If A ==NA and B outranks a row with A!=NA, then some form of transformation should take place that ensures that the A!=NA row always outranks the B row in the final C score
I would like the result to look something like this:
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 69 6
NA 66 7
NA 45 8
NA 43 9
NA 2 10
So far the closest I can get is
df$C <- ifelse(is.na(df$A), min(df$A, na.rm=T)/df$B, df$A)
But that turns the ranking upside down when A==NA, so B==2 is ranked 6 instead of B==69
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 2 6
NA 43 7
NA 45 8
NA 66 9
NA 69 10
I'm not sure if I could use some kind of weights?
Any suggestions are greatly appreciated! Thanks!
You can try:
df$C <- order(-df$A)
df[is.na(df$A),"C"] <- sort.list(order(-df[is.na(df$A),"B"]))+length(which(!is.na(df$A)))
and the order for C:
df[order(df$C),]
I have a data.frame composed of observations and modelled predictions of data. A minimal example dataset could look like this:
myData <- data.frame(tree=c(rep("A", 20)), doy=c(seq(75, 94)), count=c(NA,NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,NA,2,NA,NA,NA,NA,NA), pred=c(0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,6,9,12,20,44))
The count column represents when observations were made and predictions are modelled over a complete set of days, in effect interpolating the data to a day level (from every 5 days).
I would like to conditionally filter this dataset so that I end up truncating the predictions to the same range as the observations, in effect keeping all predictions between when count starts and ends (i.e. removing preceding and trailing rows/values of pred when they correspond to an NA in the count column). For this example, the ideal outcome would be:
tree doy count pred
5 A 79 0 1
6 A 80 NA 1
7 A 81 NA 1
8 A 82 NA 2
9 A 83 NA 2
10 A 84 1 2
11 A 85 NA 2
12 A 86 NA 3
13 A 87 NA 3
14 A 88 NA 3
15 A 89 2 3
I have tried to solve this problem through combining filter with first and last, thinking about using a conditional mutate to create a column that determines if there is an observation in the previous doy (probably using lag) and filling that with 1 or 0 and using that output to then filter, or even creating a second data.frame that contains the proper doy range that can be joined to this data.
In my searches on StackOverflow I have come across the following questions that seemed close, but were not quite what I needed:
Select first observed data and utilize mutate
Conditional filtering based on the level of a factor R
My actual dataset is much larger with multiple trees over multiple years (with each tree/year having different period of observation depending on elevation of the sites, etc.). I am currently implementing the dplyr package across my code, so an answer within that framework would be great but would be happy with any solutions at all.
I think you're just looking to limit the rows to fall between the first and last non-NA count value:
myData[seq(min(which(!is.na(myData$count))), max(which(!is.na(myData$count)))),]
# tree doy count pred
# 5 A 79 0 1
# 6 A 80 NA 1
# 7 A 81 NA 1
# 8 A 82 NA 2
# 9 A 83 NA 2
# 10 A 84 1 2
# 11 A 85 NA 2
# 12 A 86 NA 3
# 13 A 87 NA 3
# 14 A 88 NA 3
# 15 A 89 2 3
In dplyr syntax, grouping by the tree variable:
library(dplyr)
myData %>%
group_by(tree) %>%
filter(seq_along(count) >= min(which(!is.na(count))) &
seq_along(count) <= max(which(!is.na(count))))
# Source: local data frame [11 x 4]
# Groups: tree
#
# tree doy count pred
# 1 A 79 0 1
# 2 A 80 NA 1
# 3 A 81 NA 1
# 4 A 82 NA 2
# 5 A 83 NA 2
# 6 A 84 1 2
# 7 A 85 NA 2
# 8 A 86 NA 3
# 9 A 87 NA 3
# 10 A 88 NA 3
# 11 A 89 2 3
Try
indx <- which(!is.na(myData$count))
myData[seq(indx[1], indx[length(indx)]),]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
If this is based on groups
ind <- with(myData, ave(!is.na(count), tree,
FUN=function(x) cumsum(x)>0 & rev(cumsum(rev(x))>0)))
myData[ind,]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
Or using na.trim from zoo
library(zoo)
do.call(rbind,by(myData, myData$tree, FUN=na.trim))
Or using data.table
library(data.table)
setDT(myData)[,.SD[do.call(`:`,as.list(range(.I[!is.na(count)])))] , tree]
# tree doy count pred
#1: A 79 0 1
#2: A 80 NA 1
#3: A 81 NA 1
#4: A 82 NA 2
#5: A 83 NA 2
#6: A 84 1 2
#7: A 85 NA 2
#8: A 86 NA 3
#9: A 87 NA 3
#10: A 88 NA 3
#11: A 89 2 3
RProf revealed, that the following operation I perform is rather slow:
stockHistory[.(p), stock:=stockHistory[.(p), stock] - (backorderedDemands[.(p-1),backlog] - backorderedDemands[.(p),backlog])]
I suppose this is because of the subtraction
backorderedDemands[.(p-1),backlog] - backorderedDemands[.(p),backlog]
Is there any way to speed up this operation?
.(p) subsets the data.table for a period p, .(p-1) subsets the previous period (see example data below). Would it maybe be faster to apply some kind diff() here? I do not know how to do this, though.
Example data:
backorderedDemands<-CJ(period=1:1000, articleID=letters[1:10], backlog=0)[,backlog:=round(runif(10000)*42,0)]
setkey(backorderedDemands,period, articleID)
stockHistory<-CJ(period=1:1000, articleID=letters[1:10], stock=0)[,stock:=round(runif(10000)*42+66,0)]
setkey(stockHistory,period, articleID)
You can first calculate a difference column in backorderedDemands.
backorderedDemands[, diff := c(NA, -diff(backlog)), by=articleID]
Also it is not necessary to use stockHistory[.(p), stock]. It's enough to just use stock.
stockHistoryNew[.(p), stock:=stock - backorderedDemands[.(p), diff]]
If you want to compute first differences of your data, you can do it like below. It is fast...I included step by step computation.
library(data.table)
library(dplyr)
Data
set.seed(1)
backorderedDemands <-
CJ(period = 1:1000,
articleID = letters[1:10],
backlog = 0)[,backlog:= round(runif(10000) * 42, 0)]
stockHistory <-
CJ(period = 1:1000,
articleID = letters[1:10],
stock = 0)[, stock:= round(runif(10000) * 42 + 66, 0)]
Solution
merge(stockHistory, backorderedDemands,
by = c("period", "articleID")) %>%
group_by(articleID) %>%
mutate(lag_backlog = lag(backlog, 1),
my_backlog_diff = backlog - lag_backlog,
my_diff = stock + my_backlog_diff) %>%
as.data.frame(.) %>%
head(., 20)
period articleID stock backlog lag_backlog my_backlog_diff my_diff
1 1 a 69 11 NA NA NA
2 1 b 94 16 NA NA NA
3 1 c 97 24 NA NA NA
4 1 d 71 38 NA NA NA
5 1 e 68 8 NA NA NA
6 1 f 71 38 NA NA NA
7 1 g 103 40 NA NA NA
8 1 h 101 28 NA NA NA
9 1 i 102 26 NA NA NA
10 1 j 67 3 NA NA NA
11 2 a 71 9 11 -2 69
12 2 b 89 7 16 -9 80
13 2 c 71 29 24 5 76
14 2 d 96 16 38 -22 74
15 2 e 96 32 8 24 120
16 2 f 99 21 38 -17 82
17 2 g 92 30 40 -10 82
18 2 h 87 42 28 14 101
19 2 i 85 16 26 -10 75
20 2 j 67 33 3 30 97
Extend from this post that gives the result as follows:
x y z
1: 1 NA NA
2: 2 NA 22
3: 3 13 23
4: 4 14 24
5: 5 15 25
6: 6 16 26
7: 7 17 27
8: NA 18 28
9: NA 19 NA
10: NA NA NA
As you can see, if NAs of each column are removed, we can obtain data.table as follows:
x y z
1: 1 13 22
2: 2 14 23
3: 3 15 24
4: 4 16 25
5: 5 17 26
6: 6 18 27
7: 7 19 28
I come up with this code to obtain the above result:
mat.temp <- na.omit(mat[,1, with = F])
for (i in 2:3) {
temp <- na.omit(mat[,i, with = F])
mat.temp <- cbind(mat.temp, temp)
}
However, I am not sure it is efficient.
Could you please give me suggestions ?
Thank you
It sounds like you are just trying to do:
DT[, lapply(.SD, function(x) x[!is.na(x)])]
# x y z
# 1: 1 13 22
# 2: 2 14 23
# 3: 3 15 24
# 4: 4 16 25
# 5: 5 17 26
# 6: 6 18 27
# 7: 7 19 28
However, I'm not sure how well this would hold up if you have a different number of NA values in each column.