I have the dataframe below and I would like to subset it in a way that it should find the observation when a name covered the longest distance between two consecutive observations. If there is a situation when a name moves exactly the same amount of meters at the same time to select the most recent.
So I would like to have as final result 2 rows. Those consequtives with the longest distance, And if there are more than one consequtive pairs only the most recent should remain. Then I will take those 2 points and I will display them on a map.
Here is my data:
name<-c("bal","bal","bal","bal","bal","bal","bal","bal")
LAT<-c(54.77127,54.76542,54.76007,54.75468,54.74926 ,54.74385,54.73847,54.73228)
LON<-c(18.99692,18.99361,18.99059 ,18.98753,18.98447,18.98150,18.97842,18.97505 )
dtime<-c("2016-12-16 02:37:02","2016-12-16 02:38:02","2016-12-16 02:38:32","2016-12-16 02:39:08",
"2016-12-16 02:39:52","2016-12-16 02:41:02","2016-12-16 02:42:02","2016-12-16 02:42:32")
df<-data.frame(name,LAT,LON,dtime)
anf this is how I think I should calculate the distance
library(geosphere)
distm(c(as.numeric(df[1,3]),as.numeric(df[1,2])), c(as.numeric(df[2,3]),as.numeric(df[2,2])), fun = distHaversine)
and this regarding time difference:
difftime("2016-12-19 11:33:01", "2016-12-19 11:31:01", units = "secs")
but how can I combine them?
I think you can do everything with one pipeline in dplyr
library(dplyr)
df %>%
group_by(name) %>%
mutate(lag_LAT = lag(LAT), lag_LON = lag(LON)) %>%
mutate(distance = diag(distm(cbind(lag_LON, lag_LAT), cbind(LON, LAT), fun = distHaversine)),
timespan = difftime(dtime, lag(dtime), units = "secs")) %>%
slice_max(distance) %>%
slice_max(dtime) %>%
ungroup()
#> # A tibble: 1 x 8
#> name LAT LON dtime lag_LAT lag_LON distance timespan
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <drtn>
#> 1 bal 54.7 19.0 2016-12-16 02:42:32 54.7 19.0 722. 30 secs
Given your request in the comment, I added the first mutate to keep track of the previous position, so that you're able to plot it later.
Having everything in one unique row, it's much better than having two separated rows.
With the second mutate you can calculate the distance between two following points and the time difference.
I did not question whether the calculation of the distance is correct.
I assumed you knew better than I do.
The first slice_max identifies the max distance, while the second one it's necessary just in case of ties in the first one (you said you were looking for the most recent in case of ties).
I grouped because I figured you may have more than one name in your dataset.
I did not get why you need to calculate the time difference, but I left it.
Related
I have a dataset with 3.9M rows, 5 columns, stored as a tibble. When I try to convert it to tsibble, I run out of memory even though I have 32 GB which should be way more than enough. The weird thing is that if I apply a filter function before piping it into as_tsibble() then it works, even though I'm not actually filtering out any rows.
This does not work:
dataset %>% as_tsibble(index = TimeStamp, key = c("TSSU", "Phase"))
This works. But there are no "Phase" values less than 1 so the filter does nothing, no rows are actually removed.
dataset %>% filter(Phase > 0) %>% as_tsibble(index = TimeStamp, key = c("TSSU", "Phase"))
Any ideas why the second option works? Here's what the dataset looks like:
Volume <dbl>
Travel_Time <dbl>
TSSU <chr>
Phase <int>
TimeStamp <dttm>
105
1.23
01017
2
2020-09-28 10:00:00
20
1.11
01017
2
2020-09-28 10:15:00
Have you tried using the data.table library? It is optimized for performance with large datasets. I have replicated your steps and depending on where the dataset variable is coming from you may want to use the fread function to load the data as it is also very fast.
library(data.table)
dataset <- data.table(dataset)
# setkeyv(x = dataset, cols = c("TSSU", "Phase")) # This line may not be needed
dataset[Phase>0, ]
I have a series of tracking data in the form of coordinates from individuals leaving and returning to the same place (ie. all of them leaving from the same coordinates) and would like to find the maximum Euclidean distance reached from the starting point for each individual, my data set looks more or less like this
LON LAT ID year
-5.473959 51.72998 1 2010
-5.502444 51.58304 1 2010
-5.341897 50.97509 1 2010
-5.401117 50.76360 1 2010
and the coordinates of the starting point are x=-5.479,y=51.721
ID represents each individual (only shown ID 1 as data set extremely long), where I have more than 100 individuals per year and 8 years of data.
Any help is appreciated!
Here's an approach with dplyr and distHaversine:
library(dplyr)
library(geosphere)
data %>%
mutate(distance = distHaversine(c(-5.479,51.721), cbind(LON, LAT))) %>%
group_by(ID) %>%
summarize(max = max(distance, na.rm = TRUE))
# A tibble: 1 x 2
ID max
<int> <dbl>
1 1 106715.
The output should be in meters.
I’m trying to figure out how to make a calculation across (or between?) rows. I’ve tried looking this up but clearly my Google-Fu is not strong today, because I’m just not finding the right search terms.
Here is a super simplified example of the type of data I’m trying to deal with:
mydf <- data.frame(pair = rep(1,2),
participant = c("PartX", "PartY"),
goalsAtt = c(6, 3),
goalsScr = c(2, 3))
We have data on how many "goals" a participant attempted and how many they actually scored, and lets say I want to know about their "defensive" ability. Now essentially what I want to do is mutate() two new columns called… let’s say saved and missed, where saved would be the goals attempted by the opposite participant minus the goals scored by them, and missed would just be goals scored by the opposite participant. So obviously participant X would have saved 0 and missed 3, and participant Y will have saved 4 and missed 2.
Now obviously this is a stupid simple example, and I’ll have like 6 different “goal” types to deal with and 2.5k pairs to go through, but I’m just having a mental block about where to start with this.
Amateur coder here, and Tidyverse style solutions are appreciated.
OK, so let's assume that pair can only be for 2 teams. Here's a tidyverse solution where we first set an index number for position within a group and then subtract for goals saved. Something similar for goals missed.
library(tidyverse)
mydf %>%
group_by(pair) %>%
mutate(id = row_number()) %>%
mutate(goalsSaved = if_else(id == 1,
lead(goalsAtt) - lead(goalsScr),
lag(goalsAtt) - lag(goalsScr))) %>%
mutate(goalsMissed = if_else(id == 1,
lead(goalsScr),
lag(goalsScr)))
# A tibble: 2 x 7
# Groups: pair [1]
pair participant goalsAtt goalsScr id goalsSaved goalsMissed
<dbl> <fct> <dbl> <dbl> <int> <dbl> <dbl>
1 1 PartX 6 2 1 0 3
2 1 PartY 3 3 2 4 2
The data I have represents sales and their distance (Dist) to a given store One and Two in this example. What I would like to do is, to define the stores catchment area based on sales desity. A cacthment area is defined as the radius that contains 50% of sales. Starting with orders that have the smallest distance (Dist) to a store I would like to calculate radius that contains 50% of sales of a given store.
I the following df that I've calculated in a previous model.
df <- data.frame(ID = c(1,2,3,4,5,6,7,8),
Store = c('One','One','One','One','Two','Two','Two','Two'),
Dist = c(1,5,7,23,1,9,9,23),
Sales = c(10,8,4,1,11,9,4,2))
Now I want to find the minimum distance dist that gives the closes figure to 50% of Sales. So my output looks as follows:
Output <- data.frame(Store = c('One','Two'),
Dist = c(5,9),
Sales = c(18,20))
I have a lot of observation in my actual df and it's unlekely that I will be able to solve for exactly 50%, so I need to round to the nearest observation.
Any suggestions how to do this?
NOTE: I appologise in advance for the poor title, I tried to think of a better way to formulate the problem, suggestions are welcome...
Here is one approach with data.table:
library(data.table)
setDT(df)
df[order(Store, Dist),
.(Dist, Sales = cumsum(Sales), Pct = cumsum(Sales) / sum(Sales)),
by = "Store"][Pct >= 0.5, .SD[1,], by = "Store"]
# Store Dist Sales Pct
# 1: One 5 18 0.7826087
# 2: Two 9 20 0.7692308
setDT(df) converts df into a data.table
The .(...) expression selects Dist, and calculates the cumulative sales and respective cumulative percentage of sales, by Store
Pct >= 0.5 subsets this to only cases where cumulative sales exceeds the threshold, and .SD[1,] takes only the top row (i.e., the smallest value of Dist), by Store
I think it would be easier if you rearrange your data in a certain format. My logic would be to first take cumsum by groups. Then merge sum of groups to the data. Finally i calculate percentage. Now You have got the data and you can subset in any way you want to get the first obs from the group.
df$cums=unlist(lapply(split(df$Sales, df$Store), cumsum), use.names = F)
zz=aggregate(df$Sales, by = list(df$Store), sum)
names(zz)=c('Store', 'TotSale')
df = merge(df, zz)
df$perc=df$cums/df$TotSale
sub-setting the data:
merge(aggregate(perc ~ Store,data=subset(df,perc>=0.5), min),df)
Store perc ID Dist Sales cums TotSale
1 One 0.7826087 2 5 8 18 23
2 Two 0.7692308 6 9 9 20 26
I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf