Calculate and summarize total distance in a table using dplyr in R - r

I have a table consists of user, sequence, and geolocation: x and y
I would like to group it by user and calculate total distance based on the sequence
For example:
> df <- data.frame(user_id=rep(1,3), seq=1:3, x=c(1,5,3), y=c(2,3,9))
> df
user_id seq x y
1 1 1 1 2
2 1 2 5 3
3 1 3 3 9
Here is the function to calculate distance between two points (Euclidean):
> d <- function(n1,n2){
+ d <- sqrt((df$y[n2]-df$y[n1])^2+(df$x[n2]-df$x[n1])^2)
+ return(d)
+ }
I would like to get the total distance like this:
> df <- data.frame(user_id=1, dtot=d(1,2)+d(2,3))
> df
user_id dtot
1 1 10.45
How can I use dplyr "group_by" and get total distance based on the sequence for all users?

One way to accomplish what you want is to define a function for computing the total distance:
library(dplyr)
total.dist <- function(x,y) {
sum(sqrt((x-lag(x))^2+(y-lag(y))^2),na.rm=TRUE)
}
The inputs to this function are the column vectors x and y. We compute the distance between each row in vectorized fashion by subtracting with the lag of these columns. Then the total distance is the sum of all the distances computed, removing NAs.
Then using this as a summarise function group_by user_id:
res <- df %>% group_by(user_id) %>% summarise(dtot=total.dist(x,y))
### A tibble: 1 x 2
## user_id dtot
## <dbl> <dbl>
##1 1 10.44766

Related

find minimum of 2 columns from a data frame (minimize 2 columns at the same time) in R

I have a data frame like this:
X Y
1 2
3 1
1 1
2 3
1 2
Now I want to find the minimum value of X and among the smallest values for X I want to pick the row that has the smallest value for Y. (My data has several minima.)
So in this example the desired output is "line 3" because minimum value of X is 1 and among the rows with X=1 the minimum value for Y is in line 3 (Y=1).
I know the function min() which seems to pick the first minimal value of the data.frame or of the specified column of the data.frame.
But is there a function in R or an easy way to find the row that minimizes X and Y at the same time?
Right now I would
use the min() function to find the minimum value for X, then
remove every row that has a greater value for X than the minimum
use the min() function to find the minimum value for Y (among these remaining lines)
see how I find the corresponding row in the original data.frame.
But there must be a more easy way to do it?
If you arrange the data by X and Y, you can select the 1st row of the dataframe.
In dplyr that would be -
library(dplyr)
df %>% arrange(X, Y) %>% slice(1L)
# X Y
#1 1 1
Or in base R -
df[order(df$X, df$Y)[1], ]
Another base R option
> df[which.min(as.integer(interaction(df))), ]
X Y
3 1 1
or a data.table option
> setorder(setDT(df))[1]
X Y
1: 1 1
use data.table package
library(dplyr)
library(data.table)
dt <- read.table(text = "X Y
1 2
3 1
1 1
2 3
1 2", header = T)
dt <- dt %>% as.data.table() ## convert to data.table
dt[X == min(X), .SD[Y == min(Y)]][1]

Joining 2 datasets where key variable appears in multiple rows in both left and right datasets: R

I have two dataframes. The first one has information on individual id, period and city of the workplace. The second dataset contains information on individual id and city of study degrees achieved throughout their lives. One individual can work at different places at the same period as well as may have multiple degree. I wish to add a column to the first dataframe informing whether the individual has a degree from the same city as she is working at the given period.
Consider the very simple example below. Dataframe mydf1 informs that (i) individual A works in cities x and y at both periods 1 and 2, (ii) invididual B works in city w in periods 1 and 2 and in city k in period 1, (iii) individual C works in city k in period 1. Dataframe mydf2 shows that (i) individual A has studied in cities x and w, (ii) individual B has studied in cities x and k, and (iii) individual C has studied in cities y and k.
mydf1 <- data.frame(id=c('A','A','A','A','B','B','B','C'),
period=c(1,1,2,2,1,1,2,1),
work_city=c('x','y','x','y','w','k','w','k'))
mydf2 <- data.frame(id=c('A','A','B','B','C','C'),
study_city=c('x','w','x','k','y','k'))
My output should be as below, where the indicator variable same_city is equal to 1 if the value of work_city for the respective row coincides with any of the values of variable study_city in dataset mydf2 for that particular individual. For instance: for individual A, variable same_city should be 1 if work_city is equal to 'x' or 'w', or 0 otherwise.
mydf_final <- data.frame(id=c('A','A','A','A','B','B','B','C'),
period=c(1,1,2,2,1,1,2,1),
work_city=c('x','y','x','y','w','k','w','k'),
same_city=c('1','0','1','0','0','1','0','1'))
Possible solution by aggregating mydf2 by id and putting all study cities in a list. After joining mydf1andmydf2_aggregatedwe check if thework_cityfor each row appears in thestudy_cities` list:
mydf1 <- data.frame(id=c('A','A','A','A','B','B','B','C'),
period=c(1,1,2,2,1,1,2,1),
work_city=c('x','y','x','y','w','k','w','k'))
mydf2 <- data.frame(id=c('A','A','B','B','C','C'),
study_city=c('x','w','x','k','y','k'))
Aggregate mydf2 by id and put all values for study_cities in a list. Now there is only one row per unique id.
library(dplyr)
mydf2_aggr <- mydf2 %>%
group_by(id) %>%
summarise(study_cities = list(study_city))
Join mydf2 and mydf2_aggr on id and use the rowwise function so that we can use a simple ifelse on each rows study_cities list. There might exist solutions without having to use rowwise... The columne study_cities_as_string I've only added to illustrate my answer!
mydf_final <- mydf1 %>%
left_join(mydf2_aggr, by="id") %>%
rowwise() %>%
mutate(study_cities_as_string = paste(study_cities, collapse=","),
same_city = ifelse(work_city %in% study_cities, 1, 0)) %>%
select(-study_cities)
mydf_final is now:
id period work_city study_cities_as_string same_city
<chr> <dbl> <chr> <chr> <dbl>
1 A 1 x x,w 1
2 A 1 y x,w 0
3 A 2 x x,w 1
4 A 2 y x,w 0
5 B 1 w x,k 0
6 B 1 k x,k 1
7 B 2 w x,k 0
8 C 1 k y,k 1

cumsum and product based on Unique ID

Am working on a large dataset to calculate a single value in R. I believe the CUMSUM and cum product would work. But I don't know-how
county_id <- c(1,1,1,1,2,2,2,3,3)
res <- c(2,3,2,4,2,4,3,3,2)
I need a function that can simply give me a single value as follows
for every county_id, then I need the total.
Example, for county_id=1 the total for res is calculated manually as
2(3+2+4)+3(2+4)+2(4)
for county_id=2 the total for res is calculated manually as
2(4+3)+4(3)
for county_id=3 the total for res is calculated manually as
3(2)
Then it sums all this into a single variable
44+26+6=76
NB my county_id run from 1:47 and each county_id could have up to 200 res
Thank you
You can use aggregate with cumsum like:
x <- aggregate(res, list(county_id)
, function(x) sum(rev(cumsum(rev(x[-1])))*x[-length(x)]))
#Group.1 x
#1 1 44
#2 2 26
#3 3 6
sum(x[,2])
#[1] 76
You can sum the product of the pairwise combinations:
library(dplyr)
dat %>%
group_by(county_id) %>%
summarise(x = sum(combn(res, 2, FUN = prod)))
# A tibble: 3 x 2
county_id x
<dbl> <dbl>
1 1 44
2 2 26
3 3 6
Base R:
aggregate(res ~ county_id, dat, FUN = function(x) sum(combn(x, 2, FUN = prod)))
Here is one way to do this using tidyverse functions.
For each county_id we multiply the current res value with the sum of res value after it.
library(dplyr)
library(purrr)
df1 <- df %>%
group_by(county_id) %>%
summarise(result = sum(map_dbl(row_number(),
~res[.x] * sum(res[(.x + 1):n()])), na.rm = TRUE))
df1
# county_id result
# <dbl> <dbl>
#1 1 44
#2 2 26
#3 3 6
To get total sum you can then do :
sum(df1$result)
#[1] 76
data
county_id <- c(1,1,1,1,2,2,2,3,3)
res <- c(2,3,2,4,2,4,3,3,2)
df <- data.frame(county_id, res)
Another option is to use SPSS syntax
// You need to count the number of variables with valid responses
count x1=var1 to var4(1 thr hi).
execute.
// 1st thing is to declare a variable that will hold your cumulative sum
// Declare your variables in terms of a vector
//You then loop twice. The 1st loop being from the 1st variable to the number of
//variables with data (x1). The 2nd loop will be from the 1st variable to the `
//variable in (1st loop-1) for all variables with data.`
//Lastly you need to get a cumulative sum based on your formulae
// This syntax can be replicated in other software.
compute index1=0.
vector x=var1 to var4.
loop #i=1 to x1.
loop #j=1 to #i-1 if not missing(x(#i)).
compute index1=index1+(x(#j)*sum(x(#i))).
end loop.
end loop.
execute.

How to sum every nth (200) observation in a data frame using R [duplicate]

This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 4 years ago.
I am new to R so any help is greatly appreciated!
I have a data frame of 278800 observations for each of my 10 variables, I am trying to create an 11th variable that sums every 200 observations (or rows) of a specific variable/column (sum(1:200, 201:399, 400:599 etc.) Similar to the offset function in excel.
I have tried subsetting my data to just the variable of interest with the aim of adding a new variable that continuously sums every 200 rows however I cannot figure it out. I understand my new "variable" will produce 1,394 data points (278,800/200). I have tried to use the rollapply function, however the output does not sum in blocks of 200, it sums 1:200, 2:201, 3:202 etc.)
Thanks,
E
rollapply has a by= argument for that. Here is a smaller example using n = 3 instead of n = 200. Note that 1+2+3=6, 4+5+6=15, 7+8+9=24 and 10+11+12=33.
# test data
DF <- data.frame(x = 1:12)
library(zoo)
n <- 3
rollapply(DF$x, n, sum, by = n)
## [1] 6 15 24 33
First let's generate some data and get a label for each group:
library(tidyverse)
df <-
rnorm(1000) %>%
as_tibble() %>%
mutate(grp = floor(1 + (row_number() - 1) / 200))
> df
# A tibble: 1,000 x 2
value grp
<dbl> <dbl>
1 -1.06 1
2 0.668 1
3 -2.02 1
4 1.21 1
...
1000 0.78 5
This creates 1000 random N(0,1) variables, turns it into a data frame, and then adds an incrementing numeric label for each group of 200.
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value))
# A tibble: 5 x 2
grp grp_sum
<dbl> <dbl>
1 1 9.63
2 2 -12.8
3 3 -18.8
4 4 -8.93
5 5 -25.9
Then we just need to do a group-by operation on the second column and sum the values. You can use the pull() operation to get a vector of the results:
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value)) %>%
pull(grp_sum)
[1] 9.62529 -12.75193 -18.81967 -8.93466 -25.90523
I created a vector with 278800 observations (a)
a<- rnorm(278800)
b<-NULL #initializing the column of interest
j<-1
for (i in seq(1,length(a),by=200)){
b[j]<-sum(a[i:i+199]) #b is your column of interest
j<-j+1
}
View(b)

Select rows in data frame with same values

I have a dataframe with unique values $Number identifying specific points where a polygon is intersecting. Some points (i.e. 56) have 3 polygons that intersect. I want to extract the three rows which start with 56.
df <- cbind(Number = rownames(check), check)
df
df table
The issue going forward is I will be applying this for 10,000 points and won't know the repeating number such as "56". So is there a way to have a general expression which chooses rows with a general match without knowing that value?
You can achieve the desired output with:
subset2 <- function(n) df[floor(df$Number) == n,]
where df is the name of your dataset and Number is the name of the target column. We can fill in n as needed:
#Example
df <- data.frame(Number=c(1,3,24,56.65,56.99,56.14,66),y=sample(LETTERS,7))
df
# Number y
# 1 1.00 J
# 2 3.00 B
# 3 24.00 D
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H
# 7 66.00 V
subset2(56)
# Number y
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H
I simply changed the $Number column into a numeric field, then rounded down to integer data.
numeric <- as.numeric(as.character(df$Number))
Id <- floor(numeric)
If we only want $Number with more than 3 counts then we can use dplyr to group by $Number and then retain $Number if it has more than 3 counts
library(dplyr)
# Data
df <- data.frame(Number = c(1,1,1,2,2,3,3))
# Filtering
df %>% group_by(Number) %>% filter(n() >= 3)

Resources