I need to calculate the Euclidean distance between the first and current row in a dataframe. Each row is keyed by (group, month) and has a list of values. In the toy example below the key is c(month, student) and the values are in c(A, B). I want to create a distance column C, that's equal to sqrt((A_i-A_1)^2 + (B_i - B_1)^2).
So far I managed to spread my data and pull each group's first values into new columns. While I could create the formula by hand in the toy example, in my actual data I have very many columns instead of just 2. I believe I could create the squared differences within the mutate_all, and then do a row sum and take the square root of that, but no luck so far.
df <- data.frame(month=rep(1:3,2),
student=rep(c("Amy", "Bob"), each=3),
A=c(9, 6, 6, 8, 6, 9),
B=c(6, 2, 8, 5, 6, 7))
# Pull in each column's first values for each group
df %>%
group_by(student) %>%
mutate_all(list(first = first)) %>%
# TODO: Calculate the distance, i.e. SQRT(sum_i[(x_i - x_1)^2]).
#Output:
month student A B month_first A_first B_first
1 1 Amy 9 6 1 9 6
2 2 Amy 6 2 1 9 6
...
Desired output:
#Output:
month student A B month_first A_first B_first dist_from_first
1 1 Amy 9 6 1 9 6 0
2 2 Amy 6 2 1 9 6 5
...
Here is another way using compact dplyr code. This can be used for any number of columns
df %>%
select(-month) %>%
group_by(student) %>%
mutate_each(function(x) (first(x) - x)^2) %>%
ungroup() %>%
mutate(euc.dist = sqrt(rowSums(select(., -1))))
# A tibble: 6 x 4
student A B euc.dist
<chr> <dbl> <dbl> <dbl>
1 Amy 0 0 0
2 Amy 9 16 5
3 Amy 9 4 3.61
4 Bob 0 0 0
5 Bob 4 1 2.24
6 Bob 1 4 2.24
Edit: added alternative formulation using a join. I expect that approach will be much faster for a very wide data frame with many columns to compare.
Approach 1: To get euclidean distance for a large number of columns, one way is to rearrange the data so each row shows one month, one student, and one original column (e.g. A or B in the OP), but then two columns representing current month value and first value. Then we can square the difference, and group across all columns to get the euclidean distance, aka root-mean-squared / RMS for each student-month.
library(tidyverse)
df %>%
group_by(student) %>%
mutate_all(list(first = first)) %>%
ungroup() %>%
# gather into long form; make col show variant, col2 show orig column
gather(col, val, -c(student, month, month_first)) %>%
mutate(col2 = col %>% str_remove("_first")) %>%
mutate(col = if_else(col %>% str_ends("_first"),
"first",
"comparison")) %>%
spread(col, val) %>%
mutate(square_dif = (comparison - first)^2) %>%
group_by(student, month) %>%
summarize(RMS = sqrt(sum(square_dif)))
# A tibble: 6 x 3
# Groups: student [2]
student month RMS
<fct> <int> <dbl>
1 Amy 1 0
2 Amy 2 5
3 Amy 3 3.61
4 Bob 1 0
5 Bob 2 2.24
6 Bob 3 2.24
Approach 2. Here, a long version of the data is joined to a version that is just the earliest month for each student.
library(tidyverse)
df_long <- gather(df, col, val, -c(month, student))
df_long %>% left_join(df_long %>%
group_by(student) %>%
top_n(-1, wt = month) %>%
rename(first_val = val) %>%
select(-month),
by = c("student", "col")) %>%
mutate(square_dif = (val - first_val)^2) %>%
group_by( student, month) %>%
summarize(RMS = sqrt(sum(square_dif)))
# A tibble: 6 x 3
# Groups: student [2]
student month RMS
<fct> <int> <dbl>
1 Amy 1 0
2 Amy 2 5
3 Amy 3 3.61
4 Bob 1 0
5 Bob 2 2.24
6 Bob 3 2.24
Instead of the mutate_all call, it'd be easier to directly calculate the dist_from_first. The only thing I'm unclear about is whether month should be included in the group_by() statement.
library(tidyverse)
df <- tibble(month=rep(1:3,2),
student=rep(c("Amy", "Bob"), each=3),
A=c(9, 6, 6, 8, 6, 9),
B=c(6, 2, 8, 5, 6, 7))
df%>%
group_by(student)%>%
mutate(dist_from_first = sqrt((A - first(A))^2 + (B - first(B))^2))%>%
ungroup()
# A tibble: 6 x 5
# month student A B dist_from_first
# <int> <chr> <dbl> <dbl> <dbl>
#1 1 Amy 9 6 0
#2 2 Amy 6 2 5
#3 3 Amy 6 8 3.61
#4 1 Bob 8 5 0
#5 2 Bob 6 6 2.24
#6 3 Bob 9 7 2.24
Related
I have a longitudal dataset, where the same subjects are measured at different occasions in time.
For instance:
dd=data.frame(subject_id=c(1,1,1,2,2,2,3,3,4,5,6,7,8,8,9,9),income=c(rnorm(16,50000,250)))
I should write something able to tell me how many subjects have been counted only once, twice, three times,... In the example above, the number of subjects measured at only one occasion in time is 4, the number of subjects measured twice is 3,...
That's my attempt for counting, for instance, how many subjects have been measured only twice:
library(dplyr)
s.two=dd %>% group_by(subject_id) %>% filter(n() == 2) %>% ungroup()
length(s.two$subject_id)/2
But since I have very heterogenous clusters (ranging from 1 to 24 observations per subject), this implies that I should write planty of rows. Is there something more efficient I can do?
The objective is to have a summary of the size of the clusters (and the cluster is the subject_id). For instance, let say I have 1,000 clusters. I wanna know, how many of them are made up of subjects observed just once, twice... And so, 50 out of 1000 clusters are made up of subjects observed just one occasion in time ; 300 out of 1000 clusters are made up of subjects observed just two occasions in time...
With this info, I shall construct a table to add in my report
You should use summarize. After this you can still filter with filter(n == 2).
library(dplyr)
dd <- data.frame(
subject_id = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 9),
income = c(rnorm(16, 50000, 250))
)
dd |>
group_by(subject_id) |>
summarise(n = n())
#> # A tibble: 9 × 2
#> subject_id n
#> <dbl> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 2
#> 4 4 1
#> 5 5 1
#> 6 6 1
#> 7 7 1
#> 8 8 2
#> 9 9 2
If you use mutate instead of summarize and filter then, you will get
dd |>
group_by(subject_id) |>
mutate(n = n()) |>
filter(n ==2)
subject_id income n
<dbl> <dbl> <int>
1 3 49675. 2
2 3 50306. 2
3 8 49879. 2
4 8 50202. 2
5 9 49783. 2
6 9 49834. 2
NEW EDIT
Maybe you mean this:
dd |>
group_by(subject_id) |>
summarise(n = n()) |>
mutate(info = glue::glue(
'There are {n} times {subject_id} out of {max(subject_id)} groups')) |>
select(info)
# A tibble: 9 × 1
info
<glue>
1 There are 3 times 1 out of 9 groups
2 There are 3 times 2 out of 9 groups
3 There are 2 times 3 out of 9 groups
4 There are 1 times 4 out of 9 groups
5 There are 1 times 5 out of 9 groups
6 There are 1 times 6 out of 9 groups
7 There are 1 times 7 out of 9 groups
8 There are 2 times 8 out of 9 groups
9 There are 2 times 9 out of 9 groups
Next which would be #Ritchie Sacramento 's solution
dd |>
group_by(subject_id) |>
summarise(no_of_occurences = n()) |>
count(no_of_occurences)
# A tibble: 3 × 2
no_of_occurences n
<int> <int>
1 1 4
2 2 3
3 3 2
I have a data frame like this:
Team
GF
A
3
B
5
A
2
A
3
B
1
B
6
Looking for output like this (just an additional column):
Team
x
avg(X)
A
3
0
B
5
0
A
2
3
A
3
2.5
B
1
5
B
6
3
avg(x) is the average of all previous instances of x where Team is the same. I have the following R code which gets the overall average, however I'm looking for the "step-wise" average.
new_df <- df %>% group_by(Team) %>% summarise(avg_x = mean(x))
Is there a way to vectorize this while only evaluating the previous rows on each "iteration"?
You want the cummean() function from dplyr, combined with lag():
df %>% group_by(Team) %>% mutate(avg_x = replace_na(lag(cummean(x)), 0))
Producing the following:
# A tibble: 6 × 3
# Groups: Team [2]
Team x avg_x
<chr> <dbl> <dbl>
1 A 3 0
2 B 5 0
3 A 2 3
4 A 3 2.5
5 B 1 5
6 B 6 3
As required.
Edit 1:
As #Ritchie Sacramento pointed out, the following is cleaner and clearer:
df %>% group_by(Team) %>% mutate(avg_x = lag(cummean(x), default = 0))
I have this data frame:
df=data.frame(id=c(1,1,2,2,2,5,NA),var=c("a","a","b","b","b","e","f"),value=c(1,1,0,1,0,0,1),cs=c(2,2,3,3,3,3,NA))
I want to calculate the sum of value for each group (id, var) and then the cumulative sum but I would like to have the cumulative sum to be displayed for each row of data, i.e., I don't want to summarized view of the data. I have included what my output should look like. This is what I have tried so far:
df%>%arrange(id,var)%>%group_by(id,var)%>%mutate(cs=cumsum(value))
Any suggestions?
Here is an approach that I think meets your expectations.
Would group by id and calculate the sum of value for each id via summarise.
You can then add your cumulative sum column with mutate. Based on your comments, I included an ifelse so that if id was NA, it would not provide a cumulative sum, but instead be given NA.
Finally, to combine your cumulative sum data with your original dataset, you would need to join the two tables.
library(tidyverse)
df %>%
arrange(id) %>%
group_by(id) %>%
summarise(sum = sum(value)) %>%
mutate(cs=ifelse(is.na(id), NA, cumsum(sum))) %>%
left_join(df)
Output
# A tibble: 7 x 5
id sum cs var value
<dbl> <dbl> <dbl> <fct> <dbl>
1 1 2 2 a 1
2 1 2 2 a 1
3 2 1 3 b 0
4 2 1 3 b 1
5 2 1 3 b 0
6 5 0 3 e 0
7 NA 1 NA f 1
Calculate cumulative sum over all values, even if id is NA, then alter final cs to NA if id is NA
df %>%
arrange(id, var) %>%
mutate(cs = cumsum(value)) %>%
group_by(id, var) %>%
mutate(cs = max(ifelse(!is.na(id), cs, NA))) %>%
ungroup()
OR, Exclude rows where id is NA when calculating cumulative sum
df %>%
arrange(id, var) %>%
mutate(cs = cumsum(ifelse(!is.na(id), value, 0))) %>%
group_by(id, var) %>%
mutate(cs = max(ifelse(!is.na(id), cs, NA))) %>%
ungroup()
For your data, both return similar result
# A tibble: 7 x 4
# id var value cs
# <dbl> <fct> <dbl> <dbl>
# 1 1 a 1 2
# 2 1 a 1 2
# 3 2 b 0 3
# 4 2 b 1 3
# 5 2 b 0 3
# 6 5 e 0 3
# 7 NA f 1 4
I have this df
df <- read.table(text="
id month gas tickets
1 1 13 14
2 1 12 1
1 2 4 5
3 1 5 7
1 3 0 9
", header=TRUE)
What I like to do is calculate sum of gas, tickets (and another 50+ rows in my real df) for each month. Usually I would do something like
result <-
df %>%
group_by(month) %>%
summarise(
gas = sum(gas),
tickets = sum(tickets)
) %>%
ungroup()
But since I have really lot of columns in my dataframe, I don´t want to repeat myself with creating sum function for each column. I´m wondering if is possible to create some more elegant - function or something that will create sum of each column except id and month with grouped month column.
You can use summarise_at() to ignore id and sum the rest:
df %>%
group_by(month) %>%
summarise_at(vars(-id), list(sum = ~sum))
# A tibble: 3 x 3
month gas_sum tickets_sum
<int> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9
You can use aggregate as markus recommends in the comments. If you want to stick to the tidyverse you could try something like this:
df %>%
select(-id) %>%
group_by(month) %>%
summarise_if(is.numeric, sum)
#### OUTPUT ####
# A tibble: 3 x 3
month gas tickets
<fct> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9
I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.
Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.
Any help gratefully appreciated.
UPDATE:
Current
ID Age Sex
1 3 1
1 5 1
1 6 1
1 7 1
2 1 2
2 12 2
2 5 2
3 3 1
Expected output
ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4
1 3 1 5 1 6 1 7 1
2 1 2 12 2 5 2
3 3 1
UPDATE 2:
So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:
x <- melt(example, id.vars = "ID")
x$time <- ave(x$ID, x$ID, FUN = seq_along)
example2 <- dcast (x, ID ~ time, value.var = "value")
and here is the output using that code:
ID A B C D E F G H (for clarity i have labelled these)
1 3 5 6 7 1 1 1 1
2 1 12 5 2 2 2
3 3 1
So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?
Here's an approach using gather, spread and unite from the tidyr package:
suppressPackageStartupMessages(library(tidyverse))
x <- tribble(
~ID, ~Age, ~Sex,
1, 3, 1,
1, 5, 1,
1, 6, 1,
1, 7, 1,
2, 1, 2,
2, 12, 2,
2, 5, 2,
3, 3, 1
)
x %>% group_by(ID) %>%
mutate(grp = 1:n()) %>%
gather(var, val, -ID, -grp) %>%
unite("var_grp", var, grp, sep ='') %>%
spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Age2 Age3 Age4 Sex1 Sex2 Sex3 Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 5 6 7 1 1 1 1
#> 2 2 1 12 5 2 2 2
#> 3 3 3 1
If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').
Other questions which might help with this include:
R spreading multiple columns with tidyr
How can I spread repeated measures of multiple variables into wide format?
I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.
library(tidyr)
library(dplyr)
x %>%
group_by(ID) %>%
mutate(grp = 1:n()) %>%
pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
unite("var_grp", var, grp, sep = "") %>%
pivot_wider(names_from = var_grp, values_from = val)
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Sex1 Age2 Sex2 Age3 Sex3 Age4 Sex4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 5 1 6 1 7 1
#> 2 2 1 2 12 2 5 2 NA NA
#> 3 3 3 1 NA NA NA NA NA NA