I want to summarize relocations (between cities), based on a unique ID number. A sample dataframe, with two unique ID's:
year ID city adress
1 2013 1 B adress_1
2 2014 1 B adress_1
3 2015 1 A adress_2
4 2016 1 A adress_2
5 2013 2 B adress_3
6 2014 2 B adress_3
7 2015 2 C adress_4
8 2016 2 C adress_4
I have provided a sample code below. The summaries are correct, except for one thing. If, for example, a relocation is found between city B and city A, I want an output of relocation found from city B to city A (and number of times 1 = seen once in the dataframe). However, because of the properties of the summary function (and the tendency to store output in alphabetic order), I get the following output
tmp <- df %>% group_by(ID, city, adress) %>% summarize(numberofyears = n())
tmp <- tmp %>%
group_by(ID) %>%
#filter(n() >1) %>%
mutate(from = city[1], from_adres = adress[1], from_years = numberofyears[1], to = city[2],
to_adres = adress[2], to_years = numberofyears[2]) %>%
distinct(ID, .keep_all = TRUE) %>% select(-c(2:3))
# A tibble: 2 x 8
# Groups: ID [2]
ID numberofyears from from_adres from_years to to_adres to_years
<dbl> <int> <fct> <fct> <int> <fct> <fct> <int>
1 1 2 A adress_2 2 B adress_1 2
2 2 2 B adress_3 2 C adress_4 2
Which is wrong, because we know that adress_1 preceed adress_2. When summarizing a relocation from City B to City C, I get the right results.
It is a very small detail, but an important one as I tried to demonstrate. Any suggestions would be very much appreciated!
Similar to #jyjek but this will allow for the possibility of more than one move per ID.
library(tidyverse)
df <- data.frame(year = rep(2013:2016, 2),
ID = rep(1:2, each = 4),
city = c("B", "B", "A", "A", "B", "B", "C", "C"),
address = rep(1:4, each = 2),
stringsAsFactors = FALSE)
df %>%
group_by(ID, city, address) %>%
#note the first and last year at the address
summarise(startyear = min(year),
endyear = max(year)) %>%
#sort by ID and year
arrange(ID, startyear) %>%
group_by(ID) %>%
#grab the next address for each ID
mutate(to = lead(city),
to_address = lead(address),
to_years = lead(endyear) - lead(startyear) + 1,
from_years = endyear - startyear + 1) %>%
#exclude the last row of each ID, since there's no next address being moved to
filter(!is.na(to)) %>%
select(ID, from = city, from_address = address, from_years, to, to_address, to_years)
Like this?
library(tidyverse)
df<-read.table(text=" year ID city adress
1 2013 1 B adress_1
2 2014 1 B adress_1
3 2015 1 A adress_2
4 2016 1 A adress_2
5 2013 2 B adress_3
6 2014 2 B adress_3
7 2015 2 C adress_4
8 2016 2 C adress_4",header=T)
df%>%
group_by(ID, city, adress)%>%
summarize(numberofyears = n())%>%
mutate(id=parse_number(adress))%>%
group_by(ID,id)%>%
arrange(id)%>%
ungroup()%>%
select(-id)%>%
group_by(ID)%>%
mutate(from=first(city), from_adres = first(adress),
from_years = first(numberofyears),to=last(city),
to_adres = last(adress),to_years=last(numberofyears))%>%
distinct(ID, .keep_all = TRUE)%>%
select(-c(2:3))
# A tibble: 2 x 8
# Groups: ID [2]
ID numberofyears from from_adres from_years to to_adres to_years
<int> <int> <fct> <fct> <int> <fct> <fct> <int>
1 1 2 B adress_1 2 A adress_2 2
2 2 2 B adress_3 2 C adress_4 2
Related
I have a dataframe named episodes where each line is a different episode from different subjects (numadm). Each episode has a "start" and an "end" time. A subject can have 1 or multiple episodes (so 1 to multiple lines)
The format of the table is like this :
num adm start end
I would like to obtain a table where each subject (numadm) has only one line, with new columns for the start and end of each episode (start1, end1, start2, end2, start3, end3)
num adm start 1 end 1 start2 end2
I read about pivot.wider but not sure how it applies here.
Any ideas?
Thank you for your help.
Since you haven't shared any example, let's create a small one
df <- read.table(text = "num_adm start end
1 a b
2 c d
2 e f
3 g h
3 i j
3 k l
", header = T)
Now to get desired result in tidyverse do it like this
library(tidyverse)
df %>% group_by(num_adm) %>%
mutate(d = row_number()) %>%
pivot_longer(cols = c(start, end)) %>%
mutate(name = paste0(name, "_" ,d)) %>%
select(-d) %>%
pivot_wider(id_cols = num_adm, names_from = name, values_from = "value")
# A tibble: 3 x 7
# Groups: num_adm [3]
num_adm start_1 end_1 start_2 end_2 start_3 end_3
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 a b NA NA NA NA
2 2 c d e f NA NA
3 3 g h i j k l
Suppose you have a table of data:
df<-tibble(person = c("Alice", "Bob", "Mary"),
colour = c("Red", "Green", "Blue"),
city = c("London", "Paris", "New York"))
# A tibble: 3 x 3
person colour city
<chr> <chr> <chr>
1 Alice Red London
2 Bob Green Paris
3 Mary Blue New York
And a second table which contains the field names and the maximum string length of each field:
len<-tibble(field_name = c("person", "colour", "city"),
field_length = c(12, 4, 6))
# A tibble: 3 x 2
field_name field_length
<chr> <dbl>
1 person 12
2 colour 4
3 city 6
How can I check, for each field in len, whether a string in df is less than or equal to len$field_length, returning rows which fail the test?
As an example:
Output Row 1 in df would pass because:
'Alice' <= 12 characters long,
'Red' is <= 4 characters long and
'London' is <= 6 characters long.
However,
Row 2 would fail because:
'Green' > 4 characters long and
Row 3 would fail because:
'New York' > 6 characters long.
Thus the returned data frame should only display Rows 2 and Row 3 of the original df.
A dplyr solution with c_across():
library(dplyr)
df %>%
rowwise() %>%
filter(any(nchar(c_across(everything())) > len$field_length)) %>%
ungroup()
# # A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
# 1 Bob Green Paris
# 2 Mary Blue New York
Using base R with mapply :
df[rowSums(mapply(function(x, y) nchar(x) > y, df, len$field_length)) > 0, ]
# A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
#1 Bob Green Paris
#2 Mary Blue New York
If column names in df are not in the same order as len$field_name use df[len$field_name] in mapply.
In tidyverse we can get data in long format join it with len data by column name, select the rows which fail and get data in wide format again.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
left_join(len, by = c('name' = 'field_name')) %>%
group_by(row) %>%
filter(any(nchar(value) > field_length)) %>%
dplyr::select(-field_length) %>%
pivot_wider()
It's easier to solve your problem in terms of 2 matrices, first the length of each of your entries:
nchar(as.matrix(df))
person colour city
[1,] 5 3 6
[2,] 3 5 5
[3,] 4 4 8
And a corresponding matrix of allowed length:
allowed = replicate(nrow(df),len$field_length[match(colnames(df),len$field_name)])
allowed
[,1] [,2] [,3]
[1,] 12 12 12
[2,] 4 4 4
[3,] 6 6 6
Then matrix wise comparison, and only keep those where the rowSums() are
df[rowMeans(nchar(as.matrix(df)) > allowed)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
If your two data.frames are already in the same order like your example, you can do (thanks to #zx8754) for pointing it out:
df[rowMeans(nchar(as.matrix(df)) > len$field_length)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
Pivot df into the same format as len and join the two. After this, it is trivial to compare each string to the field_length.
library(tidyverse)
test_result_df <- df %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = 'field_name') %>%
left_join(len, by = 'field_name') %>%
mutate(test_passed = str_length(value) <= field_length) %>%
group_by(id) %>%
summarise(all_passed = all(test_passed))
df[!test_result_df$all_passed,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
I am trying to compare a new algorithm result versus an old one. I need to know approximately how many days of a difference the new algorithm has in predicting a "D" versus the old one.
I can't seem to figure out how to point to the first row (day) that contains a 'D' (min(day) and new == 'D') without filtering (I was able to grab the row using a double filter due to the grouping, but not use it). I want to use it in summarise using dplyr which is why I have included pseudo code similar to where i am currently at in my own dataset.
In my data there are groups of varying length (number of days) for each ID, which is why I made groups of different lengths in the example.
library(dplyr)
id = c(123,123,123,123,123,456,456,456,456)
old = c('S','S','S','S','D','S','S','D','D')
new = c('S','S','D','D','D','S','D','D','D')
day = c(1,2,3,4,5,1,2,3,4)
data = data.frame(id,old,new,day)
data
#> id old new day
#> 1 123 S S 1
#> 2 123 S S 2
#> 3 123 S D 3
#> 4 123 S D 4
#> 5 123 D D 5
#> 6 456 S S 1
#> 7 456 S D 2
#> 8 456 D D 3
#> 9 456 D D 4
d = data %>%
group_by(id)%>%
arrange(day,.by_group=T)%>%
add_tally(new=='S',name='S')%>%
add_tally(new=='D',name='D')%>%
group_by(id,S,D)
# summarise(diff = (day of 1st old D) - (day of 1st new D) )
#Expected Outcome
ido = c(123,456)
S = c(2,1)
D = c(3,3)
diff = c(2,1)
outcome = data.frame(ido,S,D,diff)
outcome
#> ido S D diff
#> 1 123 2 3 2
#> 2 456 1 3 1
Created on 2019-12-26 by the reprex package (v0.3.0)
We can group_by id and count the occurrence of 'S' and 'D' and the difference between first occurrence of old and new 'D'.
library(dplyr)
data %>%
group_by(id) %>%
summarise(S = sum(new == 'S'),
D = sum(new == 'D'),
diff = which.max(old == 'D') - which.max(new == 'D'))
#OR if there could be id without D use
#diff = which(old == 'D')[1] - which(new == 'D')[1])
# A tibble: 2 x 4
# id S D diff
# <dbl> <int> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
We can use pivot_wider after summariseing to get the frequency count after creating a column to take the difference between the 'day' based on the first occurence of 'D' in both 'old' and 'new' columnss
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
group_by(diff = day[match("D", old)] - day[match("D", new)],
new, add = TRUE) %>%
summarise(n = n()) %>%
ungroup %>%
pivot_wider(names_from = new, values_from = n)
# A tibble: 2 x 4
# id diff D S
# <dbl> <dbl> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
data <-
STUDY ID BASE CYCLE1 DIED PROG
1 1 100 30 No Yes
1 2 NA 20 Yes No
1 3 16 NA Yes Yes
1 4 15 10 Yes Yes
I wanted to make a summary of the following:
how many subjects have both baseline and CYCLE1 value?
Of those in 1, how many had DIED?
Of those in 1, how many had DIED or PROG?
Answers:
2-subjects (50% of subjects) ==> subjects 1 & 4
1-subject (25%) ===> this is subject 4
2-subjects (50%) ==> subjectys 1 & 4
A summary table by STUDY for this would be great (showing the number and percentage).
I am using Rstudio.
If it is based on the first filter
library(dplyr)
library(stringr)
data %>%
group_by(STUDY) %>%
filter(!is.na(BASE) & !is.na(CYCLE1)) %>%
summarise(ID = str_c(ID, collapse=", "),
n1 = n(),
n2 = sum(DIED== "Yes"),
n3 = sum(DIED == "Yes"|PROG == "Yes"))
# A tibble: 1 x 5
# STUDY ID n1 n2 n3
# <int> <chr> <int> <int> <int>
#1 1 1, 4 2 1 2
if we need the percentage as well
out <- data %>%
group_by(STUDY) %>%
mutate(i1 = !is.na(BASE) & !is.na(CYCLE1),
perc1 = 100 * mean(i1),
n1 = sum(i1),
i2 = DIED == "Yes" & i1,
perc2 = 100 * mean(i2),
n2 = sum(i2),
i3 = (DIED == "Yes"|PROG == "Yes") & i1,
perc3 = 100 * mean(i3),
n3 = sum(i3)) %>%
filter(i1) %>%
select(STUDY, ID, matches("perc"), matches("n")) %>%
mutate(ID = toString(ID)) %>%
slice(1)
# A tibble: 1 x 8
# Groups: STUDY [1]
# STUDY ID perc1 perc2 perc3 n1 n2 n3
# <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#1 1 1, 4 50 25 50 2 1 2
It can be further modified to format the output
library(tidyr) # 0.8.3.9000
out %>%
pivot_longer(cols = perc1:n3, names_to = c( "perc", "n"),
names_sep = "(?<=[a-z])(?=[0-9])") %>%
group_by(STUDY, ID, n) %>%
summarise(value = sprintf("%d (%d%%)", last(value), first(value))) %>%
select(-n)
I want to find the minimum value associated with an object out of a dataframe. The dataframe contains two columns representing all combinations of the objects and a value-column for each combination. It looks like this:
id_A id_B dist
206 208 2385.5096
207 208 467.8890
207 209 576.4631
...
208 209 1081.539
208 210 8214.439
...
I tried the following recommended dplyr functions:
df %>%
group_by(id_A) %>%
slice(which.min(dist))
But it creates not the desired output:
id_A id_B dist
...
207 208 467.8890
208 209 1081.5393
...
Note that for id 208 the combination with id 207 has the lowest value, but is not associated to id 208 (when it is in the grouped_by column).
I wrote a function doing this right, but since I got many entries it is way to slow. Its a loop subsetting the data by all entries containing a specific id and then finds the minimum within that subset and associates that value with that id.
Do you have an idea, how to make that fast e.g. using dplyr.
The issue boils down to needing a long (rather than wide) data format. First, here are some reproducible data (using the pipe from dplyr):
df <-
LETTERS[1:4] %>%
combn(2) %>%
t %>%
data.frame() %>%
mutate(val = 1:n()) %>%
setNames( c("id_A", "id_B", "dist") )
gives:
id_A id_B dist
1 A B 1
2 A C 2
3 A D 3
4 B C 4
5 B D 5
6 C D 6
What we want is a pair of columns giving matching each category with the distance from its row. For this, I am using gather from tidyr. It creates new columns telling us which column the data came from and what value that held. Here, we are telling it to pull from columns id_A and id_B to give us the category for each ID entry (it then duplicates the dist column as necessary)
df %>%
gather(whichID, Category, id_A, id_B)
Gives
dist whichID Category
1 1 id_A A
2 2 id_A A
3 3 id_A A
4 4 id_A B
5 5 id_A B
6 6 id_A C
7 1 id_B B
8 2 id_B C
9 3 id_B D
10 4 id_B C
11 5 id_B D
12 6 id_B D
We can then pass that data.frame to group_by and then use summarise to give us whatever information we wanted. I know that you didn't ask for the max, but I am including it just to show the general syntax you can use to get whatever type of result you want:
df %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, maxDist = max(dist))
Returns:
Category minDist maxDist
<chr> <int> <int>
1 A 1 3
2 B 1 5
3 C 2 6
4 D 3 6
I just looked at the question and realized that you wanted to also display which comparison had the minimum value. Here is an approach that does that by tracking an index of the match (so that it is replicated when gathering) and then pulls the correct row from the original df and pastes together the two comparison values:
df %>%
mutate(whichComparison = 1:n()) %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, whichMin = whichComparison[which.min(dist)]
, maxDist = max(dist)
, whichMax = whichComparison[which.max(dist)]) %>%
mutate(
minComp = sapply(whichMin, function(x){
paste(df[x, "id_A"], df[x, "id_B"], sep = " vs " )})
, maxComp = sapply(whichMax, function(x){
paste(df[x, "id_A"], df[x, "id_B"], sep = " vs " )})
)
returns
Category minDist whichMin maxDist whichMax minComp maxComp
<chr> <int> <int> <int> <int> <chr> <chr>
1 A 1 1 3 3 A vs B A vs D
2 B 1 1 5 5 A vs B B vs D
3 C 2 2 6 6 A vs C C vs D
4 D 3 3 6 6 A vs D C vs D
If you really want a single column giving which comparison gave the min value (and the max, in my output), you can instead use the index to pull both the id_A and id_B from the original df, knock out the one that matches the Category of interest, then use use_first_valid_of from the package janitor to grab just the one you are interested in. Because this generated a large number of intermediate columns, I am using select to clean things back up:
df %>%
mutate(whichComparison = 1:n()) %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, maxDist = max(dist)
, whichMin = whichComparison[which.min(dist)]
, whichMax = whichComparison[which.max(dist)]) %>%
mutate(
minA = df$id_A[whichMin]
, minB = df$id_B[whichMin]
, maxA = df$id_A[whichMax]
, maxB = df$id_B[whichMax]
) %>%
mutate_each(funs(ifelse(. == Category, NA, as.character(.)) )
, minA:maxB) %>%
mutate(minComp = use_first_valid_of(minA, minB)
, maxComp = use_first_valid_of(maxA, maxB)) %>%
select(-(whichMin:maxB))
returns:
Category minDist maxDist minComp maxComp
<chr> <int> <int> <chr> <chr>
1 A 1 3 B D
2 B 1 5 A D
3 C 2 6 A D
4 D 3 6 A C
An alternative approach is to first convert the distance pairs to a matrix. Here, I first duplicate the comparisons in the reverse order to ensure that the matrix is complete (using tidyr to spread):
bind_rows(
df
, rename(df, id_A = id_B, id_B = id_A)
) %>%
spread(id_B, dist)
returns:
id_A A B C D
1 A NA 1 2 3
2 B 1 NA 4 5
3 C 2 4 NA 6
4 D 3 5 6 NA
Then, we just apply across rows much like we would if we working from a distance matrix (which may be where your data actually started):
bind_rows(
df
, rename(df, id_A = id_B, id_B = id_A)
) %>%
spread(id_B, dist) %>%
mutate(
minDist = apply(as.matrix(.[, -1]), 1, min, na.rm = TRUE)
, minComp = names(.)[apply(as.matrix(.[, -1]), 1, which.min) + 1]
, maxDist = apply(as.matrix(.[, -1]), 1, max, na.rm = TRUE)
, maxComp = names(.)[apply(as.matrix(.[, -1]), 1, which.max) + 1]
) %>%
select(Category = `id_A`
, minDist:maxComp)
returns:
Category minDist minComp maxDist maxComp
1 A 1 B 3 D
2 B 1 A 5 D
3 C 2 A 6 D
4 D 3 A 6 C