I have a dataframe named episodes where each line is a different episode from different subjects (numadm). Each episode has a "start" and an "end" time. A subject can have 1 or multiple episodes (so 1 to multiple lines)
The format of the table is like this :
num adm start end
I would like to obtain a table where each subject (numadm) has only one line, with new columns for the start and end of each episode (start1, end1, start2, end2, start3, end3)
num adm start 1 end 1 start2 end2
I read about pivot.wider but not sure how it applies here.
Any ideas?
Thank you for your help.
Since you haven't shared any example, let's create a small one
df <- read.table(text = "num_adm start end
1 a b
2 c d
2 e f
3 g h
3 i j
3 k l
", header = T)
Now to get desired result in tidyverse do it like this
library(tidyverse)
df %>% group_by(num_adm) %>%
mutate(d = row_number()) %>%
pivot_longer(cols = c(start, end)) %>%
mutate(name = paste0(name, "_" ,d)) %>%
select(-d) %>%
pivot_wider(id_cols = num_adm, names_from = name, values_from = "value")
# A tibble: 3 x 7
# Groups: num_adm [3]
num_adm start_1 end_1 start_2 end_2 start_3 end_3
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 a b NA NA NA NA
2 2 c d e f NA NA
3 3 g h i j k l
Related
How to use dplyr::across to check unique values in multiple columns by group?
This code will still treat each column independently. I would like to have the number of unique values across variables DX1:DX4 together.
Here id=1 would have 5 unique values: A,B, C, D, F. ID 2 would also have 5 A, B, C, D, E.
library(dplyr)
x <- dat %>%
group_by(id) %>%
summarize(across(DX1:DX4, n_distinct, na.rm=T))
df <- read.table(header = TRUE, text = "
id DX1 DX2 DX3 DX4
1 A B A A
1 A A A C
1 D A A A
1 A A A F
1 A A A A
2 A A A A
2 A C A A
2 A A A D
2 A E D B
", stringsAsFactors = FALSE)
After grouping by 'id', use across to select the columns, unlist/flatten_chr and get the number of distinct elements (n_distinct)
library(dplyr)
library(purrr)
df %>%
group_by(id) %>%
summarise(n = n_distinct(flatten_chr(across(DX1:DX4)), na.rm = TRUE),
.groups = 'drop')
-output
# A tibble: 2 × 2
id n
<int> <int>
1 1 5
2 2 5
I don't think across is the "tidyverse" way to go. I suggest cur_data() instead.
df %>%
group_by(id) %>%
summarise(n = n_distinct(unlist(select(cur_data(),DX1:DX4))))
## A tibble: 2 × 2
# id n
# <int> <int>
#1 1 5
#2 2 5
Base R:
> df=read.table(header=T,text="id DX1 DX2 DX3 DX4\n1 A B A A\n1 A A A C\n1 D A A A\n1 A A A F\n1 A A A A\n2 A A A A\n2 A C A A\n2 A A A D\n2 A E D B")
> sapply(split(df[,-1],df[,1]),\(x)length(unique(unlist(x))))
1 2
5 5
Background
Here's a dataset, d:
d <- data.frame(ID = c("a","a","b","b"),
product_code = c("B78","X31","C12","C12"),
stringsAsFactors=FALSE)
It looks like this:
The Problem and Desired Output
I'm trying to make an indicator column multiple_products that's marked 1 for IDs which have more than one unique product_code and 0 for those that don't. Here's what I'm looking for:
My attempts haven't worked yet, though.
What I've Tried
Here's my current code:
d <- d %>%
group_by(ID) %>%
mutate(multiple_products = if_else(length(unique(d$product_code)) > 1, 1, 0)) %>%
ungroup()
And this is the result:
Any thoughts?
The d$ should be taken out as this will extract the whole column by removing the group attributes. Also, there is n_distinct. In addition, there is no need for ifelse or if_else as logical values (TRUE/FALSE) can be directly coerced to 1/0 as these are storage values by either using as.integer or +
library(dplyr)
d %>%
group_by(ID) %>%
mutate(multiple_products = +(n_distinct(product_code) > 1)) %>%
ungroup()
-output
# A tibble: 4 x 3
ID product_code multiple_products
<chr> <chr> <int>
1 a B78 1
2 a X31 1
3 b C12 0
4 b C12 0
solution with data.table;
library(data.table)
setDT(d)
d[,multiple_products:=rleid(product_code),by=ID][
,multiple_products:=ifelse(max(multiple_products)>1,1,0),by=ID]
d
output;
ID product_code multiple_products
<chr> <chr> <int>
1 a B78 1
2 a X31 1
3 b C12 0
4 b C12 0
A base R option using ave
transform(
d,
multiple_products = +(ave(match(product_code, unique(product_code)), ID, FUN = var) > 0
)
)
gives
ID product_code multiple_products
1 a B78 1
2 a X31 1
3 b C12 0
4 b C12 0
I am trying to compare a new algorithm result versus an old one. I need to know approximately how many days of a difference the new algorithm has in predicting a "D" versus the old one.
I can't seem to figure out how to point to the first row (day) that contains a 'D' (min(day) and new == 'D') without filtering (I was able to grab the row using a double filter due to the grouping, but not use it). I want to use it in summarise using dplyr which is why I have included pseudo code similar to where i am currently at in my own dataset.
In my data there are groups of varying length (number of days) for each ID, which is why I made groups of different lengths in the example.
library(dplyr)
id = c(123,123,123,123,123,456,456,456,456)
old = c('S','S','S','S','D','S','S','D','D')
new = c('S','S','D','D','D','S','D','D','D')
day = c(1,2,3,4,5,1,2,3,4)
data = data.frame(id,old,new,day)
data
#> id old new day
#> 1 123 S S 1
#> 2 123 S S 2
#> 3 123 S D 3
#> 4 123 S D 4
#> 5 123 D D 5
#> 6 456 S S 1
#> 7 456 S D 2
#> 8 456 D D 3
#> 9 456 D D 4
d = data %>%
group_by(id)%>%
arrange(day,.by_group=T)%>%
add_tally(new=='S',name='S')%>%
add_tally(new=='D',name='D')%>%
group_by(id,S,D)
# summarise(diff = (day of 1st old D) - (day of 1st new D) )
#Expected Outcome
ido = c(123,456)
S = c(2,1)
D = c(3,3)
diff = c(2,1)
outcome = data.frame(ido,S,D,diff)
outcome
#> ido S D diff
#> 1 123 2 3 2
#> 2 456 1 3 1
Created on 2019-12-26 by the reprex package (v0.3.0)
We can group_by id and count the occurrence of 'S' and 'D' and the difference between first occurrence of old and new 'D'.
library(dplyr)
data %>%
group_by(id) %>%
summarise(S = sum(new == 'S'),
D = sum(new == 'D'),
diff = which.max(old == 'D') - which.max(new == 'D'))
#OR if there could be id without D use
#diff = which(old == 'D')[1] - which(new == 'D')[1])
# A tibble: 2 x 4
# id S D diff
# <dbl> <int> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
We can use pivot_wider after summariseing to get the frequency count after creating a column to take the difference between the 'day' based on the first occurence of 'D' in both 'old' and 'new' columnss
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
group_by(diff = day[match("D", old)] - day[match("D", new)],
new, add = TRUE) %>%
summarise(n = n()) %>%
ungroup %>%
pivot_wider(names_from = new, values_from = n)
# A tibble: 2 x 4
# id diff D S
# <dbl> <dbl> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
I want to find the minimum value associated with an object out of a dataframe. The dataframe contains two columns representing all combinations of the objects and a value-column for each combination. It looks like this:
id_A id_B dist
206 208 2385.5096
207 208 467.8890
207 209 576.4631
...
208 209 1081.539
208 210 8214.439
...
I tried the following recommended dplyr functions:
df %>%
group_by(id_A) %>%
slice(which.min(dist))
But it creates not the desired output:
id_A id_B dist
...
207 208 467.8890
208 209 1081.5393
...
Note that for id 208 the combination with id 207 has the lowest value, but is not associated to id 208 (when it is in the grouped_by column).
I wrote a function doing this right, but since I got many entries it is way to slow. Its a loop subsetting the data by all entries containing a specific id and then finds the minimum within that subset and associates that value with that id.
Do you have an idea, how to make that fast e.g. using dplyr.
The issue boils down to needing a long (rather than wide) data format. First, here are some reproducible data (using the pipe from dplyr):
df <-
LETTERS[1:4] %>%
combn(2) %>%
t %>%
data.frame() %>%
mutate(val = 1:n()) %>%
setNames( c("id_A", "id_B", "dist") )
gives:
id_A id_B dist
1 A B 1
2 A C 2
3 A D 3
4 B C 4
5 B D 5
6 C D 6
What we want is a pair of columns giving matching each category with the distance from its row. For this, I am using gather from tidyr. It creates new columns telling us which column the data came from and what value that held. Here, we are telling it to pull from columns id_A and id_B to give us the category for each ID entry (it then duplicates the dist column as necessary)
df %>%
gather(whichID, Category, id_A, id_B)
Gives
dist whichID Category
1 1 id_A A
2 2 id_A A
3 3 id_A A
4 4 id_A B
5 5 id_A B
6 6 id_A C
7 1 id_B B
8 2 id_B C
9 3 id_B D
10 4 id_B C
11 5 id_B D
12 6 id_B D
We can then pass that data.frame to group_by and then use summarise to give us whatever information we wanted. I know that you didn't ask for the max, but I am including it just to show the general syntax you can use to get whatever type of result you want:
df %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, maxDist = max(dist))
Returns:
Category minDist maxDist
<chr> <int> <int>
1 A 1 3
2 B 1 5
3 C 2 6
4 D 3 6
I just looked at the question and realized that you wanted to also display which comparison had the minimum value. Here is an approach that does that by tracking an index of the match (so that it is replicated when gathering) and then pulls the correct row from the original df and pastes together the two comparison values:
df %>%
mutate(whichComparison = 1:n()) %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, whichMin = whichComparison[which.min(dist)]
, maxDist = max(dist)
, whichMax = whichComparison[which.max(dist)]) %>%
mutate(
minComp = sapply(whichMin, function(x){
paste(df[x, "id_A"], df[x, "id_B"], sep = " vs " )})
, maxComp = sapply(whichMax, function(x){
paste(df[x, "id_A"], df[x, "id_B"], sep = " vs " )})
)
returns
Category minDist whichMin maxDist whichMax minComp maxComp
<chr> <int> <int> <int> <int> <chr> <chr>
1 A 1 1 3 3 A vs B A vs D
2 B 1 1 5 5 A vs B B vs D
3 C 2 2 6 6 A vs C C vs D
4 D 3 3 6 6 A vs D C vs D
If you really want a single column giving which comparison gave the min value (and the max, in my output), you can instead use the index to pull both the id_A and id_B from the original df, knock out the one that matches the Category of interest, then use use_first_valid_of from the package janitor to grab just the one you are interested in. Because this generated a large number of intermediate columns, I am using select to clean things back up:
df %>%
mutate(whichComparison = 1:n()) %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, maxDist = max(dist)
, whichMin = whichComparison[which.min(dist)]
, whichMax = whichComparison[which.max(dist)]) %>%
mutate(
minA = df$id_A[whichMin]
, minB = df$id_B[whichMin]
, maxA = df$id_A[whichMax]
, maxB = df$id_B[whichMax]
) %>%
mutate_each(funs(ifelse(. == Category, NA, as.character(.)) )
, minA:maxB) %>%
mutate(minComp = use_first_valid_of(minA, minB)
, maxComp = use_first_valid_of(maxA, maxB)) %>%
select(-(whichMin:maxB))
returns:
Category minDist maxDist minComp maxComp
<chr> <int> <int> <chr> <chr>
1 A 1 3 B D
2 B 1 5 A D
3 C 2 6 A D
4 D 3 6 A C
An alternative approach is to first convert the distance pairs to a matrix. Here, I first duplicate the comparisons in the reverse order to ensure that the matrix is complete (using tidyr to spread):
bind_rows(
df
, rename(df, id_A = id_B, id_B = id_A)
) %>%
spread(id_B, dist)
returns:
id_A A B C D
1 A NA 1 2 3
2 B 1 NA 4 5
3 C 2 4 NA 6
4 D 3 5 6 NA
Then, we just apply across rows much like we would if we working from a distance matrix (which may be where your data actually started):
bind_rows(
df
, rename(df, id_A = id_B, id_B = id_A)
) %>%
spread(id_B, dist) %>%
mutate(
minDist = apply(as.matrix(.[, -1]), 1, min, na.rm = TRUE)
, minComp = names(.)[apply(as.matrix(.[, -1]), 1, which.min) + 1]
, maxDist = apply(as.matrix(.[, -1]), 1, max, na.rm = TRUE)
, maxComp = names(.)[apply(as.matrix(.[, -1]), 1, which.max) + 1]
) %>%
select(Category = `id_A`
, minDist:maxComp)
returns:
Category minDist minComp maxDist maxComp
1 A 1 B 3 D
2 B 1 A 5 D
3 C 2 A 6 D
4 D 3 A 6 C
I wish to summarize two variables in string. Let's say this is my id
#visit
id source1 source2
1 a t
2 c l
3 c z
1 b x
second dataset:
#transaction
id transactions
1 1
3 2
1 2
I'd like to join these data together but convert them to string at the same time:
I can do for one variable ( let's say source 1):
library(dplyr)
%>% left_join(visit, transaction, by="id")
%>% group_by( id)
%>% summarise( Source = toString(unique(source1)), transactions = toString(unique(transactions)) )
This gives me the following output:
id source transactions
1 a,b 1,2
2 c NA
3 c 2
But I wish to summarize for two variables: So my desire output would be something like that:
id source transactions
1 a,t > b,x 1,2
2 c,l NA
3 c,z 2
You can paste the two variables together, using both sep and collapse to combine:
visit %>% left_join(transaction) %>%
group_by(id) %>%
summarise(source = paste(unique(source1), unique(source2), sep = ', ', collapse = ' > '),
transaction = na_if(toString(unique(na.omit(transactions))), ''))
## # A tibble: 3 × 3
## id source transaction
## <int> <chr> <chr>
## 1 1 a, t > b, x 1, 2
## 2 2 c, l <NA>
## 3 3 c, z 2
Beware, though; paste and toString stupidly coerce NAs to strings. You may want to wrap in na.omit or use na_if.