Distribution of Combination of Amounts by Group - r

I have a table that looks like this:
Id
Types
1
A
1
A
1
A
1
B
2
A
2
B
3
A
3
B
4
A
4
B
4
B
What I would like to do is 1. count for every ID the amount of A's and B's it has. 2. Compute the distribution of every combination of the amounts of A and B.
So at the end of step 2 I should have the table:
Amount of A
Amount of B
Number of Different IDs
1
1
2
1
2
1
3
1
1
How can this be achieved?
Thank you.

Here's a solution with dplyr and tidyr:
library(dplyr)
library(tidyr)
# ...
# Code to generate your original table: "your_table".
# ...
result <- your_table %>%
# Count the amount of each type for each Id.
group_by(Id) %>% count(Types) %>% ungroup() %>%
# "Pivot" the Types column, such that each type (here "A" and "B") gets its
# own column (here "Amount of A" and "Amount of B") to hold its amount (as
# calculated right above).
pivot_wider(id_cols = c(Id, Types),
names_from = Types, names_prefix = "Amount of ",
values_from = n) %>%
# For each combination of amounts among those pivoted columns (ie. all the
# columns except "Id"), count how many distinct IDs there are.
group_by(across(-c(Id))) %>%
summarize("Number of Different IDs" = n_distinct(Id)) %>% ungroup()
# Print the result.
result
Given the example of your_table that you provided
your_table <- tibble::tribble(
~Id, ~Types,
1, "A",
1, "A",
1, "A",
1, "B",
2, "A",
2, "B",
3, "A",
3, "B",
4, "A",
4, "B",
4, "B"
)
you should get the following result:
# A tibble: 3 x 3
`Amount of A` `Amount of B` `Number of Different IDs`
<int> <int> <int>
1 1 1 2
2 1 2 1
3 3 1 1

Related

filtering based on three columns with dplyr

I have a dataframe that looks like this (but with lots more columns, and no helpful "KEEP" column):
df <- tribble( ~Lots.of.cols, ~analyte, ~meta, ~value, ~KEEP,
1, "A", "analyte", NA, FALSE,
1, "A", "unit", "m", FALSE,
1, "A", "method", NA, FALSE,
1, "B", "analyte", "4", TRUE,
1, "B", "unit", "kg", TRUE,
1, "B", "method", "xxx", TRUE)
What I want to do is filter out all the rows of a particular analyte if the row where meta is "analyte" the value column is also NA. So in the df above, the first three rows should be filtered out because row one has meta = "analyte" and value = NA. The final three rows (analyte = "B") should be kept because the fourth row (meta = "analyte") has !is.na(value).
So there are two approaches I've tried. The first is to group_by(analyte) and then try filtering or alternatively
df %>%
anti_join(.[is.na(.$value) & .$meta == "analyte", ],
by = c("Lots.of.cols", "analyte", "meta")) -> df
With both approaches I can remove the individual row where meta = "analyte" & is.na(value) but not the other rows in the group.
The issue is that your table is not in tidy format, i.e. 1 observation = 1 row.
To have as tidy data, you'd need to pivot wider. This is why I pivotted, filtered, then re-pivotted.
Also, it's confusing that you have two things named "analyte" that are not the same thing, hence why I changed the name.
df %>%
mutate(meta = str_replace(meta, "analyte", "analyte_value")) %>%
pivot_wider(names_from = meta, values_from = value) %>%
filter(!is.na(analyte_value)) %>%
pivot_longer(cols = analyte_value:method)
#> # A tibble: 3 x 4
#> Lots.of.cols analyte name value
#> <dbl> <chr> <chr> <chr>
#> 1 1 B analyte_value 4
#> 2 1 B unit kg
#> 3 1 B method xxx
Your anti_join was almost good, just don't put the "meta" variable in the by = c(...) like that :
df %>%
anti_join(.[is.na(.$value) & .$meta == "analyte", ],
by = c("Lots.of.cols", "analyte")) -> df
Result :
# A tibble: 3 x 5
Lots.of.cols analyte meta value KEEP
<dbl> <chr> <chr> <chr> <lgl>
1 1 B analyte 4 TRUE
2 1 B unit kg TRUE
3 1 B method xxx TRUE
I would first fix your KEEP column, and them filter the data by it. First I group your data by analyte using group_by() from dplyr, them I apply the logical test to discover if in some row of each group, there is a row with meta = analyte and value = NA, and them I use the any() function to discover if any of these results from the test, are TRUE in each group. After that, I just use filter() to select the desired rows.
library(tidyverse)
df <- df %>%
group_by(analyte) %>%
mutate(KEEP = any(meta == "analyte" & is.na(value))) %>%
filter(KEEP == FALSE)
Here is the result:
# A tibble: 3 x 5
# Groups: analyte [1]
Lots.of.cols analyte meta value KEEP
<dbl> <chr> <chr> <chr> <lgl>
1 1 B analyte 4 FALSE
2 1 B unit kg FALSE
3 1 B method xxx FALSE

Creating a table based on criteria from another table in r

I've a three column table (Table_1) and I would like to create another table based on Table_1. The table has personal ID and work start and end days.
Table_1 <- data.frame(ID = c("A", "B", "C"), Start_Day = c(1, 20, 38), End_Day = c(14, 29, 42))
The new table I would like to create will have two columns, namely ID and Week. The number of rows for each ID level is equal to the number of bins (weeks) of the End_Day and Start_Day. For example, ID A will have 2 week bins 1 (days 1-7) and 2 (days 8-14), ID B will have 3 week bins, 3 (days 15-21), 4 (days 22-28) and 5 (days 29-35).
The expected outcome is:
Table_2 <- data.frame(ID = c("A", "A", "B", "B", "B", "C" ), Week = c(1, 2, ,3, 4, 5, 6))
One way would be to divide Start_Day and End_Day by 7 and create a sequence between them using map2 bringing the data in long format using unnest.
library(dplyr)
Table_1 %>%
mutate_at(-1, ~ceiling(./7)) %>%
mutate(Week = purrr::map2(Start_Day, End_Day, seq)) %>%
tidyr::unnest(Week) %>%
select(ID, Week)
# A tibble: 6 x 2
# ID Week
# <fct> <int>
#1 A 1
#2 A 2
#3 B 3
#4 B 4
#5 B 5
#6 C 6

finding matching or non matching values in r

I have a medium sized data frame like the one I made up below, (with several columns though) where I want to find if any "id"s have different "letter"s
I imagine that there is a simple way to do this, maybe with tidyr?
df<-data.frame("id"=c(1, 1, 2, 3, 3, 3, 4, 4, 5, 6, 6), "letter"=c("f", "f",
"r", "r", "k", "k", "k", "k", "r", "f", "r"))
EDIT: I am trying to find the "id"s that have more than one letter. i.e. in this df id 3 and 6. I am less interested in which "letter"s (though it's not bad if they're shown), more in which "id"s
Not sure of your desired output, but you can check to see how many letters correspond to each id :
library(dplyr)
df %>%
group_by(id) %>%
summarise(n_letters = n_distinct(letter))
A tibble: 6 x 2
id n_letters
<dbl> <int>
1 1 1
2 2 1
3 3 2
4 4 1
5 5 1
6 6 2
If you just want a vector of the ids with only 1 letter:
df %>%
group_by(id) %>%
summarise(n_letters = n_distinct(letter)) %>%
filter(n_letters == 1) %>%
pull(id)
[1] 1 2 4 5
Or if you want a df of the ids with more than 1 letter:
multiple_letter_ids <- df %>%
group_by(id) %>%
summarise(n_letters = n_distinct(letter)) %>%
filter(n_letters > 1) %>%
pull(id)
df %>% filter(id %in% multiple_letter_ids)
id letter
1 3 r
2 3 k
3 3 k
4 6 f
5 6 r

How can we apply tidyr:: spread() to all categorical variables at once creating new columns for each level of each categorical variable? [duplicate]

This question already has answers here:
Using Group_by create aggregated counts conditional on value
(1 answer)
reshape of a large data
(2 answers)
Aggregating factor level counts - by factor
(1 answer)
Closed 4 years ago.
I have a dataframe with 3 categorical variables (x,y,z) along with an ID column :
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
I want to apply spread() to each of the categorical variables group by ID .
Output should be like this :
id a b c d v f
1 1 1 1 1 1 1
2 1 1 0 2 2 0
I tried doing it but I was able to do it only for one variable at once not all together .
For e.g: Applying spread only to the y column (similarly , it can be done for x and z separately) but not together in a single line
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Explaining my codes in three steps:
Step 1: count frequency
df %>% count(id,y)
id y n
<dbl> <chr> <int>
1.00 c 1
1.00 d 1
2.00 d 2
Step 2 : applying spread()
df %>% count(id,y) %>% spread(y,n)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1 1.00 1 1
2 2.00 NA 2
Step 3: Adding fill = 0 , replaces NA which means there was zero occurrence of c in y column for id 2 (as you can see in df)
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Problem : In my actual data set , I have 20 such categorical variables , I can't do it one by one for all. I am looking to do it all at once.
Is it possible apply spread() in tidyr for all of categorical variables all together ? If not can you please suggest an alternative
Note: I also gave a try to these answers but were not helpful for this particular case:
R spreading multiple columns with tidyr
Is it possible to use spread on multiple columns in tidyr similar to dcast?
Can spread() in tidyr spread across multiple value?
Expanding columns associated with a categorical variable into multiple columns with dplyr/tidyr while retaining id variable
Additional related helpful question :
It is possible that two categorical columns (Eg: Survey dataset) have same values . Like below.
df <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
# A tibble: 4 x 3
id Do_you_Watch_TV Do_you_Drive
<dbl> <chr> <chr>
1 1.00 yes yes
2 1.00 yes no
3 2.00 yes no
4 2.00 no yes
Running the below code would not differentiate counts of yes and no for 'Do_you_Watch_TV', 'Do_you_Drive' :
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
id no yes
1 1 3
2 2 2
Whereas, expected output should be :
id Do_you_Watch_TV_no Do_you_Watch_TV_yes Do_you_Drive_no Do_you_Drive_yes
1 0 2 1 1
2 1 1 1 1
So , we need to treat No and Yes from Do_you_Watch_TV and Do_you_Drive separately by adding prefix. Do_you_Drive_yes , Do_you_Drive_no , Do_you_Watch_TV _yes, Do_you_Watch_TV _no .
How can we achieve this?
Thanks
First you need to convert your data frame in long format before you can actually transform it in wide format. Hence, first you need to use tidyr::gather and convert data frame to long format. Afterwards, you have couple of options:
Option#1: Using tidyr::spread:
#data
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
library(tidyverse)
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
# id a b c d f v
# 1 1 1 1 1 1 1 1
# 2 2 1 1 0 2 0 2
Option#2: Another option can be is to use reshape2::dcast as:
library(tidyverse)
library(reshape2)
df %>% gather(Key, value, -id) %>%
dcast(id~value, fun.aggregate = length)
# id a b c d f v
# 1 1 1 1 1 1 1 1
# 2 2 1 1 0 2 0 2
Edited: To include solution for 2nd data frame.
#Data
df1 <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
library(tidyverse)
df1 %>% gather(Key, value, -id) %>% unite("value", c(Key, value)) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
# id Do_you_Drive_no Do_you_Drive_yes Do_you_Watch_TV_no Do_you_Watch_TV_yes
# 1 1 1 1 0 2
# 2 2 1 1 1 1

Ignore value conditionally within group_by in dplyr

Please consider the following.
Background
In a data.frame I have patient IDs (id), the day at which patients are admitted to a hospital (day), a code for the diagnostic activity they received that day (code), a price for that activity (price) and a frequency for that activity (freq).
Activities with code b and c are registered at the same time but mean more or less the same thing and should not be double counted.
Problem
What I want is: if code "b" and "c" are registered for the same day, code "b" should be ignored.
The example data.frame looks like this:
x <- data.frame(id = c(rep("a", 4), rep("b", 3)),
day = c(1, 1, 1, 2, 1, 2, 3),
price = c(500, 10, 100, rep(10, 3), 100),
code = c("a", "b", "c", rep("b", 3), "c"),
freq = c(rep(1, 5), rep(2, 2))))
> x
id day price code freq
1 a 1 500 a 1
2 a 1 10 b 1
3 a 1 100 c 1
4 a 2 10 b 1
5 b 1 10 b 1
6 b 2 10 b 2
7 b 3 100 c 2
So the costs for patient "a" for day 1 would be 600 and not 610 as I can compute with the following:
x %>%
group_by(id, day) %>%
summarise(res = sum(price * freq))
# A tibble: 5 x 3
# Groups: id [?]
id day res
<fct> <dbl> <dbl>
1 a 1. 610.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
Possible approaches
Either I delete observation code "b" when "c" is present on that same day or I set freq of code "b" to 0 in case code "c" is present on the same day.
All my attempts with ifelse and mutate failed so far.
Every help is much appreciated. Thank you very much in advance!
You can add a filter line to remove the offending b values like this...
x %>%
group_by(id, day) %>%
filter(!(code=="b" & "c" %in% code)) %>%
summarise(res = sum(price * freq))
id day res
<fct> <dbl> <dbl>
1 a 1. 600.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
You could create a new column like this:
mutate(code_day = paste0(ifelse(code %in% c("b", "c"), "z", code), day)
Then all your Bs and Cs will become Zs (without losing the original code column that helps you tell them apart). You can then arrange by code descending and remove duplicate values in the code_day column:
arrange(desc(code)) %>% # Bs will come after Cs
distinct(code_day, .keep_all = TRUE)

Resources