I have an R data frame with a column that looks like this
codes
111:222:333
222
111:222
I want to expand the codes column into individual binary columns like this:
111 222 333
1 1 1
0 1 0
1 1 0
I tried converting the codes column to a list of characters using strsplit. Then, I unnested the codes column and wanted to perform pivot_wider, but it seems like I cannot do that with columns that have duplicate identifying columns.
df <- df %>%
mutate(codes = strsplit(codes, ":", TRUE))
unnest(codes) %>%
mutate(value = 1) %>%
pivot_wider(names_from = codes,
values_from = value,
values_fill = 0)
We could use dummy_cols from fastDummies
library(fastDummies)
dummy_cols(df1, "codes", split = ":", remove_selected_columns = TRUE)
-output
codes_111 codes_222 codes_333
1 1 1 1
2 0 1 0
3 1 1 0
NOTE: It may be better to have column names that start with alphabets. If we want to have only the values
library(dplyr)
library(stringr)
dummy_cols(df1, "codes", split = ":", remove_selected_columns = TRUE) %>%
setNames(str_remove(names(.), "codes_"))
111 222 333
1 1 1 1
2 0 1 0
3 1 1 0
data
df1 <- structure(list(codes = c("111:222:333", "222", "111:222")),
class = "data.frame", row.names = c(NA,
-3L))
Another approach using separate_rows:
library(tidyr)
library(dplyr)
df1 %>%
mutate(r = 1:n()) %>%
separate_rows(., codes, sep=":") %>%
table %>%
t
# codes
# r 111 222 333
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
Although this will give us a table, so if we need a dataframe we should use pivot_wider instead of table.
df1 %>%
mutate(r = 1:n(), val = 1) %>%
separate_rows(., codes, sep=":") %>%
pivot_wider(names_from = "codes", values_from = "val", values_fill = 0) %>%
select(-r)
# # A tibble: 3 x 3
# `111` `222` `333`
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
Data:
df1 <- data.frame(codes = c("111:222:333","222", "111:222"))
Add an id column to your data.frame before unnest
library(dplyr)
library(tidyr)
df %>%
mutate(codes = strsplit(codes, ":", TRUE)) |>
mutate(id = row_number()) |>
unnest(codes) |>
mutate(value = 1) %>%
pivot_wider(names_from = codes,
values_from = value,
values_fill = 0)
##> + # A tibble: 3 × 4
##> id `111` `222` `333`
##> <int> <dbl> <dbl> <dbl>
##> 1 1 1 1 1
##> 2 2 0 1 0
##> 3 3 1 1 0
Related
I'm trying to calculate the cumulative time among several grades.
Here's how my original df looks like:
df = data.frame(id = c(1,1,1,1,2,2,2,2),
group = c(0,0,0,0,1,1,1,1),
grade = c(0,1,2,3,0,1,3,4),
time = c(10,7,4,1,20,17,14,11))
Here's what I'm expecting as the result df1:
df1 = df %>%
pivot_wider(
names_from = "grade",
names_prefix = "grade_",
values_from = "time") %>%
replace(is.na(.), 0) %>%
mutate(grade_1 = grade_1 + grade_2 + grade_3 + grade_4,
grade_2 = grade_2 + grade_3 + grade_4,
grade_3 = grade_3 + grade_4) %>%
pivot_longer(
cols = 3:7,
names_to = "grade",
names_prefix = "grade_",
values_to = "time")
My method works, but I want it to be more flexible. When I have more grades in the df, I don't need to manually add grade_x = grade_1 + grade_2 + grade_3 ...
Thank you!
One option would be to rearrange the grade column, then do cumsum so that it is in reverse. However, we exclude the last row, where grade == 0. Then, we can re-arrange back in the desired order and ungroup.
library(tidyverse)
results <- df %>%
group_by(id) %>%
arrange(id, desc(grade)) %>%
mutate(time = ifelse(row_number()!=n(), cumsum(time), time)) %>%
arrange(id, grade) %>%
ungroup
Output
id group grade time
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 10
2 1 0 1 12
3 1 0 2 5
4 1 0 3 1
5 2 1 0 20
6 2 1 1 42
7 2 1 3 25
8 2 1 4 11
If you need each group to have the same number of rows as in your desired output, then you can use complete:
df %>%
tidyr::complete(id, grade) %>%
group_by(id) %>%
fill(group, .direction ="downup") %>%
replace(is.na(.), 0) %>%
arrange(id, desc(grade)) %>%
mutate(time = ifelse(row_number()!=n(), cumsum(time), time)) %>%
arrange(id, grade) %>%
ungroup
Output
id grade group time
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 10
2 1 1 0 12
3 1 2 0 5
4 1 3 0 1
5 1 4 0 0
6 2 0 1 20
7 2 1 1 42
8 2 2 1 25
9 2 3 1 25
10 2 4 1 11
Or if you want to pivot back and forth then you could do something like this:
output <- df %>%
pivot_wider(
names_from = "grade",
names_prefix = "grade_",
values_from = "time") %>%
replace(is.na(.), 0) %>%
select(id, group, grade_0, last_col():grade_1)
results2 <- output %>%
select(-c(id, group, grade_0)) %>%
rowwise()%>%
do(data.frame(t(cumsum(unlist(.))))) %>%
bind_cols(select(output, id, group, grade_0), .) %>%
pivot_longer(
cols = 3:7,
names_to = "grade",
names_prefix = "grade_",
values_to = "time")
1st Try:
for cumulative sums across a variable, we can group_by and use cumsum() :
No need to specify grades, etc. You can do more aggregations if needed.
df%>%
group_by(grade)%>%
mutate(Cum_Time = cumsum(time))%>%arrange(grade)
id group grade time Cum_Time
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 10 10
2 2 1 0 20 30
3 1 0 1 7 7
4 2 1 1 17 24
5 1 0 2 4 4
6 1 0 3 1 1
7 2 1 3 14 15
8 2 1 4 11 11
I want to count the use of Tool A by year and keep zeros.
ID <- c(1,1,2,2,2,3,4,5,5,5)
Tool <- c("A","B","A","B","A","A","B","A","A","A")
Year <- c(2000,2001,2001,2001,2002,2002,2001,2000,2001,2002)
df <- data.frame(ID,Tool,Year)
library(tidyverse)
df %>% group_by(ID) %>% summarise(toolA = sum(Tool == "A")) %>% count(toolA)
# A tibble: 4 x 2
toolA n
<int> <int>
1 0 1
2 1 2
3 2 1
4 3 1
I want to add year columns, so that I can have a table as below
tool A
Count
2000
2001
2002
0
1
0
0
0
1
2
1
0
1
2
1
0
1
1
3
1
1
1
1
The numbers under years means the number of use in a year.(Not a person)
How would you do?
Here is another tidyverse method. Simply speaking, we would pivot the dataframe from wide to long and then summarize. Frist summarization gets rid of all the other non-"A"s. Second summarization condenses the result table into unique bins identified by each toolA and produces a count.
library(dplyr)
library(tidyr)
df %>%
mutate(value = +(Tool == "A")) %>%
pivot_wider(names_from = Year, values_fill = 0L) %>%
group_by(ID) %>%
summarize(across(-Tool, sum)) %>%
group_by(toolA = rowSums(across(-ID))) %>%
summarize(count = n(), across(-c(ID, count), sum))
Output
# A tibble: 4 x 5
toolA count `2000` `2001` `2002`
<dbl> <int> <int> <int> <int>
1 0 1 0 0 0
2 1 2 1 0 1
3 2 1 0 1 1
4 3 1 1 1 1
Maybe this is too convoluted and a better/easier solution exists.
library(dplyr)
library(tidyr)
dataA <- df %>%
group_by(ID) %>%
summarise(toolA = sum(Tool == "A")) %>%
count(toolA)
df %>%
group_by(ID, Year) %>%
summarise(toolA = sum(Tool == "A"), .groups = 'drop') %>%
pivot_wider(names_from = Year, values_from = toolA, values_fill = 0) %>%
select(-ID) %>%
mutate(toolA = rowSums(.)) %>%
right_join(dataA, by = 'toolA') %>%
select(toolA, n, everything()) %>%
arrange(toolA) %>%
group_by(toolA, n) %>%
summarise(across(.fns = sum), .groups = 'drop')
# toolA n `2000` `2001` `2002`
# <dbl> <int> <int> <int> <int>
#1 0 1 0 0 0
#2 1 2 1 0 1
#3 2 1 0 1 1
#4 3 1 1 1 1
I might try this approach with tidyverse. Create a list column with the Year when grouping by ID. After including the count n as you have done, use unnest_longer to recover the years. I added an extra column for situations where count is zero called "None". A final pivot_wider would put the data into wide form again.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(toolA = sum(Tool == "A"),
Years = list(Year[Tool == "A"])) %>%
add_count(toolA) %>%
unnest_longer(Years) %>%
replace_na(list(Years = "None")) %>%
mutate(value = 1) %>%
pivot_wider(id_cols = c(toolA, n), names_from = Years, names_prefix = "Year_", values_from = value, values_fill = 0)%>%
arrange(toolA)
Output
toolA n Year_2000 Year_2001 Year_2002 Year_None
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 0 1 0 0 0 1
2 1 2 1 0 1 0
3 2 1 0 1 1 0
4 3 1 1 1 1 0
Using the following data:
df <- data.frame(id = c("A", "B", "C", "A", "B", "A"),
value = c(1, 2, 3, 4, 5, 6))
I want to pivot_wider this data so that the reshaping creates two different sets of columns:
One set where I create a bunch of binary columns that take the column names from the value columns (e.g. bin_1, bin_2 and so on) and that are coded as 0/1.
An additional set where I create as many necessary columns to store the values in a "categorical" way. Here, id "A" has three values, so I want to create three columns cat_1, cat_2, cat_3 and for IDs B and C I want to fill them up with NAs if there's no value.
Now, I know how to create these two things separately from each other and merge them afterwards via a left_join.
However, my question is: can it be done in one pipeline, where I do two subsequent pivot_widers? I tried, but it doesn't work (obviously because my way of copying the value column and then try to use one for the binary reshape and one for the categorial reshape is wrong).
Any ideas?
Code so far that works:
df1 <- df %>%
group_by(id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
df2 <- df %>%
mutate(dummy = 1) %>%
arrange(value) %>%
pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length))
df <- df1 %>%
left_join(., df2, by = "id)
Expected output:
# A tibble: 3 x 10
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0
With the addition of purrr, you could do:
map(.x = reduce(range(df$value), `:`),
~ df %>%
group_by(id) %>%
mutate(!!paste0("bin_", .x) := as.numeric(.x %in% value))) %>%
reduce(full_join) %>%
mutate(cats = paste0("cat_", row_number())) %>%
pivot_wider(names_from = "cats",
values_from = "value")
id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 1 0 1 1 4 6
2 B 0 1 0 0 1 0 2 5 NA
3 C 0 0 1 0 0 0 3 NA NA
In base you can try:
tt <- unstack(df[2:1])
x <- cbind(t(sapply(tt, "[", seq_len(max(lengths(tt))))),
t(+sapply(names(tt), "%in%", x=df$id)))
colnames(x) <- c(paste0("cat_", seq_len(max(lengths(tt)))),
paste0("bin_", seq_len(nrow(df))))
x
# cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
#A 1 4 6 1 0 0 1 0 1
#B 2 5 NA 0 1 0 0 1 0
#C 3 NA NA 0 0 1 0 0 0
Slightly modifying your approach by reducing df2 code and including it all in one pipe by taking advantage of the list and . trick which allows you to work on two versions of df in the same call.
Its not much of an improvement on what you have done but it is now all in one call. I can't think of way you can do it without a merge/join.
library(tidyverse)
df %>%
list(
pivot_wider(., id_cols = id,
names_from = value,
names_prefix = "bin_") %>%
mutate_if(is.numeric, ~ +(!is.na(.))), #convert to binary
group_by(., id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
) %>%
.[c(2:3)] %>%
reduce(left_join)
# id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
# <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 1 0 1 1 4 6
# 2 B 0 1 0 0 1 0 2 5 NA
# 3 C 0 0 1 0 0 0 3 NA NA
Even you can join both your syntax into one without creating any intermediate object
df %>%
group_by(id) %>%
mutate(group_id = row_number()) %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value) %>% left_join(df %>% mutate(dummy = 1) %>% arrange(value) %>% pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length)), by = "id")
# A tibble: 3 x 10
# Groups: id [3]
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0
With the following code I assign a quantile rank y (from 1 to 4) for every value of x.
df$y <- ntile(df$x, 4)
Then, I would like to have four separate columns for absolute frequency count of every quantile rank, grouped also by variable z. With the following code, it does the calculation but I get all calculations in the same column.
df <-
df %>%
group_by(z, y) %>%
mutate(Freq = n())
example:
z y(quartile) n_quartile_4 n_quartile 3 n_quartile 2
1 4 2 1 0
1 3 2 1 0
1 4 2 1 0
2 2 0 0 3
2 2 0 0 3
2 2 0 0 3
We could create the count column with add_count, then pivot to 'wide' format with pivot_wider, fill the NA elements with the non-NA value in the column for each group and finally replace the rest of the NAs with 0
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(z, y) %>%
mutate(new = str_c('n_quartile_', y), rn = row_number()) %>%
pivot_wider(names_from = new, values_from = n) %>%
group_by(z) %>%
fill(starts_with('n_quartile'), .direction = 'downup') %>%
ungroup %>%
select(-rn) %>%
mutate_at(vars(starts_with('n_quartile')), replace_na, 0)
# A tibble: 6 x 5
# z y n_quartile_4 n_quartile_3 n_quartile_2
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 4 2 1 0
#2 1 3 2 1 0
#3 1 4 2 1 0
#4 2 2 0 0 3
#5 2 2 0 0 3
#6 2 2 0 0 3
data
df <- structure(list(z = c(1L, 1L, 1L, 2L, 2L, 2L), y = c(4, 3, 4,
2, 2, 2)), class = "data.frame", row.names = c(NA, -6L))
Having a data frame like this:
df <- data.frame(id = c(1,2,3,4,5), keywords = c("google, yahoo, air, cookie", "cookie, air", "air, cookie", "google", "yahoo, google"))
How is it possible to extract a table like
df_binary_exist <- data.frame(id = c(1,2,3,4,5), google = c(1,0,0,1,1), yahoo = c(1,0,0,0,1), air = c(1,1,1,0,0), cookie = c(1,1,1,0,0))
df_binary_exist
id google yahoo air cookie
1 1 1 1 1 1
2 2 0 0 1 1
3 3 0 0 1 1
4 4 1 0 0 0
5 5 1 1 0 0
and from this table find the most frequent couples?
df_frequency <- data.frame(couple = c("yahoo-google", "cookie-air"), freq = c(2,3))
df_frequency
couple freq
1 yahoo-google 2
2 cookie-air 3
The first part can be achieved by using separate_rows, count and spread
library(dplyr)
library(tidyr)
df1 <- df %>% separate_rows(keywords)
df1 %>%
dplyr::count(id, keywords) %>%
spread(keywords, n, fill = 0)
# id air cookie google yahoo
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 1
#2 2 1 1 0 0
#3 3 1 1 0 0
#4 4 0 0 1 0
#5 5 0 0 1 1
For second part I used a base R method where we first split keywords based on id, paste combination of every 2 elements and count their frequency using table.
data.frame(sort(table(unlist(sapply(split(df1$keywords, df1$id), function(x)
combn(sort(x), pmin(2, length(x)), paste, collapse = "-")))), decreasing = TRUE))
# Var1 Freq
#1 air-cookie 3
#2 google-yahoo 2
#3 air-google 1
#4 air-yahoo 1
#5 cookie-google 1
#6 cookie-yahoo 1
#7 google 1
One tidyverse possibility could be:
df %>%
mutate(keywords = strsplit(keywords, ", ", fixed = TRUE)) %>%
unnest() %>%
full_join(df %>%
mutate(keywords = strsplit(keywords, ", ", fixed = TRUE)) %>%
unnest(), by = c("id" = "id")) %>%
filter(keywords.x != keywords.y) %>%
count(keywords.x, keywords.y) %>%
transmute(keywords = paste(pmax(keywords.x, keywords.y), pmin(keywords.x, keywords.y), sep = "-"),
n) %>%
distinct(keywords, .keep_all = TRUE)
keywords n
<chr> <int>
1 cookie-air 3
2 google-air 1
3 yahoo-air 1
4 google-cookie 1
5 yahoo-cookie 1
6 yahoo-google 2
It, first, splits the "keywords" column on , and then performs a full join with itself. Second, it filters out the rows where the values are the same as the OP is interested in pairs of values. Third, it counts the number of occurrences of pairs. Finally, it creates an ordered variable of pairs and keeps only the distinct rows based on this variable.
Or the same using separate_rows():
df %>%
separate_rows(keywords) %>%
full_join(df %>%
separate_rows(keywords), by = c("id" = "id")) %>%
filter(keywords.x != keywords.y) %>%
count(keywords.x, keywords.y) %>%
transmute(keywords = paste(pmax(keywords.x, keywords.y), pmin(keywords.x, keywords.y), sep = "-"),
n) %>%
distinct(keywords, .keep_all = TRUE)
We can do this easily with
library(qdapTools)
cbind(df[1], mtabulate(strsplit(as.character(df$keywords), ", ")))
# id air cookie google yahoo
#1 1 1 1 1 1
#2 2 1 1 0 0
#3 3 1 1 0 0
#4 4 0 0 1 0
#5 5 0 0 1 1