pivot_wider when there's no value column - r

I'm trying to reshape a dataset from long to wide. The following code works, but I'm curious if there's a way not to provide a value column and still use pivot_wider. In the following example, I have to create a temporary column "val" to use pivot_wider, but is there a way I can do it without it?
a <- data.frame(name = c("sam", "rob", "tom"),
type = c("a", "b", "c"))
a
name type
1 sam a
2 rob b
3 tom c
I want to convert it as the following.
name a b c
1 sam 1 0 0
2 rob 0 1 0
3 tom 0 0 1
This can be done by the following code, but can I do it without creating "val" column (and still using tidyverse language)?
a <- data.frame(name = c("sam", "rob", "tom"),
type = c("a", "b", "c"),
val = rep(1, 3)) %>%
pivot_wider(names_from = type, values_from = val, values_fill = list(val = 0))

You can use the values_fn argument to assign 1 and values_fill to assign 0:
library(tidyr)
pivot_wider(a, names_from = type, values_from = type, values_fn = ~1, values_fill = 0)
# A tibble: 3 x 4
name a b c
<fct> <dbl> <dbl> <dbl>
1 sam 1 0 0
2 rob 0 1 0
3 tom 0 0 1

We can mutate with a column of 1s and use that in pivot_wider
library(dplyr)
library(tidyr)
a %>%
mutate(n = 1) %>%
pivot_wider(names_from = type, values_from = n, values_fill = list(n = 0))
# A tibble: 3 x 4
# name a b c
# <fct> <dbl> <dbl> <dbl>
#1 sam 1 0 0
#2 rob 0 1 0
#3 tom 0 0 1
In base R, it would be easier..
table(a)

Going older school, reshape2::dcast, or the thriving data.table::dcast, let you do this by specifying an aggregate function:
reshape2::dcast(a, name ~ type, fun.aggregate = length)
# name a b c
# 1 rob 0 1 0
# 2 sam 1 0 0
# 3 tom 0 0 1
data.table::dcast(setDT(a), name ~ type, fun.aggregate = length)
# name a b c
# 1: rob 0 1 0
# 2: sam 1 0 0
# 3: tom 0 0 1

Related

tidyverse do a pivot_wider with two different reshaping strategies (creating categorical and binary columns)

Using the following data:
df <- data.frame(id = c("A", "B", "C", "A", "B", "A"),
value = c(1, 2, 3, 4, 5, 6))
I want to pivot_wider this data so that the reshaping creates two different sets of columns:
One set where I create a bunch of binary columns that take the column names from the value columns (e.g. bin_1, bin_2 and so on) and that are coded as 0/1.
An additional set where I create as many necessary columns to store the values in a "categorical" way. Here, id "A" has three values, so I want to create three columns cat_1, cat_2, cat_3 and for IDs B and C I want to fill them up with NAs if there's no value.
Now, I know how to create these two things separately from each other and merge them afterwards via a left_join.
However, my question is: can it be done in one pipeline, where I do two subsequent pivot_widers? I tried, but it doesn't work (obviously because my way of copying the value column and then try to use one for the binary reshape and one for the categorial reshape is wrong).
Any ideas?
Code so far that works:
df1 <- df %>%
group_by(id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
df2 <- df %>%
mutate(dummy = 1) %>%
arrange(value) %>%
pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length))
df <- df1 %>%
left_join(., df2, by = "id)
Expected output:
# A tibble: 3 x 10
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0
With the addition of purrr, you could do:
map(.x = reduce(range(df$value), `:`),
~ df %>%
group_by(id) %>%
mutate(!!paste0("bin_", .x) := as.numeric(.x %in% value))) %>%
reduce(full_join) %>%
mutate(cats = paste0("cat_", row_number())) %>%
pivot_wider(names_from = "cats",
values_from = "value")
id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 1 0 1 1 4 6
2 B 0 1 0 0 1 0 2 5 NA
3 C 0 0 1 0 0 0 3 NA NA
In base you can try:
tt <- unstack(df[2:1])
x <- cbind(t(sapply(tt, "[", seq_len(max(lengths(tt))))),
t(+sapply(names(tt), "%in%", x=df$id)))
colnames(x) <- c(paste0("cat_", seq_len(max(lengths(tt)))),
paste0("bin_", seq_len(nrow(df))))
x
# cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
#A 1 4 6 1 0 0 1 0 1
#B 2 5 NA 0 1 0 0 1 0
#C 3 NA NA 0 0 1 0 0 0
Slightly modifying your approach by reducing df2 code and including it all in one pipe by taking advantage of the list and . trick which allows you to work on two versions of df in the same call.
Its not much of an improvement on what you have done but it is now all in one call. I can't think of way you can do it without a merge/join.
library(tidyverse)
df %>%
list(
pivot_wider(., id_cols = id,
names_from = value,
names_prefix = "bin_") %>%
mutate_if(is.numeric, ~ +(!is.na(.))), #convert to binary
group_by(., id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
) %>%
.[c(2:3)] %>%
reduce(left_join)
# id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
# <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 1 0 1 1 4 6
# 2 B 0 1 0 0 1 0 2 5 NA
# 3 C 0 0 1 0 0 0 3 NA NA
Even you can join both your syntax into one without creating any intermediate object
df %>%
group_by(id) %>%
mutate(group_id = row_number()) %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value) %>% left_join(df %>% mutate(dummy = 1) %>% arrange(value) %>% pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length)), by = "id")
# A tibble: 3 x 10
# Groups: id [3]
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0

r count values in rows after dcast

I want to sum all values in a row of a dataframe after performing a dcast operation from the reshape2 package. Problem is that all values are the same (10) and are the sum of all rows combined. Values should be 4,2,4
Example data with code:
df <- data.frame(x = as.factor(c("A","A","A","A","B","B","C","C","C","C")),
y = as.factor(c("AA","AB","AA","AC","BB","BA","CC","CC","CC","CD")),
z = c("var1","var1","var2","var1","var2","var1","var1","var2","var2","var1"))
df2 <- df %>%
group_by(x,y) %>%
summarise(num = n()) %>%
ungroup()
df3 <- dcast(df2,x~y, fill = 0 )
df3$total <- sum(df3$AA,df3$AB,df3$AC,df3$BA,df3$BB,df3$CC,df3$CD)
sum gives you 1 combined value and that value is repeated for all other rows.
sum(df3$AA,df3$AB,df3$AC,df3$BA,df3$BB,df3$CC,df3$CD)
#[1] 10
You need rowSums to get sum of each row separately.
df3$total <- rowSums(df3[-1])
Here is a simplified tidyverse approach starting from df :
library(dplyr)
library(tidyr)
df %>%
count(x, y, name = 'num') %>%
pivot_wider(names_from = y, values_from = num, values_fill = 0) %>%
mutate(total = rowSums(select(., AA:CD)))
# x AA AB AC BA BB CC CD total
# <fct> <int> <int> <int> <int> <int> <int> <int> <dbl>
#1 A 2 1 1 0 0 0 0 4
#2 B 0 0 0 1 1 0 0 2
#3 C 0 0 0 0 0 3 1 4
We can specify the values_fn in pivot_wider and also use adorn_totals from janitor
library(dplyr)
library(tidyr)
library(janitor)
df %>%
pivot_wider(names_from = y, values_from = z, values_fill = 0,
values_fn = length) %>%
adorn_totals("col")
-output
# x AA AB AC BB BA CC CD Total
# A 2 1 1 0 0 0 0 4
# B 0 0 0 1 1 0 0 2
# C 0 0 0 0 0 3 1 4
Or using base R with xtabs and addmargins
addmargins(xtabs(z ~ x + y, transform(df, z = 1)), 2)
# y
#x AA AB AC BA BB CC CD Sum
# A 2 1 1 0 0 0 0 4
# B 0 0 0 1 1 0 0 2
# C 0 0 0 0 0 3 1 4

from column with factors to two different column with 0, 1

I have a column with group1 group 2 in data frame.
group <- c( "group1", "group1", "group2", "group1", "group2" )
value<- c(1:5)
dat <- data.frame(value, group)
I want to make it like this-
group1 <- c(1, 1, 0, 1, 0)
group2 <- c(0, 0, 1, 0, 1)
dat<- data.frame(value, group1, group2)
I tried this but have to remove the group column later
dat<- dat %>%
mutate( group1 = ifelse(data1$group =="group1", 1, 0 ),
group2 = ifelse(data1$group =="group2", 1, 0 ) )
Is there any other nice way to do this job.
Thanks in advance for your help.
You could create a dummy column and get data in wide format.
library(dplyr)
library(tidyr)
dat %>%
mutate(n = 1) %>%
pivot_wider(names_from = group, values_from = n, values_fill = 0) -> result
# value group1 group2
# <int> <dbl> <dbl>
#1 1 1 0
#2 2 1 0
#3 3 0 1
#4 4 1 0
#5 5 0 1
Or in base R use table :
table(dat)
# group
#value group1 group2
# 1 1 0
# 2 1 0
# 3 0 1
# 4 1 0
# 5 0 1
A base R option using reshape
replace(
out <- reshape(
cbind(dat, q = 1),
direction = "wide",
idvar = "value",
timevar = "group"
),
is.na(out),
0
)
giving
value q.group1 q.group2
1 1 1 0
2 2 1 0
3 3 0 1
4 4 1 0
5 5 0 1
We can use data.table
library(data.table)
dcast(setDT(dat), value ~ group, length)
# value group1 group2
#1: 1 1 0
#2: 2 1 0
#3: 3 0 1
#4: 4 1 0
#5: 5 0 1
Or this can be done with pivot_wider in a single step by specifying values_fn
library(dplyr)
library(tidyr)
dat %>%
pivot_wider(names_from = group, values_from = group,
values_fn = length, values_fill = 0)
# A tibble: 5 x 3
# value group1 group2
# <int> <int> <int>
#1 1 1 0
#2 2 1 0
#3 3 0 1
#4 4 1 0
#5 5 0 1
Insert %>% select(!"group") at the end of the dplyr pipe. Also remove data1$ from it - you probably meant dat, even that's not needed.
dat %>%
mutate(group1 = ifelse(group =="group1", 1, 0 ),
group2 = ifelse(group =="group2", 1, 0 )) %>%
select(!"group")
value group1 group2
1 1 1 0
2 2 1 0
3 3 0 1
4 4 1 0
5 5 0 1

distinct cases for two variables by grouping and counting

We can use the following data frame as an example:
Case <- c("Siddhartha", "Siddhartha", "Siddhartha", "Paul", "Paul", "Paul", "Hannah", "Herbert")
Procedure <- c("1", "1", "2", "3", "3", "4", "1", "1")
Location <- c("a", "a", "b", "a", "a", "b", "c", "a")
(df <- data.frame(Case, Procedure, Location))
Case Procedure Location
1 Siddhartha 1 a
2 Siddhartha 1 a
3 Siddhartha 2 b
4 Paul 3 a
5 Paul 3 a
6 Paul 4 b
7 Hannah 1 c
8 Herbert 1 a
Now i do the following:
df %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
which gives me:
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 3 0 1
2 3 2 0 0
3 2 0 1 0
4 4 0 1 0
This is not exactly, what i want though. What i want is the following data frame:
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 2 0 1
2 3 1 0 0
3 2 0 1 0
4 4 0 1 0
Notice the difference in Procedure 1 and 3.
So what i would like is a function, that counts the number of DISTINCT cases for each Procedures AND each location. Also that function should be working on varying data frames, where there are different (unknown) cases and procedures.
For the original data frame
df %>%
distinct() %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
does not work, since it is ignoring the "distinct". What works (also for the original data frame!) is the following:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case))
That gives me the following though:
# A tibble: 5 x 3
# Groups: Procedure [4]
Procedure Location Anzahl
<fct> <fct> <int>
1 1 a 2
2 1 c 1
3 2 a 1
4 3 b 1
5 4 b 1
But how to implement the "pivot_wider" function, so it is also sorted by location? If i try to add it, i get the following error:
"Error: This tidyselect interface doesn't support predicates yet.
i Contact the package author and suggest using eval_select()."
Also it is very confusing to me, why the solution of Ronak works for the example data frame but not for the original. I can't spot important differences in these two data frames.
Regards
You can do it with a single call to pivot_wider and take advantage of the argument values_fn, which applies a function to the values
df %>%
pivot_wider(names_from = Location,
values_from = Case,
values_fn = list(Case = n_distinct),
values_fill = list(Case = 0))
which gives,
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 2 0 1
2 2 0 1 0
3 3 1 0 0
4 4 0 1 0
A simple fix is to add distinct or unique before counting
library(dplyr)
library(tidyr)
df %>%
distinct() %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
# A tibble: 4 x 4
# Procedure a b c
# <chr> <int> <int> <int>
#1 1 2 0 1
#2 3 1 0 0
#3 2 0 1 0
#4 4 0 1 0
For OP's data they need :
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
pivot_wider(names_from = Location, values_from = Anzahl,
values_fill = list(Anzahl = 0))

Frequency between couples of words

Having a data frame like this:
df <- data.frame(id = c(1,2,3,4,5), keywords = c("google, yahoo, air, cookie", "cookie, air", "air, cookie", "google", "yahoo, google"))
How is it possible to extract a table like
df_binary_exist <- data.frame(id = c(1,2,3,4,5), google = c(1,0,0,1,1), yahoo = c(1,0,0,0,1), air = c(1,1,1,0,0), cookie = c(1,1,1,0,0))
df_binary_exist
id google yahoo air cookie
1 1 1 1 1 1
2 2 0 0 1 1
3 3 0 0 1 1
4 4 1 0 0 0
5 5 1 1 0 0
and from this table find the most frequent couples?
df_frequency <- data.frame(couple = c("yahoo-google", "cookie-air"), freq = c(2,3))
df_frequency
couple freq
1 yahoo-google 2
2 cookie-air 3
The first part can be achieved by using separate_rows, count and spread
library(dplyr)
library(tidyr)
df1 <- df %>% separate_rows(keywords)
df1 %>%
dplyr::count(id, keywords) %>%
spread(keywords, n, fill = 0)
# id air cookie google yahoo
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 1
#2 2 1 1 0 0
#3 3 1 1 0 0
#4 4 0 0 1 0
#5 5 0 0 1 1
For second part I used a base R method where we first split keywords based on id, paste combination of every 2 elements and count their frequency using table.
data.frame(sort(table(unlist(sapply(split(df1$keywords, df1$id), function(x)
combn(sort(x), pmin(2, length(x)), paste, collapse = "-")))), decreasing = TRUE))
# Var1 Freq
#1 air-cookie 3
#2 google-yahoo 2
#3 air-google 1
#4 air-yahoo 1
#5 cookie-google 1
#6 cookie-yahoo 1
#7 google 1
One tidyverse possibility could be:
df %>%
mutate(keywords = strsplit(keywords, ", ", fixed = TRUE)) %>%
unnest() %>%
full_join(df %>%
mutate(keywords = strsplit(keywords, ", ", fixed = TRUE)) %>%
unnest(), by = c("id" = "id")) %>%
filter(keywords.x != keywords.y) %>%
count(keywords.x, keywords.y) %>%
transmute(keywords = paste(pmax(keywords.x, keywords.y), pmin(keywords.x, keywords.y), sep = "-"),
n) %>%
distinct(keywords, .keep_all = TRUE)
keywords n
<chr> <int>
1 cookie-air 3
2 google-air 1
3 yahoo-air 1
4 google-cookie 1
5 yahoo-cookie 1
6 yahoo-google 2
It, first, splits the "keywords" column on , and then performs a full join with itself. Second, it filters out the rows where the values are the same as the OP is interested in pairs of values. Third, it counts the number of occurrences of pairs. Finally, it creates an ordered variable of pairs and keeps only the distinct rows based on this variable.
Or the same using separate_rows():
df %>%
separate_rows(keywords) %>%
full_join(df %>%
separate_rows(keywords), by = c("id" = "id")) %>%
filter(keywords.x != keywords.y) %>%
count(keywords.x, keywords.y) %>%
transmute(keywords = paste(pmax(keywords.x, keywords.y), pmin(keywords.x, keywords.y), sep = "-"),
n) %>%
distinct(keywords, .keep_all = TRUE)
We can do this easily with
library(qdapTools)
cbind(df[1], mtabulate(strsplit(as.character(df$keywords), ", ")))
# id air cookie google yahoo
#1 1 1 1 1 1
#2 2 1 1 0 0
#3 3 1 1 0 0
#4 4 0 0 1 0
#5 5 0 0 1 1

Resources