How to summarize values across multiple columns? - r

I have a dataframe that looks like this:
1 2 3 4 5
A B A B A
C B B B B
A C A B B
And I would like to summarize my data in a frequency table like this:
A B C
1 2 0 1
2 0 2 1
3 2 1 0
4 0 3 0
5 1 2 0
How would I be able to do this?

You can use the following solution:
library(tidyr)
library(janitor)
tab %>%
pivot_longer(everything(), names_to = "nm", values_to = "val",
names_prefix = "X") %>%
tabyl(nm, val)
nm A B C
1 2 0 1
2 0 2 1
3 2 1 0
4 0 3 0
5 1 2 0

You could use
library(tidyr)
data %>%
pivot_longer(everything()) %>%
{ table(.$name, .$value) }
which returns
A B C
1 2 0 1
2 0 2 1
3 2 1 0
4 0 3 0
5 1 2 0

Another table option with stack
> t(table(stack(df)))
values
ind A B C
1 2 0 1
2 0 2 1
3 2 1 0
4 0 3 0
5 1 2 0

An option with base R with table
table(c(col(df1)), c(t(df1)))
A B C
1 2 1 0
2 1 1 1
3 0 3 0
4 1 1 1
5 1 2 0
Data
df1 <- structure(list(`1` = c("A", "C", "A"), `2` = c("B", "B", "C"),
`3` = c("A", "B", "A"), `4` = c("B", "B", "B"), `5` = c("A",
"B", "B")), class = "data.frame", row.names = c(NA, -3L))

Related

adding 1/0 columns from list all at once

I have a dataframe with identifiers and storm categories. Right now, the categories are in one column, but I want to add columns for each category with a 1 or 0 value. I don't think I want to reshape the data as wide, because in the actual dataset there are a number of long format variables I want to keep. I am using a series of ifelse statements currently, but it feels like there is probably a much better way:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"),
cat = c("TS", NA, NA, "TS", "1", "1", NA, NA, "2", NA, NA, NA)
)
df$cat_TS <- ifelse(df$cat == "TS", 1, 0) %>% replace_na(., 0)
df$cat_1 <- ifelse(df$cat == "1", 1, 0) %>% replace_na(., 0)
df$cat_2 <- ifelse(df$cat == "2", 1, 0) %>% replace_na(., 0)
We may use pivot_wider - create a sequence column 'rn', and then use pivot_wider to reshape to wide with values_fn as length and values_fill as 0
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number(), cat1 = cat) %>%
pivot_wider(names_from = cat1, values_from = cat1,
values_fn = length, values_fill = 0, names_prefix = "cat_")%>%
select(-cat_NA, -rn)
-output
# A tibble: 12 × 5
ID cat cat_TS cat_1 cat_2
<chr> <chr> <int> <int> <int>
1 A TS 1 0 0
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 1 0 0
5 A 1 0 1 0
6 B 1 0 1 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 0 1
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0
or use fastDummies
library(fastDummies)
df %>%
dummy_cols("cat", remove_selected_columns = FALSE, ignore_na = TRUE) %>%
mutate(across(starts_with('cat_'), ~ replace_na(.x, 0)))
-output
ID cat cat_1 cat_2 cat_TS
1 A TS 0 0 1
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 0 0 1
5 A 1 1 0 0
6 B 1 1 0 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 1 0
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0
An idea using base R
First, get all unique category names
cats <- unique(df$cat[!is.na(df$cat)])
cats
[1] "TS" "1" "2"
Then look for matches in column cat for each entry in cats. PS, I left the cat column in to show the matching is right. Remove it by using df$ID instead of df as the first argument in cbind.
cbind(df, setNames(data.frame(sapply(seq_along(cats), function(x)
df$cat %in% cats[x]) * 1), cats))
ID cat TS 1 2
1 A TS 1 0 0
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 1 0 0
5 A 1 0 1 0
6 B 1 0 1 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 0 1
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0

Separate rows to make dummy rows

Consider this dataframe:
dat <- structure(list(col1 = c(1, 2, 0), col2 = c(0, 3, 2), col3 = c(1, 2, 3)), class = "data.frame", row.names = c(NA, -3L))
col1 col2 col3
1 1 0 1
2 2 3 2
3 0 2 3
How can one dummify rows? i.e. whenever there is a row with more than 1 non-0 value, separate the row into multiple rows with one non-0 value per row.
In this case, this would be:
col1 col2 col3
1 1 0 0
2 0 0 1
3 2 0 0
4 0 3 0
5 0 0 2
6 0 2 0
7 0 0 3
You can do:
library(tidyverse)
dat |>
pivot_longer(everything()) |>
mutate(id = 1:n()) |>
pivot_wider(values_fill = 0) |>
filter(!if_all(-id, ~ . == 0)) |>
select(-id)
# A tibble: 7 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 1 0 0
2 0 0 1
3 2 0 0
4 0 3 0
5 0 0 2
6 0 2 0
7 0 0 3
Another approach, here I used data.table
library(data.table)
x <- rbindlist(apply(dat, 1, function(x) {
x <- data.table(diag(x, ncol(dat)))
x[colSums(x) > 0]
}))
setnames(x, names(dat))
x
# col1 col2 col3
# 1: 1 0 0
# 2: 0 0 1
# 3: 2 0 0
# 4: 0 3 0
# 5: 0 0 2
# 6: 0 2 0
# 7: 0 0 3
A very ugly way is:
library(tidyverse)
dat %>%
apply(1, diag) %>%
matrix(nrow = 3) %>%
t() %>%
as.data.frame() %>%
rename_with(~ names(dat), everything()) %>%
filter(rowSums(.) != 0)
col1 col2 col3
1 1 0 0
2 0 0 1
3 2 0 0
4 0 3 0
5 0 0 2
6 0 2 0
7 0 0 3

Transform each column factors in a column containing just `0` or `1`

I'm trying to transform each of my column factors in a column containing just 0 or 1. Probably there is a function for that, or someone else already asked, but I couldn't found it. Here is a simple example to try to show what I need:
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
measure1 = c(1:9))
#as result:
# group_A group_B group_C measure1
# 1 1 0 0 1
# 1 1 0 0 2
# 1 1 0 0 3
# 1 0 1 0 4
# 1 0 1 0 5
# 1 0 0 1 6
# 1 0 0 1 7
# 1 0 0 1 8
# 1 0 0 1 9
Any hint on how can I do that?
We may use dummy_cols from fastDummies
library(fastDummies)
library(dplyr)
test %>%
rename(group = 'my_groups') %>%
dummy_cols('group', remove_selected_columns = TRUE) %>%
select(starts_with('group'), measure1)
-output
group_A group_B group_C measure1
1 1 0 0 1
2 1 0 0 2
3 1 0 0 3
4 0 1 0 4
5 0 1 0 5
6 0 0 1 6
7 0 0 1 7
8 0 0 1 8
9 0 0 1 9
Fortunately, there's a one-function Base R solution.
This type of problem happens a lot, and model.matrix() is built exactly for this.
# the "+ 0" is to avoid adding a column for the intercept.
model.matrix(~ my_groups + measure1 + 0, data=test)
Output:
my_groupsA my_groupsB my_groupsC measure1
1 1 0 0 1
2 1 0 0 2
3 1 0 0 3
4 0 1 0 4
5 0 1 0 5
6 0 0 1 6
7 0 0 1 7
8 0 0 1 8
9 0 0 1 9
Here's a base R solution, constructing the matrix using expand.grid, then adding the required names.
res <- data.frame( t( unique( matrix( as.numeric( do.call("==", expand.grid(
test$my_groups, test$my_groups) ) ), dim(test)[1] ) ) ), test$measure1 )
colnames(res) <- c( paste0( "group_", unique(test$my_groups) ), colnames(test)[2] )
res
group_A group_B group_C measure1
1 1 0 0 1
2 1 0 0 2
3 1 0 0 3
4 0 1 0 4
5 0 1 0 5
6 0 0 1 6
7 0 0 1 7
8 0 0 1 8
9 0 0 1 9
We can try this using dplyr or purrr.
library(tidyverse)
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
measure1 = c(1:9))
dummyfy <-
as_mapper(~{
len_row <- vector('numeric', nrow(test))
len_row[.] <- c(1)
len_row}
)
data <- pivot_wider(test, names_from = my_groups, values_from = measure1)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
map(data, ~reduce(., c)) %>%
map_dfr(dummyfy) %>%
bind_cols(test[-1])
#> # A tibble: 9 × 4
#> A B C measure1
#> <dbl> <dbl> <dbl> <int>
#> 1 1 0 0 1
#> 2 1 0 0 2
#> 3 1 0 0 3
#> 4 0 1 0 4
#> 5 0 1 0 5
#> 6 0 0 1 6
#> 7 0 0 1 7
#> 8 0 0 1 8
#> 9 0 0 1 9
#equivalent using across:
data %>% summarise(across(everything(), ~reduce(., c) %>% dummyfy)) %>% bind_cols(test[-1])
#> # A tibble: 9 × 4
#> A B C measure1
#> <dbl> <dbl> <dbl> <int>
#> 1 1 0 0 1
#> 2 1 0 0 2
#> 3 1 0 0 3
#> 4 0 1 0 4
#> 5 0 1 0 5
#> 6 0 0 1 6
#> 7 0 0 1 7
#> 8 0 0 1 8
#> 9 0 0 1 9
Created on 2021-12-03 by the reprex package (v2.0.1)

Detect a pattern in a column with R

I am trying to calculate how many times a person moved from one job to another. This can be calculated every time the Job column has this pattern 1 -> 0 -> 1.
In this example, it happened one rotation:
Person Job
A 1
A 0
A 1
A 1
In this another example, person B had one rotation as well.
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1
Whats would be a good approach to measure this pattern in a new column 'rotation', by person ?
Person Job Rotation
A 1 0
A 0 0
A 1 1
A 1 1
B 1 0
B 0 0
B 0 0
B 1 1
You can use regular expressions to capture a group with 101 and count it as a 1. so you use a pattern="(?<=1)0+(?=1)" where for all zeros, check whether they are preceeded by 1 and also succeeded by a 1
library(tidyverse)
df%>%
group_by(Person)%>%
mutate(Rotation=str_count(accumulate(Job,str_c,collapse=""),"(?<=1)0+(?=1)"))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1
One solution is to use lag with default = 0 and count cumulative sum of condition when value changes from 0 to 1. Just subtract 1 from the cumsum to get the rotation.
The solution using dplyr can be as:
library(dplyr)
df %>% group_by(Person) %>%
mutate(Rotation = cumsum(lag(Job, default = 0) == 0 & Job ==1) - 1) %>%
as.data.frame()
# Person Job Rotation
# 1 A 1 0
# 2 A 0 0
# 3 A 1 1
# 4 A 1 1
# 5 B 1 0
# 6 B 0 0
# 7 B 0 0
# 8 B 1 1
Data:
df <- read.table(text ="
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1",
header = TRUE, stringsAsFactors = FALSE)
Here is an option with data.table
library(data.table)
setDT(df)[, Rotation := +(grepl("101", do.call(paste0,
shift(Job, 0:.N, fill = 0)))), Person]
df
# Person Job Rotation
# 1: A 1 0
# 2: A 0 0
# 3: A 1 1
# 4: A 1 1
# 5: B 1 0
# 6: B 0 0
# 7: B 0 0
# 8: B 1 0
# 9: C 0 0
#10: C 1 0
#11: C 0 0
#12: C 1 1
A base R option would be
f1 <- function(x) Reduce(paste0, x, accumulate = TRUE)
df$Rotation <- with(df, +grepl("101", ave(Job, Person, FUN = f1)))
data
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
I'm assuming that if a person starts unemployed,
the first job they get doesn't count as rotation.
In that case:
library(dplyr)
rotation <- function(x) {
# this will have 1 when a person got a new job
dif <- c(0L, diff(x))
dif[dif < 0L] <- 0L
if (x[1L] == 0L) {
# unemployed at the beginning,
# first job doesn't count as change from one to another
dif[which.max(dif)] <- 0L
}
# return
cumsum(dif)
}
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
df %>%
group_by(Person) %>%
mutate(Rotation = rotation(Job))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1

How to create dichotomous variables based on some factors in r?

The initial dataframe is:
Factor1 Factor2 Factor3
A B C
B C NA
A NA NA
B C D
E NA NA
I want to create 5 dichotomous variables based on the above factor variables. The rule should be the new variable A will get 1 if either Factor1 or Factor2 or Factor3 contains an A otherwise A should be 0, and so on. The newly created variables should look like:
A B C D E
1 1 1 0 0
0 1 1 0 0
1 0 0 0 0
0 1 1 1 0
0 0 0 0 1
We can use table to do this. We replicate the sequence of rows with the number of columns, unlist the dataset and get the frequency of values.
table(rep(1:nrow(df1), ncol(df1)), unlist(df1))
# A B C D E
# 1 1 1 1 0 0
# 2 0 1 1 0 0
# 3 1 0 0 0 0
# 4 0 1 1 1 0
# 5 0 0 0 0 1
If we have more than 1 value per row, then convert to logical and then reconvert it back to binary.
+(!!table(rep(1:nrow(df1), ncol(df1)), unlist(df1)))
data
df1 <- structure(list(Factor1 = c("A", "B", "A", "B", "E"),
Factor2 = c("B",
"C", NA, "C", NA), Factor3 = c("C", NA, NA, "D", NA)),
.Names = c("Factor1",
"Factor2", "Factor3"), class = "data.frame", row.names = c(NA, -5L))

Resources