R - create dynamic indicator columns from values in character columns - r

I have data that looks like this:
library(dplyr)
d<-data.frame(ID=c(1,1,2,3,3,4), Quality=c("Good", "Bad", "Ugly", "Good", "Good", "Ugly"), Area=c("East", "North", "North", "South", "East", "North"))
What I'd like to do is create one new column for each unique value in Quality and populate it with whether the ID matches that value and then aggregate the ID's. I want to do the same for Area.
This is what I have for when Quality == Good:
d$Quality.Good <- 0
d$Quality.Good[d$Quality=="Good"] <- 1
e <- d %>%
group_by(ID) %>%
summarise(n=n(), MAX.Quality.Good = max(Quality.Good))
e
Output
A tibble: 4 x 3
ID MAX.Quality.Good
<dbl> <dbl>
1 1 1
2 2 0
3 3 1
4 4 0
Is it possible to build a function that will loop through each character column and build an indicator column for Good, Bad, Ugly, North, East, South instead of copy pasting the above many more times?
Here's where I'm stuck:
library(stringr)
#vector of each Quality
e <-d %>%
group_by(Quality) %>%
summarise(n=n()) %>%
select(Quality)
e<-as.data.frame(e)
#create new column names
f <- str_c(names(e),".",e[,1])
#initialize list of new columns
d[f] <- 0
#I'm stuck after this...
Thank you!

We can do this in base R using table by replicating the 'ID' column by the number of columns of dataset minus 1, and pasteing the column names with the unlisted values (excluding the 'ID' column)
table(rep(d$ID, 2), paste0(names(d)[-1][col(d[-1])], unlist(d[-1])))
# AreaEast AreaNorth AreaSouth QualityBad QualityGood QualityUgly
# 1 1 1 0 1 1 0
# 2 0 1 0 0 0 1
# 3 1 0 1 0 2 0
# 4 0 1 0 0 0 1
or with tidyverse, gather into 'long' format, unite the 'key', 'val' columns to a single column, get the distinct rows, and spread into 'wide' format after creating a column of 1s.
library(tidyverse)
gather(d, key, val, -ID) %>%
unite(kv, key, val) %>%
distinct %>%
mutate(n = 1) %>%
spread(kv, n, fill = 0)
#ID Area_East Area_North Area_South Quality_Bad Quality_Good Quality_Ugly
#1 1 1 1 0 1 1 0
#2 2 0 1 0 0 0 1
#3 3 1 0 1 0 1 0
#4 4 0 1 0 0 0 1

1) Base R Create the model matrix for each column (using function make_mm) and bind them together as a data frame m. Finally aggregate on ID. No packages are used.
make_mm <- function(nm, data) model.matrix(~ . - 1, data[nm])
m <- do.call("data.frame", lapply(names(d)[-1], make_mm, d))
with(d, aggregate(. ~ ID, m, max))
giving:
ID QualityBad QualityGood QualityUgly AreaEast AreaNorth AreaSouth
1 1 1 1 0 1 1 0
2 2 0 0 1 0 1 0
3 3 0 1 0 1 0 1
4 4 0 0 1 0 1 0
2) dplyr/purrr This could alternately be written as the following which is close to the code in the question but generalizes to all required columns. Note that here we make model data frames using make_md rather than making model matrices with make_mm. Also note that the dot in group_by(m, ID = .$ID) refers to d and not to m.
library(dplyr)
library(purrr)
make_md <- function(nm, data) {
data %>%
select(nm) %>%
model.matrix(~ . - 1, .) %>%
as.data.frame
}
d %>% {
m <- map_dfc(names(.)[-1], make_md, .)
group_by(m, ID = .$ID) %>%
summarize_all(max) %>%
ungroup
}

Related

How to visualize the data if one participant has multiple entries in different rows?

I am currently working on a dataset which consists of multiple participants. Some participants have participated all followups, whereas others have skipped some followups.
For example, in the dataset below, participant 2 only participated the 3rd followup, and participant 3 only participated the 2nd and the 3rd followup. You can also see that some participants have more than 1 rows of entry because they have several followups.
The original dataset only has the 1st and the 2nd column. Since I am aiming to create a progress chart like this
I have tried to create extra columns for each visit by using the code below:
participant <- c(1,1,1,2,3,3,4,5,5,5 )
visit <- c(1,2,3,3,2,3,1,1,2,3)
df <- data.frame(participant, visit)
df[,3] <- as.integer(df$visit=="1")
df[,4] <- as.integer(df$visit=="2")
df[,5] <- as.integer(df$visit=="3")
colnames(df)[colnames(df) %in% c("V3","V4","V5")] <- c(
"Visit1","Visit2","Visit3")
However, I still experience a hard time combining rows of the same participant, and hence I could not proceed to making the chart (which I also have no clue about). I have tried the 'reshape' function but it did not work out. group_by function also did not work out and still showed the original dataset
df1 <- df[,-2]
df1 %>%
group_by(participant)
What function should I use this case for:
combining rows of the same participant?
how to produce the progress chart?
Thank you in advance for your help!
Based on your df you could produce the chart with
library(ggplot2)
library(dplyr)
df %>%
ggplot(aes(x = as.factor(visit),
y = as.factor(participant),
fill = as.factor(visit))) +
geom_tile(aes(width = 0.7, height = 0.7), color = "black") +
scale_fill_grey() +
xlab("Visit") +
ylab("Participants") +
guides(fill = "none")
If you need your data.frame in a wide format (similar to the image shown but with only one row per participant), use
library(tidyr)
library(dplyr)
df %>%
mutate(value = 1) %>%
pivot_wider(
names_from = visit,
values_from = value,
names_glue = "Visit{visit}",
values_fill = 0)
to get
# A tibble: 5 x 4
participant Visit1 Visit2 Visit3
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
I think you are looking for a way to dummify a variable.
There are several ways to do that.
I like the fastDummies package. You can use dummy_cols, with remove_selected_columns=TRUE.
df %>% fastDummies::dummy_cols(select_columns = 'visit',
remove_selected_columns = TRUE)
participant visit_1 visit_2 visit_3
1 1 1 0 0
2 1 0 1 0
3 1 0 0 1
4 2 0 0 1
5 3 0 1 0
6 3 0 0 1
7 4 1 0 0
8 5 1 0 0
9 5 0 1 0
10 5 0 0 1
You may want to pipe in some summariseoperation to make the table even cleaner, as in:
df %>% fastDummies::dummy_cols(select_columns = 'visit', remove_selected_columns = TRUE)%>%
group_by(participant)%>%
summarise(across(starts_with('visit'), max))
# A tibble: 5 x 4
participant visit_1 visit_2 visit_3
<dbl> <int> <int> <int>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
In a certain way, this looks a bit like a pivoting operation too.
You may be interested in using dplyr::pivot_wider here too
EDIT: #MartinGal had just given a similar answer, I removed a very similar version of his pivot_wider

Selecting all columns that have some specific values

I have a data.frame with more than 50 columns and 10,000 rows I want select those columns that are haveing 0 or 1 in them excluding other values in those columna
sample data.frame is as below:
dummy_df <- data.frame(
id=1:4,
gender=c(4,1,0,1),
height=seq(150, 180,by = 10),
smoking=c(3,0,1,0)
)
I want to select all those columns with 0 or 1 value and exclude other values like 4 in gender and 3 in smoking and as below
gender smoking
1 0
0 1
1 0
but I have 50 columns in actual data frame and I don't know which of them are having 0 or 1
What I'm trying is:
dummy_df %>% select_if(~ all( . %in% 0:1))
Is this useful for you?
dummy_df %>%
select(- c(id, height)) %>%
rowwise() %>%
filter(any(c_across() == 0)|any(c_across() == 1))
# A tibble: 3 x 2
# Rowwise:
gender smoking
<dbl> <dbl>
1 1 0
2 0 1
3 1 0
EDIT:
If you don't know in advance which cols contain 0 and/or 1, you can determine that in base R:
temp <- dummy_df[sapply(dummy_df, function(x) any(x == 0|x == 1))]
Now you can filter for rows with 0and/or 1:
temp %>%
rowwise() %>%
filter(any(c_across() == 0)|any(c_across() == 1))
I think it's more like a case of filter than select:
library(dplyr)
dummy_df %>%
filter(if_all(c(gender, smoking), ~ .x %in% c(0, 1)))
id gender height smoking
1 2 1 160 0
2 3 0 170 1
3 4 1 180 0

How to calculate transition probabilities in R

I would like to calculate how often changes between values happen by person-year combination (panel data). This mimics Stata's command xttrans. The transition between index 6 and 7 should not be included, since it is not a transition from within one person.
df = data.frame(id=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
year=seq(from=2003, to=2009, by=1),
health=c(3,1,2,2,5,1,1,1,2,3,2,1,1,2))
Here is a base R solution to calculate transition counts by id groups:
with(df, do.call(`+`, tapply(health, id, function(x){
x <- factor(x, levels = min(health, na.rm = T):max(health, na.rm = T))
table(x[-length(x)], x[-1])
})))
# 1 2 3 4 5
# 1 2 3 0 0 0
# 2 1 1 1 0 1
# 3 1 1 0 0 0
# 4 0 0 0 0 0
# 5 1 0 0 0 0
library(tidyverse)
# Calculate the last health status for each id
df <- df %>%
group_by(id) %>%
mutate(lastHealth=lag(health)) %>%
ungroup()
# Count nunmber of existing transitions
transitions <- df %>%
group_by(health, lastHealth) %>%
summarise(N=n()) %>%
ungroup()
# Fill in the transition grid to include possible transitions that weren't observed
transitions <- transitions %>%
complete(health=1:5, lastHealth=1:5, fill=list(N=0))
# Present the transitions in the required format
transitions %>%
pivot_wider(names_from="health", values_from="N", names_prefix="health") %>%
filter(!is.na(lastHealth))

How to filter for a combination of list arguments and multiple character strings in dplyr

Given a dataframe:
v1_attr1 <- c(1,0,0,0,1,0,0,0,1,1) %>% as.integer ()
v1_attr2 <- c(0,1,0,0,1,1,1,1,1,1) %>% as.integer ()
v2_attr1 <- c(0,0,1,0,0,1,1,1,0,0) %>% as.integer ()
v2_attr2 <- c(0,0,0,1,0,1,1,1,0,0) %>% as.integer ()
df <- data.frame (v1_attr1, v1_attr2, v2_attr1, v2_attr2)
How can I set a filter for the attr of each v[[x]]?
I tried the following code to get the number of rows in each data.frame filtered by attr.
library(dplyr)
# create list for vs
list_vs <- list ("v1", "v2")
# set multiple attr filter for each v[[x]] to get the respective number of rows in each filtered data.frame (presented in a list)
filtered <- lapply (list_vs, function (x){
df %>% filter (noquote(paste0(list_vs[[x]], "_attr1")) == 1 | noquote(paste0(list_vs[[x]], "_attr2")) == 1) %>%
nrow ()
})
Although this code doesn't return an error, the result for filtered[[x]] is always 0. How do I need to set the filter arguments correctly to get the desired number of rows in each data.frame? I used noquote because otherwise filtering arguments would be pasted in quotes.
One dplyr and purrr option could be:
map(.x = list_vs,
~ df %>%
filter_at(vars(starts_with(.x)), any_vars(. == 1)))
[[1]]
v1_attr1 v1_attr2 v2_attr1 v2_attr2
1 1 0 0 0
2 0 1 0 0
3 1 1 0 0
4 0 1 1 1
5 0 1 1 1
6 0 1 1 1
7 1 1 0 0
8 1 1 0 0
[[2]]
v1_attr1 v1_attr2 v2_attr1 v2_attr2
1 0 0 1 0
2 0 0 0 1
3 0 1 1 1
4 0 1 1 1
5 0 1 1 1
An option is to convert to 'long' format with pivot_longer by automatically picking up the patterns from the column names, and then do a group_by, filter_at
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything(), names_sep = "_",
names_to = c('group', '.value' )) %>%
group_by(group) %>%
filter_at(vars(-group_cols()), any_vars(. == 1))

Count frequencies and add a total sum

I have a large data.frame containing these values:
ID_Path Conversion Lead Path Week
32342 A25177 1 JEFD 2015-25
32528 A25177 1 EUFD 2015-25
25485 A3 1 DTFE 2015-25
32528 Null 0 DDFE 2015-25
23452 A25177 1 JDDD 2015-26
54454 A25177 1 FDFF 2015-27
56848 A2323 1 HDG 2015-27
I want to be able to create a frequency table that displays a table like this:
Week Total A25177 A3 A2323
2015-25 3 2 1 0
2015-26 1 1 0 0
2015-27 2 1 0 1
Where every unique Conversion has a column, and all the times where the Conversion is Null is the same time as when the Lead is 0.
In this example there is 3 unique conversions, sometimes there is 1, sometimes there are 5 or more. So it should not be limited to only 3.
I have created a new DF containing only Conversion that are not Null
I have tried using data.table with this code:
DF[,list(Week=Week,by=Conversion]
with no luck.
I have tried using plyr with this code:
ddply(DF,~Conversion,summarise,week=week)
with no luck.
I would recommend dropping unnecessary levels in order to not mess the output, and then run a simple table and addmargins combination
DF <- droplevels(DF[DF$Conversion != "Null",])
addmargins(table(DF[c("Week", "Conversion")]), 2)
# Conversion
# Week A2323 A25177 A3 Sum
# 2015-25 0 2 1 3
# 2015-26 0 1 0 1
# 2015-27 1 1 0 2
Alternatively, you could do the same with reshape2 while specifying the margins parameter
library(reshape2)
dcast(DF, Week ~ Conversion, value.var = "Conversion", length, margins = "Conversion")
# Week A2323 A25177 A3 (all)
# 1 2015-25 0 2 1 3
# 2 2015-26 0 1 0 1
# 3 2015-27 1 1 0 2
An alternative solution using dplyr and tidyr:
library(tidyr)
library(dplyr)
dt = data.frame(Conversion = c("A1","Null","A1","A3"),
Lead = c(1,0,1,1),
Week = c("2015-25","2015-25","2015-25","2015-26"))
dt %>%
filter(Conversion != "Null") %>%
group_by(Week, Conversion) %>%
summarise(Lead = sum(Lead)) %>%
ungroup() %>%
spread(Conversion,Lead,fill=0) %>%
group_by(Week) %>%
do(data.frame(.,
Total = sum(.[,-1]))) %>%
ungroup()
# Week A1 A3 Total
# 1 2015-25 2 0 2
# 2 2015-26 0 1 1

Resources