I have a large data.frame containing these values:
ID_Path Conversion Lead Path Week
32342 A25177 1 JEFD 2015-25
32528 A25177 1 EUFD 2015-25
25485 A3 1 DTFE 2015-25
32528 Null 0 DDFE 2015-25
23452 A25177 1 JDDD 2015-26
54454 A25177 1 FDFF 2015-27
56848 A2323 1 HDG 2015-27
I want to be able to create a frequency table that displays a table like this:
Week Total A25177 A3 A2323
2015-25 3 2 1 0
2015-26 1 1 0 0
2015-27 2 1 0 1
Where every unique Conversion has a column, and all the times where the Conversion is Null is the same time as when the Lead is 0.
In this example there is 3 unique conversions, sometimes there is 1, sometimes there are 5 or more. So it should not be limited to only 3.
I have created a new DF containing only Conversion that are not Null
I have tried using data.table with this code:
DF[,list(Week=Week,by=Conversion]
with no luck.
I have tried using plyr with this code:
ddply(DF,~Conversion,summarise,week=week)
with no luck.
I would recommend dropping unnecessary levels in order to not mess the output, and then run a simple table and addmargins combination
DF <- droplevels(DF[DF$Conversion != "Null",])
addmargins(table(DF[c("Week", "Conversion")]), 2)
# Conversion
# Week A2323 A25177 A3 Sum
# 2015-25 0 2 1 3
# 2015-26 0 1 0 1
# 2015-27 1 1 0 2
Alternatively, you could do the same with reshape2 while specifying the margins parameter
library(reshape2)
dcast(DF, Week ~ Conversion, value.var = "Conversion", length, margins = "Conversion")
# Week A2323 A25177 A3 (all)
# 1 2015-25 0 2 1 3
# 2 2015-26 0 1 0 1
# 3 2015-27 1 1 0 2
An alternative solution using dplyr and tidyr:
library(tidyr)
library(dplyr)
dt = data.frame(Conversion = c("A1","Null","A1","A3"),
Lead = c(1,0,1,1),
Week = c("2015-25","2015-25","2015-25","2015-26"))
dt %>%
filter(Conversion != "Null") %>%
group_by(Week, Conversion) %>%
summarise(Lead = sum(Lead)) %>%
ungroup() %>%
spread(Conversion,Lead,fill=0) %>%
group_by(Week) %>%
do(data.frame(.,
Total = sum(.[,-1]))) %>%
ungroup()
# Week A1 A3 Total
# 1 2015-25 2 0 2
# 2 2015-26 0 1 1
Related
I am currently working on a dataset which consists of multiple participants. Some participants have participated all followups, whereas others have skipped some followups.
For example, in the dataset below, participant 2 only participated the 3rd followup, and participant 3 only participated the 2nd and the 3rd followup. You can also see that some participants have more than 1 rows of entry because they have several followups.
The original dataset only has the 1st and the 2nd column. Since I am aiming to create a progress chart like this
I have tried to create extra columns for each visit by using the code below:
participant <- c(1,1,1,2,3,3,4,5,5,5 )
visit <- c(1,2,3,3,2,3,1,1,2,3)
df <- data.frame(participant, visit)
df[,3] <- as.integer(df$visit=="1")
df[,4] <- as.integer(df$visit=="2")
df[,5] <- as.integer(df$visit=="3")
colnames(df)[colnames(df) %in% c("V3","V4","V5")] <- c(
"Visit1","Visit2","Visit3")
However, I still experience a hard time combining rows of the same participant, and hence I could not proceed to making the chart (which I also have no clue about). I have tried the 'reshape' function but it did not work out. group_by function also did not work out and still showed the original dataset
df1 <- df[,-2]
df1 %>%
group_by(participant)
What function should I use this case for:
combining rows of the same participant?
how to produce the progress chart?
Thank you in advance for your help!
Based on your df you could produce the chart with
library(ggplot2)
library(dplyr)
df %>%
ggplot(aes(x = as.factor(visit),
y = as.factor(participant),
fill = as.factor(visit))) +
geom_tile(aes(width = 0.7, height = 0.7), color = "black") +
scale_fill_grey() +
xlab("Visit") +
ylab("Participants") +
guides(fill = "none")
If you need your data.frame in a wide format (similar to the image shown but with only one row per participant), use
library(tidyr)
library(dplyr)
df %>%
mutate(value = 1) %>%
pivot_wider(
names_from = visit,
values_from = value,
names_glue = "Visit{visit}",
values_fill = 0)
to get
# A tibble: 5 x 4
participant Visit1 Visit2 Visit3
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
I think you are looking for a way to dummify a variable.
There are several ways to do that.
I like the fastDummies package. You can use dummy_cols, with remove_selected_columns=TRUE.
df %>% fastDummies::dummy_cols(select_columns = 'visit',
remove_selected_columns = TRUE)
participant visit_1 visit_2 visit_3
1 1 1 0 0
2 1 0 1 0
3 1 0 0 1
4 2 0 0 1
5 3 0 1 0
6 3 0 0 1
7 4 1 0 0
8 5 1 0 0
9 5 0 1 0
10 5 0 0 1
You may want to pipe in some summariseoperation to make the table even cleaner, as in:
df %>% fastDummies::dummy_cols(select_columns = 'visit', remove_selected_columns = TRUE)%>%
group_by(participant)%>%
summarise(across(starts_with('visit'), max))
# A tibble: 5 x 4
participant visit_1 visit_2 visit_3
<dbl> <int> <int> <int>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
In a certain way, this looks a bit like a pivoting operation too.
You may be interested in using dplyr::pivot_wider here too
EDIT: #MartinGal had just given a similar answer, I removed a very similar version of his pivot_wider
I have clinical data that records a patient at three time points with a disease outcome indicated by a binary variable. It looks something like this
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
I want to create an onset variable, so for each patient it would code a 1 for the time which the patient first got the disease, but would then be a 0 for any time period before or a time period after (even if that patient still had the disease). For the example data it should now look like this.
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
outcome_onset <- c(0,1,0,0,0,1,1,0,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome,
outcome_onset=outcome_onset)
Data
Therefore I would like some code/ some help automating the creation of the outcome_onset variable.
Here is an option with cumsum to create a logical vector after grouping by the 'patientid'
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(cumsum(outcome) == 1))
Or use match and %in%
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(row_number() %in% match(1, outcome_onset)))
We can use which.max to get the index of 1st one in outcome variable and make that row as 1 and rest of them as 0.
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = as.integer(row_number() %in% which.max(outcome)),
outcome_onset = replace(outcome_onset, is.na(outcome), NA))
# patientid time outcome outcome_onset
# <dbl> <dbl> <dbl> <int>
#1 100 1 0 0
#2 100 2 1 1
#3 100 3 1 0
#4 101 1 0 0
#5 101 2 0 0
#6 101 3 1 1
#7 102 1 1 1
#8 102 2 1 0
#9 102 3 0 0
I need to reshape my data, to get it in a proper format for Survival Analysis.
My current Dataset looks like this:
Product_Number Date Status
A 2018-01-01 0
A 2018-01-02 1
A 2018-01-03 0
B 2018-01-01 0
B 2018-01-02 0
B 2018-01-03 0
B 2018-01-04 1
C 2018-01-01 0
C 2018-01-02 0
I need to reshape my data, based on the columns Product_Number, Date and Status (I want to count the number of days, per product, until the status shift to a 1. If the status is 0, the proces should start over again).
So the data should look like this:
Product_Number Number_of_Days Status
A 2 1 #Two days til status = 1
A 1 0 #One day, status = 0 (no end date yet)
B 4 1 #Four days til status = 1
C 2 0 #Two days, status is still 0 (no end date yet)
What have I tried so far?
I ordered my data by ProductNumber and Date. I love the DPLYR way, so I used:
df <- df %>% group_by(Product_Number, Date) # note: my data is now in the form as in the example above.
Then I tried to use the diff() function, to see the differences in dates (count the number of days). But I was unable to "stop" the count, when status switched (from 0 to 1, and vice versa).
I hope that I clearly explained the problem. Please let me know if you need some additional information.
You could do:
library(dplyr)
df %>%
group_by(Product_Number) %>%
mutate(Date = as.Date(Date),
group = cumsum(coalesce(as.numeric(lag(Status) == 1 & Status == 0), 1))) %>%
group_by(Product_Number, group) %>%
mutate(Number_of_Days = (last(Date) - first(Date)) + 1) %>%
slice(n()) %>% ungroup() %>%
select(-group, -Date)
Output:
# A tibble: 4 x 3
Product_Number Status Number_of_Days
<chr> <int> <time>
1 A 1 2
2 A 0 1
3 B 1 4
4 C 0 2
This might be what you're looking for, if I got your question right.
library(dplyr)
df %>%
mutate(Number_of_Days=1) %>%
select(-Date) %>%
group_by(Product_Number, Status) %>%
summarise_all(sum,na.rm=T)
Product_Number Status Number_of_Days
1 A 0 2
2 A 1 1
3 B 0 3
4 B 1 1
5 C 0 2
I have data that looks like this:
library(dplyr)
d<-data.frame(ID=c(1,1,2,3,3,4), Quality=c("Good", "Bad", "Ugly", "Good", "Good", "Ugly"), Area=c("East", "North", "North", "South", "East", "North"))
What I'd like to do is create one new column for each unique value in Quality and populate it with whether the ID matches that value and then aggregate the ID's. I want to do the same for Area.
This is what I have for when Quality == Good:
d$Quality.Good <- 0
d$Quality.Good[d$Quality=="Good"] <- 1
e <- d %>%
group_by(ID) %>%
summarise(n=n(), MAX.Quality.Good = max(Quality.Good))
e
Output
A tibble: 4 x 3
ID MAX.Quality.Good
<dbl> <dbl>
1 1 1
2 2 0
3 3 1
4 4 0
Is it possible to build a function that will loop through each character column and build an indicator column for Good, Bad, Ugly, North, East, South instead of copy pasting the above many more times?
Here's where I'm stuck:
library(stringr)
#vector of each Quality
e <-d %>%
group_by(Quality) %>%
summarise(n=n()) %>%
select(Quality)
e<-as.data.frame(e)
#create new column names
f <- str_c(names(e),".",e[,1])
#initialize list of new columns
d[f] <- 0
#I'm stuck after this...
Thank you!
We can do this in base R using table by replicating the 'ID' column by the number of columns of dataset minus 1, and pasteing the column names with the unlisted values (excluding the 'ID' column)
table(rep(d$ID, 2), paste0(names(d)[-1][col(d[-1])], unlist(d[-1])))
# AreaEast AreaNorth AreaSouth QualityBad QualityGood QualityUgly
# 1 1 1 0 1 1 0
# 2 0 1 0 0 0 1
# 3 1 0 1 0 2 0
# 4 0 1 0 0 0 1
or with tidyverse, gather into 'long' format, unite the 'key', 'val' columns to a single column, get the distinct rows, and spread into 'wide' format after creating a column of 1s.
library(tidyverse)
gather(d, key, val, -ID) %>%
unite(kv, key, val) %>%
distinct %>%
mutate(n = 1) %>%
spread(kv, n, fill = 0)
#ID Area_East Area_North Area_South Quality_Bad Quality_Good Quality_Ugly
#1 1 1 1 0 1 1 0
#2 2 0 1 0 0 0 1
#3 3 1 0 1 0 1 0
#4 4 0 1 0 0 0 1
1) Base R Create the model matrix for each column (using function make_mm) and bind them together as a data frame m. Finally aggregate on ID. No packages are used.
make_mm <- function(nm, data) model.matrix(~ . - 1, data[nm])
m <- do.call("data.frame", lapply(names(d)[-1], make_mm, d))
with(d, aggregate(. ~ ID, m, max))
giving:
ID QualityBad QualityGood QualityUgly AreaEast AreaNorth AreaSouth
1 1 1 1 0 1 1 0
2 2 0 0 1 0 1 0
3 3 0 1 0 1 0 1
4 4 0 0 1 0 1 0
2) dplyr/purrr This could alternately be written as the following which is close to the code in the question but generalizes to all required columns. Note that here we make model data frames using make_md rather than making model matrices with make_mm. Also note that the dot in group_by(m, ID = .$ID) refers to d and not to m.
library(dplyr)
library(purrr)
make_md <- function(nm, data) {
data %>%
select(nm) %>%
model.matrix(~ . - 1, .) %>%
as.data.frame
}
d %>% {
m <- map_dfc(names(.)[-1], make_md, .)
group_by(m, ID = .$ID) %>%
summarize_all(max) %>%
ungroup
}
I have a df, this provides information about the create_date and delete_date(if any) for a given ID.
Structure:
ID create_date1 create_date2 delete_date1 delete_date2
1 01-01-2014 NA NA NA
2 01-04-2014 01-08-2014 01-05-2014 NA
the create_date and delete_date extends till 10, i.e. create_date10
and delete_date10 columns are present
Rules/Logic:
We charge a user on monthly basis, if a user was created on 30th of a month, even then it's treated as if the user was active for a month(very low cost)
If a user has a delete date (irrespective on which date) in this month, then from next month the user is not charged
If a user has only create_date and no delete_date then all dates including the create_month is charged
Output expected:
ID 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08
1 1 1 1 1 1 1 1 1
2 0 0 0 1 1 0 0 1
so on till current date
1 indicates the user is charged/active for that month
Problem:
I have been struggling to do this, but can't even understand how to do this. My earlier method is a bit too slow
Previous Solution:
Make the dataset into tall
Insert sequence of dates for each ID as a new column
Use a for loop to check the status
for each ID, status is equal to 1,
if create_date is equal to sequence, and it's 0 if the lag(delete_date) is equal to sequence
else is same as lag(status)
ID create_date delete_date sequence status?
1 01-01-2014 NA 2014-01 1
1 01-01-2014 NA 2014-02 1
1 01-01-2014 NA 2014-03 1
may not be that efficient : assuming this is just for a single year(could be extended easily)
# convert all dates to Date format
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) as.Date(df[[x]], format = "%d-%m-%Y"))
# extract the month
library(lubridate)
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) month(df[[x]]))
# df
# ID create_date1 create_date2 delete_date1 delete_date2
#1 1 1 NA NA NA
#2 2 4 8 5 NA
# get the current month
current.month <- month(Sys.Date())
# assume for now current month is 9
current.month <- 9
flags <- rep(FALSE, current.month)
func <- function(x){
x[is.na(x)] <- current.month # replacing all NA with current month(9)
create.columns.indices <- x[grepl("create_date", colnames(df[-1]))] # extract the create_months
delete.columns.indices <- x[grepl("delete_date", colnames(df[-1]))] # extract the delete_months
flags <- pmin(1,colSums(t(sapply(seq_along(create.columns.indices),
function(x){
flags[create.columns.indices[x]:delete.columns.indices[x]] = TRUE;
flags
}))))
flags
}
df1 = cbind(df$ID , t(apply(df[-1], 1, func)))
colnames(df1) = c("ID", paste0("month",1:current.month))
# df1
# ID month1 month2 month3 month4 month5 month6 month7 month8 month9
#[1,] 1 1 1 1 1 1 1 1 1 1
#[2,] 2 0 0 0 1 1 0 0 1 1
Here's a still-pretty-long tidyverse approach:
library(tidyverse)
df %>% gather(var, date, -ID) %>% # reshape to long form
# separate date type from column set number
separate(var, c('action', 'number'), sep = '_date', convert = TRUE) %>%
mutate(date = as.Date(date, '%d-%m-%Y')) %>% # parse dates
spread(action, date) %>% # spread create and delete to two columns
mutate(min_date = min(create, delete, na.rm = TRUE), # add helper columns; use outside
max_date = max(create, delete, na.rm = TRUE)) %>% # variable to save memory if an issue
group_by(ID, number) %>%
mutate(month = list(seq(min_date, max_date, by = 'month')), # add month sequence list column
# boolean vector of whether range of months in whole range
active = ifelse(is.na(create),
list(rep(FALSE, length(month[[1]]))),
lapply(month, `%in%`,
seq.Date(create,
min(delete, max_date, na.rm = TRUE),
by = 'month')))) %>%
unnest() %>% # unnest list columns to long form
group_by(ID, month = format(month, '%Y-%m')) %>%
summarise(active = any(active) * 1L) %>% # combine muliple rows for one ID
spread(month, active) # reshape to wide form
## Source: local data frame [2 x 9]
## Groups: ID [2]
##
## ID `2014-01` `2014-02` `2014-03` `2014-04` `2014-05` `2014-06` `2014-07` `2014-08`
## * <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 1 1 1 1 1 1 1 1
## 2 2 0 0 0 1 1 0 0 1