I have this data:
drugData <- data.frame(caseID=c(9, 9, 10, 11, 12, 12, 12, 12, 13, 45, 45, 225),
Drug=c("Cocaine", "Cocaine", "DPT", "LSD", "Cocaine", "LSD", "Heroin","Heroin", "LSD", "DPT", "DPT", "Heroin"),
County=c("A", "A", "B", "C", "D", "D", "D","D", "E", "F", "F", "G"),
Date=c(2009, 2009, 2009, 2009, 2011, 2011, 2011, 2011, 2010, 2010, 2010, 2005))
"CaseID" rows make up a single case, which may have observations of all the same drug, or different types of drugs. I want this data to look like the following:
CaseID Drug.1 Drug.2 Drug. 3 Drug.4 County Date
9 Cocaine Cocaine NA NA A 2009
10 DPT LSD NA NA B 2009
11 LSD NA NA NA C 2009
12 Cocaine LSD Heroin Heroin D 2011
13 LSD NA NA NA E 2010
45 DPT DPT NA NA F 2010
225 Heroin NA NA NA G 2005
I've tried using dplyr spread function but can't seem to quite get this to work.
We can pivot to wide format after creating a sequence column based on 'caseID'
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
drugData %>%
mutate(nm = str_c('Drug', rowid(caseID))) %>%
pivot_wider(names_from = nm, values_from = Drug)
#A tibble: 7 x 7
# caseID County Date Drug1 Drug2 Drug3 Drug4
# <dbl> <fct> <dbl> <fct> <fct> <fct> <fct>
#1 9 A 2009 Cocaine Cocaine <NA> <NA>
#2 10 B 2009 DPT <NA> <NA> <NA>
#3 11 C 2009 LSD <NA> <NA> <NA>
#4 12 D 2011 Cocaine LSD Heroin Heroin
#5 13 E 2010 LSD <NA> <NA> <NA>
#6 45 F 2010 DPT DPT <NA> <NA>
#7 225 G 2005 Heroin <NA> <NA> <NA>
Or with spread (spread is deprecated in place of pivot_wider
drugData %>%
mutate(nm = str_c('Drug', rowid(caseID))) %>%
spread(nm, Drug)
Or using data.table
dcast(setDT(drugData), caseID + County + Date ~
paste0('Drug', rowid(caseID)), value.var = 'Drug')
# caseID County Date Drug1 Drug2 Drug3 Drug4
#1: 9 A 2009 Cocaine Cocaine <NA> <NA>
#2: 10 B 2009 DPT <NA> <NA> <NA>
#3: 11 C 2009 LSD <NA> <NA> <NA>
#4: 12 D 2011 Cocaine LSD Heroin Heroin
#5: 13 E 2010 LSD <NA> <NA> <NA>
#6: 45 F 2010 DPT DPT <NA> <NA>
#7: 225 G 2005 Heroin <NA> <NA> <NA>
Related
I have rows grouped by ID and I want to calculate how much time passes until the next event occurs (if it does occur for that ID).
Here is example code:
year <- c(2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018)
id <- c(rep("A", times = 4), rep("B", times = 4), rep("C", times = 4))
event_date <- c(NA, 2016, NA, 2018, NA, NA, NA, NA, 2015, NA, NA, 2018)
df<- as.data.frame(cbind(id, year, event_date))
df
id year event_date
1 A 2015 <NA>
2 A 2016 2016
3 A 2017 <NA>
4 A 2018 2018
5 B 2015 <NA>
6 B 2016 <NA>
7 B 2017 <NA>
8 B 2018 <NA>
9 C 2015 2015
10 C 2016 <NA>
11 C 2017 <NA>
12 C 2018 2018
Here is what I want the output to look like:
id year event_date years_till_next_event
1 A 2015 <NA> 1
2 A 2016 2016 0
3 A 2017 <NA> 1
4 A 2018 2018 0
5 B 2015 <NA> <NA>
6 B 2016 <NA> <NA>
7 B 2017 <NA> <NA>
8 B 2018 <NA> <NA>
9 C 2015 2015 0
10 C 2016 <NA> 2
11 C 2017 <NA> 1
12 C 2018 2018 0
Person B does not have the event, so it is not calculated. For the others, I want to calculate the difference between the leading event_date (ignoring NAs, if it exists) and the year.
I want to calculate years_till_next_event such that 1) if there is an event_date for a row, event_date - year. 2) If not, then return the first non-NA leading value - year. I'm having difficulty with the 2nd part of the logic, keeping in mind the event could occur not at all or every year, by ID.
Using zoo with dplyr
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(years_till_next_event = na.locf0(event_date, fromLast = TRUE) - year )
Here is a data.table option
setDT(df)[, years_till_next_event := nafill(event_date, type = "nocb") - year, id]
which gives
id year event_date years_till_next_event
1: A 2015 NA 1
2: A 2016 2016 0
3: A 2017 NA 1
4: A 2018 2018 0
5: B 2015 NA NA
6: B 2016 NA NA
7: B 2017 NA NA
8: B 2018 NA NA
9: C 2015 2015 0
10: C 2016 NA 2
11: C 2017 NA 1
12: C 2018 2018 0
You can create a new column to assign a row number within each id if the value is not NA, fill the NA values from the next values and subtract the current row number from it.
library(dplyr)
df %>%
group_by(id) %>%
mutate(years_till_next_event = replace(row_number(),is.na(event_date), NA)) %>%
tidyr::fill(years_till_next_event, .direction = 'up') %>%
mutate(years_till_next_event = years_till_next_event - row_number()) %>%
ungroup
# id year event_date years_till_next_event
# <chr> <dbl> <dbl> <int>
# 1 A 2015 NA 1
# 2 A 2016 2016 0
# 3 A 2017 NA 1
# 4 A 2018 2018 0
# 5 B 2015 NA NA
# 6 B 2016 NA NA
# 7 B 2017 NA NA
# 8 B 2018 NA NA
# 9 C 2015 2015 0
#10 C 2016 NA 2
#11 C 2017 NA 1
#12 C 2018 2018 0
data
df <- data.frame(id, year, event_date)
I have two data sets df1 and df2, which have one column "ID" and "Country" in common:
df1 <- data.frame(ID=c(1:20), State=c("NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","CA","IL","SD","NC","SC","WA","CO","AL","AK","HI"))
df2 <- data.frame(ID=c(1,2,3,4,5,"NA","NA","NA","NA","NA"), Year=c("2020","2021","2020","2020","2021","2020","2020","2021","2020","2019"),State=c("NA","NA","NA","NA","NA","CA","SC","NY","NJ","OR"))
How can I add Year from df2 to df1 to the same ID that exists in df1 OR the same State that exists in df1?
The reason why I want to make this change: I just need to add this "Year" information from df2 to df1.
Here's a dplyr solution:
library(dplyr)
df1 <- df1 %>%
mutate(join = ifelse(State == 'NA', ID, State))
df2 <- df2 %>%
mutate(join = ifelse(State == 'NA', ID, State))
df_new <- left_join(df1, df2, by = "join") %>%
mutate(State = coalesce(State.x, State.y)) %>%
select(-c(State.x, State.y, join, ID.y)) %>%
rename(ID = ID.x)
This gives us:
ID Year State
1 1 2020 NA
2 2 2021 NA
3 3 2020 NA
4 4 2020 NA
5 5 2021 NA
6 6 <NA> NA
7 7 <NA> NA
8 8 <NA> NA
9 9 <NA> NA
10 10 <NA> NA
11 11 2020 CA
12 12 <NA> IL
13 13 <NA> SD
14 14 <NA> NC
15 15 2020 SC
16 16 <NA> WA
17 17 <NA> CO
18 18 <NA> AL
19 19 <NA> AK
20 20 <NA> HI
You could do:
df1 <- type.convert(df1)
df2 <- type.convert(df2)
df1 %>%
left_join(select(df2, -State), 'ID') %>%
left_join(select(filter(df2, is.na(ID)), -ID), 'State') %>%
mutate(Year = coalesce(Year.x, Year.y), Year.x = NULL, Year.y = NULL)
ID State Year
1 1 <NA> 2020
2 2 <NA> 2021
3 3 <NA> 2020
4 4 <NA> 2020
5 5 <NA> 2021
6 6 <NA> NA
7 7 <NA> NA
8 8 <NA> NA
9 9 <NA> NA
10 10 <NA> NA
11 11 CA 2020
12 12 IL NA
13 13 SD NA
14 14 NC NA
15 15 SC 2020
16 16 WA NA
17 17 CO NA
18 18 AL NA
19 19 AK NA
20 20 HI NA
In a data frame like data below:
library(tidyverse)
ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y","Z", "a","b","c","d")
State <- rep(c("FL", "GA", "SC", "NC", "VA", "GA"), each = 5)
Location <- rep(c("alpha", "beta", "gamma"), each = 10)
Var3 <- rep(c("Bravo", "Charlie", "Delta", "Echo"), times = c(7,8,10,5))
Sex <- rep(c("M","F","M"), times = 10)
data <- data.frame(ID, State, Location, Var3, Sex)
I want to return a data frame, or a list of several data frames, that summarize each way the data can be grouped. I want to see how many individual IDs are in each State, Location, and Var3, how many M and F are in each State, Location, and Var3, how many Locations are in each State, ect... what is the best way to achieve this.
We can use count
library(dplyr)
data %>%
count(State, Location, Var3, Sex)
Also, to get rollup/cube way of hierarchial counts,
library(data.table)
rollup(as.data.table(data), j = .N, by = c("State","Location","Var3", "Sex"))
# State Location Var3 Sex N
# 1: FL alpha Bravo M 3
# 2: FL alpha Bravo F 2
# 3: GA alpha Bravo M 2
# 4: GA alpha Charlie F 1
# 5: GA alpha Charlie M 2
# 6: SC beta Charlie F 2
# 7: SC beta Charlie M 3
# 8: NC beta Delta M 3
# 9: NC beta Delta F 2
#10: VA gamma Delta M 4
#11: VA gamma Delta F 1
#12: GA gamma Echo F 2
#13: GA gamma Echo M 3
#14: FL alpha Bravo <NA> 5
#15: GA alpha Bravo <NA> 2
#16: GA alpha Charlie <NA> 3
#17: SC beta Charlie <NA> 5
#18: NC beta Delta <NA> 5
#19: VA gamma Delta <NA> 5
#20: GA gamma Echo <NA> 5
#21: FL alpha <NA> <NA> 5
#22: GA alpha <NA> <NA> 5
#23: SC beta <NA> <NA> 5
#24: NC beta <NA> <NA> 5
#25: VA gamma <NA> <NA> 5
#26: GA gamma <NA> <NA> 5
#27: FL <NA> <NA> <NA> 5
#28: GA <NA> <NA> <NA> 10
#29: SC <NA> <NA> <NA> 5
#30: NC <NA> <NA> <NA> 5
#31: VA <NA> <NA> <NA> 5
#32: <NA> <NA> <NA> <NA> 30
# State Location Var3 Sex N
Or use cube
cube(as.data.table(data), j = .N, by = c("State","Location","Var3", "Sex"))
#. State Location Var3 Sex N
# 1: FL alpha Bravo M 3
# 2: FL alpha Bravo F 2
# 3: GA alpha Bravo M 2
# 4: GA alpha Charlie F 1
# 5: GA alpha Charlie M 2
# ---
#111: <NA> <NA> Delta <NA> 10
#112: <NA> <NA> Echo <NA> 5
#113: <NA> <NA> <NA> M 20
#114: <NA> <NA> <NA> F 10
#115: <NA> <NA> <NA> <NA> 30
One dplyr and purrr solution to group by all possible combinations of column names could be:
map2(list(colnames(data)),
1:ncol(data),
combn, simplify = FALSE) %>%
flatten() %>%
map(~ data %>%
group_by_at(.x) %>%
tally())
In this case, there are 31 possible combinations of column names, so it returns 31 lists. The first three lists:
[[1]]
# A tibble: 30 x 2
ID n
<fct> <int>
1 a 1
2 A 1
3 b 1
4 B 1
5 c 1
6 C 1
7 d 1
8 D 1
9 E 1
10 F 1
# … with 20 more rows
[[2]]
# A tibble: 5 x 2
State n
<fct> <int>
1 FL 5
2 GA 10
3 NC 5
4 SC 5
5 VA 5
[[3]]
# A tibble: 3 x 2
Location n
<fct> <int>
1 alpha 10
2 beta 10
3 gamma 10
Suppose you have a data frame df with 5 attributes: x1, x2, x3, x4, Year, as follows:
set.seed(1)
x1 <- 1:30
x2 <- rnorm(10)
x3 <- rchisq(25, 2, ncp = 0)
x4 <- rpois(6, 0.94)
Year <- sample(2011:2014,30,replace=TRUE)
noRow <- max(length(x1), length(x2), length(x3), length(x4), length(Year))
df <- list(x1=x1, x2=x2, x3=x3, x4=x4, Year=Year)
attributes(df) <- list(names = names(df), row.names=1:30, class='data.frame')
and output
x1 x2 x3 x4 Year
1 1 -0.6264538 4.2807226 0 2014
2 2 0.1836433 1.6273105 0 2014
3 3 -0.8356286 0.3144031 0 2012
4 4 1.5952808 0.6216108 0 2012
5 5 0.3295078 0.9374638 1 2014
6 6 -0.8204684 0.1363947 2 2013
7 7 0.4874291 2.4985843 <NA> 2013
8 8 0.7383247 2.0162627 <NA> 2012
9 9 0.5757814 2.7218900 <NA> 2012
10 10 -0.3053884 2.4119764 <NA> 2014
11 11 <NA> 1.1082308 <NA> 2013
12 12 <NA> 2.4140052 <NA> 2011
13 13 <NA> 3.1249573 <NA> 2011
14 14 <NA> 0.2615523 <NA> 2012
15 15 <NA> 0.4381074 <NA> 2014
16 16 <NA> 0.6944394 <NA> 2013
17 17 <NA> 0.8599189 <NA> 2014
18 18 <NA> 0.2924151 <NA> 2013
19 19 <NA> 1.6834339 <NA> 2012
20 20 <NA> 0.4848175 <NA> 2012
21 21 <NA> 3.1606987 <NA> 2011
22 22 <NA> 2.3705121 <NA> 2011
23 23 <NA> 0.7808625 <NA> 2013
24 24 <NA> 0.4621734 <NA> 2011
25 25 <NA> 1.9421776 <NA> 2012
26 26 <NA> <NA> <NA> 2013
27 27 <NA> <NA> <NA> 2014
28 28 <NA> <NA> <NA> 2012
29 29 <NA> <NA> <NA> 2012
30 30 <NA> <NA> <NA> 2011
I would like to group by year and determine if for a given year we have no entries in one or more attributes.
Using
library("dplyr")
df1 <- df %>%
dplyr::group_by(Year) %>%
dplyr::mutate(count = n())
only gives me the number of entries in a given year, but it doesn't tell me the which attributes are present/non-missing in a given year.
Thanks for sharing your ideas.
Wished output:
Year x1 x2 x3 x4
2011 1 0 1 0
2012 1 1 1 1
2013 1 1 1 1
2014 1 1 1 1
where 1 means there's at least one entry for the variable in a given year, and 0 otherwise.
This code solves your problem:
df$attrib_ok <- !is.na(rowSums(df[1:4]))
df1 <- df %>%
dplyr::group_by(Year) %>%
dplyr::mutate(count=sum(attrib_ok)) %>%
dplyr::select(-attrib_ok)
but it seems you have created a corrupt dataframe where this solution doesn`t work.
You have to create previously a non-corrupt dataframe like this:
set.seed(1)
x1 <- 1:30
x2 <- c(rnorm(10), rep(NA, 20))
x3 <- c(rchisq(25, 2, ncp = 0), rep(NA, 5))
x4 <- c(rpois(6, 0.94), rep(NA, 24))
Year <- sample(2011:2014,30,replace=TRUE)
df <- data.frame(x1,x2,x3,x4,Year)
Code to get your wished output:
df1 <- data.frame(Year=df$Year,!is.na(df[1:4]))
df1 <- aggregate(.~Year, data = df1, FUN = sum)
df1 <- data.frame(Year=df1$Year, apply(apply(df1[,2:5], 2, as.logical), 2, as.numeric))
I have a large data set which used different coding schemes for the same variables over different time periods. The coding in each time period is represented as a column with values during the year it was active and NA everywhere else.
I was able to "combine" them by using nested ifelse commands together with dplyr's mutate [see edit below], but I am running into a problem using ifelse to do something slightly different. I want to code a new variable based on whether ANY of the previous variables meets a condition. But for some reason, the ifelse construct below does not work.
MWE:
library("dplyr")
library("magrittr")
df <- data.frame(id = 1:12, year = c(rep(1995, 5), rep(1996, 5), rep(1997, 2)), varA = c("A","C","A","C","B",rep(NA,7)), varB = c(rep(NA,5),"B","A","C","A","B",rep(NA,2)))
df %>% mutate(varC = ifelse(varA == "C" | varB == "C", "C", "D"))
Output:
> df
id year varA varB varC
1 1 1995 A <NA> <NA>
2 2 1995 C <NA> C
3 3 1995 A <NA> <NA>
4 4 1995 C <NA> C
5 5 1995 B <NA> <NA>
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C C
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
If I don't use the | operator, and test against only varA, it will come out with the results as expected, but it will only apply to those years that varA is not NA.
Output:
> df %<>% mutate(varC = ifelse(varA == "C", "C", "D"))
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C <NA>
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
Desired output:
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B D
7 7 1996 <NA> A D
8 8 1996 <NA> C C
9 9 1996 <NA> A D
10 10 1996 <NA> B D
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
How do I get what I'm looking for?
To make this question more applicable to a wider audience, and to learn from this situation, it would be great have an explanation as to what is happening with the comparison using | that causes it not to work as expected. Thanks in advance!
EDIT: This is what I meant by successfully combining them with nested ifelses
> df %>% mutate(varC = ifelse(year == 1995, as.character(varA),
+ ifelse(year == 1996, as.character(varB), NA)))
id year varA varB varC
1 1 1995 A <NA> A
2 2 1995 C <NA> C
3 3 1995 A <NA> A
4 4 1995 C <NA> C
5 5 1995 B <NA> B
6 6 1996 <NA> B B
7 7 1996 <NA> A A
8 8 1996 <NA> C C
9 9 1996 <NA> A A
10 10 1996 <NA> B B
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
R has this annoying tendency where the logical value of a condition that involves NA is just NA, rather than true or false.
i.e. NA>0 = NA rather than FALSE
NA interacts with TRUE just like false does. i.e. TRUE|NA = TRUE. TRUE&NA = NA.
Interestingly, it also interacts with FALSE as if it was TRUE. i.e. FALSE|NA=NA. FALSE&NA=FALSE
In fact, NA is like a logical value between TRUE and FALSE. e.g. NA|TRUE|FALSE = TRUE.
So here's a way to hack this:
ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB))
How do we interpret this? On the left side of the OR, we have the following: If varA is NA, then we have NA&FALSE. Since NA is one step above FALSE in the hierarchy of logicals, the & is going to force the whole thing to be FALSE. Otherwise, if varA is not NA but it's not 'C', you'll have FALSE&TRUE which gives FALSE as you want. Otherwise, if it's 'C', they're both true. Same goes for the thing on the right of the OR.
When using a condition that involves x, but x can be NA, I like to use
((condition for x)&!is.na(x)) to completely rule out the NA output and force the TRUE or FALSE values in the situations I want.
EDIT: I just remembered that you want an NA output if they're both NA. This doesn't end up doing it, so that's my bad. Unless you're okay with a 'D' output when they're both NA.
EDIT2: This should output the NAs as you want:
ifelse(is.na(varA)&is.na(varB), NA, ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB)), 'C','D'))
Per #Khashaa comment. This should do the trick and get you to the desired output.
df %>%
mutate(varC = ifelse(is.na(varA) & is.na(varB), NA,
ifelse(varA %in% "C" | varB %in% "C", "C", "D")))