Imputing NA with conditional LOCF

Imputing NA with conditional LOCF - r

I've updated a new different problem. This time I would like to obtain column Oxy2 from Oxy.
ID Oxy Y Oxy2
1 NA 2010 NA
1 0 2011 0
1 NA 2012 NA
1 1 2013 1
1 NA 2014 1
1 NA 2015 1
1 -1 2016 1
2 0 2011 0
2 NA 2012 NA
2 1 2013 1
2 -1 2014 1
3 0 2012 0
3 -1 2013 -1
3 NA 2014 NA
4 -1 2010 -1
4 1 2011 1
4 -1 2012 1
4 -1 2013 1
4 0 2014 1
4 NA 2015 1
Basically, I need to keep NAs, if there any, when previous values of my Oxy variable are 0 or -1, and replace everything coming after the first 1 appears with 1.
Again, thanks for your suggestions.

library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(Ins1=na.locf(ifelse(is.na(Ins) & lag(Ins)==0, 999, Ins), na.rm = FALSE), Ins2=na_if(Ins1, 999))
#one step version
#mutate(Ins1 = na_if(na.locf(ifelse(is.na(Ins) & lag(Ins)==0, 999, Ins), na.rm = FALSE), 999))
# A tibble: 8 x 5
# Groups: ID [2]
ID Ins Y Ins1 Ins2
<int> <int> <int> <dbl> <dbl>
1 1 0 2010 0 0
2 1 NA 2011 999 NA
3 1 1 2012 1 1
4 1 NA 2013 1 1
5 1 NA 2014 1 1
6 2 0 2011 0 0
7 2 0 2012 0 0
8 2 NA 2013 999 NA
Update: To solve the -1 issue, I add a small change to what #user12492692 has suggested in the Edit, namely replaced the | with %in%
df %>%
group_by(ID) %>%
mutate(Ins1 = na.locf(ifelse(is.na(Ins) & lag(Ins) %in% c(0,-1), 999, Ins), na.rm = FALSE),
Ins2 = na_if(Ins1, 999))

Here is another alternative that fills all values using the LOCF and then adds NA's following the zeros:
library(dplyr)
df1 %>%
mutate(Ins_b = Ins[!is.na(Ins)][cumsum(!is.na(Ins))],
Ins_b = replace(Ins_b, is.na(Ins) & Ins_b == 0, NA))
ID Ins Y Ins_b
1 1 0 2010 0
2 1 NA 2011 NA
3 1 1 2012 1
4 1 NA 2013 1
5 1 NA 2014 1
6 2 0 2011 0
7 2 0 2012 0
8 2 NA 2013 NA

Related

Replacing Grouped Data in R

I want to create a new column based on conditions
My data frame looks like:
Id Year Code Y
1 2009 0 0
1 2010 NA NA
2 2009 0 0
2 2010 NA NA
3 2009 1 1
3 2010 NA NA
4 2009 2 1
4 2010 NA NA
I need to replace values in my original Y variable in a way that returns 1 when the first year code for each individual is equal to 0 and the second year/code line is NA. The output that I'm looking for is:
Id Year Code Y
1 2009 0 0
1 2010 NA 1
2 2009 0 0
2 2010 NA 1
3 2009 1 1
3 2010 NA 0
4 2009 2 1
4 2010 NA 0
Thanks in advance!

This replace 0 to 1 and 1 to 0 if NA is present in Y column for each Id.
library(dplyr)
df %>%
arrange(Id, Year) %>%
group_by(Id) %>%
mutate(Y = ifelse(is.na(Y), as.integer(!as.logical(na.omit(Y))), Y))
# Id Year Code Y
# <int> <int> <int> <int>
#1 1 2009 0 0
#2 1 2010 NA 1
#3 2 2009 0 0
#4 2 2010 NA 1
#5 3 2009 1 1
#6 3 2010 NA 0
#7 4 2009 2 1
#8 4 2010 NA 0

how to obtain an element/column even when it's NA with tapply in R

I have a dataset like this:
df <- data.frame("y"=c(2010,2011,2012,2013,2010,2012,2010,2011,2012),"x"=c(1,2,1,1,2,2,4,4,4),"a"=c(5,3,0,2,3,0,2,3,0))
y x a
1 2010 1 5
2 2011 2 3
3 2012 1 0
4 2013 1 2
5 2010 2 3
6 2012 2 0
7 2010 4 2
8 2011 4 3
9 2012 4 0
And I want to sum 'a' for each 'y' and 'x', using:
sum <- tapply(df$a,list(df$y,df$x),sum)
That is:
1 2 4
2010 5 3 2
2011 NA 3 3
2012 0 0 0
2013 2 NA NA
How can i obtain also the '3' column, even though I don't have the value 3 in the column x of df?
Something like this:
1 2 3 4
2010 5 3 NA 2
2011 NA 3 NA 3
2012 0 0 NA 0
2013 2 NA NA NA

Make x column as factor with levels that include all the values between min and max of x column.
df$x <- factor(df$x, levels = seq(min(df$x), max(df$x)))
tapply(df$a,list(df$y,df$x),sum)
# 1 2 3 4
#2010 5 3 NA 2
#2011 NA 3 NA 3
#2012 0 0 NA 0
#2013 2 NA NA NA

R - Bind values from certain columns and selected rows and replacing them inside the group

I would like to create a new variable called Var3 that combines the values of Year and Month from the row in which Var1 == 1. My data is grouped by ID (in long format). In cases without a 1 on Var1 in any row (e.g. ID 3) there should be NA's on Var3.
df <- read.table(text=
"ID Var1 Year Month
1 0 2008 2
1 0 2009 2
1 0 2010 2
1 0 2011 2
1 1 2013 2
1 0 2014 10
2 0 2008 2
2 0 2010 2
2 1 2011 2
2 0 2013 2
2 0 2015 11
3 0 2010 2
3 0 2011 2
3 0 2013 2
3 0 2015 11
3 0 2017 10", header=TRUE)
My expected outcome would look like this:
df <- read.table(text=
"ID Var1 Year Month Var2
1 0 2008 2 20132
1 0 2009 2 20132
1 0 2010 2 20132
1 0 2011 2 20132
1 1 2013 2 20132
1 0 2014 10 20112
2 0 2008 2 20112
2 0 2010 2 20112
2 1 2011 2 20112
2 0 2013 2 20112
2 0 2015 11 20112
3 0 2010 2 NA
3 0 2011 2 NA
3 0 2013 2 NA
3 0 2015 11 NA
3 0 2017 10 NA",header=TRUE)
I am trying to figure out how to solve this issue using dplyr. I am pretty new to tidyverse therefore any suggestions are more than welcome. I already figured out that I have to use group_by(ID) and probably mutate to create the new variable. Can anybody help me out?

One possible solution using dplyr is
df %>%
group_by(ID) %>%
mutate(Var3 = ifelse(Var1 == 1, paste0(Year, Month), NA)) %>%
mutate(Var3 = max(Var3, na.rm = TRUE))
The idea behind it is: first, you paste together Year and Month where Var1 == 1, then inside each group you spread the only value present for Var3 with a function such as max (but it could also be min) removing the NA values.
Output
# A tibble: 16 x 5
# Groups: ID [3]
ID Var1 Year Month Var3
<int> <int> <int> <int> <chr>
1 1 0 2008 2 20132
2 1 0 2009 2 20132
3 1 0 2010 2 20132
4 1 0 2011 2 20132
5 1 1 2013 2 20132
6 1 0 2014 10 20132
7 2 0 2008 2 20112
8 2 0 2010 2 20112
9 2 1 2011 2 20112
10 2 0 2013 2 20112
11 2 0 2015 11 20112
12 3 0 2010 2 NA
13 3 0 2011 2 NA
14 3 0 2013 2 NA
15 3 0 2015 11 NA
16 3 0 2017 10 NA

How to flag first change in a variable value between years, per group?

Given a very large longitudinal dataset with different groups, I need to create a flag that indicates the first change in a certain variable (code) between years (year), per group (id). The type of observation within the same id-year just indicates different group members.
Sample data:
library(tidyverse)
sample <- tibble(id = rep(1:3, each=6),
year = rep(2010:2012, 3, each=2),
type = (rep(1:2, 9)),
code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))
What I need is to flag the first change to code within a group, between years. Second changes do not matter. Missing codes ("") can be treated as NA but in any case should not affect flag. The following is the above tibble with a flag field as it should be:
# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1
I still have a looping mindset and I am trying to use vectorized dplyr to do what I need.
Any input would be greatly appreciated!
EDIT: thanks for pointing this out regarding the importance of year. The id's are arranged by year, as the ordering is important here, and also all types per id per year need to have the same flag. So, in the edited row 15 e code is "" which would not warrant a change by itself, but since in the same year row 16 has a new code, both observations need to have their codes changed to 1.

We can use data.table
library(data.table)
setDT(sample)[, flag :=0][code!="", flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
# id year type code flag
# 1: 1 2010 1 abc 0
# 2: 1 2010 2 abc 0
# 3: 1 2011 1 0
# 4: 1 2011 2 0
# 5: 1 2012 1 xyz 1
# 6: 1 2012 2 xyz 1
# 7: 2 2010 1 0
# 8: 2 2010 2 0
# 9: 2 2011 1 lmn 0
#10: 2 2011 2 0
#11: 2 2012 1 efg 1
#12: 2 2012 2 efg 1
#13: 3 2010 1 def 0
#14: 3 2010 2 def 0
#15: 3 2011 1 klm 1
#16: 3 2011 2 klm 1
#17: 3 2012 1 nop 1
#18: 3 2012 2 nop 1
Update
If we need to include the 'year' as well,
setDT(sample)[, flag :=0][code!="", flag := {rl <- rleid(code, year)-1
cummax(rl*(rl < 2)) }, id]

possible solution using the dplyr. not sure its the cleanest way though
sample %>%
group_by(id) %>%
#find first year per group where code exists
mutate(first_year = min(year[code != ""])) %>%
#gather all codes from first year (does not assume code is constant within year)
mutate(first_codes = list(code[year==first_year])) %>%
#if year is not first year & code not in first year codes & code not blank
mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>%
#drop created columns
select(-first_year, -first_codes) %>%
ungroup()
output
# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 klm 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1

A short solution with the data.table-package:
library(data.table)
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code)-1 > 0), by = id]
Or:
setDT(samp)[, flag := 0][code!="", flag := 1*(code!=code[1] & code!=''), by = id][]
which gives the desired result:
> samp
id year type code flag
1: 1 2010 1 abc 0
2: 1 2010 2 abc 0
3: 1 2011 1 0
4: 1 2011 2 0
5: 1 2012 1 xyz 1
6: 1 2012 2 xyz 1
7: 2 2010 1 0
8: 2 2010 2 0
9: 2 2011 1 lmn 0
10: 2 2011 2 0
11: 2 2012 1 efg 1
12: 2 2012 2 efg 1
13: 3 2010 1 def 0
14: 3 2010 2 def 0
15: 3 2011 1 klm 1
16: 3 2011 2 klm 1
17: 3 2012 1 nop 1
18: 3 2012 2 nop 1
Or when the year is relevant as well:
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code, year)-1 > 0), id]
A possible base R alternative:
f <- function(x) {
x <- rle(x)$lengths
1 * (rep(seq_along(x), times=x) - 1 > 0)
}
samp$flag <- 0
samp$flag[samp$code!=''] <- with(samp[samp$code!=''], ave(as.character(code), id, FUN = f))
NOTE: it is better not to give your object the same name as functions.
Used data:
samp <- data.frame(id = rep(1:3, each=6),
year = rep(2010:2012, 3, each=2),
type = (rep(1:2, 9)),
code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","klm","klm","nop","nop"))

R: assign previous non NA value 'n' times based on value in previous non NA row

I have a dataframe test_case. I have missing data in a column (income).
test_case <- data.frame(
person=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3),
year=c(2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2013, 2014, 2014, 2014),
income=c(4, 10, 13, NA, NA, NA, 13, NA, NA, NA, NA, NA),
cutoff=c(0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 0, 0)
)
The variable cutoff specifies the number of times that I would like to 'carry forward' the values in income into subsequent rows (using the na.locf() method in the package zoo). For example, in the dataframe above, the value for 2 in cutoff indicates that income should be carried forward twice.
I have seen examples on SO about specifying how to use na.locf to carry forward n times when n is constant. But in my case, I am having trouble generalizing (R -- Carry last observation forward n times) when n is changing.
Here is my original dataframe:
person year income cutoff
1 1 2010 4 0
2 1 2011 10 0
3 1 2012 13 2
4 2 2010 NA 0
5 2 2011 NA 0
6 2 2012 NA 0
7 3 2010 13 3
8 3 2011 NA 0
9 3 2013 NA 0
10 3 2014 NA 0
11 3 2014 NA 0
12 3 2014 NA 0
And here is the desired output:
person year income cutoff
1 1 2010 4 0
2 1 2011 10 0
3 1 2012 13 2
4 2 2010 13 0
5 2 2011 13 0
6 2 2012 NA 0
7 3 2010 13 3
8 3 2011 13 0
9 3 2013 13 0
10 3 2014 13 0
11 3 2014 NA 0
12 3 2014 NA 0

Here's an attempt using data.table. The grouping method is at #jeremys answer, though I'm avoiding ifelse or lapply here, rather combining the first income value replicated according to first income value with NAs values replicate .N - (cutoff[1L] + 1L) times. I'm also operating only on the values since first time cutoff > 0L)
library(data.table)
setDT(test_case)[which.max(cutoff > 0L):.N, # Or `cutoff > 0L | is.na(income)`
income := c(rep(income[1L], cutoff[1L] + 1L), rep(NA, .N - (cutoff[1L] + 1L))),
by = cumsum(cutoff != 0L)]
test_case
# person year income cutoff
# 1: 1 2010 4 0
# 2: 1 2011 10 0
# 3: 1 2012 13 2
# 4: 2 2010 13 0
# 5: 2 2011 13 0
# 6: 2 2012 NA 0
# 7: 3 2010 13 3
# 8: 3 2011 13 0
# 9: 3 2013 13 0
# 10: 3 2014 13 0
# 11: 3 2014 NA 0
# 12: 3 2014 NA 0

Here's an answer using dplyr.
It works by grouping by the cumulative sum of different cutoffs.
Then it makes a list of one FALSE if cutoff is 0, and cutoff number of TRUEs, which is unlisted and sliced to the size of the group.
Then using ifelse, the income is either unmodified or made to be the first income (ie the cutoff one).
library(dplyr)
test_case %>% group_by(z = cumsum(cutoff != 0)) %>%
mutate(income = ifelse(unlist(lapply(cutoff, function(x) rep(as.logical(x), max(1,x + 1))))[1:n()], income[1], income))
Source: local data frame [12 x 5]
Groups: z [3]
z person year income cutoff
(int) (dbl) (dbl) (dbl) (dbl)
1 0 1 2010 4 0
2 0 1 2011 10 0
3 1 1 2012 13 2
4 1 2 2010 13 0
5 1 2 2011 13 0
6 1 2 2012 NA 0
7 2 3 2010 13 3
8 2 3 2011 13 0
9 2 3 2013 13 0
10 2 3 2014 13 0
11 2 3 2014 NA 0
12 2 3 2014 NA 0

A solution using na.locf could work in a similar way to #jeremycg's solution. We simply need to group by cumsum(cutoff != 0) and another variable which is the shifted row_number
My solution isn't as elegant as jeremycg's one, but this is how I approached it:
library(dplyr)
library(zoo)
test_case %>%
mutate(
rownum = row_number(),
cutoff2 = ifelse(cutoff == 0, NA, cutoff + rownum),
cutoff2 = na.locf(cutoff2, na.rm = FALSE),
cutoff2 = ifelse(rownum > cutoff2, NA, cutoff2)
) %>%
group_by(z = cumsum(cutoff != 0), cutoff2) %>%
mutate(income = na.locf(income, na.rm = FALSE))
# Source: local data frame [12 x 7]
# Groups: z, cutoff2 [5]
#
# person year income cutoff rownum cutoff2 z
# (dbl) (dbl) (dbl) (dbl) (int) (dbl) (int)
# 1 1 2010 4 0 1 NA 0
# 2 1 2011 10 0 2 NA 0
# 3 1 2012 13 2 3 5 1
# 4 2 2010 13 0 4 5 1
# 5 2 2011 13 0 5 5 1
# 6 2 2012 NA 0 6 NA 1
# 7 3 2010 13 3 7 10 2
# 8 3 2011 13 0 8 10 2
# 9 3 2013 13 0 9 10 2
# 10 3 2014 13 0 10 10 2
# 11 3 2014 NA 0 11 NA 2
# 12 3 2014 NA 0 12 NA 2