Replacing Grouped Data in R - r

I want to create a new column based on conditions
My data frame looks like:
Id Year Code Y
1 2009 0 0
1 2010 NA NA
2 2009 0 0
2 2010 NA NA
3 2009 1 1
3 2010 NA NA
4 2009 2 1
4 2010 NA NA
I need to replace values in my original Y variable in a way that returns 1 when the first year code for each individual is equal to 0 and the second year/code line is NA. The output that I'm looking for is:
Id Year Code Y
1 2009 0 0
1 2010 NA 1
2 2009 0 0
2 2010 NA 1
3 2009 1 1
3 2010 NA 0
4 2009 2 1
4 2010 NA 0
Thanks in advance!

This replace 0 to 1 and 1 to 0 if NA is present in Y column for each Id.
library(dplyr)
df %>%
arrange(Id, Year) %>%
group_by(Id) %>%
mutate(Y = ifelse(is.na(Y), as.integer(!as.logical(na.omit(Y))), Y))
# Id Year Code Y
# <int> <int> <int> <int>
#1 1 2009 0 0
#2 1 2010 NA 1
#3 2 2009 0 0
#4 2 2010 NA 1
#5 3 2009 1 1
#6 3 2010 NA 0
#7 4 2009 2 1
#8 4 2010 NA 0

Related

Create sequence by condition in the case when condition changes

the data looks like:
df <- data.frame("Grp"=c(rep("A",10),rep("B",10)),
"Year"=c(seq(2001,2010,1),seq(2001,2010,1)),
"Treat"=c(as.character(c(0,0,1,1,1,1,0,0,1,1)),
as.character(c(1,1,1,0,0,0,1,1,1,0))))
df
Grp Year Treat
1 A 2001 0
2 A 2002 0
3 A 2003 1
4 A 2004 1
5 A 2005 1
6 A 2006 1
7 A 2007 0
8 A 2008 0
9 A 2009 1
10 A 2010 1
11 B 2001 1
12 B 2002 1
13 B 2003 1
14 B 2004 0
15 B 2005 0
16 B 2006 0
17 B 2007 1
18 B 2008 1
19 B 2009 1
20 B 2010 0
All I want is to generate another col seq to count the sequence of Treat by Grp, maintaining the sequence of Year. I think the hard part is that when Treat turns to 0, seq should be 0 or whatever, and the sequence of Treat should be re-counted when it turns back to non-zero again. An example of the final dataframe looks like below:
Grp Year Treat seq
1 A 2001 0 0
2 A 2002 0 0
3 A 2003 1 1
4 A 2004 1 2
5 A 2005 1 3
6 A 2006 1 4
7 A 2007 0 0
8 A 2008 0 0
9 A 2009 1 1
10 A 2010 1 2
11 B 2001 1 1
12 B 2002 1 2
13 B 2003 1 3
14 B 2004 0 0
15 B 2005 0 0
16 B 2006 0 0
17 B 2007 1 1
18 B 2008 1 2
19 B 2009 1 3
20 B 2010 0 0
Any suggestions would be much appreciated!
With data.table rleid , you can do :
library(dplyr)
df %>%
group_by(Grp, grp = data.table::rleid(Treat)) %>%
mutate(seq = row_number() * as.integer(Treat)) %>%
ungroup %>%
select(-grp)
# Grp Year Treat seq
# <chr> <dbl> <chr> <int>
# 1 A 2001 0 0
# 2 A 2002 0 0
# 3 A 2003 1 1
# 4 A 2004 1 2
# 5 A 2005 1 3
# 6 A 2006 1 4
# 7 A 2007 0 0
# 8 A 2008 0 0
# 9 A 2009 1 1
#10 A 2010 1 2
#11 B 2001 1 1
#12 B 2002 1 2
#13 B 2003 1 3
#14 B 2004 0 0
#15 B 2005 0 0
#16 B 2006 0 0
#17 B 2007 1 1
#18 B 2008 1 2
#19 B 2009 1 3
#20 B 2010 0 0

how to obtain an element/column even when it's NA with tapply in R

I have a dataset like this:
df <- data.frame("y"=c(2010,2011,2012,2013,2010,2012,2010,2011,2012),"x"=c(1,2,1,1,2,2,4,4,4),"a"=c(5,3,0,2,3,0,2,3,0))
y x a
1 2010 1 5
2 2011 2 3
3 2012 1 0
4 2013 1 2
5 2010 2 3
6 2012 2 0
7 2010 4 2
8 2011 4 3
9 2012 4 0
And I want to sum 'a' for each 'y' and 'x', using:
sum <- tapply(df$a,list(df$y,df$x),sum)
That is:
1 2 4
2010 5 3 2
2011 NA 3 3
2012 0 0 0
2013 2 NA NA
How can i obtain also the '3' column, even though I don't have the value 3 in the column x of df?
Something like this:
1 2 3 4
2010 5 3 NA 2
2011 NA 3 NA 3
2012 0 0 NA 0
2013 2 NA NA NA
Make x column as factor with levels that include all the values between min and max of x column.
df$x <- factor(df$x, levels = seq(min(df$x), max(df$x)))
tapply(df$a,list(df$y,df$x),sum)
# 1 2 3 4
#2010 5 3 NA 2
#2011 NA 3 NA 3
#2012 0 0 NA 0
#2013 2 NA NA NA

R - Bind values from certain columns and selected rows and replacing them inside the group

I would like to create a new variable called Var3 that combines the values of Year and Month from the row in which Var1 == 1. My data is grouped by ID (in long format). In cases without a 1 on Var1 in any row (e.g. ID 3) there should be NA's on Var3.
df <- read.table(text=
"ID Var1 Year Month
1 0 2008 2
1 0 2009 2
1 0 2010 2
1 0 2011 2
1 1 2013 2
1 0 2014 10
2 0 2008 2
2 0 2010 2
2 1 2011 2
2 0 2013 2
2 0 2015 11
3 0 2010 2
3 0 2011 2
3 0 2013 2
3 0 2015 11
3 0 2017 10", header=TRUE)
My expected outcome would look like this:
df <- read.table(text=
"ID Var1 Year Month Var2
1 0 2008 2 20132
1 0 2009 2 20132
1 0 2010 2 20132
1 0 2011 2 20132
1 1 2013 2 20132
1 0 2014 10 20112
2 0 2008 2 20112
2 0 2010 2 20112
2 1 2011 2 20112
2 0 2013 2 20112
2 0 2015 11 20112
3 0 2010 2 NA
3 0 2011 2 NA
3 0 2013 2 NA
3 0 2015 11 NA
3 0 2017 10 NA",header=TRUE)
I am trying to figure out how to solve this issue using dplyr. I am pretty new to tidyverse therefore any suggestions are more than welcome. I already figured out that I have to use group_by(ID) and probably mutate to create the new variable. Can anybody help me out?
One possible solution using dplyr is
df %>%
group_by(ID) %>%
mutate(Var3 = ifelse(Var1 == 1, paste0(Year, Month), NA)) %>%
mutate(Var3 = max(Var3, na.rm = TRUE))
The idea behind it is: first, you paste together Year and Month where Var1 == 1, then inside each group you spread the only value present for Var3 with a function such as max (but it could also be min) removing the NA values.
Output
# A tibble: 16 x 5
# Groups: ID [3]
ID Var1 Year Month Var3
<int> <int> <int> <int> <chr>
1 1 0 2008 2 20132
2 1 0 2009 2 20132
3 1 0 2010 2 20132
4 1 0 2011 2 20132
5 1 1 2013 2 20132
6 1 0 2014 10 20132
7 2 0 2008 2 20112
8 2 0 2010 2 20112
9 2 1 2011 2 20112
10 2 0 2013 2 20112
11 2 0 2015 11 20112
12 3 0 2010 2 NA
13 3 0 2011 2 NA
14 3 0 2013 2 NA
15 3 0 2015 11 NA
16 3 0 2017 10 NA

Imputing NA with conditional LOCF

I've updated a new different problem. This time I would like to obtain column Oxy2 from Oxy.
ID Oxy Y Oxy2
1 NA 2010 NA
1 0 2011 0
1 NA 2012 NA
1 1 2013 1
1 NA 2014 1
1 NA 2015 1
1 -1 2016 1
2 0 2011 0
2 NA 2012 NA
2 1 2013 1
2 -1 2014 1
3 0 2012 0
3 -1 2013 -1
3 NA 2014 NA
4 -1 2010 -1
4 1 2011 1
4 -1 2012 1
4 -1 2013 1
4 0 2014 1
4 NA 2015 1
Basically, I need to keep NAs, if there any, when previous values of my Oxy variable are 0 or -1, and replace everything coming after the first 1 appears with 1.
Again, thanks for your suggestions.
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(Ins1=na.locf(ifelse(is.na(Ins) & lag(Ins)==0, 999, Ins), na.rm = FALSE), Ins2=na_if(Ins1, 999))
#one step version
#mutate(Ins1 = na_if(na.locf(ifelse(is.na(Ins) & lag(Ins)==0, 999, Ins), na.rm = FALSE), 999))
# A tibble: 8 x 5
# Groups: ID [2]
ID Ins Y Ins1 Ins2
<int> <int> <int> <dbl> <dbl>
1 1 0 2010 0 0
2 1 NA 2011 999 NA
3 1 1 2012 1 1
4 1 NA 2013 1 1
5 1 NA 2014 1 1
6 2 0 2011 0 0
7 2 0 2012 0 0
8 2 NA 2013 999 NA
Update: To solve the -1 issue, I add a small change to what #user12492692 has suggested in the Edit, namely replaced the | with %in%
df %>%
group_by(ID) %>%
mutate(Ins1 = na.locf(ifelse(is.na(Ins) & lag(Ins) %in% c(0,-1), 999, Ins), na.rm = FALSE),
Ins2 = na_if(Ins1, 999))
Here is another alternative that fills all values using the LOCF and then adds NA's following the zeros:
library(dplyr)
df1 %>%
mutate(Ins_b = Ins[!is.na(Ins)][cumsum(!is.na(Ins))],
Ins_b = replace(Ins_b, is.na(Ins) & Ins_b == 0, NA))
ID Ins Y Ins_b
1 1 0 2010 0
2 1 NA 2011 NA
3 1 1 2012 1
4 1 NA 2013 1
5 1 NA 2014 1
6 2 0 2011 0
7 2 0 2012 0
8 2 NA 2013 NA

Finding occurrences of a repeat value in a row, R dataframe or Excel?

Currently need some help with the below dataframe (which is also the same format in Excel so this could be done in Excel or R)
Dataframe:
Company_id Year Month Employee_Range Employees Cheese Chips Eggs
1 2014 NA NA NA 1 0 0
1 2014 NA NA NA 1 0 0
1 2014 NA NA NA 1 0 0
2 2014 NA NA NA 0 1 0
3 2014 NA NA NA 0 0 1
3 2014 NA NA NA 0 0 1
The dataframe continues on for about 630,000 rows, here is some further information
1) for the column company_id, each company is numbered so 1 = company 1, 2 = company 2 and so on. Each company is repeated based on if they received Chips, Eggs or Cheese more than once, which is why company 2 only has one row.
2) the numbers under the columns, cheese, chips and eggs just means 1 = "yes they ordered it" and 0 = "no they did not order it", so it works like a tally table but each company is a row
3) the rest of the information is NA as it is not needed
4) if a company chose one of eggs, cheese or chips then it is just that column only! There are no cases or occurences where a company chose more than one item, so all numbers are contained to a single column for that company.
I would like a way to find the sum of the count of a given company row, so i would like to produce a dataframe or excel table such as :
Company_id Year Month Employee_Range Employees Cheese Chips Eggs
1 2014 NA NA NA 3 0 0
2 2014 NA NA NA 0 1 0
3 2014 NA NA NA 0 0 2
Any ideas are helpful,
Thank you,
A solution using dplyr. dat2 is the final output.
library(dplyr)
dat2 <- dat %>%
group_by(Company_id, Year, Month, Employee_Range, Employees) %>%
summarise_at(vars(Cheese, Chips, Eggs), funs(sum(.))) %>%
ungroup()
dat2
# # A tibble: 3 x 8
# Company_id Year Month Employee_Range Employees Cheese Chips Eggs
# <int> <int> <lgl> <lgl> <lgl> <int> <int> <int>
# 1 1 2014 NA NA NA 3 0 0
# 2 2 2014 NA NA NA 0 1 0
# 3 3 2014 NA NA NA 0 0 2
DATA
dat <- read.table(text = "Company_id Year Month Employee_Range Employees Cheese Chips Eggs
1 2014 NA NA NA 1 0 0
1 2014 NA NA NA 1 0 0
1 2014 NA NA NA 1 0 0
2 2014 NA NA NA 0 1 0
3 2014 NA NA NA 0 0 1
3 2014 NA NA NA 0 0 1",
header = TRUE)
Try this:
library(dplyr)
df %>%
group_by(Company_id, Year, Month, Employee_Range) %>%
summarize(Cheese = sum(Cheese),
Chips = sum(Chips),
Eggs = sum(Eggs)) %>%
as.data.frame()
The result as you wished:
Company_id Year Month Employee_Range Cheese Chips Eggs
1 1 2014 NA NA 3 0 0
2 2 2014 NA NA 0 1 0
3 3 2014 NA NA 0 0 2

Resources