Subsetting data based on multiple stratified fields and criteria - r

My data frame has multiple factors. I would like to subset the data in a way that excludes only data that belongs to a specific factor level within another factor level.
I've used the two following approaches, but only one has worked - not sure why. Would appreciate if someone could explain it.
This is a simplified example, where f1 and f2 are the factors:
df = data.frame(f1 = c(rep(2019,4),rep(2018,4),rep(2017,4)),
f2 = rep(1:4,3), data = c(0:11))
print (df)
Output:
f1 f2 data
1 2019 1 0
2 2019 2 1
3 2019 3 2
4 2019 4 3
5 2018 1 4
6 2018 2 5
7 2018 3 6
8 2018 4 7
9 2017 1 8
10 2017 2 9
11 2017 3 10
12 2017 4 11
In this case I want to keep only data that do not belong to level "1" of "factor 2" that are from "2019" in "factor 1".
Method 1:
subs.df = subset (df, f1 != 2019 & f2 != 1)
print (subs.df)
f1 f2 data
6 2018 2 5
7 2018 3 6
8 2018 4 7
10 2017 2 9
11 2017 3 10
12 2017 4 11
Method 2:
subs.df = subset (df, !(f1 %in% 2019 & f2 %in% 1))
print (subs.df)
f1 f2 data
2 2019 2 1
3 2019 3 2
4 2019 4 3
5 2018 1 4
6 2018 2 5
7 2018 3 6
8 2018 4 7
9 2017 1 8
10 2017 2 9
11 2017 3 10
12 2017 4 11
WORKED!
Why doesn't method 1 work but method 2 does?
What are the differences?

This is a logical issue, the negation of (A and B) is (not A) or (not B)
You just have to replace & by | (or)
subs.df = subset (df, f1 != 2019 | f2 != 1)

Related

Delete rows with redundant information in r (not just duplicates)

In this sample data:
id<-c(2,2,2,2,2,3,3,3,3,3,3,4,4,4,4)
time<-c(3,5,7,8,9,2,8,10,12,14,18,4,6,7,9)
status<-c('mar','mar','div','c','mar','mar','div','mar','mar','c','div','mar','mar','c','mar')
myd<-data.frame(id,time,status)
id time status
1 2 3 mar
2 2 5 mar
3 2 7 div
4 2 8 c
5 2 9 mar
6 3 2 mar
7 3 8 div
8 3 10 mar
9 3 12 mar
10 3 14 c
11 3 18 div
12 4 4 mar
13 4 6 mar
14 4 7 c
15 4 9 mar
I need to know when the person married (if there are two consecutive 'mar' rows without 'div' anywhere in between, the person never divorced, hence it's the same marriage, and we don't need the timing of that repeat information; the same goes with sequence of mar, c, mar where since div is not detected, the marriage before and after child are the same marriage, hence the second one can be deleted). I suspect I need to get min(time[status=='mar']) but this would be incorrect if that person gets a mar,mar,div,mar,div,mar sequence (only 2nd mar needs deletion, not all the ones after the first one).
So the new data should look something like
id time status
2 2 5 mar
3 2 7 div
4 2 8 c
5 2 9 mar
6 3 2 mar
7 3 8 div
8 3 10 mar
10 3 14 c
11 3 18 div
13 4 6 mar
14 4 7 c
This was my approach, which only worked for one row
myd2<-myd %>%
group_by(id) %>%
mutate(dum1=ifelse(status=='mar',min(time[status=='mar']),NA),
dum2=cumsum(status=='div'),
flag=ifelse(time>dum1 & dum2==0,1,0))
If I get rid of dum2==0 then it deleted too many rows.
Using a quick helper function,
func <- function(x, vals = c("mar", "div")) {
out <- rep(TRUE, length(x))
last <- x[1]
for (ind in seq_along(x)[-1]) {
out[ind] <- x[ind] != last || !x[ind] %in% vals
if (out[ind] && x[ind] %in% vals) last <- x[ind]
}
out
}
We can do
library(data.table)
as.data.table(myd)[, .SD[func(status),], by = .(id)]
# id time status
# <num> <num> <char>
# 1: 2 3 mar
# 2: 2 7 div
# 3: 2 8 c
# 4: 2 9 mar
# 5: 3 2 mar
# 6: 3 8 div
# 7: 3 10 mar
# 8: 3 14 c
# 9: 3 18 div
# 10: 4 4 mar
# 11: 4 7 c
If you want this in dplyr, then
library(dplyr)
myd %>%
group_by(id) %>%
filter(func(status))
My approach:
library(dplyr)
myd %>%
group_by(id) %>%
arrange(time) %>%
filter(status != lag(status) | is.na(lag(status))) %>%
ungroup() %>%
arrange(id)
Returns:
# A tibble: 12 x 3
id time status
<dbl> <dbl> <chr>
1 2 3 mar
2 2 7 div
3 2 8 c
4 2 9 mar
5 3 2 mar
6 3 8 div
7 3 10 mar
8 3 14 c
9 3 18 div
10 4 4 mar
11 4 7 c
12 4 9 mar
I would delete rows in which the status is unchanged by creating a lag_status variable in grouped data:
> myd %>%
+ arrange(id, time) %>%
+ group_by(id) %>%
+ mutate(lag_status = lag(status)) %>%
+ ungroup() %>%
+ filter(is.na(lag_status) | status != lag_status) %>%
+ select(-lag_status)
# A tibble: 12 x 3
id time status
<dbl> <dbl> <fct>
1 2 3 mar
2 2 7 div
3 2 8 c
4 2 9 mar
5 3 2 mar
6 3 8 div
7 3 10 mar
8 3 14 c
9 3 18 div
10 4 4 mar
11 4 7 c
12 4 9 mar
I read two different questions in your post.
When the person first married
How to make a list that removes redundant status information
It seems like you have a solution for #1, but you actually want #2.
I read #2 as a desire to filter out rows where the id and status are the same as the previous row. That would look like:
myd %>%
filter(!(id == lag(id) & status == lag(status))

R combining multiple vectors created using dplyr's pull

I have monthly data for 2019, 2020 and only 2 months data for 2021 (Jan and Feb). I want to make a vector of these 26 values for use as a time series.
my_dat <- data.frame(X2021 = c(1:2,rep(NA,10)), X2020 = 1:12, X2019 = 1:12)
library(dplyr)
X2021 <- my_dat %>% pull(X2021)
X2021 <- X2021[ -(3:12) ]
x <- my_dat %>% pull(X2019,X2020)
c(x, X2021)
##1 2 3 4 5 6 7 8 9 10 11 12
##1 2 3 4 5 6 7 8 9 10 11 12 1 2
I expected:
c(1:12, 1:12, 1:2)
What went wrong?
Since pull is equivalent to $ in base R, and can only be used for extracting one variable, I think you want select and then unlist. E.g.:
my_dat %>% select(X2019, X2020) %>% unlist(use.names=FALSE)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Which would be equivalent to using the square brackets [] in base R:
unlist(my_dat[c("X2019","X2020")], use.names=FALSE)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
As to why the original code didn't work, ?pull shows the syntax is:
pull(.data, var, name)
So
my_dat %>% pull(X2019,X2020)
is just pulling / extracting X2019 and naming it with X2020. To give a clearer example:
dat <- data.frame(a=1:3, b=month.abb[1:3])
pull(dat, a, b)
#Jan Feb Mar
# 1 2 3
unname(pull(dat, a, b))
#[1] 1 2 3
names(pull(dat, a, b))
#[1] "Jan" "Feb" "Mar"

summation for multiple columns dynamically

Hi I have dataframe with multiple columns ,I.e first 5 columns are my metadata and remaing
columns (columns count will be even) are actual columns which need to be calculated
formula : (col6*col9) + (col7*col10) + (col8*col11)
country<-c("US","US","US","US")
name <-c("A","B","c","d")
dob<-c(2017,2018,2018,2010)
day<-c(1,4,7,9)
hour<-c(10,11,2,4)
a <-c(1,3,4,5)
d<-c(1,9,4,0)
e<-c(8,1,0,7)
f<-c(10,2,5,6)
j<-c(1,4,2,7)
m<-c(1,5,7,1)
df=data.frame(country,name,dob,day,hour,a,d,e,f,j,m)
how to get final summation if i have more columns
I have tried with below code
df$final <-(df$a*df$f)+(df$d*df$j)+(df$e*df$m)
Here is one way to do generalize the computation:
x <- ncol(df) - 5
df$final <- rowSums(df[6:(5 + x/2)] * df[(ncol(df) - x/2 + 1):ncol(df)])
# country name dob day hour a d e f j m final
# 1 US A 2017 1 10 1 1 8 10 1 1 19
# 2 US B 2018 4 11 3 9 1 2 4 5 47
# 3 US c 2018 7 2 4 4 0 5 2 7 28
# 4 US d 2010 9 4 5 0 7 6 7 1 37

How to delete observations in R based criterion that observations have same value?

I have the following data frame, from which I would like to remove observations based on three criteria: x=x, y=y and z>=60.
df <- data.frame(x=c(1,1,2,2,3,3,4,4),
y=c(2011,2012,2011,2011,2013,2014,2011,2012),
z=c(15,15,60,60,15,15,30,15))
> df
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 2 2011 60
5 3 2013 15
6 3 2014 15
7 4 2011 30
8 4 2012 15
The data frame I'm looking for is thus (which one of the x=2 observations is removed doesn't matter):
> df1
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 3 2013 15
5 3 2014 15
6 4 2011 30
7 4 2012 15
My first thoughts included using unique or duplicate, but I cannot seem to understand how to implement it in practice.
This should do the trick. Look for duplicated x and y entries where z is also greater than or equal to 60:
df[!(duplicated(df[,1:2]) & df$z >= 60), ]
# x y z
#1 1 2011 15
#2 1 2012 15
#3 2 2011 60
#5 3 2013 15
#6 3 2014 15
#7 4 2011 30
#8 4 2012 15

How to join without losing information?

I have several data frames with the following structure:
january february march april
Id A B Id A B Id A B Id A B
1 4 4 1 2 3 3 9 7 1 4 3
2 3 5 2 2 7 2 2 4 4 6 2
3 6 8 4 9 9 2 3 5
4 7 8
I would like to bring them into one single data frame which contains ´NA´ for the missing ID' and there corresponding attributes. The results has might look like:
Id janA janB febA febB marA marB aprA aprB
1 4 4 2 3 NA NA 4 3
2 3 5 2 7 2 4 3 5
3 6 8 NA NA 9 7 NA NA
4 7 8 9 9 NA NA 6 2
Given some data:
ID<-c(1,2,3,4)
A<-c(4,3,6,7)
B<-c(4,5,8,8)
jan<-data.frame(ID,A,B)
ID<-c(1,2,4)
A<-c(2,2,9)
B<-c(3,7,9)
feb<-data.frame(ID,A,B)
ID<-c(3,2)
A<-c(9,2)
B<-c(7,4)
mar<-data.frame(ID,A,B)
ID<-c(1,4,2)
A<-c(4,6,3)
B<-c(6,2,5)
apr<-data.frame(ID,A,B)
What I have tried:
test <- rbind(jan, feb,mar,apr)
test <- rbind.fill(jan, feb, mar,apr)
You can use merge within Reduce.
First, let's prepare a list with the data and change the column names to janA, janB, febA, ...
list_df <- list(
jan = jan,
feb = feb,
mar = mar,
apr = apr
)
list_df <- lapply(names(list_df), function(name_month){
df_month <- list_df[[name_month]]
names(df_month)[-1] <- paste0(name_month, names(df_month)[-1])
df_month
})
Reduce will merge all of them.
Reduce(function(x, y) merge(x, y, by = "ID", all = TRUE), list_df)

Resources