The rules of subsetting - r

Having df1 and df2 as follows:
df1 <- read.table(text =" x y z
1 1 1
1 2 1
1 1 2
2 1 1
2 2 2",header=TRUE)
df2 <- read.table(text =" a b c
1 1 1
1 2 8
1 1 2
2 6 2",header=TRUE)
I can ask of the data a bunch of things like:
df2[ df2$b == 6 | df2$c == 8 ,] #any rows where b=6 plus c=8 in df2
#and additive conditions
df2[ df2$b == 6 & df2$c == 8 ,] # zero rows
between data.frame:
df1[ df1$z %in% df2$c ,] # rows in df1 where values in z are in c (allrows)
This gives me all rows:
df1[ (df1$x %in% df2$a) &
(df1$y %in% df2$b) &
(df1$z %in% df2$c) ,]
but shouldn't this give me all rows of df1 too:
df1[ df1$z %in% df2$c | df1$b == 9,]
What I am really hoping to do is to subset df1 an df2 on three column conditions,
so that I only get rows in df1 where a,b,c all equal x,y,z at the same time within a row. In real data i will have more than 3 columns but I will still want to subset on 3 additive column conditions.
So subsetting my example data df1 on df2 my result would be:
df1
1 1 1
1 1 2
Playing with syntax has confusedme more and the SO posts are all variaion of what I want that actually lead to more confusion for me.
I figured out I can do this:
merge(df1,df2, by.x=c("x","y","z"),by.y=c("a","b","c"))
which gives me what I want, but I would like to understand why I am wrong in my [ attempts.

In addition to your nice solution using merge (thanks for that, I always forget merge), this can be achieved in base using ?interaction as follows. There may be other variations of this, but this is the one I am familiar with:
> df1[interaction(df1) %in% interaction(df2), ]
Now to answer your question: First, I think there's a typo (corrected) in:
df1[ df1$z %in% df2$c | df2$b == 9,] # second part should be df2$b == 9
You would get an error, because the first part evaluates to
[1] TRUE TRUE TRUE TRUE TRUE
and the second evaluates to:
[1] FALSE FALSE FALSE FALSE
You do a | operation on unequal lengths getting the error:
longer object length is not a multiple of shorter object length
Edit: If you have multiple columns then you can choose the interaction as such. For example, if you want to get from df1 the rows where the first two columns match with that of df2, then you could simply do:
> df1[interaction(df1[, 1:2]) %in% interaction(df2[, 1:2]), ]

Related

Calculate value in third column based off values in other columns but different rows

Sorry if this is a trivial question or doesn't make sense, this is my first post. I'm coming from Excel where I've worked with if statements and index match functions and am trying to do something similar in R to pull data from two columns but not necessarily the same row to get a value in a third column, my example is this
df<-data.frame(ID=c(1,5,4,2,3),A=c(1,0,1,1,1),B=c(0,0,1,0,0))
desired output: df<-data.frame(ID=c(1,5,4,2,3),A=c(1,0,1,1,1),B=c(0,0,1,0,0),C=c(0,0,0,0,1))
What I want is to create a third column "C" that essentially follows this format:
Ifelse(A[ID]=1 & B[ID+1]=1 , C[ID]=1 , C[ID]=0)
Essentially if A=1 in ID "x" and B=1 in ID "x+1" then in the new column C in ID "x" =1 otherwise =0. I could order everything by ID if that makes things easier but doing it by the ID column would be ideal.
So far I've tried ifelse statements but I imagine there is probably a better way of doing this
Using dplyr, we can use lead to get next element after arranging the data by ID.
library(dplyr)
df %>%
arrange(ID) %>%
mutate(C = as.integer(A == 1 & lead(B) == 1))
# ID A B C
#1 1 1 0 0
#2 2 1 0 0
#3 3 1 0 1
#4 4 1 1 0
#5 5 0 0 0
In base R, we can do
df1 <- df[order(df$ID),]
df1$C <- with(df1, c(A[-nrow(df)] == 1 & tail(B, -1) == 1, 0))
Without arranging the data, we can probably do
transform(df, C = as.integer(A[ID] == 1 & B[match(ID + 1, ID)] == 1))
Using the lead function I got this to work
df <- df [order(df$ID), ]
df$C <- ifelse (df$A == 1 & lead (df$B) == 1, 1, 0)

R: how to subset duplicate rows of data.frame

set.seed(3)
mydata <- data.frame(id = c(1:5),
score = c(rnorm(5, 0, 1)))
ids <- c(1, 2, 3, 3)
> subset(mydata, id %in% ids)
id score
1 1 -0.9619334
2 2 -0.2925257
3 3 0.2587882
I have a situation where I would like to subset all rows of mydata such that its id matches my ids. The catch is that my ids has the number 3 repeated twice. But it seems that subset only extracted the unique rows, I'm guessing due to the operator %in%. However, my desired output is
> subset(mydata, id %in% ids)
id score
1 1 -0.9619334
2 2 -0.2925257
3 3 0.2587882
4 3 0.2587882
I've also tried to use the == operator instead. However, that didn't seem to do the trick.
Rather than using %in%, try using it's sister function match()
mydata[match(ids, mydata$id), ]
This will return the duplicated IDs.

R - Compare column values in data frames of differing lengths by unique ID

I'm sure I can figure out a straightforward solution to this problem, but I didn't see a comparable question so I thought I'd post a question.
I have a longitudinal dataset with thousands of respondents over several time intervals. Everything from the questions to the data types can differ between the waves and often requires constructing long series of bools to construct indicators or dummy variables, but each respondent has a unique ID with no additional respondents add to the surveys after the first wave, so easy enough.
The issue is that while the early wave consist of one (Stata) file each, the latter waves contain lots of addendum files, structured differently. So, for example, in constructing previous indicators for the sex of previous partners there were columns (for one wave) called partnerNum and sex and there were up to 16 rows for each unique ID (respondent). Easy enough to spread (or cast) that data to be able to create a single row for each unique ID and columns partnerNum_1 ... partnerNum_16 with the value from the sex column as the entry in partnerDF. Then it's easy to construct indicators like:
sexuality$newIndicator[mainDF$bioSex = "Male" & apply(partnerDF[1:16] == "Male", 1, any)] <- 1
For other addendum files in the last two waves the data is structured long like the partner data, with multiple rows for each unique ID, but rather than just one variable like sex there are hundreds that I need to use to test against to construct indicators, all coded with different types, so it's impractical to spread (or cast) the data wide (never mind writing those bools). There are actually several of these files for each wave and the way they are structured some respondents (unique ID) occupy just 1 row, some a few dozen. (I've left_join'ed the addendum files together for each wave.)
What I'd like to be able to do to is test something like:
newDF$indicator[any(waveIIIAdds$var1 == 1) & any(waveIIIAdds$var2 == 1)] <- 1
or
newDF$indicator[mainDF$var1 == 1 & any(waveIIIAdds$var2 == 1)] <- 1
where newDF is the same length as mainDF (one row per unique ID).
So, for example, if I had two dfs.
df1 <- data.frame(ID = c(1:4), A = rep("a"))
df2 <- data.frame(ID = rep(1:4, each=2), B = rep(1:2, 2), stringsAsFactors = FALSE)
df1$A[1] <- "b"
df1$A[3] <- "b"
df2$B[8] <- 3
> df1 > df2
ID A ID B
1 b 1 1
2 a 1 2
3 b 2 1
4 a 2 2
3 1
3 2
4 1
4 3
I'd like to test like (assuming df3 has one column, just the unique IDs from df1)
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] & df2$ID[df2$B == 2]] <- 1
So that df3 would have one unique ID per row and since there is an "a" in df1$A for all IDs but df1$A[1] and a 2 in at least one row of df2$B for all IDs except the last ID (df2$B[7:8]) the result would be:
> df3
ID new
1 0
2 1
3 1
4 0
and
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] | df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 1
4 0
This does it...
df3 <- data.frame(ID=unique(df1$ID),
new=sapply(unique(df1$ID),function(x)
as.numeric(x %in% df1$ID[df1$A == "a"] & x %in% df2$ID[df2$B == 2])))
df3
ID new
1 1 1
2 2 1
3 3 1
4 4 0
I came up with a parsimonious solution thinking about it for a few minutes after returning to the problem (rather than the wee hours of the morning of the post).
I wanted something a graduate student who will likely construct thousands of indicators or dummy variables this way and may learn R first, or even only ever learn R, could use. The following provides a solution for the example and actual data using the same schema:
if the DF was already created with the IDs and the column values for the dummy indicator initiated to zero already as assumed in the example:
df3 <- data.frame(ID = df1$ID)
df3$new <- 0
My solution was:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] & df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 0
2 1
3 0
4 1
Using | (or) instead:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] | df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 0
4 1

How can I use if else to change values in some rows and columns in my data frame in R?

I have a data frame with 200 rows and 150 columns. Out of those columns, I wish to change the NAs of about 50 rows, and 100 columns.
Below is an example of (a small part) of my data frame:
>df
Bird Mammal Type
1 NA 1 A
2 1 0 B
3 1 0 A
4 0 NA A
5 NA 1 A
6 0 0 B
7 0 0 A
8 NA NA A
9 1 1 B
10 1 1 A
What I want, is to change all the NAs to 0 ONLY for type "A", but not for type "B". For type "B", I want everything to remain the same.
I have tried to do this with various ifelse options, but I think I still don't have the hang of it. Here are some of the things I've tried:
a) Subsetting only the columns as a list:
try <- c(1,2)
for(i in 1:length(try)){
df[,try[i]] <- ifelse(df[,is.na(try[i])],0,df[,try[i]])
}
b) Subsetting both rows and columns (this gave me a data frame, so off course the ifelse didn't run)
Here is a very simple one liner that gets exactly what you want. No loops or apply needed.
df[is.na(df) & df$Type=='A'] <- 0
You can use a combination of lapply and ifelse.
Assuming you have a vector of indices or names of the columns with the NAs stored as cols you can do the following:
df[ ,cols] <- as.data.frame(lapply(cols,
FUN = function(x) ifelse(df$Type == "A" & is.na(df[,x]), 0, df[, x])))
Here is an option using set from data.table. We are considering all the other columns except the 'Type' column. The set option is fast. Also, this is changing the values in the column without converting to a logical matrix.
library(data.table)
setDT(df)
nm1 <- setdiff(names(df), 'Type')
for(j in nm1){
set(df, i= which(is.na(df[[j]]) & df$Type=='A'), j=j, value=0)
}

Subset a data frame with multiple match conditions in R

With the sample data
> df1 <- data.frame(x=c(1,1,2,3), y=c("a","b","a","b"))
> df1
x y
1 1 a
2 1 b
3 2 a
4 3 b
> df2 <- data.frame(x=c(1,3), y=c("a","b"))
> df2
x y
1 1 a
2 3 b
I want to remove all the value pairs (x,y) of df2 from df1. I can do it using a for loop over each row in df2 but I'm sure there is a better and simpler way that I just can't think of at the moment. I've been trying to do something starting with the following:
> df1$x %in% df2$x & df1$y %in% df2$y
[1] TRUE TRUE FALSE TRUE
But this isn't what I want as df1[2,] = (1,b) is pulled out for removal. Thank you very much in advance for your help.
Build a set of pairs from df2:
prs <- with(df2, paste(x,y,sep="."))
Test each row in df1 with similarly process for membership in the pairset:
df1[ paste(df1$x, df1$y, sep=".") %in% prs , ]
You could go the other way around: rbind everything and remove duplicates
out <-rbind(df1,df2)
out[!duplicated(out, fromLast=TRUE) & !duplicated(out),]
x y
2 1 b
3 2 a

Resources