Conditionally Splitting Dataframes Using ifelse - r

I have a large dataset called "inputs". One of the columns in the dataset is a flag called "constrained" with either "Y" or "N". I want to create two datasets where one is the rows where the flag is "Y" and one is the rows where the flag is "N".
I tried:
ifelse(inputs$constrained == "N",unconstrained <- inputs,constrained <- inputs)
but both datasets unconstrained and constrained are identical to inputs.
What am I doing wrong?

first <- split(inputs, inputs$constrained)[1]
second <- split(inputs, inputs$constrained)[2]
If you wanted to use "[" you could do this:
unconstrd <- inputs[ inputs$constrained == "N" , ]
constrd <- inputs[ ! inputs$constrained == "N" , ]
Both of that second option might have entries where 'constrained' is NA, due the screwy way that R handles NA conditionals although it would not be a faithful reflection of those rows. (I admit I did not sure what the split method does with NA's.) I just tested the split method and it might be superior, since (like subset) it does not return the is.na(input$constrained) rows.

Related

R: Determine if all sets in a list appear in a data frame

I need to figure out if sets of item ID's are found within a data frame.
If I'm only looking for a single set of ID's, the below code works just fine:
set <- c( id1, id2, etc...)
all(subSets %in% df[,rangeOfColumns])
However, if the set is a list of various things I want to check, this code doesn't work as expected and I am unsure how to get this functionality.
Example of what I'm aiming for:
set <- list()
set[[1]] <- c(1, 2)
set[[2]] <- c(2, 3)
df <- as.data.frame(cbind(c(1:4),c(2:5)))
all(set %in% df)
#Returns TRUE
Maybe check each row against each set and return TRUE if any row matches. Then if there is a match for each set, then the whole result is TRUE.
all(sapply(set, function(s)
any(apply(df, 1, function(x) all(x==s)))))
This might not be easy to understand but it does the job. Data frames are organized by column, so doing things by row isn't always straightforward.
# Your setup had some unnecessary complications. Here it is again
# more simply:
set <- list(1:2, 2:3)
d_f <- data.frame(1:4, 2:5) # df is already a function name so best not to use it again.
all(
sapply(seq_along(set),
function(i) any(
sapply(
lapply(1:nrow(d_f), function(j) set[[i]] == d_f[j,]),
all) # Does each element of set[[i]] equal the elements in df[j]?
)
) # Does it happen in any row of df?
) # It is true for all elements of set?
EDIT: to address the question in the comment Well, if it's not straight-forward, why not work with a transposed version of the df to make things easier?
Because a data frame is a list, not a matrix.
Doing matrix things (like transpose with t or using apply) ruin (often without any warning to the user) what a data frame is supposed to be, which is a list of vectors of the same length.
When you use t or apply on a data frame, the first thing to happen is as.matrix gets applied to it. And if your data frame has a date, character, or factor variable, then the whole thing is coerced to "character", and it doesn't tell you this happens.
An answer for your specific problem can be crafted using apply (as someone did) and/or t, but it's going to be a bit fragile unless one is completely sure of the classes of the variables in the data frame.

if function for rowSums_Modify the code

I want to get summation over several columns and make a new column based on them. So I use
df$Sum <-rowSums(df[,grep("y", names(df))])
But sometimes df just includes one column and in this case, I will get the error. Since this function is part of my long programming procedure, I was wondering how I can make an if function in a way that If df[,grep("y", names(df))] includes just one column then get sum is equal to df[,grep("y", names(df))] otherwise if df[,grep("y", names(df))] have more at leat two columns get the summation over them?
suppose:
require(stats); require(graphics)
attach(cars)
cars$y1<-seq(20:69)
#cars$y2<-seq(30:79)
df<-cars
df$Sum <-rowSums(df[,grep("y", names(df))])
You can use drop = FALSE when subsetting:
df$Sum <-rowSums(df[,grep("y", names(df)), drop = FALSE])
This keeps df as a data frame even if you are selecting only one column.

Subsetting data in R with ifelse

I am attempting to use ifelse to subset data that can then be used in a plot. I am coding it this way as I am trying to make the code usable to a layman by only defining one or two objects and then running the whole script to make a plot using the data selected by given criteria.
The problem is that the mydataframe[mydataframe$data . ...] operation is not working the way I would like it to inside ifelse. Is there a way to get it to work in ifelse or is anyone aware of a smarter way to do what I'm trying to do? Thanks!
Also, the second block of code is added explanation but not needed to see the problem.
# generate data
mydata<-c(1:100)
mydata<-as.data.frame(mydata)
mydata$checkthefunction<-rep(c("One","Two","Three","Four","Multiple of 5",
"Six","Seven","Eight","Nine","Multiple of 10"))
# everything looks right
mydata
# create function
myfunction = function(MyCondition="low"){
# special criteria
lowRandomNumbers=c(58,61,64,69,73)
highRandomNumbers=c(78,82,83,87,90)
# subset the data based on MyCondition
mydata<-ifelse(MyCondition=="low",mydata[mydata$mydata %in% lowRandomNumbers==TRUE,],mydata)
mydata<-ifelse(MyCondition=="high",mydata[mydata$mydata %in% highRandomNumbers==TRUE,],mydata)
# if not "high" or "low" then don't subset the data
mydata
}
myfunction("low")
# returns just the numbers selected from the dataframe, not the
# subsetted dataframe with the $checkthefunction row
myfunction("high")
# returns: "Error in mydata[mydata$mydata %in% highRandomNumbers == TRUE, ] :
# incorrect number of dimensions"
# additional explanation code if it helps
# define dataframe again
mydata<-c(1:100)
mydata<-as.data.frame(mydata)
mydata$checkthefunction<-rep(c("One","Two","Three","Four","Multiple of 5",
"Six","Seven","Eight","Nine","Multiple of 10"))
# outside of the function and ifelse my subsetting works
lowRandomNumbers=c(58,61,64,69,73)
ItWorks<-mydata[mydata$mydata %in% lowRandomNumbers==TRUE,]
# ifelse seems to be the problem, the dataframe is cut into the string of lowRandomNumbers again
MyCondition="low"
NoLuck<-ifelse(MyCondition=="low",mydata[mydata$mydata %in% lowRandomNumbers==TRUE,],mydata)
NoLuck
# if the 'else' portion is returned the dataframe is converted to a one-dimensional list
MyCondition="high"
NoLuck<-ifelse(MyCondition=="low",mydata[mydata$mydata %in% lowRandomNumber==TRUE,mydata)
NoLuck
You don't want ifelse. You want if and else. ifelse is used if you have a condition vector. You only have a single condition value.
myfunction = function(MyCondition="low"){
# special criteria
lowRandomNumbers=c(58,61,64,69,73)
highRandomNumbers=c(78,82,83,87,90)
# subset the data based on MyCondition
mydata <- if(MyCondition=="low") mydata[mydata$mydata %in% lowRandomNumbers==TRUE,] else mydata
mydata <- if(MyCondition=="high") mydata[mydata$mydata %in% highRandomNumbers==TRUE,] else mydata
# if not "high" or "low" then don't subset the data
mydata
}

Efficient method to subset drop rows with NA values in R

Background
Before running a stepwise model selection, I need to remove missing values for any of my model terms. With quite a few terms in my model, there are therefore quite a few vectors that I need to look in for NA values (and drop any rows that have NA values in any of those vectors). However, there are also vectors that contain NA values that I do not want to use as terms / criteria for dropping rows.
Question
How do I drop rows from a dataframe which contain NA values for any of a list of vectors? I'm currently using the clunky method of a long series of !is.na's
> my.df[!is.na(my.df$termA)&!is.na(my.df$termB)&!is.na(my.df$termD),]
but I'm sure that there is a more elegant method.
Let dat be a data frame and cols a vector of column names or column numbers of interest. Then you can use
dat[!rowSums(is.na(dat[cols])), ]
to exclude all rows with at least one NA.
Edit: I completely glossed over subset, the built in function that is made for sub-setting things:
my.df <- subset(my.df,
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
)
I tend to use with() for things like this. Don't use attach, you're bound to cut yourself.
my.df <- my.df[with(my.df, {
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
}), ]
But if you often do this, you might also want a helper function, is_any()
is_any <- function(x){
!is.na(x)
}
If you end up doing a lot of this sort of thing, using SQL is often going to be a nicer interaction with subsets of data. dplyr may also prove useful.
This is one way:
# create some random data
df <- data.frame(y=rnorm(100),x1=rnorm(100), x2=rnorm(100),x3=rnorm(100))
# introduce random NA's
df[round(runif(10,1,100)),]$x1 <- NA
df[round(runif(10,1,100)),]$x2 <- NA
df[round(runif(10,1,100)),]$x3 <- NA
# this does the actual work...
# assumes data is in columns 2:4, but can be anywhere
for (i in 2:4) {df <- df[!is.na(df[,i]),]}
And here's another, using sapply(...) and Reduce(...):
xx <- data.frame(!sapply(df[2:4],is.na))
yy <- Reduce("&",xx)
zz <- df[yy,]
The first statement "applies" the function is.na(...) to columns 2:4 of df, and inverts the result (we want !NA). The second statement applies the logical & operator to the columns of xx in succession. The third statement extracts only rows with yy=T. Clearly this can be combined into one horrifically complicated statement.
zz <-df[Reduce("&",data.frame(!sapply(df[2:4],is.na))),]
Using sapply(...) and Reduce(...) can be faster if you have very many columns.
Finally, most modeling functions have parameters that can be set to deal with NA's directly (without resorting to all this). See, for example the na.action parameter in lm(...).

How to add a column to a dataframe with values of another based on multiple conditions

I have two data frames of different length, and I want to add a new column to the first data frame with corresponding values of the second data frame.
The corresponding value is defined by the following condition if (DF1[i,1] == DF2[,1] & DF1[i,2] == DF2[i,2]) == TRUE, then the value of this row should be taken from DF2 and written to DF1$newColumn[i].
The following data frames are used to illustrate the question:
DF1<-data.frame(X = rep(c("A","B","C"),each=3),
Y = rep(c("a","b","c"),each=3))
DF2<-data.frame(X = c("A","B","C"),
Y = c("a","b","c"),
Z = c(1:3))
I tried to use if() statements as in the text above but the condition returns a vector of TRUE/FALSE and that doesn't seem to work.
The code that works that I use now is
for (i in 1 : length(DF1[,1])) {
DF1$Z[i] <- subset(DF2,DF2$X == DF1$X[i] & DF2$Y == DF1$Y[i])$Z
}
However it is incredibly slow (user system elapsed 115.498 12.341 127.799 for my full dataframe) and there must be a more efficient way to code this. Also, I have read repeatedly that vectorizing is more efficient then loops but I don't know how to do that.
I do need to work with conditional statements though so something like
DF1$Zz<-rep(DF2$Z,each=3)
wouldn't work for my real dataset.
DF1$Z <- sapply(1:nrow(DF1), function(i) DF2$Z[DF2$X==DF1$X[i] & DF2$Y==DF1$Y[i]]) seems to be taking roughly a quarter of the time of your for loop.
I created DF1 with 300 each reps, and my function took ~2secs to run; your loop with subset took ~8secs to run, and repackaging your loop into an sapply it took ~5secs to run.

Resources