subset R data frame using only exact matches of character vector - r

I would like to subset a data frame (Data) by column names. I have a character vector with column name IDs I want to exclude (IDnames).
What I do normally is something like this:
Data[ ,!colnames(Data) %in% IDnames]
However, I am facing the problem that there is a name "X-360" and another one "X-360.1" in the columns. I only want to exclude the "X-360" (which is also in the character vector), but not "X-360.1" (which is not in the character vector, but extracted anyway). - So I want only exact matches, and it seems like this does not work with %in%.
It seems such a simple problem but I just cannot find a solution...
Update:
Indeed, the problem was that I had duplicated names in my data.frame! It took me a while to figure this out, because when I looked at the subsetted columns with
Data[ ,colnames(Data) %in% IDnames]
it showed "X-360" and "X-360.1" among the names, as stated above.
But it seems this was just happening when subsetting the data, before there were just columns with the same name ("X-360") - and that happened because the data frame was set up from matrices with cbind.
Here is a demonstration of what happened:
D1 <-matrix(rnorm(36),nrow=6)
colnames(D1) <- c("X-360", "X-400", "X-401", "X-300", "X-302", "X-500")
D2 <-matrix(rnorm(36),nrow=6)
colnames(D2) <- c("X-360", "X-406", "X-403", "X-300", "X-305", "X-501")
D <- cbind(D1, D2)
Data <- as.data.frame(D)
IDnames <- c("X-360", "X-302", "X-501")
Data[ ,colnames(Data) %in% IDnames]
X-360 X-302 X-360.1 X-501
1 -0.3658194 -1.7046575 2.1009329 0.8167357
2 -2.1987411 -1.3783129 1.5473554 -1.7639961
3 0.5548391 0.4022660 -1.2204003 -1.9454138
4 0.4010191 -2.1751914 0.8479660 0.2800923
5 -0.2790987 0.1859162 0.8349893 0.5285602
6 0.3189967 1.5910424 0.8438429 0.1142751
Learned another thing to be careful about when working with such data in the future...

One regex based solution here would be to form an alternation of exact keyword matches:
regex <- paste0("^(?:", paste(IDnames, collapse="|"), ")$")
Data[ , !grepl(regex, colnames(Data))]

Related

if function for rowSums_Modify the code

I want to get summation over several columns and make a new column based on them. So I use
df$Sum <-rowSums(df[,grep("y", names(df))])
But sometimes df just includes one column and in this case, I will get the error. Since this function is part of my long programming procedure, I was wondering how I can make an if function in a way that If df[,grep("y", names(df))] includes just one column then get sum is equal to df[,grep("y", names(df))] otherwise if df[,grep("y", names(df))] have more at leat two columns get the summation over them?
suppose:
require(stats); require(graphics)
attach(cars)
cars$y1<-seq(20:69)
#cars$y2<-seq(30:79)
df<-cars
df$Sum <-rowSums(df[,grep("y", names(df))])
You can use drop = FALSE when subsetting:
df$Sum <-rowSums(df[,grep("y", names(df)), drop = FALSE])
This keeps df as a data frame even if you are selecting only one column.

How to apply ifelse function by column names?

I know there are many similar questions around but I'm afraid couldn't get my head around this particular one, though obviously it is very simple!
I am trying to write a simple ifelse function to be applied over a series of columns in a data frame by using column names (rather than numbers). What I try to do is to create a single u_all variable as shown below without typing column names repeatedly.
dat <- data.frame(id=c(1:20),u1 = sample(c(0:1),20,replace=T) , u2 = sample(c(0:1),20,replace=T) , u3 = sample(c(0:1),20,replace=T))
dat<-within(dat,u_all<-ifelse (u1==1 | u2==1 |u3==1,1,0))
dat
I tried many variants of apply but clearly I'm not on the right track as those grouping functions replicate the ifelse function on each column separately.
dat2 <- data.frame(id=c(1:20),u1 = sample(c(0:1),20,replace=T) , u2 = sample(c(0:1),20,replace=T) , u3 = sample(c(0:1),20,replace=T))
dat2<-cbind(dat2,sapply(dat2[,grepl("^u\\d{1,}",colnames(dat2))],
function(x){ u_all<-ifelse(x==1 & !is.na(x),1,0)}))
dat2
This line from the OP
dat<-within(dat,u_all<-ifelse (u1==1 | u2==1 |u3==1,1,0))
can instead be written as
dat$u_all <- +Reduce("|", dat[, c("u1", "u2", "u3")])
How it works, in terms of intermediate objects:
D = dat[, c("u1", "u2", "u3")] uses the names of the columns to subset the data frame.
r = Reduce("|", D) collapses the data by putting | between each pair of columns. The result is a logical (TRUE/FALSE) vector.
To convert r to a 0/1 integer vector, you could use ifelse(r,1L,0L) or as.integer(r) (since TRUE/FALSE converts to 1/0 by default) or just the unary +, like +r.
If you want to avoid using column names (it's really not clear to me from the post), you can construct D = dat[-1] to exclude the first column instead.
You were almost there, here's a solution using apply over rows and using all to transform a vector of tests to a single digit.
dat2$u_all <- apply(dat2[,-1], MARGIN=1, FUN=function(x){
any(x==1)&all(!is.na(x))*1
}
)

automatic column prefix with cbind and just one column

I have some trouble with a script which uses cbind to add columns to a data frame. I select these columns by regular expression and I love that cbind automatically provides a prefix if you add more then one column. Bit this is not working if you just append one column... Even if I cast this column as a data frame...
Is there a way to get around this behaviour?
In my example, it works fine for columns starting with a but not for b1 column.
df <- data.frame(a1=c(1,2,3),a2=c(3,4,5),b1=c(6,7,8))
cbind(df, log=log(df[grep('^a', names(df))]))
cbind(df, log=log(df[grep('^b', names(df))]))
cbind(df, log=as.data.frame(log(df[grep('^b', names(df))])))
A solution would be to create an intermediate dataframe with the log values and rename the columns :
logb = log(df[grep('^b', names(df))]))
colnames(logb) = paste0('log.',names(logb))
cbind(df, logb)
What about
cbw <- c("a","b") # columns beginning with
cbw_pattern <- paste0("^",cbw, collapse = "|")
cbind(df, log=log(df[grep(cbw_pattern, names(df))]))
This way you do select both pattern at once. (all three columns).
Only if just one column is selected the colnames wont fit.

Efficient method to subset drop rows with NA values in R

Background
Before running a stepwise model selection, I need to remove missing values for any of my model terms. With quite a few terms in my model, there are therefore quite a few vectors that I need to look in for NA values (and drop any rows that have NA values in any of those vectors). However, there are also vectors that contain NA values that I do not want to use as terms / criteria for dropping rows.
Question
How do I drop rows from a dataframe which contain NA values for any of a list of vectors? I'm currently using the clunky method of a long series of !is.na's
> my.df[!is.na(my.df$termA)&!is.na(my.df$termB)&!is.na(my.df$termD),]
but I'm sure that there is a more elegant method.
Let dat be a data frame and cols a vector of column names or column numbers of interest. Then you can use
dat[!rowSums(is.na(dat[cols])), ]
to exclude all rows with at least one NA.
Edit: I completely glossed over subset, the built in function that is made for sub-setting things:
my.df <- subset(my.df,
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
)
I tend to use with() for things like this. Don't use attach, you're bound to cut yourself.
my.df <- my.df[with(my.df, {
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
}), ]
But if you often do this, you might also want a helper function, is_any()
is_any <- function(x){
!is.na(x)
}
If you end up doing a lot of this sort of thing, using SQL is often going to be a nicer interaction with subsets of data. dplyr may also prove useful.
This is one way:
# create some random data
df <- data.frame(y=rnorm(100),x1=rnorm(100), x2=rnorm(100),x3=rnorm(100))
# introduce random NA's
df[round(runif(10,1,100)),]$x1 <- NA
df[round(runif(10,1,100)),]$x2 <- NA
df[round(runif(10,1,100)),]$x3 <- NA
# this does the actual work...
# assumes data is in columns 2:4, but can be anywhere
for (i in 2:4) {df <- df[!is.na(df[,i]),]}
And here's another, using sapply(...) and Reduce(...):
xx <- data.frame(!sapply(df[2:4],is.na))
yy <- Reduce("&",xx)
zz <- df[yy,]
The first statement "applies" the function is.na(...) to columns 2:4 of df, and inverts the result (we want !NA). The second statement applies the logical & operator to the columns of xx in succession. The third statement extracts only rows with yy=T. Clearly this can be combined into one horrifically complicated statement.
zz <-df[Reduce("&",data.frame(!sapply(df[2:4],is.na))),]
Using sapply(...) and Reduce(...) can be faster if you have very many columns.
Finally, most modeling functions have parameters that can be set to deal with NA's directly (without resorting to all this). See, for example the na.action parameter in lm(...).

Removing rows in data frame using get function

Suppose I have following data frame:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
I want to get following data frame at the end:
mydataframe[-which(is.na(mydataframe$ID)),]
I need to do this kind of cleaning (and other similar manipulations) with many other data frames. So, I decided to assign a name to mydataframe, and variable of interest.
dbname <- "mydataframe"
varname <- "ID"
attach(get(dbname))
I get an error in the following line, understandably.
get(dbname) <- get(dbname)[-which(is.na(get(varname))),]
detach(get(dbname))
How can I solve this? (I don't want to assign to a new data frame, even though it seems only solution right now. I will use "dbname" many times afterwards.)
Thanks in advance.
There is no get<- function, and there is no get(colname) function (since colnames are not first class objects), but there is an assign() function:
assign(dbname, get(dbname)[!is.na( get(dbname)[varname] ), ] )
You also do not want to use -which(.). It would have worked here since there were some matches to the condition. It will bite you, however, whenever there are not any rows that match and instead of returning nothing as it should, it will return everything, since vec[numeric(0)] == vec. Only use which for "positive" choices.
As #Dason suggests, lists are made for this sort of work.
E.g.:
# make a list with all your data.frames in it
# (just repeating the one data.frame 3x for this example)
alldfs <- list(mydataframe,mydataframe,mydataframe)
# apply your function to all the data.frames in the list
# have replaced original function in line with #DWin and #flodel's comments
# pointing out issues with using -which(...)
lapply(alldfs, function(x) x[!is.na(x$ID),])
The suggestion to use a list of data frames is good, but I think people are assuming that you're in a situation where all the data frames are loaded simultaneously. This might not necessarily be the case, eg if you're working on a number of projects and just want some boilerplate code to use in all of them.
Something like this should fit the bill.
stripNAs <- function(df, var) df[!is.na(df[[var]]), ]
mydataframe <- stripNAs(mydataframe, "ID")
cars <- stripNAs(cars, "speed")
I can totally understand your need for this, since I also frequently need to cycle through a set of data frames. I believe the following code should help you out:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
#define target dataframe and varname
dbname <- "mydataframe"
varname <- "ID"
tmp.df <- get(dbname) #get df and give it a temporary name
col.focus <- which(colnames(tmp.df) == varname) #define the column of focus
tmp.df <- tmp.df[which(!is.na(tmp.df[,col.focus])),] #cut out the subset of the df where the column of focus is not NA.
#Result
ID score
1 1 11
2 2 12
4 4 14
5 5 15

Resources