Suppose I have following data frame:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
I want to get following data frame at the end:
mydataframe[-which(is.na(mydataframe$ID)),]
I need to do this kind of cleaning (and other similar manipulations) with many other data frames. So, I decided to assign a name to mydataframe, and variable of interest.
dbname <- "mydataframe"
varname <- "ID"
attach(get(dbname))
I get an error in the following line, understandably.
get(dbname) <- get(dbname)[-which(is.na(get(varname))),]
detach(get(dbname))
How can I solve this? (I don't want to assign to a new data frame, even though it seems only solution right now. I will use "dbname" many times afterwards.)
Thanks in advance.
There is no get<- function, and there is no get(colname) function (since colnames are not first class objects), but there is an assign() function:
assign(dbname, get(dbname)[!is.na( get(dbname)[varname] ), ] )
You also do not want to use -which(.). It would have worked here since there were some matches to the condition. It will bite you, however, whenever there are not any rows that match and instead of returning nothing as it should, it will return everything, since vec[numeric(0)] == vec. Only use which for "positive" choices.
As #Dason suggests, lists are made for this sort of work.
E.g.:
# make a list with all your data.frames in it
# (just repeating the one data.frame 3x for this example)
alldfs <- list(mydataframe,mydataframe,mydataframe)
# apply your function to all the data.frames in the list
# have replaced original function in line with #DWin and #flodel's comments
# pointing out issues with using -which(...)
lapply(alldfs, function(x) x[!is.na(x$ID),])
The suggestion to use a list of data frames is good, but I think people are assuming that you're in a situation where all the data frames are loaded simultaneously. This might not necessarily be the case, eg if you're working on a number of projects and just want some boilerplate code to use in all of them.
Something like this should fit the bill.
stripNAs <- function(df, var) df[!is.na(df[[var]]), ]
mydataframe <- stripNAs(mydataframe, "ID")
cars <- stripNAs(cars, "speed")
I can totally understand your need for this, since I also frequently need to cycle through a set of data frames. I believe the following code should help you out:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
#define target dataframe and varname
dbname <- "mydataframe"
varname <- "ID"
tmp.df <- get(dbname) #get df and give it a temporary name
col.focus <- which(colnames(tmp.df) == varname) #define the column of focus
tmp.df <- tmp.df[which(!is.na(tmp.df[,col.focus])),] #cut out the subset of the df where the column of focus is not NA.
#Result
ID score
1 1 11
2 2 12
4 4 14
5 5 15
Related
I need to figure out if sets of item ID's are found within a data frame.
If I'm only looking for a single set of ID's, the below code works just fine:
set <- c( id1, id2, etc...)
all(subSets %in% df[,rangeOfColumns])
However, if the set is a list of various things I want to check, this code doesn't work as expected and I am unsure how to get this functionality.
Example of what I'm aiming for:
set <- list()
set[[1]] <- c(1, 2)
set[[2]] <- c(2, 3)
df <- as.data.frame(cbind(c(1:4),c(2:5)))
all(set %in% df)
#Returns TRUE
Maybe check each row against each set and return TRUE if any row matches. Then if there is a match for each set, then the whole result is TRUE.
all(sapply(set, function(s)
any(apply(df, 1, function(x) all(x==s)))))
This might not be easy to understand but it does the job. Data frames are organized by column, so doing things by row isn't always straightforward.
# Your setup had some unnecessary complications. Here it is again
# more simply:
set <- list(1:2, 2:3)
d_f <- data.frame(1:4, 2:5) # df is already a function name so best not to use it again.
all(
sapply(seq_along(set),
function(i) any(
sapply(
lapply(1:nrow(d_f), function(j) set[[i]] == d_f[j,]),
all) # Does each element of set[[i]] equal the elements in df[j]?
)
) # Does it happen in any row of df?
) # It is true for all elements of set?
EDIT: to address the question in the comment Well, if it's not straight-forward, why not work with a transposed version of the df to make things easier?
Because a data frame is a list, not a matrix.
Doing matrix things (like transpose with t or using apply) ruin (often without any warning to the user) what a data frame is supposed to be, which is a list of vectors of the same length.
When you use t or apply on a data frame, the first thing to happen is as.matrix gets applied to it. And if your data frame has a date, character, or factor variable, then the whole thing is coerced to "character", and it doesn't tell you this happens.
An answer for your specific problem can be crafted using apply (as someone did) and/or t, but it's going to be a bit fragile unless one is completely sure of the classes of the variables in the data frame.
Warning: Multi-part question!
I realize parts of this have been answered elsewhere but am struggling to bring them together in a nice parsimonious bit of code....
I have a data frame with a number (24) of numeric columns of interest. For each column, I want to create a new variable in the same data frame (named sensibly) in which the values correspond to the mean of the sex-specific decile for that variable (sex is in a different column, coded 0/1).
New column names from an original column called 'WBC' would be, for example: 'WBC_meandec_women', and 'WBC_meandeac_men'.
I've tried various bits of code to first create new variables, then assign values related to the decile but none work well and can't figure out how to put it together. I just know there is a clever way to put all parts into the same code chunk, I'm just not fluent enough in R to get there...
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),WBC=rnorm(100),RBC=rnorm(100))
Trying to achieve:
goaldata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100),WBC_decmean_women=rep(NA,length(dummydata)),WBC_decmean_men=rep(NA,length(dummydata)),RBC_decmean_women=rep(NA,length(dummydata)),RBC_decmean_men=rep(NA,length(dummydata)))
...but obviously with the correct values instead of NAs, and for a list of about 24 original variables.
Any help greatly appreciated!
Depending on if I understood you right, I'll propose this giant ball of duct tape...
# fake data
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100))
# a function to calculate decile means
decilemean <- function(x) {
xrank <- rank(x)
xdec <- floor((xrank-1)/length(x)*10)+1
decmeans <- as.numeric(tapply(x,xdec,mean))
xdecmeans <- decmeans[xdec]
return(xdecmeans)
}
# looping thru your data columns and making new columns
newcol <- 5 # the first new column to create
for(j in c(3,4)) { # all of your colums to decilemean-ify
dummydata[,newcol] <- NA
dummydata[dummydata$sex==0,newcol] <- decilemean(dummydata[dummydata$sex==0,j])
names(dummydata)[newcol] <- paste0(names(dummydata)[j],"_decmean_women")
dummydata[,newcol+1] <- NA
dummydata[dummydata$sex==1,newcol+1] <- decilemean(dummydata[dummydata$sex==1,j])
names(dummydata)[newcol+1] <- paste0(names(dummydata)[j],"_decmean_men")
newcol <- newcol+2
}
I'd recommend testing it though ;)
It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.
In R, I'm trying to use a for loop, with a nested test, in order to append a column to multiple data frames.
I am having trouble 1) calling a data frame with a variable name and 2) using a logical test to skip.
For example, I created 3 data frames with a number, and I want to add a column that's the squared root of the value. I want to skip the data frame if it'll result in an error.
Below is what I've gotten to so far:
df1 <- data.frame(a=c(1))
df2 <- data.frame(a=c(6))
df3 <- data.frame(a=c(-3))
df_lst$b<-
for(df_lst in c("df1","df2","df3"){
ifelse(is.na(df_lst$a) = T, skip,
df_list$b <- sqrt(df1$a)
})
In the above example, I would ideally like to see df1 and df2 with a new column b with the squared root of column a, and then nothing happens to df3.
Any help would be GREATLY appreciated, thank you everyone!
It's generally not a good idea to just have a bunch of data.frames lying around with different names if you need to do things to all of them. You're better off storing them in a list. For example
mydfs<-list(df1, df2, df3)
Then you can use lapply and such to work with those data.frames. For example
mydfs<-lapply(mydfs, function(x) {
if(all(x$a>0)) {
x$b<-sqrt(x$a)
}
x;
})
Otherwise, changing your code to
for(df_lst in c("df1","df2","df3")) {
df<-get(df_lst)
if( all(df$a>=0) ) {
df$b <- sqrt(df$a)
}
assign(df_lst, df)
}
should work as well, it's just generally not considered good practice.
Background
Before running a stepwise model selection, I need to remove missing values for any of my model terms. With quite a few terms in my model, there are therefore quite a few vectors that I need to look in for NA values (and drop any rows that have NA values in any of those vectors). However, there are also vectors that contain NA values that I do not want to use as terms / criteria for dropping rows.
Question
How do I drop rows from a dataframe which contain NA values for any of a list of vectors? I'm currently using the clunky method of a long series of !is.na's
> my.df[!is.na(my.df$termA)&!is.na(my.df$termB)&!is.na(my.df$termD),]
but I'm sure that there is a more elegant method.
Let dat be a data frame and cols a vector of column names or column numbers of interest. Then you can use
dat[!rowSums(is.na(dat[cols])), ]
to exclude all rows with at least one NA.
Edit: I completely glossed over subset, the built in function that is made for sub-setting things:
my.df <- subset(my.df,
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
)
I tend to use with() for things like this. Don't use attach, you're bound to cut yourself.
my.df <- my.df[with(my.df, {
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
}), ]
But if you often do this, you might also want a helper function, is_any()
is_any <- function(x){
!is.na(x)
}
If you end up doing a lot of this sort of thing, using SQL is often going to be a nicer interaction with subsets of data. dplyr may also prove useful.
This is one way:
# create some random data
df <- data.frame(y=rnorm(100),x1=rnorm(100), x2=rnorm(100),x3=rnorm(100))
# introduce random NA's
df[round(runif(10,1,100)),]$x1 <- NA
df[round(runif(10,1,100)),]$x2 <- NA
df[round(runif(10,1,100)),]$x3 <- NA
# this does the actual work...
# assumes data is in columns 2:4, but can be anywhere
for (i in 2:4) {df <- df[!is.na(df[,i]),]}
And here's another, using sapply(...) and Reduce(...):
xx <- data.frame(!sapply(df[2:4],is.na))
yy <- Reduce("&",xx)
zz <- df[yy,]
The first statement "applies" the function is.na(...) to columns 2:4 of df, and inverts the result (we want !NA). The second statement applies the logical & operator to the columns of xx in succession. The third statement extracts only rows with yy=T. Clearly this can be combined into one horrifically complicated statement.
zz <-df[Reduce("&",data.frame(!sapply(df[2:4],is.na))),]
Using sapply(...) and Reduce(...) can be faster if you have very many columns.
Finally, most modeling functions have parameters that can be set to deal with NA's directly (without resorting to all this). See, for example the na.action parameter in lm(...).