Surprised this hasn't been asked before (as far as I can see)
I have a data.frame with multiple columns and two rows, such as the below.
df<-as.data.frame(rbind(row1=c(NA,NA,rep(0,2),"FOO",NA,"BAR","FOO","FOOBAR","ETC"),
row2=c(300,23.4,1,2,"BAR","FOO","BAR","HELLO","WORLD","ETC")))
I want to select the entry in the first row as default but only if it's not NA. If it is NA I want to entry in the second row. I've tried the following:
apply(df,2,function(x) ifelse(is.na(x[1]),x[2],x[1]))
However, x is a combination of numeric and character and each columns class needs to be maintained so apply is causing issues. Also I need it returned as a data frame and not a named vector.
Try this and see if this is what you are after.
df<-as.data.frame(rbind(row1=c(NA,NA,rep(0,2),"FOO",NA,"BAR","FOO","FOOBAR","ETC"),
row2=c(300,23.4,1,2,"BAR","FOO","BAR","HELLO","WORLD","ETC")))
outDF <- lapply(df, function(x){
if(is.na(x[[1]])&!is.na(x[[2]])){
x[[1]] <- x[[2]]
}
x
})
data.frame(outDF, stringsAsFactors = FALSE)
Related
I have a dataframe. I divide this dataframe in subframes of 6 rows each in a list.
I want if inside in those subframes exist the word "#ERROR" to be deleted all the dataframe( that contain even in one row the specific word) and to receive the list with smaller number of dataframes. Then I am going to convert the list in dataframe again. My problem is that I try different codes and I cannot figure out how to eliminate subdataframe with specific word from the list.
I try the follow
a<-dataset
View(a)
my.list<-split(a, rep(1:119, each = 6))
z=lapply(1:length(my.list), function(i) my.list[[i]] != "#ERROR")
but what I get they are 119 elements TRUE FALSE. But I want to eliminate those false... anyone please help....
Try using sapply as it is will return a vector instead of list like lapply.
new.list <- my.list[sapply(1:length(my.list), function(i)
all(my.list[[i]] != "#ERROR"))]
Or a bit simplified with Filter :
new.list <- Filter(function(x) all(x != "#ERROR"), my.list)
I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)
This question already has answers here:
Deleting columns from a data.frame where NA is more than 15% of the column length [duplicate]
(2 answers)
Closed 7 years ago.
I'm new to writing functions, am sure this is a simple one. I have a 111 col X ~10,500 row df with all missing values coded as <NA>. Intuitively, I need a function that does the following column-wise over a dataframe:
ifelse(length(is.na(colx) > length(colx)/5, NULL, colx)
i.e. I need to drop any variables with more than 1/5 (20%) missing values. Thanks to all for indicating there's a similar answer, i.e. using
colMeans(is.na(mydf)) > .20
to ID the columns, but this doesn't fully answer my question.
The above code returns a logical vector indicating the variables to be dropped. I have more than 100 variables with complex names and picking through them to drop by hand is tedious and bound to introduce errors. How can I modify the above, or use some version of my original proposed ifelse, to only return a new dataframe of columns with < 20% NA, as I asked originally?
Thanks!!
One way of doing this (probably not the shortest) is to iterate over the lines of the data.frame with by and then rbinding the result together to one data.frame.
Just change the condition in the if in the code below, here line with at least one NA value are removed.
do.call(rbind, by(your.dataset,
1:nrow(your.dataset),
FUN=function(x){
if(sum(is.na(x))==0){
return(x)
} else {
return(NULL)}
}))
When you use lapply on a data.frame, it performs the given function on each column as if each were a list.
So if f is your function for "processing" a column, you should use:
lapply(df, f)
vapply should be used when the result will always be a vector of a known size.
sapply is like an automatic vapply. It tries to simplify the result to a vector. I would advise against using sapply, except for exploratory programming.
(Updated to reflect edit)
Try:
f <- function(x) {
sum(is.na(x)) < length(x) * 0.2
}
df[, vapply(df, f, logical(1)), drop = F]
Background
Before running a stepwise model selection, I need to remove missing values for any of my model terms. With quite a few terms in my model, there are therefore quite a few vectors that I need to look in for NA values (and drop any rows that have NA values in any of those vectors). However, there are also vectors that contain NA values that I do not want to use as terms / criteria for dropping rows.
Question
How do I drop rows from a dataframe which contain NA values for any of a list of vectors? I'm currently using the clunky method of a long series of !is.na's
> my.df[!is.na(my.df$termA)&!is.na(my.df$termB)&!is.na(my.df$termD),]
but I'm sure that there is a more elegant method.
Let dat be a data frame and cols a vector of column names or column numbers of interest. Then you can use
dat[!rowSums(is.na(dat[cols])), ]
to exclude all rows with at least one NA.
Edit: I completely glossed over subset, the built in function that is made for sub-setting things:
my.df <- subset(my.df,
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
)
I tend to use with() for things like this. Don't use attach, you're bound to cut yourself.
my.df <- my.df[with(my.df, {
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
}), ]
But if you often do this, you might also want a helper function, is_any()
is_any <- function(x){
!is.na(x)
}
If you end up doing a lot of this sort of thing, using SQL is often going to be a nicer interaction with subsets of data. dplyr may also prove useful.
This is one way:
# create some random data
df <- data.frame(y=rnorm(100),x1=rnorm(100), x2=rnorm(100),x3=rnorm(100))
# introduce random NA's
df[round(runif(10,1,100)),]$x1 <- NA
df[round(runif(10,1,100)),]$x2 <- NA
df[round(runif(10,1,100)),]$x3 <- NA
# this does the actual work...
# assumes data is in columns 2:4, but can be anywhere
for (i in 2:4) {df <- df[!is.na(df[,i]),]}
And here's another, using sapply(...) and Reduce(...):
xx <- data.frame(!sapply(df[2:4],is.na))
yy <- Reduce("&",xx)
zz <- df[yy,]
The first statement "applies" the function is.na(...) to columns 2:4 of df, and inverts the result (we want !NA). The second statement applies the logical & operator to the columns of xx in succession. The third statement extracts only rows with yy=T. Clearly this can be combined into one horrifically complicated statement.
zz <-df[Reduce("&",data.frame(!sapply(df[2:4],is.na))),]
Using sapply(...) and Reduce(...) can be faster if you have very many columns.
Finally, most modeling functions have parameters that can be set to deal with NA's directly (without resorting to all this). See, for example the na.action parameter in lm(...).
When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.