Generate variable with missing values if condition doesn't hold true - r

I want to generate a new variable in a data frame which contains the difference of the current row and a lag-value of another variable. However, I want to assign only values for those rows, where a specific condition holds true for a second variable. In this example the new lag-difference variable should only have values for rows with the fruit "Banana". All other rows shall be empty or rather contain NA.
fruitnumbers <- data.frame(numbers=c(2,4,1,5,3,5,2,5,1,3),
fruits=c("Apple","Banana","Orange","Cherry","Strawberry","Banana","Banana",
"Apple","Cherry","Banana"))
I tried to solve this problem with an if condition:
fruitnumbers$newvar <- if(fruitnumbers$fruits=="Banana"){
fruitnumbers$numbers-lag(fruitnumbers$numbers, 1)
}
However, I've received the following warning massage.
Warning message:
In if (fruits == "Banana") { :
the condition has length > 1 and only the first element will be used
From research, I assume that it has something to do with the fact R wants to check the If-condition for the whole data frame instead of row by row for each value but I'm not quite sure. I'd be grateful for any solution here.

Here fruitnumbers$fruits is a vector so when you run if (fruitnumbers$fruits == "Banana") only the first element of fruitnumbers$fruits is tested(here "Apple" == "Banana").
If you want a vectorized test use the case_when function of the library dplyr
library(dplyr)
fruitnumbers$newvar <- case_when(
fruitnumbers$fruits == "Banana" ~ fruitnumbers$numbers-lag(fruitnumbers$numbers, 1),
TRUE ~ NA_real_
)
Which gives
fruitnumbers$newvar
[1] NA 2 NA NA NA 2 -3 NA NA 2
EDIT : as mentioned by someone you could have used the ifelse function
fruitnumbers$newvar <- ifelse(fruitnumbers$fruits == "Banana", fruitnumbers$numbers-lag(fruitnumbers$numbers, 1), NA)

I would do that in two stages:
Create a new column in the data frame:
fruitnumbers$newvar <- NA
Change the values only for bananas:
fruitnumbers$newvar[fruitnumbers$fruits=="Banana"] <-
fruitnumbers$numbers[fruitnumbers$fruits=="Banana"] - lag(fruitnumbers$numbers[fruitnumbers$fruits=="Banana"], 1)
I am not sure about the lag function in this context. It only returns zeros. Another problem might be hiding there.

In base R, you could try this:
fruitnumbers <- data.frame(numbers=c(2,4,1,5,3,5,2,5,1,3),
fruits=c("Apple","Banana","Orange","Cherry","Strawberry","Banana","Banana",
"Apple","Cherry","Banana"))
indexes = which(fruitnumbers$fruits == "Banana")
fruitnumbers[indexes, 'newvar'] = fruitnumbers[indexes, 'numbers'] - lag(fruitnumbers[indexes, 'numbers'], 1)
Rest of the row values in column newvar would show as blank.

Related

Is there a way to check if dataframe is empty and if so to add a NA row?

For example i have a dataframe that has nothing inside but i need it to run the full code cause it usually expects there to be data. I tried this but it did not work
ifelse(dim(df_empty)[1]==0,rbind(Shots1B_empty,NA))
Maybe something like this:
df_empty <- data.frame(x=integer(0), y = numeric(0), a = character(0))
if(nrow(df_empty) == 0){
df_empty <- rbind(df_empty, data.frame(x=NA, y=NA, a=NA))
}
df_empty
# x y a
#1 NA NA NA
Simple question, OP, but actually pretty interesting. All the elements of your code should work, but the issue is that when you run as is, it will return a list, not a data frame. Let me show you with an example:
growing_df <- data.frame(
A=rep(1, 3),
B=1:3,
c=LETTERS[4:6])
df_empty <- data.frame()
If we evaluate as you have written you get:
df <- ifelse(dim(df_empty)[1]==0, rbind(growing_df, NA))
with df resulting in a List:
> class(df)
[1] "list"
> df
[[1]]
[1] 1 1 1 NA
The code "worked", but the resulting class of df is wrong. It's odd because this works:
> rbind(growing_df, NA)
A B c
1 1 1 D
2 1 2 E
3 1 3 F
4 NA NA <NA>
The answer is to use if and else, rather than ifelse(), just as #akrun noted in their answer. The reason is found if you dig into the documentation of ifelse():
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
Since dim(df_empty)[1] and/or nrow(df_empty) are both vectors, the result will be saved as a list. That's why if {} works, but not ifelse() here. rbind() results in a data frame normally, but the class of the result stored into df when assigning with ifelse() is decided based on the test element, not the resulting element. Compare that to if{} statements, which have a result element decided based on whatever expression is input into {}.
We may need if/else instead of ifelse - ifelse requires all arguments to be of same length, which obviously will be not the case when we rbind
Shots1B_empty <- if(nrow(df_empty) == 0) rbind(Shots1B_empty, NA)

Subsetting a df by 2, 3 or more conditions in R [duplicate]

I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA"), as suggested in the comments; because
TRUE is interpreted as 1 and FALSE as 0, this should return the
numer of TRUE values in your vector.
length(which(mydata$sCode == "CA")); the which() function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA".
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.
sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
mydata$sCode is a vector, it's why nrow output is NULL.
mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
With dplyr package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Call nrow passing as argument the name of the dataset:
nrow(dataset)
I'm using this short function to make it easier using dplyr:
countc <- function(.data, ..., preserve = FALSE){
return(nrow(filter(.data, ..., .preserve = preserve)))
}
With this you can just use it like filter. For example:
countc(data, active == TRUE)
[1] 42

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

Function to change blanks to NA

I'm trying to write a function that turns empty strings into NA. A summary of one of my column looks like this:
a b
12 210 468
I'd like to change the 12 empty values to NA. I also have a few other factor columns for which I'd like to change empty values to NA, so I borrowed some stuff from here and there to come up with this:
# change nulls to NAs
nullToNA <- function(df){
# split df into numeric & non-numeric functions
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
# Change empty strings to NA
b<-b[lapply(b,function(x) levels(x) <- c(levels(x), NA) ),] # add NA level
b<-b[lapply(b,function(x) x[x=="",]<- NA),] # change Null to NA
# Put the columns back together
d<-cbind(a,b)
d[, names(df)]
}
However, I'm getting this error:
> foo<-nullToNA(bar)
Error in x[x == "", ] <- NA : incorrect number of subscripts on matrix
Called from: FUN(X[[i]], ...)
I have tried the answer found here: Replace all 0 values to NA but it changes all my columns to numeric values.
You can directly index fields that match a logical criterion. So you can just write:
df[is_empty(df)] = NA
Where is_empty is your comparison, e.g. df == "":
df[df == ""] = NA
But note that is.null(df) won’t work, and would be weird anyway1. I would advise against merging the logic for columns of different types, though! Instead, handle them separately.
1 You’ll almost never encounter NULL inside a table since that only works if the underlying vector is a list. You can create matrices and data.frames with this constraint, but then is.null(df) will never be TRUE because the NULL values are wrapped inside the list).
This worked for me
df[df == 'NULL'] <- NA
How about just:
df[apply(df, 2, function(x) x=="")] = NA
Works fine for me, at least on simple examples.
This is the function I used to solve this issue.
null_na=function(vector){
new_vector=rep(NA,length(vector))
for(i in 1:length(vector))
if(vector[i]== ""){new_vector[i]=NA}else if(is.na(vector[i]))
{new_vector[i]=NA}else{new_vector[i]=vector[i]}
return(new_vector)
}
Just plug in the column or vector you are having an issue with.

Create new column with binary data based on several columns

I have a dataframe in which I want to create a new column with 0/1 (which would represent absence/presence of a species) based on the records in previous columns. I've been trying this:
update_cat$bobpresent <- NA #creating the new column
x <- c("update_cat$bob1999", "update_cat$bob2000", "update_cat$bob2001","update_cat$bob2002", "update_cat$bob2003", "update_cat$bob2004", "update_cat$bob2005", "update_cat$bob2006","update_cat$bob2007", "update_cat$bob2008", "update_cat$bob2009") #these are the names of the columns I want the new column to base its results in
bobpresent <- function(x){
if(x==NA)
return(0)
else
return(1)
} # if all the previous columns are NA then the new column should be 0, otherwise it should be 1
update_cat$bobpresence <- sapply(update_cat$bobpresent, bobpresent) #apply the function to the new column
Everything is going fina until the last string where I'm getting this error:
Error in if (x == NA) return(0) else return(1) :
missing value where TRUE/FALSE needed
Can somebody please advise me?
Your help will be much appreciated.
By definition all operations on NA will yield NA, therefore x == NA always evaluates to NA. If you want to check if a value is NA, you must use the is.na function, for example:
> NA == NA
[1] NA
> is.na(NA)
[1] TRUE
The function you pass to sapply expects TRUE or FALSE as return values but it gets NA instead, hence the error message. You can fix that by rewriting your function like this:
bobpresent <- function(x) { ifelse(is.na(x), 0, 1) }
In any case, based on your original post I don't understand what you're trying to do. This change only fixes the error you get with sapply, but fixing the logic of your program is a different matter, and there is not enough information in your post.

Resources