ifelse in a mutate function in r - r

I am trying to add a column with a condition using the mutate function in r, but keep getting an error. The code is straight from the teacher's lecture, but yet an error occurs. The LineItem column is a factor class, I am not sure if that make a difference.
Please advice on what I am missing.
Thank you,
Avi
df <- read.csv('ities_short.csv')
colSums(is.na(df))
sl <- str_length(df$LineItem)
avg <- mean(str_length(df$LineItem))
df <- df %>% mutate(LineItem_LongName = ifelse(sl > avg), 1, 0)
Error in ifelse(sl > avg) : argument "yes" is missing, with no default

You have placed ')' at wrong places. The general syntax for ifelse is:
ifelse(cond,value if true, value if false)
df <- read.csv('ities_short.csv')
colSums(is.na(df))
sl <- str_length(df$LineItem)
avg <- mean(str_length(df$LineItem))
df <- df %>% mutate(LineItem_LongName = ifelse(sl > avg, 1, 0))

#Nirbhay Singh answer is correct. However, if you compare two vectors, it's generally better to use dplyr::if_else because it is stricter regarding NA values :
df <- df %>% mutate(LineItem_LongName = if_else(sl > avg, 1, 0))
See the doc

Don't create separate objects and use it in dataframe, instead keep them in dataframe itself. You can remove the columns later which you don't need. Moreover, you can do this without ifelse.
library(dplyr)
library(stringr)
df %>%
mutate(temp = str_length(LineItem),
LineItem_LongName = as.integer(temp > mean(temp)))
Or in base R :
df$temp <- nchar(df$LineItem)
transform(df, LineItem_LongName = +(temp > mean(temp)))

Related

I would like to select >= as.Date('2008-01-01 ') and the NAs

I have tried
subset(df,df$date >= as.Date('2008-01-01'),na.rm = FALSE)
subset(df,df$date >= as.Date('2008-01-01'),na.omit = FALSE)
I'm losing all the people who have NAs too. Please suggest a way to sort it out
I tried subset(df,df$date >= as.Date('2008-01-01'),na.rm = FALSE)
If you look at the ?subset help page, it doesn't have any arguments named na.rm or na.omit. Those aren't magic keywords. They're common arguments that some (but not all) functions take, and you need to look at the function's help page to see if they work with a certain function.
Also, the point of using subset rather than just [ is that you don't have to use data$ after passing the data argument.
subset(df, date >= "2008-01-01" | is.na(date))
This should work to keep rows where the date is >= 2008-01-01 OR where the date is NA.
Here is an example using filter from dplyr package: instead of subset:
library(dplyr)
# create tibble
dat <- tibble(x = c(rep(as.Date('2008-01-01'),10)))
# add NA to tibble
set.seed(123)
df <- as.data.frame(lapply(dat, \(x) replace(x, sample(length(x), .3*length(x)), NA)))
# filter all 2008-01-01 and NA
df %>%
filter(x == "2008-01-01" | is.na(.))

Check if single column is equal to any multiple others

My question seems simple, but I just can't do it. I have a dataframe with multiple columns with the name starting with coa and another column p with values like A, D, F, and so on, which changes according to the id.
All I found is how to do this matching with a fixed value, let's say "A", as below:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), "A")))
However, in my case, I want to compare to the column p specifically, where p changes, something like this:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), p)))
In this case, I get the error:
x no applicable method for 'type' applied to an object of class "factor"
Any thoughts? Thanks!
If we need to create a column, use if_any
library(dplyr)
library(stringr)
df <- df %>%
mutate(ly = if_any(starts_with("coa"), ~ str_detect(.x, p)))
I think this is a good place to use dplyr::across. You can run vignette('colwise') for a more comprehensive guide, but the key point here is that we can mutate all columns starting with "coa" simultaneously using the function == and we can pass a second argument, p, to == using the ... option provided by across.
library(dplyr)
df <- tibble(p = 1:10, coa1 = 1:10, coa2 = 11:20)
df %>%
mutate(across(.cols = starts_with('coa'), .fns = `==`, p))

Dplyr Non Standard Evaluation -- Help Needed

I am making my first baby steps with non standard evaluation (NSE) in dplyr.
Consider the following snippet: it takes a tibble, sorts it according to the values inside a column and replaces the n-k lower values with "Other".
See for instance:
library(dplyr)
df <- cars%>%as_tibble
k <- 3
df2 <- df %>%
arrange(desc(dist)) %>%
mutate(dist2 = factor(c(dist[1:k],
rep("Other", n() - k)),
levels = c(dist[1:k], "Other")))
What I would like is a function such that:
df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)
produces the same result, where old_column column "dist" (the column I use to sort the data set), new_column (the column I generate) is "dist2" and levels_to_keep is "k" (number of values I explicitly retain).
I am getting lost in enquo, quo_name etc...
Any suggestion is appreciated.
You can do:
library(dplyr)
sort_keep=function(df,old_column, new_column, levels_to_keep){
old_column = enquo(old_column)
new_column = as.character(substitute(new_column))
df %>%
arrange(desc(!!old_column)) %>%
mutate(use = !!old_column,
!!new_column := factor(c(use[1:levels_to_keep],
rep("Other", n() - levels_to_keep)),
levels = c(use[1:levels_to_keep], "Other")),
use=NULL)
}
df%>%sort_keep(dist,dist2,3)
Something like this?
old_column = "dist"
new_column = "dist2"
levels_to_keep = 3
command = "df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)"
command = gsub('old_column', old_column, command)
command = gsub('new_column', new_column, command)
command = gsub('levels_to_keep', levels_to_keep, command)
eval(parse(text=command))

Simple mutate with dplyr gives "wrong result size" error

My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))
Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))

Issues with replacing a subset of a data.frame using the R Package dplyr

I am trying to replace some filtered values of a data set. So far, I wrote this lines of code:
df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA)),
where uniq is just a list containing variable names I want to focus on (and group1 and values are column names). This is actually working. However, it only outputs the altered filtered rows and does not replace anything in the data set df. Does anyone have an idea, where my mistake is? Thank you so much! The following code is to reproduce the example:
group1 <- c("A","A","A","B","B","C")
values <- c(0.6,0.3,0.1,0.2,0.8,0.9)
df = data.frame(group1, group2, values)
uniq <- unique(unlist(df$group1))
for (i in 1:length(uniq)){
df <- df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA))
}
What I would like to get is that it leaves all values except the last one since it is one unique group (group1 == C) and 0.9 < 1. So I'd like to get the exact same data frame here except that 0.9 is replaced with NA. Moreover, would it be possible to just use if instead of ifelse?
dplyr won't create a new object unless you use an assignment operator (<-).
Compare
require(dplyr)
data(mtcars)
mtcars %>% filter(cyl == 4)
with
mtcars4 <- mtcars %>% filter(cyl == 4)
mtcars4
The data are the same, but in the second example the filtered data is stored in a new object mtcars4

Resources