My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))
Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))
Related
I am trying to add a column with a condition using the mutate function in r, but keep getting an error. The code is straight from the teacher's lecture, but yet an error occurs. The LineItem column is a factor class, I am not sure if that make a difference.
Please advice on what I am missing.
Thank you,
Avi
df <- read.csv('ities_short.csv')
colSums(is.na(df))
sl <- str_length(df$LineItem)
avg <- mean(str_length(df$LineItem))
df <- df %>% mutate(LineItem_LongName = ifelse(sl > avg), 1, 0)
Error in ifelse(sl > avg) : argument "yes" is missing, with no default
You have placed ')' at wrong places. The general syntax for ifelse is:
ifelse(cond,value if true, value if false)
df <- read.csv('ities_short.csv')
colSums(is.na(df))
sl <- str_length(df$LineItem)
avg <- mean(str_length(df$LineItem))
df <- df %>% mutate(LineItem_LongName = ifelse(sl > avg, 1, 0))
#Nirbhay Singh answer is correct. However, if you compare two vectors, it's generally better to use dplyr::if_else because it is stricter regarding NA values :
df <- df %>% mutate(LineItem_LongName = if_else(sl > avg, 1, 0))
See the doc
Don't create separate objects and use it in dataframe, instead keep them in dataframe itself. You can remove the columns later which you don't need. Moreover, you can do this without ifelse.
library(dplyr)
library(stringr)
df %>%
mutate(temp = str_length(LineItem),
LineItem_LongName = as.integer(temp > mean(temp)))
Or in base R :
df$temp <- nchar(df$LineItem)
transform(df, LineItem_LongName = +(temp > mean(temp)))
I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.
We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])
We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))
I'm using dplyr to mutate columns in my dataframe. It consists on creating a ratio of the current row value to the max value so far (basically a lag and cummax combination).
It works great. Except when there's a NA value, because all the following calculations become NA.
I tried placing na.omit() here and there but while it might work, the function fails because na.omit() messes with the length of the vectors and it crashes.
Here is my reproducible code:
v1<-c(NA,100,80,40,NA,30,100,40,20,10,NA,NA,1,NA)
v2<-c(100,100,90,50,NA,-40,NA,-10,NA,NA,NA,1,NA,NA)
group<-c(1,1,1,1,1,1,2,2,2,2,2,3,3,4)
x1<-as.data.frame(cbind(v1,v2,group))
library(dplyr)
for ( i in c("v1","v2")){
x1<-x1 %>%
group_by(group) %>%
mutate( !!sym(paste( i,"_max_lag_ratio", sep="")) := get(i)/ lag( as.vector(cummax( get(i))) , default=first(get(i))))
}
If I add na.omit() as follows:
mutate( !!sym(paste( i,"_max_lag_ratio", sep="")) := get(i)/ lag( cummax( na.omit(get(i))) , default=first( get(i) )))
I get the following error:
Error: Column `column_max_lag_ratio` must be length 1 (the group size), not 0
Most likely because of one single group (group 4) having only NAs.
How can I make this failsafe? My real dataset features "imperfect" data.
Help is greatly appreciated since I'm really stucked.
A working solution based on this answer Need to get R cummax but dealing properly with NAs could be:
df %>%
replace_na(list(v1=-Inf, v2=-Inf)) %>%
group_by(group) %>%
mutate(max_v1 = cummax(v1),
max_v2 = cummax(v2)
) %>%
group_by(group) %>%
mutate(v1_max_lag_ratio = v1/lag(max_v1)) %>%
mutate(v2_max_lag_ratio = v2/lag(max_v2))
Made this workaround and did the trick.
v1<-c(NA,100,80,40,NA,30,100,40,20,10,NA,NA,1,NA)
v2<-c(100,100,90,50,NA,-40,NA,-10,NA,NA,NA,1,NA,NA)
group<-c(1,1,1,1,1,1,2,2,2,2,2,3,3,4)
x1<-as.data.frame(cbind(v1,v2,group))
library(dplyr)
for ( i in c("v1","v2")){
x1<-x1 %>%
group_by(group) %>%
mutate( !!sym(paste( i,"_max_lag_ratio", sep="")) := get(i)/(lag( cummax( ifelse(is.na(get(i)), na.omit(get(i) ) ,get(i))) , default=first(get(i))))
)
}
I am trying to replace some filtered values of a data set. So far, I wrote this lines of code:
df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA)),
where uniq is just a list containing variable names I want to focus on (and group1 and values are column names). This is actually working. However, it only outputs the altered filtered rows and does not replace anything in the data set df. Does anyone have an idea, where my mistake is? Thank you so much! The following code is to reproduce the example:
group1 <- c("A","A","A","B","B","C")
values <- c(0.6,0.3,0.1,0.2,0.8,0.9)
df = data.frame(group1, group2, values)
uniq <- unique(unlist(df$group1))
for (i in 1:length(uniq)){
df <- df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA))
}
What I would like to get is that it leaves all values except the last one since it is one unique group (group1 == C) and 0.9 < 1. So I'd like to get the exact same data frame here except that 0.9 is replaced with NA. Moreover, would it be possible to just use if instead of ifelse?
dplyr won't create a new object unless you use an assignment operator (<-).
Compare
require(dplyr)
data(mtcars)
mtcars %>% filter(cyl == 4)
with
mtcars4 <- mtcars %>% filter(cyl == 4)
mtcars4
The data are the same, but in the second example the filtered data is stored in a new object mtcars4
I am having trouble mutating a subset of rows in dplyr. I am using the chaining command: %>% to say:
data <- data %>%
filter(ColA == "ABC") %>%
mutate(ColB = "XXXX")
This works fine but the problems is that I want to be able to select the entire original table and see the mutate applied to only the subset of data I had specified. My problem is that when I view data after this I only see the subset of data and its updated ColB information.
I would also like to know how to do this using data.table.
Thanks.
Using data.table, we'd do:
setDT(data)[colA == "ABC", ColB := "XXXX"]
and the values are modified in-place, unlike if-else, which'd copy the entire column to replace just those rows where the condition satisfies.
We call this sub-assign by reference. You can read more about it in the new HTML vignettes.
When you use filter() you are actually removing the rows that do not match the condition you specified, so they will not show up in the final data set.
Does ColB already exist in your data frame? If so,
data %>%
mutate(ColB = ifelse(ColA == "ABC", "XXXX", ColB))
will change ColB to "XXXX" when ColA == "ABC" and leave it as is otherwise. If ColB does not already exist, then you will have to specify what to do for rows where ColA != "ABC", for example:
data %>%
mutate(ColB = ifelse(ColA == "ABC", "XXXX", NA))
Another option is to perform a subsequent combination of union and anti-join with the same data. This requires a primary key:
data <- data %>%
filter(ColA == "ABC") %>%
mutate(ColB = "XXXX") %>%
rbind_list(., anti_join(data, ., by = ...))
Example:
mtcars_n <- mtcars %>% add_rownames
mtcars_n %>%
filter(cyl > 6) %>%
mutate(mpg = 1) %>%
rbind_list(., anti_join(mtcars_n, ., by = "rowname"))
This is much slower than probably any other approach, but useful to get quick results by extending your existing pipe.
Just updating (by June 02nd 2022) #krlmlr great answer:
add_rownames() is deprecated, use tibble::rownames_to_column() instead.
rbind_list is also deprecated, use bind_rows instead
You might also find a different sequence of rows in your resulting joined dataset, which depending on your aim is quite difficult to correct with dplyr::arrange() afterwards.
An alternative, although slower, is:
mtcars_n <- mtcars %>%
add_rownames() %>%
filter(cyl > 6) %>%
mutate(new_col = 1)
mtcars_m <- left_join(x=mtcars, y=mtcars_n)