I am having trouble mutating a subset of rows in dplyr. I am using the chaining command: %>% to say:
data <- data %>%
filter(ColA == "ABC") %>%
mutate(ColB = "XXXX")
This works fine but the problems is that I want to be able to select the entire original table and see the mutate applied to only the subset of data I had specified. My problem is that when I view data after this I only see the subset of data and its updated ColB information.
I would also like to know how to do this using data.table.
Thanks.
Using data.table, we'd do:
setDT(data)[colA == "ABC", ColB := "XXXX"]
and the values are modified in-place, unlike if-else, which'd copy the entire column to replace just those rows where the condition satisfies.
We call this sub-assign by reference. You can read more about it in the new HTML vignettes.
When you use filter() you are actually removing the rows that do not match the condition you specified, so they will not show up in the final data set.
Does ColB already exist in your data frame? If so,
data %>%
mutate(ColB = ifelse(ColA == "ABC", "XXXX", ColB))
will change ColB to "XXXX" when ColA == "ABC" and leave it as is otherwise. If ColB does not already exist, then you will have to specify what to do for rows where ColA != "ABC", for example:
data %>%
mutate(ColB = ifelse(ColA == "ABC", "XXXX", NA))
Another option is to perform a subsequent combination of union and anti-join with the same data. This requires a primary key:
data <- data %>%
filter(ColA == "ABC") %>%
mutate(ColB = "XXXX") %>%
rbind_list(., anti_join(data, ., by = ...))
Example:
mtcars_n <- mtcars %>% add_rownames
mtcars_n %>%
filter(cyl > 6) %>%
mutate(mpg = 1) %>%
rbind_list(., anti_join(mtcars_n, ., by = "rowname"))
This is much slower than probably any other approach, but useful to get quick results by extending your existing pipe.
Just updating (by June 02nd 2022) #krlmlr great answer:
add_rownames() is deprecated, use tibble::rownames_to_column() instead.
rbind_list is also deprecated, use bind_rows instead
You might also find a different sequence of rows in your resulting joined dataset, which depending on your aim is quite difficult to correct with dplyr::arrange() afterwards.
An alternative, although slower, is:
mtcars_n <- mtcars %>%
add_rownames() %>%
filter(cyl > 6) %>%
mutate(new_col = 1)
mtcars_m <- left_join(x=mtcars, y=mtcars_n)
Related
I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr you can do so using rowwise and your function FUN.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr approach
We can use a combination of nest and map_dbl for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.
Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).
unnest returns our tibble into the more standard structure.
purrrlyr approach
While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.
In your above example, we would use your data frame df and user-defined function FUN in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f = the function to apply to each row
.to = the name of the output column, in this case s
.collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"
See here for more information on using purrrlyr...
Performance
Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see #konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.
I want to add a suffix or prefix to most variable names in a data.frame, typically after they've all been transformed in some way and before performing a join. I don't have a way to do this without breaking up my piping.
For example, with this data:
library(dplyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
I want to get to this result (note variable names):
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
My current approach is:
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.)))
names(means14)[2:length(names(means14))] <- paste0(names(means14)[2:length(names(means14))], "_mean_2014")
Is there an alternative to that clunky last line that breaks up my pipes? I've looked at select() and rename() but don't want to explicitly specify each variable name, as I usually want to rename all except a single variable and might have a much wider data.frame than in this example.
I'm imagining a final piped command that approximates this made-up function:
appendname(cols = 2:n, str = "_mean_2014", placement = "suffix")
Which doesn't exist as far as I know.
You can pass functions to rename_at, so do
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_all(funs(mean(.))) %>%
rename_at(vars(-class),function(x) paste0(x,"_2014"))
After additional experimenting since posting this question, I've found that the setNames function will work with the piping as it returns a data.frame:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
setNames(c(names(.)[1], paste0(names(.)[-1],"_mean_2014")))
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
This is a bit quicker, but not totally what you want:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) -> means14
names(means14)[-1] %<>% paste0("_mean_2014")
if you haven't used the %<>%-operator before definitely check this link out, its a super-useful tool.
you can also use it for recomputing or rounding some columns, like this df$meancolumn %<>% round() , and so on, it just comes up very often and just saves you a lot of writing
As of February 2017 you can do this with the dplyr command rename_(...).
In the case of this example you could do.
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
rename_(names(.)[-1], paste0(names(.)[-1],"_mean_2014")))
This is rather similar to the answer with set_names but works with tibbles too!
This is more of a step back, but you might think of reshaping your data in order to apply the function to multiple years at the same time. This will preserve tidyness. If you're going to want to end up comparing different years, it might make sense to have the year be a separate variable in a dataframe, rather than storing the year in the names. You should be able to use summarise_ to get the mean_year behavior. See http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
library(dplyr)
library(tidyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
dat14 %>%
gather(variable, value, -ID, -class) %>%
mutate(year = 2014) %>%
group_by(class, year, variable)%>%
summarise(mean = mean(value))`
While Sam Firkes solution using setNames() ist certainly the only solution keeping an unbroken pipe, it will not work with the tbl objects from dplyr, since the column names are not accessible by methods from the usual base R naming functions. Here is a function that you can use within a pipe with tbl objects as well, thanks to this solution by hrbrmstr. It adds predefined prefixes and suffixes at the specified column indices. Default is all columns.
tbl.renamer <- function(tbl,prefix="x",suffix=NULL,index=seq_along(tbl_vars(tbl))){
newnames <- tbl_vars(tbl) # Get old variable names
names(newnames) <- newnames
names(newnames)[index] <- paste0(prefix,".",newnames,suffix)[index] # create a named vector for .dots
rename_(tbl,.dots=newnames) # rename the variables
}
Example usage (Assume auth_users beeing an tbl_sql object):
auth_user %>% tbl_vars
tbl.renamer(auth_user) %>% tbl_vars
auth_user %>% tbl.renamer %>% tbl_vars
auth_user %>% tbl.renamer(index = c(1,5)) %>% tbl_vars
My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))
Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))
I am trying to replace some filtered values of a data set. So far, I wrote this lines of code:
df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA)),
where uniq is just a list containing variable names I want to focus on (and group1 and values are column names). This is actually working. However, it only outputs the altered filtered rows and does not replace anything in the data set df. Does anyone have an idea, where my mistake is? Thank you so much! The following code is to reproduce the example:
group1 <- c("A","A","A","B","B","C")
values <- c(0.6,0.3,0.1,0.2,0.8,0.9)
df = data.frame(group1, group2, values)
uniq <- unique(unlist(df$group1))
for (i in 1:length(uniq)){
df <- df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA))
}
What I would like to get is that it leaves all values except the last one since it is one unique group (group1 == C) and 0.9 < 1. So I'd like to get the exact same data frame here except that 0.9 is replaced with NA. Moreover, would it be possible to just use if instead of ifelse?
dplyr won't create a new object unless you use an assignment operator (<-).
Compare
require(dplyr)
data(mtcars)
mtcars %>% filter(cyl == 4)
with
mtcars4 <- mtcars %>% filter(cyl == 4)
mtcars4
The data are the same, but in the second example the filtered data is stored in a new object mtcars4
How can I simplify or perform the following operations using dplyr:
Run a function on all data.frame names, like mutate_each(funs()) for values, e.g.
names(iris) <- make.names(names(iris))
Delete columns that do NOT exist (i.e. delete nothing), e.g.
iris %>% select(-matches("Width")) # ok
iris %>% select(-matches("X")) # returns empty data.frame, why?
Add a new column by name (string), e.g.
iris %>% mutate_("newcol" = 0) # ok
x <- "newcol"
iris %>% mutate_(x = 0) # adds a column with name "x" instead of "newcol"
Rename a data.frame colname that does not exist
names(iris)[names(iris)=="X"] <- "Y"
iris %>% rename(sl=Sepal.Length) # ok
iris %>% rename(Y=X) # error, instead of no change
I would use setNames for this:
iris %>% setNames(make.names(names(.)))
Include everything() as an argument for select:
iris %>% select(-matches("Width"), everything())
iris %>% select(-matches("X"), everything())
To my understanding there's no other shortcut than explicitly naming the string like you already do:
iris %>% mutate_("newcol" = 0)
I came up with the following solution for #4:
iris %>%
rename_at(vars(everything()),
function(nm)
recode(nm,
Sepal.Length="sl",
Sepal.Width = "sw",
X = "Y")) %>%
head()
The last line just for convenient output of course.
1 through 3 are answered above. I came here because I had the same problem as number 4. Here is my solution:
df <- iris
Set a name key with the columns to be renamed and the new values:
name_key <- c(
sl = "Sepal.Length",
sw = "Sepal.Width",
Y = "X"
)
Set values not in data frame to NA. This works for my purpose better. You could probably just remove it from name_key.
for (var in names(name_key)) {
if (!(name_key[[var]] %in% names(df))) {
name_key[var] <- NA
}
}
Get a vector of column names in the data frame.
cols <- names(name_key[!is.na(name_key)])
Rename columns
for (nm in names(name_key)) {
names(df)[names(df) == name_key[[nm]]] <- nm
}
Select columns
df2 <- df %>%
select(cols)
I'm almost positive this can be done more elegantly, but this is what I have so far. Hope this helps, if you haven't solved it already!
Answer for the question n.2:
You can use the function any_of if you want to give explicitly the full names of the columns.
iris %>%
select(-any_of(c("X", "Sepal.Width","Petal.Width")))
This will not remove the non-existing column X and will remove the other two listed.
Otherwise, you are good with the solution with matches or a combination of any_of and matches.
iris %>%
select(-any_of("X")) %>%
select(-matches("Width"))
This will remove explicitly X and the matches. Multiple matches are also possible.
iris %>%
select(-any_of("X")) %>%
select(-matches(c("Width", "Spec"))) # use c for multiple matches