R Mutate (Dataframe vs Tibble) - r

Working in R 3.6.1 (64) bit. Used readxl to get a data frame into R (Named "RawShift". I made 6 variables (class: "character") that are lists of user names. Each list is named for the team that user is on.
I want to use Mutate to create a column that has the team the user is from.
INTeam = C("user1", "user2",...)
OFTeam = C("user3", "user4",...)
When I had been working with Data frame this code worked:
RawShift <- RawShift %>% mutate(Team =case_when(
`username` %in% OFTeam ~ "Office",
`username` %in% INTeam ~ "Industrial"
))
Now I've taken that and done "as_tibble" on my Raw Shift, it won't error not will it work. Is this a case of not understanding the Tibble access methods ("", [], ., [[]]). Is it worth worry about or just do a hack job and convert using data frame then convert to a tibble later? I've looked into the benefits of Tibble over dataframe and seems to me I'd be better using Tibbles but cant seem to get this working. Have tried using "$", "." etc before the %in% without luck so far. Thanks for any advice/help.

We may need to load the tibble package
library(dplyr)
library(tibble)
head(iris) %>%
as_tibble %>%
mutate(new = case_when(Species == "setosa" ~ "hello"))
# A tibble: 6 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
#1 5.1 3.5 1.4 0.2 setosa hello
#2 4.9 3 1.4 0.2 setosa hello
#3 4.7 3.2 1.3 0.2 setosa hello
#4 4.6 3.1 1.5 0.2 setosa hello
#5 5 3.6 1.4 0.2 setosa hello
#6 5.4 3.9 1.7 0.4 setosa hello

Related

Passing vector of names to verify to assertr's verify in R

I am importing a dataset from a third party and would would like to be able to validate that all of the columns in the incoming dataset are named as agreed to and expected. To do this, I intended to use the verify statement in assertr's package in R with has_all_names. I can accomplish this with no problem if I manually enter the column names to be verified, but I can't seem to accomplish this by passing in a vector that contains the names of the columns to be verified. So for example, using the build-in iris dataset, I can verify that existence of the all the column names if I manually enter the names as an argument to the has_all_names function, but if I have the names stored in a vector and attempt to use it for verification, it does not work:
#Create a sample list of column names to be verified
#In my real work, I obtain this list from a database
(names(iris)->expected_variable_names)
Which outputs:
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
But then I run the following and:
#This works:
iris %>% verify(has_all_names("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"))
#But this does not:
iris %>% verify(has_all_names(expected_variable_names))
When I attempt to run the line that does not work, this generates:
verification [has_all_names(expected_variable_names)] failed! (1 failure)
verb redux_fn predicate column index value
1 verify NA has_all_names(expected_variable_names) NA 1 NA
Error: assertr stopped execution
Obviously, the failed attempt is indicating that not all of the column names are found in the dataframe, but since I'm passing in all the variable names that are indeed on the dataset, it should succeed. How can I pass into verify a vector or possibly even a list of column names to validate? I've tried a number of different variations of this last attempt with no success.
Thanks.
We may use invoke
library(purrr)
library(dplyr)
library(assertr)
iris %>%
verify(invoke(has_all_names, expected_variable_names))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
...
Or with exec from rlang
library(rlang)
iris %>%
verify(exec(has_all_names, !!!expected_variable_names))
Or with do.call from base R
iris %>%
verify(do.call(has_all_names,
as.list(expected_variable_names)))

Why does adding by 1 actually add by 2 in Sparklyr when using dynamic variable names?

When I run the following code, I expect the value of the Sepal_Width_2 column to be Sepal_Width + 1, but it is in fact Sepal_Width + 2. What gives?
require(dplyr)
require(sparklyr)
Sys.setenv(SPARK_HOME='/usr/lib/spark')
sc <- spark_connect(master="yarn")
# for this example these variables are hard coded
# but in my actual code these are named dynamically
sw_name <- as.name('Sepal_Width')
sw2 <- "Sepal_Width_2"
sw2_name <- as.name(sw2)
ir <- copy_to(sc, iris)
print(head(ir %>% mutate(!!sw2 := sw_name))) # so far so good
# Source: spark<?> [?? x 6]
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Sepal_Width_2
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 5.1 3.5 1.4 0.2 setosa 3.5
# 4.9 3 1.4 0.2 setosa 3
# 4.7 3.2 1.3 0.2 setosa 3.2
# 4.6 3.1 1.5 0.2 setosa 3.1
# 5 3.6 1.4 0.2 setosa 3.6
# 5.4 3.9 1.7 0.4 setosa 3.9
print(head(ir %>% mutate(!!sw2 := sw_name) %>% mutate(!!sw2 := sw2_name + 1))) # i guess 2+2 != 4?
# Source: spark<?> [?? x 6]
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Sepal_Width_2
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 5.1 3.5 1.4 0.2 setosa 5.5
# 4.9 3 1.4 0.2 setosa 5
# 4.7 3.2 1.3 0.2 setosa 5.2
# 4.6 3.1 1.5 0.2 setosa 5.1
# 5 3.6 1.4 0.2 setosa 5.6
# 5.4 3.9 1.7 0.4 setosa 5.9
My use case requires that I use the dynamic variable naming you see above. In this example it is rather silly (compared to just using the variables directly), but in my use case I'm running the same function across hundreds of different spark tables. They all have the same "schema" in terms of the number of columns and what each column is (outputs from some machine learning models), but the names differ because each table contains the output for a different model. The names are predictable, but since they vary, I construct them dynamically as you see here instead of hardcoding them.
It appears that Spark knows how to add 2 and 2 together when the names are hardcoded, but when the names are dynamic it suddenly freaks out.
You might be misusing as.name which is leading sparklyr to misinterpret your input.
Note that your code errors when just working on a local table:
sw_name <- as.name('Sepal.Width') # swap "_" to "." to match variable names
sw2 <- "Sepal_Width_2"
sw2_name <- as.name(sw2)
data(iris)
print(head(iris %>% mutate(!!sw2 := sw_name)))
# Error: Problem with `mutate()` input `Sepal_Width_2`.
# x object 'Sepal.Width' not found
# i Input `Sepal_Width_2` is `sw_name`.
Note that you are using both the !! operator from rlang with as.name from base R. But you are not using them together as demonstrated in this question.
I recommend you use sym and !! from the rlang package instead of as.name, and that you apply both to character strings that are column names. The following works locally, and is consistent with the non-standard evaluation guidance. So it should translate to spark:
library(dplyr)
data(iris)
sw <- 'Sepal.Width'
sw2 <- paste0(sw, "_2")
head(iris %>% mutate(!!sym(sw2) := !!sym(sw)))
head(iris %>% mutate(!!sym(sw2) := !!sym(sw)) %>% mutate(!!sym(sw2) := !!sym(sw2) + 1))
I'm not sure which package was the culprit (sparklyr, dplyr, R, who knows), but this has been fixed when I upgraded from 3.6.3/sparklyr 1.5 to R 4.0.2/sparklyr 1.7.0.

Creating variables from list objects in R

I'm trying to create a binary set of variables that uses data across multiple columns.
I have a dataset where I'm trying to create a binary variable where any column with a specific name will be indexed for a certain value. I'll use iris as an example dataset.
Let's say I want to create a variable where any column with the string "Sepal" and any row in those columns with the values of 5.1, 3.0, and 4.7 will become "Class A" while values with 3.1, 5.0, and 5.4 will be "Class B". So let's look at the first few entries of iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The first 3 rows should then be under "Class A" While rows 4-6 will be under "Class B". I tried writing this code to do that
mutate(iris, Class = if_else(
vars(contains("Sepal")), any_vars(. %in% c(5.1,3.0, 4.7))), "Class A",
ifelse(vars(contains("Sepal")), any_vars(. %in% c(3.1, 5.0, 5.4))), "Class B",NA)
and received the error
Error: `condition` must be a logical vector, not a `quosures/list` object
So I've realized I need lapply here, but I'm not even sure where to begin to write this because I'm not sure how to represent the entire part of selecting columns with "Sepal" in the name and also include the specific values in those rows as one list object to provide to lapply
This is clearly the wrong syntax
lapply(vars(contains("Sepal")), any_vars(. %in% c(5.1,3.0, 4.7)))
Examples using case_when will also be accepted as answers.
If you want to do this using dplyr, you can use rowwise with new c_across :
library(dplyr)
iris %>%
rowwise() %>%
mutate(Class = case_when(
any(c_across(contains("Sepal")) %in% c(5.1,3.0, 4.7)) ~ 'Class A',
any(c_across(contains("Sepal")) %in% c(3.1,5.0,5.4)) ~ 'Class B')) %>%
head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Class
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
#1 5.1 3.5 1.4 0.2 setosa Class A
#2 4.9 3 1.4 0.2 setosa Class A
#3 4.7 3.2 1.3 0.2 setosa Class A
#4 4.6 3.1 1.5 0.2 setosa Class B
#5 5 3.6 1.4 0.2 setosa Class B
#6 5.4 3.9 1.7 0.4 setosa Class B
However, note that using %in% on numerical values is not accurate. If interested you may read Why are these numbers not equal?

How can I replace various columns in a tibble using select?

I try to replace all columns selected using select by data of the same size.
A reproducible example is
library(tidyverse)
iris = as_data_frame(iris)
temp = cbind( runif(nrow(iris)), runif(nrow(iris)), runif(nrow(iris)), runif(nrow(iris)))
select(iris, -one_of("Petal.Length")) = temp
Then I get the error
Error in select(iris, -one_of("Petal.Length")) = temp : could not find
function "select"
Thanks for any comments.
You want to bind columns of two data frames, so you can simply use bind_cols():
library(tidyverse)
iris <- as_tibble(iris)
temp <- tibble(r1 = runif(nrow(iris)), r2 = runif(nrow(iris)), r3 = runif(nrow(iris)), r4 = runif(nrow(iris)))
select(iris, -Petal.Length) %>% bind_cols(temp)
# or use:
# bind_cols(iris, temp) %>% select(-Petal.Length)
which gives you:
# A tibble: 150 × 8
Sepal.Length Sepal.Width Petal.Width Species r1 r2 r3 r4
<dbl> <dbl> <dbl> <fctr> <dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 0.2 setosa 0.7208566 0.1367070 0.04314771 0.4909396
2 4.9 3.0 0.2 setosa 0.4101884 0.4795735 0.75318182 0.1463689
3 4.7 3.2 0.2 setosa 0.6270065 0.5425814 0.26599432 0.1467248
4 4.6 3.1 0.2 setosa 0.8001282 0.4691908 0.73060637 0.0792256
5 5.0 3.6 0.2 setosa 0.5663895 0.4745482 0.65088630 0.5360953
6 5.4 3.9 0.4 setosa 0.8813042 0.1560600 0.41734507 0.2582568
7 4.6 3.4 0.3 setosa 0.5046977 0.9555570 0.22118401 0.9246906
8 5.0 3.4 0.2 setosa 0.5283764 0.4730212 0.24982471 0.6313071
9 4.4 2.9 0.2 setosa 0.5976045 0.4717439 0.14270551 0.2149888
10 4.9 3.1 0.1 setosa 0.3919660 0.5125420 0.95001067 0.5259598
# ... with 140 more rows
We can use -> to assign the output to 'temp'
select(iris, -one_of("Petal.Length")) -> temp
Using tidyverse paradigma you could use:
dplyr::mutate_at(iris, vars(-one_of("Petal.Length")), .funs = funs(runif))
Although the above sample produces the behaviour with random numbers, it will probably not suit your needs - i suppose you want match features and rows to that one in temp.
It can be done by trasforming iris and temp into long format and the join and replace data accordingly with *join methods for example.

save residuals with `dplyr`

I want to use dplyr to group a data.frame, fit linear regressions and save the residuals as a column in the original, ungrouped data.frame.
Here's an example
> iris %>%
select(Sepal.Length, Sepal.Width) %>%
group_by(Species) %>%
do(mod = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
Returns:
Species mod
1 setosa <S3:lm>
2 versicolor <S3:lm>
3 virginica <S3:lm>
Instead, I would like the original data.frame with a new column containing residuals.
For example,
Sepal.Length Sepal.Width resid
1 5.1 3.5 0.04428474
2 4.9 3.0 0.18952960
3 4.7 3.2 -0.14856834
4 4.6 3.1 -0.17951937
5 5.0 3.6 -0.12476423
6 5.4 3.9 0.06808885
I adapted an example from http://jimhester.github.io/plyrToDplyr/.
r <- iris %>%
group_by(Species) %>%
do(model = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
do((function(mod) {
data.frame(resid = residuals(mod$model))
})(.))
corrected <- cbind(iris, r)
update Another method is to use the augment function in the broom package:
r <- iris %>%
group_by(Species) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data=.))
Which returns:
Source: local data frame [150 x 10]
Groups: Species
Species Sepal.Length Sepal.Width .fitted .se.fit .resid .hat
1 setosa 5.1 3.5 5.055715 0.03435031 0.04428474 0.02073628
2 setosa 4.9 3.0 4.710470 0.05117134 0.18952960 0.04601750
3 setosa 4.7 3.2 4.848568 0.03947370 -0.14856834 0.02738325
4 setosa 4.6 3.1 4.779519 0.04480537 -0.17951937 0.03528008
5 setosa 5.0 3.6 5.124764 0.03710984 -0.12476423 0.02420180
...
A solution that seems to be easier than the ones proposed so far and closer to the code of the original question is :
iris %>%
group_by(Species) %>%
do(data.frame(., resid = residuals(lm(Sepal.Length ~ Sepal.Width, data=.))))
Result :
# A tibble: 150 x 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species resid
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 0.0443
2 4.9 3 1.4 0.2 setosa 0.190
3 4.7 3.2 1.3 0.2 setosa -0.149
4 4.6 3.1 1.5 0.2 setosa -0.180
5 5 3.6 1.4 0.2 setosa -0.125
6 5.4 3.9 1.7 0.4 setosa 0.0681
7 4.6 3.4 1.4 0.3 setosa -0.387
8 5 3.4 1.5 0.2 setosa 0.0133
9 4.4 2.9 1.4 0.2 setosa -0.241
10 4.9 3.1 1.5 0.1 setosa 0.120
Since you are be running the exact same regression for each group, you might find it simpler to just define your regression model as a function() beforehand, and then execute it for each group using mutate.
model<- function(y,x){
a<- y + x
if( length(which(!is.na(a))) <= 2 ){
return( rep(NA, length(a)))
} else {
m<- lm( y ~ x, na.action = na.exclude)
return( residuals(m))
}
}
Note, that the first part of this function is to insure against any error messages popping up in case your regression is run on a group with less than zero degrees of freedom (This might be the case if you have a dataframe with several grouping variables with many levels , or numerous independent variables for your regression (like for example lm(y~ x1 + x2)), and can't afford to inspect each of them for sufficient non-NA observations).
So your example can be rewritten as follows:
iris %>% group_by(Species) %>%
mutate(resid = model(Sepal.Length,Sepal.Width) ) %>%
select(Sepal.Length,Sepal.Width,resid)
Which should yield:
Species Sepal.Length Sepal.Width resid
<fctr> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 0.04428474
2 setosa 4.9 3.0 0.18952960
3 setosa 4.7 3.2 -0.14856834
4 setosa 4.6 3.1 -0.17951937
5 setosa 5.0 3.6 -0.12476423
6 setosa 5.4 3.9 0.06808885
This method should not be computationally much different from the one using augment().(I've had to use both methods on data sets containing several hundred million observations, and believe there was no significant difference in terms of speed compared to using the do() function).
Also, please note that omitting na.action = na.exclude, or using m$residuals instead of residuals(m), will result in the exclusion of rows that have NAs (dropped prior to estimation) from the output vector of residuals. The corresponding vector will thus not have sufficient length() in order to be merged with the data set, and some error message might appear.

Resources