Invalid column using dplyr + NSE - r

The following code works fine on my mac, using CRAN R:
delta_scores <- function(df, data_var) {
# Use Hadley's new non-standard evaluation helpers to compute differences in
# the symbol passed through data_var from Session 1 to 2. Assumes an ID column
# in df that groups units of measurement.
# For the RHS:
quo_data_var <- enquo(data_var)
# For the LHS, we need yet another step (basically a string)
name_data_var <- quo_name(quo_data_var)
df %>% select(ID, Session, !!quo_data_var) %>%
# NSE spread stopped working on my windows machine!
spread(Session, !!quo_data_var) %>%
# Note use of := instead of plain = to support NSE
transmute(ID=ID, !!name_data_var := `2`-`1`)
}
test_df <- data_frame(ID=c(1,2,3,1,2,3),
Session=c(1,1,1,2,2,2),
Measure=c(1,2,3,1,1,4))
delta_scores(test_df, Measure)
But when I run it on Windows, Microsoft R Open 3.4.2, dplyr 0.7.3, I get:
Error: Invalid column specification
NOTE: it's easy enough to fix by replacing spread with spread_('Session', name_data_var). Interestingly, the select call works fine (my real data frames have many columns). I'm concerned about the bigger issue of dplyr's NSE not working in a given environment.
Looking at the debugger stacktrace was daunting enough that I decided to ask for help here first. Any ideas about what's going on or ideas on how to debug this are much appreciated!

This is solved in later versions of tidyr. Confirmed working in tidyr 0.7.2. Support for Hadley's new NSE (non-standard evaluation) system was added in the 0.7 release.

Related

Why wont replace_na actually replace the missing values using dplyr and piping?

I have been struggling a lot recently with the replace_na() function when cleaning my data. I have two complementary variables and I want to use one variable (varname2) to supply the missing values for the other (varname1). I've been trying the following:
df %>%
replace_na(varname = varname2)
In response I keep getting the error:
Did you misspecify an argument?
Run `rlang::last_error()` to see where the error occurred.
> df <- df %>%
+ replace_na(varname1= varname2)
Error: 1 components of `...` were not used.
We detected these problematic arguments:
* `varname1`
Suggestions for an efficient way to fix this?
I found a blog response elsewhere in which Hadley himself said they wanted to move away from replace_na() toward a more SQL adjacent command coalesce(). The solution involves both across() and coalesce().
Here's an example of what I just did in my work:
df %>%
mutate(across(varname1, coalesce, varname2))
It seems to have worked like a charm.

Error in df % > % mutate(price_scal = scale(price), hd_scal = scale(hd), : could not find function "% > %" [duplicate]

I'm running an example in R, going through the steps and everything is working so far except for this code produces an error:
words <- dtm %>%
as.matrix %>%
colnames %>%
(function(x) x[nchar(x) < 20])
Error: could not find function "%>%"
I don't understand what the benefit of using this special operator
%>% is, and any feedback would be great.
You need to load a package (like magrittr or dplyr) that defines the function first, then it should work.
install.packages("magrittr") # package installations are only needed the first time you use it
install.packages("dplyr") # alternative installation of the %>%
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr) # alternatively, this also loads %>%
The pipe operator %>% was introduced to "decrease development time and to improve readability and maintainability of code."
But everybody has to decide for himself if it really fits his workflow and makes things easier.
For more information on magrittr, click here.
Not using the pipe %>%, this code would return the same as your code:
words <- colnames(as.matrix(dtm))
words <- words[nchar(words) < 20]
words
EDIT:
(I am extending my answer due to a very useful comment that was made by #Molx)
Despite being from magrittr, the pipe operator is more commonly used
with the package dplyr (which requires and loads magrittr), so
whenever you see someone using %>% make sure you shouldn't load dplyr
instead.
On Windows: if you use %>% inside a %dopar% loop, you have to add a reference to load package dplyr (or magrittr, which dplyr loads).
Example:
plots <- foreach(myInput=iterators::iter(plotCount), .packages=c("RODBC", "dplyr")) %dopar%
{
return(getPlot(myInput))
}
If you omit the .packages command, and use %do% instead to make it all run in a single process, then works fine. The reason is that it all runs in one process, so it doesn't need to specifically load new packages.
One needs to install magrittr as follows
install.packages("magrittr")
Then, in one's script, don't forget to add on top
library(magrittr)
For the meaning of the operator %>% you might want to consider this question: What does %>% function mean in R?
Note that the same operator would also work with the library dplyr, as it imports from magrittr.
dplyr used to have a similar operator (%.%), which is now deprecated. Here we can read about the differences between %.% (deprecated operator from the library dplyr) and %>% (operator from magrittr, that is also available in dplyr)
The pipe operator is not available in base R. You need to load one of the following packages to use it: dplyr, tidyverse or magrittr
Anyone else stumbling upon this for calculating powers of matrices please install this library (dplyr alone not correct)
library(expm)

Fixing a multiple warning "unknown column"

I have a persistent multiple warning of "unknown column" for all types of commands (e.g., str(x) to installing updates on packages), and not sure how to debug this or fix it.
The warning "unknown column" is clearly related to a variable in a tbl_df that I renamed, but the warning comes up in all kinds of commands seemingly unrelated to the tbl_df (e.g., installing updates on a package, str(x) where x is simply a character vector).
This is an issue with the Diagnostics tool in RStudio (the tool that shows warnings and possible mistakes in your code). It was partially fixed at this commit in RStudio v1.1.103 or later by #kevin-ushey. That fix was partial, because the warnings still appeared (albeit with less frequency). This issue was reported with a reproducible example at https://github.com/rstudio/rstudio/issues/7372 and it was fixed on RStudio v1.4 pull request.
Update to the latest RStudio release to fix this issue. Alternatively, there are several workarounds available, choose the solution you prefer:
Disable the code diagnostics for all files in Preferences/Code/Diagnostics
Disable all diagnostics for a specific file:
Add at the beginning of the opened file(s):
# !diagnostics off
Then save the files and the warnings should stop appearing.
Disable the diagnostics for the variables that cause the warning
Add at the beginning of the opened file(s):
# !diagnostics suppress=<comma-separated list of variables>
Then save the files and the warnings should stop appearing.
The warnings appear because the diagnostics tool in RStudio parses the source code to detect errors and when it performs the diagnostic checks it accesses columns in your tibble that are not initialized, giving the Warning we see. The warnings do not appear because you run unrelated things, they appear when the RStudio diagnostics are executed (when a file is saved, then modified, when you run something...).
I have been encountering the same problem, and although I don't know why it occurs, I have been able to pin down when it occurs, and thus prevent it from happening.
The issue seems to be with adding in a new column, derived from indexing, in a base R data frame vs. in a tibble data frame. Take this example, where you add a new column (age) to a base R data frame:
base_df <- data.frame(id = c(1:3), name = c("mary", "jill","steve"))
base_df$age[base_df$name == "mary"] <- 47
That works without returning a warning. But when the same is done with a tibble, it throws a warning (and consequently, I think causing the weird, seemingly unprovoked, multiple warning issue):
library(tibble)
tibble_df <- tibble(id = c(1:3), name = c("mary", "jill","steve"))
tibble_df$age[tibble_df$name == "mary"] <- 47
Warning message:
Unknown column 'age'
There are surely better ways of avoiding this, but I have found that first creating a vector of NAs does the job:
tibble_df$age <- NA
tibble_df$age[tibble_df$name == "mary"] <- 47
I have faced this issue when using the "dplyr" package.
For those facing this problem after using the "group_by" function in the "dplyr" library:
I have found that ungrouping the variables solves the unknown column warning problem. Sometimes I have had to iterate through the ungrouping several times until the problem is resolved.
Converting the class into data.frame solved the problem for me:
library(dplyr)
df <- data.frame(id = c(1,1:3), name = c("mary", "jo", "jill","steve"))
dfTbl <- df %>%
group_by(id) %>%
summarize (n = n())
class(dfTbl) # [1] "tbl_df" "tbl" "data.frame"
dfTbl = as.data.frame(dfTbl)
class(dfTbl) # [1] "data.frame"
Borrowed the partial script from #adts
I had this problem when dealing with tibble and lapply functions together. The tibble seemed to save things as a list inside the dataframe.
I solved it by using unlist before adding the results of an lapply function to the tibble.
I ran into this problem too except through a tibble created using a dyplyr block. Here's slight modification of sabre's code to show how I came to the same error.
library(dplyr)
df <- data.frame(id = c(1,1:3), name = c("mary", "jo", "jill","steve"))
t <- df %>%
group_by(id) %>%
summarize (n = n())
t
str(t)
t$newvar[t$id==1] <- 0
I know this is an old thread, but I just encountered the same problem when loading a spatial vector in geopackage format with the package sf. Using as_tibble=FALSE worked for me. The file was loaded as an sp object but everything still worked fine. As mentioned by #sabre, trying to force an object into a tibble seems to be making the problems while trying to index a column that was not anymore there.
Let's say I wanted to select the following column(s)
best.columns = 'id'
For me the following gave the warning:
df%>% select_(one_of(best.columns))
While this worked as expected, although, as far as I know dplyr, this should be identical.
df%>% select_(.dots = best.columns)
I get these warnings when I rename a column using dplyr::rename after reading it using the readr package.
The old name of the column is not renamed in the spec attribute. So removing the the spec attribute makes the warnings go away. Also removing the "spec_tbl_df" class seems like a good idea.
attr(dat, "spec") <- NULL
class(dat) <- setdiff(class(dat), "spec_tbl_df")
Building on the answer by #stok ( https://stackoverflow.com/a/47848259/7733418 ), who found this problem when using group_by (which also converts your data.frame to a tibble), and solved it in the same way.
For me the problem was ultimately due to the use of "slice()".
Slice() converted my data.frame to a tibble, causing this error.
Checking the class of your data.frame and re-converting it to a data.frame whenever a function converts it to a tibble could solve this issue.

How to use require(googlesheets) properly?

I recently downloaded googlesheets via
devtools::install_github("jennybc/googlesheets")
and experience some difficulties. When running the script as mentioned in
https://github.com/jennybc/googlesheets I get always:
Error: could not find function "%>%"
How can I solve that problem?
Reproducible example:
Download:
devtools::install_github("jennybc/googlesheets")
require(googlesheets)
Data:
gap_key <- "1HT5B8SgkKqHdqHJmn5xiuaC04Ngb7dG9Tv94004vezA"
copy_ss(key = gap_key, to = "Gapminder")
gap <- register_ss("Gapminder")
Error occurs:
oceania_csv <- gap %>% get_via_csv(ws = "Oceania")
Load the dplyr package first, which provides the %>% operator. This is noted here in the README you link to (suppressMessages is optional):
googlesheets is designed for use with the %>% pipe operator and, to a lesser extent, the data-wrangling mentality of dplyr. The examples here use both, but we'll soon develop a vignette that shows usage with plain vanilla R. googlesheets uses dplyr internally but does not require the user to do so.
library("googlesheets")
suppressMessages(library("dplyr"))
You can install dplyr with
install.packages("dplyr")
See here for more about the pipe operator (%>%).

Error: could not find function "%>%"

I'm running an example in R, going through the steps and everything is working so far except for this code produces an error:
words <- dtm %>%
as.matrix %>%
colnames %>%
(function(x) x[nchar(x) < 20])
Error: could not find function "%>%"
I don't understand what the benefit of using this special operator
%>% is, and any feedback would be great.
You need to load a package (like magrittr or dplyr) that defines the function first, then it should work.
install.packages("magrittr") # package installations are only needed the first time you use it
install.packages("dplyr") # alternative installation of the %>%
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr) # alternatively, this also loads %>%
The pipe operator %>% was introduced to "decrease development time and to improve readability and maintainability of code."
But everybody has to decide for himself if it really fits his workflow and makes things easier.
For more information on magrittr, click here.
Not using the pipe %>%, this code would return the same as your code:
words <- colnames(as.matrix(dtm))
words <- words[nchar(words) < 20]
words
EDIT:
(I am extending my answer due to a very useful comment that was made by #Molx)
Despite being from magrittr, the pipe operator is more commonly used
with the package dplyr (which requires and loads magrittr), so
whenever you see someone using %>% make sure you shouldn't load dplyr
instead.
On Windows: if you use %>% inside a %dopar% loop, you have to add a reference to load package dplyr (or magrittr, which dplyr loads).
Example:
plots <- foreach(myInput=iterators::iter(plotCount), .packages=c("RODBC", "dplyr")) %dopar%
{
return(getPlot(myInput))
}
If you omit the .packages command, and use %do% instead to make it all run in a single process, then works fine. The reason is that it all runs in one process, so it doesn't need to specifically load new packages.
One needs to install magrittr as follows
install.packages("magrittr")
Then, in one's script, don't forget to add on top
library(magrittr)
For the meaning of the operator %>% you might want to consider this question: What does %>% function mean in R?
Note that the same operator would also work with the library dplyr, as it imports from magrittr.
dplyr used to have a similar operator (%.%), which is now deprecated. Here we can read about the differences between %.% (deprecated operator from the library dplyr) and %>% (operator from magrittr, that is also available in dplyr)
The pipe operator is not available in base R. You need to load one of the following packages to use it: dplyr, tidyverse or magrittr
Anyone else stumbling upon this for calculating powers of matrices please install this library (dplyr alone not correct)
library(expm)

Resources