dplyr masks GGally and breaks ggparcoord - r

Given a fresh session,
executing a small ggparcoord(.) example provided in the documentation of the function
library(GGally)
data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = diamonds.samp, columns = c(1, 5:10))
results into the following plot:
Again, starting in a fresh session and executing the same script with the loaded dplyr
library(GGally)
library(dplyr)
data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = diamonds.samp, columns = c(1, 5:10))
results in:
Error: (list) object cannot be coerced to type 'double'
Note that the order of the library(.) statements does not matter.
Questions
Is there something wrong with the code samples?
Is there a way to overcome the problem (over some namespace functions)?
Or is this a bug?
I need both dplyr and ggparcoord(.) in a bigger analysis but this minimal example reflects the problem i am facing.
Versions
R # 3.2.3
dplyr # 0.4.3
GGally # 1.0.1
ggplot # 2.0.0
UPDATE
To wrap the excellent answer given by Joran up:
Answers
The code samples are in fact wrong as ggparcoord(.) expects a data.frame not a tbl_df as given by the diamonds data set (if dplyr is loaded).
The problem is solved by coercing the tbl_df to a data.frame.
No it is not a bug.
Working code sample:
library(GGally)
library(dplyr)
data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = as.data.frame(diamonds.samp), columns = c(1, 5:10))

Converting my comments to an answer...
The GGally package here is making the reasonable assumption that using [ on a data frame should behave the way it always does and always has. However, this all being in the Hadley-verse, the diamonds data set is a tbl_df as well as a data.frame.
When dplyr is loaded, the behavior of [ is overridden such that drop = FALSE is always the default for a tbl_df. So there's a place in GGally where data[,"cut"] is expected to return a vector, but instead it returns another data frame.
...specifically, the error is thrown in your example while attempting to execute:
data[, fact.var] <- as.numeric(data[, fact.var]).
Since data[,fact.var] remains a data frame, and hence a list, as.numeric won't work.
As for your conclusion that this isn't a bug, I'd say....maybe. Probably. At least there probably isn't anything the GGally package author ought to do to address it. You just have to be aware that using tbl_df's with non-Hadley written packages may break things.
As you noted, removing the extra class attributes fixes the problem, as it returns R to using the normal [ method.

Workaround: coerce your data for ggparcoord to as.data.table(...) or as.data.table(... , keep.rownames=TRUE) unless you want to lose all your rownames.
Cause: as per #joran's investigating, when dplyr is loaded, tbl_df overrides [ so that drop = FALSE.
Solution: file a pull-request on GGally.
edit: fixed in v1.3.0 (https://github.com/ggobi/ggally/commit/bfa930d102289d723de2ce9ec528baf42b3b7b40)

Related

error argument "df1" is missing, with no default

a friend of mine is working with the r language and asked me what she did wrong, i can't seem to find the problem. does someone know what it is?
the code she send me:
# 10*. Pipe that to a ggplot command and create a histogram with 4 bins.
# Hint: you will NOT write ggplot(df, aes(...)) because the df is already piped in.
# Instead, just write: ggplot(aes(...)) etc.
# Title the histogram, "Distribution of Sunday tips for bills over $20"
# Feel free to style the plot (not required; this would be a typical exploratory
# analysis where only you will see it, so it doesn't have to be perfect).
df %>%
filter(total_bill > 20 & day == "Sun") %>%
ggplot(aes(x=total_bill, fill=size)) +
geom_histogram(bins=4) +
ggtitle("Distribution of Sunday tips for bills over $20")
the error:
Error in df(.) : argument "df1" is missing, with no default
Type ?df in your console, and you will see that df is a function with the following argument.
df(x, df1, df2, ncp, log = FALSE)
where df1 is an argument. So the error message is saying that R cannot find the first argument for the df function.
It seems like in this code example, your friend is trying to put a data frame called df into the filter function from the dplyr package and the ggplot function from the ggplot2 package to create a plot.
So my guess is your friend needs to define df as a data frame. Otherwise, R will think df is a function and keep throwing error.
By the way, since df is a defined function in R, it is not a good name for a data frame. However, people use df as a name for a data frame all the time. Try a different name, such as dat, for the name of a data frame next time.

How to hot encode/generate dummy columns using sparklyr

I know there are number of questions similar to this here but 1) most of the solutions rely on deprecated functions like ml_create_dummy_variables and 2) other solutions are incomplete.
Is there a function or an approach to easily hot encode a categorical variable into multiple dummy variables in sparklyr?
This post asks for a solution in SparkR, incidentally a sparklyr solution is given that only works when the categories are unique in a given column, which renders its pointless.
This solution, results in a single dummy instead of a dummy for each category (grabs the first category). This is also the solution I stumbled onto (based on this post), which does not cut it:
iris_sdf <- copy_to(sc, iris, overwrite = TRUE)
iris_sdf %>%
ft_string_indexer(input_col = "Species", output_col = "species_num") %>%
mutate(cat_num = species_num + 1) %>%
ft_one_hot_encoder("species_num", "species_dum") %>%
ft_vector_assembler(c("species_dum"))
I'm looking for a solution that will take Species from the iris dataset and generate three columns -one for each category in Species (virginica, setosa, and versicolor). Using R, fastDummies package has what I need, but I'm left wondering how to achieve similar functionality in sparklyr.
Again, I'll note that ml_create_dummy_variables (suggested by this post) produced the following error:
Error in ml_create_dummy_variables(., "species_num", "species_dum") : Error in ml_create_dummy_variables(., "species_num", "species_dum") :
could not find function "ml_create_dummy_variables"
Note: I'm using sparklyr_1.3.1

Creating a compartive object in R from two dataframes for comparitive phylogenetics

I'm trying to read in two dataframes into a comparitive object so I can plot them using pgls.
I'm not sure what the error being returned means, and how to go about getting rid of it.
My code:
library(ape)
library(geiger)
library(caper)
taxatree <- read.nexus("taxonomyforzeldospecies.nex")
LWEVIYRcombodata <- read.csv("LWEVIYR.csv")
LWEVIYRcombodataPGLS <-data.frame(LWEVIYRcombodata$Sum.of.percentage,OGT=LWEVIYRcombodata$OGT, Species=LWEVIYRcombodata$Species)
comp.dat <- comparative.data(taxatree, LWEVIYRcombodataPGLS, "Species")
Returns error:
> comp.dat <- comparative.data(taxatree, LWEVIYRcombodataPGLS, 'Species')
Error in if (tabulate(phy$edge[, 1])[ntips + 1] > 2) FALSE else TRUE :
missing value where TRUE/FALSE needed
This might come from your data set and your phylogeny having some discrepancies that comparative.data struggles to handle (by the look of the error message).
You can try cleaning both the data set and the tree using dispRity::clean.data:
library(dispRity)
## Reading the data
taxatree <- read.nexus("taxonomyforzeldospecies.nex")
LWEVIYRcombodata <- read.csv("LWEVIYR.csv")
LWEVIYRcombodataPGLS <- data.frame(LWEVIYRcombodata$Sum.of.percentage,OGT=LWEVIYRcombodata$OGT, Species=LWEVIYRcombodata$Species)
## Cleaning the data
cleaned_data <- clean.data(LWEVIYRcombodataPGLS, taxatree)
## Preparing the comparative data object
comp.dat <- comparative.data(cleaned_data$tree, cleaned_data$data, "Species")
However, as #MrFlick suggests, it's hard to know if that solves the problem without a reproducible example.
The error here is that I was using a nexus file, although ?comparitive.data does not specify which phylo objects it should use, newick trees seem to work fine, whereas nexus files do not.

ts object not recognised in hybridModel of forecastHybrid package

Data is something like this:
df <- tribble(
~y,~timestamp
18.74682, 1500256800,
19.00424, 1500260400,
18.86993, 1500264000,
18.74960, 1500267600,
18.99854, 1500271200,
18.85443, 1500274800,
18.78031, 1500278400,
18.97948, 1500282000,
18.86576, 1500285600,
18.55633, 1500289200,
18.79052, 1500292800,
18.74790, 1500296400,
18.62743, 1500300000,
19.04696, 1500303600,
18.97851, 1500307200,
18.70956, 1500310800,
18.92302, 1500314400,
18.91465, 1500318000,
18.61556, 1500321600,
19.03535, 1500325200 )
I'm trying to apply hybridModel on timeseries data to perform ensemble.Below is my code:
library(tidyquant)
library(forecast)
library(timetk)
library(sweep)
library(forecastHybrid)
df <- mutate(df, timestamp = as_datetime(timestamp))
tk_ts_df <- tk_ts(df, start = 1, freq = 3600, silent = TRUE)
fit <- hybridModel(tk_ts_df)
On fitting timeseries object tk_ts_df (ts object) to hybridModel; it's giving error : "The time series must be numeric and may not be a matrix or dataframe object."
But on link: https://cran.r-project.org/web/packages/forecastHybrid/vignettes/forecastHybrid.html
It's clearly mentioned : The workhorse function of the package is hybridModel(), a function that combines several component models from the “forecast” package. At a minimum, the user must supply a ts or numeric vector for y
Please suggest what I'm doing wrong.
The "forecastHybrid" requires that the input timeseries is a numeric vector or ts type. While the "timekit" package does return a ts object, it also adds additional attributes that are not in regular ts objects so input checks failed.
See discussion here. and the fixing commit here.
The latest version from Github incorporating the fix can be downloaded with
devtools::install_github("ellisp/forecastHybrid/pkg")

In R cannot use AdjustedSharpeRatio() from 'Performance Analytics'

I have some troubles using the function AdjustedSharpeRatio() from the package PerformanceAnalytics, the following code sample in R 3.0.0:
library(PerformanceAnalytics)
logrets = array(dim=c(3,2),c(1,2,3,4,5,6))
weights = c(0.4,0.6)
AdjustedSharpeRatio(rowSums(weights*logrets),0.01)
gives the following error:
Error in checkData(R) :
The data cannot be converted into a time series. If you are trying to pass in
names from a data object with one column, you should use the form 'data[rows,
columns, drop = FALSE]'. Rownames should have standard date formats, such as
'1985-03-15'.
Replacing the last line with zoo gives the same error:
AdjustedSharpeRatio(zoo(rowSums(weights*logrets)),0.01)
Am I missing something obvious ?
Hmm...not too sure what you are trying to achieve with the logrets and weights objects there....but if logrets are already in percentages. then maybe something like this...
AdjustedSharpeRatio(xts(rowSums(weights*logrets)/100,Sys.Date()-(c(3:1)*365)), Rf=0.01)
This might work:
a <- rowSums(weights*logrets)
names(a) <- c('1985-03-15', '1985-03-16', '1985-03-17')
AdjustedSharpeRatio(a,0.01)

Resources