subsetting data.cube inside custom function - r

I am trying to make a function of my own to subset a data.cube in R, and format the result automatically for some predefined plots I aim to build.
This is my function.
require(data.table)
require(data.cube)
secciona <- function(cubo = NULL,
fecha_valor = list(),
loc_valor = list(),
prod_valor = list(),
drop = FALSE){
cubo[fecha_valor, loc_valor, prod_valor, drop = drop]
## The line above will really be an asignment of type y <- format(cubo[...drop])
## Rest of code which will end up plotting the subset of the function
}
The thing is I keep on getting the error: Error in eval(expr, envir, enclos) : object 'fecha_valor' not found
What is most strange for me, is that on the console everything works fine, but not when defined inside the subsetting function of mine.
In console:
> dc[list(as.Date("2013/01/01"))]
> dc[list(as.Date("2013/01/01")),]
> dc[list(as.Date("2013/01/01")),,]
> dc[list(as.Date("2013/01/01")),list(),list()]
all give as result:
<data.cube>
fact:
5627 rows x 2 dimensions x 1 measures (0.32 MB)
dimensions:
localizacion : 4 entities x 3 levels (0.01 MB)
producto : 153994 entities x 3 levels (21.29 MB)
total size: 21.61 MB
But whenever I try
secciona(dc)
secciona(dc, fecha_valor = list(as.Date("2013/01/01")))
secciona(dc, fecha_valor = list())
I always get the error above mentioned.
Any ideas why this is happening? should I proceed in else way for my approach of editing the subset for plotting?

This is the standard issue that R users will face when dealing with non-standard evaluation. This is a consequence of Computing on the language R language feature.
[.data.cube function expects to be used in interactive way, that extends the flexibility of the arguments passed to it, but gives some restrictions. In that aspect it is similar to [.data.table when passing expressions from wrapper function to [ subset operator. I've added dummy example to make it reproducible.
I see you are already using data.cube-oop branch, so just to clarify for other readers. data.cube-oop branch is 92 commits ahead of master branch, to install use the following.
install.packages("data.cube", repos = paste0("https://", c(
"jangorecki.gitlab.io/data.cube",
"Rdatatable.github.io/data.table",
"cran.rstudio.com"
)))
library(data.cube)
set.seed(1)
ar = array(rnorm(8,10,5), rep(2,3),
dimnames = list(color = c("green","red"),
year = c("2014","2015"),
country = c("IN","UK"))) # sorted
dc = as.data.cube(ar)
f = function(color=list(), year=list(), country=list(), drop=FALSE){
expr = substitute(
dc[color=.color, year=.year, country=.country, drop=.drop],
list(.color=color, .year=year, .country=country, .drop=drop)
)
eval(expr)
}
f(year=list(c("2014","2015")), country="UK")
#<data.cube>
#fact:
# 4 rows x 3 dimensions x 1 measures (0.00 MB)
#dimensions:
# color : 2 entities x 1 levels (0.00 MB)
# year : 2 entities x 1 levels (0.00 MB)
# country : 1 entities x 1 levels (0.00 MB)
#total size: 0.01 MB
You can track the expression just by putting print(expr) before/instead eval(expr).
Read more about non-standard evaluation:
- R Language Definition: Computing on the language
- Advanced R: Non-standard evaluation
- manual of substitute function
And some related SO questions:
- Passing on non-standard evaluation arguments to the subset function
- In R, why is [ better than subset?

Related

Changing Default dots argument in R

This question is related to this question and this question.
I need to assign a default to the ... argument in a function. I have successfully been able to to use the default package to accomplish this for specific arguments. For instance, lets say I want to allow toJSON from the jsonlite package to show more than four digits. The default is 4, but I want to show 10.
library(jsonlite)
library(default)
df <- data.frame(x = 2:5,
y = 2:5 / pi)
df
#> x y
#> 1 2 0.6366198
#> 2 3 0.9549297
#> 3 4 1.2732395
#> 4 5 1.5915494
# show as JSON - defaults to four digits
toJSON(df)
#> [{"x":2,"y":0.6366},{"x":3,"y":0.9549},{"x":4,"y":1.2732},{"x":5,"y":1.5915}]
# use default pacakge to change to 10
default(toJSON) <- list(digits = 10)
toJSON(df)
#> [{"x":2,"y":0.63661977237},{"x":3,"y":0.95492965855},{"x":4,"y":1.2732395447},{"x":5,"y":1.5915494309}]
There is another function called stream_out which uses toJSON but only uses the digits argument in ....
> stream_out(df)
{"x":2,"y":0.63662}
{"x":3,"y":0.95493}
{"x":4,"y":1.27324}
{"x":5,"y":1.59155}
Complete! Processed total of 4 rows.
>
> stream_out(df, digits = 10)
{"x":2,"y":0.63661977237}
{"x":3,"y":0.95492965855}
{"x":4,"y":1.2732395447}
{"x":5,"y":1.5915494309}
Complete! Processed total of 4 rows.
So even though I have changed the digits in toJSON, it isn't passed to the ... in stream_out. I cannot change this in the same manner as with toJSON.
> default(stream_out) <- list(digits = 10)
Error: 'digits' is not an argument of this function
This is not strictly a jsonlite question, but that is my use case here. I need to somehow change the ... argument of the stream_out function so that any time it is used, 10 digits are returned, rather than 4. However, any examples that show how to change defaults of ... arguments could probably be used to get to what I need.
Thanks!

adehabitatHR locoh.k orphan holes

I am trying to optimise the k parameter using AdehabitatHR LoCoH.k.area and it stops running when the topology is such that it can't produce a polygon. Message is:
rgeos_PolyCreateComment: orphaned hole, cannot find containing polygon
for hole at index 12.
I have done a number of successful single runs using LoCoH.k with only a few not running due to orphan holes.
Is it possible to keep LoCoH.k.area looping through the k values specified in the vector even if the one prior produces an orphan hole?
Thanks, Janine
You cant wrap LoCoH.k.area function in tryCatch. E.g. the function with krange = 5:9 argument throws:
Error in rgeos::createPolygonsComment(oobj) :
rgeos_PolyCreateComment: orphaned hole, cannot find containing polygon
for hole at index 6
Please see the code below:
library(adehabitatHR)
data(puechabonsp)
locs <- puechabonsp$relocs
## The call below throws an error
## LoCoH.k.area(locs[, 1], krange = 5:9)
pdf()
y <- sapply(5:9, function(x) tryCatch(
expr = cbind(LoCoH.k.area(locs[, 1], krange = x), k = x),
error = function(e){},
finally = NULL))
dev.off()
do.call(rbind, y)
Output:
Brock Calou Chou Jean k
1 25.21552 38.61693 83.37389 80.97771 8
2 27.37161 39.10789 86.45349 83.44156 9

Selecting features from a feature set using mRMRe package

I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.

Sparklyr - Decimal precision 8 exceeds max precision 7

I'm trying to copy a big database into Spark using spark_read_csv, but I'm getting the following error as output:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 16.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 16.0 (TID 176, 10.1.2.235):
java.lang.IllegalArgumentException: requirement failed: Decimal
precision 8 exceeds max precision 7
data_tbl <- spark_read_csv(sc, "data", "D:/base_csv", delimiter = "|", overwrite = TRUE)
It's a big data set, about 5.8 million of records, with my dataset I have data of types Int, num and chr.
I think you have a couple options depending on the spark version that you're using
Spark >=1.6.1
from here: https://docs.databricks.com/spark/latest/sparkr/functions/read.df.html
it seems, you can specifically specify your schema to force it to use doubles
csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
source = "csv", header="true", schema = csvSchema)
Spark < 1.6.1
consider test.csv
1,a,4.1234567890
2,b,9.0987654321
you can easily make this more efficient, but I think you get the gist
linesplit <- function(x){
tmp <- strsplit(x,",")
return ( tmp)
}
lineconvert <- function(x){
arow <- x[[1]]
converted <- list(as.integer(arow[1]), as.character(arow[2]),as.double(arow[3]))
return (converted)
}
rdd <- SparkR:::textFile(sc,'/path/to/test.csv')
lnspl <- SparkR:::map(rdd, linesplit)
ll2 <- SparkR:::map(lnspl,lineconvert)
ddf <- createDataFrame(sqlContext,ll2)
head(ddf)
_1 _2 _3
1 1 a 4.1234567890
2 2 b 9.0987654321
NOTE: the SparkR::: methods are private for a reason, the docs say 'be careful when you use this'

R-How dos the pos() function work for parts-of-speech tagging

I'm new to R and confused with the way the pos() function works. Here's why:
Example:
library(qdap)
s1<-c("Hello World")
pos(s1)
This produces the correct output saying the word count
wrd.cnt - 2
NN -1(50%)
UH-1(50%)
whereas the following to operations throws errors:
s2<-"Hello"
pos(s2)
Error in apply(pro, 2, paster, digits = digits, symbol = s.ymb, override = override) :
dim(X) must have a positive length
s3<-c("Hello Hello")
pos(s3)
Error in apply(pro, 2, paster, digits = digits, symbol = s.ymb, override = override) :
dim(X) must have a positive length
I'm not able to understand why this is caused.
You have found a bug in this version of qdap cause by not using drop = FALSE while indexing.
The dev version will behave as expected. You can download it easily with this code:
if (!require("pacman")) install.packages("pacman"); library(pacman)
p_install_gh("trinker/qdap")
The following has been added to the NEWS file as well:
pos threw an error if only one word was passed to text.var. Fix:
drop = FALSE has been added to data frame indexing. Caught by
StackOverflow user G_1991 R-How dos the pos() function work for parts-of-speech tagging.
Here's the updated output:
library(qdap)
s1<-c("Hello World")
pos(s1)
## wrd.cnt NN UH
## 1 2 1(50%) 1(50%)
s2<-"Hello"
pos(s2)
## wrd.cnt UH
## 1 1 1(100%)

Resources