I am using R library arules for rules minning.
So first I tried just to see the rules:
#Get the rules
rules <- apriori(trans, parameter = list(supp=0.05, conf = 0.05)) #minlen = 2
rules <- sort(rules, by="confidence", decreasing=TRUE)
However the lhs column is empty:
inspect(rules)
lhs rhs support confidence lift
3 {} => {product=CM,DD,OS} 0.501 0.501 1
2 {} => {product=CM,DD} 0.223 0.223 1
1 {} => {product=CM} 0.068 0.068 1
So I tried to specifically ask for the lhs column:
rules <- apriori(data=trans, parameter=list(supp=0.05, conf = 0.05),
appearance = list(default="rhs", lhs="product=CM,DD,OS"),
control = list(verbose=F))
rules <- sort(rules, by="confidence", decreasing=TRUE)
inspect(rules)
Unfortunately the output remains same.
One of the reason might be that most of the clients have ~4 products, therefore they might not be any rules, but I find that unlikley.
So the problem was in the format of the data. If I before dump data into .csv and use read.transactions, it works correctly.
trans = read.transactions("C:/.../basket_analysis_data.csv", format="single",sep = ";", cols = c(2,1))
Before I was using direct ODBC connection, put data into Data frame and then convert them like this:
trans <- data.frame(product = as.factor(qry$product_owned))
trans <- as(trans, "transactions")
However use .csv as immediate step is annoying. If anyone can help how to make it work without .csv, I would appreciate it.
Related
Is there a way to know if a table was renamed in the process of unnesting? I want to know if there is something where I can intercept any messages that come through with New names: and give more context about solutions
# min reprex
library(tidyverse)
f <- function() {
tibble(
x = 1:2,
y = 2:1,
z = tibble(x = 1)
) |>
unnest_wider(z, names_repair = "unique")
}
f()
New names:
• `x` -> `x...1`
• `x` -> `x...3`
x...1 y x...3
----- ----- -----
1 2 1
2 1 1
More context:
The message stems from
vctrs::vec_as_names(c("x", "x"), repair = "universal")
I see information about withCallingHandlers() but not sure if that is the right route. I thought there was a way for errors/messages to have classes that you can intercept but I can't remember what I read.
Something in testthat::expect_message() may help. I thought there would be a has_message() function out there.
There is a lot of tidy evaluation and comparing names before and after might be tricky. I could look for the names with the regex "\\.+\\d+$" but not sure that is robust enough since data could have fields with that syntax already.
Thank you!
Taking inspiration from hadley's answer, on the question that #ritchie-sacramento linked, you should check out the evaluate package.
> eval_res <- evaluate::evaluate("f()")
> eval_res[[2]]$message
[1] "New names:\n* x -> x...1\n* x -> x...3\n"
This will require more testing to see what happens to the data structure when there are errors, warnings, or even multiple messages. But this seems like the right track.
I'm trying to write an xlsx file from a list of dataframes that I created but I'm getting an error due to missing data (I couldn't download it). I just want to write the xlsx file besides having this lacking data. Any help is appreciated.
For replication of the problem:
library(quantmod)
name_of_symbols <- c("AKER","YECO","SNOA")
research_dates <- c("2018-11-19","2018-11-19","2018-11-14")
my_symbols_df <- lapply(name_of_symbols, function(x) tryCatch(getSymbols(x, auto.assign = FALSE),error = function(e) { }))
my_stocks_OHLCV <- list()
for (i in 1:3) {
trade_date <- paste(as.Date(research_dates[i]))
OHLCV_data <- my_symbols_df[[i]][trade_date]
my_stocks_OHLCV[[i]] <- data.frame(OHLCV_data)
}
And you can see the missing data down here in my_stocks_OHLCV[[2]] and the write.xlsx error I'm getting:
print(my_stocks_OHLCV)
[[1]]
AKER.Open AKER.High AKER.Low AKER.Close AKER.Volume AKER.Adjusted
2018-11-19 2.67 3.2 1.56 1.75 15385800 1.75
[[2]]
data frame with 0 columns and 0 rows
[[3]]
SNOA.Open SNOA.High SNOA.Low SNOA.Close SNOA.Volume SNOA.Adjusted
2018-11-14 1.1 1.14 1.01 1.1 107900 1.1
write.xlsx(my_stocks_OHLCV, "C:/Users/MICRO/Downloads/Datasets_stocks/dux_OHLCV.xlsx")
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE,:arguments imply differing number of rows: 1, 0
How do I run write.xlsx even though I have this missing data?
The main question you need to ask is, what do you want instead?
As you are working with stock data, the best idea, is that if you don't have data for a stock, then remove it. Something like this should work,
my_stocks_OHLCV[lapply(my_stocks_OHLCV,nrow)>0]
If you want a row full of NA or 0
Then use the lapply function and for each element of the list, of length 0, replace with either NA's, vector of 0's (c(0,0,0,0,0,0)) etc...
Something like this,
condition <- !lapply(my_stocks_OHLCV,nrow)>0
my_stocks_OHLCV[condition] <- data.frame(rep(NA,6))
Here we define the condition variable, to be the elements in the list where you don't have any data. We can then replace those by NA or swap the NA for 0. However, I can't think of a reason to do this.
A variation on your question, and one you could handle inside your for loop, is to check if you have data, and if you don't, replace the values there, with NAs, and you could given it the correct headers, as you know which stock it relates to.
Hope this helps.
I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.
I am using the dataset "adult".
http://archive.ics.uci.edu/ml/datasets/Adult
I have retrieved frequent rules using apriori and sorted them by lift.
library(arules)
trans = read.transactions("adult.data", format = "basket", sep = ",", rm.duplicates = TRUE)
rules <- apriori(trans)
rules.lift <- sort(rules, decreasing = TRUE, by="lift")
When I execute
inspect(head(rules.lift,100))
I obtain the following:
lhs rhs support confidence lift
1 { 13,
Male,
United-States} => { Bachelors} 0.1024507 0.9976077 6.066125
2 { 0,
13,
Male,
United-States} => { Bachelors} 0.1024507 0.9976077 6.066125
ETC
For example, in the rule:
{ 0,
13,
Male,
United-States} => { Bachelors}
How can I know which attribute that 0 and that 13 are? I have looked at the description of the data set and to the data itself so I guess that 13 is the education-num and 0 is the capital-loss but sometimes two or more attributes can have the same ranges so I would not know how to distinguish them.
>class(rules.lift)
[1] "rules"
attr(,"package")
[1] "arules"
I've read here: How could we know the ColumnName /attribute of items generated in Rules that the problem is I haven't preprocessed the data. So, how can I do that?
Thank you very much!
I would like to mine specific rhs rules. There is an example in the documentation which demonstrates that this is possible, but only for a specific case (as we see below). First an data set to illustrate my problem:
input <- matrix( c( rep(10001,6) , rep(10002,3) , rep(10003,3), 100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,rep('a',6),rep('b',6)), ncol=3)
colnames(input) <- c(letters[1:3])
input <- as.data.frame(input)
Now i can create rules:
r <- apriori(input)
To see the rules:
inspect(r)
I would like to only mine rules that have b=... on the rhs. For specific values this can be done by adding:
appearance = list(rhs = c("b=100001", "b=100002"),default="lhs")
to the apriori command. I will also have to adjust the confidence if i want to find them ofcourse. The problem lies in the number of elements in column b. I can manualy type all the elements in the "b=....." format in this example, but I can't in my own data.
I tried to get the values of b using unique() and then giving that to the rhs, but it will generate an error because i give values like: "100001" "100002" instead of "b=100001" "b=100002".
Is there a was to only get rhs rules from a specific column?
If not, is there an easy way to generate 'want' from 'current?
current <- c("100001", "100002", "100003", "100004", "100005", "100006", "100007", "100008")
want <- c("b=100001", "b=100002", "b=100003", "b=100004", "b=100005", "b=100006", "b=100007", "b=100008")
Somewhat related is this question: Creating specific rules with arules in r
But that has the same problem for me, only a different way.
You can use subset:
r <- apriori(input, parameter = list(support = 0.1, confidence = 0.1))
inspect( subset( r, subset = rhs %pin% "b=" ) )
# lhs rhs support confidence lift
# 1 {} => {b=100002} 0.2500000 0.2500000 1.000000
# 2 {} => {b=100003} 0.2500000 0.2500000 1.000000
# 3 {c=b} => {b=100002} 0.1666667 0.3333333 1.333333
# 4 {c=b} => {b=100003} 0.1666667 0.3333333 1.333333
For you second question, you can use paste:
paste0( "b=", current )
# [1] "b=100001" "b=100002" "b=100003" "b=100004" "b=100005" "b=100006" "b=100007"
# [8] "b=100008"
The arules documentation now has an example that does exactly what you want:
bItems <- grep("^b=", itemLabels(input), value = TRUE)
rules <- apriori(input, parameter = list(support = 0.1, confidence = 0.1),
appearance = list(rhs = bItems))
I haven't actually tested this with your example code (the arules documentation example uses a transactions object, not a data.frame), but grep-ing those column labels should work out.