Arules - introducing new measure - r

I am new to R. I am trying to run Arules for titanic data. I am using the following code:
library(arules)
library(plyr)
rules<-apriori(titanic.raw)
inspect(rules)
rules <- apriori(titanic.raw,
parameter = list(minlen=2, supp=0.005, conf=0.8),
appearance = list(rhs=c("Survived=No", "Survived=Yes"),
default="lhs"),
control = list(verbose=F))
inspect(rules)
quality(rules)$rec<- support(rules,titanic.raw)/count(rhs)
I am able to get the result. However, I need to introduce one more measure apart from support,confidence & lift i.e. rec = Support/(total no of target instances), written in the last piece of code.
I think that I am going wrong in writing it. Could any of the folks help and guide me through it? would really appreciate it

Related

RNAseq - Plotting log2foldchange-basemean but has weird data points

I am new to processing RNA seq data and am now practicing to reproduce a published figure related to RNA seq. This os the paper and Fig2A is what I'm trying to achieve.
In brief, I downloaded the code with recount3 and subset the sample for groups that I want (control vs condition 1, control vs condition 2, etc). Then I performed the following code:
dds_4uM_30min <- DESeqDataSetFromMatrix(countData = ha_4uM_30min_data,
colData = ha_4uM_30min_meta,
design = ~ type)
dds2_4uM_30min <- DESeq(dds_4uM_30min)
res_4uM_30min <- results(dds2_4uM_30min, tidy=F)
(type is the column that I made to contain the information of whether it's control or condition 1)
This is the figure I get, which confuses me since it is nowhere near the original figure.
I thought that they might do additional processing of the data, but have no idea what are the common or reasonable ways to do.
Furthermore, there seems to be datapoints that form lines (as can seen in the above figure), which is not seen by in the original figure. I am wondering what causes this kind of distribution and how to adjust for getting rid of it.
Thanks in advance for any opinion or suggestion.
I have been trying to use the function lfcShrink but the figure still has this weird line.
Any suggestions on how to further process RNA seq data?

How to fix unbalanced data in the synthetic control method?

I am currently writing a research project on the effects of voting behaviour after the closure of mines in a given area. For this research I have chosen the 'synthetic control' method. Now, I have run into trouble with the synth package, namely, each time that I try to dataprep the data to create the synthetic control unit I get error messages. These messages show the the following:
"Your panel, as described by unit.variable and time.variable, is unbalanced. Balance it and run again."
I have currently modelled my data after the Abadie's dataset used in his study on terrorism in the Basque region. And I ought to note that there is no missing data in my dataset, nor are there outliers.
I have tried to make several changes to my code, however, each time I try this, I run into trouble. Moreover, I have tried copying code from others who came up with a solution, but this did not work either. I would be very very thankful if someone could help me with my problem.
Some other lovely person has helped me with my previous problem, for which I am very grateful. However, being new to coding, I do not really have any idea as to how to solve my problem.
enter code here {dataprep_outcomes <- dataprep(foo=dataset [dataset$Year %in% c(1948:1986),],
predictors = c("Income","Distance","Gini","Percentage_voted","Protest"),
dependent = c("Percentage_voted"),
unit.variable = c("Municipality_No"),
time.variable = c("Year"),
treatment.identifier = 1,
controls.identifier = c(2:14),
time.predictors.prior = intersect(1948:1965, dataset$Year),
time.optimize.ssr = intersect(1948:1986, dataset$Year),
unit.names.variable = c("Municipality_ID"),
time.plot = intersect("1948:1986"), dataset$Year)}
I would like to run my dataprep. If one has suggestions regarding the manner in which I can alter my data, that would be welcome as well!
Thank you in advance.

How to check that a user-defined function works in r?

THis is probably a very silly question, but how can I check if a function written by myself will work or not?
I'm writing a not very simple function involving many other functions and loops and was wondering if there are any ways to check for errors/bugs, or simply just check if the function will work. Do I just create a simple fake data frame and test on it?
As suggested by other users in the comment, I have added the part of the function that I have written. So basically I have a data frame with good and bad data, and bad data are marked with flags. I want to write a function that allows me to produce plots as usual (with the flag points) when user sets flag.option to 1, and remove the flag points from the plot when user sets flag.option to 0.
AIR.plot <- function(mydata, flag.option) {
if (flag.option == 1) {
par(mfrow(2,1))
conc <- tapply(mydata$CO2, format(mydata$date, "%Y-%m-%d %T"), mean)
dates <- seq(mydata$date[1], mydata$date[nrow(mydata(mydata))], length = nrow(conc))
plot(dates, conc,
type = "p",
col = "blue",
xlab = "day",
ylab = "CO2"), error = function(e) plot.new(type = "n")
barplot(mydata$lines, horiz = TRUE, col = c("red", "blue")) # this is just a small bar plot on the bottom that specifies which sample-taking line (red or blue) is providing the samples
} else if (flag.option == 0) {
# I haven't figured out how to write this part yet but essentially I want to remove all
# of the rows with flags on
}
}
Thanks in advance, I'm not an experienced R user yet so please help me.
Before we (meaning, at my workplace) release any code to our production environment we run through a series of testing procedures to make sure our code behaves the way we want it to. It usually involves several people with different perspectives on the code.
Ideally, such verification should start before you write any code. Some questions you should be able to answer are:
What should the code do?
What inputs should it accept? (including type, ranges, etc)
What should the output look like?
How will it handle missing values?
How will it handle NULL values?
How will it handle zero-length values?
If you prepare a list of requirements and write your documentation before you begin writing any code, the probability of success goes up pretty quickly. Naturally, as you begin writing your code, you may find that your requirements need to be adjusted, or the function arguments need to be modified. That's okay, but document those changes when they happen.
While you are writing your function, use a package like assertthat or checkmate to write as many argument checks as you need in your code. Some of the best, most reliable code where I work consists of about 100 lines of argument checks and 3-4 lines of what the code actually is intended to do. It may seem like overkill, but you prevent a lot of problems from bad inputs that you never intended for users to provide.
When you've finished writing your function, you should at this point have a list of requirements and clearly documented expectations of your arguments. This is where you make use of the testthat package.
Write tests that verify all of the requirements you wrote are met.
Write tests that verify you can no put in unintended inputs and get the results you want.
Write tests that verify you get the output you intended on your test data.
Write tests that test any edge cases you can think of.
It can take a long time to write all of these tests, but once it is done, any further development is easier to check since anything that violates your existing requirements should fail the test.
That being said, I'm really bad at following this process in my own work. I have the tendency to write code, then document what I did. But the best code I've written has been where I've planned it out conceptually, wrote my documentation, coded, and then tested against my documentation.
As #antoine-sac pointed out in the links, some things cannot be checked programmatically; for example, if your function terminates.
Looking at it pragmatically, have a look at the packages assertthat and testthat. assertthat will help you insert checks of results "in between", testthat is for writing proper tests. Yes, the usual way of writing tests is creating a small test example including test data.

Plot using arulesViz the output of arulesSequences? Or, a way to coerce an object of class sequencerules into rules? (arulesSequence, R)

Is there a way to use arulesViz with ruleInduction output from arulesSequences? Or is there a way to coerce/cast the sequence rules output (of class sequencerules) to class rules, so I can use arulesViz?
Objective: I am interested in playing with some visualization options reviewed in this paper, particularly the "graph" options (https://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf).
Typically you would use arulesViz on "rules" derived from arules, like so (from the vignette):
rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5))
plot(x, method = NULL, measure = "support", shading = "lift",
+ interactive = FALSE, data = NULL, control = NULL, ...)
But I want to use it on the output of cspade + ruleInduction:
s1 <- cspade(trans, parameter = list(support = 0.001,maxlen=3,maxgap=10), control = list(verbose = TRUE,numpart=1))
summary(s1)
s1_df <- as(s1, "data.frame")
r1 <- ruleInduction(s1, confidence = 0.05, control = list(verbose = TRUE))
r1.subset.rule <- subset(r1, rhs(r1) %in% c("9990") & lift>2 & !lhs(r1) %in% c("300","301","412","4033","4043"))
plot(r1.subset.rule,method="graph",control=list(alpha=1))
Error in as.double(y) :
cannot coerce type 'S4' to vector of type 'double'
Is there a way to do this? I currently get the above error. Note, this is similar to this question: Error in as.double(y) : cannot coerce type 'S4' to vector of type 'double' but the solution proposed there (make sure you have arulesViz loaded) doesn't work/is not the problem.
Thank you for the help!
If you feel that this is not an appropriate question, please leave me feedback/comments -- I tried researching this for many hours before posting here, and am a somewhat new user: Would be happy to hear how this can be improved.
Turns out this was a conceptual misunderstanding on my part. I ended up contacting the original author of the package (thank you for responding! Leaving your name out in case you'd prefer not to be mentioned) and that cured my tunnel vision.
sequenceRules and rules, even though they look very similar when you you run inspect() on them, are very different classes. The plot command in arulesViz can handle rules, but not sequencerules. While I'm sure I don't understand all the differences, here are a couple:
sequenceRules allows repeated elements, rules (most likely) does not
sequenceRules could have {A,B} -> {C} and {B,A} -> {D}. Since order doesnt matter for "rules" and arulesViz expects rules, arulesViz likely won't know what to do with this type of input.
Anyway -- I did find another poster in the world wide web who had a similar question, so posting my understanding here hoping it'll help someone out there.
As I said in my question, if you feel my answer (and/or question) should be improved, please leave me feedback in the form of comments! Much appreciated.
It may bepointless, but as I am gaining understanding of arulesSequence, I think you can improve your representation.
(A,B) and (B,A) are the same itemset. And{(A),(B)} and {(B),(A)} are two different sequence. In short : no order within an itemset but order matters in a sequence. So {(A,B),(A),(C,D)} is the same sequence as {(B,A),(A),(D,C)} but differs from {(A),(A,B),C,D)}.
I think this is why arulesViz, as you said, wouldn't know what to do. Thank you for your question which herlped me understand those packages.

how assign new text to the built model (text mining)

yesterday i found good R code for classification emotion and took some part
happy = readLines("./happy.txt")
sad = readLines("./sad.txt")
happy_test = readLines("./happy_test.txt")
sad_test = readLines("./sad_test.txt")
tweet = c(happy, sad)
tweet_test= c(happy_test, sad_test)
tweet_all = c(tweet, tweet_test)
sentiment = c(rep("happy", length(happy) ),
rep("sad", length(sad)))
sentiment_test = c(rep("happy", length(happy_test) ),
rep("sad", length(sad_test)))
sentiment_all = as.factor(c(sentiment, sentiment_test))
library(RTextTools)
mat= create_matrix(tweet_all, language="english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=FALSE, tm::weightTfIdf)
container = create_container(mat, as.numeric(sentiment_all),
trainSize=1:160, testSize=161:180,virgin=FALSE)
models = train_models(container, algorithms=c("MAXENT",
"SVM",
#"GLMNET", "BOOSTING",
"SLDA","BAGGING",
"RF", # "NNET",
"TREE"
))
# test the model
results = classify_models(container, models)
table(as.numeric(as.numeric(sentiment_all[161:180])), results[,"FORESTS_LABEL"])
all is good, but one question. Here, we work with data where we self indicate to the machine what is sad and what is happy text. If i have new documents without indicating what sad, what happy or whats positive and what's negative(suppose, path one of this document n=read.csv("C:/1/ttt.csv")), how to do, that built model can define what phrase is negative and what positive?
Well, What was all the purpose of building a model to detect what is sad and what is happy? What is you want to achieve? And this does not look like a SO question/answer.
So you are using Supervised Learning in a labeled data (you already know is sad or happy) to learn what defines those classes, so later on you can use the models built for predicting new content where you do not have the label.
So, any transformations done to the data for training you have to do it for the new data coming in, and you ask your model to predict (evaluate) based on this new input data. So you use the prediction as a result. This does not change your model, it is just evaluating it in new data.
Another scenario is that you come with new labeled data, so you want to update your model, so you can retrain it based on the new data you might learn new models that have maybe more features.
In your case you should look at classify_model or classify_models functions in that package.

Resources