How to fix unbalanced data in the synthetic control method? - r

I am currently writing a research project on the effects of voting behaviour after the closure of mines in a given area. For this research I have chosen the 'synthetic control' method. Now, I have run into trouble with the synth package, namely, each time that I try to dataprep the data to create the synthetic control unit I get error messages. These messages show the the following:
"Your panel, as described by unit.variable and time.variable, is unbalanced. Balance it and run again."
I have currently modelled my data after the Abadie's dataset used in his study on terrorism in the Basque region. And I ought to note that there is no missing data in my dataset, nor are there outliers.
I have tried to make several changes to my code, however, each time I try this, I run into trouble. Moreover, I have tried copying code from others who came up with a solution, but this did not work either. I would be very very thankful if someone could help me with my problem.
Some other lovely person has helped me with my previous problem, for which I am very grateful. However, being new to coding, I do not really have any idea as to how to solve my problem.
enter code here {dataprep_outcomes <- dataprep(foo=dataset [dataset$Year %in% c(1948:1986),],
predictors = c("Income","Distance","Gini","Percentage_voted","Protest"),
dependent = c("Percentage_voted"),
unit.variable = c("Municipality_No"),
time.variable = c("Year"),
treatment.identifier = 1,
controls.identifier = c(2:14),
time.predictors.prior = intersect(1948:1965, dataset$Year),
time.optimize.ssr = intersect(1948:1986, dataset$Year),
unit.names.variable = c("Municipality_ID"),
time.plot = intersect("1948:1986"), dataset$Year)}
I would like to run my dataprep. If one has suggestions regarding the manner in which I can alter my data, that would be welcome as well!
Thank you in advance.

Related

ezANOVA not providing Greenhouse Geiser correct df though violated

I've noticed that sometimes when I use ezANOVA from package ez I get columns stating what the Greenhouse-Geiser corrected df values are, and other times the tables with the sphericity corrections do not include the new df values, even though there are violations. For example, I just ran a 2-way repeated measures anova, and my table output looks like this:
I wish I could give repeatable data, but I genuinely don't know why it does or doesn't do it sometimes. Does anyone else know? I'll show my code below in case there's something I'm missing regarding the actual ezANOVA function. I could do the Df values by hand, but haven't found a good resource online to show me how to correct them using the epsilon value and I unfortunately was never taught that.
ez::ezANOVA(data = IntA2, wid = rat, dv = numReinforcers, within = .(component, minute))
Edit: A very nice person on the internet has explained to me how to calculate the new df values by hand (multiplying the GG epsilon by the old Dfs, in case any one else was wondering!) but I'm still unclear on why sometimes the function does it for you and other times it does not.

RNAseq - Plotting log2foldchange-basemean but has weird data points

I am new to processing RNA seq data and am now practicing to reproduce a published figure related to RNA seq. This os the paper and Fig2A is what I'm trying to achieve.
In brief, I downloaded the code with recount3 and subset the sample for groups that I want (control vs condition 1, control vs condition 2, etc). Then I performed the following code:
dds_4uM_30min <- DESeqDataSetFromMatrix(countData = ha_4uM_30min_data,
colData = ha_4uM_30min_meta,
design = ~ type)
dds2_4uM_30min <- DESeq(dds_4uM_30min)
res_4uM_30min <- results(dds2_4uM_30min, tidy=F)
(type is the column that I made to contain the information of whether it's control or condition 1)
This is the figure I get, which confuses me since it is nowhere near the original figure.
I thought that they might do additional processing of the data, but have no idea what are the common or reasonable ways to do.
Furthermore, there seems to be datapoints that form lines (as can seen in the above figure), which is not seen by in the original figure. I am wondering what causes this kind of distribution and how to adjust for getting rid of it.
Thanks in advance for any opinion or suggestion.
I have been trying to use the function lfcShrink but the figure still has this weird line.
Any suggestions on how to further process RNA seq data?

GLMM's for meta-analysis - error using metabin

I'm trying to run a generalised linear mixed effects (binomial-normal) meta-analysis for 7 randomised studies, where each study records the presence of an adverse event within the treatment and placebo populations (exposure and control).
To do this, I'm hoping to use the metabin function (meta package). However, I'm getting an error and I'm not sure why. E.g. running this code:
install.packages('meta')
# Data
data<-data.frame(exposure.events=c(11,34,152,4,60,3,25), exposure.population=c(184,152,9500,77,2012,15,60), control.events=c(3,33,4729,133,1441,1,25), control.population=c(184,375,613978,15865,480485,105,238), Study=c("1","2","3","4","5","6","7"))
# Calling metabin
metabin(event.e=exposure.events, n.e=exposure.population, event.c=control.events, n.c=control.population, studlab=Study, data=data, method="GLMM",model.glmm = "CM.AL",method.tau = "ML")
I get this output:
Error in metafor::rma.glmm(ai = event.e[!exclude], n1i = n.e[!exclude], :
Cannot fit ML model.
I've also tried calling the rma.glmm function directly (instead of doing this via metabin), but get the same error message. I've also tried reading the source code for rma.glmm but I'm not sure I understand what's going on. However, I think the issue is related to the third study (the largest), and in particular the size of the control population, as both of the following run smoothly:
# Modifying 3rd row's control population
data<-data.frame(exposure.events=c(11,34,152,4,60,3,25), exposure.population=c(184,152,9500,77,2012,15,60), control.events=c(3,33,4729,133,1441,1,25), control.population=c(184,375,61378,15865,480485,105,238), Study=c("1","2","3","4","5","6","7"))
metabin(event.e=exposure.events, n.e=exposure.population, event.c=control.events, n.c=control.population, studlab=Study, data=data, method="GLMM",model.glmm = "CM.AL",method.tau = "ML")
# Deleting 3rd row
data<-data.frame(exposure.events=c(11,34,4,60,3,25), exposure.population=c(184,152,77,2012,15,60), control.events=c(3,33,133,1441,1,25), control.population=c(184,375,15865,480485,105,238), Study=c("1","2","3","4","5","6"))
metabin(event.e=exposure.events, n.e=exposure.population, event.c=control.events, n.c=control.population, studlab=Study, data=data, method="GLMM",model.glmm = "CM.AL",method.tau = "ML")
Is this a convergence problem, and does anyone know if there is any way around this? The only other thing I can find about this error message is for a problem (and thus solution) which does not apply to me.
Any help would be really appreciated :)

Arules - introducing new measure

I am new to R. I am trying to run Arules for titanic data. I am using the following code:
library(arules)
library(plyr)
rules<-apriori(titanic.raw)
inspect(rules)
rules <- apriori(titanic.raw,
parameter = list(minlen=2, supp=0.005, conf=0.8),
appearance = list(rhs=c("Survived=No", "Survived=Yes"),
default="lhs"),
control = list(verbose=F))
inspect(rules)
quality(rules)$rec<- support(rules,titanic.raw)/count(rhs)
I am able to get the result. However, I need to introduce one more measure apart from support,confidence & lift i.e. rec = Support/(total no of target instances), written in the last piece of code.
I think that I am going wrong in writing it. Could any of the folks help and guide me through it? would really appreciate it

how assign new text to the built model (text mining)

yesterday i found good R code for classification emotion and took some part
happy = readLines("./happy.txt")
sad = readLines("./sad.txt")
happy_test = readLines("./happy_test.txt")
sad_test = readLines("./sad_test.txt")
tweet = c(happy, sad)
tweet_test= c(happy_test, sad_test)
tweet_all = c(tweet, tweet_test)
sentiment = c(rep("happy", length(happy) ),
rep("sad", length(sad)))
sentiment_test = c(rep("happy", length(happy_test) ),
rep("sad", length(sad_test)))
sentiment_all = as.factor(c(sentiment, sentiment_test))
library(RTextTools)
mat= create_matrix(tweet_all, language="english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=FALSE, tm::weightTfIdf)
container = create_container(mat, as.numeric(sentiment_all),
trainSize=1:160, testSize=161:180,virgin=FALSE)
models = train_models(container, algorithms=c("MAXENT",
"SVM",
#"GLMNET", "BOOSTING",
"SLDA","BAGGING",
"RF", # "NNET",
"TREE"
))
# test the model
results = classify_models(container, models)
table(as.numeric(as.numeric(sentiment_all[161:180])), results[,"FORESTS_LABEL"])
all is good, but one question. Here, we work with data where we self indicate to the machine what is sad and what is happy text. If i have new documents without indicating what sad, what happy or whats positive and what's negative(suppose, path one of this document n=read.csv("C:/1/ttt.csv")), how to do, that built model can define what phrase is negative and what positive?
Well, What was all the purpose of building a model to detect what is sad and what is happy? What is you want to achieve? And this does not look like a SO question/answer.
So you are using Supervised Learning in a labeled data (you already know is sad or happy) to learn what defines those classes, so later on you can use the models built for predicting new content where you do not have the label.
So, any transformations done to the data for training you have to do it for the new data coming in, and you ask your model to predict (evaluate) based on this new input data. So you use the prediction as a result. This does not change your model, it is just evaluating it in new data.
Another scenario is that you come with new labeled data, so you want to update your model, so you can retrain it based on the new data you might learn new models that have maybe more features.
In your case you should look at classify_model or classify_models functions in that package.

Resources