Experiment design for count data - r

I got trouble on a work problem, which I need to design an experiment that can compare different treatment means. I'm gonna use one way anova with 6 levels which are tel_check, applicant_check, and several other checks. I want to find all this checks single or combination work and find the best check combination that works well.
However the question is my dependent variable is count data that is not continuous which violates the assumption of anova. Also, I don't figure out alternatives if I can't use one way anova here. If one way anova can work, how can I estimate a rough value for the sample size that can my plan work?
I asked a lot of questions, sorry for that, but I didn't find suitable answers for my questions on the Internet.

Related

How to stack selected columns of a dataframe

I am not quite sure if this is the right place to post this question.
I have to do a logistic regression using R. Now the programming part should not be an issue as there is enough tutorials and similar questions on these forums already.
My question is more about how to get data into usable form this model.
To specify: The survey is about a tax on a specific consumer good. Specifically on the change in the consumers purchasing behaviour. There were two categories that people were randomly selected for. One with tax and the other without. Additionaly, two different situations were people were asked about their preferences. So to sum up, Group A were taxed on the good in both situations, Group B was not taxed in either situation.
The results are now in a CSV file. The problem now is, however, all those subgroups got their own respective column. This means that this can't be evaluated well as they should all be merged into one to then create a logistic regression with a 1 if a person has chosen the taxed good and 0 if they did not. This should then be evaluated to see if a tax on said good would reduce the amount bought by x percent, if the tax even has an impact on purchasing behaviour. (This may not apply to this question but is more aimed towards clarification. Logistic regression will not tell me the before mentioned point)
My question now is, is there even a way to make this work with the design chosen? Is it possible to merge all the data into usable form without losing / distorting any data?
I am not sure if this question is stated clearly enough. Let me know if I should clarify more details for this question to be properly answered.
Thank you for your help!
EDIT:
The columns in the CSV file now each have a number in them corresponding to the choice they made in the survey. But since there were different groups they all got their respective column. For a logistic regression they have to be all in the same column (I believe). Can I just stack them using the links posted in the comments and go from there?
Also. Does it not distort any data when just stacking columns? I am not sure if this is the right place to ask this but I think it's worth a try.
What you could try, is splitting the csv, in two seperate datasets (one for each group) and use rbind to combine them:
# note: the column names needs to be identical in order for them to stack
df_final <- rbind(df1, df2)

Clarifying the aim of linear regression with multiple predictor variables and how to plot using ggplot2 [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I'm trying to learn the intricacies of linear regression for prediction, and I'd like to ask two questions:
I've got one dependent variable (call it X) and, let's say, ten independent variables. I can use lm() to generate a model. But my question is this: is the aim of generating a model (or, more likely, multiple models) to identify the single best predictor of X, or is the aim to discover the best combination of predictors of X? I assumed the latter, but after several hours of reading online I am now unsure.
If the aim is to discover the best combination of predictors of X, then (once I've identified that combination) how is a combination plotted properly? Plotting one line is easy, but for a combination would it be proper to (a) plot ten distinct regression lines (one per independent variable) or (b) plot a single line that somehow represents the combination? I've provided the summary() I'm working with in case it facilitates answering this question.
Is the aim of generating a model (or, more likely, multiple models) to identify the single best predictor of X, or is the aim to discover the best combination of predictors of X?
This depends mainly on the situation/context you are in. If you are always going to have access to these predictors, then yes, you'd like to identify the best model that will (likely) use a combination of these predictors. Obviously you want to keep in mind issues like overfitting and make sure the predictors you include are actually contributing something meaningful to your model, but there's no reason not to include multiple predictors if they make your model meaningfully better.
However, in many real world scenarios predictors are not free. It might cost $10,000 to collect each predictor and the organization you are working for only has the budget to collect one predictor. Thus, you might only be interested in the single best predictor because it is not practical to collect more than one going forward. In this case you'd also just be interested in how well that variable predicts in a simple regression, not a multiple regression, since you won't be controlling for other variables in the future anyway (but looking at the multiple regression results could still provide insight).
how is a combination plotted properly?
Again, this depends on context. However, in most cases you probably don't want to plot 10 regression lines because that's too overwhelming to look at and you will probably never have 10 variables that meaningfully contribute to your model. I'm actually kind of surprised your adjusted R^2 is not lower given you have quite a few variables so close to zero, unless they're just on massive scales.
First, who is viewing this graph? Is it you? If so, what information do you need to see that isn't being conveyed by the beta parameters? If it's someone else, who are they? Are they a stakeholder who knows nothing about statistics? If that's the case, you want a pretty simple graph that drives home your main point. Second, what is the purpose of your predictions and how does the process you are predicting unfold in the real world? Let's say I'm predicting how well people perform on the job given their scores on some different selection measures. The first thing you need to consider is, how is that selection happening? Are candidates screened on their answers to some personality questions and only the top scorers get an interview? In that case, it might be useful to create multiple graphs that show that process. However, candidates might be reviewed holistically and assigned a sum score based on all these predictors. In that case one regression line makes sense because you are interested in how these predictors act in concert.
There is no one answer to this question because the answers depend on the reason you're doing a regression in the first place. Once you identify the reason you're trying to predict this thing and the context that the process is happening in you should probably be able to determine what makes most sense. There is no "right" answer you'll find in a textbook because most real life problems are not in textbooks.

Is repeated anova what i am looking for?

I'm studying the NDVI (normalized vegetation index) behaviour of some soils and cultivars. My database has 33 days of acquisition, 17 kind of soils and 4 different cultivars. I have built it in two different ways, that you can see attached. I am having troubles and errors with both the shapes.
The question first of all is: Is repeated anova the correct way of analyzing my data? I want to see if there are any differences between the behaviours of the different cultivars and the different soils. I've made an ANOVA for each day and there are statistical differecies in each day, but the results are not globally interesting due to the fact that I would like to investigate the whole year behaviour.
The second question then is: how can I perform it? I''ve tryed different tutorials but I had unexpected errors or I didn't manage to complete the analysis.
Last but not the least: I'm coding with R Studio.
Any help is appreciated, I'm still new to statistic but really interested in improving!
orizzontal database
vertical database
I believe you can use the ANOVA, but as always, you have to know if that really is what you're looking for. Either way, since this a plataform for programmin questions, I'll write a code that should work for the vertical version. However, since I don't have your data, I can't know for sure (for future reference, dput(data) creates easily importeable code for those trying to answer you).
summary(aov(suolo ~ CV, data = data))

VAR model with variable combination and variation

I tried searching for an answer for this question of mine, however I could not find anything.
I want to build a model that predicts barley prices for that i came up with 11 variables that may have an impact on the prices. What I tried doing was building a loop that chooses every time one extra variable from my pool of variables and tries different combinations of them and the output would be for every (extra/combination) variable a new VAR-model, so in a sense, it is a combinatorics exercise. After that, i want to implement an in/out of sample testing for each of the models that I came up with to decide which one is the most appropriate. Unfortunately, i am not very familiar with loops and i have been told not to use them on R... As I am a beginner on R, my tryouts won't help you out at all, but if you really require them I am happy to provide them to you.
Many thanks in advance!

prediction and imputation of missing values using a panel data model (R)

I have a panel dataset, which is unbalanced. I created a pooled model and now need to predict and input the missing values of the dataset. How can it be done?
Here is a printscreen of my data: https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
Thank you!
First of all it looks like you have a too broad question in here. If you're really asking about how you should predict values for your spreadsheet (i.e cells: Z6,AA6,...,AM22,...); yes you have a HUGE questions =]. Just a hint, in your following questions, you should be more specific, like: I have THIS data related to Households in Belarus. I've searched about predicting models for that and tried XPTO1 and XPTO2. How can I decide which one is better?
So, what I really mean here is that predicting is not exactly a function like SUM, that you can apply to your data and that's it. Prediction is a whole discipline, with a bunch of methods that should be tested to different cases. For example, to predict the Z6 cell in your data, you should to ask yourself what other data can contribute to infer data missing information? In some cases the simple average value for the past 5 years will be enough, in some other, a lot more should be considered.
I recommend you to first take a look at some basic material that covers simple models, like linear models, play with them, try to understand the accuracy of obtained predictions... That will finally solve your problem, or will at least help you to ask the community more "answerable" questions.
One last tip: there is a new SO's sister Q&A community that may be more appropriate to ask questions about prediction models: https://datascience.stackexchange.com/
Good luck.

Resources