Is repeated anova what i am looking for? - r

I'm studying the NDVI (normalized vegetation index) behaviour of some soils and cultivars. My database has 33 days of acquisition, 17 kind of soils and 4 different cultivars. I have built it in two different ways, that you can see attached. I am having troubles and errors with both the shapes.
The question first of all is: Is repeated anova the correct way of analyzing my data? I want to see if there are any differences between the behaviours of the different cultivars and the different soils. I've made an ANOVA for each day and there are statistical differecies in each day, but the results are not globally interesting due to the fact that I would like to investigate the whole year behaviour.
The second question then is: how can I perform it? I''ve tryed different tutorials but I had unexpected errors or I didn't manage to complete the analysis.
Last but not the least: I'm coding with R Studio.
Any help is appreciated, I'm still new to statistic but really interested in improving!
orizzontal database
vertical database

I believe you can use the ANOVA, but as always, you have to know if that really is what you're looking for. Either way, since this a plataform for programmin questions, I'll write a code that should work for the vertical version. However, since I don't have your data, I can't know for sure (for future reference, dput(data) creates easily importeable code for those trying to answer you).
summary(aov(suolo ~ CV, data = data))

Related

VAR model with variable combination and variation

I tried searching for an answer for this question of mine, however I could not find anything.
I want to build a model that predicts barley prices for that i came up with 11 variables that may have an impact on the prices. What I tried doing was building a loop that chooses every time one extra variable from my pool of variables and tries different combinations of them and the output would be for every (extra/combination) variable a new VAR-model, so in a sense, it is a combinatorics exercise. After that, i want to implement an in/out of sample testing for each of the models that I came up with to decide which one is the most appropriate. Unfortunately, i am not very familiar with loops and i have been told not to use them on R... As I am a beginner on R, my tryouts won't help you out at all, but if you really require them I am happy to provide them to you.
Many thanks in advance!

Experiment design for count data

I got trouble on a work problem, which I need to design an experiment that can compare different treatment means. I'm gonna use one way anova with 6 levels which are tel_check, applicant_check, and several other checks. I want to find all this checks single or combination work and find the best check combination that works well.
However the question is my dependent variable is count data that is not continuous which violates the assumption of anova. Also, I don't figure out alternatives if I can't use one way anova here. If one way anova can work, how can I estimate a rough value for the sample size that can my plan work?
I asked a lot of questions, sorry for that, but I didn't find suitable answers for my questions on the Internet.

Fastest way to reduce dimensionality for multi-classification in R

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

prediction and imputation of missing values using a panel data model (R)

I have a panel dataset, which is unbalanced. I created a pooled model and now need to predict and input the missing values of the dataset. How can it be done?
Here is a printscreen of my data: https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
Thank you!
First of all it looks like you have a too broad question in here. If you're really asking about how you should predict values for your spreadsheet (i.e cells: Z6,AA6,...,AM22,...); yes you have a HUGE questions =]. Just a hint, in your following questions, you should be more specific, like: I have THIS data related to Households in Belarus. I've searched about predicting models for that and tried XPTO1 and XPTO2. How can I decide which one is better?
So, what I really mean here is that predicting is not exactly a function like SUM, that you can apply to your data and that's it. Prediction is a whole discipline, with a bunch of methods that should be tested to different cases. For example, to predict the Z6 cell in your data, you should to ask yourself what other data can contribute to infer data missing information? In some cases the simple average value for the past 5 years will be enough, in some other, a lot more should be considered.
I recommend you to first take a look at some basic material that covers simple models, like linear models, play with them, try to understand the accuracy of obtained predictions... That will finally solve your problem, or will at least help you to ask the community more "answerable" questions.
One last tip: there is a new SO's sister Q&A community that may be more appropriate to ask questions about prediction models: https://datascience.stackexchange.com/
Good luck.

Example R source code for multiple linear regression with looping through geographies & products?

pardon the newbie question, as I just started learning R a couple weeks ago (but intend to use it actively from now on). However, I could use some help if you already have a working example.
In order to determine own price elasticity coefficients for our each of our products (~100) in each of our states, I want to be able to write a multiple regression that regresses Units on a variety of independent variables. That's straightforward. However, I would like R to be able to cycle through EACH product within a particular state, THEN move onto the next state in the data file, and start the regression on the first product, repeating the cycle.
I have attached an example of what I'm trying to accomplish. I would also like R at the end to export the regression coefficients (and summaries, p-value, t-stat) into a separate worksheet.
Does anyone have an example similar to this? I'm comfortable enough to read the source code and make modifications to fit my needs, but certainly not yet comfortable at this point to write one from scratch. And, alas, I am tired of copying/pasting into Minitab/Excel (which is what i've been using up to this point) to run regressions 1,000 times.
Appreciate any help you could offer!

Resources