I have a panel dataset, which is unbalanced. I created a pooled model and now need to predict and input the missing values of the dataset. How can it be done?
Here is a printscreen of my data: https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
Thank you!
First of all it looks like you have a too broad question in here. If you're really asking about how you should predict values for your spreadsheet (i.e cells: Z6,AA6,...,AM22,...); yes you have a HUGE questions =]. Just a hint, in your following questions, you should be more specific, like: I have THIS data related to Households in Belarus. I've searched about predicting models for that and tried XPTO1 and XPTO2. How can I decide which one is better?
So, what I really mean here is that predicting is not exactly a function like SUM, that you can apply to your data and that's it. Prediction is a whole discipline, with a bunch of methods that should be tested to different cases. For example, to predict the Z6 cell in your data, you should to ask yourself what other data can contribute to infer data missing information? In some cases the simple average value for the past 5 years will be enough, in some other, a lot more should be considered.
I recommend you to first take a look at some basic material that covers simple models, like linear models, play with them, try to understand the accuracy of obtained predictions... That will finally solve your problem, or will at least help you to ask the community more "answerable" questions.
One last tip: there is a new SO's sister Q&A community that may be more appropriate to ask questions about prediction models: https://datascience.stackexchange.com/
Good luck.
Related
I have a dataset containing repeated measures and quite a lot of variables per observation. Therefore, I need to find a way to select explanatory variables in a smart way. Regularized Regression methods sound good to me to address this problem.
Upon looking for a solution, I found out about the glmmLasso package quite recently. However, I have difficulties defining a model. I found a demo file online, but since I'm a beginner with R, I had a hard time understanding it.
(demo: https://rdrr.io/cran/glmmLasso/src/demo/glmmLasso-soccer.r)
Since I cannot share the original data, I would suggest you use the soccer dataset (the same dataset used in glmmLasso demo file). The variable team is repeated in observations and should be taken as a random effect.
# sample data
library(glmmLasso)
data("soccer")
I would appreciate if you can explain the parameters lambda and family, and how to tune them.
I'm working with a large data set with repeated patients over multiple months with ordered outcomes on a severity scale from 1 to 5. I was able to analyze the first set of patients using the polr function to run a basic ordinal logistic regression model, but now want to analyze association across all the time points using a longitudinal ordinal logistic model. I can't seem to find any clear documentation online or on this site so far explaining which package to use and how to use it. I am also an R novice so any simple explanations would be incredibly useful. Based on some initial searching it seems like the mixor function might be what I need though I am not sure how it works. I found it on this site
https://cran.r-project.org/web/packages/mixor/vignettes/mixor.pdf
Would appreciate a simple explanation of how to use this function if this is the right one, or would happily take any alternate suggestions with an explanation.
Thank you in advance for your help!
I'm studying the NDVI (normalized vegetation index) behaviour of some soils and cultivars. My database has 33 days of acquisition, 17 kind of soils and 4 different cultivars. I have built it in two different ways, that you can see attached. I am having troubles and errors with both the shapes.
The question first of all is: Is repeated anova the correct way of analyzing my data? I want to see if there are any differences between the behaviours of the different cultivars and the different soils. I've made an ANOVA for each day and there are statistical differecies in each day, but the results are not globally interesting due to the fact that I would like to investigate the whole year behaviour.
The second question then is: how can I perform it? I''ve tryed different tutorials but I had unexpected errors or I didn't manage to complete the analysis.
Last but not the least: I'm coding with R Studio.
Any help is appreciated, I'm still new to statistic but really interested in improving!
orizzontal database
vertical database
I believe you can use the ANOVA, but as always, you have to know if that really is what you're looking for. Either way, since this a plataform for programmin questions, I'll write a code that should work for the vertical version. However, since I don't have your data, I can't know for sure (for future reference, dput(data) creates easily importeable code for those trying to answer you).
summary(aov(suolo ~ CV, data = data))
I tried searching for an answer for this question of mine, however I could not find anything.
I want to build a model that predicts barley prices for that i came up with 11 variables that may have an impact on the prices. What I tried doing was building a loop that chooses every time one extra variable from my pool of variables and tries different combinations of them and the output would be for every (extra/combination) variable a new VAR-model, so in a sense, it is a combinatorics exercise. After that, i want to implement an in/out of sample testing for each of the models that I came up with to decide which one is the most appropriate. Unfortunately, i am not very familiar with loops and i have been told not to use them on R... As I am a beginner on R, my tryouts won't help you out at all, but if you really require them I am happy to provide them to you.
Many thanks in advance!
I got trouble on a work problem, which I need to design an experiment that can compare different treatment means. I'm gonna use one way anova with 6 levels which are tel_check, applicant_check, and several other checks. I want to find all this checks single or combination work and find the best check combination that works well.
However the question is my dependent variable is count data that is not continuous which violates the assumption of anova. Also, I don't figure out alternatives if I can't use one way anova here. If one way anova can work, how can I estimate a rough value for the sample size that can my plan work?
I asked a lot of questions, sorry for that, but I didn't find suitable answers for my questions on the Internet.