I have a large table with multiple columns, each column being one different model.
In the rows are the different results of these models at a specific location on a map.
See picture (I had to change the column names)...
Now, I want to check with an ANOVA in R if there is a signifanct difference between the results at every location between the different models. In other words: If the models are responsible for a difference, if there is any.
I am not 100% sure an ANOVA is the best way to go, but I want to try.
How can I reshape my table to long format (is it?)
I tried with "gather" (tidyr) but I am not sure what the inputs are.
Thank you!
A friend helped me. A simple stack(data) did the trick!
Related
I am not quite sure if this is the right place to post this question.
I have to do a logistic regression using R. Now the programming part should not be an issue as there is enough tutorials and similar questions on these forums already.
My question is more about how to get data into usable form this model.
To specify: The survey is about a tax on a specific consumer good. Specifically on the change in the consumers purchasing behaviour. There were two categories that people were randomly selected for. One with tax and the other without. Additionaly, two different situations were people were asked about their preferences. So to sum up, Group A were taxed on the good in both situations, Group B was not taxed in either situation.
The results are now in a CSV file. The problem now is, however, all those subgroups got their own respective column. This means that this can't be evaluated well as they should all be merged into one to then create a logistic regression with a 1 if a person has chosen the taxed good and 0 if they did not. This should then be evaluated to see if a tax on said good would reduce the amount bought by x percent, if the tax even has an impact on purchasing behaviour. (This may not apply to this question but is more aimed towards clarification. Logistic regression will not tell me the before mentioned point)
My question now is, is there even a way to make this work with the design chosen? Is it possible to merge all the data into usable form without losing / distorting any data?
I am not sure if this question is stated clearly enough. Let me know if I should clarify more details for this question to be properly answered.
Thank you for your help!
EDIT:
The columns in the CSV file now each have a number in them corresponding to the choice they made in the survey. But since there were different groups they all got their respective column. For a logistic regression they have to be all in the same column (I believe). Can I just stack them using the links posted in the comments and go from there?
Also. Does it not distort any data when just stacking columns? I am not sure if this is the right place to ask this but I think it's worth a try.
What you could try, is splitting the csv, in two seperate datasets (one for each group) and use rbind to combine them:
# note: the column names needs to be identical in order for them to stack
df_final <- rbind(df1, df2)
I am trying to run a PCA on a dataframe which is accompanied by a metadata table. The PCA table is all normalized, scaled etc. the metadata, however, is not.
I want the PCA to not only cluster based on the dataframe but also the option to add one or multiple columns from the metadata table as explanatory variables as well. Again, these are not scaled and normalized with the main dataset. Also, I am not looking to color the plots with a certain data column, I'd want the column to be considered for the actual clustering.
I am aware that this sounds kinda vague, but I am having a hard time to find the exact words. After looking around for a little bit I found demixed PCA which seems to be very close to what I want to achieve. Sadly, there is no package in R to run it.
Any recommendations are welcome and thank you in advance.
I'm studying the NDVI (normalized vegetation index) behaviour of some soils and cultivars. My database has 33 days of acquisition, 17 kind of soils and 4 different cultivars. I have built it in two different ways, that you can see attached. I am having troubles and errors with both the shapes.
The question first of all is: Is repeated anova the correct way of analyzing my data? I want to see if there are any differences between the behaviours of the different cultivars and the different soils. I've made an ANOVA for each day and there are statistical differecies in each day, but the results are not globally interesting due to the fact that I would like to investigate the whole year behaviour.
The second question then is: how can I perform it? I''ve tryed different tutorials but I had unexpected errors or I didn't manage to complete the analysis.
Last but not the least: I'm coding with R Studio.
Any help is appreciated, I'm still new to statistic but really interested in improving!
orizzontal database
vertical database
I believe you can use the ANOVA, but as always, you have to know if that really is what you're looking for. Either way, since this a plataform for programmin questions, I'll write a code that should work for the vertical version. However, since I don't have your data, I can't know for sure (for future reference, dput(data) creates easily importeable code for those trying to answer you).
summary(aov(suolo ~ CV, data = data))
I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.
I'm trying to create a formula in R, of the form
Output~Var1+Var2+Var3
For use in a model. The way it seems to work is that you give Variable name you want to predict,tilde,the variable names you want to use as predictors and then in a later argument you give the data frame containing observations of those variables. The data frame I'm using, however, has quite a few Variables in it, and I don't want to type them all out. These variables also change names relatively frequently, so it would be an effort to keep changing my code. In essence, I want to know how to write
Output~(All the variables that aren't the output)
Although I also need to exclude some other Variables as well.
Sorry to make it quite so clear I don't know what's going on, ?formula didn't help too much, and this isn't like any other programming or R structure I've seen before.
Thanks for any help,
N
Ah, I found a much better solution: the function
reformulate(termlabels = colnames(InputTable), response = 'Prediction')
Will create a formula from the strings you provide. Manipulate colnames as you like to dynamically choose which variables are used in the model.
Actually, the ?formula documentation provides one possible answer. It is, however, extremely 'hacky', and one of the least pleasant ways I can imagine accomplishing this
## Create a formula for a model with a large number of variables:
xnam <- paste0("x", 1:25)
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))
ie, you just paste toghether a string and use that as your formula.