Comparing data from two different samples (social science data) - scale

Dear Stackoverflow community,
I would really appreciate your advice in the following matter.
I would like to compare two different data sets from a social science survey with each other: one group had "no" treatment, the other group had a treatment. The sociodemographics are broadly similar (gender, professional background), except for age: in one group are slightly younger participants than in the other. I tested different items on Likert scales (ordinal data from 1-5).
At this stage, I am using an unpaired Wilcoxon rank sum test to compare both data sets as I would consider them as independent from each other. Do you think this is the right approach in this matter? I looked up a few medical studies that compared different treatment groups and they used e.g. ANVOVAs. Maybe this is a better approach?
I would very much appreciate your support. Many thanks!

Related

How to ensure correct scale of different categories in R

I have a dataset where I have a simple male/female breakdown, a category (say A, B or C), some kind of location to give me more data points and then a count for each one. E.g.
Basic sample
Obviously performing any kind of analysis on this is a bit meaningless at the moment as the number of males is far higher than females. 7 males is significantly lower than 7 females as it currently stands. The examples I can find online for standardising these are a bit too simple and blanket affect the whole dataset, rather than breaking it down into a particular category. I am looking to do this in R to give me more options when it comes to analysing larger things and I am frustratingly still waiting for my R training!
I have tried this manually and using tutorials online, but they are too basic for my data.

Network model with much more variables than samples

Dear stackoverflow forum,
this is more of a background question and I hope someone finds the time to give me an advice.
The last few weeks I was learning on how to create a food web model depending on abundances of species (obtained by analysing genomic sequences from several places).
Given that this project was my actual start with this topic (i.e. coding, network modelling) I read very much, but could just understand a small part of it. Now I finally got the data and even if I filter it as much as maintainable there are more than 300 hundred species, but just 27 samples (not all species are present in every sample) and just 1-2 environmental parameters.
My first intention was to produce a food web which shows the strength of interaction & its direction, because the goal is to win knowledge about a uncharted biotope. Do you think it is possible to create a statistical reliable food web (with R) based on this low information or at least a co occurrence network? Because I got my doubts.. For example because working with the robust lm function would force me to restrict the number of species to 27 (samples).
If yes, a hint on how to, or some literature would make my day.
If this is the completely wrong place for this type of question, just tell me and I will delete it, but an advice for a better forum would be nice, maybe like stats.stackexchange?
Many thanks in advance

what‘s the difference bw One-way ANOVA and One-way ANOVA for repeated measures In R

for example
one-way :
aov.res2 <- aov(mark ~ teacher, data=my_data2)
one-way for repeated measures :
aov(mark ~ teacher + Error(essay/teacher), data=my_data2)
what's the difference bw teacher + Error(essay/teacher) and teacher.
1.why add plus Error() after teacher & what's that mean?
2.why in the Error, we use essay/teacher not essay * teacher?
First, stackoverflow likely isn't the best place to ask such a theoretical question. There are other sites that lend better to your question. Try Cross Validated.
Having studied statistics extensively, I will give you a high level answer and then direct you to look for more details in textbooks or elsewhere online.
Let's make sure we understand what repeated measures data is. An example of such data would be measuring the blood pressure of a patient every day for a week. Hence we have several "repeated" measurements from one subject. If we did this for many patients/subjects, we then have repeated measures data.
Repeated measures data is inherently different from other data because we expect that the data we observe from the same subject, say over time, will be correlated. (Referring to our previous example, we expect that the blood pressure of a patient tomorrow will be related to the blood pressure of that same patient today.) If you have repeated measures data but don't model it as such, you are leaving out important information about how the data might be related within a subject. Modeling the data properly will then give you a more complete and accurate view, particularly in the variance. Said another way, the data collected from one patient does not vary the same way that the data varies between patients.
Hopefully this helps you understand the nuances of the two methods in question. Certainly I have not explicitly detailed the coding syntax, but I hope that this answer will help you understand why they are different. Once you understand the theory better, your questions will likely change and be more specific. Good luck!
I found the answer:
The experimental object is measured multiple times, so there will be factors within the group. The factors within the group will be specially marked in the following form,
where "teacher" is the factor within the group, and "essay" is the ID of experimental object.

Which cluster methodology should I use for a multidimensional dataset?

I am trying to create clusters of countries with a dataset quite heterogeneous (the data I have on countries goes from median age to disposable income, including education levels).
How should I approach this problem?
I read some interesting papers on clustering, using K-means for instance, but it seems those algorithms are mostly used when there are two sets of variables, not 30 like in my case, and when the variables are comparable (it might be though to try to cluster countries with such diversity in the data).
Should I normalise some of the data? Should I just focus on fewer indicators to avoid this multidimensional issue? Use spectral clustering first?
Thanks a lot for the support!
Create a "similarity metric". Probably just a weight to all your measurements, but you might build in some corrections for population size and so on. Then you can only have low hundreds of countries, so most brute force methods will work. Hierarchical clustering would be my first point of call, and that will tell you if the data is inherently clustered.
If all the data is quantitative, you can normalise on 0 - 1 (lowest country is 0, highest is 1), then take eigenvectors. Then plot out the first two axes in eigenspace. That will give another visual fix on clusters.
If it's not clustered, however, it's better to admit that.

certain levels of categorical variables insignificant

I was working on a multiple regression model that predicts amount of insurance claims based on certain factors. One such (categorical) factor is the room type the person has access to as part of the insurance package (eg. VIP room). The problem is that a few room types have a high variability in claims which results in them being insignificant predictors (p value as high as 0.6 for those levels). My suggestion is to create two separate models, one with room type as a predictor and one without. If a person is part of one of the rooms with high variability then the model without room type as a predictor should be used otherwise the better fit model can be used (has a higher adjusted R^2).
My question is, is there something incorrect with this procedure?
Thank you.
I don't know how many possible types of rooms you have there, but it can be that some categories have a very low volume compared to the others. If that's the case, I'd rather try to combine types with similar characteristics as new categories. That may increase volume and make them significant.
It's hard to suggest things without seeing the data.

Resources