Investigate if scores of a Likert item actually differ from one onother? - r

My dataset contains a Likert item containing how energetic the participants were at that moment rated from 0-6. Where 0 = not energetic at all and 6 = very energetic. I have to investigate if these scores actually differ from one another based on the data. If 0 and 1 do not differ from eachother, I have to combine these two levels into one and so on. So at the end I might have 2 or 4 levels instead of 6.
I have tried applying classification algorithms to the data to see if a model classifying '0' would give an error rate when classifying '1'. Unfortunately, this did not work as I wanted. Is this actually possible?
My question is if someone knows how I can best investigate if there is indeed a difference between those 6 levels or whether I can combine some of them based on differences (or not) in the data of those levels.

Related

Pavlidis Template Matching (PTM, DataVisEasy) R function with 3 levels

I need to perform a correlation analysis on a data frame constructed as follows:
Rows: Features --> gene variants related with different levels of severity of the disease we are studying, in a format of a Boolean matrix
Columns: Observations --> List of patients;
The discriminant of my analysis is, thus, the severity marked as follows:
A: less severe than expected
B: equal to what expected
C: more severe than expected
Suppose I have a lot more features than observations and I want to use the PTM function with a three-level annotation (i.e. A,B,C) as a match template. The function requires you to set the annotation.level.set.high parameter, but it's not clear to me how it works. For example, if I set annotation.level.set.high='A', does that mean I'm making a comparison between A vs B&C? So I can only do a comparison between two groups/classes even if I have multiple levels? Because my goal is to compare all levels with each other (i.e. A vs B vs C), but it is not clear to me how to achieve this comparison, if it is possible.
Thanks

Why exact matching with MatchIt R package finds matched pairs that have 2 different levels of categorical variable?

I'm actually working on tuna tag-recapture data. I want to balance my sampling between two groups of individuals, the ones that where tagged in the reference area (Treated group) and the ones that where tagged outside this area (Control group). To do this, I used the MatchIt package.
I have 3 covariates: length (by 5 cm bins), month of tagging (January to December) and structure on which the tuna was tagged.
So there is the model: treatment ~ length + month + structure
This last variable, is a categorical variable with 5 levels coded as A to E. The level A is almost only represented in the Treated group (6000 individuals with structure = A, vs on 300 individuals with structure = A in control group).
I first used the nearest neighbour method, but the improvement in balance was not satisfying. So I ran exact and Coarsened Exact Matching methods.
I though that Exact methods should match pairs with the same values for each covariates. But in the output matched data, there are still more than 3000 individuals with structure = A in the treated group.
Do you guys have one explanation ? I red a lot but I didn't find answers.
Thanks
Exact and coarsened exact matching do not perform 1:1 matching. They find all members in the control group that exactly match each member in the treated group. Subclasses are formed based on each combination of the predictor values, and any subclass that has both treated and control units is retained, and others dropped. There is no pairing that takes place. Your results indicate that you have many control units that have identical (or near-identical in the case of CEM) values of the covariates as some treated units.

R factor and level

Levels make sense that it is unique values of the vector, but I can't get my head around what factor is. It just seems to repeat the vector values.
factor(c(1,2,3,3,4,5,1))
[1] 1 2 3 3 4 5 1
Levels: 1 2 3 4 5
Can anyone explain what factor is supposed to do, or why would I used it?
I'm starting to wonder if factors are like a code table in a database. Where the factor name is code table name and levels are the unique options of the code table. ?
A factor is stored as a hash table rather than raw character vector. What does this imply? There are two major benefits.
Much smaller memory footprint. Consider a text file containing the phrase "New Jersey" 100,000 times over encoded in ASCII. Now imagine if you just had to store the number 16 (in binary 100,000 times and then another table indicating that 16 means "New Jersey". It's leaner and faster.
Especially for visualization and statistical analysis, frequently we test for values "across all categories" (think ANOVA or what you would color a stacked barplot by). We can either repeatedly encode all of our functions to stack up observed choices in a string vector or we can simply create a new type of vector which will tell you what the valid choices are. That is called a factor, and the valid choices are called levels.

How to create contingency table with multiple criteria subpopulation from weighted data using svyby in the survey package?

I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.

Run nested logit regression in R

I want to run a nested logistic regression in R, but the examples I found online didn't help much. I read over an example from this website (Step by step procedure on how to run nested logistic regression in R) which is similar to my problem, but I found that it seems not resolved in the end (The questioner reported errors and I didn't see more answers).
So I have 9 predictors (continuous scores), and 1 categorical dependent variable (DV). The DV is called "effect", and it can be divided into 2 general categories: "negative (0)" and "positive (1)". I know how to run a simple binary logit regression (using the general grouping way, i.e., negative (0) and positive (1)), but this is not enough. "positive" can be further grouped into two types: "physical (1)" and "mental (2)". So I want to run a nested model which includes these 3 categories (negative (0), physical (1), and mental (2)), and reflects the nature that "physical" and "mental" are nested in "positive". Maybe R can compare these two models (general vs. detailed) together? So I created two new columns, one is called "effect general", in which the individual scores are "negative (0)" and "positive (1)"; the other is called "effect detailed", which contains 3 values - negative (0), physical (1), and mental (2). I ran a simple binary logit regression only using "effect general", but I don't know how to run a nested logit model for "effect detailed".
From the example I searched and other materials, the R package "mlogit" seems right, but I'm stuck with how to make it work for my data. I don't quite understand the examples in R-help, and this part in the example from this website I mentioned earlier (...shape='long', alt.var='town.list', nests=list(town.list)...) makes me very confused: I can see that my data shape should be 'wide', but I have no idea what "alt.var" and "nests" are...
I also looked at page 19 of the mlogit manual for examples of nested logit model calls. But I still cannot decide what I need in terms of options. (http://cran.r-project.org/web/packages/mlogit/mlogit.pdf)
Could someone provide me with detailed steps and notes on how to do it? I'm sure this example (if well discussed and resolved) is also going to help me and others a lot!
Thanks for your help!!!
I can help you with understanding the mlogit structure. When using the mlogit.data() command, specify choice = yourchoicevariable (and id.var = respondentid if you have a panel dataset, i.e. you have multiple responses from the same individual), along with the shape='wide' argument. The new data.frame created will be in long format, with a line for each choice situation, negative, physical, mental. So you will have 3 rows for which you only had one in the wide data format. Whatever your MN choice var is, it will now be a column of logical values, with TRUE for the row that the respondent chose. The row names will now have be in the format of observation#.level(choice variable) So in your case, if the first row of your dataset the person had a response of negative, you would see:
row.name | choice
1.negative | TRUE
1.physical | FALSE
1.mental | FALSE
Also not that the actual factor level for each choice is stored in an index called alt of the mlogit.data.frame which you can see by index(your.data.frame) and the observation number (i.e. the row number from your wide format data.frame) is stored in chid. Which is in essence what the row.name is telling you, i.e. chid.alt. Also note you DO NOT have to specify alt.var if your data is in wide format, only long format. The mlogit.data function does that for you as I have just described. Essentially, it takes unique(choice) when you specify your choice variable and creates the alt.var for you, so it is redundant if your data is in wide format.
You then specify the nests by adding to the mlogit() command a named list of the nests like this, assuming your factor levels are just '0','1','2':
mlogit(..., nests = c(negative = c('0'), positive = c('1','2')
or if the factor levels were 'negative','physical','mental' it would be the like this:
mlogit(..., nests = c(negative = c('negative'), positive = c('physical','mental')
Also note a nest of one still MUST be specified with a c() argument per the package documentation. The resulting model will then have the iv estimate between nests if you specify the un.nest.el=T argument, or nest specific estimates if un.nest.el=F
You may find Kenneth Train's Examples useful

Resources