How to change a class variable into an ordered variable in R? - r

Hey guys I am trying to calculate the p value of individual variables to see if they have an impact when the other variable is set to 0. Here is my code:
quiet_result = aov(overbearing ~ as.factor(Intention)*as.factor(quiet_only), data=df)
summary(quiet_result)
loud_result = aov(overbearing ~ as.factor(Intention)*as.factor(loud_only), data = df)
summary(loud_result)
For context, the intention variable only has the values of -1 and 1. -1 is intentional and 1 is intentional. Quiet_only and loud_only are new columns created from a data set. quiet_only only has the values of 0 and 2 and it is the original column of sound + 1, and loud_only only has the values of -2 and 0 because it is only the original column of sound - 1. Therefore these are all ordered variables and they are not supposed to be assessed by their actual numerical value like a class variable. However, my code keeps reading it as a class variable even though I changed all the variables to factors to make them ordered variables. Therefore, when I run anova on them, they all return the same result. I am wondering how I can change the variables to make them into ordered variables because the anova is only reading the change between the intention and quiet_only/loud_only columns, which would obviously return the same anovas because there is no actual change if you subtract or add 1 to a column. Therefore, I'm trying to find the p value of the intention variable with loud_only and quiet_only and this p value should change depending on whether I use loud_only or quiet_only.
Sorry if this doesn't make any sense lol. This is research work for a graduate professor so it uses concept that I don't fully understand (I'm undergrad) so I don't think I explained it very well. Anyways, if any of you have any ideas that would be great.

Related

"grouping factor must have exactly 2 levels"

Hi y'all I'm fairly new to R and I'm supposed to calculate F statistic for this table
The code I have inputted is as follows:
# F-test
res.ftest <- var.test(TotalLength ~ SwimSpeed , data = my_data)
res.ftest
I know I have more than two levels from the other posts I have read online, but I am not sure what to change to get the outcome I want.
FIRST AND FOREMOST...If you invoke
?var.test()
you will note that the S3 version you called assumes lhs is numeric and rhs is a 2-level factor.
As for the rest, while I don't know the words to your specific work/school assignment here, the words shouldn't be "calculate an F-test", exactly. They should be "analyze these data appropriately". While there are a number of routes you could take, this is normally seen as a regression problem, NOT a problem of trying to compare two variances/complete a 1-way ANOVA which is what var.test() is designed to do. (Reading the documentation at, for example, https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/var.test should make this clear and is something you should always do when invoking R procedures.)
Using a subset of your data (please do this yourself for stack helpers next time rather than make someone here do it for you)...
df <- data.frame(
ID = 1:4,
TL = c(27.1,29.0,33.0,29.3),
SS = c(86.6,62.4,63.8,62.3)
)
cor.test(df$TL,df$SS) # reports t statistic
# or
summary(lm(df$TL ~ df$SS)) # reports F statistic
Note that F is simply t^2 here in the 2 variable case.
Lastly, I should add it is remotely, vaguely possible the assignment is to check if the variances of the 2 distributions are equal even though I can see no reason why anyone would want to know considering they are 2 different measures on two different underlying scales measuring 2 different things. However,
var.test(df$TL, df$SS)
will return a "result" should you take the assignment to mean compare the observed variances.

Get next level from a given level factor

I am currently making my first steps using R with RStudio and right now I am struggling with the following problem:
I got some test data about a marathon with four columns, where the third column is a factor with 15 levels representing different age classes.
One age class randomAgeClass will be randomly selected at the beginning, and an object is created holding the data that matches this age class.
set.seed(12345678)
attach(marathon)
randomAgeClass <- sample(levels(marathon[,3]), 1)
filteredMara <- subset(marathon, AgeClass == randomAgeClass)
My goal is to store a second object that holds the data matching the next higher level, meaning that if age class 'kids' was randomly selected, I now want to access the data relating to 'teenagers', which is the next higher level. Looking something like this:
nextAgeClass <- .... randomAgeClass+1 .... ?
filteredMaraAgeClass <- subset(marathon, AgeClass == nextAgeClass)
Note that I already found this StackOverflow question, which seems to partially match my situation, but the accepted answer is not understandable to me, thus I wasn't able to apply it to my needs.
Thanks a lot for any patient help!
First you have to make sure thar the levels of your factor are ordered by age:
factor(marathon$AgeClass,levels=c("kids","teenagers",etc.))
Then you almost got there in your example:
next_pos<-which(levels(marathon$AgeClass)==randomAgeClass)+1 #here you get the desired position in the level vector
nextAgeClass <- levels(marathon$AgeClass) [next_pos]
filteredMaraAgeClass <- subset(marathon, AgeClass == nextAgeClass)
You might have a problem if the randomAgeClass is the last one, so make sure to avoid that problem

Making a histogram

this sounds pretty basic but every time I try to make a histogram, my code is saying x needs to be numeric. I've been looking everywhere but can't find one relating to my problem. I have data with 240 obs with 5 variables.
Nipper length
Number of Whiskers
Crab Carapace
Sex
Estuary location
There is 3 locations and i'm trying to make a histogram with nipper length
I've tried making new factors and levels, with the 80 obs in each location but its not working
Crabs.data <-read.table(pipe("pbpaste"),header = FALSE)##Mac
names(Crabs.data)<-c("Crab Identification","Estuary Location","Sex","Crab Carapace","Length of Nipper","Number of Whiskers")
Crabs.data<-Crabs.data[,-1]
attach(Crabs.data)
hist(`Length of Nipper`~`Estuary Location`)
Error in hist.default(Length of Nipper ~ Estuary Location) :
'x' must be numeric
Instead of correct result
hist() doesn't seem to like taking more than one variable.
I think you'd have the best luck subsetting the data, that is, making a vector of nipper lengths for all crabs in a given estuary.
crabs.data<-read.table("whatever you're calling it")
names<-(as you have it)
Estuary1<-as.vector(unlist(subset(crabs.data, `Estuary Loc`=="Location", select = `Length of Nipper`)))
hist(Estuary1)
Repeat the last two lines for your other two estuaries. You may not need the unlist() command, depending on your table. I've tended to need it for Excel files, but I don't know what format your table is in (that would've been helpful).

"Factor codes of type double or float detected" when using read.dta13

I am using read.dta13 packages to load data. There are a bunch of categorical variables with Stata values labels in the data set. The data set looks like below in Stataļ¼š
cohort year age gender income health migration
1101 2010 35 F 13034 healthy yes
1102 2010 54 M 34134 unhealthy no
For gender, health and migration, the original values are numeric, for example, gender = 1 for male. In Stata, for the convenience of understanding, I add value labels for categorical variables using label define, so it shows as above. But the original values are kept. Now let's go to R. If I simply type
mydata <- read.dta13("mydata_stata13.dta")
I get a lot of warnings like these
Factor codes of type double or float detected - no labels assigned.
Set option nonint.factors to TRUE to assign labels anyway.
All the value labels I add in Stata will be dropped, which is what I need in R. The problem is that R gives warnings even for some variables that should be taken as numeric, for example income. I don't want to set nonint.factor = TRUE since I need the numeric values of the categorical variables for the calculation.
It's not actually an error, but I would like to know whether it is safe to just ignore the warnings.
As the warning states, there are doubles or floats with labels assigned. This is because I assumed you created a categorical variable without specifying Stata to store it as a byte. readstata13 gives you a warning because it is not sure if floats/doubles with value labels are categorical or continuous variables.
Let's say gender is the wrongly stored variable, I assumed the person who coded the variables in stata created it as:
gen gender = *expr*
instead of
gen byte gender = *expr*
This can be solved either by always prefixing categorical variables with gen byte or by using compress (see Stata's manual) before saving/exporting the whole dataset. You can detect which variables are wrongly coded using describe and checking value label assignment in non-byte-variables. This will in turn will store your data efficiently.
In addition, I assume that for some reason the same person accidentally added a value label to a "true" float variable, like income at some point. Check labelbook command to correct such problems.

Is there anything like numerical variable with labels?

I have a numerical variable with discrete levels, that have a special meaning for me, e.g.
-1 'less than zero'
0 'zero'
1 'more than zero'
I know, that I can convert the variable as factor/ordinal and keep the labels, but then the numerical representation of the variable would be
1 'less than zero'
2 'zero'
3 'more than zero'
which is useless for me. I cannot afford having two copies of the variable, because of memory constraints (it is a very big data.table).
Is there any standard way of adding text labels to certain levels of the numerical (possibly integer) variable, so that I can get a nice looking frequency tables just like if it was a factor, and simultaneously being able to treat it as the source numerical variable with values untouched?
I'm going to say the answer to your questions is "no". There's no standard or built-in way of doing what you want.
Because, as you note, factors have positive non-zero integer codes, and integers can't be denoted by label strings in a vector. Not in a "standard" way anyway.
So you will have to do the labelling yourself, in whatever outputs you want to present, manually.
Any tricks like keeping your data (once) as a factor and subtracting a number to get the negative values you need (presumably for your analysis) will make a copy of that data. Keep the numbers, do the analysis, then do replacement with the results (which I presume are tables and plots and so aren't as big as the data).
R also doesn't have an equivalent to the "enumerated type" of many languages, which is one way this can be done.
You could use a vector. Would that work?
var <- c(-1,0,1)
names(var) <- c("less than zero", "zero", "more than zero")
that would give you
> var
less than zero zero more than zero
-1 0 1
Hope that helps,
Umberto

Resources