"grouping factor must have exactly 2 levels" - r

Hi y'all I'm fairly new to R and I'm supposed to calculate F statistic for this table
The code I have inputted is as follows:
# F-test
res.ftest <- var.test(TotalLength ~ SwimSpeed , data = my_data)
res.ftest
I know I have more than two levels from the other posts I have read online, but I am not sure what to change to get the outcome I want.

FIRST AND FOREMOST...If you invoke
?var.test()
you will note that the S3 version you called assumes lhs is numeric and rhs is a 2-level factor.
As for the rest, while I don't know the words to your specific work/school assignment here, the words shouldn't be "calculate an F-test", exactly. They should be "analyze these data appropriately". While there are a number of routes you could take, this is normally seen as a regression problem, NOT a problem of trying to compare two variances/complete a 1-way ANOVA which is what var.test() is designed to do. (Reading the documentation at, for example, https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/var.test should make this clear and is something you should always do when invoking R procedures.)
Using a subset of your data (please do this yourself for stack helpers next time rather than make someone here do it for you)...
df <- data.frame(
ID = 1:4,
TL = c(27.1,29.0,33.0,29.3),
SS = c(86.6,62.4,63.8,62.3)
)
cor.test(df$TL,df$SS) # reports t statistic
# or
summary(lm(df$TL ~ df$SS)) # reports F statistic
Note that F is simply t^2 here in the 2 variable case.
Lastly, I should add it is remotely, vaguely possible the assignment is to check if the variances of the 2 distributions are equal even though I can see no reason why anyone would want to know considering they are 2 different measures on two different underlying scales measuring 2 different things. However,
var.test(df$TL, df$SS)
will return a "result" should you take the assignment to mean compare the observed variances.

Related

Translating a for-loop to perhaps an apply through a list

I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.

How to change a class variable into an ordered variable in R?

Hey guys I am trying to calculate the p value of individual variables to see if they have an impact when the other variable is set to 0. Here is my code:
quiet_result = aov(overbearing ~ as.factor(Intention)*as.factor(quiet_only), data=df)
summary(quiet_result)
loud_result = aov(overbearing ~ as.factor(Intention)*as.factor(loud_only), data = df)
summary(loud_result)
For context, the intention variable only has the values of -1 and 1. -1 is intentional and 1 is intentional. Quiet_only and loud_only are new columns created from a data set. quiet_only only has the values of 0 and 2 and it is the original column of sound + 1, and loud_only only has the values of -2 and 0 because it is only the original column of sound - 1. Therefore these are all ordered variables and they are not supposed to be assessed by their actual numerical value like a class variable. However, my code keeps reading it as a class variable even though I changed all the variables to factors to make them ordered variables. Therefore, when I run anova on them, they all return the same result. I am wondering how I can change the variables to make them into ordered variables because the anova is only reading the change between the intention and quiet_only/loud_only columns, which would obviously return the same anovas because there is no actual change if you subtract or add 1 to a column. Therefore, I'm trying to find the p value of the intention variable with loud_only and quiet_only and this p value should change depending on whether I use loud_only or quiet_only.
Sorry if this doesn't make any sense lol. This is research work for a graduate professor so it uses concept that I don't fully understand (I'm undergrad) so I don't think I explained it very well. Anyways, if any of you have any ideas that would be great.

From Stata to R: recoding bysort and xtreg

I'm very new to R and currently working on a replication project for a meta-research course at my university. The paper examines if having a in-home display to monitor energy consumption reduces the energy usage. I have already recoded 300 lines of code, but now I ran into a problem I could not yet solve.
The source code says: bysort id expdays: egen ave15 = mean(power) if hours0105==1
I do understand what this does, but I cannot replicate it in R. id is the identifier for the examined household and expdays denotes the current day of the experiment. So ave15 is the average power consumption from midnight to 6 am sorted for every household on each day. I figured out that (EIPbasedata is the complete dataset containing hourly data)
EIPbasedata$ave15[EIPbasedata$hours0105 == 1] <- ave(EIPbasedata$power, EIPbasedata$ID, EIPbasedata$ExpDays, FUN=mean)
would probably do the job, but this gives me a warning:
number of items to replace is not a multiple of replacement length
and the results are not right too. I do not have any idea what I could do to solve this.
The next thing I struggle to recode is:
xtreg ln_power0105 ihd0105 i.days0105 if exptime==4, fe vce(bootstrap, rep(200) seed(12345))
I think the right way would be using plm but I'm not sure how to implement the if condition (days0105 is a running variable for the number of the day in experiment and 0 if not between 0-6am, ihd0105 is a dummy for having an in-home display, exptime denotes 4 am in the morning- however I do not understand what exptime does here)
table4_1 <- plm(EIPbasedata$ln_power0105 ~ EIPbasedata$ihd0105, data=EIPbasedata, index = c("days0105"), model="within")
How do I compute the bootstrapped standard errors in plm?
I hope some expert can help me, since my R and Stata knowledge is not sufficient for this..
My lecturer provided the answer to me: at first i do specify a subsample which I call tmp_data here: tmp_data <- EIPbasedata[which(EIPbasedata$ExpTime == 4) , ]
Then I'm regressing the tmp_data with as.factor(days0105) values, which is the R equivalent to i.days0105
tmp_results <- plm(tmp_data$ln_power0105 ~ tmp_data$ihd0105 + as.factor(tmp_data$days0105), data = tmp_data, index = ("ID"), model = "within")
There are probably better and cleaner ways to do this, but I'm fine with it for now.

Run nested logit regression in R

I want to run a nested logistic regression in R, but the examples I found online didn't help much. I read over an example from this website (Step by step procedure on how to run nested logistic regression in R) which is similar to my problem, but I found that it seems not resolved in the end (The questioner reported errors and I didn't see more answers).
So I have 9 predictors (continuous scores), and 1 categorical dependent variable (DV). The DV is called "effect", and it can be divided into 2 general categories: "negative (0)" and "positive (1)". I know how to run a simple binary logit regression (using the general grouping way, i.e., negative (0) and positive (1)), but this is not enough. "positive" can be further grouped into two types: "physical (1)" and "mental (2)". So I want to run a nested model which includes these 3 categories (negative (0), physical (1), and mental (2)), and reflects the nature that "physical" and "mental" are nested in "positive". Maybe R can compare these two models (general vs. detailed) together? So I created two new columns, one is called "effect general", in which the individual scores are "negative (0)" and "positive (1)"; the other is called "effect detailed", which contains 3 values - negative (0), physical (1), and mental (2). I ran a simple binary logit regression only using "effect general", but I don't know how to run a nested logit model for "effect detailed".
From the example I searched and other materials, the R package "mlogit" seems right, but I'm stuck with how to make it work for my data. I don't quite understand the examples in R-help, and this part in the example from this website I mentioned earlier (...shape='long', alt.var='town.list', nests=list(town.list)...) makes me very confused: I can see that my data shape should be 'wide', but I have no idea what "alt.var" and "nests" are...
I also looked at page 19 of the mlogit manual for examples of nested logit model calls. But I still cannot decide what I need in terms of options. (http://cran.r-project.org/web/packages/mlogit/mlogit.pdf)
Could someone provide me with detailed steps and notes on how to do it? I'm sure this example (if well discussed and resolved) is also going to help me and others a lot!
Thanks for your help!!!
I can help you with understanding the mlogit structure. When using the mlogit.data() command, specify choice = yourchoicevariable (and id.var = respondentid if you have a panel dataset, i.e. you have multiple responses from the same individual), along with the shape='wide' argument. The new data.frame created will be in long format, with a line for each choice situation, negative, physical, mental. So you will have 3 rows for which you only had one in the wide data format. Whatever your MN choice var is, it will now be a column of logical values, with TRUE for the row that the respondent chose. The row names will now have be in the format of observation#.level(choice variable) So in your case, if the first row of your dataset the person had a response of negative, you would see:
row.name | choice
1.negative | TRUE
1.physical | FALSE
1.mental | FALSE
Also not that the actual factor level for each choice is stored in an index called alt of the mlogit.data.frame which you can see by index(your.data.frame) and the observation number (i.e. the row number from your wide format data.frame) is stored in chid. Which is in essence what the row.name is telling you, i.e. chid.alt. Also note you DO NOT have to specify alt.var if your data is in wide format, only long format. The mlogit.data function does that for you as I have just described. Essentially, it takes unique(choice) when you specify your choice variable and creates the alt.var for you, so it is redundant if your data is in wide format.
You then specify the nests by adding to the mlogit() command a named list of the nests like this, assuming your factor levels are just '0','1','2':
mlogit(..., nests = c(negative = c('0'), positive = c('1','2')
or if the factor levels were 'negative','physical','mental' it would be the like this:
mlogit(..., nests = c(negative = c('negative'), positive = c('physical','mental')
Also note a nest of one still MUST be specified with a c() argument per the package documentation. The resulting model will then have the iv estimate between nests if you specify the un.nest.el=T argument, or nest specific estimates if un.nest.el=F
You may find Kenneth Train's Examples useful

Does R randomForest's rfcv method actually say which features it selected, or not?

I would like to use rfcv to cull the unimportant variables from a data set before creating a final random forest with more trees (please correct and inform me if that's not the way to use this function). For example,
> data(fgl, package="MASS")
> tst <- rfcv(trainx = fgl[,-10], trainy = fgl[,10], scale = "log", step=0.7)
> tst$error.cv
9 6 4 3 2 1
0.2289720 0.2149533 0.2523364 0.2570093 0.3411215 0.5093458
In this case, if I understand the result correctly, it seems that we can remove three variables without negative side effects. However,
> attributes(tst)
$names
[1] "n.var" "error.cv" "predicted"
None of these slots tells me what those first three variables that can be harmlessly removed from the dataset actually were.
I think the purpose of rfcv is to establish how your accuracy is related to the number of variables you use. This might not seem useful when you have 10 variables, but when you have thousands of variables it is quite handy to understand how much those variables "add" to the predictive power.
As you probably found out, this code
rf<-randomForest(type ~ .,data=fgl)
importance(rf)
gives you the relative importance of each of the variables.

Resources