Subsetting everything but a given line in 1 column - r

I have a dataset where I have 4 different treatments. One of these treatments is the control group. I want to subset the data between control and other treatments.
I wrote this in R Studio:
ControlQ2<-subset(Q2, Treatment == "No_Suite")
Now how to select all the treatments except "No_Suite"?
Thanks

I'm not sure if I understood you well, but what about this one?
ExceptControlQ2<-subset(Q2, Treatment != "No_Suite")
If this is not what you were looking for, please provide an example with an expected output.

Related

Transforming only part of the NA's of a variable in 0

I am working in a dataframe in RStudio trying to understand if there is a correlation between doing exercises and the general health of the person. There is three main variables:
exerof1: this variable is related to how many of times the people in the research exercised in the last 30 days.
exerany2: in this variable, the participants responded if they practiced exercises in the last month, therefore they can say yes, no or refuse to answer.
genhlth: a factor variable which split the observations in 5 levels.
I have already transformed the exeroft1 variable, but 30% of this variable are NA's and most of them are NA's because they answered "No" in the "exerany2" question.
My objective is to identificate the people who said "No" in the "exerany" variable and are listed in the exerof1 as "NAs" to transform those "NAs" in 0.
I don't know if my analysis is the best way because I am a beginner. I tried to do what I want using ifelse, but I am struggling. I also tried to check if there is another thread with the same question, but I coundn't find.
I will await for your feedback.
Assuming your data frame is called data:
data[(is.na(data$exerof1) & data$exerany2=="No"),"exerof1"] <- 0
Basically we select the rows the satisfy your condition, then pick the column exerof1, and asign those the value 0.

Removing data frames from a list that contains a certain value under a variable in R

Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.

missing values for each participant in the study

I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).

How to perform a two-way repeated measures ANOVA with missing values

For my data set, I need to perform some sort of two factor repeated measures ANOVA. I have one between-subject factor called "Treatment" and one within-subject factor called "Frequency" with 8 levels. My problem is that most of my subjects don't have responses, called "Threshold", for all 8 of the levels of frequency (missing values). In addition, my two treatments are also unbalanced (about 23 for the first treatment type and about 21 for the other).
What r code do you suggest I try? And what would that code look like? I've been looking at the aov and Anova (car package) functions. I also need to figure out the formula for my model. I was thinking something like
aov(Threshold~(TreatmentFrequency)+Error(Subject/(TreatmentFrequency))
but I keep getting error messages like "In aov (......) Error() model is singular."
My question here is if you only include within-subject factors in the error term, so Error(Subject/Frequency) or just Error(Subject), or if I had it right in including everything? Also, should I include rows of responses for every frequency per bird, even if I don't have that specific value? Should I put NA's in those missing value cells, or delete the entire row of data for that level?
Any help would be greatly appreciated! I'm new to more advanced statistics and modeling, so keep that in mind! And if I need to clarify or add anything, just let me know. Thanks!

Subset dataframe by unique id variables with certain number of rows

I have not found a clear answer to this question, so hopefully someone can put me in the right direction!
I have a nested data frame (panel data), with multiple observations within multiple individuals. I want to subset my data frame by those individuals (id) which have at least 20 rows of data.
I have tried the following:
subset1 = subset(df, table(df$id)[df$id] >= 20)
However, I still find individuals with less that 20 rows of data.
Can anyone supply a solution?
Thanks in advance
subset1 = subset(df, as.logical(table(df$id)[df$id] >= 20))
Now, it should work.
The subset function actually is getting a series of true and false from the condition part, which indicates if the row should be kept or not/ meet the condition or not. Hence, the output of the condition part should be a series of true or false.
However, if you put table(df$id)[df$id]>=20 in the console, you will see it returns an array rather than logic. In this case, it is pretty straight that you just need to turn it into logic. Then, it works.

Resources