I am trying to count the frequencies of correct responses to over 200 questions on an IQ test.
Responses are:
0 = fail
1 = correct
7 = below basal
8 = above ceiling
I used dplyr:group_by to put the questions into groups. This works, and retains the ordering of the variables.
a_long%>%
dplyr::group_by(
key
, value
) -> a_long_grouped
Then I used dplyr:count to get the frequency of responses to each question.
a_long_count=dplyr::count(a_long_grouped
, value
)
The resulting tibble summarized the frequency of responses, but put the variables into alphabetical order, which I do not want. I've tried using sort = FALSE, but it comes out the same.
I would appreciate any suggestions -- thanks!
Related
Hi y'all I'm fairly new to R and I'm supposed to calculate F statistic for this table
The code I have inputted is as follows:
# F-test
res.ftest <- var.test(TotalLength ~ SwimSpeed , data = my_data)
res.ftest
I know I have more than two levels from the other posts I have read online, but I am not sure what to change to get the outcome I want.
FIRST AND FOREMOST...If you invoke
?var.test()
you will note that the S3 version you called assumes lhs is numeric and rhs is a 2-level factor.
As for the rest, while I don't know the words to your specific work/school assignment here, the words shouldn't be "calculate an F-test", exactly. They should be "analyze these data appropriately". While there are a number of routes you could take, this is normally seen as a regression problem, NOT a problem of trying to compare two variances/complete a 1-way ANOVA which is what var.test() is designed to do. (Reading the documentation at, for example, https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/var.test should make this clear and is something you should always do when invoking R procedures.)
Using a subset of your data (please do this yourself for stack helpers next time rather than make someone here do it for you)...
df <- data.frame(
ID = 1:4,
TL = c(27.1,29.0,33.0,29.3),
SS = c(86.6,62.4,63.8,62.3)
)
cor.test(df$TL,df$SS) # reports t statistic
# or
summary(lm(df$TL ~ df$SS)) # reports F statistic
Note that F is simply t^2 here in the 2 variable case.
Lastly, I should add it is remotely, vaguely possible the assignment is to check if the variances of the 2 distributions are equal even though I can see no reason why anyone would want to know considering they are 2 different measures on two different underlying scales measuring 2 different things. However,
var.test(df$TL, df$SS)
will return a "result" should you take the assignment to mean compare the observed variances.
I am currently working on an amazon dataset with many rows, which makes it hard to spot issues in the data.
My goal is to look at the amazon data, and see whether certain products have a higher variance in star ratings than other ones. I have a variable indicating product ID (asin), a variable indicating the star rating (overall), and want to create a variance variable.
I have thus used dplyr's group_by function in combination with the mutate function. Even though all input variables don't have NAs/Missings, my output variable does. I have attempted to look for a solution, yet only found solutions on what to do if the input has NAs.
See my code attached:
any(is.na(data$asin))
#[1] FALSE
any(is.na(data$overall))
# [1] FALSE
#create variable that represents variance of rating, grouped by product type
data <- data %>%
group_by(asin) %>%
mutate(ProductVariance = var(overall))
any(is.na(data$ProductVariance))
#5226 [1] TRUE
> sum(is.na(data$ProductVariance))
# [1] 289
I would much appreciate your help! Even though the amount of NAs is not big regarding the number of reviews, I would still appreciate getting to accurate means (NAs hinder the usage of tapply) and being as precice as possible in follow-up analyses.
Thank you in advance!
var will return NA if the input is length one. So any ASINs that appear once in your data will have NA variance. Depending what you're doing with it, you may find it convenient to change those NAs to 0s:
var(1)
# [1] NA
...
mutate(ProductVariance = coalesce(var(overall), 0))
Is it possible that what you're seeing is that "empty" groups are not showing up? You can change the default with .drop.
When .drop = TRUE, empty groups are dropped.
I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")
I have a very simple SSRS report - a census of employees with health insurance. I need a count of how many have single policies and how many have family policies. I created a conditional COUNT statement using the two parameters. However, they both produce the TOTAL number of policies instead of only the parameter they are supposed to be filtering on.
Expressions I used:
=COUNT(IIF(Fields!Deduction.Value = "Single", Fields!Deduction.Value,0))
=Count(IIF(Fields!Deduction.Value = "Family", Fields!Deduction.Value,0))
I have 79 rows. 54 rows say "Single" and 25 rows say "Family". The output shows 79 for BOTH groups.
The "Single" and "Family" designation is based on a case statement that converts multiple kinds of policies into the 2 basic values of Single and Family.
Any ideas on why this is happening?
You are doing a COUNT of results.
0 is a result that counts as 1 the same as any other number.
Use NOTHING instead of 0. COUNTs do not include NULL (NOTHING) values.
=COUNT(IIF(Fields!Deduction.Value = "Single", Fields!Deduction.Value, NOTHING))
The other way would be to assign a 1 for a match, 0 for non-match and then SUM the results.
=SUM(IIF(Fields!Deduction.Value = "Single", 1, 0))
I'm afraid this question has two sub parts. My project is to determine which insurance carrier has the lowest cost based on CPT Codes. Since there are so many CPT Codes I wanted to group them using cut like this:
uCPTCode<- unique(data$CPTCode)
uCPTCode <- cut(uCPTCode,
breaks = c(-Inf, "01999", "69979", "79999", "89398", "99091", "99499", Inf),
labels = c("NA","Anesthesia", "Surgery", "Radiology", "Pathology&Laboratory", "Medicine","Evaluation&Management", "Temp"),
right = FALSE)
Not sure unique is required or wise, but seemed to make sense to me. The issue is that some codes have leading zeros and terminating letters like this
2608 Levels: 0014F 0159T 0164T 0191T 0195T 0232T 0319T 0326T 0513F 0517F 0518F
So question 1 is what is the process to convert these ranges into integers corresponding to the labels I have in the cut function so I can graph the grouped results the x axis?
Question 2 is that I expected the ranges to be continuous, but they are not. How to I manage what happens around code 99000 through 99216 where previous groups (Medicine, Anesthesiology and Evaluation and Management) get combined? Here is a link to the CPT grouper file https://www.dropbox.com/s/wm55n17pufoacww/CPTGrouper.xlsx?dl=0
Here is a smattering of results to see where I am going with it
https://www.dropbox.com/s/h6sdnvm9yew6jdg/SampleStudyResults.xlsx?dl=0
Thanks very much for your time and attention