Calculating a conditional mean and store it in a dataframe - r

I want to calculate a conditional mean for values that exceed a certain threshold.
For example by looking at the cars dataset, I want to calculate the mean hp for cars with more than 100 hp. Furthermore, I want to store this value within a datastream.
I tried the following Code (PROPTV is the dataset and DDM, g=0 is the variable for which I need the mean of values>0):
PROPTV %>%
group_by(`DDM, g=0`>0) %>%
summarise(mean=mean(`DDM, g=0`))
I get the following tibble:
# A tibble: 2 × 2
`\`DDM, g=0\` > 0` mean
<lgl> <dbl>
1 FALSE 0
2 TRUE 0.709
The 0.709 should be correct but I have no idea how to store this value without using any help dataframe.
Any ideas?
Thanks in advance!!

Related

R arrange function seems not to be working

I am having a dataframe with timestamps that also have decimal values. I want to calculate the difference between the first and all other events from the same group. To do that I use the following code:
values <- c("1671535501.862424", "1671535502.060679","1671535502.257422",
"1671535502.472993", "1671535502.652619","1671535502.856569",
"1671535503.048685", "1671535503.245988")
column_b <- c("a", "a","a","a","a","a","a","a")
values<-as.numeric(values)
#-- Calculate differences
data <- data.frame(values,column_b) #create data frame
res <- data %>%
group_by(column_b) %>%
arrange(values) %>%
mutate(time=values-lag(values, default = first(values)))
In general, the code does exactly what I expect it to do. It groups them, arranges them, and calculates the difference for each group. The output looks like this:
> res
# A tibble: 8 × 3
# Groups: column_b [2]
values column_b time
<dbl> <fct> <dbl>
1 1671535502. a 0
2 1671535502. a 0.198
3 1671535502. a 0.197
4 1671535502. a 0.216
5 1671535503. a 0.180
6 1671535503. a 0.204
7 1671535503. a 0.192
8 1671535503. a 0.197
Nevertheless, I have my doubts about the math results. If I am not mistaken, the values in this example are prearranged. But even if that was not the case, arrange() should have done the job. Hence, IF it is arranging the values, how can the 4th have a larger value than the 5th? There are multiple examples where we see that it does not make sense. What am I missing?

Summarizing by group using dplyr not working as expected

I am trying to summarize demographic information of a dataframe and I am running into some issues. Breaking it down by gender, there are 4 possible options that participants can choose from: 1,2,3,4 with blanks (no response) being treated as NA values by R. I am getting the correct counts for each gender but when trying to obtain the mean of each gender is where I am running into issues.
I'd like to keep the observations with NA values because while they may not have answered demographic information, they have answered other questions hence why I do not want to simply remove those rows from the dataframe.
Here is what I tried
#df$q10: what is your gender
by_gender = df %>%
group_by(df$Q10) %>%
dplyr::summarize(count = n(),
AvgAge = mean(df$Q11_1_TEXT, na.rm = TRUE))
by_gender
This returns the same value for all genders as
mean(df$Q11_1_TEXT, na.rm = TRUE)
Both the gender and age columns have NA values and I suspect this is where the issue may be? I tried adding na.rm = T but that does not seem to work. What else can I try?
Edit: Removing $ makes the function work as expected.
When you ask for mean(df$Q11_1_TEXT) it will calculate a mean from the original ungrouped vector, whereas if you use mean(Q11_1_TEXT) it will look for Q11_1_TEXT within the grouped data frame it received from the prior step.
Compare:
mtcars %>%
group_by(gear) %>%
summarize(wt_ttl = sum(wt),
wt_ttl2 = sum(mtcars$wt))
# A tibble: 3 × 3
gear wt_ttl wt_ttl2
<dbl> <dbl> <dbl>
1 3 58.4 103.
2 4 31.4 103.
3 5 13.2 103.

How can I calculate a new dataframe only for one outcome type?

I'm working with some data that involves participants running on a cognitive task that measures their outcome (Correct or Incorrect) and reaction time (RT) (the entire dataset is called practice). For each participant, I want to create a new dataframe with their average RT when they got the answer correct, and one for when they were incorrect. I've tried
practice %>%
mutate(correctRT = mean(practice$RT[practice$Outcome=="Correct"]))
Using dplyr and tidyverse, as well as
correctRT <- c(mean(practice$RT[practice$Outcome=="Correct"]))
(which I'm sure isn't the correct way to do it) and nothing seems to be working. I'm a complete novice and am working with this dataset in order to learn how to do stats with R and just can't find any answers with R.
In R you can "keep" multiple objects (e.g. data frames) in a single list. This saves you from storing every (sub)dataframe in a separate variable (e.g. through subsetting your problem and storing it based on Participant, Outcome). This will come handy when you have "many" individuals and a manual filter and storing of the (sub)dataframe becomes prohibitive.
Conceptually, your problem is to "subset" your data to the Participant and Outcome you aim for and calculate the mean on this group.
The following is based on {tidyverse}, i.e. {dplyr}.
data
As you have not provided a reproducble example, this is a quick hack of your data:
practice <- data.frame(
Participant = c("A","A","A","B","B","B","B","C","C","D"),
RT = c(10, 12, 14, 9, 12, 13, 17, 11, 13, 17),
Outcome = c("Incorrect","Correct", "Correct","Incorrect","Incorrect","Correct", "Correct","Incorrect","Correct", "Correct")
)
which looks like the following:
practice
Participant RT Outcome
1 A 10 Incorrect
2 A 12 Correct
3 A 14 Correct
4 B 9 Incorrect
5 B 12 Incorrect
6 B 13 Correct
7 B 17 Correct
8 C 11 Incorrect
9 C 13 Correct
10 D 17 Correct
splitting groups of a dataframe
The {tidyverse} provides some neat functions for the general data processing.
{dplyr} has a group_split() function that returns such a list.
library(dplyr)
practice %>% group_split(Participant, Outcome)
<list_of<
tbl_df<
Participant: character
RT : double
Outcome : character
>
>[7]>
[[1]]
# A tibble: 2 x 3
Participant RT Outcome
<chr> <dbl> <chr>
1 A 12 Correct
2 A 14 Correct
[[2]]
...
You can address the respective list-elements with the [[]] notation.
Store the list in a variable and try my_list_name[[3]] to extract the 3rd element.
potential summary for your problem
If you do not need a list you could wrap this into a data summary.
If you want to split on Outcomes, you may want to filter your data in 2 sub-dataframes only holding the respective outcome (e.g. correct <- practice %>% filter(Outcome == "Correct")).
Group your data dependent on the summary you want to construct.
Use summarise() to summarise your groups into a 1-row summary.
Note you can combine multiple operations. For example next to the mean reaction time, the following counts the number of rows (:= attempts).
practice %>%
group_by(Participant, Outcome) %>%
##--------- summarise data into 1 row summarise
summarise( Mean_RT = mean(RT) # calculate mean reaction time
,Attempts = n() ) # how many times
This yields:
# A tibble: 7 x 4
# Groups: Participant [4]
Participant Outcome Mean_RT Attempts
<chr> <chr> <dbl> <int>
1 A Correct 13 2
2 A Incorrect 10 1
3 B Correct 15 2
4 B Incorrect 10.5 2
5 C Correct 13 1
6 C Incorrect 11 1
7 D Correct 17 1
Please note that this is a grouped data frame. If you further process the data, you need to "remove" the grouping. Otherwise any follow up operation in a pipe will be on the group-level.
For this you can either use summarise(...., .groups = "drop") or you add ... %>% ungroup() to your pipe.
If you need to split the result, check for above group_split().

How do I sample specific sizes within groups?

I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?
My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.
After this code, I will be sampling completely at random, to get my final sample.
Code example:
testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))
testing <- testing %>%
slice_sample(ID, prop=1-.6)
As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.
However, the overall sample size of 20 is correct.
How do I brute force the sample size within the groups to be what I need?
There are other variables in the data frame so I need to sample randomly from each age group.
EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.
I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.
That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.
You haven't defined your grouping variable.
Try the following:
set.seed(1)
x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
x %>% count()
# # A tibble: 4 x 2
# # Groups: Age [4]
# Age n
# <dbl> <int>
# 1 18 4
# 2 19 3
# 3 20 6
# 4 21 6
Alternatively, try stratified from my "splitstackshape" package:
library(splitstackshape)
set.seed(1)
y <- stratified(testing, "Age", .4)
y[, .N, Age]
# Age N
# 1: 18 4
# 2: 19 4
# 3: 20 6
# 4: 21 6

How do I make the list output from the 'by' function in R usable?

I have a set of data with a dependent variable and two factors. I would like randomly sample the dependent variable (with replacement) within each subset of combinations of my two factors (and the number of random samples retrieved should equal the number that existed originally at each combination of the two factors). I've been able to do this using the 'by' function. The problem is the output is a list and I'd like something more accessible but haven't had any luck converting to a data frame. My end goal is to run the simulation described above 1000 times and for each simulation calculate the average of the random samples retrieved for each combination of the factors.
This produces the dataset:
value<-runif(100,5,25)
cat1<-factor(rep(1:10,10))
a<-rep("A",50)
b<-rep("B",50)
cat2<-append(a,b)
data<-as.data.frame(cbind(value,cat1,cat2))
This creates one simulation of random values drawn from the factor levels and
stores that info in a list:
list<-by(data[,"value"],data[,c("cat1","cat2")],function(x) sample(x,length(x),T))
What I'd like to do is wind up with a dataframe that has as columns "Simulation", "AverageValue", "cat1", and "cat2" - so that I would have 1000 simulation lines for each combination of cat1 and cat 2.
Any suggestions on how to make the 'by' output more accessible so I can run a for loop on the output or other suggestions would be great.
Thanks!
As a more general method, you might like to use dplyr rather than by. this way you'll keep your data.frame.
In this case, you would use group_by to group by your cat1 and cat2, rather than by, and use mutate to add a new column on. You could replace new = with value = if you don't want to keep your old data:
library(dplyr)
data %>% group_by(cat1, cat2) %>%
mutate(new = sample(value, length(value), replace = T))
Source: local data frame [100 x 4]
Groups: cat1, cat2 [20]
value cat1 cat2 new
(fctr) (fctr) (fctr) (fctr)
1 13.9639607304707 1 A 13.2139691384509
2 22.6068278681487 2 A 5.27278678957373
3 24.6930849226192 3 A 22.0293137291446
4 16.842244095169 4 A 9.56347029190511
5 18.467006101273 5 A 23.1605510273948
6 20.6661582039669 6 A 24.3043746100739
7 9.37060782220215 7 A 13.9268753770739
8 6.68592340312898 8 A 20.034239795059
9 6.95704637560993 9 A 12.676755907014
10 17.2769332909957 10 A 24.453850784339
.. ... ... ...

Resources