I keep getting "summarise() has grouped output by 'new_brand'. You can override using
the .groups argument." I'm not sure if I'm getting this error because I created columns pos_prop and neg_prop
superbowl %>% group_by(new_brand, superbowl) %>% summarize(mean(superbowl$volume, superbowl$pos_prop, superbowl$neg_prop), sd(superbowl$volume, superbowl$pos_prop, superbowl$neg_prop)) %>% filter(superbowl, superbowl == "0")
When I run rlang::last_error() The code works, I'm not sure how to make the code run properly. Any help will be appreciated.
You're using summarize and such incorrectly. Try this:
superbowl %>%
group_by(new_brand) %>%
summarize(across(c(volume, pos_prop, neg_prop),
list(mu = ~ mean(.), sigma = ~ sd(.)))) %>%
filter(superbowl == "0")
Notes on your code:
once you start a dplyr-pipe with superbowl %>%, almost never use superbowl$ in the dplyr verbs (very rare exceptions); I also removed references to superbowl in both group_by and filter, since it is not clear if you're trying to refer to the original frame symbol again ... if you have superbowl$superbowl, then they may still be appropriate;
either use across(..) as above or name the calculations, e.g., summarize(volume_mu = mean(volume), pos_mu = mean(pos_prop), ...); and
I'm inferring, but ... mean(volume, pos_prop, neg_prop) (with or without the superbowl$) is an error: in this case, the call is effectively mean(volume, trim=pos_prop, na.rm=neg_prop), which should be producing errors. One could adapt this to be mean(c(volume, pos_prop, neg_prop)) if you really want to aggregate three columns' data into a single number, but I thought that might be unintended over-aggregation.
Demonstration of this with real data:
mtcars %>%
group_by(cyl) %>%
summarize(across(c(disp, mpg),
list(mu = ~ mean(.), sigma = ~ sd(.))))
# # A tibble: 3 x 5
# cyl disp_mu disp_sigma mpg_mu mpg_sigma
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 105. 26.9 26.7 4.51
# 2 6 183. 41.6 19.7 1.45
# 3 8 353. 67.8 15.1 2.56
Related
I'd like to get a dataframe with variable names, means, and standard deviations. For this purpose, I created the descriptives function:
library(tidyverse)
library(labelled)
data <- read_spss("http://staff.bath.ac.uk/pssiw/stats2/SAQ.sav")
descriptives <- function(data, var) {
data %>%
select({{var}}) %>%
drop_na() %>%
summarize(Label = labelled::var_label({{var}}), Mean = mean({{var}}), `Std Dev` = sd({{var}}))
}
descriptives(data, Q01)
I tried to map this function with all the variables Q01:Q23 but I'm getting this error: Error in is_quosure(x) : argument "x" is missing, with no default
data %>%
select(Q01:Q23) %>%
map_dfr(descriptives(data))
We can loop over the names in map after changing the function from {{}} to ensym and evaluate with !!
library(dplyr)
library(purrr)
library(haven)
library(labelled)
map_dfr(names(data), ~ descriptives(data, !!.x))
-output
# A tibble: 23 x 3
Label Mean `Std Dev`
<chr> <dbl> <dbl>
1 Statiscs makes me cry 2.37 0.828
2 My friends will think I'm stupid for not being able to cope with SPSS 1.62 0.851
3 Standard deviations excite me 2.59 1.08
4 I dream that Pearson is attacking me with correlation coefficients 2.79 0.949
5 I don't understand statistics 2.72 0.965
6 I have little experience of computers 2.23 1.12
7 All computers hate me 2.92 1.10
8 I have never been good at mathematics 2.24 0.873
9 My friends are better at statistics than me 2.85 1.26
10 Computers are useful only for playing games 2.28 0.877
# … with 13 more rows
-function used
descriptives <- function(data, var) {
var <- rlang::ensym(var)
data %>%
select(!! var) %>%
drop_na() %>%
summarize(Label = labelled::var_label(!!var),
Mean = mean(!!var), `Std Dev` = sd(!!var))
}
I have a data frame in which I would I would like to compute some extra column as a function of the existing columns, but want to specify both each new column name and the function dynamically. I have a vector of column names that are already in the dataframe df_daily:
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
The rows of the dataframe have responses to each question from each user each time they answer the questionnaire, as well as a column with the number of days since the user first answered the questionnaire (i.e. Days_From_First_Use = 0 on the very first use, = 1 if it is used the next day etc.). I want to average the responses to these questions by Days_From_First_Use . I start by by grouping my dataframe by Days_From_First_Use:
df_test <- df_daily %>%
group_by(Days_From_First_Use)
and then try averaging the responses in a loop as follows:
for(i in 1:5){
df_test <- df_test %>%
mutate(!! paste0('Avg_Score_', DAILY_QUESTIONS[i]) :=
paste0('mean(', DAILY_QUESTIONS[i], ')'))
}
Unfortunately, while my new variable names are correct ("Avg_Score_Q1_Daily", "Avg_Score_Q2_Daily", "Avg_Score_Q3_Daily", "Avg_Score_Q4_Daily", "Avg_Score_Q5_Daily"), my answers are not: every row in my data frame has a string such as "mean(Q1_Daily)" in the relevant column .
So I'm clearly doing something wrong - what do I need to do fix this and get the average score across all users on each day?
Sincerely and with many thanks in advance
Thomas Philips
I took a somewhat different approach, using summarize(across(...)) after group_by(Days_From_First_Use) I achieve the dynamic names by using rename_with and a custom function that replaces (starts with)"Q" with "Avg_Score_Q"
library(dplyr, warn.conflicts = FALSE)
# fake data -- 30 normalized "responses" from 0 to 2 days from first use to 5 questions
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
df_daily <- as.data.frame(do.call('cbind', lapply(1:5, function(i) rnorm(30, i))))
colnames(df_daily) <- DAILY_QUESTIONS
df_daily$Days_From_First_Use <- floor(runif(30, 0, 3))
df_test <- df_daily %>%
group_by(Days_From_First_Use) %>%
summarize(across(.fns = mean)) %>%
rename_with(.fn = function(x) gsub("^Q","Avg_Score_Q",x))
#> `summarise()` ungrouping output (override with `.groups` argument)
df_test
#> # A tibble: 3 x 6
#> Days_From_First… Avg_Score_Q1_Da… Avg_Score_Q2_Da… Avg_Score_Q3_Da…
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1.26 1.75 3.02
#> 2 1 0.966 2.14 3.48
#> 3 2 1.08 2.45 3.01
#> # … with 2 more variables: Avg_Score_Q4_Daily <dbl>, Avg_Score_Q5_Daily <dbl>
Created on 2020-12-06 by the reprex package (v0.3.0)
I'm pretty new to R and couldn't find a clear answer my question after extensively searching the web. I'm trying to get dplyr functions to do the following task:
I have the following data.frame as tibble: Columns starting with X. indicates different samples and rows indicate how much a specific gene is expressed.
head(immgen_dat)
# A tibble: 6 x 212
ProbeSetID GeneName Description X.proB_CLP_BM. X.proB_CLP_FL. X.proB_FrA_BM. X.proB_FrA_FL. X.proB_FrBC_BM.
<int> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10344620 " Gm1056~ " predicted gene 1~ 15.6 15.3 17.2 16.1 18.1
2 10344622 " Gm1056~ " predicted gene 1~ 240. 255. 224. 312. 272.
3 10344624 " Lypla1" " lysophospholipas~ 421. 474. 349. 478. 459.
4 10344633 " Tcea1" " transcription el~ 802. 950. 864. 968. 1056.
5 10344637 " Atp6v1~ " ATPase H+ transp~ 199. 262. 167. 267. 255.
6 10344653 " Oprk1" " opioid receptor ~ 14.8 12.8 18.0 13.2 15.3
# ... with 204 more variables: X.proB_FrBC_FL. <dbl>,
I added a mean expression variable at the end for each gene by using the following code (the range of variables are the first and the last sample):
immgen_avg <- immgen_dat %>%
rowwise() %>%
mutate(Average = mean(X.proB_CLP_BM.:X.MLP_FL.))
Here, I have a quick question: The returned mean value I get from this code doesn't match the average I calculated elsewhere (in Excel). I don't think there are any missing values.
What I'd like to do is the following: For each gene, I'd like to compare the sample values with the average value and calculate a log2-fold difference (log2 difference of gene expression in a sample compared to the average expression value across all the samples). I'd like to store this dataframe with the name of immgen_log2 and do some subsequent analyses. In this new data frame, I'd like to keep the gene names because I'm thinking to merge this with another data table to compare log2 change between different experiments.
What is the best way of doing this? I appreciate your answers.
I will explain what is happening in a short while, but one way to solve for the row means of your intended variables is:
immgen_dat %>%
mutate(Average = apply(.[, 4:8], 1, mean)) %>%
select(Average)
# Average
# 1 16.46
# 2 260.60
# 3 436.20
# 4 928.00
# 5 230.00
# 6 14.82
To see what is happening with your code, we can use the do function as follows:
df2 <- immgen_dat %>%
rowwise() %>%
do(Average = .$X.proB_CLP_BM.:.$X.proB_FrBC_BM.)
df2$Average[1]
# [[1]]
# [1] 15.6 16.6 17.6
You will see that : generates a sequence from 15.6 in steps of 1. You can see this explained in more detail by typing help(":"). So in
immgen_dat %>%
rowwise() %>%
mutate(Average = mean(X.proB_CLP_BM.:X.proB_FrBC_BM.))
you are computing the means of the values of these sequences.
Edit
The logarithm of the ratios is of course the differences of the logarithms (provided the denominator is nonzero). So you are trying to find the differences between the log2's of each of the other numerical variables from the log2 of the Average, you can do something like.
immgen_log2 <- immgen_dat
immgen_log2[,4:9] <- log(immgen_dat[,4:9])
immgen_log2[,4:8] <- sapply(immgen_log2[,4:8], func)
I'm not entirely sure whether I get it right what you need to do, but whenever using dplyr or tidyverse in general (also ggplot2), long representation of your data works best. I assume that you want to calculate the mean of all variables starting with X. for each ProbeSetID. Then, for each X.-column and ProbeSetID, calculate ratio and take log2, i.e. log2(X.bla/mean):
df <- read.table(text = 'ProbeSetID X.proB_CLP_BM. X.proB_CLP_FL. X.proB_FrA_BM. X.proB_FrA_FL. X.proB_FrBC_BM.
10344620 15.6 15.3 17.2 16.1 18.1
10344622 240. 255. 224. 312. 272.
10344624 421. 474. 349. 478. 459.
10344633 802. 950. 864. 968. 1056.
10344637 199. 262. 167. 267. 255.
10344653 14.8 12.8 18.0 13.2 15.3', header = T)
library(dplyr)
library(tidyr)
result <-
df %>%
# transform to long:
gather(key = key, value = value, grep(x = names(.), pattern = "^X\\.")) %>%
# group by IDs, ie make rowwise calculations if it was still wide, but faster:
group_by(ProbeSetID) %>%
# calculate group-mean on the fly and calculate log-ratio directly:
mutate(log2_ratio = log2(value / mean(value)))
# transform back to wide, if needed:
result %>%
# remove initial values to have only 1 value variable:
select(-value) %>%
# go back to wide:
spread(key = key, value = log2_ratio)
# or, if you want to keep all values:
df %>%
# transform to long:
gather(key = key, value = value, grep(x = names(.), pattern = "^X\\.")) %>%
# group by IDs, ie make rowwise calculations if it was still wide, but faster:
group_by(ProbeSetID) %>%
# calculate the mean of each observation:
mutate(mean_value = mean(value)) %>%
# go back to wide:
spread(key, value) %>%
# now do the transformation to each variable that begins with X.:
mutate_at(.vars = vars(matches("^X\\.")),
.funs = funs(log2_ratio = log2(./mean_value)))
how are you?
So, I have a dataset that looks like this:
dirtax_trev indtax_trev lag2_majority pub_exp
<dbl> <dbl> <dbl> <dbl>
0.1542 0.5186 0 9754
0.1603 0.4935 0 9260
0.1511 0.5222 1 8926
0.2016 0.5501 0 9682
0.6555 0.2862 1 10447
I'm having the following problem. I want to execute a series of t.tests along a dummy variable (lag2_majority), collect the p-value of this tests, and attribute it to a vector, using a pipe.
All variables that I want to run these t-tests are selected below, then I omit NA values for my t.test variable (lag2_majority), and then I try to summarize it with this code:
test <- g %>%
select(dirtax_trev, indtax_trev, gdpc_ppp, pub_exp,
SOC_tot, balance, fdi, debt, polity2, chga_demo, b_gov, social_dem,
iaep_ufs, gini, pov4, informal, lab, al_ethnic, al_language, al_religion,
lag_left, lag2_left, majority, lag2_majority, left, system, b_system,
execrlc, allhouse, numvote, legelec, exelec, pr) %>%
na.omit(lag2_majority) %>%
summarise_all(funs(t.test(.[lag2_majority], .[lag2_majority == 1])$p.value))
However, once I run this, the response I get is: Error in summarise_impl(.data, dots): Evaluation error: data are essentially constant., which is confusing since there is a clear difference on means along the dummy variable. The same error appears when I replace the last line of the code indicated above with: summarise_all(funs(t.test(.~lag2_majority)$p.value)).
Alternatively, since all I want to do is: t.test(dirtax_trev~lag2_majority, g)$p.value, for instance, I thought I could do a loop, like this:
for (i in vars){
t.test(i~lag2_majority, g)$p.value
},
Where vars is an object that contains all variables selected in code indicated above. But once again I get an error message. Specifically, this one: Error in model.frame.default(formula = i ~ lag2_majority, data = g): comprimentos das variáveis diferem (encontradas em 'lag2_majority')
What am I doing wrong?
Best Regards!
Your question is not reproducible, please read this for how you could improve its quality.
My answer has been generalised to be reproducible because I don't have your data and cannot therefore adapt your code directly.
Using a tidy approach I'll produce a data frame of p-values for each variable.
library(tidyr)
library(dplyr)
library(purrr)
mtcars %>%
select_if(is.numeric) %>%
map(t.test) %>%
lapply(`[[`, "p.value") %>%
as_tibble %>%
gather(key, p.value)
# # A tibble: 11 x 2
# key p.value
# <chr> <dbl>
# 1 mpg 1.526151e-18
# 2 cyl 5.048147e-19
# 3 disp 9.189065e-12
# 4 hp 2.794134e-13
# 5 drat 1.377586e-27
# 6 wt 2.257406e-18
# 7 qsec 7.790282e-33
# 8 vs 2.776961e-05
# 9 am 6.632258e-05
# 10 gear 1.066949e-23
# 11 carb 4.590930e-11
update
Thank you for updating your question, note that the value you included in your earlier comment is likely from your original dataset and is still not reproducible here. When I run the code, this is the output.
t.test(dirtax_trev ~ lag2_majority, g)$p.value
# [1] 0.5272474
Please frame your questions in a way that anyone can see the problem in the same way that you do.
To build up the formula you are running through the t.test, I have taken a slightly different approach.
library(magrittr)
library(dplyr)
library(purrr)
g <- tribble(
~dirtax_trev, ~indtax_trev, ~lag2_majority, ~pub_exp,
0.1542, 0.5186, 0, 9754,
0.1603, 0.4935, 0, 9260,
0.1511, 0.5222, 1, 8926,
0.2016, 0.5501, 0, 9682,
0.6555, 0.2862, 1, 10447
)
dummy <- "lag2_majority"
colnames(g) %>%
.[. != dummy] %>% # vector of variables to send through t.test
paste(., "~", dummy) %>% # build formula as character
map(as.formula) %>% # convert to formula class
map(t.test, data = g) %$% # run t.test for each, note the special operator
tibble(
data.name = unlist(lapply(., `[[`, "data.name")),
p.value = unlist(lapply(., `[[`, "p.value"))
)
# # A tibble: 3 x 2
# data.name p.value
# <chr> <dbl>
# 1 dirtax_trev by lag2_majority 0.5272474
# 2 indtax_trev by lag2_majority 0.5021217
# 3 pub_exp by lag2_majority 0.8998690
If you prefer to drop the dummy variable name from data.name, you could modify its assignment in the tibble with:
data.name = unlist(strsplit(unlist(lapply(., `[[`, "data.name")), paste(" by", dummy)))
N.B. I used the special %$% from magrittr to expose the names from the list of tests to build a data frame. I'm sure there are other ways that may be more elegant, however, I find this form quite easy to reason about.
library(tidyverse)
I'm stuck on something that should be so simple! Using the code below, all I want to do is group and summarise the three "Var" columns. I want counts and sums (so that I can create three percentage columns, so bonus if you can include an easy way to accomplish this in your answer). However, I don't want to include the NA's. Removing the NA's from sum is easy enough by using "na.rm=TRUE", but I can't seem to figure out how to not include the NA's in the counts (using n() ) while using dplyr::summarise_at.
Am I missing something very simple?
Df%>%group_by(Group)%>%summarise_at(vars(Var1:Var3),funs(n(),sum((.),na.rm=TRUE)))
Group<-c("House","Condo","House","House","House","House","House","Condo")
Var1<-c(0,1,1,NA,1,1,1,0)
Var2<-c(1,1,1,1,0,1,1,1)
Var3<-c(1,1,1,NA,NA,1,1,0)
Df<-data.frame(Group,Var1,Var2,Var3)
I think your code was very close to getting the job done. I made some slight changes and have included an example of how you might include the percent calculation in the same step (although I am not sure of your expected output).
library(dplyr)
Df %>%
group_by(Group) %>%
summarise_all(funs(count = sum(!is.na(.)),
sum = sum(.,na.rm=TRUE),
pct = sum(.,na.rm=TRUE)/sum(!is.na(.))))
#> # A tibble: 2 x 10
#> Group Var1_count Var2_count Var3_count Var1_sum Var2_sum Var3_sum
#> <fctr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 Condo 2 2 2 1 2 1
#> 2 House 5 6 4 4 5 4
#> # ... with 3 more variables: Var1_pct <dbl>, Var2_pct <dbl>,
#> # Var3_pct <dbl>
I've also used summarise_all instead of summarise_at as summarise_all works on all the variables which aren't group variables.
I think you just need to move your 'na.rm()' argument back in the parentheses. See below:
Group<-c("House","Condo","House","House","House","House","House","Condo")
Var1<-c(0,1,1,NA,1,1,1,0)
Var2<-c(1,1,1,1,0,1,1,1)
Var3<-c(1,1,1,NA,NA,1,1,0)
Df<-data.frame(Group,Var1,Var2,Var3)
out <- Df %>%
group_by(Group) %>%
mutate_at(vars(Var1:Var3), funs(total = sum(!(is.na(.))), sum = sum(., na.rm = T))) %>%
ungroup()