I am a rookie STATA user trying to make the jump to R. I am working through various exercises, but keep getting something wrong with the group_by and subset command.
I have a simple dataset that I wish to make groupbased calculations on. I am trying to use the groups_by command from the dplyr package to do this.
My dataset is called itchy and consists of 4 variabels:
treat- levels A and B (type of treatment)
type- levels Dark and Fair (skin-colour)
y - levels 0 and 1 (failure or succes of treatment)
freq - numerical variable indicating how many are in this particular group
Using this code you can recreate it:
type <- c(2,2,2,2,1,1,1,1)
treat <-c(1,1,2,2,1,1,2,2)
y <- c(1,0,1,0,1,0,1,0)
freq <- c(9,17,5,20,10,15,3,20)
itchy <- cbind.data.frame(type,treat,y,freq)
itchy$type <- as.factor(type)
itchy$type <- factor(itchy$type,levels = c(1,2), labels = c("Dark", "Fair"))
itchy$treat <- as.factor(treat)
itchy$treat <- factor(itchy$treat,levels = c(1,2), labels = c("A", "B"))
itchy$y <- as.factor(y)
itchy$y <- factor(itchy$y,levels = c(0,1), labels = c("failure", "succes"))
Now I would like to calculate the ods for a success for treatment A and B when applied to skintype Dark or Fair. (ods = nr of successful events/nr of failures)
I have two questions:
1) Can you help me do the ods calculations by groups?
2) I have tried with various combinations of group_by and subset, without any luck. The below code shows some of my unsuccessful attempts. Can you then tell I have a basic misunderstanding of how the group_by and subset commands work
itchy %>% group_by(treat, type) %>% summarize(ods = (subset(freq, y==1)/subset(freq, y==0)))
itchy %>% group_by(treat, type) %>% ods <- c((subset(freq, y==1)/subset(freq, y==0)))
itchy %>% group_by(treat, type) %>% itchy$ods <- (subset(freq, y==1)/subset(freq, y==0))
If I understand you correctly, I think the following will work. I made use of the the spread function from the tidyr package, which like dplyr is part of the tidyverse
library(tidyr)
itchy %>%
spread(y, freq) %>%
mutate(odds = succes / failure)
type treat failure succes odds
1 Dark A 15 10 0.6666667
2 Dark B 20 3 0.1500000
3 Fair A 17 9 0.5294118
4 Fair B 20 5 0.2500000
junk = itchy %>% group_by(y,treat, type) %>% summarize(Overall = sum(freq))
myfunc = function(arg1,arg2){
filter(junk,treat == arg1,type == arg2)[1,4]/filter(junk,treat == arg1,type == arg2)[2,4]
}
myfunc("A","Dark") # You can try all the various combinations here
Does this give you the desired result?
Related
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
I have a data set like the following:
source <- c("Email","Email","University","Google","Wordpress","Government","University","Email")
TLD <- c(".com",".com","net",".com",".edu",".com",".gov",".org")
speed <- c("1MB/s to 10MB/s","1MB/s to 10MB/s","1KB/s to 99KB/s","100KB/s to 1MB/s","1MB/s to 10MB/s","1MB/s to 10MB/s","10MB/s to 100MB/s","1MB/s to 10MB/s")
ping <- c(120,250,32,66,502,222,307,21)
install <- c("Yes","No","No","No","Yes","Yes","No","Yes")
df <- data.frame(source,TLD,speed,ping,install)
I would like to make a prop table for all of the categorical variables at once if possible for a single table. Is there any way to do this?
My desired output would look something like this:
Factor Level N (%)
source Email 5
Google 10
Wordpress 2
... .... ...
install Yes 42
No 58
Get the data in long format, count each occurrence of column and it's value and calculate the percentage.
library(dplyr)
df %>%
mutate(across(.fns = as.character)) %>%
tidyr::pivot_longer(cols = everything()) %>%
count(name, value, name = 'N') %>%
group_by(name) %>%
mutate(N = prop.table(N) * 100)
This is how I ended up solving my problem.
Thanks for the help.
#Categorical data summary
prop.Fun <- function(x){
dfProp <- cbind(table(x, useNA="ifany"),round(prop.table(table(x, useNA="ifany"))*100, 2))
colnames(dfProp) <- c("Count", "Proportion (%)")
dfProp
}
lapply(df[,c(1,2,3,7,8,10,15)], prop.Fun)
This question already has an answer here:
Passing variable in function to other function variables in R
(1 answer)
Closed 2 years ago.
Equivalent for deprecated select_() and mutate_()
I am trying to make a function with this data and would really appreciate your help!
Imagine I have a data.frame like this one (the fusion of control and sites).
I want to select the InitDryW and FinalDryW columns of the Treatment “Control” and then calculate the average.
Inside the function I must write select_() and then mutate_(). However, I understand that these two functions are deprecated.
control <- data.frame(Day=c(0,0,0,0,0,0),
Replica=c(1,1,1,1,1,1),
Initial_Dry_Weight=c(5.010,5.010,5.010,5.010,5.010,5.000),
Final_Dry_Weight=c(4.990,4.940,4.840,4.820,4.960,4.970),
InitiaFraction=c(1.1071,1.1964,1.0647,1.0005,1.0453,1.1212),
FinalFraction=c(0.3858,0.3504,0.4248,0.3333,0.3417,0.3467),
Treatment=c("Control","Control","Control","Control","Control","Control"))
control
sites <-data.frame(Day=c(2,4,8,16,32,44),
Replica=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Initial_Dry_Weight=c(5.000,5.000,5.000,5.000,5.01,5.000,5.000,5.000,
5.000,5.000,5.000,5.01,5.01,5.01,5.000,5.000,5.000,5.000),
Final_Dry_Weight=c(4.65,4.63,4.67,4.64,4.37,4.37,4.17,3.72,4.12,4,3.99,3.64,
4.26,3.3,3.47,3.7,3.75,3.3),
InitiaFraction=c(1.0081,1.0972,1.1307,1.0898,1.075,1.0295,1.0956,1.042,1.0876,
1.006,1.1052,1.0922,1.0472,1.0843,1.0177,1.0143,1.1112,1.0061),
FinalFraction=c(0.3229,0.3605,0.3304,0.3489,0.3181,0.2948,0.4098,0.3762,0.3787,
0.3345,0.3595,0.3511,0.3921,0.3908,0.3385,0.347,0.3366,0.3318),
Treatment=c("CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC",
"CC","CC","CC","CC","CC"))
sites
total <- dplyr::bind_rows(control,sites)
total
My functions is:
manipulation <- function(data,
InitDryW,
FinalDryW,
Treatment,
Difference) {control <- data %>%
filter(Treatment == "Control") %>%
select_(InitDryW,FinalDryW) %>%
mutate_(Difference = lazyeval::interp (~a/b, a=as.name(FinalDryW),b=as.name(InitDryW)))
meanControl <- mean(control$Difference, na.rm = TRUE)
return (meanControl)
}
manipulation()
Then, I run the example:
control <- manipulation(data= total,
InitDryW = "Initial_Dry_Weight",
FinalDryW = "Final_Dry_Weight",
Treatment = "Treatment")
control
Now, I'm getting warnings like these (for both select_() and mutate_()):
Warning message:
mutate_() is deprecated.
Please use mutate() instead
The 'programming' vignette or the tidyeval book can help you
to program with mutate() : https://tidyeval.tidyverse.org
The result is correct, but the first time that warning appears.
My question is: what is the equivalent of select_() and mutate_() in functions in this case?
I think now select_() is solved using only select()
Thanks in advance!!!
You can pass unqouted column names and use {{}} to evaluate it.
library(dplyr)
library(rlang)
manipulation <- function(data,InitDryW,FinalDryW,Treatment,Difference) {
control <- data %>%
filter({{Treatment}} == "Control") %>%
select({{InitDryW}},{{FinalDryW}}) %>%
mutate(Difference = {{FinalDryW}}/{{InitDryW}})
meanControl <- mean(control$Difference, na.rm = TRUE)
return (meanControl)
}
manipulation(data= total,
InitDryW = Initial_Dry_Weight,
FinalDryW = Final_Dry_Weight,
Treatment = Treatment)
However, based on #27 ϕ 9's comment we think you might want to do :
manipulation <- function(data,InitDryW,FinalDryW,Treatment) {
control <- data %>%
filter(Treatment == Treatment) %>%
select({{InitDryW}},{{FinalDryW}}) %>%
mutate(Difference = {{FinalDryW}}/{{InitDryW}})
meanControl <- mean(control$Difference, na.rm = TRUE)
return (meanControl)
}
manipulation(data= total,
InitDryW = Initial_Dry_Weight,
FinalDryW = Final_Dry_Weight,
Treatment = "Control")
I think NSE (non standard evaluation) could help you. At first it might be a little bit confusing, but I think it's quite an elegant way to forget about the underscore functions :) All(?) the dplyr functions work somehow that way. So you should already be familiar with the concept (even if you didn't know about it):
... here is an example.
# some data
dat <- dplyr::tibble(A=1:5,
B=5:1)
# some function
some_function <- function(dat,
.var){
.var <- rlang::enquo(.var)
dat %>%
dplyr::select(!!.var)
}
# run function
some_function(dat,.var=B)
# output
# A tibble: 5 x 1
B
<int>
1 5
2 4
3 3
4 2
5 1
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
Well, I know that there are already tons of related questions, but none gave an answer to my particular need.
I want to use dplyr "summarize" on a table with 50 columns, and I need to apply different summary functions to these.
"Summarize_all" and "summarize_at" both seem to have the disadvantage that it's not possible to apply different functions to different subgroups of variables.
As an example, let's assume the iris dataset would have 50 columns, so we do not want to address columns by names. I want the sum over the first two columns, the mean over the third and the first value for all remaining columns (after a group_by(Species)). How could I do this?
Fortunately, there is a much simpler way available now.
With the new dplyr 1.0.0 coming out soon, you can leverage the across function for this purpose.
All you need to type is:
iris %>%
group_by(Species) %>%
summarize(
# I want the sum over the first two columns,
across(c(1,2), sum),
# the mean over the third
across(3, mean),
# the first value for all remaining columns (after a group_by(Species))
across(-c(1:3), first)
)
Great, isn't it?
I first thought the across is not necessary as the scoped variants worked just fine, but this use case is exactly why the across function can be very beneficial.
You can get the latest version of dplyr by devtools::install_github("tidyverse/dplyr")
As other people have mentioned, this is normally done by calling summarize_each / summarize_at / summarize_if for every group of columns that you want to apply the summarizing function to. As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example set the colnames in such way that you can use the select helpers (e.g. contains()) to filter just the columns that you want to apply the function to. If not, then you can set the specific column numbers that you want to summarize.
For the example you mentioned, you could try the following:
summarizer <- function(tb, colsone, colstwo, colsthree,
funsone, funstwo, funsthree, group_name) {
return(bind_cols(
summarize_all(select(tb, colsone), .funs = funsone),
summarize_all(select(tb, colstwo), .funs = funstwo) %>%
ungroup() %>% select(-matches(group_name)),
summarize_all(select(tb, colsthree), .funs = funsthree) %>%
ungroup() %>% select(-matches(group_name))
))
}
#With colnames
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = contains("Sepal"),
colstwo = matches("Petal.Length"),
colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
#With indexes
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = 1:2,
colstwo = 3,
colsthree = 4,
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
You could summarise the data with each function separately and then join the data later if needed.
So something like this for the iris example:
sums <- iris %>% group_by(Species) %>% summarise_at(1:2, sum)
means <- iris %>% group_by(Species) %>% summarise_at(3, mean)
firsts <- iris %>% group_by(Species) %>% summarise_at(4, first)
full_join(sums, means) %>% full_join(firsts)
Though I would try to think of something else if there are more than a handful of summarising functions you need to use.
Try this:
library(plyr)
library(dplyr)
dataframe <- data.frame(var = c(1,1,1,2,2,2),var2 = c(10,9,8,7,6,5),var3=c(2,3,4,5,6,7),var4=c(5,5,3,2,4,2))
dataframe
# var var2 var3 var4
#1 1 10 2 5
#2 1 9 3 5
#3 1 8 4 3
#4 2 7 5 2
#5 2 6 6 4
#6 2 5 7 2
funnames<-c(sum,mean,first)
colnums<-c(2,3,4)
ddply(.data = dataframe,.variables = "var",
function(x,funcs,inds){
mapply(function(func,ind){
func(x[,ind])
},funcs,inds)
},funnames,colnums)
# var V1 V2 V3
#1 1 27 3 5
#2 2 18 6 2
See this - feature coming soon