Producing multiple frequency tables at once in R - r

I have a dataframe of different kind of variables (numeric, character, factor) on the columns which I would liko to summarise at once. I have an ID column to be counted according to the levels of the other columns.
Every column has different levels if they are character or factor and I would like to know the frequency of the IDs for each level. In addition if the column is numeric I would like to have returned summary statistics such as mean, sd, and quantiles.
Ideally I would do this with dplyr with group_by() and summarise() functions but it requires me to group each column at a time and then specify whether I want it counted with n() or whether I want summary statistics because of being numeric.
In SAS there is a command known as PROC FREQ which I am trying to replicate.
df<-
data.frame(
ID = c(1,2,3,4,5,6),
Age = c(20, 30, 45, 60, 70, 18),
Car = c("Zum", "Yat", "Zum", "Zum", "Yat", "Rel"),
Side = c("Left", "Right", "Left", "Left", "Right", "Right")
)
Result:
df %>% group_by(Car) %>% summarise(n = n())
df %>% group_by(Side) %>% summarise(n = n())
df %>% summarise(mean = mean(Age))
I would like to obtain this result in a single output and for many variables. My real df contains tens of columns which should be either grouping variables or not depending on their nature. In addition the ID could be even repeated with the same values for the observations to be summarised.

You could write a function to take action based on it's class. Here, we calculate mean if class of the column is numeric or else perform count of unique values in the column.
library(dplyr)
purrr::map(names(df)[-1], function(x) {
if(is.numeric(df[[x]])) df %>% summarise(mean = mean(.data[[x]]))
else df %>% count(.data[[x]])
})
#[[1]]
# mean
#1 40.5
#[[2]]
# Car n
#1 Rel 1
#2 Yat 2
#3 Zum 3
#[[3]]
# Side n
#1 Left 3
#2 Right 3

Related

How to merge rows based on conditions with characters values? (Household data)

I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)

How can I group categorical variable that also meets another criteria in R? DPLYR?

I have a task that I can't solve. My goal is to be able to figure out how many "families" have children (under 18). I only need the sum of unique familyids and I've tried doing it in R and Excel and can't figure it out.
In my data I have four families and my data is saved on a client level.
data <- data.frame(
"FamilyID" = c(10,10,10,11,11,11,12,12,13,13),
"ClientID" = c(101,102,103,111,112,113,121,122,131,132),
"Age" = c(26,1,5,35,34,1,54,60,17,21)
)
My goal is to have something like this
Metric Count
Families w/ Children 3
Families w/out Children 1
In my actual dataset I have thousands of families so I really appreciate ant help.
How can I do this with dplyr?
library(tidyverse)
counts <- data %>%
group_by(FamilyID) %>%
summarise(number_of_children = sum(Age<= 18), number_of_adults = sum(Age > 18)) %>%
ungroup()
final <- counts %>%
summarise("Families w/ children" = sum(number_of_children > 0), "Families w/o children" = sum(number_of_children < 1)) %>%
gather() %>%
rename("Metric" = key, "Count" = value)
You can try something like this:
data2 <- data %>% group_by(FamilyID) %>%
mutate(children=sum(Age<18)) %>% mutate(children=ifelse(children>=1,1,0))
data2 %>% group_by(children) %>% summarize(n_distinct(FamilyID))
which shows how many distinct Family IDs correspond to 0 children, and how many correspond to at least 1 child.
One option is to use any to distinguish which families have children TRUE/FALSE, followed by dplyr::count
library(dplyr)
data %>%
group_by(FamilyID) %>%
summarize(have_children = any(Age < 18)) %>%
count(have_children)
#------
have_children n
<lgl> <int>
1 FALSE 1
2 TRUE 3

Divide whole dataframe by mean of control group for each of several sub-groups

Starting data
I'm working in R and I have a set of data generated from groups (cohorts) of animals treated with different doses of different drugs. A simplified reproducible example of my dataset follows:
# set starting values for simulation of animal cohorts across doses of various drugs with a few numeric endpoints
cohort_size <- 3
animals <- letters[1:cohort_size]
drugs <- factor(c("A", "B", "C"))
doses <- factor(c(0, 10, 100))
total_size <- cohort_size * length(drugs) * length(doses)
# simulate data based on above parameters
df <- cbind(expand.grid(drug = drugs, dose = doses, animal = animals),
data.frame(
other_metadata = sample(LETTERS[24:26], size = total_size, replace = TRUE),
num1 = rnorm(total_size, mean = 10, sd = 3),
num2 = rnorm(total_size, mean = 60, sd = 9),
num3 = runif(total_size, min = 1, max = 5)))
This produces something like:
## drug dose animal other_metadata num1 num2 num3
## 1 A 0 a X 6.448411 54.49473 4.111368
## 2 B 0 a Y 9.439396 67.39118 4.917354
## 3 C 0 a Y 8.519773 67.11086 3.969524
## 4 A 10 a Z 6.286326 69.25982 2.194252
## 5 B 10 a Y 12.428265 70.32093 1.679301
## 6 C 10 a X 13.278707 68.37053 1.746217
My goal
For each drug treatment, I consider the dose == 0 animals as my control group for that drug (let's say each was run at a different time and has it's own control group). I wish to calculate the mean for each numeric endpoint (columns 5:7 in this example) of the control group. Next I want to normalize (divide) every numeric endpoint (columns 5:7) for every animal by the mean of it's respective control group.
In other words num1 for all animals where drug == "A" should be divided by the mean of num1 for all animals where drug == "A" AND dose == 0 and so on for each endpoint.
The final output should be the same size as the original data.frame with all of the non-numeric metadata columns remaining unchanged on the left side and all the numeric data columns now with the normalized values.
Naturally I'd like to find the simplest solution possible - minimizing creation of new variables and ideally in a single dplyr pipeline if possible.
What I've tried so far
I should say that I have technically solved this but the solution is super ugly with a ton of steps so I'm hoping to get help to find a more elegant solution.
I know I can easily get the averages for the control groups into a new data.frame using:
df %>%
filter(dose == 0) %>%
group_by(drug, dose) %>%
summarise_all(mean)
I've looked into several things but can't figure out how to implement them. In order of what seems most promising to me:
dplyr::group_modify()
dplyr::rowwise()
sweep() in some type of loop
Thanks in advance for any help you can offer!
If the intention is to divide the numeric columns by the mean of the control group values, grouped by 'drug', after grouping by 'drug', use mutate with across (from dplyr 1.0.0), divide the column values (. with mean of the values where the 'dose' is 0
library(dplyr) # 1.0.0
df %>%
group_by(drug) %>%
mutate(across(where(is.numeric), ~ ./mean(.[dose == 0])))
If we have a dplyr version is < 1.0.0, use mutate_if
df %>%
group_by(drug) %>%
mutate_if(is.numeric, ~ ./mean(.[dose == 0]))

Filtering a Data Frame with Very specific Requirements

Fifa2 datasetFirst, I am not a developer and have little experience with R, so please forgive me. I have tried to get this done on my own, but have run out of ideas for filtering a data frame using the 'filter' command.
the data frame has about a dozen or so columns, with one being Grp (meaning Group). This is a FIFA soccer dataset, so the Group in this context means the general position the player is in (Defense, Midfield, Goalkeeper, Forward).
I need to filter this data frame to provide me this exact information:
the Top 4 Defense Players
the Top 4 Midfield Players
the Top 2 Forwards
the Top 1 Goalkeeper
What do I mean by "Top"? It's arranged by the Grp column, which is just a numeric number. So, Top 4 would be like 22,21,21,20 (or something similar because that numeric number could in fact be repeated for different players). The Growth column is the difference between the Potential Column and Overall column, so again just a simple subtraction to find the difference between them.
#Create a subset of the data frame
library(dplyr)
fifa2 <- fifa %>% select(Club,Name,Position,Overall,Potential,Contract.Valid.Until2,Wage2,Value2,Release.Clause2,Grp) %>% arrange(Club)
#Add columns for determining potential
fifa2$Growth <- fifa2$Potential - fifa2$Overall
head(fifa2)
#Find Southampton Players
ClubName <- filter(fifa2, Club == "Southampton") %>%
group_by(Grp) %>% arrange(desc(Growth), .by_group=TRUE) %>%
top_n(4)
ClubName
ClubName2 <- ggplot(ClubName, aes(x=forcats::fct_reorder(Name, Grp),
y=Growth, fill = Grp)) +
geom_bar(stat = "identity", colour = "black") +
coord_flip() + xlab("Player Names") + ylab("Unfilled Growth Potential") +
ggtitle("Southampton Players, Grouped by Position")
ClubName2
That chart produces a list of players that ends up having the Top 4 players in each position (top_n(4)), but I need it further filtered per the logic I described above. How can I achieve this? I tried fooling around with dplyr and that is fairly easy to get rows by Grp name, but don't see how to filter it to the 4-4-2-1 that I need. Any help appreciated.
Sample Output from fifa2 & ClubName (which shows the data sorted by top_n(4):
fifa2_Dataset
This might not be the most elegant solution, but hopefully it works :)
# create dummy data
data_test = data.frame(grp = sample(c("def", "mid", "goal", "front"), 30, replace = T), growth = rnorm(30, 100,10), stringsAsFactors = F)
# create referencetable to give the number of players needed per grp
desired_n = data.frame(grp = c("def", "mid", "goal", "front"), top_n_desired = c(4,4,1,2), stringsAsFactors = F)
# > desired_n
# grp top_n_desired
# 1 def 4
# 2 mid 4
# 3 goal 1
# 4 front 2
# group and arrange, than look up the desired amount of players in the referencetable and select them.
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:desired_n$top_n_desired[which(first(grp) == desired_n$grp)]) %>%
arrange(grp)
# A bit more readable, but you have to create an additional column in your dataframe
# create additional column with desired amount for the position written in grp of each player
data_test = merge(data_test, desired_n, by = "grp", all.x = T
)
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:first(top_n_desired)) %>%
arrange(grp)

R dplyr summarise date gaps

I have data on a set of students and the semesters they were enrolled in courses.
ID = c(1,1,1,
2,2,
3,3,3,3,3,
4)
The semester variable "Date" is coded as the year followed by 20 for spring, 30 for summer, and 40 for fall. so the Date value 201430 is summer semester of 2014...
Date = c(201220,201240,201330,
201340,201420,
201120,201340,201420,201440,201540,
201640)
Enrolled<-data.frame(ID,Date)
I'm using dplyr to group the data by ID and to summarise various aspects about a given student's enrollment history
Enrollment.History<-dplyr::select(Enrolled,ID,Date)%>%group_by(ID)%>%summarise(Total.Semesters = n_distinct(Date),
First.Semester = min(Date))
I'm trying to get a measure for the number of enrollment gaps that each student has, as well as the size of the largest enrollment gap. The data frame shouls end up looking like this:
Enrollment.History$Gaps<-c(2,0,3,0)
Enrollment.History$Biggest.Gap<-c(1,0,7,0)
print(Enrollment.History)
I'm just trying to figure out what the best way to code those gap variables. Is it better to turn that Date variable into an ordered factor? I hope this is a simple solution
Since you are not dealing with real dates in a standard format, you can instead make use of factors to compute the gaps.
First you need to define a vector of all possible year/semester combinations ("Dates") in the correct order (this is important!).
all_semesters <- c(sapply(2011:2016, paste0, c(20,30,40)))
Then, you can create a new factor variable, arrange the data by ID and Date, and finally compute the maximum difference between two semesters:
Enrolled %>%
mutate(semester = factor(Enrolled$Date, levels = all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(as.integer(semester)) -1), na.rm = TRUE))
## A tibble: 4 × 2
# ID max_gap
# <dbl> <dbl>
#1 1 1
#2 2 0
#3 3 7
#4 4 0
I used max(c(0, ...)) in the summarise, because otherwise you would end up with -Inf for IDs with a single entry.
Similarly, you could also achieve this by using match instead of a factor:
Enrolled %>%
mutate(semester = match(Date, all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(semester) -1), na.rm = TRUE))

Resources