Table of proportions for multiple factors with various levels - r

I have a data set like the following:
source <- c("Email","Email","University","Google","Wordpress","Government","University","Email")
TLD <- c(".com",".com","net",".com",".edu",".com",".gov",".org")
speed <- c("1MB/s to 10MB/s","1MB/s to 10MB/s","1KB/s to 99KB/s","100KB/s to 1MB/s","1MB/s to 10MB/s","1MB/s to 10MB/s","10MB/s to 100MB/s","1MB/s to 10MB/s")
ping <- c(120,250,32,66,502,222,307,21)
install <- c("Yes","No","No","No","Yes","Yes","No","Yes")
df <- data.frame(source,TLD,speed,ping,install)
I would like to make a prop table for all of the categorical variables at once if possible for a single table. Is there any way to do this?
My desired output would look something like this:
Factor Level N (%)
source Email 5
Google 10
Wordpress 2
... .... ...
install Yes 42
No 58

Get the data in long format, count each occurrence of column and it's value and calculate the percentage.
library(dplyr)
df %>%
mutate(across(.fns = as.character)) %>%
tidyr::pivot_longer(cols = everything()) %>%
count(name, value, name = 'N') %>%
group_by(name) %>%
mutate(N = prop.table(N) * 100)

This is how I ended up solving my problem.
Thanks for the help.
#Categorical data summary
prop.Fun <- function(x){
dfProp <- cbind(table(x, useNA="ifany"),round(prop.table(table(x, useNA="ifany"))*100, 2))
colnames(dfProp) <- c("Count", "Proportion (%)")
dfProp
}
lapply(df[,c(1,2,3,7,8,10,15)], prop.Fun)

Related

How can I group categorical variable that also meets another criteria in R? DPLYR?

I have a task that I can't solve. My goal is to be able to figure out how many "families" have children (under 18). I only need the sum of unique familyids and I've tried doing it in R and Excel and can't figure it out.
In my data I have four families and my data is saved on a client level.
data <- data.frame(
"FamilyID" = c(10,10,10,11,11,11,12,12,13,13),
"ClientID" = c(101,102,103,111,112,113,121,122,131,132),
"Age" = c(26,1,5,35,34,1,54,60,17,21)
)
My goal is to have something like this
Metric Count
Families w/ Children 3
Families w/out Children 1
In my actual dataset I have thousands of families so I really appreciate ant help.
How can I do this with dplyr?
library(tidyverse)
counts <- data %>%
group_by(FamilyID) %>%
summarise(number_of_children = sum(Age<= 18), number_of_adults = sum(Age > 18)) %>%
ungroup()
final <- counts %>%
summarise("Families w/ children" = sum(number_of_children > 0), "Families w/o children" = sum(number_of_children < 1)) %>%
gather() %>%
rename("Metric" = key, "Count" = value)
You can try something like this:
data2 <- data %>% group_by(FamilyID) %>%
mutate(children=sum(Age<18)) %>% mutate(children=ifelse(children>=1,1,0))
data2 %>% group_by(children) %>% summarize(n_distinct(FamilyID))
which shows how many distinct Family IDs correspond to 0 children, and how many correspond to at least 1 child.
One option is to use any to distinguish which families have children TRUE/FALSE, followed by dplyr::count
library(dplyr)
data %>%
group_by(FamilyID) %>%
summarize(have_children = any(Age < 18)) %>%
count(have_children)
#------
have_children n
<lgl> <int>
1 FALSE 1
2 TRUE 3

Select top n columns (based on an aggregation)

I have a data set with 100's of columns, I want to keep top 20 columns with highest average (can be other aggregation like sum or SD).
How to efficiently do it?
One way I think is to create a vector of averages of all columns, sort it descending and keep top n values in it then use it subset my data set.
I am looking for a more elegant way and some thing that can also be part of dplyr pipe %>% flow.
code below for creating a dummy dataset, also I would appreciate suggestion for elegant ways to create dummy dataset.
#initialize data set
set.seed(101)
df <- as.data.frame(matrix(round(runif(25,2,5),0), nrow = 5, ncol = 5))
# add more columns
for (i in 1:5){
set.seed (101)
df_stage <-
as.data.frame(matrix(
round(runif(25,5*i , 10*i), 0), nrow = 5, ncol = 5
))
colnames(df_stage) <- paste("v",(10*i):(10*i+4))
df <- cbind(df, df_stage)
}
Another tidyverse approach with a bit of reshaping:
library(tidyverse)
n = 3
df %>%
summarise_all(mean) %>%
gather() %>%
top_n(n, value) %>%
pull(key) %>%
df[.]
We can do this with
library(dplyr)
n <- 3
df %>%
summarise_all(mean) %>%
unlist %>%
order(., decreasing = TRUE) %>%
head(n) %>%
df[.]

Group calculations using group_by and subset commands

I am a rookie STATA user trying to make the jump to R. I am working through various exercises, but keep getting something wrong with the group_by and subset command.
I have a simple dataset that I wish to make groupbased calculations on. I am trying to use the groups_by command from the dplyr package to do this.
My dataset is called itchy and consists of 4 variabels:
treat- levels A and B (type of treatment)
type- levels Dark and Fair (skin-colour)
y - levels 0 and 1 (failure or succes of treatment)
freq - numerical variable indicating how many are in this particular group
Using this code you can recreate it:
type <- c(2,2,2,2,1,1,1,1)
treat <-c(1,1,2,2,1,1,2,2)
y <- c(1,0,1,0,1,0,1,0)
freq <- c(9,17,5,20,10,15,3,20)
itchy <- cbind.data.frame(type,treat,y,freq)
itchy$type <- as.factor(type)
itchy$type <- factor(itchy$type,levels = c(1,2), labels = c("Dark", "Fair"))
itchy$treat <- as.factor(treat)
itchy$treat <- factor(itchy$treat,levels = c(1,2), labels = c("A", "B"))
itchy$y <- as.factor(y)
itchy$y <- factor(itchy$y,levels = c(0,1), labels = c("failure", "succes"))
Now I would like to calculate the ods for a success for treatment A and B when applied to skintype Dark or Fair. (ods = nr of successful events/nr of failures)
I have two questions:
1) Can you help me do the ods calculations by groups?
2) I have tried with various combinations of group_by and subset, without any luck. The below code shows some of my unsuccessful attempts. Can you then tell I have a basic misunderstanding of how the group_by and subset commands work
itchy %>% group_by(treat, type) %>% summarize(ods = (subset(freq, y==1)/subset(freq, y==0)))
itchy %>% group_by(treat, type) %>% ods <- c((subset(freq, y==1)/subset(freq, y==0)))
itchy %>% group_by(treat, type) %>% itchy$ods <- (subset(freq, y==1)/subset(freq, y==0))
If I understand you correctly, I think the following will work. I made use of the the spread function from the tidyr package, which like dplyr is part of the tidyverse
library(tidyr)
itchy %>%
spread(y, freq) %>%
mutate(odds = succes / failure)
type treat failure succes odds
1 Dark A 15 10 0.6666667
2 Dark B 20 3 0.1500000
3 Fair A 17 9 0.5294118
4 Fair B 20 5 0.2500000
junk = itchy %>% group_by(y,treat, type) %>% summarize(Overall = sum(freq))
myfunc = function(arg1,arg2){
filter(junk,treat == arg1,type == arg2)[1,4]/filter(junk,treat == arg1,type == arg2)[2,4]
}
myfunc("A","Dark") # You can try all the various combinations here
Does this give you the desired result?

Can't split dataframe into equal buckets preserving order without introducing Xn. prefix

I am trying to split an ordered data frame into 10 equal buckets. The following works but it introduces an X1., X2., X3. ... prefix to each bucket, which prevents me from iterating over the buckets to sum them.
num_dfs <- 10
buckets<-split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs)))
Produces a df[10] that looks like:
$`10`
predicted_duration actual_duration
177188 23.7402944 6
466561 23.7402663 12
479556 23.7401721 5
147585 23.7401666 48
Here's the crude code I am using to try to sum the groups.
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(as.data.frame(df[i],row.names=NULL)$X1.actual_duration) # X1., X2.,
print(paste(i,"=",p))
}
How do I remove the Xn. grouping prefix or programmatically reference it using the index i?
Here's a similar reproducible example:
df<-data.frame(actual_duration=sample(100))
num_dfs <- 10
df_grouped<-as.data.frame(split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs))))
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(df[i]$actual_duration) # does not work because postfix .1, .2.. was added by R
print(paste(p))
}
I'm not entirely clear on what your issue is, but if you are just trying to get the sum by group couldn't you use
library(tidyverse)
df <- data.frame(actual_duration=sample(100))
df %>%
arrange(actual_duration) %>%
mutate(group = rep(1:10, each = 10)) %>%
group_by(group) %>%
summarise(sums = sum(actual_duration))
alternatively if you want to keep the list format
df %>%
arrange(actual_duration) %>%
mutate(group = factor(rep(1:10, each = 10))) %>%
split(., .$group) %>%
map(., function(x) sum(x$actual_duration))

R: dplyr::group_by failing on a pxR data frame

dplyr::group_by() fails to group the variables of the following data.frame contained in a pc-axis file:
library("pacman")
pacman::p_load(pxR, dplyr, janitor)
px_file <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1502040100_131"
pxR::read.px(base::url(px_file))$DATA$value %>% # the data.frame
janitor::clean_names() %>%
dplyr::select (student_level = studienstufe,
year = jahr,
counts = value) %>% # dplyr::rename() also fails
dplyr::group_by (year, student_level) %>% # not grouping!
dplyr::summarise(totals = sum (counts))
I believe it could be due to an encoding issue, but I cannot find the problem. Any ideas? Thanks.
The only fault I could find was that you use select instead of rename. You wrote that rename didn't work for you. This worked for me:
library("pacman")
library("dplyr")
library("janitor")
# Loading your data
pacman::p_load(pxR, dplyr, janitor)
px_file <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1502040100_131"
px <- pxR::read.px(base::url(px_file))$DATA$value
# Cleaning the column names
px1 <- px %>% janitor::clean_names()
# Rename the columns
px2 <- px1 %>%
dplyr::rename (student_level = studienstufe,
sex = geschlecht,
year = jahr,
counts = value)
# Grouping data
px3 <- px2 %>%
dplyr::group_by (year, student_level) %>%
dplyr::summarise(totals = sum (counts))
I split every step into an own dataframe to see the result. This is not necessary.
If this doesn't work, you may upload your session info.
P.S. I also renamed the column geschlecht :)

Resources