I have a dataset that I'd like to be summarised. My data looks like this looks like this.
The table in Sheet1 refers to the original table.
The table in Sheet2 is the result I'd like to get, using dplyr.
Basically, for each variable (Our Website, Friendliness of Staff, and Food Quality), I'd like a sum of 'Satisfied' + 'Very Satsified', expressed as a percentage of the total number of respondents for the Parameter. For example, the 80% for the Internet Column is 4 (Satisfied+V.Satisfied)/5 (total number of respondents whose moed of reservation is Internet) * 100 = 80%.
I used this code but I'm not getting the desired result:
test %>%
group_by(Parameter.1..Mode.of.reservation,Our.Website) %>%
select(Our.Website,Friendliness.of.Staff,Food.Quality) %>%
summarise_each(funs(freq = n()))
Any help would be appreciated.
#ira's solution can be streamlined if you gather the data prior to summarizing. This way you skip the multiple assignments.
library(tidyverse)
library(googlesheets)
library(scales)
# Authorize with google.
gs_auth()
# Register the sheet
gs_data <- gs_url("https://docs.google.com/spreadsheets/d/1zljXN7oxUvij2mXHiyuRVG3xp5063chEFW_QERgHegg/")
# Read in the first worksheet
data <- gs_read(gs_data, ws = 1)
# Summarize using tidyr/dplyr
data %>%
gather(item, response, -1:-2) %>%
filter(!is.na(response)) %>%
group_by(`Parameter 1: Mode of reservation`, item) %>%
summarise(percentage = percent(sum(response %in% c("Satisfied","Very Satisfied"))/n())) %>%
spread(`Parameter 1: Mode of reservation`, percentage)
After using dplyr to summarise the data, you can use tidyr to transpose the dataset so that you have the columns and rows just as you asked in the question.
# read in the data
data <- read.csv("C:/RSnips/My Dataset - Sheet1.csv")
# load libraries
library(dplyr)
library(tidyr)
# take the loaded data
data2 <- data %>%
# group it by mode of reservation
group_by(Parameter.1..Mode.of.reservation) %>%
# summarise
summarise(
# count how many times website column takes values sat or very sat and divide by number of observations in each group given by group_by
OurWeb = sum(Our.Website == "Satisfied" |
Our.Website == "Very Satisfied")/n(),
# do the same for Staff and food
Staff = sum(Friendliness.of.Staff == "Satisfied" |
Friendliness.of.Staff == "Very Satisfied")/n(),
Food = sum(Food.Quality == "Satisfied" |
Food.Quality == "Very Satisfied")/n()) %>%
# If you want to have email, internet and phone in columns
# use tidyr package to transpose the dataset
# first turn it into a long format, where mode of the original columns are your key
gather(categories, val, 2:(ncol(data)-1)) %>%
# then turn it back to wide format, but mode of reservation will be in columns
spread(Parameter.1..Mode.of.reservation, val)
How about:
data %>% data
mutate(OurWebsite2 = ifelse(Our.Website == "Very Satisfied" | Our.Website == "Satisfied", 1, 0),
Friendlinessofstaff2 = ifelse(Friendlinessofstaff == "Very Satisfied" | Friendlinessofstaff == "Satisfied", 1, 0),
FoodQuality2 = ifelse(FoodQuality== "Very Satisfied" | FoodQuality== "Satisfied", 1, 0) %>%
group_by(Parameter1) %>%
summarise(OurWebsiteSatisfaction = mean(OurWebsite2),
FriendlinessofstaffSatisfaction = mean(Friendlinessofstaff2),
FoodQualitySatisfaction = mean(FoodQuality2))
Related
I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)
I am trying to select max number for rows within each group and recode that number as "Last" and keep other as blank (below dataframe: new variable name is "Z"). After that I want to create new variable with multiple conditions corresponding with other variables (below dataframe: new variable name is "X").
Dataframe is:
ID = c(1,1,1,1,2,2,3,3,3,4,4)
Care = c("Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","No")
Y = c(1,2,3,4,1,2,1,2,3,1,2)
Z = c("", "", "", "Last","","Last","","","Last","","Last")
X = c("","","","Always","","Lost","","","Linked","","Never")
df <- data.frame(ID,Care,Y,Z,X)
df
I am able to create Y using this code:
main <- df %>% group_by(ID) %>% mutate(Y = row_number())
But, I want to create new Variables "Z" and "X" in my dataframe. X would be if care is Yes in all rows within each group = "Always", if care is No in all rows within each group = Never, if care is Yes at earlier and No at the Last = "Lost", if care is Yes or No at earlier but Yes at the Last = "Linked"
Here I am able to create Z variable (still need to create X):
main %>% group_by(ID) %>% mutate(Z=row_number()>=which.max(Y))
I have been struggling with this for awhile now. Any help would be greatly appreciated!
Easy! :)
You can save that step of working with which.max(Y) and instead just compare row_number() against n() in each group.
Creating Z is just an easy ifelse-statement and what I assume caused you a little trouble in creating X can be solved with case_when() to work through the four cases you describe. First, check whether all() observations within the group hold true to your condition of being "Yes" or "No", then check the two "mixed" cases afterwards.
This is what you're looking for:
library(dplyr)
df <- tibble(
ID = c(1,1,1,1,2,2,3,3,3,4,4),
Care = c("Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","No")
)
df2 <- df %>%
group_by(ID) %>%
mutate(
Z = ifelse(row_number() == n(), "Last", ""),
X = case_when(
Z == "" ~ "",
all(Care == "Yes") ~ "Always",
all(Care == "No") ~ "Never",
Care == "Yes" ~ "Linked",
Care == "No" ~ "Lost"
)
)
I have a bunch of survey data, something like:
I have some survey data, let's say it's about how often respondents tackle various daily routines:
survey <- tribble(
~Q1_toothbrush, ~Q1_bathe, ~Q1_brush_hair, ~Q1_make_bed,
"Always","Sometimes","Often","Never",
"Never","Never","Always","Sometimes",
"Often","Sometimes","Sometimes","Often",
"Sometimes","Always","Often","Never"
)
I want to arrange it into a table that shows how many people selected "Often" or "Always".
I can create a new tibble and update it, taking each question one at a time, eg.
habits <- tribble(
~Habit, ~Description, ~Count,
"Q1_toothbrush", "Brushes teeth for two minutes twice each daty.", 0,
"Q1_bathe", "Bathes with soap and water every morning or evening", 0,
"Q1_hair", "Attends to daily hair health", 0,
"Q1_make_bed", "Tidies bed covers daily", 0
)
top_two <- c("Always", "Often")
tmp <- survey %>%
filter(Q1_toothbrush %in% top_two) %>%
count()
habits <- habits %>%
mutate(Count = ifelse(Habit == "Q1_toothbrush", tmp, Count))
kable(habits)
But I'm struggling to consolidate this into a single function.
If we need to do this for each row, an option is c_across after doing rowwise
library(dplyr) # >= 1.0.0
survey %>%
rowwise %>%
mutate(count = sum(c_across(everything()) %in% top_two)) %>%
ungroup
Or we can reshape to 'long' format and then do the count
library(dplyr)
library(tidyr)
pivot_longer(survey, everything()) %>%
filter(value %in% top_two) %>%
dplyr::count(name)
I'm trying to count the number of responses in multiple columns for rows which all belong to one of four factors in the column Paper. I can sum the terms for each factor individually using map_df from purr as so
times <- in_all_waves %>%
filter(Paper =='Times') %>%
ungroup() %>% #function refuses to work without this
select(-Paper) %>%
map_df(table) %>% # use map_df from the purrr package to "table" each column
rownames_to_column("response") %>% #convert the rownames to a column named response
mutate(resp = case_when(response == 1 ~ "Remain", #change the resulting numbers to the correct responses
response == 2 ~ "Leave",
response ==3 ~ "Will Not Vote",
response == 4 ~ "Don't Know")) %>%
select(resp, everything(), -response) #reorder the columns with resp at the front, removing response
But when I try to do this without selecting just one column as so:
different_papers <- in_all_waves %>%
map_df(table) %>%
rownames_to_column("response") %>%
mutate(resp = case_when(response == 1 ~ "Remain", #change the resulting 1s to No in resp
response == 2 ~ "Leave",
response ==3 ~ "Will Not Vote",
response == 4 ~ "Don't Know")) %>%
select(resp, everything(), -response) #reorder the columns with resp at the front, removing response
I get the error Error: Argument 9 must be length 4, not 5 which is a reference to this last column of factors. Is there a way to keep all of the rows in the same tibble, or do they have to be in seperate ones for each factor?
No other suggested questions seem quite to match my query I'm afraid.
This is the dataframe I'm using in an rds format!
https://www.dropbox.com/s/nwq913lw13kxyw9/inallwaves.rds?dl=0
I found just adding the column back in worked best !
tally_reader_number <- function(input_dataframe,newspaper_name) {
#function takes the input of in_all_waves, tallies the number of different eu ref responses using map_df for a given newspaper factor (defined above)
# and returns a dataframe of responese for each wave with the newspaper factor as a column
returned_dataframe <- input_dataframe %>%
filter(Paper == newspaper_name) %>%
ungroup() %>% #function refuses to work without this
select(-Paper) %>%
map_df(table) %>% # use map_df from the purrr package to "table" each column
rownames_to_column("response") %>% #convert the rownames to a column named response
mutate(resp = case_when(response == 1 ~ "Remain", #change the resulting numbers to the correct responses
response == 2 ~ "Leave",
response ==3 ~ "Will Not Vote",
response == 4 ~ "Don't Know")) %>%
select(resp, everything(), -response) %>% #reorder the columns with resp at the front, removing response
mutate(Paper = newspaper_name)
returned_dataframe$Paper <- as.factor(returned_dataframe$Paper)
returned_dataframe$resp <- as.factor(returned_dataframe$resp)
returned_dataframe
}
I have data that looks like this. For R_fighter, I want R_fighter have the fighters that are the winner defined in the winning column.
For example, this is not satisfied for row 5, where Petr Yan won the fight but he is in B_fighter. Also, I would need R_KD and B_KD to be swapped for row 5, and R_sig_str and B_sig_str. I have many more columns with R_ and B_ column attributes, and would need them all swapped as well.
I need all rows with the winner on B_fighter switched.
Attached is a sample of my data:
R_fighter,B_fighter,R_KD,B_KD,R_SIG_STR.,B_SIG_STR.,win_by,weight,winner
Henry Cejudo,Marlon Moraes,0,0,90 of 171,57 of 119,KO/TKO,UFC Bantamweight Title Bout,Henry Cejudo
Valentina Shevchenko,Jessica Eye,1,0,8 of 11,2 of 12,KO/TKO,UFC Women's Flyweight Title Bout,Valentina Shevchenko
Tony Ferguson,Donald Cerrone,0,0,104 of 200,68 of 185,TKO - Doctor's Stoppage,Lightweight Bout,Tony Ferguson
Jimmie Rivera,Petr Yan,0,2,73 of 192,56 of 189,Decision - Unanimous,Bantamweight Bout,Petr Yan
Tai Tuivasa,Blagoy Ivanov,0,1,64 of 144,73 of 123,Decision - Unanimous,Heavyweight Bout,Blagoy Ivanov
Tatiana Suarez,Nina Ansaroff,0,0,75 of 142,48 of 99,Decision - Unanimous,Women's Strawweight Bout,Tatiana Suarez
Aljamain Sterling,Pedro Munhoz,0,0,174 of 349,105 of 265,Decision - Unanimous,Bantamweight Bout,Aljamain Sterling
Karolina Kowalkiewicz,Alexa Grasso,0,0,90 of 232,148 of 369,Decision - Unanimous,Women's Strawweight Bout,Alexa Grasso
Ricardo Lamas,Calvin Kattar,0,1,12 of 29,22 of 41,KO/TKO,Featherweight Bout,Calvin Kattar
Yan Xiaonan,Angela Hill,0,0,94 of 249,71 of 144,Decision - Unanimous,Women's Strawweight Bout,Yan Xiaonan
Bevon Lewis,Darren Stewart,0,0,31 of 84,30 of 73,Decision - Unanimous,Middleweight Bout,Darren Stewart
Eddie Wineland,Grigorii Popov,2,0,74 of 171,55 of 150,KO/TKO,Bantamweight Bout,Eddie Wineland
Katlyn Chookagian,Joanne Calderwood,0,0,82 of 221,112 of 266,Decision - Unanimous,Women's Flyweight Bout,Katlyn Chookagian
Many thanks :)
You can use the dplyr package in R which has plenty of functions to reshape data.
In your case you could use something like:
library(dplyr)
mydata %>%
mutate(R_fighter_new = winner,
B_fighter_new = if_else(R_fighter == winner, B_fighter, R_fighter),
R_KD_new = if_else(R_fighter == winner, R_KD, B_KD),
B_KD_new = if_else(R_fighter == winner, B_KD, R_KD)) %>%
select(R_fighter = R_fighter_new, B_fighter = B_fighter_new, R_KD = R_KD_new, B_KD = B_KD_new, winner)
In the last select statement you can include all the columns you want in your dataframe.
We can use case_when in dplyr
library(dplyr)
mydata %>%
mutate(R_fighter_new = winner,
B_fighter_new = case_when(R_fighter == winner ~ B_fighter,TRUE~ R_fighter),
R_KD_new = case_when(R_fighter == winner~R_KD, TRUE ~ B_KD),
B_KD_new = case_when(R_fighter == winner ~ B_KD, TRUE ~ R_KD)) %>%
select(R_fighter = R_fighter_new, B_fighter = B_fighter_new, R_KD = R_KD_new, B_KD = B_KD_new, winner)
You could try something like this with multiple columns to swap. First, add a match number column for each row. Then, pivot_longer so you have a single column for R vs. B. This column would be swapped depending on fighter and winner values. Then, to put back into original wider format, you can use pivot_wider. Note that pivot_wider will put the "value" in front of output since it contains multiple values (R and B were moved to the end).
library(tidyverse)
df %>%
mutate(match_no = row_number()) %>%
pivot_longer(cols = R_fighter:B_SIG_STR., names_to = c("R_vs_B", ".value"), names_pattern = "(R|B)_(\\w+)") %>%
mutate(R_vs_B = case_when(R_vs_B == "B" & fighter == winner ~ "R",
R_vs_B == "R" & fighter != winner ~ "B",
TRUE ~ R_vs_B)) %>%
pivot_wider(id_cols = c(match_no, winner, win_by, weight), names_from = R_vs_B, values_from = fighter:SIG_STR)