Using map_df from purrr on a factor column - r

I'm trying to count the number of responses in multiple columns for rows which all belong to one of four factors in the column Paper. I can sum the terms for each factor individually using map_df from purr as so
times <- in_all_waves %>%
filter(Paper =='Times') %>%
ungroup() %>% #function refuses to work without this
select(-Paper) %>%
map_df(table) %>% # use map_df from the purrr package to "table" each column
rownames_to_column("response") %>% #convert the rownames to a column named response
mutate(resp = case_when(response == 1 ~ "Remain", #change the resulting numbers to the correct responses
response == 2 ~ "Leave",
response ==3 ~ "Will Not Vote",
response == 4 ~ "Don't Know")) %>%
select(resp, everything(), -response) #reorder the columns with resp at the front, removing response
But when I try to do this without selecting just one column as so:
different_papers <- in_all_waves %>%
map_df(table) %>%
rownames_to_column("response") %>%
mutate(resp = case_when(response == 1 ~ "Remain", #change the resulting 1s to No in resp
response == 2 ~ "Leave",
response ==3 ~ "Will Not Vote",
response == 4 ~ "Don't Know")) %>%
select(resp, everything(), -response) #reorder the columns with resp at the front, removing response
I get the error Error: Argument 9 must be length 4, not 5 which is a reference to this last column of factors. Is there a way to keep all of the rows in the same tibble, or do they have to be in seperate ones for each factor?
No other suggested questions seem quite to match my query I'm afraid.
This is the dataframe I'm using in an rds format!
https://www.dropbox.com/s/nwq913lw13kxyw9/inallwaves.rds?dl=0

I found just adding the column back in worked best !
tally_reader_number <- function(input_dataframe,newspaper_name) {
#function takes the input of in_all_waves, tallies the number of different eu ref responses using map_df for a given newspaper factor (defined above)
# and returns a dataframe of responese for each wave with the newspaper factor as a column
returned_dataframe <- input_dataframe %>%
filter(Paper == newspaper_name) %>%
ungroup() %>% #function refuses to work without this
select(-Paper) %>%
map_df(table) %>% # use map_df from the purrr package to "table" each column
rownames_to_column("response") %>% #convert the rownames to a column named response
mutate(resp = case_when(response == 1 ~ "Remain", #change the resulting numbers to the correct responses
response == 2 ~ "Leave",
response ==3 ~ "Will Not Vote",
response == 4 ~ "Don't Know")) %>%
select(resp, everything(), -response) %>% #reorder the columns with resp at the front, removing response
mutate(Paper = newspaper_name)
returned_dataframe$Paper <- as.factor(returned_dataframe$Paper)
returned_dataframe$resp <- as.factor(returned_dataframe$resp)
returned_dataframe
}

Related

r: combine filter with n_distinct in data frame

Simple question. Considering the data frame below, I want to count distinct IDs: one for all records and one after filtering on status. However, the %>% doesn't seem to work here. I just want to have a single value as ouput (so for total this should be 10, for closed it should be 5), not a dataframe . Both # lines don't work
dat <- data.frame (ID = as.factor(c(1:10)),
status = as.factor(rep(c("open","closed"))))
total <- n_distinct(dat$ID)
#closed <- dat %>% filter(status == "closed") %>% n_distinct(dat$ID)
#closed <- dat %>% filter(status == "closed") %>% n_distinct(ID)
n_distinct expects a vector as input, you are passing a dataframe. You can do :
library(dplyr)
dat %>%
filter(status == "closed") %>%
summarise(n = n_distinct(ID))
# n
#1 5
Or without using filter :
dat %>% summarise(n = n_distinct(ID[status == "closed"]))
You can add %>% pull(n) to above if you want a vector back and not a dataframe.
An option with data.table
library(data.table)
setDT(dat)[status == "closed"][, .(n = uniqueN(ID))]

Select only Max number (and recode max) and keep others blank in dataframe and recode with multiple conditions with multiple variables

I am trying to select max number for rows within each group and recode that number as "Last" and keep other as blank (below dataframe: new variable name is "Z"). After that I want to create new variable with multiple conditions corresponding with other variables (below dataframe: new variable name is "X").
Dataframe is:
ID = c(1,1,1,1,2,2,3,3,3,4,4)
Care = c("Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","No")
Y = c(1,2,3,4,1,2,1,2,3,1,2)
Z = c("", "", "", "Last","","Last","","","Last","","Last")
X = c("","","","Always","","Lost","","","Linked","","Never")
df <- data.frame(ID,Care,Y,Z,X)
df
I am able to create Y using this code:
main <- df %>% group_by(ID) %>% mutate(Y = row_number())
But, I want to create new Variables "Z" and "X" in my dataframe. X would be if care is Yes in all rows within each group = "Always", if care is No in all rows within each group = Never, if care is Yes at earlier and No at the Last = "Lost", if care is Yes or No at earlier but Yes at the Last = "Linked"
Here I am able to create Z variable (still need to create X):
main %>% group_by(ID) %>% mutate(Z=row_number()>=which.max(Y))
I have been struggling with this for awhile now. Any help would be greatly appreciated!
Easy! :)
You can save that step of working with which.max(Y) and instead just compare row_number() against n() in each group.
Creating Z is just an easy ifelse-statement and what I assume caused you a little trouble in creating X can be solved with case_when() to work through the four cases you describe. First, check whether all() observations within the group hold true to your condition of being "Yes" or "No", then check the two "mixed" cases afterwards.
This is what you're looking for:
library(dplyr)
df <- tibble(
ID = c(1,1,1,1,2,2,3,3,3,4,4),
Care = c("Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","No")
)
df2 <- df %>%
group_by(ID) %>%
mutate(
Z = ifelse(row_number() == n(), "Last", ""),
X = case_when(
Z == "" ~ "",
all(Care == "Yes") ~ "Always",
all(Care == "No") ~ "Never",
Care == "Yes" ~ "Linked",
Care == "No" ~ "Lost"
)
)

R: Create new column based list of values from a multiple columns

I want to create a new column (T/F) based on any value from a list being present in multiple columns. For this example, I'm using mtcars for my example, searching for two values in two columns, but my actual challenge is many values in many columns.
I have a successful filter using filter_at() included below, but I've been unable to apply that logic to a mutate:
# there are 7 cars with 6 cyl
mtcars %>%
filter(cyl == 6)
# there are 2 cars with 19.2 mpg, one with 6 cyl, one with 8
mtcars %>%
filter(mpg == 19.2)
# there are 8 rows with either.
# these are the rows I want as TRUE
mtcars %>%
filter(mpg == 19.2 | cyl == 6)
# set the cols to look at
mtcars_cols <- mtcars %>%
select(matches('^(mp|cy)')) %>% names()
# set the values to look at
mtcars_numbs <- c(19.2, 6)
# result is 8 vars with either value in either col.
# this is a successful filter of the data
out1 <- mtcars %>%
filter_at(vars(mtcars_cols), any_vars(
. %in% mtcars_numbs
)
)
# shows set with all 6 cyl, plus one 8cyl 21.9 mpg
out1 %>%
select(mpg, cyl)
# This attempts to apply the filter list to the cols,
# but I only get 6 rows as True
# I tried to change == to %in& but that results in an error
out2 <- mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs) > 0
)
# only 6 rows returned
out2 %>%
filter(myset == T)
I'm not sure why the two rows are skipped. I think it might be the use of rowSums that is aggregating those two rows in some way.
If we want to do the corresponding checks, it may be better to use map2
library(dplyr)
library(purrr)
map2_df(mtcars_cols, mtcars_numbs, ~
mtcars %>%
filter(!! rlang::sym(.x) == .y)) %>%
distinct
NOTE: Doing the comparison (==) with floating point numbers can get into trouble as the precision can vary and result in FALSE
Also, note that == works only when when either the lhs and rhs elements have the same length or the rhs vector is of length 1 (here the recycling happens). If the length is greater than 1 and not equal to length of lhs vector, then the recycling would be comparing in the column order.
We can replicate to make the lengths equal and now it should work
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs[col(select(., mtcars_cols))]) > 0
) %>% pull(myset) %>% sum
#[1] 8
In the above code select is used twice for better understanding. Otherwise, we can also use rep
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == rep(mtcars_numbs, each = n())) > 0
) %>%
pull(myset) %>%
sum
#[1] 8

How to use mutate_at or mutate_if at the same time to do multiple action on data

I would like to apply 3 functions using one code on the same variables in my data.
I have a data set and there are certain columns in my data and i want to apply these function to all of them.
1- make them all factor data
2- replace spaces in the columns with missing(convert space values to missing)
3- give missing value an explicit factor level using fct_explicit_na
i have done this in separate code lines but i want to merge all of them using dplyr mutate function. I tried the following but didnt work
cols <- c("id12", "id13", "id14", "id15")
data_new <- data_old %>%
mutate_if(cols=="", NA) %>% # replace space with NA for cols
mutate_at(cols, factor) %>% # then turn them into factors
mutate_at(cols, fct_explicit_na) # give NAs explicit factor level
)
I get the error:
Error in tbl_if_vars(.tbl, .p, .env, ..., .include_group_vars = .include_group_vars) :
length(.p) == length(tibble_vars) is not TRUE
The mutate_if step is not doing what the OP intend to do. Instead, we can do this in a single step with
library(dplyr)
data_old %>%
mutate_at(vars(cols), ~ na_if(., "") %>%
factor %>%
fct_explicit_na)
Why the OP's code didn't work?
Using a reproducible example, below code converts columns that are factor to character class
iris1 <- iris %>%
mutate_if(is.factor, as.character) %>%
mutate(Species = replace(Species, c(1, 3, 5), ""))
Now, if we do
iris1 %>%
mutate_if("Species" == "", NA)
it is comparing two strings instead of checking the column values. Also, mutate_if should return a logical vector of length 1 for selecting that column.
Instead, if we use
iris1 %>%
mutate_if(~ any(. == ""), ~ na_if(., "")) %>%
head

Creating a dplyr summarised table

I have a dataset that I'd like to be summarised. My data looks like this looks like this.
The table in Sheet1 refers to the original table.
The table in Sheet2 is the result I'd like to get, using dplyr.
Basically, for each variable (Our Website, Friendliness of Staff, and Food Quality), I'd like a sum of 'Satisfied' + 'Very Satsified', expressed as a percentage of the total number of respondents for the Parameter. For example, the 80% for the Internet Column is 4 (Satisfied+V.Satisfied)/5 (total number of respondents whose moed of reservation is Internet) * 100 = 80%.
I used this code but I'm not getting the desired result:
test %>%
group_by(Parameter.1..Mode.of.reservation,Our.Website) %>%
select(Our.Website,Friendliness.of.Staff,Food.Quality) %>%
summarise_each(funs(freq = n()))
Any help would be appreciated.
#ira's solution can be streamlined if you gather the data prior to summarizing. This way you skip the multiple assignments.
library(tidyverse)
library(googlesheets)
library(scales)
# Authorize with google.
gs_auth()
# Register the sheet
gs_data <- gs_url("https://docs.google.com/spreadsheets/d/1zljXN7oxUvij2mXHiyuRVG3xp5063chEFW_QERgHegg/")
# Read in the first worksheet
data <- gs_read(gs_data, ws = 1)
# Summarize using tidyr/dplyr
data %>%
gather(item, response, -1:-2) %>%
filter(!is.na(response)) %>%
group_by(`Parameter 1: Mode of reservation`, item) %>%
summarise(percentage = percent(sum(response %in% c("Satisfied","Very Satisfied"))/n())) %>%
spread(`Parameter 1: Mode of reservation`, percentage)
After using dplyr to summarise the data, you can use tidyr to transpose the dataset so that you have the columns and rows just as you asked in the question.
# read in the data
data <- read.csv("C:/RSnips/My Dataset - Sheet1.csv")
# load libraries
library(dplyr)
library(tidyr)
# take the loaded data
data2 <- data %>%
# group it by mode of reservation
group_by(Parameter.1..Mode.of.reservation) %>%
# summarise
summarise(
# count how many times website column takes values sat or very sat and divide by number of observations in each group given by group_by
OurWeb = sum(Our.Website == "Satisfied" |
Our.Website == "Very Satisfied")/n(),
# do the same for Staff and food
Staff = sum(Friendliness.of.Staff == "Satisfied" |
Friendliness.of.Staff == "Very Satisfied")/n(),
Food = sum(Food.Quality == "Satisfied" |
Food.Quality == "Very Satisfied")/n()) %>%
# If you want to have email, internet and phone in columns
# use tidyr package to transpose the dataset
# first turn it into a long format, where mode of the original columns are your key
gather(categories, val, 2:(ncol(data)-1)) %>%
# then turn it back to wide format, but mode of reservation will be in columns
spread(Parameter.1..Mode.of.reservation, val)
How about:
data %>% data
mutate(OurWebsite2 = ifelse(Our.Website == "Very Satisfied" | Our.Website == "Satisfied", 1, 0),
Friendlinessofstaff2 = ifelse(Friendlinessofstaff == "Very Satisfied" | Friendlinessofstaff == "Satisfied", 1, 0),
FoodQuality2 = ifelse(FoodQuality== "Very Satisfied" | FoodQuality== "Satisfied", 1, 0) %>%
group_by(Parameter1) %>%
summarise(OurWebsiteSatisfaction = mean(OurWebsite2),
FriendlinessofstaffSatisfaction = mean(Friendlinessofstaff2),
FoodQualitySatisfaction = mean(FoodQuality2))

Resources