Ifelse within dplyr taking a longer time to execute - r

I am working on medical claims data and the data file is as showcased below
claim_id status
abc123 P
abc123 R
xyz374 P
xyz386 R
I would like to create a new column as flag which will basically group by claim_id and if the status for the same claim_id includes both "P" and "R". The flag column should include "Yes"
claim_id status flag
abc123 P Yes
abc123 R Yes
xyz374 P No
xyz386 R No
My approach to this solution is using dplyr :-
data <-data1 %>%
group_by(claim_id)%>%
mutate(flag = ifelse(any(status == "P" | status == "R"),
"Yes",
as.character(status)))
This approach takes a longer time and also marks all the rows as Yes in flag column.

Try this:
data1 <- data1 %>% group_by(claim_id) %>% mutate(flag = (n_distinct(status) == 2))
This one assumes that those are the only two possible values for the status field. If that is not true, you will need to something like this:
data1 <- data1 %>% group_by(claim_id) %>% mutate(flag = (('P' %in% status) & ('R' %in% status)))
You can also do
data1 %>%
group_by(claim_id) %>%
mutate(flag = ifelse(all(c("P", "R") %in% status), "Yes", "No"))
However, it might be even better to use a logical flag. It avoids the ifelse altogether (making it faster) and makes subsetting really easy afterwards:
data1 %>%
group_by(claim_id) %>%
mutate(flag = all(c("P", "R") %in% status))

Related

Equivalent of an "except" command in R when subsetting a dataframe

I have a dataset like that :
ID
Amount
MemberCard
345890
251000
NO
341862
400238
YES
345791
678921
YES
341750
87023
NO
345716
12987
YES
I need to delete all the observations with an amount > 250000, but i have to keep the IDs 341862 & 345791. So i was wondering if a kind of "except" command exists in R when subsetting, instead of creating a data frame with these 2 observations only and rbind after.
Select a row if ID is one of c(341862, 345791) OR if Amount is less than equal to 25000.
We can use subset in base R -
res <- subset(df, ID %in% c(341862, 345791) | Amount <= 25000)
res
# ID Amount MemberCard
#1 341862 400238 YES
#2 345791 678921 YES
#3 345716 12987 YES
Or with dplyr::filter -
library(dplyr)
df %>% filter(ID %in% c(341862, 345791) | Amount <= 25000)
If all you want is to have empty values for observations with Amount > 250000, you can use replace():
library(tidyverse)
df_new <- df %>%
mutate(Amount = replace(Amount, Amount >250000, NA))
If you want the results to be applied to both columns, you can just add it to mutate():
df_new <- df %>%
mutate(Amount = replace(Amount, Amount > 250000, NA),
MemberCard = replace(Amount, Amount > 250000, NA))
This will preserve the ID, but removes all other values if the condition is met. Hope this helps. 😉
We may also use
subset(df, ID == 341862| ID == 345791|Amount <= 25000)

How to filter in R using multiple OR statments? Dplyr

I tried searching for this but couldn't find what I needed.
This is how my data looks like,
mydata <- data.frame(Chronic = c("Yes", "No", "Yes"),
Mental = c("No", "No", "No"),
SA = c("No", "No", "Yes"))
> mydata
Chronic Mental SA
1 Yes No No
2 No No No
3 Yes No Yes
My goal is get the count of rows where any of the column equal Yes.
In this case Row 1 & 3 have at least one Yes. Where Row 2 only has No
Is there an easy to do this?
We can use rowSums on a logical matrix and then get the sum of the logical vector to return the count of rows having at least one 'Yes'
sum(rowSums(mydata == 'Yes') > 0)
#[1] 2
Or with tidyverse
library(dplyr)
mydata %>%
rowwise %>%
mutate(Count = + any(c_across(everything()) == 'Yes')) %>%
ungroup %>%
pull(Count) %>%
sum
#[1] 2
If you want to write out the code (as opposed to using across) you can write the code out using case_when:
mydata %>%
mutate(yes_column = case_when(Chronic == 'Yes' | Mental == 'Yes' | SA == 'Yes' ~ 1,
TRUE ~ 0)) %>%
summarise(total = sum(yes_column))
This creates a binary flag if Yes appears in any of the columns. It's quite useful for seeing the code works ok by each column, particularly to spot if there are data quality problems like 'Yes' or 'yes' or even 'Y'. The | denotes OR and you can use & for AND.

Select only Max number (and recode max) and keep others blank in dataframe and recode with multiple conditions with multiple variables

I am trying to select max number for rows within each group and recode that number as "Last" and keep other as blank (below dataframe: new variable name is "Z"). After that I want to create new variable with multiple conditions corresponding with other variables (below dataframe: new variable name is "X").
Dataframe is:
ID = c(1,1,1,1,2,2,3,3,3,4,4)
Care = c("Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","No")
Y = c(1,2,3,4,1,2,1,2,3,1,2)
Z = c("", "", "", "Last","","Last","","","Last","","Last")
X = c("","","","Always","","Lost","","","Linked","","Never")
df <- data.frame(ID,Care,Y,Z,X)
df
I am able to create Y using this code:
main <- df %>% group_by(ID) %>% mutate(Y = row_number())
But, I want to create new Variables "Z" and "X" in my dataframe. X would be if care is Yes in all rows within each group = "Always", if care is No in all rows within each group = Never, if care is Yes at earlier and No at the Last = "Lost", if care is Yes or No at earlier but Yes at the Last = "Linked"
Here I am able to create Z variable (still need to create X):
main %>% group_by(ID) %>% mutate(Z=row_number()>=which.max(Y))
I have been struggling with this for awhile now. Any help would be greatly appreciated!
Easy! :)
You can save that step of working with which.max(Y) and instead just compare row_number() against n() in each group.
Creating Z is just an easy ifelse-statement and what I assume caused you a little trouble in creating X can be solved with case_when() to work through the four cases you describe. First, check whether all() observations within the group hold true to your condition of being "Yes" or "No", then check the two "mixed" cases afterwards.
This is what you're looking for:
library(dplyr)
df <- tibble(
ID = c(1,1,1,1,2,2,3,3,3,4,4),
Care = c("Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","No")
)
df2 <- df %>%
group_by(ID) %>%
mutate(
Z = ifelse(row_number() == n(), "Last", ""),
X = case_when(
Z == "" ~ "",
all(Care == "Yes") ~ "Always",
all(Care == "No") ~ "Never",
Care == "Yes" ~ "Linked",
Care == "No" ~ "Lost"
)
)

Create new column based on occurence of at least one variable in other column by group

Consider the following data frame:
ID <- c(1,1,1,2,2,3,3,3,3)
A <- c("No","No","Yes","Yes","Yes","No","No","No","No")
B <- c("Yes","Yes","Yes","Yes","Yes","No","No","No","No")
df <- data.frame(ID,A,B)
I want to create column B, where the occurence of at least one "Yes" in column A results in only "Yes" values in column B for each separate ID. I have tried the two following approaches (I feel I am almost there):
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(B1=ifelse(A == "Yes", "Yes", "No")) # B1 is the new column for comparison
unfortunately this gives the same column as A
and
df2 <- transform(df, B1= ave(A, ID, FUN=function(x) x[A == "Yes"]))
yields an error message:
1: In x[...] <- m :
number of items to replace is not a multiple of replacement length
Help would be much appreciated.
You almost had it. Here's a small edit to your pipe. Is this what you were after?
df <- df %>%
group_by(ID) %>%
mutate(B1=ifelse("Yes" %in% A, "Yes", "No"))
df

Creating a dplyr summarised table

I have a dataset that I'd like to be summarised. My data looks like this looks like this.
The table in Sheet1 refers to the original table.
The table in Sheet2 is the result I'd like to get, using dplyr.
Basically, for each variable (Our Website, Friendliness of Staff, and Food Quality), I'd like a sum of 'Satisfied' + 'Very Satsified', expressed as a percentage of the total number of respondents for the Parameter. For example, the 80% for the Internet Column is 4 (Satisfied+V.Satisfied)/5 (total number of respondents whose moed of reservation is Internet) * 100 = 80%.
I used this code but I'm not getting the desired result:
test %>%
group_by(Parameter.1..Mode.of.reservation,Our.Website) %>%
select(Our.Website,Friendliness.of.Staff,Food.Quality) %>%
summarise_each(funs(freq = n()))
Any help would be appreciated.
#ira's solution can be streamlined if you gather the data prior to summarizing. This way you skip the multiple assignments.
library(tidyverse)
library(googlesheets)
library(scales)
# Authorize with google.
gs_auth()
# Register the sheet
gs_data <- gs_url("https://docs.google.com/spreadsheets/d/1zljXN7oxUvij2mXHiyuRVG3xp5063chEFW_QERgHegg/")
# Read in the first worksheet
data <- gs_read(gs_data, ws = 1)
# Summarize using tidyr/dplyr
data %>%
gather(item, response, -1:-2) %>%
filter(!is.na(response)) %>%
group_by(`Parameter 1: Mode of reservation`, item) %>%
summarise(percentage = percent(sum(response %in% c("Satisfied","Very Satisfied"))/n())) %>%
spread(`Parameter 1: Mode of reservation`, percentage)
After using dplyr to summarise the data, you can use tidyr to transpose the dataset so that you have the columns and rows just as you asked in the question.
# read in the data
data <- read.csv("C:/RSnips/My Dataset - Sheet1.csv")
# load libraries
library(dplyr)
library(tidyr)
# take the loaded data
data2 <- data %>%
# group it by mode of reservation
group_by(Parameter.1..Mode.of.reservation) %>%
# summarise
summarise(
# count how many times website column takes values sat or very sat and divide by number of observations in each group given by group_by
OurWeb = sum(Our.Website == "Satisfied" |
Our.Website == "Very Satisfied")/n(),
# do the same for Staff and food
Staff = sum(Friendliness.of.Staff == "Satisfied" |
Friendliness.of.Staff == "Very Satisfied")/n(),
Food = sum(Food.Quality == "Satisfied" |
Food.Quality == "Very Satisfied")/n()) %>%
# If you want to have email, internet and phone in columns
# use tidyr package to transpose the dataset
# first turn it into a long format, where mode of the original columns are your key
gather(categories, val, 2:(ncol(data)-1)) %>%
# then turn it back to wide format, but mode of reservation will be in columns
spread(Parameter.1..Mode.of.reservation, val)
How about:
data %>% data
mutate(OurWebsite2 = ifelse(Our.Website == "Very Satisfied" | Our.Website == "Satisfied", 1, 0),
Friendlinessofstaff2 = ifelse(Friendlinessofstaff == "Very Satisfied" | Friendlinessofstaff == "Satisfied", 1, 0),
FoodQuality2 = ifelse(FoodQuality== "Very Satisfied" | FoodQuality== "Satisfied", 1, 0) %>%
group_by(Parameter1) %>%
summarise(OurWebsiteSatisfaction = mean(OurWebsite2),
FriendlinessofstaffSatisfaction = mean(Friendlinessofstaff2),
FoodQualitySatisfaction = mean(FoodQuality2))

Resources