Wrangling dataframes with Dplyr

Wrangling dataframes with Dplyr - r

I'm trying to find a neat Dplyr solution to convert this dataframe;
Rule <- c('Rule 1', 'Rule 1', 'Rule 1', 'Rule 1', 'Rule 2', 'Rule 2', 'Rule 2')
Condition <- c('1 of 4', '2 of 4', '3 of 4', '4 of 4', '1 of 3', '2 of 3', '3 of 3')
Clause <- c('Temperature > 60', 'Temperature < 90', 'Rain = 0', 'Wind < 20', 'Temperature > 55', 'Temperature < 85', 'Rain <= 2')
Lift <- c('1.30', '1.30', '1.30', '1.30', '1.60', '1.60', '1.60')
Coverage <- c('20%','20%','20%','20%','35%','35%','35%')
DF <- data.frame(Rule, Condition, Clause, Lift, Coverage)
Into this dataframe;
Rule <- c('Rule 1', 'Rule 1', 'Rule 1', 'Rule 1','', 'Rule 2', 'Rule 2', 'Rule 2')
Condition <- c('1 of 4', '2 of 4', '3 of 4', '4 of 4','', '1 of 3', '2 of 3', '3 of 3')
Clause <- c('Temperature > 60', 'Temperature < 90', 'Rain = 0', 'Wind < 20','', 'Temperature > 55', 'Temperature < 85', 'Rain <= 2')
Lift <- c('', '', '', '1.30', '','', '', '1.60')
Coverage <- c('','','','20%','','','','35%')
Result <- data.frame(Rule, Condition, Clause, Lift, Coverage)
Notice new blank rows which separates rules and repetitive Lift and Coverage metrics have been removed. Only retaining the Lift and Coverage from the final row of each rule.

You can create a blank row to insert in every Rule :
empty_df <- data.frame(matrix('', nrow = 1, ncol = ncol(DF),
dimnames = list(NULL, names(DF))))
Split the data for each unique Rule, replace the repeating values in Lift and Coverage column with blank add empty_df and combine the result.
library(dplyr)
DF %>%
group_split(Rule) %>%
purrr::map_df(~.x %>%
mutate(across(c(Lift, Coverage),
~replace(., duplicated(., fromLast = TRUE), ''))) %>%
bind_rows(empty_df)
) %>%
#Remove the blank row from last `Rule`.
slice(-n())
# Rule Condition Clause Lift Coverage
# <chr> <chr> <chr> <chr> <chr>
#1 "Rule 1" "1 of 4" "Temperature > 60" "" ""
#2 "Rule 1" "2 of 4" "Temperature < 90" "" ""
#3 "Rule 1" "3 of 4" "Rain = 0" "" ""
#4 "Rule 1" "4 of 4" "Wind < 20" "1.30" "20%"
#5 "" "" "" "" ""
#6 "Rule 2" "1 of 3" "Temperature > 55" "" ""
#7 "Rule 2" "2 of 3" "Temperature < 85" "" ""
#8 "Rule 2" "3 of 3" "Rain <= 2" "1.60" "35%"

Related

table1() Output Labeling all Data as "Missing"

I am trying to make a descriptive statistics table in R and my code functions properly (producing a table) but despite the fact that I have no missing values in my dataset, the table outputs all of my values as missing. I am still a novice in R, so I do not have a broad enough knowledge base to troubleshoot.
My code:
data <- read_excel("Data.xlsx")
data$stage <-
factor(data$stage, levels=c(1,2,3,4,5,6,7),
labels =c("Stage 0", "Stage 1", "Stage 2", "Stage 3", "Unsure", "Unsure (Early Stage)", "Unsure (Late Stage"))
data$primary_language <-factor(data$primary_language, levels=c(1,2), labels = c("Spanish", "English"))
data$status_zipcode <- factor(data$status_zipcode, levels = (1:3), labels = c("Minority", "Majority", "Diverse"))
data$status_censusblock <- factor(data$status_censusblock, levels = c(0:2), labels = c("Minority", "Majority", "Diverse"))
data$self_identity <- factor(data$self_identity, levels = c(0:1), labels = c("Hispanic/Latina","White/Caucasian"))
data$subjective_identity <- factor(data$subjective_identity, levels = c(0,1,2,4), labels = c("Hispanic/Latina", "White/Caucasian", "Multiracial", "Asian"))
label (data$stage)<- "Stage at Diagnosis"
label(data$age) <- "Age"
label(data$primary_language) <- "Primary language"
label(data$status_zipcode)<- "Demographic Status in Zipcode Area"
label(data$status_censusblock)<- "Demographic Status in Census Block Group"
label(data$self_identity) <- "Self-Identified Racial/Ethnic Group"
label(data$subjective_identity)<- "Racial/Ethnic Group as Identified by Others"
table1(~ stage +age + primary_language + status_zipcode + status_censusblock + self_identity + subjective_identity| primary_language, data=data)
Table output:
enter image description here
Data set:
enter image description here

When I run the data set the values are there. It actually worked for me when I re-did the spacing:
data$stage <- factor(data$stage,
levels = c(1,2,3,4,5,6,7),
labels = c("Stage 0", "Stage 1", "Stage 2", "Stage 3", "Unsure", "Unsure (Early Stage)", "Unsure (Late Stage"))
When I did it exactly as you typed it came up with NA's, too. Try the first and see if it works for you that way. Then check the spacing for the others. That may be all it is.
I do end up with one NA on the stage column because 0 is not defined in your levels.
Edit: Ran the rest so here are some other points.
You end up with an NA in stage because one of your values is 0 but it's not defined with a label
You end up with NA's in language because you have a 0 and a 1 but you define it as 1, 2. So you'd need to change to the values. You end up with NA's in other portions because of the :
Change your code to this and you should have the values you need except that initial 0 in "stage":
data$stage <- factor(data$stage,
levels=c(1,2,3,4,5,6,7),
labels =c("Stage 0", "Stage 1", "Stage 2", "Stage 3", "Unsure", "Unsure (Early Stage)", "Unsure (Late Stage"))
data$primary_language <-factor(data$primary_language,
levels=c(0,1),
labels = c("Spanish", "English"))
data$status_zipcode <- factor(data$status_zipcode,
levels = c(0,1,2),
labels = c("Minority", "Majority", "Diverse"))
data$status_censusblock <- factor(data$status_censusblock,
levels = c(0,1,2),
labels = c("Minority", "Majority", "Diverse"))
data$self_identity <- factor(data$self_identity,
levels = c(0,1),
labels = c("Hispanic/Latina","White/Caucasian"))
data$subjective_identity <- factor(data$subjective_identity,
levels = c(0,1,2,4),
labels = c("Hispanic/Latina", "White/Caucasian", "Multiracial", "Asian"))
enter image description here

Categorize and highlight table sections with flextable

I have below dataframe which is categorized by the column Category
> dput(mydata)
structure(list(Category = c("Executive", "Management", "Management",
"Management", "Professional", "Professional", "Professional",
"Para-Professional", "Para-Professional", "Para-Professional"
), Rank = c("Rank 1", "Rank 1", "Rank 2", "Rank 3", "Rank 1",
"Rank 2", "Rank 3", "Rank 1", "Rank 2", "Rank 3"), Jobs = c(" SMP - Sales, Marketing & Product Management",
" SMP - Sales, Marketing & Product Management", " ENS - Engineering & Science",
" FIN - Finance", " SMP - Sales, Marketing & Product Management",
" ENS - Engineering & Science", " FIN - Finance", " PSK - Production & Skilled Trades",
" ENS - Engineering & Science", " EGS - Energy Generation & Supply"
), N = c(3, 10, 3, 2, 54, 25, 5, 7, 2, 1)), row.names = c(NA,
-10L), class = "data.frame")
I wanted to merge cells of same category for category column which did with below code using flex table
mydata%>% flextable()%>% merge_v(j=~Category)
Now i want to highlight entire section border of data corresponding to each category, for example the data corresponding to Executive category should be highlighted with broader border similarly for other. I tried and could highlight only the category cells as below:
cl<-fp_border(color = "#00A8C8",width = 3)
mydata%>% flextable()%>% merge_v(j=~Category)%>%hline(j=~Category,border = cl)
I want similar border around all the sub-table corresponding to each category to differentiate better between categories in the table. How can do it with flextable only?

I think this should help.
The key here is to find a way to create a logical vector representing where lines should appear - this is what function break_position is doing.
library(flextable)
library(officer)
library(magrittr)
mydata <- structure(list(Category = c(
"Executive", "Management", "Management",
"Management", "Professional", "Professional", "Professional",
"Para-Professional", "Para-Professional", "Para-Professional"
), Rank = c(
"Rank 1", "Rank 1", "Rank 2", "Rank 3", "Rank 1",
"Rank 2", "Rank 3", "Rank 1", "Rank 2", "Rank 3"
), Jobs = c(
" SMP - Sales, Marketing & Product Management",
" SMP - Sales, Marketing & Product Management", " ENS - Engineering & Science",
" FIN - Finance", " SMP - Sales, Marketing & Product Management",
" ENS - Engineering & Science", " FIN - Finance", " PSK - Production & Skilled Trades",
" ENS - Engineering & Science", " EGS - Energy Generation & Supply"
), N = c(3, 10, 3, 2, 54, 25, 5, 7, 2, 1)), row.names = c(
NA,
-10L
), class = "data.frame")
cl <- fp_border(color = "#00A8C8", width = 3)
break_position <- function(x) {
z <- data.table::rleidv(x)
c(z[-length(z)] != z[-1], FALSE)
}
mydata %>%
flextable() %>%
merge_v(j = ~Category) %>%
hline(i = ~ break_position(Category), border = cl) %>%
fix_border_issues()

How do to conditionally bold y axis text in gantt chart using ggplot2

I am trying to use a theme to conditionally set the element text based on an attribute ActivitySort. If the value is 0, I would like to bold the text otherwise I would like it stay plain.
require("tidyverse")
task0 <- c('Strategy 1', 'Strategy 1', '2017-04-01', '2020-04-01',0, "Strategy")
task1 <- c('Strategy 1', 'Collect data', '2017-04-01', '2018-04-01',1, "In Progress")
task2 <- c('Strategy 1', 'Clean data', '2018-04-01', '2018-06-01', 1, "Completed")
task3 <- c('Strategy 1', 'Analyse data', '2018-06-01', '2019-04-01',1, "Discontinued")
task10 <- c('Strategy 2', 'Strategy 2', '2017-04-01', '2020-04-01',0, "Strategy")
task11 <- c('Strategy 2', 'Collect data again', '2017-04-01', '2018-04-01',1, "In Progress")
task12 <- c('Strategy 2', 'Clean data again', '2018-04-01', '2018-06-01', 1, "Completed")
task13 <- c('Strategy 2', 'Analyse data again', '2018-06-01', '2019-04-01',1, "Discontinued")
task14 <- c('Strategy 2', 'Write report again', '2019-04-01', '2020-04-01', 1, "Planned")
dataset <- as.data.frame(rbind(task0, task1, task2, task3,task10, task11, task12, task13, task14))
names(dataset) <- c('StrategyName', 'Activity', 'Start', 'End', 'ActivitySort', "Status")
dataset <- as_tibble(dataset)
dataset <- dataset %>% mutate(StartSort = as.Date(Start, "%Y-%m-%d" ))
dataset <- dataset %>% arrange(desc(StrategyName), desc(ActivitySort), desc(StartSort),Activity, End)
acts <- c("Strategy", "Completed","In Progress", "Discontinued","Planned")
actcols <- c("#000000","#548235", "#2E75B6", "#BF9000", "#7030A0")
els <-unique(dataset$Activity)
g.gantt <- gather(dataset, "state", "date", 3:4) %>% mutate(date = as.Date(date, "%Y-%m-%d" ), Status=factor(Status, acts[length(acts):1]), Activity=factor(Activity, els))
plot <- ggplot(g.gantt, aes(x = date, y = Activity, color = Status, group=Activity)) +
geom_line(size = 5) +
scale_color_manual(values=actcols, name="Status", breaks = acts, limits = acts) +
labs(x="Project year", y=NULL, title="Activity timeline")
plot <- plot + facet_grid(rows = vars(StrategyName), scales="free")
plot <- plot + theme(axis.text.y= element_text(face=ifelse((dataset$ActivitySort == 0),"bold","plain")))
plot
The code currently bolds the text using ifelse, but the results are not as expected. I am wanting only strategies, items with the black lines and status of strategy to be bolded. Note that currently only the second strategy is bolded where the expectation is for both strategies to be bold.

Your problem lies in the fact that you're splitting your data over two facets and you're setting your y axes to free. This means they are not the same. Through ifelse statement, you're just passing on 9 TRUE or FALSE values that are pertaining to an outside data set, i.e. your upper graph gets 9 values (out of which first 4 are FALSE and hence nothing is in bold face), while your lower graph also gets the same 9 values and since the 5th one is TRUE, Strategy 2 is in bold.
You can easily check this by removing the scales="free" part from facet_grid. You should get something like this:
So you'll have to use a custom function that will bold each label based on its content.
I've modified the code from here.
bold_labels <- function(breaks) {
strategy <- filter(dataset, Activity %in% breaks) %>%
mutate(check = str_detect(Activity, "Strategy")) %>%
pull(check)
labels <- purrr::map2(
breaks, strategy,
~ if (.y) bquote(bold(.(.x))) else bquote(plain(.(.x)))
)
parse(text = labels)
}
plot <- ggplot(g.gantt, aes(x = date, y = Activity, color = Status, group=Activity)) +
geom_line(size = 5) +
scale_color_manual(values=actcols, name="Status", breaks = acts, limits = acts) +
labs(x="Project year", y=NULL, title="Activity timeline") +
scale_y_discrete(labels = bold_labels) +
facet_grid(rows = vars(StrategyName), scales="free_y")
plot
The result:

Undefined columns selected, how to solve?

When I try to run the following code I get an error:
value <- as.matrix(wsu.wide[, c(4, 3, 2)])
Error in [.data.frame(wsu.wide, , c(4, 3, 2)) : undefined columns
selected
How do I get this line of work? It's part of dcasting my data.
This is full the code:
library(readxl)
library(reshape2)
Store_and_Regional_Sales_Database <- read_excel("~/Downloads/Data_Files/Store and Regional Sales Database.xlsx", skip = 2)
store <- Store_and_Regional_Sales_Database
freq <- table(store$`Sales Region`)
freq
rel.freq <- freq / nrow(store)
rel.freq
rel.freq.scaled <- rel.freq * 100
rel.freq.scaled
labs <- paste(names(rel.freq.scaled), "\n", "(", rel.freq.scaled, "%", ")", sep = "")
pie(rel.freq.scaled, labels = labs, main = "Pie Chart of Sales Region")
monitor <- store[which(store$`Item Description` == '24" Monitor'),]
wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")])
wsu.wide <- dcast(wsu, "Store No." ~ "Week Ending", value.var = "Units Sold")
value <- as.matrix(wsu.wide[, c(4, 3, 2)])
Thanks.
Edit:
This is my table called "monitor":
When I then make this wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")]) I create another vector with only variables "Week Ending", "Store No." and "Units Sold".
However, as I write the wsu.wide code the ouput I get is only this:
Why do I only get this small table when I'm asking to dcast my data?
After this I don't get what is wrong.

The problem is at the line:
wsu.wide <- dcast(wsu, "Store No." ~ "Week Ending", value.var="Units Sold")
Instead of the double quotation mark " you should use the grave accent - ` in the formula:
wsu.wide <- dcast(wsu, `Store No.` ~ `Week Ending`, value.var = "Units Sold")
To avoid this kind of problem it is better not to use spaces in the R object names it is better to substitute Sales Region variable name to sales_region using underscore. See e.g. Google's R Style Guide.
Please see the code below, I used simulation of your data as extract it from the picture is quite cumbersome:
library(readxl)
library(reshape2)
#simulation
n <- 4
Store_and_Regional_Sales_Database <- data.frame(
a = seq_along(LETTERS[1:n]),
sr = LETTERS[1:n],
sr2 = '24" Monitor',
sr3 = 1:4,
sr4 = 2:5,
sr5 = 3:6)
names(Store_and_Regional_Sales_Database)[2:6] <- c(
"Sales Region", "Item Description",
"Week Ending", "Store No.", "Units Sold")
# algorithm
store <- Store_and_Regional_Sales_Database
freq <- table(store$`Sales Region`)
freq
rel.freq <- freq/nrow(store)
rel.freq
rel.freq.scaled <- rel.freq * 100
rel.freq.scaled
labs <- paste(names(rel.freq.scaled), "\n", "(", rel.freq.scaled, "%", ")", sep = "")
pie(rel.freq.scaled, labels = labs, main = "Pie Chart of Sales Region")
monitor <- store[which(store$`Item Description` == '24" Monitor'),]
wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")])
wsu.wide <- dcast(wsu, `Store No.` ~ `Week Ending`, value.var = "Units Sold")
value <- as.matrix(wsu.wide[ ,c(4,3,2)])
Output:
3 2 1
[1,] NA NA 3
[2,] NA 4 NA
[3,] 5 NA NA
[4,] NA NA NA

Putting Some Part in Double Quotes in R

I want to put some part of object into double quote like the example given below:
Required Output
"Group 1" = 3, "Group 2" = 3
MWE
Test <- structure("Group 1 = 3, Group 2 = 3", class = "noquote")
Test
[1] Group 1 = 3, Group 2 = 3
as.character(Test)
[1] "Group 1 = 3, Group 2 = 3"
Edited
Actually I have a long character string (here Labs)
Labs <- c("Group 1", "Group 2")
Test <- noquote(paste(Labs, "= 3", collapse = ", "))
Test
[1] Group 1 = 3, Group 2 = 3
However, I want to have output like this
"Group 1" = 3, "Group 2" = 3

You can use single quotes to let R know where the string begins and ends. That will let you have double quotes inside of it:
Test <- c('"Group 1" = 3', '"Group 2" = 3')
If you print it, then by default it's going to show you the escape characters. However, you can just cat it, or use some fancier options, depending on your needs.
cat(Test)
"Group 1" = 3 "Group 2" = 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Wrangling dataframes with Dplyr - r

Related

table1() Output Labeling all Data as "Missing"

Categorize and highlight table sections with flextable

How do to conditionally bold y axis text in gantt chart using ggplot2

Undefined columns selected, how to solve?

Putting Some Part in Double Quotes in R

Categories

Resources