Compare Values Across Two Dataframes R - r

I am building a crime report in R and am comparing two separate dataframes, one from the current year and one from the previous year. The data structure is the same in both. Is there a way to color the values in a flextable based on the crimes that were committed the previous year? So, for example, if the month of January 2020 had more homicides than January 2019 then color that value red. If the month of January 2020 had less burglaries than January 2019 then color that value green, and so on for every month of the year and for every crime. Here is a sample of the data:
df2019 <- data.frame(crime = c("assault", "homicide", "burglary"),
Jan = c(5, 2, 7),
Feb = c(2, 4, 0),
Mar = c(1, 2, 1))
df2020 <- data.frame(crime = c("assault", "homicide", "burglary"),
Jan = c(1, 2, 5),
Feb = c(1, 3, 0),
Mar = c(2, 2, 1))
My desired output is the to have the df2020 values colored based on the df2019 values (I have included a picture below). I would then like to include the table in a Powerpoint using the Officer package.
Does anyone have any ideas? I have been exploring options in kable, kableExtra, and flextable but can't find any solutions that work across dataframes. Thanks for the help!

Here is a solution:
library(flextable)
library(magrittr)
df2019 <- data.frame(crime = c("assault", "homicide", "burglary"),
Jan = c(5, 2, 7),
Feb = c(2, 4, 0),
Mar = c(1, 2, 1))
df2020 <- data.frame(crime = c("assault", "homicide", "burglary"),
Jan = c(1, 2, 5),
Feb = c(1, 3, 0),
Mar = c(2, 2, 1))
colors <- unlist(df2020[-1] - df2019[-1]) %>%
cut(breaks = c(-Inf, -.1, 0.1, Inf),
labels = c("green", "transparent", "red")) %>%
as.character()
flextable(df2020) %>%
bg(j = ~ . -crime, bg = colors) %>%
theme_vanilla() %>%
autofit() %>% save_as_pptx(path = "test.pptx")

Related

How to find sum of a column given the date and month is the same

I am wondering how I can find the sum of a column, (in this case it's the AgeGroup_20_to_24 column) for a month and year. Here's the sample data:
https://i.stack.imgur.com/E23Th.png
I essentially want to find the total amount of cases per month/year.
For an example: 01/2020 = total sum cases of the AgeGroup
02/2020 = total sum cases of the AgeGroup
I tried doing this, however I get this:
https://i.stack.imgur.com/1eH0O.png
xAge20To24 <- covid%>%
mutate(dates=mdy(Date), year = year(dates), month = month(dates))%>%
mutate(total = sum(AgeGroup_20_to_24))%>%
select(Date, year, month, AgeGroup_20_to_24)%>%
group_by(year)
View(xAge20To24)
Any help will be appreciated.
structure(list(Date = c("3/9/2020", "3/10/2020", "3/11/2020",
"3/12/2020", "3/13/2020", "3/14/2020"), AgeGroup_0_to_19 = c(1,
0, 2, 0, 0, 2), AgeGroup_20_to_24 = c(1, 0, 2, 0, 2, 1), AgeGroup_25_to_29 = c(1,
0, 1, 2, 2, 2), AgeGroup_30_to_34 = c(0, 0, 2, 3, 4, 3), AgeGroup_35_to_39 = c(3,
1, 2, 1, 2, 1), AgeGroup_40_to_44 = c(1, 2, 1, 3, 3, 1), AgeGroup_45_to_49 = c(1,
0, 0, 2, 0, 1), AgeGroup_50_to_54 = c(2, 1, 1, 1, 0, 1), AgeGroup_55_to_59 = c(1,
0, 1, 1, 1, 2), AgeGroup_60_to_64 = c(0, 2, 2, 1, 1, 3), AgeGroup_70_plus = c(2,
0, 2, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I'm not sure if your question and your data match up. You're asking for by-month summaries of data, but your data only includes March entries. I've provided two examples of summarizing your data below, one that uses the entire date and one that uses by-day summaries since we can't use month. If your full data set has more months included, you can just swap the day for month instead. First, a quick summary of just the dates can be done with this code:
#### Load Library ####
library(tidyverse)
library(lubridate)
#### Pivot and Summarise Data ####
covid %>%
pivot_longer(cols = c(everything(),
-Date),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Date) %>%
summarise(Sum_Cases = sum(Cases))
This pivots your data into long format, groups by the entire date, then summarizes the cases, which gives you this by-date sum of data:
# A tibble: 6 × 2
Date Sum_Cases
<chr> <dbl>
1 3/10/2020 6
2 3/11/2020 16
3 3/12/2020 14
4 3/13/2020 15
5 3/14/2020 17
6 3/9/2020 13
Using the same pivot_longer principle, you can mutate the data to date format like you already did, pivot to longer format, then group by day, thereafter summarizing the cases:
#### Theoretical Example ####
covid %>%
mutate(Date=mdy(Date),
Year = year(Date),
Month = month(Date),
Day = day(Date)) %>%
pivot_longer(cols = c(everything(),
-Date,-Year,-Month,-Day),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Day) %>% # use by day instead of month
summarise(Sum_Cases = sum(Cases))
Which you can see below. Here we can see the 14th had the most cases:
# A tibble: 6 × 2
Day Sum_Cases
<int> <dbl>
1 9 13
2 10 6
3 11 16
4 12 14
5 13 15
6 14 17

How to define score of popularity for list of tags in R?

There is a dataset where each object has a list of tags of categories comma separated. I would like to have aggregated categories score per object based on categories' popularities. I can define the sum, min, and max of popularities but it's not clear to me how an aggregated score can be calculated.
library(tidyverse)
library(tibble)
library(stringr)
# 1. Data
df <- tribble(
~object, ~category,
1, "Software, Model, Cloud",
2, "Model",
3, "Cloud, Software",
4, "Train, Test, Model",
5, "Test, Model"
)
# 2. List of categories
list_category <- trimws(unlist(str_split(df$category, ",")))
# 3. Categories popularity
data.frame(category = list_category) %>%
group_by(category) %>%
summarise(n_count = n()) %>%
arrange(-n_count) %>%
ungroup()
# 4. Outcome with undefined 'score_category' feature that I'd like to know how to score
tribble(
~object, ~sum_category, ~min_category, ~max_category, ~score_category,
1, sum(c(2, 4, 2)), min(c(2, 4, 2)), max(c(2, 4, 2)), NA,
2, sum(c(4)), min(c(4)), max(c(4)), NA,
3, sum(c(2, 2)), min(c(2, 2)), max(c(2, 2)), NA,
4, sum(c(1, 2, 4)), min(c(1, 2, 4)), max(c(1, 2, 4)), NA,
5, sum(c(2, 4)), min(c(2, 4)), max(c(2, 4)), NA
)
Any ideas and code are welcome!

Adding rows to make a full long dataset for longitudinal data analysis

I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]

Separate a column with uneven/unequal strings and with no delimiters

How would I separate a column like this where the data has delimiters but the rest does not and it has some unequal strings?
Input:
id
142 TM500A2013PISA8/22/17BG
143 TM500CAGE2012QUDO8/22/1720+
Output:
category site garden plot year species date portion
142 TM 500 A 2013 PISA 8/22/17 BG
143 TM 500 CAGE 2012 QUDO 8/22/17 20+
I poked around other questions and tried something that may work if it was an equal string ie:
>df <- avgmass %>% separate(id, c("site", "garden", "plot", "year",
"species", "sampledate", "portion"),sep=cumsum(c(2,3,3,4,4,5)))
But as the plot id is either A, B or CAGE; the date has "/" - I am not sure how to approach it.
As I am relatively new to R, I tried searching for more details on how to use the sep argument but to no avail... Thank you for your help.
The code below may work for you, assuming that the "site", "garden" and "species" columns are of a fixed width.
df <- df %>%
mutate(site = substr(id, 1, 2),
garden = substr(id, 3, 5),
plot = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 6, 9), substr(id, 6, 6)),
year = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 10, 13), substr(id, 7, 10)),
species = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 14, 17), substr(id, 11, 14)),
sampledate = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 18, nchar(id)), substr(id, 15, nchar(id)))) %>%
separate(sampledate, into = c("m","d","y"), sep = "/") %>%
mutate(portion = substr(y, 3, nchar(y)),
sampledate = as.Date(paste(m, d, substr(y, 1, 2), sep = "-"), format = "%m-%d-%y"),
m = NULL,
d = NULL,
y = NULL)

gvisMotionChart: change the default variables

I have a data.frame df that consists of fours sites (1 to 4). Each site has values for four parameters (A to D) from 2011 to 2014. I want to create a motion chart for site1.
library(dplyr)
siteID <- c(rep("site1", 16), rep("site2", 16), rep("site3", 16), rep("site4", 16))
YEAR <- as.numeric(rep(c("2011", "2012", "2013", "2014"), 16))
parameter <- c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4),
rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4),
rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4),
rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4))
value <- c(seq(1, 4, by=1), seq(10, 40, by=10), seq(12, 18, by=2), seq(5, 20, by=5),
seq(3, 12, by=3), sample(13:18, 4), sample(15:22, 4), sample(10:18, 4),
seq(7, 1, by=-2), sample(15:22, 4), sample(15:19, 4), sample(10:20, 4),
seq(8, 5, by=-1), seq(50, 20, by=-10), seq(16, 10, by=-2), seq(20, 5, by=-5))
df <- data.frame(siteID, YEAR, parameter, value)
df$YEAR <- as.numeric(df$YEAR)
df1 <- df %>%
dplyr::filter(siteID =="site1")
I created the motion chart for site 1 using the following code
library(googleVis)
site1 = gvisMotionChart(data=df1,
idvar="parameter",
timevar="YEAR",
chartid="site1")
plot(site1)
It worked fine. The result is here
However, the default x axis and y axis were value. I had to change x axis myself from value to YEAR.
I wanted to change the default values so that x-axis will be YEAR, colorvar will be parameter, and sizevar will be value. I did that using this code
site1_1 = gvisMotionChart(data=df1,
idvar="parameter",
timevar="YEAR",
chartid="site1",
xvar="YEAR",
yvar="value",
colorvar="parameter",
sizevar="value")
plot(site1_1)
It kept showing as loading but the plot was not created.
Any suggestions would be appreciated.
I think the below should get you just about there. All that's left is to set the options appropriately to get rid of the commas and such.
df1 <- df %>%
dplyr::filter(siteID =="site1") %>%
mutate(Date = YEAR) %>%
mutate(colorValue = parameter) %>%
mutate(sizeValue = value)
library(googleVis)
site1 = gvisMotionChart(data=df1,
idvar="parameter",
timevar="YEAR",
chartid="site1",
xvar = "Date",
yvar = "value",
colorvar = "colorValue",
sizevar = "sizeValue")
plot(site1)

Resources