How to Visualize The frequency of a categorical variable in R - r

I have 2 variables in my dataframe that I am trying to use ggplot to graph. On the x-axis I want the date which has a daily frequency. On the y-axis I want the count of unique names that show up on that given day.
The variables look something like this in the dataframe.
Date Name
1 2016-03-01 Joe
2 2016-03-01 Joe
3 2016-03-01 Joe
4 2016-03-01 Mark
5 2016-03-01 Sue
6 2016-03-02 Mark
7 2016-03-02 Joe
8 2016-03-03 Joe
9 2016-03-03 Joe
10 2016-03-03 Bill
So the frequency on the y-axis on the first day would show 3, 2 on the second, and 2 on the third.
My question is how do I produce that graph.

count number of unique Name for each Date and then plot with geom_bar/geom_col.
library(dplyr)
library(ggplot2)
df %>%
group_by(Date) %>%
summarise(n = n_distinct(Name)) %>%
ggplot() + geom_col(aes(Date, n))
#ggplot() + geom_bar(aes(Date, n), stat = "identity")
data
df <- structure(list(Date = c("2016-03-01", "2016-03-01", "2016-03-01",
"2016-03-01", "2016-03-01", "2016-03-02", "2016-03-02", "2016-03-03",
"2016-03-03", "2016-03-03"), Name = c("Joe", "Joe", "Joe", "Mark",
"Sue", "Mark", "Joe", "Joe", "Joe", "Bill")), class = "data.frame",
row.names = c(NA, -10L))

Related

Sort values across multiple columns in R with dplyr

Apologies for the not-particularly-clear title - hoping my example below helps. I am working with some sports data, attempting to compute "lineup statistics" for certain grouping of players in the data. Below is an example of the type of data I'm working with (playerInfo), as well as the type of analysis I am attempting to do (groupedInfo):
playerInfo = data.frame(
lineup = c(1,2,3,4,5,6),
player1 = c("Bil", "Tom", "Tom", "Nik", "Nik", "Joe"),
player1id = c("e91", "a27", "a27", "b17", "b17", "3b3"),
player2 = c("Nik", "Bil", "Nik", "Joe", "Tom", "Tom"),
player2id = c("b17", "e91", "b17", "3b3", "a27", "a27"),
player3 = c("Joe", "Joe", "Joe", "Tom", "Joe", "Nik"),
player3id = c("3b3", "3b3", "3b3", "a27", "3b3", "b17"),
points = c(6, 8, 3, 12, 36, 2),
stringsAsFactors = FALSE
)
groupedInfo <- playerInfo %>%
dplyr::group_by(player1, player2, player3) %>%
dplyr::summarise(
lineup_ct = n(),
total_pts = sum(points)
)
> groupedInfo
# A tibble: 6 x 5
# Groups: player1, player2 [?]
player1 player2 player3 lineup_ct total_pts
<chr> <chr> <chr> <int> <dbl>
1 Bil Nik Joe 1 6
2 Joe Tom Nik 1 2
3 Nik Joe Tom 1 12
4 Nik Tom Joe 1 36
5 Tom Bil Joe 1 8
6 Tom Nik Joe 1 3
The goal here is to group_by the 3 players in each row, and then compute some summary statistics (in this simple example, count and sum-of-points) for the different groups. Unfortunately, what dplyr::group_by is missing is the fact that certain groups of players should be the same group of players, if its the same 3 players simply in different columns.
For example, in the dataframe above, rows 3,4,5,6 all have the same 3 players (Nik, Tom, Joe), however because sometimes Nik is player1, and sometimes Nik is player2, etc., the group_by groups them separately.
For clarity, below is an example of the type of results I am seeking to get:
correctPlayerInfo = data.frame(
lineup = c(1,2,3,4,5,6),
player1 = c("Bil", "Bil", "Joe", "Joe", "Joe", "Joe"),
player1id = c("e91", "e91", "3b3", "3b3", "3b3", "3b3"),
player2 = c("Joe", "Joe", "Nik", "Nik", "Nik", "Nik"),
player2id = c("3b3", "3b3", "b17", "b17", "b17", "b17"),
player3 = c("Nik", "Tom", "Tom", "Tom", "Tom", "Tom"),
player3id = c("b17", "a27", "a27", "a27", "a27", "a27"),
points = c(6, 8, 3, 12, 36, 2),
stringsAsFactors = FALSE
)
correctGroupedInfo <- correctPlayerInfo %>%
dplyr::group_by(player1, player2, player3) %>%
dplyr::summarise(
lineup_ct = n(),
total_pts = sum(points)
)
> correctGroupedInfo
# A tibble: 3 x 5
# Groups: player1, player2 [?]
player1 player2 player3 lineup_ct total_pts
<chr> <chr> <chr> <int> <dbl>
1 Bil Joe Nik 1 6
2 Bil Joe Tom 1 8
3 Joe Nik Tom 4 53
In this second example, I have manually sorted the data alphabetically such that player1 < player2 < player3. As a result, when I do the group_by, it accurately groups rows 3-6 into a single grouping.
How can I achieve this programatically? I'm not sure if (a) re-structuring playerInfo into the column-sorted correctPlayerInfo (as I've done above(), or (b) some other approach where group_by automatically identifies that these are the same groups, is best.
I am actively working on this, and will post updates if I can come about to my own solution. Until then, any help with this is greatly appreciated!
Edit: Thus far I've tried something along these lines:
newPlayerInfo <- playerInfo %>%
dplyr::mutate(newPlayer1 = min(player1, player2, player3)) %>%
dplyr::mutate(newPlayer3 = max(player1, player2, player3))
... to no avail.
You could create group IDs that are sorted composites of the players' names (or IDs). For example:
playerInfo %>%
mutate(
group_id = purrr::pmap_chr(
.l = list(p1 = player1, p2 = player2, p3 = player3),
.f = function(p1, p2, p3) paste(sort(c(p1, p2, p3)), collapse = "_")
)
) %>%
group_by(group_id) %>%
summarise(
lineup_ct = n(),
total_pts = sum(points)
)
# A tibble: 3 x 3
group_id lineup_ct total_pts
<chr> <int> <dbl>
1 Bil_Joe_Nik 1 6
2 Bil_Joe_Tom 1 8
3 Joe_Nik_Tom 4 53

Conditionally remove rows based on date and time

I am trying to implement a way to filter this dataframe df
structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey",
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20",
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45",
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15",
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04",
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")
I want to filter it such that if there are duplicate responses in the Name column like how Jim appears twice I would like to keep the row that has the earliest date and time according to the EndDate column ONLY if the Progress column value is greater than 70. Otherwise I want to take the row that has a later date and time in the EndDate column.
Using dplyr, we first convert EndDate to date time object using parse_date_time from lubridate then we group_by Name and select row with minimum EndDate if Progress > 70 and number of rows for each Name is more than 1 and maximum EndDate otherwise. If there is only one row for the Name then we select only that one by default.
library(dplyr)
library(lubridate)
df %>%
mutate(EndDate = parse_date_time(EndDate,c("%m-%d-%y %H:%M","%Y-%m-%d %H:%M:%S"))) %>%
group_by(Name) %>%
slice(ifelse(n() > 1,
ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
# Name Progress EndDate
# <chr> <chr> <dttm>
#1 Jane 70 2018-11-30 11:18:00
#2 Jim 25 2018-11-29 18:04:00
#3 Jose 80 2018-11-25 14:20:00
#4 Matt 20 2018-12-01 22:52:00
#5 Mickey 65 2018-11-29 18:15:00
#6 Peter 20 2018-11-26 12:07:00
#7 Tom 45 2018-12-02 15:27:00
Based on the condition, we convert the 'EndDate' to DateTime class, then arrange by 'Name', 'EndDate', grouped by 'Name' if the first element of 'Progres' is greater than 70 return index 1 or else the last row index in slice to subset the rows
library(tidyverse)
library(lubridate)
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
# if there are multiple formats
# mutate(EndDate = anytime::anytime(EndDate)) %>%
arrange(Name, EndDate) %>%
group_by(Name) %>%
slice(if(first(Progress) > 70) 1 else n())
# A tibble: 7 x 3
# Groups: Name [7]
# Name Progress EndDate
# <chr> <chr> <dttm>
#1 Jane 70 2018-11-30 11:18:00
#2 Jim 25 2018-11-29 18:04:00
#3 Jose 80 2018-11-25 14:20:00
#4 Matt 20 2018-12-01 22:52:00
#5 Mickey 65 2018-11-29 18:15:00
#6 Peter 20 2018-11-26 12:07:00
#7 Tom 45 2018-12-02 15:27:00
NOTE: if there are multiple 'DateTime' formats, one option is anytime::anytime instead of mdy_hm
An (of course) this can also be done using data.table
sample data
df <- structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey",
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20",
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45",
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15",
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04",
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")
code
#create the data.table (can also be done using setDT(df) )
dt <- as.data.table( df )
#set the dates to a proper POSIXct-format
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
#order omn EndDate (by reference!)
setorder( dt, EndDate )
#summarise by Name, if first Progress >70 then keep it, else keep last Progress
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
benchmarks
microbenchmark::microbenchmark(
data.table = {
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
setorder( dt, EndDate )
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
},
tidyverse1 = {
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
arrange(Name, EndDate) %>%
group_by(Name) %>%
slice(if(first(Progress) > 70) 1 else n())
},
tidyverse2 = {
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
group_by(Name) %>%
slice(ifelse(n() > 1,
ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.654241 2.030820 2.709023 2.556978 2.782023 30.36590 100
# tidyverse1 6.847731 7.218286 8.742247 7.516838 8.034861 72.00902 100
# tidyverse2 6.173201 6.506398 7.286639 6.764582 7.088591 52.10180 100

Combine duplicate rows in dataframe and create new columns

I am trying to aggregate rows in dataframe that have some values similar and others different as below :
dataframe1 <- data.frame(Company_Name = c("KFC", "KFC", "KFC", "McD", "McD"),
Company_ID = c(1, 1, 1, 2, 2),
Company_Phone = c("237389", "-", "-", "237002", "-"),
Employee_Name = c("John", "Mary", "Jane", "Joshua",
"Anne"),
Employee_ID = c(1001, 1002, 1003, 2001, 2002))
I wish to combine the rows for the values that are similar and creating new columns for the values that are different as below:
dataframe2 <- data.frame(Company_Name = c("KFC", "McD"),
Company_ID = c(1, 2),
Company_Phone = c("237389", "237002"),
Employee_Name1 = c("John", "Joshua" ),
Employee_ID1 = c(1001, 2001),
Employee_Name2 = c("Mary", "Anne"),
Employee_ID2 = c(1002, 2002),
Employee_Name3 = c("Jane", "na"),
Employee_ID3 = c(1003, "na"))
I have checked similar questions such as this Combining duplicated rows in R and adding new column containing IDs of duplicates and R: collapse rows and then convert row into a new column but I do not wish to sepoarate the values by commas but rather create new columns.
# Company_Name Company_ID Company_Phone Employee_Name1 Employee_ID1 Employee_Name2 Employee_ID2 Employee_Name3 Employee_ID3
#1 KFC 1 237389 John 1001 Mary 1002 Jane 1003
#2 McD 2 237002 Joshua 2001 Anne 2002 na na
Thank you in advance.
A solution using tidyverse. dat is the final output.
library(tidyverse)
dat <- dataframe1 %>%
mutate_if(is.factor, as.character) %>%
mutate(Company_Phone = ifelse(Company_Phone %in% "-", NA, Company_Phone)) %>%
fill(Company_Phone) %>%
group_by(Company_ID) %>%
mutate(ID = 1:n()) %>%
gather(Info, Value, starts_with("Employee_")) %>%
unite(New_Col, Info, ID, sep = "") %>%
spread(New_Col, Value) %>%
select(c("Company_Name", "Company_ID", "Company_Phone",
paste0(rep(c("Employee_ID", "Employee_Name"), 3), rep(1:3, each = 2)))) %>%
ungroup()
# View the result
dat %>% as.data.frame(stringsAsFactors = FALSE)
# Company_Name Company_ID Company_Phone Employee_ID1 Employee_Name1 Employee_ID2 Employee_Name2 Employee_ID3 Employee_Name3
# 1 KFC 1 237389 1001 John 1002 Mary 1003 Jane
# 2 McD 2 237002 2001 Joshua 2002 Anne <NA> <NA>
We could do this with dcast from data.table which can take multiple value.var columns. Convert the 'data.frame' to 'data.table' (setDT(dataframe1)), grouped by 'Company_Name', replace the 'Company_Phone' _ elements with the first alphanumeric string, then dcast from 'long' to 'wide' by specifying 'Employee_Name' and 'Employee_ID' as the value.var columns
library(data.table)
setDT(dataframe1)[, Company_Phone := first(Company_Phone), Company_Name]
res <- dcast(dataframe1, Company_Name + Company_ID + Company_Phone ~
rowid(Company_Name), value.var = c("Employee_Name", "Employee_ID"), sep='')
-output
res
#Company_Name Company_ID Company_Phone Employee_Name1 Employee_Name2 Employee_Name3 Employee_ID1 Employee_ID2 Employee_ID3
#1: KFC 1 237389 John Mary Jane 1001 1002 1003
#2: McD 2 237002 Joshua Anne NA 2001 2002 NA
If we need to order it
res[, c(1:3, order(as.numeric(sub("\\D+", "", names(res)[-(1:3)]))) + 3), with = FALSE]
# Company_Name Company_ID Company_Phone Employee_Name1 Employee_ID1 Employee_Name2 Employee_ID2 Employee_Name3 Employee_ID3
#1: KFC 1 237389 John 1001 Mary 1002 Jane 1003
#2: McD 2 237002 Joshua 2001 Anne 2002 NA NA
Here is an other approach combining dplyr and cSplit
library(dplyr)
dataframe1 <- dataframe1 %>%
group_by(Company_Name, Company_ID) %>%
summarise_all(funs(paste((.), collapse = ",")))
library(splitstackshape)
dataframe1 <- cSplit(dataframe1, c("Company_Phone", "Employee_Name", "Employee_ID"), ",")
dataframe1
# Company_Name Company_ID Company_Phone_1 Company_Phone_2 Company_Phone_3 Employee_Name_1 Employee_Name_2 Employee_Name_3 Employee_ID_1 Employee_ID_2 Employee_ID_3
#1: KFC 1 237389 - - John Mary Jane 1001 1002 1003
#2: McD 2 237002 - NA Joshua Anne NA 2001 2002 NA

How to divide contents of one column by different values, conditional on contents of a second column?

I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)

How to move information in some rows in R to columns?

I am looking to do something in R that seems similar to what I would use the reshape package for, but not quite. I am looking to move some rows of a data frame into columns but not all. For example, my data frame looks something like:
v1, v2, v3
info, time, 12:00
info, day, Monday
info, temperature, 70
data, 1, 2
data, 2, 2
data, 3, 1
data, 4, 1
data, 5, 3
I would like to transform it into something like:
v1, v2, v3, info_time, info_day, info_temperature
data, 1, 2, 12:00, Monday, 70
data, 2, 2, 12:00, Monday, 70
data, 3, 1, 12:00, Monday, 70
data, 4, 1, 12:00, Monday, 70
data, 5, 3, 12:00. Monday, 70
Is there an easy way to do this? Does the reshape package help here?
Thank you in advance for all your help!
Vincent
Try
library(reshape2)
indx <- df$v1=='data'
res <- cbind(df[indx,],dcast(df[!indx,],v1~v2, value.var='v3'))[,-4]
row.names(res) <- NULL
colnames(res)[4:6] <- paste('info', colnames(res)[4:6], sep="_")
res
# v1 v2 v3 info_day info_temperature info_time
#1 data 1 2 Monday 70 12:00
#2 data 2 2 Monday 70 12:00
#3 data 3 1 Monday 70 12:00
#4 data 4 1 Monday 70 12:00
#5 data 5 3 Monday 70 12:00
Or use dplyr/tidyr
library(dplyr)
library(tidyr)
cbind(df[indx,],
unite(df[!indx,], Var, v1, v2) %>%
mutate(id=1) %>%
spread(Var, v3)%>%
select(-id))
Or using base R
cbind(df[indx,],
reshape(transform(df[!indx,], v2= paste(v1, v2, sep="_")),
idvar='v1', timevar='v2', direction='wide')[,-1])
data
df <- structure(list(v1 = c("info", "info", "info", "data", "data",
"data", "data", "data"), v2 = c("time", "day", "temperature",
"1", "2", "3", "4", "5"), v3 = c("12:00", "Monday", "70", "2",
"2", "1", "1", "3")), .Names = c("v1", "v2", "v3"), class = "data.frame",
row.names = c(NA, -8L))
A solution without external packages ( using df structure from Akrun):
df1 <- cbind(df[4:8,1:3],apply(df[1:3,3,drop=FALSE],1,function(x) rep(x,nrow(df)-3)))
colnames(df1)[4:6] <- paste("info",df[1:3,2], sep = "_")
df1
> df1
v1 v2 v3 info_time info_day info_temperature
4 data 1 2 12:00 Monday 70
5 data 2 2 12:00 Monday 70
6 data 3 1 12:00 Monday 70
7 data 4 1 12:00 Monday 70
8 data 5 3 12:00 Monday 70

Resources