Aggregating strings using tostring and counting them in r - r

I have following dataframe got after applying dplyr code
Final_df<- df %>%
group_by(clientID,month) %>%
summarise(test=toString(Sector)) %>%
as.data.frame()
Which gives me following output
ClientID month test
ASD Sep Auto,Auto,Finance
DFG Oct Finance,Auto,Oil
How I want is to count sectors as well
ClientID month test
ASD Sep Auto:2,Finance:1
DFG Oct Finance:1,Auto:1,Oil:1
How can I achieve it with dplyr?

Here's a similar but slightly different solution to the one by #akrun:
count(df, ClientID, month, Sector) %>%
summarise(test = toString(paste(Sector, n, sep=":")))
#Source: local data frame [4 x 3]
#Groups: ClientID [?]
#
# ClientID month test
# <chr> <chr> <chr>
#1 ASD. Oct Finance:2
#2 ASD. Sep Auto:2, Finance:1
#3 DFG. Oct Oil:2
#4 DFG. Sep Auto:1, Finance:2
In this case, count does the same as group_by + tally and you don't need another group_by since the count removes the outer most grouping variable (Sector) automagically.

We can try
df %>%
group_by(client_id, month, Sector) %>%
tally() %>%
group_by(client_id, month) %>%
summarise(test = toString(paste(Sector, n, sep=":")))
Or using data.table
library(data.table)
setDT(df)[, .N, .(ClientID, month, Sector)
][, .(test = toString(paste(Sector, N, sep=":"))) , .(ClientID, month)]
If we need a base R
aggregate(newCol~ClientID + month, transform(aggregate(n~.,
transform(df, n = 1), sum), newCol = paste(Sector, n, sep=":")), toString)
data
df <- data.frame(ClientID = rep(c("ASD.", "DFG."), each = 5),
month = rep(c("Sep", "Oct" ) , c(3,2)),
Sector = c("Auto", "Auto", "Finance", "Finance", "Finance",
"Auto", "Finance", "Finance", "Oil", "Oil"),
stringsAsFactors=FALSE)

Related

How do I pivot columns?

I have found this dataframe in an Excel file, very disorganized. This is just a sample of a bigger dataset, with many jobs.
df <- data.frame(
Job = c("Frequency", "Driver", "Operator"),
Gloves = c("Daily", 1,2),
Aprons = c("Weekly", 2,0),
)
Visually it's
I need it to be in this format, something that I can work in a database:
df <- data.frame(
Job = c("Driver", "Driver", "Operator", "Operator"),
Frequency= c("Daily", "Weekly", "Daily", "Weekly"),
Item= c("Gloves", "Aprons", "Gloves", "Aprons"),
Quantity= c(1,2,2,0)
)
Visually it's
Any thoughts in how do we have to manipulate the data? I have tried without any luck.
We could use tidyverse methods by doing this in three steps
Remove the first row - slice(-1), reshape to 'long' format (pivot_longer)
Keep only the first row - slice(1), reshape to 'long' format (pivot_longer)
Do a join with both of the reshaped datasets
library(dplyr)
library(tidyr)
df %>%
slice(-1) %>%
pivot_longer(cols = -Job, names_to = 'Item',
values_to = 'Quantity') %>%
left_join(df %>%
slice(1) %>%
pivot_longer(cols= -Job, values_to = 'Frequency',
names_to = 'Item') %>%
select(-Job) )
-output
# A tibble: 4 x 4
Job Item Quantity Frequency
<chr> <chr> <chr> <chr>
1 Driver Gloves 1 Daily
2 Driver Aprons 2 Weekly
3 Operator Gloves 2 Daily
4 Operator Aprons 0 Weekly
data
df <- data.frame(
Job = c("Frequency", "Driver", "Operator"),
Gloves = c("Daily", 1,2),
Aprons = c("Weekly", 2,0))

Is there a more efficient way to handle facts which are duplicating in an R dataframe?

I have a dataframe which looks like this:
ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")
df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)
The dataframes dimensions work like this:
There will always be an ID/key which singularly and uniquely identifies a submitted fact
There will always be a dimension for a given fact defining the Overall_Category of which a submitted fact belongs.
Most of the time - but not always - there will be a dimension for a "Descriptor",
If there is a "Descriptor" dimension for a given fact, there will be another "Members" dimension to show possible members within "Descriptor".
The problem is that a single submitted fact is duplicated for a given ID based on how many dimensions apply to the given fact. What I'd like is a way to show the fact only once, based on its ID, and have the applicable dimensions stored against that single ID.
I've achieved it by doing this:
df1 <- pivot_wider(df,
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")
ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()
df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")
But it seems like it wouldn't scale well for facts with many dimensions due to the pivot_wide, and in general doesn't seem like a very efficient approach.
Is there a better way to do this?
You can unite the columns and for each ID combine them together and take average of Fact values.
library(dplyr)
library(tidyr)
df %>%
unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
group_by(ID) %>%
summarise(Descriptor = paste0(Descriptor, collapse = '_'),
mean_sel = mean(Fact, na.rm = TRUE))
# ID Descriptor mean_sel
# <dbl> <chr> <dbl>
#1 1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas… 233
#2 2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans… 50
#3 3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi… 15
I think you want simple paste with sep and collapse arguments
library(dplyr, warn.conflicts = F)
df %>% group_by(ID, Fact) %>%
summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')
# A tibble: 3 x 3
ID Fact Descriptor
<dbl> <dbl> <chr>
1 1 233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown
2 2 50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual
3 3 15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic
An option with str_c
library(dplyr)
library(stringr)
df %>%
group_by(ID, Fact) %>%
summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')

Creating list with the same number of values

I have a data set with a date, ID, and coordinates that I would like to split into seasonal months. For example for winter I have January to winter1, February to winter2, and March to winter3. I have done the same for the summer months.
I would like to filter out the IDs that have all of these months, so that when I split the data by ID and year, I would have identical list lengths.
I wasn't sure how to simulate uneven values for each ID in the sample code below, but in my actual data some IDs only have summer1 and not winter1, while it could be flipped around for summer2 and winter2`.
library(lubridate)
library(tidyverse)
date <- rep_len(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),1000)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$month <- month(df$date)
df$year <- year(df$date)
df1 <- df %>%
mutate(season_categ = case_when(month %in% 6 ~ 'summer1',
month %in% 7 ~ 'summer2',
month %in% 8 ~ 'summer3',
month %in% 1 ~ 'winter1',
month %in% 2 ~ 'winter2',
month %in% 3 ~ 'winter3')) %>%
group_by(year, ID )%>%
filter(any(month %in% 6:8) &
any(month %in% 1:3))
summer_list <- df1 %>%
filter(season_categ == "summer1") %>%
group_split(year, ID)
# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
# Creates a list for each year and by ID
winter_list <- df1 %>%
filter(season_categ == "winter1") %>%
group_split(year, ID)
names(winter_list) <- sapply(winter_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
Not sure if that is what you want, but I understood that you would want to get rid of IDs that have less than the 6 months of Q1 and Q3 in any of the years, but you could modify the filter or grouping if that assumption was wrong.
Here is one approach:
library(lubridate)
library(dplyr)
set.seed(12345)
# random sampling of dates with this seed gives no July date for ID 2 in 2010
df <- tibble(
date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),
1000, replace = TRUE),
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID = rep(1:5, 200),
month = month(date),
year =year(date)) %>%
arrange(ID, date)
df %>%
filter(month %in% c(1:3, 6:8)) %>%
group_by(ID, year) %>%
mutate(complete = length(unique(month)) == 6) %>%
group_by(ID) %>%
filter(all(complete)) %>%
group_by(ID, year) %>%
group_split()
To me it is not really clear as to what your are looking for. Before you split the data into a list sort the rows by columns
df1<-df1[order(ID,season_categ),]
### Determine which ID's have uneven numbers ###
df1 %>%
group_by(ID) %>%
summarize(month_seq = paste(season_categ , collapse = "_"),
number_of_months = n(season_categ))
#### Remove odd numbers###

How to find frequencies of multiple ID's from one column by year and plot?

I have a df that looks like
ID
Year
Nation, Nation - NA, Economy, Economy - Asia
2008
Economy, Economy - EU, State, Nation
2009
I would like to extract the frequencies of the ID's so that it looks like
Nation
Economy
State
Year
2
2
0
2008
1
2
1
2009
For ID's that have hyphens like "Economy - EU", I am only interested in counting this as a frequency of "Economy"
My end goal is to plot this df by year with the frequency counts of different ID's in the same plot. So for example, "State" would be a green dot in 2008, "Nation" would be a red dot in 2008, and "Economy" would be a blue dot in 2008.
If the second df is not a good way to do this, I am also open to suggestions! That was just my first thought on how to start this.
I will this post as a separate question if this is not appropriate, but my next question is how to plot the frequencies of the second df by year, as mentioned above?
Thank you!
You can split the data into different rows using separate_rows splitting on a comma (,). Separate the value after - in a different column and calculate occurrence of ID value in each Year and get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
separate_rows(ID, sep = ',\\s*') %>%
separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
count(Year, ID) %>%
pivot_wider(names_from = ID, values_from = n, values_fill = 0)
# Year Economy Nation State
# <int> <int> <int> <int>
#1 2008 2 2 0
#2 2009 2 1 1
You can also reduce the code by using janitor::tabyl.
df %>%
separate_rows(ID, sep = ',\\s*') %>%
separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
janitor::tabyl(Year, ID)
data
df <- structure(list(ID = c("Nation, Nation - NA, Economy, Economy - Asia",
"Economy, Economy - EU, State, Nation"), Year = 2008:2009),
class = "data.frame", row.names = c(NA, -2L))
We could use str_count to count the strings and summarise by Year
Bring the data in long format with pivot_longer for ggplot
Use ggplot for barchart (basic version demonstrated)
library(tidyverse)
# table
df <- df %>%
group_by(Year) %>%
summarise(Nation = str_count(ID, "Nation"),
Economy = str_count(ID, "Economy"),
State = str_count(ID,"State"))
df
# preparation for plotting
df1 <- df %>%
pivot_longer(
cols = -Year,
names_to = "names",
values_to = "values"
)
# plot
ggplot(df1, aes(x = factor(names), y=values, fill=factor(Year), label=values)) +
geom_col(position=position_dodge())+
geom_text(size = 4, position =position_dodge(1),vjust=-.5)
Output:
Year Nation Economy State
* <dbl> <int> <int> <int>
1 2008 2 2 0
2 2009 1 2 1
plot:
I think Ronak has nailed it completely, but as you have mentioned in question that your ultimate goal is to plot, I think there is no need to pivot_wider
library(tidyverse)
df <- structure(list(ID = c("Nation, Nation - NA, Economy, Economy - Asia",
"Economy, Economy - EU, State, Nation"), Year = 2008:2009),
class = "data.frame", row.names = c(NA, -2L))
df %>%
separate_rows(ID, sep = ',\\s*') %>%
separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
count(Year, ID) %>%
ggplot(aes(x= as.factor(Year), y = n, color = ID)) +
geom_col(position = 'dodge') +
coord_flip()
OR
df %>%
separate_rows(ID, sep = ',\\s*') %>%
separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
count(Year, ID) %>%
ggplot(aes(x= as.factor(Year), y = n, color = ID, label = paste(ID, n, sep = '-'))) +
geom_col(position = 'dodge') +
geom_text(size = 2, position =position_dodge(0.9), vjust = -0.5)
Created on 2021-05-27 by the reprex package (v2.0.0)

Summarizing and spreading data

I have data similar to below :
df=data.frame(
company=c("McD","McD","McD","KFC","KFC"),
Title=c("Crew Member","Manager","Trainer","Crew Member","Manager"),
Manhours=c(12,NA,5,13,10)
)
df
I would wish to manipulate it and obtain the data frame as below:
df=data.frame(
company=c("KFC", "McD"),
Manager=c(1,1),
Surbodinate=c(1,2),
TotalEmp=c(2,3),
TotalHours=c(23,17)
)
I have managed to manipulate and categorise the employees as well as their count as below:
df<- df %>%
mutate(Role = if_else((Title=="Manager" ),
"Manager","Surbodinate"))%>%
count(company, Role) %>%
spread(Role, n, fill=0)%>%
as.data.frame() %>%
mutate(TotalEmp= select(., Manager:Surbodinate) %>%
apply(1, sum, na.rm=TRUE))
Also, I have summarised the man hours as below:
df <- df %>%group_by(company) %>%
summarize(TotalHours = sum(Manhours, na.rm = TRUE))
How would I combine these two steps at once or is there a cleaner/simpler way of getting the desired output?
dplyr solution:
df %>%
mutate(Title = if_else((Title=="Manager" ),
"Manager","Surbodinate")) %>%
group_by(company) %>%
summarise(Manager = sum(Title == "Manager"), Subordinate = sum(Title == "Surbodinate"), TotalEmp = n(), Manhours = sum(Manhours, na.rm = TRUE))
company Manager Subordinate TotalEmp Manhours
<fct> <int> <int> <int> <dbl>
1 KFC 1 1 2 23
2 McD 1 2 3 17
how about something like this:
df %>%
mutate(Role = ifelse(Title=="Manager" ,
"Manager", "Surbodinate"))%>%
group_by(company) %>%
mutate(TotalEmp = n(),
TotalHours = sum(Manhours, na.rm=TRUE)) %>%
reshape2::dcast(company + TotalEmp + TotalHours ~ Role)
This is not tidyverse nor is it a one step process. But if you use data.table you could do:
library(data.table)
setDT(df, key = "company")
totals <- DT[, .(TotalEmp = .N, TotalHours = sum(Manhours, na.rm = TRUE)), by = company]
dcast(DT, company ~ ifelse(Title == "Manager", "Manager", "Surbodinate"))[totals]
# company Manager Surbodinate TotalEmp TotalHours
# 1 KFC 1 1 2 23
# 2 McD 1 2 3 17

Resources