Count words in string, grouped by year - r

I'm trying to find popular words in a string using R, which is probably easiest to explain with an example.
Taking this as the input (with millions of entries, where each date can appear thousands of times)
IncorporationDate CompanyName
3007931 2003-05-12 OUTLANE BUSINESS CONSULTANTS LIMITED
692999 2013-03-28 AGB SERVICES ANGLIA LIMITED
2255234 2008-05-22 CIDA INTERNATIONAL LIMITED
310577 2017-09-19 FA IT SERVICES LIMITED
2020738 2012-09-03 THE SPARES SHOP LIMITED
2776144 2006-02-03 ANGELVIEW PROPERTIES LIMITED
2420435 2017-10-17 SHANE WARD TM LIMITED
2523165 2014-06-04 THE INDEPENDENT GIN COMPANY LTD
2594847 2015-05-05 AIA ENGINEERING LTD
2701395 2015-05-27 LAURA BRIDGES LIMITED
I want to find the top 10 most popular words used in each year, with the result looking something like this:
| Year | Top1 | Top1_Count | Top2 | Top2_Count | ...
| ---- | ------- | ---------- | ---- | ---------- |
| 2017 | LIMITED | 2 | IT | 1 |
| ...
The closest I've got so far is:
words <- data.frame(table(unlist(strsplit(tolower(df$SText, " "))))
but that loses the year data, only giving a full total across the entire data frame.
I've also played around with summarize from dplyr, but haven't found a way to get it to do what I want.
edit: using the answer from #maurits-evers I've got a bit further, and found the top 10 using this:
top_words_by_year <- words_by_year %>% group_by(year) %>% top_n(n = 10, wt = n)
just trying to figure out how to get it into the shape I need
Thanks

You could do something like this:
library(tidyverse);
df %>%
mutate(year = format(as.Date(IncorporationDate, format = "%Y-%m-%d"), "%Y")) %>%
group_by(year) %>%
mutate(words = strsplit(as.character(CompanyName), " ")) %>%
unnest() %>%
count(year, words);
# year words n
#<chr> <chr> <int>
#1 2003 BUSINESS 1
#2 2003 CONSULTANTS 1
#3 2003 LIMITED 1
#4 2003 OUTLANE 1
#5 2006 ANGELVIEW 1
#6 2006 LIMITED 1
#7 2006 PROPERTIES 1
#8 2008 CIDA 1
#9 2008 INTERNATIONAL 1
#10 2008 LIMITED 1
## ... with 26 more rows
Explanation: Extract year from IncorporationDate, group by year, split CompanyName into words, unnest, and count the number of words per year.
Sample data
df <- read.table(text =
"IncorporationDate CompanyName
3007931 2003-05-12 'OUTLANE BUSINESS CONSULTANTS LIMITED'
692999 2013-03-28 'AGB SERVICES ANGLIA LIMITED'
2255234 2008-05-22 'CIDA INTERNATIONAL LIMITED'
310577 2017-09-19 'FA IT SERVICES LIMITED'
2020738 2012-09-03 'THE SPARES SHOP LIMITED'
2776144 2006-02-03 'ANGELVIEW PROPERTIES LIMITED'
2420435 2017-10-17 'SHANE WARD TM LIMITED'
2523165 2014-06-04 'THE INDEPENDENT GIN COMPANY LTD'
2594847 2015-05-05 'AIA ENGINEERING LTD'
2701395 2015-05-27 'LAURA BRIDGES LIMITED'", header = T)

Related

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

How to do a frequency table where column values are variables?

I have a DF named JOB. In that DF i have 4 columns. Person_ID; JOB; FT (full time or part time with values of 1 for full time and 2 for part time) and YEAR. Every person can have only 1 full time job per year in this DF. This is the full time job they got most of their income during the year.
DF
PERSON_ID JOB FT YEAR
1 Analyst 1 2018
1 Analyst 1 2019
1 Analyst 1 2020
2 Coach 1 2018
2 Coach 1 2019
2 Analyst 1 2020
3 Gardener 1 2020
4 Coach 1 2018
4 Coach 1 2019
4 Analyst 1 2020
4 Coach 2 2019
4 Gardener 2 2019
I want to get different frequency in the lines of the following question:
What full time job changes occurred from 2019 and 2020?
I want to look only at changes where FT=1.
I want my end table to look like this
2019 2020 frequency
Analyst Analyst 1
Coach Analyst 2
NA Gardener 1
I want to look at the data so that i can say 2 people moved from they coaching job to analyst job. 1 analyst did not change their job and one person entered the labour market as a gardener.
I tried to fiddle around with the table function but did not even get close to what i wanted. I could not get the YEAR's to go to separate variables.
10 Bonus points if i can do it in base R :)
Thank you for your help
Not pretty but worked:
# split df by year
df_2019 <- df[df$YEAR %in% c(2019) & df$FT == 1, ]
df_2020 <- df[df$YEAR %in% c(2020) & df$FT == 1, ]
# rename Job columns
df_2019$JOB_2019 <- df_2019$JOB
df_2020$JOB_2020 <- df_2020$JOB
# select needed columns
df_2019 <- df_2019[, c("PERSON_ID", "JOB_2019")]
df_2020 <- df_2020[, c("PERSON_ID", "JOB_2020")]
# merge dfs
df2 <- merge(df_2019, df_2020, by = "PERSON_ID", all = TRUE)
df2$frequency <- 1
df2$JOB_2019 <- addNA(df2$JOB_2019)
df2$JOB_2020 <- addNA(df2$JOB_2020)
# aggregate frequency
aggregate(frequency ~ JOB_2019 + JOB_2020, data = df2, FUN = sum, na.action=na.pass)
JOB_2019 JOB_2020 frequency
1 Analyst Analyst 1
2 Coach Analyst 2
3 <NA> Gardener 1
Not R base but worked:
library(dplyr)
library(tidyr)
data %>%
filter(FT==1, YEAR %in% c(2019, 2020)) %>%
group_by(YEAR, JOB, PERSON_ID) %>%
tally() %>%
pivot_wider(names_from = YEAR, values_from = JOB) %>%
select(-PERSON_ID) %>%
group_by(`2019`, `2020`) %>%
summarise(n = n())
`2019` `2020` n
<chr> <chr> <int>
1 Analyst Analyst 1
2 Coach Analyst 2
3 NA Gardener 1

How to find top 5 most occurring names in column grouped by another column

I'm trying to find the top occuring names in a column for each group from another column. I am new to R and am struggling to understand how other solutions are achieving this (solutions I find seem to resolve either the first or second part of the above).
A sample of the dataset is as follows:
Australia City | International City | Port_Region | Airline | Month_num
"Melbourne" | "Kular Lumpar" | "East Asia" | "Air Asia" | 1
"Melbourne" | "Auckland" | "Oceania" | "Air New Zealand" | 1
"Melbourne" | "Auckland" | "Oceania" | "Air New Zealand" | 1
"Melbourne" | "Auckland" | "Oceania" | "Air New Zealand" | 2
I am trying to find the top occurring airlines per month for an Australia city and display in a jitter chart.
Where I am having issues is with grouping the flights by airline and finding the top airlines.
The current code I am trying is:
sort(table(airlineMelb$Airline),decreasing = TRUE)[1:5]
airlineMelbPop <- c("Air New Zealand", "Air Asia")
as.factor(airlineMelbPop) %>%
ggplot(aes(x=Month_num, y=Port_Region, color=Airline)) +
labs(title="Most popular airlines per month for Melbourne") +
geom_jitter()
Any help would be greatly appreciated.
Edit: I can get the below now. This seems to be on the right track, where it is showing, for example, 'Qantas Airways' has 248 occurrences during the 9th month.
> dt = as.data.table(airlineMelb)
> dt[, .(nobs = .N), by = .(Australian_City, Month_num, Airline)][order(-nobs)]
Australian_City Month_num Airline nobs
1: Melbourne 9 Qantas Airways 248
2: Melbourne 12 Qantas Airways 242
3: Melbourne 3 Qantas Airways 224
4: Melbourne 6 Qantas Airways 224
5: Melbourne 1 Qantas Airways 195
---
494: Melbourne 1 SriLankan Airlines 2
495: Melbourne 1 LATAM Airlines 2
496: Melbourne 1 Scoot Tigerair 2
497: Melbourne 1 Japan Airlines 2
498: Melbourne 1 Air Canada 2
How can this be used with ggplot2 to graph the top 5 airlines for each month (the above is only showing 5 months?
You can use data.table to get the counts and the choose 5 rows from the sorted count column
library(data.table)
dt=data.table(airlineMelb)
dt[,counts:=sort(.N,descending=T),by=c("Australia City","Month_num","Airline")]
dt_top_5=dt[,.SD[1:5],by=c("Australia City","Month_num","Airline")]
The first groupby gets the count in each group and sorts in descending order
The second groupby is used to extract the first 5 rows from each sorted group.
Note, if a particular group has less than 5 rows, a row of NA will be added
With data.table you could do
library(data.table)
dt = as.data.table(airlineMelb)
dt_res = dt[, .(nobs = .N), by = .(city, month, airline)][order(-nobs)]
.N gives you the number of observations within the groups in by, giving you the number of observations per airline, per city, and per month in decreasing order.

Add row with group sum in new column at the end of group category

I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

Resources