Is there a way to animate a word cloud in R? - r

Currently, I am using the library("wordcloud") to make a word cloud of frequent terms some text data that I have. The text data also comes with an associated year, and I want to be able to generate new word clouds based on the year, and I want it to be automatically animated using a library like gganimate. Is there any way to do this? I want to visualize the most frequent keywords over time, but I am struggling. Any tips?

Yes, with the help of the ggwordcloud package. I'll use the babynames dataset as an interesting example to see how the 5 most common baby names have changed over 100 years. First, load the required packages and load the data.
library(babynames) # Data
library(dplyr) # Data management
library(ggplot2) # Graph framework
library(ggwordcloud) # Wordcloud using ggplot
library(gganimate) # Animation
data(babynames)
The next command finds the top 5 names for each sex in 1915 and 2015, grouped by year.
babies <- babynames %>%
filter(year %in% c(1915, 2015)) %>%
group_by(name, sex, year) %>%
summarise(n=sum(n)) %>%
arrange(desc(n)) %>%
group_by(year, sex) %>%
top_n(n=5) %>%
# A tibble: 20 x 4
# Groups: sex, year [4]
name sex year n
<chr> <chr> <dbl> <int>
1 Mary F 1915 58187
2 John M 1915 47577
3 William M 1915 38564
4 James M 1915 33776
5 Helen F 1915 30866
6 Robert M 1915 28738
7 Dorothy F 1915 25154
8 Margaret F 1915 23054
9 Joseph M 1915 23052
10 Ruth F 1915 21878
11 Emma F 2015 20435
12 Olivia F 2015 19669
13 Noah M 2015 19613
14 Liam M 2015 18355
15 Sophia F 2015 17402
16 Mason M 2015 16610
17 Ava F 2015 16361
18 Jacob M 2015 15938
19 William M 2015 15889
20 Isabella F 2015 15594
ungroup() %>%
select(name, sex)
I halted it before the end just to show you which names were returned before omitting the years and frequency because I want to merge this data with the original to get the frequency for every 5 years between 1915 and 2015, not every year because it takes too long to plot.
Here's the join.
babyyears <- babynames %>%
inner_join(babies, by=c("name","sex")) %>%
filter(year>=1915 & year %% 5 == 0) %>% # Keep all years if you like
mutate(year=as.integer(year)) # For animation. Not sure why this is required.
So that's just setting up the data for the plot. If we just wanted a static wordcloud, we'd aggregate on the year. But we keep the years for the animation.
For plotting, we use ggplot with the geom_text_wordcloud function.
gg <- babyyears %>%
ggplot(aes(label = name, size=n)) +
geom_text_wordcloud() +
theme_classic()
Then transition through the years.
gg2 <- gg + transition_time(year) +
labs(title = 'Year: {frame_time}')
I like to add a pause at the end, otherwise the animation rolls around to the start immediately after finishing.
animate(gg2, end_pause=30)
anim_save("gg_anim_wc.gif")
It's hard to keep track of all the names (especially the boys) with them all being placed in random locations. Maybe slowing it down will help. But the name that stands out the most from this graphic is "Mary", which was the most common name in 1915 but then slowly started to lose popularity towards the latter half of the century.

Related

Adding mean() column with multiple filters throughout dataframe in R

new to R, am using it for some NFL analysis in a dataframe where the relevant columns look like this:
Randy Moss 12.9 2000
Randy Moss 21.6 2000
Randy Moss 4.0 2000
Randy Moss 44.7 2000
Randy Moss 25.8 2000
Randy Moss 12.9 2000
it's not a list, it's a dataframe where a player's ("fname.1") fantasy stats for each game ("fp3") and year of the game ("year") are the columns in question. This data includes all years from 2000-2019.
I want to add a column which is the mean of all fantasy results for that year for that player. So, my wanted output in the example data (if randy moss only played 6 games) would add a column of the mean for each entry, like this:
Randy Moss 12.9 2000 16.98333
Randy Moss 21.6 2000 16.98333
Randy Moss 4.0 2000 16.98333
Randy Moss 44.7 2000 16.98333
Randy Moss 25.8 2000 16.98333
Randy Moss 12.9 2000 16.98333
I'm having trouble using a simple group_by() and summarize() formula because of needing a different mean per player for each year. I wrote a for loop that creates a list with the information I need, but I'm not sure how to add that into the original data or if there's an easier way to accomplish this...
mean_fantasy <- list()
for(y in 2000:2019) {
mean_fantasy[[y]] <- offense_test %>%
filter(year == y) %>%
group_by(fname.1) %>%
summarize(mean_fp3 = sum(fp3)/n(), games = n(), year = sum(year)/n())
}
Very new to R and this forum so hopefully this question/formatting makes sense
Just using the ave() function should give the result that you are looking for, giving the mean value per player per year.
fp3 <- rnorm(20,20,5)
player <- rep(c(LETTERS)[1:4], each = 5)
year <- as.factor(rep(seq(2015,2016, by = 1), 10))
df <- data.frame(player,fp3,year)
df$mean.player.year <- ave(df$fp3, df[,c('player', 'year')], FUN = mean)
# And for the desired output view...
df <- df[order(df$player,df$year),]
> df
player fp3 year mean.player.year
1 A 20.658824 2015 14.36088
3 A 19.842985 2015 14.36088
5 A 2.580835 2015 14.36088
2 A 12.571649 2016 14.33038
4 A 16.089108 2016 14.33038
7 B 34.268847 2015 27.21018
9 B 20.151507 2015 27.21018
6 B 9.363759 2016 15.10290
8 B 19.686929 2016 15.10290
10 B 16.257998 2016 15.10290
11 C 25.823640 2015 21.57919
13 C 17.753304 2015 21.57919
15 C 21.160641 2015 21.57919
12 C 20.878661 2016 23.27219
14 C 25.665711 2016 23.27219
17 D 22.621288 2015 22.81370
19 D 23.006116 2015 22.81370
16 D 25.508619 2016 19.37231
18 D 13.923885 2016 19.37231
20 D 18.684435 2016 19.37231
We could use transmute with map
library(dplyr)
library(purrr)
library(stringr)
out <- map_dfc(2000:2019, ~ offense_test %>%
filter(year == .x) %>%
group_by(fname.1) %>%
transmute(!! str_c('mean_fp3_', .x) := sum(fp3)/n(),
!! str_c('games_', .x) := n(),
!! str_c('year_', .x) := sum(year)/n())) %>%
bind_cols(offense_test, .)
If we need a single mean column, then we don't need a loop, use 'year' also in the group_by and then create the column with mutate
offense_test %>%
group_by(fname.1, year) %>%
mutate(mean_fp3 = mean(fp3), games = n())
Thanks for the answers guys, went with Roasty's since it was simpler. Can verify it worked

How to restructure very wide dataframes with dplyr using an index? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I've read a number of posts on gather but I'm struggling to create a solution that would restructure a file with different widths into a long format.
My data are here:
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv")
df2 <- read.csv(text = x)
In the above case, I have groups of 3 columns, each of which need to be stacked up. I tried the following method but my values get spread into the wrong columns:
longJazz<- df2 %>% gather(key,
value,
X1:X69)
The resulting dataframe should have 782 rows and 3 columns (title, year and artist).
In another case, I have groups of 5 columns, so I'd like a solution that can be simply adapted. For instance, a function that takes as arguments a dataframe and the number of columns per group, would be handy.
We can remove the first column 'X', and then rename the columns until the last column 'id', by a sequence of 'Details', 'year', 'Description', then use pivot_longer from tidyr to reshape into 'long' format
library(stringr)
library(dplyr)
library(readr)
library(tidyr)
df2 <- df2[-1]
i1 <- as.integer(gl(ncol(df2)-1, 3, ncol(df2)-1))
names(df2)[1:69] <- str_c(c("Details", "year", "Description"), i1, sep="_")
df2 %>%
mutate_at(vars(starts_with('year')), ~ as.integer(as.character(.))) %>%
pivot_longer(cols = -id, names_sep="_", names_to = c(".value", "group")) %>%
select(-group)
# A tibble: 1,150 x 4
# id Details year Description
# <int> <fct> <int> <fct>
# 1 1 Sophisticated Lady / Tea For Two 1933 Art Tatum
# 2 1 The Genius Of Art Tatum, No. 21 1955 Art Tatum
# 3 1 The Tatum Group Masterpieces, Vol. 5 1964 Art Tatum / Lionel Hampton / Harry Edison / Buddy Rich / Red Callender / Barney Ke…
# 4 1 Live Sessions 1940 / 1941 1975 Art Tatum
# 5 1 20th Century Piano Genius 1986 Art Tatum
# 6 1 Jazz Masters (100 Ans De Jazz) 1998 Art Tatum
# 7 1 The Art Tatum - Ben Webster Quartet 2015 Art Tatum / Ben Webster
# 8 1 El Gran Tatum NA Art Tatum
# 9 1 Sweet Georgia Brown / Shiek Of Araby / Back O' Town Bl… 1945 Benny Goodman Quintet* / Esquire All Stars Featuring Louis Armstrong
#10 1 The Immortal Live Sessions 1944/1947 1975 Louis Armstrong
# … with 1,140 more rows

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

How to re-order data in R, and creating a new variable for the data?

I have been working with the CDC FluView dataset, retrieved by this code:
library(cdcfluview)
library(ggplot2)
usflu <- get_flu_data("national", "ilinet", years=1998:2015)
What I am trying to do is create a new week variable, call it "week_new", so that the WEEK variable from this dataset is reordered. I want to reorder it by having the first week be equal to week number 30 in each year. For example, in 1998, instead of week 1 corresponding to the first week of that year, I would like week 30 to correspond to the first week of that year, and every subsequent year after that have the same scale. I am also trying to create another new variable called "season", which simply puts each week into it's corresponding flu season, say "1998-1999" for week 30 of 1998 through 1999, and so on.
I believe this involves a for loop and conditional statements, but I am not familiar with how to use these in R. I am new to programming and am learning Java and R at the same time, and have only worked with loops in Java so far.
Here is what I have tried so far, I think it's supposed to be something like this:
wk_num <- 1
for(i in nrow(usflu)){
if(week == 31){
wk_num <- 1
wk_new[i] <- wk_num
wk_num <- wk_num+1
}
if(week < 53){
season[i] <- paste(Yr[i], '-', Yr[i] +1)
}
else{
}
Any help is greatly appreciated and hopefully what I am asking makes sense. I am hoping to understand re-ordering for the future as I believe it will be an important tool for me to have at my disposal for coding in R.
Here's one way to accomplish this with the packages dplyr and tidyr:
library(dplyr)
library(tidyr)
usflu_df <- tbl_df(usflu)
usflu_df %>%
complete(YEAR, WEEK) %>%
filter(!(YEAR == 1998 & WEEK < 30)) %>%
mutate(season = cumsum(WEEK == 30),
season_nm = paste(1997 + season, 1998 + season, sep = "-")) %>%
group_by(season) %>%
mutate(new_wk = seq_along(season)) %>%
select(YEAR, WEEK, new_wk, season, season_nm)
# YEAR WEEK new_wk season season_nm
# (int) (int) (int) (int) (chr)
# 1 1998 30 1 1 1998-1999
# 2 1998 31 2 1 1998-1999
# 3 1998 32 3 1 1998-1999
# 4 1998 33 4 1 1998-1999
# 5 1998 34 5 1 1998-1999
# 6 1998 35 6 1 1998-1999
# 7 1998 36 7 1 1998-1999
# 8 1998 37 8 1 1998-1999
# 9 1998 38 9 1 1998-1999
# 10 1998 39 10 1 1998-1999
Talking through this...
First, use tidyr::complete to turn implicit missing values into explicit missing values -- the original data pulled back did not have all of the weeks for 1998. Next, filter out the irrelevant records from 1998, that is, anything with a week before 1998 and week 30 to make our lives easier. We then create two new variables, season and season_nm via cumsum and a simple paste function. The season simply increments anytime it sees WEEK == 30 -- this is useful because of leap years. We then group_by season so that we can seq_along season to create the new_wk variable.

Resources