How to restructure very wide dataframes with dplyr using an index? [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I've read a number of posts on gather but I'm struggling to create a solution that would restructure a file with different widths into a long format.
My data are here:
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv")
df2 <- read.csv(text = x)
In the above case, I have groups of 3 columns, each of which need to be stacked up. I tried the following method but my values get spread into the wrong columns:
longJazz<- df2 %>% gather(key,
value,
X1:X69)
The resulting dataframe should have 782 rows and 3 columns (title, year and artist).
In another case, I have groups of 5 columns, so I'd like a solution that can be simply adapted. For instance, a function that takes as arguments a dataframe and the number of columns per group, would be handy.

We can remove the first column 'X', and then rename the columns until the last column 'id', by a sequence of 'Details', 'year', 'Description', then use pivot_longer from tidyr to reshape into 'long' format
library(stringr)
library(dplyr)
library(readr)
library(tidyr)
df2 <- df2[-1]
i1 <- as.integer(gl(ncol(df2)-1, 3, ncol(df2)-1))
names(df2)[1:69] <- str_c(c("Details", "year", "Description"), i1, sep="_")
df2 %>%
mutate_at(vars(starts_with('year')), ~ as.integer(as.character(.))) %>%
pivot_longer(cols = -id, names_sep="_", names_to = c(".value", "group")) %>%
select(-group)
# A tibble: 1,150 x 4
# id Details year Description
# <int> <fct> <int> <fct>
# 1 1 Sophisticated Lady / Tea For Two 1933 Art Tatum
# 2 1 The Genius Of Art Tatum, No. 21 1955 Art Tatum
# 3 1 The Tatum Group Masterpieces, Vol. 5 1964 Art Tatum / Lionel Hampton / Harry Edison / Buddy Rich / Red Callender / Barney Ke…
# 4 1 Live Sessions 1940 / 1941 1975 Art Tatum
# 5 1 20th Century Piano Genius 1986 Art Tatum
# 6 1 Jazz Masters (100 Ans De Jazz) 1998 Art Tatum
# 7 1 The Art Tatum - Ben Webster Quartet 2015 Art Tatum / Ben Webster
# 8 1 El Gran Tatum NA Art Tatum
# 9 1 Sweet Georgia Brown / Shiek Of Araby / Back O' Town Bl… 1945 Benny Goodman Quintet* / Esquire All Stars Featuring Louis Armstrong
#10 1 The Immortal Live Sessions 1944/1947 1975 Louis Armstrong
# … with 1,140 more rows

Related

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

Is there a way to animate a word cloud in R?

Currently, I am using the library("wordcloud") to make a word cloud of frequent terms some text data that I have. The text data also comes with an associated year, and I want to be able to generate new word clouds based on the year, and I want it to be automatically animated using a library like gganimate. Is there any way to do this? I want to visualize the most frequent keywords over time, but I am struggling. Any tips?
Yes, with the help of the ggwordcloud package. I'll use the babynames dataset as an interesting example to see how the 5 most common baby names have changed over 100 years. First, load the required packages and load the data.
library(babynames) # Data
library(dplyr) # Data management
library(ggplot2) # Graph framework
library(ggwordcloud) # Wordcloud using ggplot
library(gganimate) # Animation
data(babynames)
The next command finds the top 5 names for each sex in 1915 and 2015, grouped by year.
babies <- babynames %>%
filter(year %in% c(1915, 2015)) %>%
group_by(name, sex, year) %>%
summarise(n=sum(n)) %>%
arrange(desc(n)) %>%
group_by(year, sex) %>%
top_n(n=5) %>%
# A tibble: 20 x 4
# Groups: sex, year [4]
name sex year n
<chr> <chr> <dbl> <int>
1 Mary F 1915 58187
2 John M 1915 47577
3 William M 1915 38564
4 James M 1915 33776
5 Helen F 1915 30866
6 Robert M 1915 28738
7 Dorothy F 1915 25154
8 Margaret F 1915 23054
9 Joseph M 1915 23052
10 Ruth F 1915 21878
11 Emma F 2015 20435
12 Olivia F 2015 19669
13 Noah M 2015 19613
14 Liam M 2015 18355
15 Sophia F 2015 17402
16 Mason M 2015 16610
17 Ava F 2015 16361
18 Jacob M 2015 15938
19 William M 2015 15889
20 Isabella F 2015 15594
ungroup() %>%
select(name, sex)
I halted it before the end just to show you which names were returned before omitting the years and frequency because I want to merge this data with the original to get the frequency for every 5 years between 1915 and 2015, not every year because it takes too long to plot.
Here's the join.
babyyears <- babynames %>%
inner_join(babies, by=c("name","sex")) %>%
filter(year>=1915 & year %% 5 == 0) %>% # Keep all years if you like
mutate(year=as.integer(year)) # For animation. Not sure why this is required.
So that's just setting up the data for the plot. If we just wanted a static wordcloud, we'd aggregate on the year. But we keep the years for the animation.
For plotting, we use ggplot with the geom_text_wordcloud function.
gg <- babyyears %>%
ggplot(aes(label = name, size=n)) +
geom_text_wordcloud() +
theme_classic()
Then transition through the years.
gg2 <- gg + transition_time(year) +
labs(title = 'Year: {frame_time}')
I like to add a pause at the end, otherwise the animation rolls around to the start immediately after finishing.
animate(gg2, end_pause=30)
anim_save("gg_anim_wc.gif")
It's hard to keep track of all the names (especially the boys) with them all being placed in random locations. Maybe slowing it down will help. But the name that stands out the most from this graphic is "Mary", which was the most common name in 1915 but then slowly started to lose popularity towards the latter half of the century.

Grouping within group in R, plyr/dplyr

I'm working on the baseball data set:
data(baseball, package="plyr")
library(dplyr)
baseball[,1:4] %>% head
id year stint team
4 ansonca01 1871 1 RC1
44 forceda01 1871 1 WS3
68 mathebo01 1871 1 FW1
99 startjo01 1871 1 NY2
102 suttoez01 1871 1 CL1
106 whitede01 1871 1 CL1
First I want to group the data set by team in order to find the first year each team appears, and the number of distinct players that has ever played for each team:
baseball[,1:4] %>% group_by(team) %>%
summarise("first_year"=min(year), "num_distinct_players"=n_distinct(id))
# A tibble: 132 × 3
team first_year num_distinct_players
<chr> <int> <int>
1 ALT 1884 1
2 ANA 1997 29
3 ARI 1998 43
4 ATL 1966 133
5 BAL 1954 158
Now I want to add a column showing the maximum number of years any player (id) has played for the team in question. To do this, I need to somehow group by player within the existing group (team), and select the maximum number of rows. How do I do this?
Perhaps this helps
baseball %>%
select(1:4) %>%
group_by(id, team) %>%
dplyr::mutate(nyear = n_distinct(year)) %>%
group_by(team) %>%
dplyr::summarise(first_year = min(year),
num_distinct_players = n_distinct(id),
maxYear = max(nyear))
I tried doing this with base R and came up with this. It's fairly slow.
df = data.frame(t(sapply(split(baseball, baseball$team), function(x)
cbind( min(x$year),
length(unique(x$id)),
max(sapply(split(x,x$id), function(y)
nrow(y))),
names(which.max(sapply(split(x,x$id), function(y)
nrow(y)))) ))))
colnames(df) = c("Year", "Unique Players", "Longest played duration",
"Longest Playing Player")
First, split by team into different groups
For each group, obtain the minimum year as first year when the team appears
Get length of unique ids which is the number of players in that team
Split each group into subgroup by id and obtain the maximum number of rows that will give the maximum duration played by a player in that team
For each subgroup, get names of the id with maximum rows which gives the name of the player that played for the longest time in that team

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

Sorting output of tally / count (dplyr) [duplicate]

This question already has answers here:
Arrange a grouped_df by group variable not working
(2 answers)
Closed 6 years ago.
This should be easy, but I can't find a straight forward way to achieve it. My dataset looks like the following:
DisplayName Nationality Gender Startyear
1 Alfred H. Barr, Jr. American Male 1929
2 Paul C\216zanne French Male 1929
3 Paul Gauguin French Male 1929
4 Vincent van Gogh Dutch Male 1929
5 Georges-Pierre Seurat French Male 1929
6 Charles Burchfield American Male 1929
7 Charles Demuth American Male 1929
8 Preston Dickinson American Male 1929
9 Lyonel Feininger American Male 1929
10 George Overbury ("Pop") Hart American Male 1929
...
I want to group by DisplayName and Gender, and get the counts for for each of the names (they are repeated several times on the list, with different year information).
The following 2 commands give me the same output, but they are not sorted by the count output "n". Any ideas on how to achieve this?
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
group_by(DisplayName, Gender) %>%
tally(sort = T) %>%
arrange(desc(n))
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
count(DisplayName, Gender, sort = T)
DisplayName Gender n
(chr) (chr) (int)
1 A. F. Sherman Male 1
2 A. G. Fronzoni Male 2
3 A. Lawrence Kocher Male 3
4 A. M. Cassandre Male 21
5 A. R. De Ycaza Female 1
6 A.R. Penck (Ralf Winkler) Male 20
7 Aaron Siskind Male 25
8 Abigail Perlmutter Female 1
9 Abraham Rattner Male 5
10 Abraham Walkowitz Male 17
.. ... ... ...
Your data is grouped by two variables. So after tally, your dataframe is still grouped by Display name. So arrange(desc(n)) is sorting but by Disply name. If you want to sort the all dataframe by column n, just ungroup before sorting. try this :
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
group_by(DisplayName, Gender) %>%
tally(sort = T) %>%
ungroup() %>%
arrange(desc(n))

Resources