Sorting output of tally / count (dplyr) [duplicate] - r

This question already has answers here:
Arrange a grouped_df by group variable not working
(2 answers)
Closed 6 years ago.
This should be easy, but I can't find a straight forward way to achieve it. My dataset looks like the following:
DisplayName Nationality Gender Startyear
1 Alfred H. Barr, Jr. American Male 1929
2 Paul C\216zanne French Male 1929
3 Paul Gauguin French Male 1929
4 Vincent van Gogh Dutch Male 1929
5 Georges-Pierre Seurat French Male 1929
6 Charles Burchfield American Male 1929
7 Charles Demuth American Male 1929
8 Preston Dickinson American Male 1929
9 Lyonel Feininger American Male 1929
10 George Overbury ("Pop") Hart American Male 1929
...
I want to group by DisplayName and Gender, and get the counts for for each of the names (they are repeated several times on the list, with different year information).
The following 2 commands give me the same output, but they are not sorted by the count output "n". Any ideas on how to achieve this?
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
group_by(DisplayName, Gender) %>%
tally(sort = T) %>%
arrange(desc(n))
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
count(DisplayName, Gender, sort = T)
DisplayName Gender n
(chr) (chr) (int)
1 A. F. Sherman Male 1
2 A. G. Fronzoni Male 2
3 A. Lawrence Kocher Male 3
4 A. M. Cassandre Male 21
5 A. R. De Ycaza Female 1
6 A.R. Penck (Ralf Winkler) Male 20
7 Aaron Siskind Male 25
8 Abigail Perlmutter Female 1
9 Abraham Rattner Male 5
10 Abraham Walkowitz Male 17
.. ... ... ...

Your data is grouped by two variables. So after tally, your dataframe is still grouped by Display name. So arrange(desc(n)) is sorting but by Disply name. If you want to sort the all dataframe by column n, just ungroup before sorting. try this :
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
group_by(DisplayName, Gender) %>%
tally(sort = T) %>%
ungroup() %>%
arrange(desc(n))

Related

How to fill in time series data into a data frame?

I am working with the following time series data:
Weeks <- c("1995-01", "1995-02", "1995-03", "1995-04", "1995-06", "1995-08", "1995-10", "1995-15", "1995-16", "1995-24", "1995-32")
Country <- c("United States")
Values <- sample(seq(1,500,1), length(Weeks), replace = T)
df <- data.frame(Weeks,Country, Values)
Weeks Country Values
1 1995-01 United States 193
2 1995-02 United States 183
3 1995-03 United States 402
4 1995-04 United States 75
5 1995-06 United States 402
6 1995-08 United States 436
7 1995-10 United States 97
8 1995-15 United States 445
9 1995-16 United States 336
10 1995-24 United States 31
11 1995-32 United States 413
It is structured according to the year and the week number in that year (column 1). Notice, how some weeks are omitted (as a result of the aggregation function). For example, 1995-05 is not included. How can I include the omitted rows into the data, add the appropriate country name, and assign them a value = 0?
Thank you for your help!
separate year and week values in different columns. For each Country and Years we complete the missing weeks and assign Values to 0. Finally unite year and week column to get the data in the same format as the original one.
library(dplyr)
library(tidyr)
df %>%
separate(Weeks, c('Years', 'Weeks'), sep = '-', convert = TRUE) %>%
group_by(Country, Years) %>%
complete(Weeks = min(Weeks):max(Weeks), fill = list(Values = 0)) %>%
ungroup() %>%
mutate(Weeks = sprintf('%02d', Weeks)) %>%
unite(Weeks, Years, Weeks, sep = '-')
# Country Weeks Values
# <chr> <chr> <dbl>
# 1 United States 1995-01 354
# 2 United States 1995-02 395
# 3 United States 1995-03 408
# 4 United States 1995-04 143
# 5 United States 1995-05 0
# 6 United States 1995-06 481
# 7 United States 1995-07 0
# 8 United States 1995-08 49
# 9 United States 1995-09 0
#10 United States 1995-10 229
# … with 22 more rows

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

Combining & totalling rows in R

I have the below dataset, with the variables as follows:
member_id - an id number for each member
year - the year in question
gender - binary variable, 0 is male, 1 is female
party - the party of the member
Leadership - TRUE if the member holds a leadership position in government or opposition, FALSE if they don't
house_start - the date the member became an MP
Year.Entered - the year they became an MP
Years.in.parliament - how many years it has been since they were first elected
Edu - the amount of time the MP has participated in debates related to education in the given year.
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat FALSE 09/06/1983 1983 14 3
6 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 9
As you can see with rows 5 and 6 in the dataset, the same member is recorded twice in the one year. This has happened throughout the dataset for some members because of the Leadership variable. For example this member (id number 15) did not have a leadership position for the first part of 1997 but did get one later in the year. I want to be able to combine these two rows and have the Leadership variable as TRUE in these cases. I also need to compute the sum of Edu rows for these as well, so for this member it would become 12 (because I want each members number of times participated per year for this policy area). So I want it to look like:
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 12
I have been trying to change these manually on Excel, but I need to do this for several different policy areas, so it is taking a lot of time. Any help would be much appreciated!
We can do a group by sum and arrange and slice the first row
library(dplyr)
df1 %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
arrange(party, desc(Leadership)) %>%
slice(1)
For each group you can select the rows where there is only one row or row where Leadership is TRUE.
library(dplyr)
df %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
filter(n() == 1 | Leadership)
From my understanding the minimal repeating group is the member_id & year, we can then sum the Edu amount defensively (using na.rm = TRUE) and then slice the grouped data.frame using boolean algebra (taking the maximum of a boolean vector yields true records).
library(dplyr)
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
slice(which.max(Leadership)) %>%
ungroup()
Alternatively we can use top_n function (which yields the same result):
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
top_n(1, Leadership) %>%
ungroup()

How to restructure very wide dataframes with dplyr using an index? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I've read a number of posts on gather but I'm struggling to create a solution that would restructure a file with different widths into a long format.
My data are here:
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv")
df2 <- read.csv(text = x)
In the above case, I have groups of 3 columns, each of which need to be stacked up. I tried the following method but my values get spread into the wrong columns:
longJazz<- df2 %>% gather(key,
value,
X1:X69)
The resulting dataframe should have 782 rows and 3 columns (title, year and artist).
In another case, I have groups of 5 columns, so I'd like a solution that can be simply adapted. For instance, a function that takes as arguments a dataframe and the number of columns per group, would be handy.
We can remove the first column 'X', and then rename the columns until the last column 'id', by a sequence of 'Details', 'year', 'Description', then use pivot_longer from tidyr to reshape into 'long' format
library(stringr)
library(dplyr)
library(readr)
library(tidyr)
df2 <- df2[-1]
i1 <- as.integer(gl(ncol(df2)-1, 3, ncol(df2)-1))
names(df2)[1:69] <- str_c(c("Details", "year", "Description"), i1, sep="_")
df2 %>%
mutate_at(vars(starts_with('year')), ~ as.integer(as.character(.))) %>%
pivot_longer(cols = -id, names_sep="_", names_to = c(".value", "group")) %>%
select(-group)
# A tibble: 1,150 x 4
# id Details year Description
# <int> <fct> <int> <fct>
# 1 1 Sophisticated Lady / Tea For Two 1933 Art Tatum
# 2 1 The Genius Of Art Tatum, No. 21 1955 Art Tatum
# 3 1 The Tatum Group Masterpieces, Vol. 5 1964 Art Tatum / Lionel Hampton / Harry Edison / Buddy Rich / Red Callender / Barney Ke…
# 4 1 Live Sessions 1940 / 1941 1975 Art Tatum
# 5 1 20th Century Piano Genius 1986 Art Tatum
# 6 1 Jazz Masters (100 Ans De Jazz) 1998 Art Tatum
# 7 1 The Art Tatum - Ben Webster Quartet 2015 Art Tatum / Ben Webster
# 8 1 El Gran Tatum NA Art Tatum
# 9 1 Sweet Georgia Brown / Shiek Of Araby / Back O' Town Bl… 1945 Benny Goodman Quintet* / Esquire All Stars Featuring Louis Armstrong
#10 1 The Immortal Live Sessions 1944/1947 1975 Louis Armstrong
# … with 1,140 more rows

Count elements (rows) in group of a data table [duplicate]

This question already has answers here:
Count number of rows matching a criteria
(9 answers)
Closed 4 years ago.
I have table (dt) which has several columns.
X__1 First Name Last Name Gender Country Age Date Id
1: 1 Dulce Abril Female United States 32 15/10/2017 1562
2: 2 Mara Hashimoto Female Great Britain 25 16/08/2016 1582
3: 3 Philip Gent Male France 36 21/05/2015 2587
4: 4 Kathleen Hanner Female United States 25 15/10/2017 3549
5: 5 Nereida Magwood Female United States 58 16/08/2016 2468
I want to count the number of rows which has Country = "France" and Age >32.
I used the following command which gives me the result, but i need to count the number of rows in the result. What is the command to do it?
dt[Country == 'France' & Age > 32]
use the function nrow()
nrow(dt[Country == 'France' & Age > 32])
nrow() is simplest, but if you want to do it using data.table syntax:
dt[Country == 'France' & Age > 32, (.N)]

Resources