Finding the top n represented entries in a grouped dataframe in R - r

I am a beginner in R and would be very thankful for a response as I am stuck on this code (this is my attempt at solving the problem but it does not work):
personal_spotify_df <- fromJSON("data/StreamingHistory0.json")
personal_spotify_df = personal_spotify_df %>%
mutate(minutesPlayed = msPlayed/1000/60)
personal_spotify_df_ranked <- personal_spotify_df %>%
group_by(artistName) %>%
filter(top_n(15, max(nrows())))
I have a dataframe (see below for a screenshot on how its structured) which is my spotity listening history. I want to group this dataframe by artists and afterwards arrange the new dataframe to show the top 15 artists with the most songs listened to. I am stuck on how to get from grouping by artistName to actually filtering out the top 15 represented artists from the dataframe.
The dataframe

We may use slice_max, with n specified as 15 and the order column created with add_count
library(dplyr)
personal_spotify_df %>%
add_count(artistName, name = "Count") %>%
slice_max(n = 15, order_by = "Count") %>%
select(-Count)
If we want to get only the top 15 distinct 'artistName',
personal_spotify_df %>%
count(artistName, name = "Count") %>%
slice_max(n = 15, order_by = "Count")
Or an option with filter after arrangeing the rows based on the count
personal_spotify_df %>%
add_count(artistName) %>%
arrange(desc(n)) %>%
filter(artistName %in% head(unique(artistName), 15))

In base R, you can make use of table, sort and head to get top 15 artists with their count
table(personal_spotify_df$artistName) |>
sort(decreasing = TRUE) |>
head(15) |>
stack()
The pipe operator (|>) requires R 4.1 if you have a lower version use -
stack(head(sort(table(personal_spotify_df$artistName), decreasing = TRUE), 15))

Related

How to repeat a vector of strings N times in a dataframe using dplyr

I am working with a list of dataframes and want to create a new column with the names of the variables. There are three variables and the length of the dataframe is 684, therefore I need the variable names to repeat 228 times. However, I can't get this to work.
Here is the snippet I am currently using:
empleo = lapply(lista.empleo, function(x){x = x %>%
read_excel(skip=4) %>%
head(23) %>%
drop_na() %>%
clean_names() %>%
pivot_longer(!1,
names_to = 'fecha',
values_to = 'valor') %>%
mutate(variable = rep(c('trabajadores',
'masa',
'salario'),
times = 228))})
So far, I have tried to use mutate, but I get the following mistake:
Error in `mutate()`:
! Problem while computing `variable = rep(c("trabajadores", "masa",
"salario"), times = 228)`.
x `variable` must be size 0 or 1, not 684.
I will add the structure of a sample df in the comments since it is too big.
Thanks in advance for any help!
The rep may fail as some datasets may have different number of rows in the list. Use length.out to make sure it returns n() elements (number of rows)
library(readxl)
library(tidyr)
library(dplyr)
library(janitor)
empleo <- lapply(lista.empleo, function(x){x = x %>%
read_excel(skip=4) %>%
head(23) %>%
drop_na() %>%
clean_names() %>%
pivot_longer(!1,
names_to = 'fecha',
values_to = 'valor') %>%
mutate(variable = rep(c('trabajadores',
'masa',
'salario'),
228, length.out = n()))})

How to use count for object of class "Character"

I have a data frame where in one column named "City" there are more than 50 different cities and if I plot a bar graph using city then it gets very difficult to read the plot.
Is there any way to first use count() to count the number of cities and then select top 15 cities based on how many time they appear in the data and after that using ggplot() plot a bar graph.
We can also do
library(dplyr)
res <- df %>%
group_by(City) %>%
summarise(n = n()) %>%
slice_max(n = 15, n) %>%
left_join(df, by = 'City')
To keep the rows for top 15 Cities you can do -
library(dplyr)
df %>%
count(City) %>%
slice_max(n = 15, n) %>%
left_join(df, by = 'City') -> res
res
Or in base R -
res <- subset(df, City %in% tail(sort(table(City)), 15))

Finding closest matching number between 2 dataframes using a grouped dplyr identifier

I have 2 datasets each with a 'Patient ID` and a collection date measured from the same date "from start." In order to join these dataframes together, I'd like to match each sample in d1 to it's closest neighbor in d2. How can this be done with a function in dplyr?
d1<-data.frame(`Patient ID`=c(rep("001",4),rep("002",5)),`fromstart`=c(-5,30,90,150,-10,15,45,100,250),check.names = F)
d2<-data.frame(`Patient ID`=c(rep("001",7),rep("002",4)),`fromstart`=c(-20,10,30,50,90,110,150,-10,15,45,100),check.names = F)
closest_date<-function(cases,d2) {
return(d2 %>% select(`Patient ID`,fromstart) %>% unique() %>% filter(`Patient ID`==cases$`Patient ID`) %>% rowwise() %>% mutate(date_match=as.numeric(cases$fromstart[which.min(abs(fromstart - cases$fromstart))])))
}
d1 %>% select(`Patient ID`,fromstart) %>% unique() %>% group_by(`Patient ID`) %>% rowwise() %>% mutate(closest=closest_date(.,d2))
If I understood your problem correctly you want to join by patient ID and then select those lines where the difference between fromstart is the smallest? If so this would be a solution
library(dplyr)
d1 %>%
dplyr::full_join(d2, by = c("Patient ID"), suffix = c("_1", "_2")) %>%
dplyr::mutate(DIF = abs(fromstart_1 - fromstart_2)) %>%
dplyr::group_by(`Patient ID`, fromstart_1) %>%
dplyr::filter(DIF == min(DIF))
As you can see this does not really work well if you want unique combinations because there can be cases where the distance is the same... than again maybe I did not get your questoin right
Instead of using the absolute value you could filter for positive differences/distances as well if you want fromstart of the second table to be larger than from the first, this would reduce double entries to a certain degree

Looking up multiple values in separate table, but only returning one unique row

I have two data frames that look like this:
Table1:
Gender<-c("M","F","M","M","F")
CPTCodes<-c("15777, 19328, 19342, 19366, 19370, 19371, 19380","15777, 19357","19367, 49568","15777, 19357","15777, 19357")
Df<-tibble(Gender,CPTCodes)
Table2:
Code<-c(19328,19342,15777,49568,12345)
Value<-c(0.5,7,9,35,2)
Df2<-tibble(Code,Value)
And had previously asked this question about how to summarize the "values" from table 2 into a column in table 1, depending on how many codes were in the "Code" column of table 1. Turns out it was a duplicate of another question, but either way, the solutions there worked great! It did exactly what I asked.
Problem was that I didn't realize, buried deep down in the thousands of rows of Table 2, were some duplicate codes. I.e. table 2 really looked like this:
Code<-c(19357,19342,15777,49568,12345,15777,19357)
Modifier<-c("","","","","","a","a")
Value<-c(0.5,7,9,35,2,3,45)
Df2<-tibble(Code,Modifier,Value)
So when I use the suggested code:
Df %>% mutate(id = row_number()) %>% separate_rows(CPTCodes, sep = ", ", convert = TRUE) %>% left_join(Df2, by = c("CPTCodes" = "Code")) %>% group_by(id, Gender) %>% summarize(total = sum(Value, na.rm = TRUE))
It summarizes ALL of the codes in finds that match in Table2, and I really just want rows that dont have anything in the "modifier" column. Any ideas?
Lastly, the current code returns the summarized total in its own data frame, but it'd be cool if everything was still there from the original Table 1, and it just had an extra column with the new sum.
I'm not entirely sure of your expected output. But you should be able to filter and then join the new column to the original df.
Df <- Df %>% mutate(id = row_number()) %>%
separate_rows(CPTCodes, sep = ", ", convert = TRUE) %>%
left_join(Df2, by = c("CPTCodes" = "Code")) %>%
group_by(id, Gender) %>%
filter(Modifier == "") %>%
summarize(total = sum(Value, na.rm = TRUE)) %>%
right_join(Df, by = "Gender")

transform() to add rows with dplyr()

I've got a data frame (df) with two variables, site and purchase.
I'd like to use dplyr() to group my data by site and purchase, and get the counts and percentages for the grouped data. I'd however also like the tibble to feature rows called ALLSITES, representing the data of all the sites grouped by purchase, so that I end up with a tibble looking similar to dfgoal.
The problem's that my current code doesn't get me the ALLSITES rows. I've tried adding a base R function into dplyr(), which doesn't work.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("LON","MAD","PAR","MAD","PAR","MAD","PAR","MAD","PAR","LON","MAD","LON","MAD","MAD","MAD"),purchase=c("a1","a2","a1","a1","a1","a1","a1","a1","a1","a2","a1","a2","a1","a2","a1"))
Desired outcome:
dfgoal <- data.frame(site=c("LON","LON","MAD","MAD","PAR","ALLSITES","ALLSITES"),purchase=c("a1","a2","a1","a2","a1","a1","a2"),bin=c(1,2,6,2,4,11,4),pin_per=c(33.33333,66.66667,75.00000,25.00000,100.00000,73.33333,26.66666))
Current code:
library(dplyr)
df %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
df %>%
rbind(df, transform(df, site = "ALLSITES") %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
We can start from the first output code block, after grouping by 'site' with a created string of 'ALLSITES' and 'purchase' get the sum of 'bin' and later 'bin_per', then with bind_rows row bind the two datasets
df1 %>%
ungroup() %>%
group_by(site = 'ALLSITES', purchase) %>%
summarise(bin = sum(bin)) %>%
ungroup %>%
mutate(bin_per = 100*(bin/sum(bin))) %>%
bind_rows(df1, .)

Resources