How to use count for object of class "Character" - r

I have a data frame where in one column named "City" there are more than 50 different cities and if I plot a bar graph using city then it gets very difficult to read the plot.
Is there any way to first use count() to count the number of cities and then select top 15 cities based on how many time they appear in the data and after that using ggplot() plot a bar graph.

We can also do
library(dplyr)
res <- df %>%
group_by(City) %>%
summarise(n = n()) %>%
slice_max(n = 15, n) %>%
left_join(df, by = 'City')

To keep the rows for top 15 Cities you can do -
library(dplyr)
df %>%
count(City) %>%
slice_max(n = 15, n) %>%
left_join(df, by = 'City') -> res
res
Or in base R -
res <- subset(df, City %in% tail(sort(table(City)), 15))

Related

count the top ten occurences in a table in a decreasing order

here is a table
top10_bands <- table(bands1995$origin)
top10_bands
Where origin is the country and bands1995 is the original dataframe. How do I write code to just get top ten countries with the most occurences in decreasing order?
Using dplyr package...
library(dplyr)
top10_bands <- bands1995 %>%
group_by(origin) %>%
summarise(n_bands = n()) %>%
slice_max(order_by = n_bands, n = 10)

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

Finding the top n represented entries in a grouped dataframe in R

I am a beginner in R and would be very thankful for a response as I am stuck on this code (this is my attempt at solving the problem but it does not work):
personal_spotify_df <- fromJSON("data/StreamingHistory0.json")
personal_spotify_df = personal_spotify_df %>%
mutate(minutesPlayed = msPlayed/1000/60)
personal_spotify_df_ranked <- personal_spotify_df %>%
group_by(artistName) %>%
filter(top_n(15, max(nrows())))
I have a dataframe (see below for a screenshot on how its structured) which is my spotity listening history. I want to group this dataframe by artists and afterwards arrange the new dataframe to show the top 15 artists with the most songs listened to. I am stuck on how to get from grouping by artistName to actually filtering out the top 15 represented artists from the dataframe.
The dataframe
We may use slice_max, with n specified as 15 and the order column created with add_count
library(dplyr)
personal_spotify_df %>%
add_count(artistName, name = "Count") %>%
slice_max(n = 15, order_by = "Count") %>%
select(-Count)
If we want to get only the top 15 distinct 'artistName',
personal_spotify_df %>%
count(artistName, name = "Count") %>%
slice_max(n = 15, order_by = "Count")
Or an option with filter after arrangeing the rows based on the count
personal_spotify_df %>%
add_count(artistName) %>%
arrange(desc(n)) %>%
filter(artistName %in% head(unique(artistName), 15))
In base R, you can make use of table, sort and head to get top 15 artists with their count
table(personal_spotify_df$artistName) |>
sort(decreasing = TRUE) |>
head(15) |>
stack()
The pipe operator (|>) requires R 4.1 if you have a lower version use -
stack(head(sort(table(personal_spotify_df$artistName), decreasing = TRUE), 15))

Add multiple selects in one dataset

I have the dataset below and in it I consolidate the categories Mk_Cap, Exports and Money_Supply, but each of these grids has a different Unit.
df <- data.frame(Mes=c("Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai"),
Ano=c(2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009),
Mk_Cap=c(11:15,116:120,1111:1115),
Exports=c(21:25,146:150,1351:1355),
Money_Supply=c(31:35,546:550,2111:2115),
Unit=c("USD","USD","USD","USD","USD","200=10",
"200=10","200=10","200=10","200=10",
"CNY","CNY","CNY","CNY","CNY"))
enter image description here
Today I am consolidating as follows:
library(dplyr)
Money_Supply <- df %>% dplyr::select(Ano, Mes,Money_Supply) %>% dplyr::filter(df$Unit == "USD")
Mk_Cap <- df %>% dplyr::select(Mk_Cap) %>% dplyr::filter(df$Unit == "200=10")
Exports <- df %>% dplyr::select(Exports) %>% dplyr::filter(df$Unit == "CNY")
Consolidado <- base::cbind(Money_Supply,Mk_Cap,Exports)
enter image description here
I believe that it is not the most correct way to do this, but today it is the way that I found, in this example that I passed there are few occurrences, but in the practical case I do this in more than 30 variables which is extremely costly, if there is any way easier would be ideal.
A solution with dplyr:
There is a pattern in the dataframe. Each year has three rows.
Of the three column of interest Money_Supply, Mk_Cap, Exports each variable is in the first, second or third row.
First reorder the columns, then arrange by year, then lead the columns of interest. Then group and filter by id==1.
df1 <- df %>%
select(Ano, Mes, Money_Supply, Mk_Cap, Exports) %>%
arrange(Ano) %>%
mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) %>%
mutate(Exports = lead(Exports, 2, order_by = Ano)) %>%
mutate(group = rep(row_number(), each=3, length.out = n())) %>%
group_by(group) %>%
mutate(id = row_number()) %>%
filter(id ==1) %>%
ungroup() %>%
select(-group, -id)
Data
df <- data.frame(Mes=c("Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai"),
Ano=c(2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009),
Mk_Cap=c(11:15,116:120,1111:1115),
Exports=c(21:25,146:150,1351:1355),
Money_Supply=c(31:35,546:550,2111:2115),
Unit=c("USD","USD","USD","USD","USD","200=10",
"200=10","200=10","200=10","200=10",
"CNY","CNY","CNY","CNY","CNY"))
Edit: Try to clarify my point and the simplicity of the pattern in the data:
# slightly simplified code
df1 <- df %>%
arrange(Ano) %>%
mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) %>%
mutate(Exports = lead(Exports, 2, order_by = Ano)) %>%
group_by(Ano) %>%
mutate(id = row_number()) %>%
filter(id ==1) %>%
ungroup() %>%
select(Ano, Mes, Money_Supply, Mk_Cap, Exports, -id, -Unit)
If you consider your dataframe like Fig1 with arrange(Ano):
You have 5 Ano (orange): 2005-2009
In each Ano you have 1 Mes(purple): In 2005 = Jan, 2006 = Fev, 2007 = Mar, 2008 = Abr, 2009 = Mai
In each Ano and Mes you have 3 Unit (blue): In 2005 & Jan = USD, 200=10, CNY ; In 2006 & Fev = USD, 200=10, CNY ; etc...
In your desired output you wish to have:
to condense the
3 rows of one Ano with 3 different Unit to
1 row with Ano, Mes and the corresponding values of Money_Supply, Mk_Cap, Exports
This can be achieved by lead function (see Fig.1):
In Money_Supply: no code necessary is already in the first row (color green)
In Mk_Cap: mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) yellow arrow
In Exports: mutate(Exports = lead(Exports, 2, order_by = Ano)) red arrow
group_by(Ano) Group by Ano
mutate(id = row_number()) Assign unique id within each group
filter(id ==1) Filter the 1 row in each group
Finally tweak the order of columns and remove unnesseccary columns.
select(Ano, Mes, Money_Supply, Mk_Cap, Exports, -id, -Unit)
I think a simple way would be filtering your dataset by the Unit column before doing any other operations. Store the variations in a list by performing:
unit_variations <- lapply(unique(df$Unit), function(x) {
return(df[df$Unit == x, ])
})
names(unit_variations) <- unique(df$Unit)
Then, to make your Consolidado dataframe, select which variables you want from which unit variations. Say:
vars <- c("Money_Supply", "Mk_Cap", "Exports")
unit <- c("USD", "200=10", "CNY")
Consolidado <- mapply(
FUN = function(var, unit) {
return(unit_variations[[unit]][[var]])
},
vars,
unit
)
I used a list because, from what you described, I cannot assume that the number of rows for each type of Unit will always be the same, so a list allows for more flexibility. I also did not include month and year, for the same reason.

sample multiple different sample sizes using crossing and sample_n to create single df

I am attempting to sample a dataframe using sample_n. I know that sample_n usually takes a single size= argument at a time, however, I would like to sample sizes from 2 to the max # of rows in the df. Unfortunately, the code I have compiled below does not do the job. The needed output would be a dataframe with an id= column or a list divided by the id column from crossing().
df <- data.frame(Date = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
data_sampled_by_stratum <- df %>%
group_by(Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
group_by(id) %>%
sample_n(size=c(2:15)) %>%
group_by(CLUSTER_ID,Date) %>% filter(n() > 2)
If you had a column with different sites you could do this.
data_sampled_by_stratum <- data_grouped_by_stratum %>%
group_by(siteid, Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
sample_n(rbinom(1,sum(siteid==i),(1-s)^2))

Resources