Subset data by Year from period column - r

I have data with month and year column, how can I subset this to be per year along, here is sample dataset
# Libraries
library(ggplot2)
library(reshape2)
# Data
df <- data.frame("Hospital" = c("Buge Hospital", "Buge Hospital", "Greta Hospital", "Greta Hospital",
"Makor Hospital", "Makor Hospital"),
"Period" = c("Jul-18","Aug-18", "Jul-19","Aug-19", "Jul-20","Aug-20"),
"Medical admissions" = c(12,56,0,40,5,56),
"Surgical admissions" = c(10,2,0,50,20,56),
"Inpatient admissions" = c(9,5,6,0,60,96))
I have tried this but I get empty dataset
data_18 <- subset(df, format(as.Date(df$Period, format="%b/%Y"),"%Y")== 2018)
I want to pull out monthly data for each year so that so that I can observe data trends for that monthly period
Expected result is to subset and get only data for each year individually example is like pull out monthly data for 2018.

I am not sure if this is what you are looking for:
data <- Filter(nrow,split(df,list(gsub(".*-","",df$Period),df$Hospital)))
data_18 <- data[grepl("^18",names(data))]
which gives
> data
$`18.Buge Hospital`
Hospital Period Medical.admissions Surgical.admissions Inpatient.admissions
1 Buge Hospital Jul-18 12 10 9
2 Buge Hospital Aug-18 56 2 5
$`19.Greta Hospital`
Hospital Period Medical.admissions Surgical.admissions Inpatient.admissions
3 Greta Hospital Jul-19 0 0 6
4 Greta Hospital Aug-19 40 50 0
$`20.Makor Hospital`
Hospital Period Medical.admissions Surgical.admissions Inpatient.admissions
5 Makor Hospital Jul-20 5 20 60
6 Makor Hospital Aug-20 56 56 96
and
> data_18
$`18.Buge Hospital`
Hospital Period Medical.admissions Surgical.admissions Inpatient.admissions
1 Buge Hospital Jul-18 12 10 9
2 Buge Hospital Aug-18 56 2
EDIT
If you just want to subset data in 2018 (thanks to #G. Grothendieck )
data_18 <- subset(df, grepl("18", Period))

I think what you were trying for is :
subset(df, format(as.Date(paste('1', Period), '%d %b-%y'), "%Y") == 2018)
# Hospital Period Medical.admissions Surgical.admissions Inpatient.admissions
#1 Buge Hospital Jul-18 12 10 9
#2 Buge Hospital Aug-18 56 2 5
Or using zoo's yearmon.
library(zoo)
subset(df, floor(as.yearmon(Period, "%b-%y")) == 2018)

There are several possibilities, for example with strsplit or with tidyverse as follows:
library(tidyr)
library(dplyr)
df %>% separate(Period, into=c("Month", "Year"), "-") %>% filter(Year == 18)
and if you want to summarize, plot or something, use group_by instead of filter, for example:
df %>%
separate(Period, into=c("Month", "Year"), "-") %>%
group_by(Year) %>%
summarize(sum(Medical.admissions))

And for a more pedestrian approach in response to your desire to subset on both year and month, and to reflect how the approach in your own code could be made to work:
# Libraries
library(ggplot2)
library(reshape2)
library(lubridate)
# Data
df <- data.frame("Hospital" = c("Buge Hospital", "Buge Hospital", "Greta Hospital", "Greta Hospital",
"Makor Hospital", "Makor Hospital"),
"Period" = c("Jul-18","Aug-18", "Jul-19","Aug-19", "Jul-20","Aug-20"),
"Medical admissions" = c(12,56,0,40,5,56),
"Surgical admissions" = c(10,2,0,50,20,56),
"Inpatient admissions" = c(9,5,6,0,60,96),
stringsAsFactors = FALSE)
# data wrangle to give you a valid date and year varibles, subsetting on year should be straightforward using dplyr::group_by(year, month)
df1 <-
df %>%
mutate(date = as.Date(paste0("01-", Period),format = "%d-%b-%y"),
year = year(date),
month = month(date))

Related

R: Filtering rows based on a group criterion

I have a data frame with over 100,000 rows and with about 40 columns. The schools column has about 100 distinct schools. I have data from 1980 to 2023.
I want to keep all data from schools that have at least 10 rows for each of the years 2018 through 2022. Schools that do not meet that criterion should have all rows deleted.
In my minimal example, Schools, I have three schools.
Computing a table makes it apparent that only Washington should be retained. Adams only has 5 rows for 2018 and Jefferson has 0 for 2018.
Schools2 is what the result should look like.
How do I use the table computation or a dplyr computation to perform the filter?
Schools =
data.frame(school = c(rep('Washington', 60),
rep('Adams',70),
rep('Jefferson', 100)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5),
rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
rep(2019:2023, each = 20)),
stuff = rnorm(230)
)
Schools2 =
data.frame(school = c(rep('Washington', 60)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)),
stuff = rnorm(60)
)
table(Schools$school, Schools$year)
Schools |> group_by(school, year) |> summarize(counts = n())
Keep only the year from 2018 to 2022 in the data with filter, then add a frequency count column by school, year, and filter only those 'school', having all count greater than or equal to 10 and if all the year from the range are present
library(dplyr)# version >= 1.1.0
Schools %>%
filter(all(table(year[year %in% 2018:2022]) >= 10) &
all(2018:2022 %in% year), .by = c("school")) %>%
as_tibble()
-output
# A tibble: 60 × 3
school year stuff
<chr> <dbl> <dbl>
1 Washington 2016 0.680
2 Washington 2016 -1.14
3 Washington 2016 0.0420
4 Washington 2016 -0.603
5 Washington 2016 2.05
6 Washington 2018 -0.810
7 Washington 2018 0.692
8 Washington 2018 -0.502
9 Washington 2018 0.464
10 Washington 2018 0.397
# … with 50 more rows
Or using count
library(magrittr)
Schools %>%
filter(tibble(year) %>%
filter(year %in% 2018:2022) %>%
count(year) %>%
pull(n) %>%
is_weakly_greater_than(10) %>%
all, all(2018:2022 %in% year) , .by = "school")
As it turns out, a friend just helped me come up with a base R solution.
# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)
# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]
# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))
# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)
# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

How do I create two new variables for how many days between the most recent game played by each team?

I would like to create two new variables (one for team1, another for team2). Each variable should tell me how many days between the most recent game played by each team.
library(tidyverse)
library(lubridate)
date <- c(mdy("May 7, 2021", "May 7, 2021", "May 6, 2021", "May 5, 2021", "May 5, 2021"))
team1 <- c("Boston Celtics", "Orlando Magic", "Atlanta Hawks", "Boston Celtics", "Phoenix Suns")
team2 <- c("Chicago Bulls", "Charlotte Hornets", "Indiana Pacers", "Orlando Magic", "Atlanta Hawks")
games <- data.frame(date, team1, team2)
Let me know if this provides the output you are interested in.
In this answer, you can first assign each row of data to a unique Game number. Then, put data into long form, and calculate days between games for each team. Finally, if desired, you can put data into wide format again.
library(tidyverse)
library(lubridate)
games %>%
mutate(Game = row_number()) %>%
pivot_longer(cols = starts_with("team"),
values_to = "Team",
names_to = "Number",
names_pattern = "team(\\d+)") %>%
group_by(Team) %>%
arrange(date) %>%
mutate(Days_Between = c(0, diff(date))) %>%
pivot_wider(id_cols = c(Game, date),
names_from = Number,
values_from = c(Team, Days_Between))
Output
Game date Team_1 Team_2 Days_Between_1 Days_Between_2
<int> <date> <chr> <chr> <dbl> <dbl>
1 4 2021-05-05 Boston Celtics Orlando Magic 0 0
2 5 2021-05-05 Phoenix Suns Atlanta Hawks 0 0
3 3 2021-05-06 Atlanta Hawks Indiana Pacers 1 0
4 1 2021-05-07 Boston Celtics Chicago Bulls 2 0
5 2 2021-05-07 Orlando Magic Charlotte Hornets 2 0

Organize scale of x axis of time series graph

Here I have data that looks like this:
# Data
df <- data.frame("Hospital" = c("Buge Hospital", "Buge Hospital", "Greta Hospital", "Greta Hospital",
"Makor Hospital", "Makor Hospital"),
"Period" = c("Jul-18","Aug-18", "Jul-19","Aug-19", "Jul-20","Aug-20"),
"Medical admissions" = c(12,56,0,40,5,56),
"Surgical admissions" = c(10,2,0,50,20,56),
"Inpatient admissions" = c(9,5,6,0,60,96))
Now this data has a column called period which is monthy data for different years, 2018,2019 and 2020
if I plot this data, here is how it looks
library(ggplot2
# Melt data into long format
df2 <- melt(data = df,
id.vars = c("Hospital","Period"),
measure.vars = names(df[3:5]))
# Stacked barplot
ggplot( df2, aes(x = Period, y = value, fill = variable, group = variable)) +
geom_bar(stat = "identity") +
theme(legend.position = "none") +
ggtitle(unique(df2$Hospital))+
scale_x_date(date_labels = %Y)+
labs(x = "Month", y = "Number of People", fill = "Type")
It plots well but the x axis is not organized in ascending order, I have tried to use scale_x_date function but still the plot is the same. What I want is months for the year 2018 to start, then followed with months for 2019 and 2020. I mean x axis to be organized in ascending order based on years like this
Aug-18, Jul-18, Aug-19,Jul-19, Aug-20,Jul-20.
To solve your issue, you need to convert your Period in a date format.
For example, you can use parse_date function from lubridate package:
library(lubridate)
library(tidyr)
library(dplyr)
df %>% mutate(Date = parse_date(as.character(Period), format = "%b-%y")) %>%
pivot_longer(cols = Medical.admissions:Inpatient.admissions, names_to = "Var", values_to = "Val")
# A tibble: 18 x 5
Hospital Period Date Var Val
<fct> <fct> <date> <chr> <dbl>
1 Buge Hospital Jul-18 2018-07-01 Medical.admissions 12
2 Buge Hospital Jul-18 2018-07-01 Surgical.admissions 10
3 Buge Hospital Jul-18 2018-07-01 Inpatient.admissions 9
4 Buge Hospital Aug-18 2018-08-01 Medical.admissions 56
5 Buge Hospital Aug-18 2018-08-01 Surgical.admissions 2
6 Buge Hospital Aug-18 2018-08-01 Inpatient.admissions 5
7 Greta Hospital Jul-19 2019-07-01 Medical.admissions 0
8 Greta Hospital Jul-19 2019-07-01 Surgical.admissions 0
9 Greta Hospital Jul-19 2019-07-01 Inpatient.admissions 6
10 Greta Hospital Aug-19 2019-08-01 Medical.admissions 40
11 Greta Hospital Aug-19 2019-08-01 Surgical.admissions 50
12 Greta Hospital Aug-19 2019-08-01 Inpatient.admissions 0
13 Makor Hospital Jul-20 2020-07-01 Medical.admissions 5
14 Makor Hospital Jul-20 2020-07-01 Surgical.admissions 20
15 Makor Hospital Jul-20 2020-07-01 Inpatient.admissions 60
16 Makor Hospital Aug-20 2020-08-01 Medical.admissions 56
17 Makor Hospital Aug-20 2020-08-01 Surgical.admissions 56
18 Makor Hospital Aug-20 2020-08-01 Inpatient.admissions 96
So, then, you can use scale_x_date to set appropriate labeling option on your x axis:
library(lubridate)
library(tidyr)
library(dplyr)
library(ggplot2)
df %>% mutate(Date = parse_date(as.character(Period), format = "%b-%y")) %>%
pivot_longer(cols = Medical.admissions:Inpatient.admissions, names_to = "Var", values_to = "Val") %>%
ggplot(aes(x = Date, y = Val, fill= Var, group = Var))+
geom_col()+
scale_x_date(date_breaks = "month", date_labels = "%b %Y")+
labs(x = "Month", y = "Number of People", fill = "Type")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Does it answer your question ?
EDIT: Using `lubridate v1.7.8
On lubridate version 1.7.8, parse_date does not exist anymore. You will have to replace it by parse_date_time as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Date = ymd(parse_date_time2(as.character(Period), orders = "%b-%y"))) %>% ....

Agregating and counting elements in the variables of a dataset

I might have not asked the proper question in my research, sorry in such case.
I have a multiple columns dataset:
helena <-
Year US$ Euros Country Regions
2001 12 13 US America
2000 13 15 UK Europe
2003 14 19 China Asia
I want to group the dataset in a way that I have for each region the total per year of the earnings plus a column showing how many countries have communicated their data per region every year
helena <-
Year US$ Euros Regions Number of countries per region per Year
2000 150 135 America 2
2001 135 151 Europe 15
2002 142 1900 Asia 18
Yet, I have tried
count(helena, c("Regions", "Year"))
but it does not work properly since includes only the columns indicated
Here is the data.table way, I have added a row for Canada for year 2000 to test the code:
library(data.table)
df <- data.frame(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df <- data.table(df)
df[,
.(sum_US = sum(US),
sum_Euros = sum(Euros),
number_of_countries = uniqueN(Country)),
.(Regions, Year)]
Regions Year sum_US sum_Euros number_of_countries
1: America 2000 26 30 2
2: Europe 2001 12 13 1
3: Asia 2003 14 19 1
With dplyr:
library(dplyr)
your_data %>%
group_by(Regions, Year) %>%
summarize(
US = sum(US),
Euros = sum(Euros),
N_countries = n_distinct(Country)
)
using tidyr
library(tidyr)
df %>% group_by(Regions, Year) %>%
summarise(Earnings_US = sum(`US$`),
Earnings_Euros = sum(Euros),
N_Countries = length(Country))
aggregate the data set by regions, summing the earnings columns and doing a length of the country column (assuming countries are unique)
Using tidyverse and building the example
library(tidyverse)
df <- tibble(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df %>%
group_by(Regions, Year) %>%
summarise(US = sum(US),
Euros = sum(Euros),
Countries = n_distinct(Country))
updated to reflect the data in the original question

Resources