Dropping the rows by checking whether it has multiple values in R - r

I have a data frame in this form;
Year Department Jan Feb ................... Dec
2017 TF 15.15 225.51 .............. 5562.1
2015 CIF ...................................
2013 TTR ....................................
2011 COR ....................
. .............................
. ......................
As a summary, I want to create an algorithm but first I have to make this filtering:
If a department does not have a value for 2013, 2014, 2015, 2016 years, than I want to exclude that department from my data set.
In other words, by reading the each departments data, filtering the data by departments that has all four years values in the months columns.
I tried exists, is.na but the multiple filtering always fails. And another handicap is that filter works for only single condition, but here I need like 4 condition. 4 years values must be exist to use them in next step.
Thank you.

I can't find a clear duplicate to this question. Seems like a quick fix with group_by:
library(dplyr)
df <- data_frame(Year = c(2013:2016, 2015, 2016),
Department = c(rep('TF', 4), 'CIF', 'TTR'))
df
#> # A tibble: 6 x 2
#> Year Department
#> <dbl> <chr>
#> 1 2013 TF
#> 2 2014 TF
#> 3 2015 TF
#> 4 2016 TF
#> 5 2015 CIF
#> 6 2016 TTR
df %>%
group_by(Department) %>%
mutate(x = Year %in% c(2013:2016),
y = sum(x)) %>%
ungroup() %>%
filter(y == 4)
#> # A tibble: 4 x 4
#> Year Department x y
#> <dbl> <chr> <lgl> <int>
#> 1 2013 TF TRUE 4
#> 2 2014 TF TRUE 4
#> 3 2015 TF TRUE 4
#> 4 2016 TF TRUE 4

A solution using R base:
df = read.table(text = "Year, Department
2016,TF
2017,TF
2013,CIF
2014,CIF
2015,CIF
2016,CIF
2013,TTR", header = TRUE, sep = ",", stringsAsFactors = FALSE)
df[df$Department %in% subset(aggregate(subset(df, Year %in% c(2013,2014,2015,2016)), by=list(n$Department), FUN=length), Department==4)[,1], ]
Output:
Year Department
3 2013 CIF
4 2014 CIF
5 2015 CIF
6 2016 CIF

Related

Updating table with custom numbers

Below is my dataset, which contains four columns id, year, quarter, and price.
df <- data.frame(id = c(1,2,1,2),
year = c(2010,2010,2011,2011),
quarter = c("2010-q1","2010-q2","2011-q1","2011-q2"),
price = c(10,50,10,50))
Now I want to expand this dataset for 2012 and 2013. First, I want to copy rows for 2010 and 2011 and paste them below, and after that, replace these values for years and quarters with 2012 and 2013 and also quarters with 2012-q1,2012-q2,2013-q1 and 2013-q2.
So can anybody help me with how to solve this and prepare the table as the table below?
df %>%
mutate(year = year + 2, quarter = paste0(year, "-q", id)) %>%
bind_rows(df, .)
id year quarter price
1 1 2010 2010-q1 10
2 2 2010 2010-q2 50
3 1 2011 2011-q1 10
4 2 2011 2011-q2 50
5 1 2012 2012-q1 10
6 2 2012 2012-q2 50
7 1 2013 2013-q1 10
8 2 2013 2013-q2 50

How to generate a sequence of numbers increasing at a fixed percentage?

I would like to calculate the predicted value at 2% growth rate over 10 years.
My data looks like this
df <- structure(list(fin_year = c(2016, 2017, 2018, 2019, 2020, 2021
), Total = c(136661.9, 142748.25, 146580.77, 155486.07, 171115.58,
69265.01)), class = "data.frame", row.names = c(NA, -6L))
I would like to add a new column (two_percent) with the calculated amounts based on the 2016 Total value.
I expect the answers to look like this:
I've tried this but can't figure out how to code the script properly to do what I want
df1 <- df %>%
mutate(two_percent = rep(Total[1:1] *1.02))
Your help is much appreciated
The formula is 1.02^n where n is the number of periods. One may need to subtract 1 from n depending on whether the interest is at the beginning or end of the period.
basevalue <- df$Total[1]
df1 <- df %>%
mutate(two_percent = basevalue*1.02^(row_number()-1))
We can use purrr::accumulate to calculate the 2% growth forecast. First let's calculate this for the existing data.frame. We need to supply a vector of 1.02 in the length of one less than the total row number to accumulates .x argument. Further, we need the base value of Total as .init argument (this is the value we want to base the forecast on). The function .f that we then use is just .x * .y.
library(dplyr)
library(purrr)
# Calculate the growth rate for the existing data.frame
df %>%
mutate(two_percent = accumulate(rep(1.02, nrow(.)-1),
~ .x * .y,
.init = first(Total)))
#> fin_year Total two_percent
#> 1 2016 136661.90 136661.9
#> 2 2017 142748.25 139395.1
#> 3 2018 146580.77 142183.0
#> 4 2019 155486.07 145026.7
#> 5 2020 171115.58 147927.2
#> 6 2021 69265.01 150885.8
While this works for the existing data.frame we need a new one, if we want to forecast values for years that the current df doesn't contain. Basically, we use the same approach as above and combine it with a right_join:
# Calculate the growth rate for a 10 year period, and then join
new_df <- tibble(Year = 1:10,
two_percent = df$Total[1]) %>%
mutate(two_percent = accumulate(rep(1.02, nrow(.)-1),
~ .x * .y,
.init = first(two_percent)))
df %>%
mutate(Year = row_number()) %>%
right_join(new_df)
#> Joining, by = "Year"
#> fin_year Total Year two_percent
#> 1 2016 136661.90 1 136661.9
#> 2 2017 142748.25 2 139395.1
#> 3 2018 146580.77 3 142183.0
#> 4 2019 155486.07 4 145026.7
#> 5 2020 171115.58 5 147927.2
#> 6 2021 69265.01 6 150885.8
#> 7 NA NA 7 153903.5
#> 8 NA NA 8 156981.6
#> 9 NA NA 9 160121.2
#> 10 NA NA 10 163323.6
Created on 2022-01-13 by the reprex package (v2.0.1)
Here's another simple method that anchors the base of the Two_percent calculation to the value of Total in the first fin_year using which.min(fin_year)
library(tidyverse)
df <- structure(list(fin_year = c(2016, 2017, 2018, 2019, 2020, 2021
), Total = c(136661.9, 142748.25, 146580.77, 155486.07, 171115.58,
69265.01)), class = "data.frame", row.names = c(NA, -6L))
df %>%
mutate(two_percent = Total[which.min(fin_year)] * 1.02^(seq_along(fin_year)))
#> fin_year Total two_percent
#> 1 2016 136661.90 139395.1
#> 2 2017 142748.25 142183.0
#> 3 2018 146580.77 145026.7
#> 4 2019 155486.07 147927.2
#> 5 2020 171115.58 150885.8
#> 6 2021 69265.01 153903.5
Created on 2022-01-13 by the reprex package (v2.0.1)

Time spent in each calendar year

I followed some individuals A and B from start to end
df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
df
id start end
1 A 2015-01-01 2021-06-12
2 B 2013-01-01 2017-10-10
I would like to calculate the the follow up time for each calendar year. For example I have 1 year for 2013 (from B), 1 year for 2014 (from B), 2 years for 2015 (from A and B) and so on.
I tried to treat year as an integer and count how many years each individual contributes but due to rounding errors the result is not plausible.
I tried
years<-NULL
for (i in 1:length(df$id)){
years<-c(years, as.character(as.Date(seq.Date(from = df$start[i], to = df$end[i], by = "day"))))
}
library(lubridate)
table(year(years))/365
2013 2014 2015 2016 2017 2018 2019 2020 2021
1.0000000 1.0000000 2.0000000 2.0054795 1.7753425 1.0000000 1.0000000 1.0027397 0.4465753
which is the answer I am trying to get but is computationally inefficient and very slow in large data. I am wondering is there any way to do this without the loop? Or do it more efficiently?
I'm now guessing what you actually don't want to round or truncate anything, so here's a solution that works and gives output similar to your method (correcting the 2016 value):
func <- function(st, ed) {
stopifnot(length(st) == 1, length(ed) == 1)
stL <- as.POSIXlt(st)
edL <- as.POSIXlt(ed)
start_year <- 1900 + stL$year
end_year <- 1900 + edL$year
start_eoy <- as.POSIXlt(paste0(start_year, "-12-31"))
end_eoy <- as.POSIXlt(paste0(end_year, "-12-31"))
firstyear <- (start_eoy$yday - stL$yday) / start_eoy$yday
lastyear <- edL$yday / end_eoy$yday
data.frame(
year = seq(start_year, end_year),
n = c(firstyear, rep(1, max(0, end_year - start_year - 1)), lastyear)
)
}
base R
aggregate(n ~ year, data = do.call(rbind, Map(func, df$start, df$end)), FUN = sum)
# year n
# 1 2013 1.0000000
# 2 2014 1.0000000
# 3 2015 2.0000000
# 4 2016 2.0000000
# 5 2017 1.7747253
# 6 2018 1.0000000
# 7 2019 1.0000000
# 8 2020 1.0000000
# 9 2021 0.4450549
dplyr
library(dplyr)
df %>%
with(Map(func, start, end)) %>%
bind_rows() %>%
group_by(year) %>%
summarize(n = sum(n))
# # A tibble: 9 x 2
# year n
# <int> <dbl>
# 1 2013 1
# 2 2014 1
# 3 2015 2
# 4 2016 2
# 5 2017 1.77
# 6 2018 1
# 7 2019 1
# 8 2020 1
# 9 2021 0.445
Sounds like a job for a great package called lubridate. See example:
By the way, I assumed dates are year-month-day, therefore ymd. If not, you can use ydm (year-day-month) for American date format.
df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(tidyverse)
df %>%
mutate(across(start:end, ymd),
follow_up_years = interval(start, end)/years(1),
follow_up_months = interval(start, end)/months(1),
follow_up_days = interval(start, end)/days(1),
)
#> id start end follow_up_years follow_up_months follow_up_days
#> 1 A 2015-01-01 2021-06-12 6.443836 77.36667 2354
#> 2 B 2013-01-01 2017-10-10 4.772603 57.29032 1743
Created on 2021-10-28 by the reprex package (v2.0.1)
Edit
I think I understand. I guess we can also just use lubridate intervals:
df %>%
mutate(follow_up_2015 = interval(start, as_date("2015-01-01"))/years(1)) %>%
pull(follow_up_2015) %>%
sum()
#> [1] 2
Created on 2021-10-28 by the reprex package (v2.0.1)

How do I go about filtering my data by the upper 50th percentile for a separate dependent variable?

I need to split my data so that when I use the facet_wrap I have the top 50 percentile for each year.
Here is a sample of my data:
# A tibble: 10,519 x 3
Species Abundance Year
<chr> <dbl> <chr>
1 Astropecten irregularis 2 2009
2 Asterias rubens 14 2009
3 Echinus esculentus 1 2009
4 Pagurus prideaux 1 2009
5 Raja clavata 1 2009
6 Astropecten irregularis 4 2009
7 Asterias rubens 47 2009
8 Henricia sp. 2 2009
9 Ophiura ophiura 8 2009
10 Solaster endeca 1 2009
# ... with 10,509 more rows
My current strategy is this:
Data <- All_years %>%
group_by(Species, Year) %>%
summarise(Abundance = sum(Abundance, na.rm = TRUE)) %>%
filter(quantile(Abundance, 0.50)<Abundance) %>%
filter(Abundance > 50)
The issue is that this gives me the top 50 percentile for the whole set while I would like it to give me the top 50 for each year so I can then display it with a facet_wrap in ggplot.

merge two data frames based on matching rows of multiple columns

Below is the summary and structure of the two data sets I tried to merge claimants and unemp, they can me found here claims.csv and unemp.csv
> tbl_df(claimants)
# A tibble: 6,960 × 5
X County Month Year Claimants
<int> <fctr> <fctr> <int> <int>
1 1 ALAMEDA Jan 2007 13034
2 2 ALPINE Jan 2007 12
3 3 AMADOR Jan 2007 487
4 4 BUTTE Jan 2007 3496
5 5 CALAVERAS Jan 2007 644
6 6 COLUSA Jan 2007 1244
7 7 CONTRA COSTA Jan 2007 8475
8 8 DEL NORTE Jan 2007 328
9 9 EL DORADO Jan 2007 2120
10 10 FRESNO Jan 2007 19974
# ... with 6,950 more rows
> tbl_df(unemp)
# A tibble: 6,960 × 7
County Year Month laborforce emplab unemp unemprate
* <chr> <int> <chr> <int> <int> <int> <dbl>
1 Alameda 2007 Jan 743100 708300 34800 4.7
2 Alameda 2007 Feb 744800 711000 33800 4.5
3 Alameda 2007 Mar 746600 713200 33300 4.5
4 Alameda 2007 Apr 738200 705800 32400 4.4
5 Alameda 2007 May 739100 707300 31800 4.3
6 Alameda 2007 Jun 744900 709100 35800 4.8
7 Alameda 2007 Jul 749600 710900 38700 5.2
8 Alameda 2007 Aug 746700 709600 37000 5.0
9 Alameda 2007 Sep 748200 712100 36000 4.8
10 Alameda 2007 Oct 749000 713000 36100 4.8
# ... with 6,950 more rows
I thought first I should change all the factor columns to character columns.
unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)
claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)
m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1] 0 10
In the output of dim(m), 0 rows are in the resulting dataframe. All the 6960 rows should match each other uniquely.
To verify that the two data frames have unique combination of the the 3 columns 'County', 'Month', and 'Year' I reorder and rearrange these columns within the dataframes as below:
a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]
b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]
b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE
This last output confirms that all 'County', 'Month', and 'Year' columns match each other in these two dataframes.
I have tried looking into the documentation for merge and could not gather where do I go wrong, I have also tried the inner_join function from dplyr:
> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8
I am missing something and don't know what, would appreciate the help with understanding this, I know I should not have to rearrange the rows by the three columns to run merge R should identify the matching rows and merge the non-matching columns.
The claimants df has the counties in all uppercase, the unemp df has them in lower case.
I used the options(stringsAsFactors = FALSE) when reading in your data. A few suggestions drop the X column in both, it doesn't seem useful.
options(stringsAsFactors = FALSE)
claims <- read.csv("claims.csv",header=TRUE)
claims$X <- NULL
unemp <- read.csv("unemp.csv",header=TRUE)
unemp$X <- NULL
unemp$County <- toupper(unemp$County)
m <- inner_join(unemp, claims)
dim(m)
# [1] 6960 8

Resources