In R, categorize time series data based on regex - r

My data is organised as follows: for each product, there is a tax rate for each year as well as a base year tax (baseyr).
product<-c("01","02","03","04")
baseyr <-c("10","8 GBP/tonne","8GBP/tonne + 8GBP/tonne","8")
yr1<-c("5","5 GBP/tonne","5GBP/tonne + 10GBP/tonne","5 + 5GBP/tonne")
yr2<-c("3","3GBP/tonne + 6GBP/tonne","3 GBP/tonne","3 + 5GBP/tonne")
yr3<-c("2","2","2GBP/tonne + 2GBP/tonne","excluded")
sched<-data.frame(product,baseyr,yr1,yr2,yr3)
For each year, I need to classify each product by tax type in a new column based on the following conditions:
#number -> only numbers in the tax
#nonnumber -> numbers and strings in the tax
#mixed -> either two strings or number and string; the two strings are specified by a plus sign
#baseyr -> if the tax is "excluded" from the list, the tax to be used should be the value in base year, and the classification based on this
So if there are 3 years I need to generate 3 tax type columns. However the number of years changes randomly per dataset so I need to code with this in mind. My code is currently something like this:
yearnum<-3 #set number of years; it is between 1 and around 10 but there is no limit
schedule<-c(paste0("yr",1:yearnum)
tax<-c(paste0(schedule,"_tax")
for(i in 1:nrow(sched)){
#for each new tax type
for(j in tax){
#columns 3 to five where the yearly tax rates are
for(yr in 3:5){
#if the tax is excluded from the list, the base year tax should be used to determine the tax nature
if(sched[i,yr] =="excluded"){sched[i,yr] <- sched[i,baseyr]}
#if there is a plus sign it is a mixed tax
if(grepl("\\+",sched[i,yr])){sched[i,j] <- "mixed"}
#if it is not mixed but contains strings it is a nonnumber tax
if(grepl("[:alpha:]",sched[i,yr])){sched[i,j] <- "nonnumber"}
#finally if it is neither of the above it must be a number tax
if(is.na(sched[i,j])){sched[i,j] <- "number"}
}}}
NOTE: I do not know at the start how many years there will be in total; this has to be generated in the code. Any advice much appreciated, especially to avoid these for loops that don't seem to work properly for me.
The final output should be:
#so the output should be:
yr1_tax<-c("number","nonnumber","mixed","mixed")
yr2_tax<-c("number","mixed","nonnumber","mixed")
yr3_tax<-c("number","number","mixed","number")
#and the final dataframe:
sched<-data.frame(product,baseyr,yr1,yr2,yr3,yr1_tax,yr2_tax,yr3_tax)

You could use if_else to change all the excluded into baseyr. Then use case when with regular expressions as shown below:
sched %>%
mutate(
across(starts_with('yr'), ~ifelse(.x == 'excluded', baseyr, .x),
.names = '{.col}_tax'),
across(ends_with('tax'),
~case_when(grepl("^\\d+$", .x) ~ 'number',
grepl('^[^+]$', .x)~'nonnumber',
grepl('[+]', .x)~'mixed')))
product baseyr yr1 yr2 yr3 yr1_tax yr2_tax yr3_tax
1 01 10 5 3 2 number number number
2 02 8 GBP/tonne 5 GBP/tonne 3GBP/tonne + 6GBP/tonne 2 nonnumber mixed number
3 03 8GBP/tonne + 8GBP/tonne 5GBP/tonne + 10GBP/tonne 3 GBP/tonne 2GBP/tonne + 2GBP/tonne mixed nonnumber mixed
4 04 8 5 + 5GBP/tonne 3 + 5GBP/tonne excluded mixed mixed number
The regex is simplified in that I am checking for digits only (number), If there is + then mixed, then if no + then nonnumber.

Related

Argument is not numeric

I would like to visualize the number of people infected with COVID-19, but I am unable to obtain the mortality rate because the number of deaths is derived by int when obtaining the mortality rate per 100,000 population for each prefecture.
What I want to achieve
I want to find the solution of "covid19j_20200613$POP2019 * 100" by setting the data type of "covid19j_20200613$deaths" to num.
Error message.
Error in covid19j_20200613$deaths/covid19j_20200613$POP2019:
Argument of binary operator is not numeric
Source code in question.
library(spdep)
library(sf)
library(spatstat)
library(tidyverse)
library(ggplot2)
needs::prioritize(magrittr)
covid19j <- read.csv("https://raw.githubusercontent.com/kaz-ogiwara/covid19/master/data/prefectures.csv",
header=TRUE)
# Below is an example for May 20, 2020.
# Month and date may be changed
covid19j_20200613 <- dplyr::filter(covid19j,
year==2020,
month==6,
date==13)
covid19j_20200613$CODE <- 1:47
covid19j_20200613[is.na(covid19j_20200613)] <- 0
pop19 <- read.csv("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/pop2019.csv", header=TRUE)
covid19j_20200613 <- dplyr::inner_join(covid19j_20200613, pop19,
by = c("CODE" = "CODE"))
# Load Japan prefecture administrative boundary data
jpn_pref <- sf::st_read("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/jpn_pref.shp")
# Data and concatenation
jpn_pref_cov19 <- dplyr::inner_join(jpn_pref, covid19j_20200613, by=c("PREF_CODE"="CODE"))
ggplot2::ggplot(data = jpn_pref_cov19) +
geom_sf(aes(fill=testedPositive)) +
scale_fill_distiller(palette="RdYlGn") +
theme_bw() +
labs(title = "Tested Positiv of Covid19 (2020/06/13)")
# Mortality rate per 100,000 population
# Population number in units of 1000
as.numeric(covid19j_20200613$deaths)
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
Source code in question.
prefectures.csv
https://docs.google.com/spreadsheets/d/11C2vVo-jdRJoFEP4vAGxgy_AEq7pUrlre-i-zQVYDd4/edit?usp=sharing
pop2019.csv
https://docs.google.com/spreadsheets/d/1CbEX7BADutUPUQijM0wuKUZFq2UUt-jlWVQ1ipzs348/edit?usp=sharing
What we tried
I tried to put "as.numeric(covid19j_20200613$deaths)" before the calculation and set the number of dead to type
num, but I got the same error message during the calculation.
Additional information (FW/tool versions, etc.)
iMac M1 2021, R 4.2.0
Translated with www.DeepL.com/Translator (free version)
as.numeric() does not permanently change the data type - it only does it temporarily.
So when you're running as.numeric(covid19j_20200613$deaths), this shows you the column deaths as numeric, but the column will stay a character.
So if you want to coerce the data type, you need to also reassign:
covid19j_20200613$deaths <- as.numeric(covid19j_20200613$deaths)
covid19j_20200613$POP2019 <- as.numeric(covid19j_20200613$POP2019)
# Now you can do calculations
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
It's easier to read if you use mutate from dplyr:
covid19j_20200613 <- covid19j_20200613 |>
mutate(
deaths = as.numeric(deaths),
POP2019 = as.numeric(POP2019),
death_rate = deaths / POP2019 * 100
)
Result
deaths POP2019 deaths_rate
1 91 5250 1.73333333
2 1 1246 0.08025682
3 0 1227 0.00000000
4 1 2306 0.04336513
5 0 966 0.00000000
PS: your question is really difficult to follow! There is a lot of stuff that we don't actually need to answer it, so that makes it harder for us to identify where the issue is. For example, all the data import, the join, the ggplot...
When writing a question, please only include the minimal elements that lead to a problem. In your case, we only needed a sample dataset with the deaths and POP2019 columns, and the two lines of code that you tried to fix at the end.
If you look at str(covid19j) you'll see that the deaths column is a character column containing a lot of blanks. You need to figure out the structure of that column to read it properly.

How to do a row-specific iterative calculation as many times as the value of a specific variable the row has? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a monthly financial data. I want to do some iterative calculation for each individual row. I have a variable REMCOUPS, and each row has an integer value for REMCOUPS.
Thus, for each row, I want to do the iteration for the number of times indicated by the value of REMCOUPS.
My codes run as follows:
for(i in 1:REMCOUPS){ Price <- 0
if(i!=REMCOUPS)Price <- Price + (Bonds_2016_2020B$COUPON/2)/(1+(as.numeric(Bonds_2016_2020B$YIELD1.y/2))^i)
else Price <- Price + (Bonds_2016_2020B$COUPON/2)/(1+(as.numeric(Bonds_2016_2020B$YIELD1.y/2))^i)
+ 100/(1+(as.numeric(Bonds_2016_2020B$YIELD1.y/2))^i)
}
Bonds_2016_2020B is my dataset name. For each row, I want to do an interactive calculation starting from Price=0 and adding (COUPON/2)/(1+YIELD)^1, (COUPON/2)/(1+YIELD)^2 ... (COUPON/2)/(1+YIELD)^(the value-1 of REMCOUP), and finally, when iteration reaches REMCOUP times, I add
(COUPON/2)/(1+YIELD)^(the value of REMCOUP) and 100/(1+YIELD)^(the value of REMCOUP).
As I ran my codes, my first error was:
Error in REMCOUPS : object 'REMCOUPS' not found
How can I achieve this?
Update
Here is further information.
REMCOUPS or REMCOUPS1 is the number of remaining coupons.
CUSIP is the bond identifier.
So, for each month, DATE, I am going to calculate the present value of all the remaining future cash flows. The number of future cash flows is REMCOUPS (or REMCOUPS1).
Thus, looking at the picture, let's suppose we are going to calculate the present value of a Bond(cusip id: 00130HBT1) at 2018-05-31, which has 10 remaining cash flows (REMCOUPS1). Then, the formula is:
The present Value of that bond at 2018-05-31 = (Cash flow)/(1+YIELD)^1 + (Cash flow)/(1+YIELD)^2 + ... + (Cash flow)/(1+YIELD)^REMCOUP1 + 100/(1+YIELD)^REMCOUP1
100/(1+YIELD)^REMCOUP1, this is the final payment of the bond's written price.
Since REMCOUPS1 is 10 in this case, I should calculate and add up 10 items in the above (11 items including the final bond's price).
The main problem for me is that since I am a beginner in R coming from SAS, I am not understanding the logic of R yet.
Here's a tidyverse approach. I'm understanding these to be bonds which pay out semiannually.
First, here's three made-up bonds -- a premium bond, a par bond, and a discount bond.
Bonds_2016_2020B <- data.frame(BOND = 1:3,
COUPON = c(0.04, 0.06, 0.08),
YIELD1 = c(0.03, 0.06, 0.10),
REMCOUPS = 5:7)
Now, I'll make a line for each coupon payment, calculate the discount rate at that time, and apply that discount to any coupon or principal payments due. The price is the sum of those discounted cash flows.
library(tidyverse)
Bonds_2016_2020B %>%
# uncount, from the tidyr package in tidyverse, will copy each row REMCOUPS times
uncount(REMCOUPS, .id = "COUPNUM", .remove = FALSE) %>%
# each payment of coupon or principal will be discounted based on the # of semiannual periods. I'm assuming the first payment is 6 months away.
mutate(discount = 1/ ((1+YIELD1/2)^(COUPNUM)),
COUP_disc = discount * COUPON/2,
PRINC_disc = discount * if_else(COUPNUM == REMCOUPS, 1, 0)) %>%
group_by(BOND) %>%
summarize(price = 100*sum(COUP_disc + PRINC_disc)) %>%
mutate(price_showing_more_digits = format(price, nsmall = 6))
Result
# A tibble: 3 x 3
BOND price price_showing_more_digits
* <int> <dbl> <chr>
1 1 102. "102.391322"
2 2 100. "100.000000"
3 3 94.2 " 94.213627"
BTW, this matches exactly my check in Excel:

R binning a categorical age group

I am trying to group the twitterAge in categorical bins. and add them to a new column to show the twitter age group in my
dataframe based on twitterAge by converting it into the following groupings or categories like the one below
[‘0-1’,’1-2’,’2-3’,’3-4’, ‘4-5’, ‘5+’]
‘0-1’ refers to the ages equal to older than 0 and younger then 1. Using the same logic for other age groups. the next would be older than 1 and younger than 2 .. etc
5+ is essentially referring to older than 5 but younger than 6
my approach is like this but I am afraid it wrong
breakpoints <- c(0,1,2,3,4,5,6)
name <- c(‘0-1’,’1-2’,’2-3’,’3-4’, ‘4-5’, ‘5+’)
# my data file name is twitter_data
twitter_data$twitterAgeGroup <- cut(twitter_data$twitterAge,breaks = breakpoint,labels = name)
would this be the approach suitable?

Filter factor variable based on counts

I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.
For example:
table(df$sub_area)
La Jolla
2
Carlsbad
5
Esconsido
1
..etc
I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.
For reference, an individual property entry might look like:
ID Beds Baths City Sub_area sqm... etc
1 4 2 San Diego La Jolla 100....
Then I can do
lm(price ~ beds + baths + city + sub_area)
under the new, smaller sub_area variable with fewer levels.
I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.
One way:
areas <- names(which(table(df$Sub_area) > 10))
df$Sub_area[! df$Sub_area %in% areas] <- NA
Create a new dataframe with the number of occurrences for each subarea and keep the subareas that occur at least twice.
Then add NAs to the original dataframe if the subarea does not appear in the filtered sub_area_count.
library(dplyr)
sub_area_count <- df %>%
count(sub_area) %>%
filter(n > 1)
boo <- !df$sub_area %in% sub_area_count$sub_area
df[boo, ]$sub_area <- NA
You didn't give a reproducible example, but I think this will work for identifying those places which count==1
count_1 <- as.data.frame(table(df$sub_area))
count_1 <- count_1$Var1[which(count_1$Freq==1)]

how to compound the interest monthly in recurring deposit calculation?

how to calculate recurring deposit in monthly basis?
M = ( R * [(1+r)n - 1 ] ) / (1-(1+r)-1/3)
M is Maturity value
R is deposit amount
r is rate of interest
n is number of quarters
if i take 'n' as 4(no of Quarters) for 1 year its showing yearly Maturity value.can anyone tel me how to do monthly calculation.Thanks
I'm not sure what the 1/3 is doing in the denominator, could you explain that? As integer division it will likely evaluate to 0 anyway.
That said, the formula for payment at the end of each payment interval is indeed
M = R * ( (1+r/p)^n-1 )/( (1+r/p) -1) = R * p/r * ( (1+r/p)^n-1 )
resulting in M = 125365.3694 for the given data;
and for payments at the start of each payment interval (month, quarter, ...)
M = R * (1+r/p)*( (1+r/p)^n-1 )/( (1+r/p) -1) = R * (p/r+1) * ( (1+r/p)^n-1 )
resulting in M = 126357.8452 for the given data.
Here p is the number of parts of the year that is used, i.e., p=4 for quarterly and p=12 for monthly, n is the number of payments, i.e., the payment schedule lasts n/p years, and then r is the nominal annual interest rate, used in r/p to give the interest rate over each part of the year.
Note that the effective interest rate (1+r/p)^p-1 depends on p, for p=1 it is r, for very large p it approaches exp(r)-1.
A more realistic result is obtained by taking the number of days in each month into account
days:=[31,28,31,30,31,30,31,31,30,31,30,31];
for k in [1..12] do
sum:=0;
for j in [1..12] do
sum+:=1;
sum*:=1+days[(j+k-1) mod 12 + 1]*0.095/365;
end for;
k, sum*10000;
end for;
gives as result the maturity value if started in month[k], with k=1 corresponding to january
1 126402.9195
2 126324.3970
3 126343.1642
4 126329.4573
5 126348.2653
6 126334.5983
7 126353.4478
8 126372.4494
9 126358.9711
10 126378.0173
11 126364.5825
12 126383.6740

Resources