I need to change the Month column to date format - r

My data set is monthly from Jan 1997 to Dec 2021. I need the month code to be in the correct format, however as.date doesn't recognise the cell contents as they are. Please help.
Month BrentSpot GDP Agriculture Production Construction Services
1 Jan-1997 23.54 63.8229 53.5614 81.9963 87.2775 59.4453
2 Feb-1997 20.85 64.7182 53.9091 82.1917 87.8350 60.5018
3 Mar-1997 19.13 64.9264 54.2569 81.6142 88.6714 60.8375
4 Apr-1997 17.56 65.2327 55.1264 82.0006 89.5170 61.0981
5 May-1997 19.02 64.7336 55.8220 82.0093 89.8144 60.4470
6 Jun-1997 17.58 65.1322 56.3438 82.3350 89.4891 60.8886

Gdp_Brent_Table$Month = seq(ymd('1997-01-01'),ymd('2021-12-01'), by = 'months')
(this seemed to do the trick)

Related

Extract year from complicated title in R

I have a data frame in R where I have couple of variables, right now concerned is with two variables, title and Date. I write down the short data similar with real data frame
Title Date
Veterans, Sacrame 1997
Action Newsmaker 2005
New Tri-Cable 1990 mar
EFEST June 16, 1987 28494
The Inhuman Perception: what we do 1999 june
New Tri-Cable 2003 july/august
Interviews Concerning His/her 1991-1992
Festival EFEST June 6, 1997 83443
Intervention of the people Undated
What I want is create a new variable year where we only have the year(no date/month or anything like that).
I can extract year from date format or exact similar text format, but here it's different because the title is complicated and not same(not equal word/letter) for each row. I am just wondering any easy way to create a variable 'year' in r-studio I desire. I can extract the year from the date variable if it's some sort of date format. However in some data where the date are like 83443, but I see the year in title but can't extract the year manually because of huge dataset of this format.
Use mdy to convert to Date class and then year to extract the year.
library(lubridate)
year(mdy(dat1$Title, quiet = TRUE))
## [1] NA NA NA 1987 NA NA NA 1997 NA
Note
The data in reproducible form:
Lines <- "Title Date
Veterans, Sacrame 1997
Action Newsmaker 2005
New Tri-Cable 1990 mar
EFEST June 16, 1987 28494
The Inhuman Perception: what we do 1999 june
New Tri-Cable 2003 july/august
Interviews Concerning His/her 1991-1992
Festival EFEST June 6, 1997 83443
Intervention of the people Undated"
L <- readLines(textConnection(Lines))
dat1 <- read.csv(text = sub(" +", ";", trimws(L)), sep = ";")

using difftime in R from one column

Hello can you help me with difference in (hours) in R from one column.
I use only basic package R. I would like to create new column with hours
so the column look like
hours<-c(0,24,23,21,31,26,28)
time<-c('10. 4. 2018 10:16:11',
'11. 4. 2018 10:16:15',
'12. 4. 2018 10:13:31',
'13. 4. 2018 8:16:31',
'14. 4. 2018 15:16:21',
'15. 4. 2018 17:16:31',
'16. 4. 2018 19:15:31')
I have one colum (time) and i would like to create new column (hours)
thanks
Enhancing Sotos' approach,
c(0, round(diff(as.POSIXct(time, format = '%d. %m. %Y %H:%M:%S'), units = "hours")))
comes close to OP's expected result
[1] 0 24 24 22 31 26 26
Data
time <- c(
'10. 4. 2018 10:16:11',
'11. 4. 2018 10:16:15',
'12. 4. 2018 10:13:31',
'13. 4. 2018 8:16:31',
'14. 4. 2018 15:16:21',
'15. 4. 2018 17:16:31',
'16. 4. 2018 19:15:31'
)
Another way is the following.
First coerce to class POSIXct.
time <- as.POSIXct(time, format = "%d. %m. %Y %H:%M:%S")
Now use difftime, it will give the result in the required units.
c(0, difftime(time[-1], time[-length(time)]))
#[1] 0.00000 24.00111 23.95444 22.05000 30.99722 26.00278 25.98333
The rounded output is simple to obtain.
round(c(0, difftime(time[-1], time[-length(time)])))
#[1] 0 24 24 22 31 26 26

Rolling changes of values and percentage

I'm helping a friend with some R homework for a apparently badly taught R class (because all the stuff covered in the class and the supplementary material doesn't help).
We have two datasets. One contains daily discrete returns of a company share in percent and the other contains daily exchange rates from two currencies, let's say USD to Swiss Franc. It looks like this:
Date Mon Day Exchangerate
2000 01 01 1.03405
2000 01 02 1.02987
2000 01 03 1.03021
2000 01 04 1.03456
2000 01 05 1.03200
And the daily discrete returns:
Date Share1
20000104 -0.03778
20000105 0.02154
20000106 0.01345
20000107 -0.01234
20000108 -0.01789
The task is to write a function that uses both matrices and calculates the daily returns from the perspective of a Swiss investor. We assume an initial investment of 1000 US Dollar.
I tried using tidyverse and calculate the changes in total return and percent changes from one day to another using the lag function from dplyr as in the code provided below.
library(tidyverse)
myCHFreturn <- function(matrix1, matrix2) {
total = dplyr::right_join(matrix1, matrix2, by = "date") %>%
dplyr::filter(!is.na(Share1)) %>%
dplyr::select(-c(Date, Mon, Day)) %>%
dplyr::mutate(rentShare1_usd = (1+Share1)*1000,
rentShare1_usd = dplyr::lag(rentShare1_usd) * (1+Share1),
rentShare1_chf = rentShare1_usd*Exchangerate,
rentShare1_chfperc =(rentShare1_chf - dplyr::lag(rentShare1_chf))/dplyr::lag(rentShare1_chf),
rentShare1_chfperc = rentShare1_chfperc*100)
}
The problem is that the rentShare1_usd = dplyr::lag(rentShare1_usd) * (1+Share1) part of the function relies on the values calculated for the initial 1000 US Dollar investment. Thus, my perception is that we need some type of rolling calculation of the changes, based on the initial investment. However, I don't know how to implement this in the function, since I've only worked with rolling means. We want to calculate the daily returns based on the change given in Variable Share1 and the value of the investment of the previous day. Any help is very much appreciated.
At least to point you to part of a solution, the value of a unit share on any one day is the cumulative product from the start date to that date of (1 + daily_discrete_return) over the time period concerned. To take an example using an extended version of your daily discrete returns table:
df = read.table(text = "Date Share1
20000104 -0.03778
20000105 0.02154
20000106 0.01345
20000107 -0.01234
20000108 -0.01789
20000109 0.02154
20000110 0.01345
20000111 0.02154
20000112 0.02154
20000113 0.01345", header = TRUE, stringsAsFactors = FALSE)
library(dplyr)
Shares = 1000
df1 = mutate(df, ShareValue = cumprod(1+Share1) * Shares)
Date Share1 ShareValue
1 20000104 -0.03778 962.2200
2 20000105 0.02154 982.9462
3 20000106 0.01345 996.1668
4 20000107 -0.01234 983.8741
5 20000108 -0.01789 966.2726
6 20000109 0.02154 987.0862
7 20000110 0.01345 1000.3625
8 20000111 0.02154 1021.9103
9 20000112 0.02154 1043.9222
10 20000113 0.01345 1057.9630
Once you've got a table with the share value as at that date in it you can join it back to your exchange rate table to calculate the swiss currency equivalent for that date, and extend it to do percentage changes and so on.

Moving average over 5 years with irregular dates

I have a large number of files (~1200) which each contains a large timeserie with data about the height of the groundwater. The starting date and length of the serie is different for each file. There can be large data gaps between dates, for example (small part of such a file):
Date Height (cm)
14-1-1980 7659
28-1-1980 7632
14-2-1980 7661
14-3-1980 7638
28-3-1980 7642
14-4-1980 7652
25-4-1980 7646
14-5-1980 7635
29-5-1980 7622
13-6-1980 7606
27-6-1980 7598
14-7-1980 7654
28-7-1980 7654
14-8-1980 7627
28-8-1980 7600
12-9-1980 7617
14-10-1980 7596
28-10-1980 7601
14-11-1980 7592
28-11-1980 7614
11-12-1980 7650
29-12-1980 7670
14-1-1981 7698
28-1-1981 7700
13-2-1981 7694
17-3-1981 7740
30-3-1981 7683
14-4-1981 7692
14-5-1981 7682
15-6-1981 7696
17-7-1981 7706
28-7-1981 7699
28-8-1981 7686
30-9-1981 7678
17-11-1981 7723
11-12-1981 7803
18-2-1982 7757
16-3-1982 7773
13-5-1982 7753
11-6-1982 7740
14-7-1982 7731
15-8-1982 7739
14-9-1982 7722
14-10-1982 7794
15-11-1982 7764
14-12-1982 7790
14-1-1983 7810
28-3-1983 7836
28-4-1983 7815
31-5-1983 7857
29-6-1983 7801
28-7-1983 7774
24-8-1983 7758
28-9-1983 7748
26-10-1983 7727
29-11-1983 7782
27-1-1984 7801
28-3-1984 7764
27-4-1984 7752
28-5-1984 7795
27-7-1984 7748
27-8-1984 7729
28-9-1984 7752
26-10-1984 7789
28-11-1984 7797
18-12-1984 7781
28-1-1985 7833
21-2-1985 7778
22-4-1985 7794
28-5-1985 7768
28-6-1985 7836
26-8-1985 7765
19-9-1985 7760
31-10-1985 7756
26-11-1985 7760
20-12-1985 7781
17-1-1986 7813
28-1-1986 7852
26-2-1986 7797
25-3-1986 7838
22-4-1986 7807
27-5-1986 7785
24-6-1986 7787
26-8-1986 7744
23-9-1986 7742
22-10-1986 7752
1-12-1986 7749
17-12-1986 7758
I want to calculate the average height over 5 years. So, in case of the example 14-1-1980 + 5 years, 14-1-1985 + 5 years, .... The amount of datapoints is different for each calculation of the average. It is very likely that the date 5 years later will not be in the dataset as a datapoint. Hence, I think I need to tell R somehow to take an average in a certain timespan.
I searched on the internet but didn't find something that fitted my needs. A lot of useful packages like uts, zoo, lubridate and the function aggregate passed by. Instead of getting closer to the solution I get more and more confused about which approach is the best for my problem.
Thanks a lot in advance!
As #vagabond points out, you'll want to combine your 1200 files into a single data frame (the plyr package would allow you to do something simple like: data.all <- adply(dir([DATA FOLDER]), 1, read.csv).
Once you have the data, the first step would be to transform the Date column into proper POSIXct date data. Right now the data appear to be strings, and we want them to have an underlying numerical representation (which POSIXct does):
library(lubridate)
df$date.new <- as.Date(dmy(df$Date))
Date Height date.new
1 14-1-1980 7659 1980-01-14
2 28-1-1980 7632 1980-01-28
3 14-2-1980 7661 1980-02-14
4 14-3-1980 7638 1980-03-14
5 28-3-1980 7642 1980-03-28
6 14-4-1980 7652 1980-04-14
Note that the date.new column looks like a string, but is in fact Date data, and can be handled with numerical operations (addition, comparison, etc.).
Next, we might construct a set of date periods, over which we want to compute averages. Your example mentions 5 years, but with the data you provided, that's not a very illustrative example. So here I'm creating 1-year periods starting at every day between Jan 14 1980 and Jan 14 1985
date.start <- as.Date(as.Date('1980-01-14') : as.Date('1985-01-14'), origin = '1970-01-01')
date.end <- date.start + years(1)
dates <- data.frame(start = date.start, end = date.end)
start end
1 1980-01-14 1981-01-14
2 1980-01-15 1981-01-15
3 1980-01-16 1981-01-16
4 1980-01-17 1981-01-17
5 1980-01-18 1981-01-18
6 1980-01-19 1981-01-19
Then we can use the dplyr package to move through each row of this data frame and compute a summary average of Height:
library(dplyr)
df.mean <- dates %>%
group_by(start, end) %>%
summarize(height.mean = mean(df$Height[df$date.new >= start & df$date.new < end]))
start end height.mean
<date> <date> <dbl>
1 1980-01-14 1981-01-14 7630.273
2 1980-01-15 1981-01-15 7632.045
3 1980-01-16 1981-01-16 7632.045
4 1980-01-17 1981-01-17 7632.045
5 1980-01-18 1981-01-18 7632.045
6 1980-01-19 1981-01-19 7632.045
The foverlaps function is IMHO the perfect candidate for such a situation:
library(data.table)
library(lubridate)
# convert to a data.table with setDT()
# convert the 'Date'-column to date-format
# create a begin & end date for the required period
setDT(dat)[, Date := as.Date(Date, '%d-%m-%Y')
][, `:=` (begindate = Date, enddate = Date + years(1))]
# set the keys (necessary for the foverlaps function)
setkey(dat, begindate, enddate)
res <- foverlaps(dat, dat, by.x = c(1,3))[, .(moving.average = mean(i.Height)), Date]
the result:
> head(res,15)
Date moving.average
1: 1980-01-14 7633.217
2: 1980-01-28 7635.000
3: 1980-02-14 7637.696
4: 1980-03-14 7636.636
5: 1980-03-28 7641.273
6: 1980-04-14 7645.261
7: 1980-04-25 7644.955
8: 1980-05-14 7646.591
9: 1980-05-29 7647.143
10: 1980-06-13 7648.400
11: 1980-06-27 7652.900
12: 1980-07-14 7655.789
13: 1980-07-28 7660.550
14: 1980-08-14 7660.895
15: 1980-08-28 7664.000
Now you have for each date an average of all the values that lie the date and one year ahead of that date.
Hey I just tried after seeing your question!!! Ran on a sample data frame. Try it on yours after understanding the code and then let me know!
Bdw instead of having an interval of 5 years, I used just 2 months (2*30 = approx 2 months) as the interval!
df = data.frame(Date = c("14-1-1980", "28-1-1980", "14-2-1980", "14-3-1980", "28-3-1980",
"14-4-1980", "25-4-1980", "14-5-1980", "29-5-1980", "13-6-1980:",
"27-6-1980", "14-7-1980", "28-7-1980", "14-8-1980"), height = 1:14)
# as.Date(df$Date, "%d-%m-%Y")
df1 = data.frame(orig = NULL, dest = NULL, avg_ht = NULL)
orig = as.Date(df$Date, "%d-%m-%Y")[1]
dest = as.Date(df$Date, "%d-%m-%Y")[1] + 2*30 #approx 2 months
dest_final = as.Date(df$Date, "%d-%m-%Y")[14]
while (dest < dest_final){
m = mean(df$height[which(as.Date(df$Date, "%d-%m-%Y")>=orig &
as.Date(df$Date, "%d-%m-%Y")<dest )])
df1 = rbind(df1,data.frame(orig=orig,dest=dest,avg_ht=m))
orig = dest
dest = dest + 2*30
print(paste("orig:",orig, " + ","dest:",dest))
}
> df1
orig dest avg_ht
1 1980-01-14 1980-03-14 2.0
2 1980-03-14 1980-05-13 5.5
3 1980-05-13 1980-07-12 9.5
I hope this works for you as well
This is my best try, but please keep in mind that I am working with the years instead of the full date, i.e. based on the example you provided I am averaging over beginning of 1980- end of 1984.
dat<-read.csv("paixnidi.csv")
install.packages("stringr")
library(stringr)
dates<-dat[,1]
#extract the year of each measurement
years<-as.integer(str_sub(dat[,1], start= -4))
spread_y<-years[length(years)]-years[1]
ind<-list()
#find how many 5-year intervals there are
groups<-ceiling(spread_y/4)
meangroups<-matrix(0,ncol=2,nrow=groups)
k<-0
for (i in 1:groups){
#extract the indices of the dates vector whithin the 5-year period
ind[[i]]<-which(years>=(years[1]+k)&years<=(years[1]+k+4),arr.ind=TRUE)
meangroups[i,2]<-mean(dat[ind[[i]],2])
meangroups[i,1]<-(years[1]+k)
k<-k+5
}
colnames(meangroups)<-c("Year:Year+4","Mean Height (cm)")

Differences of values based on mean months and rolling data

I am trying to do something which seems simple but is proving a bit of a challenge so I hope someone can help!
I have a time series of observations of temperature:
Lines <-"1971-01-17 298.9197
1971-01-17 298.9197
1971-02-16 299.0429
1971-03-17 299.0753
1971-04-17 299.3250
1971-05-17 299.5606
1971-06-17 299.2380
2010-07-14 298.7876
2010-08-14 298.5529
2010-09-14 298.3642
2010-10-14 297.8739
2010-11-14 297.7455
2010-12-14 297.4790"
DF <- read.table(textConnection(Lines), col.names = c("Date", "Value"))
DF$Date <- as.Date(DF$Date)
mean.ts <- aggregate(DF["Value"], format(DF["Date"], "%m"), mean)
This produces:
> mean.ts
Date Value
1 01 1.251667
2 02 1.263333
This is just an example -- my data is for many years so I can calculate a full monthly average of the data.
What I then want to do is calculate the difference in for all of the January's (individually) with the mean January I have calculated above.
If I move away from using Date/Time class I could do this with some loops but I want to see if there is a "neat" way to do this in R? Any ideas?
You can just add the year as an aggregating variable. This is easier using the formula interface:
> aggregate(Value~format(Date,"%m")+format(Date,"%Y"),data=DF,mean)
format(Date, "%m") format(Date, "%Y") Value
1 01 1971 298.9197
2 02 1971 299.0429
3 03 1971 299.0753
4 04 1971 299.3250
5 05 1971 299.5606
6 06 1971 299.2380
7 07 2010 298.7876
8 08 2010 298.5529
9 09 2010 298.3642
10 10 2010 297.8739
11 11 2010 297.7455
12 12 2010 297.4790
At least as I understand your question you want the differences of each month with the mean of those months, so you probably you want to use ave rather than aggregate:
diff.mean.ts <- ave(DF[["Value"]],
list(format(DF[["Date"]], "%m")), FUN=function(x) x-mean(x) )
If you wanted it in the same dataframe, then just assign it as a column:
DF$ diff.mean.ts <- diff.mean.ts
The ave function is designed for adding columns to existing dataframes because it returns a vector of the same length as the number of values in the its first argument, in this case DF[["Value"]]. In the present instance it returns all 0's which is the correct answer because there is only one value for each month.

Resources