How i can convert chr column into date in R? - r

I have a data frame in R (xlsx file imported) that the first column contain dates with character type:
> dat
# A tibble: 4,372 x 3
date `1` `2`
<chr> <dbl> <dbl>
1 40544 35.5 35.5
2 40545 35.6 35.8
3 40546 37.2 36.4
4 40547 36.7 35.4
5 40548 36.6 35.3
I want to convert the character type into date type in R.
I tried the below code but NA's occur:
> dat%>%
+ mutate(date2 = as.Date(date, format= "%m-%d-%Y"))
# A tibble: 4,372 x 4
date `1` `2` date2
<chr> <dbl> <dbl> <date>
1 40544 35.5 35.5 NA
2 40545 35.6 35.8 NA
3 40546 37.2 36.4 NA
4 40547 36.7 35.4 NA
5 40548 36.6 35.3 NA
How can i fix this ?
Any help ?

Is this what you're looking for?
dat %>%
mutate(date2 = as.numeric(date),
date2 = as.Date(date2, origin = "1900-01-01"))
date date2
1 40544 2011-01-03
2 40545 2011-01-04
3 40546 2011-01-05
4 40547 2011-01-06

Related

Add new variable with arithmetic conditions

the randomly generated data frame contains ID, Dates, and Earnings. I changed up the data frame format so that each column represents a date and its values corresponds to the earnings.
I want to create a new variable named "Date_over100 " that would determine the date when one's cumulative earnings have exceeded 100. I have put below a reproducible code that would generate the data frame. I assume conditional statements or loops would be used to achieve this. I would appreciate all the help there is. Thanks in advance!
ID <- c(1:10)
Date <- sample(seq(as.Date('2021/01/01'), as.Date('2021/01/11'), by="day", replace=T), 10)
Earning <- round(runif(10,30,50),digits = 2)
df <- data.frame(ID,Date,Earning,check.names = F)
df1 <- df%>%
arrange(Date)%>%
pivot_wider(names_from = Date, values_from = Earning)
df1 <- as.data.frame(df1)
df1[is.na(df1)] <- round(runif(sum(is.na(df1)),min=30,max=50),digits = 2)
I go back to long format for the calculation, then join to the wide data:
library(dplyr)
library(tidyr)
df1 %>% pivot_longer(cols = -ID, names_to = "date") %>%
group_by(ID) %>%
summarize(Date_over_100 = Date[which.max(cumsum(value) > 100)]) %>%
right_join(df1, by = "ID")
# # A tibble: 10 × 12
# ID Date_over_100 `2021-01-04` `2021-01-01` `2021-01-08` `2021-01-11` `2021-01-02` `2021-01-09`
# <int> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2021-01-08 45.0 46.2 40.1 47.4 47.5 48.8
# 2 2 2021-01-08 36.7 30.3 36.2 47.5 41.4 41.7
# 3 3 2021-01-08 49.5 46.0 45.0 43.9 45.4 37.1
# 4 4 2021-01-08 31.0 48.7 47.3 40.4 40.8 35.5
# 5 5 2021-01-08 48.2 35.2 32.1 44.2 35.4 49.7
# 6 6 2021-01-08 40.8 37.6 31.8 40.3 38.3 42.5
# 7 7 2021-01-08 37.9 42.9 36.8 46.0 39.8 33.6
# 8 8 2021-01-08 47.7 47.8 39.7 46.4 43.8 46.5
# 9 9 2021-01-08 32.9 42.0 41.8 32.8 33.9 35.5
# 10 10 2021-01-08 34.5 40.1 42.7 35.9 44.8 31.8
# # … with 4 more variables: 2021-01-10 <dbl>, 2021-01-03 <dbl>, 2021-01-07 <dbl>, 2021-01-05 <dbl>

Loop to sum weekly rolling average

I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)
It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows

How to find mean value using multiple columns of a R data.frame?

I am trying to find mean of A and B for each row and save it as separate column but seems like the code only average the first row and fill the rest of the rows with that value. Any suggestion how to fix this?
library(tidyverse)
library(lubridate)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("2001-01-01"), to = as.Date("2003-12-31"), by = "day"),
A = runif(1095, 1,60),
Z = runif(1095, 5,100)) %>%
mutate(MeanofAandZ= mean(A:Z))
Are you looking for this:
DF %>% rowwise() %>% mutate(MeanofAandZ = mean(c_across(A:Z)))
# A tibble: 1,095 x 4
# Rowwise:
Date A Z MeanofAandZ
<date> <dbl> <dbl> <dbl>
1 2001-01-01 26.5 7.68 17.1
2 2001-01-02 54.9 33.1 44.0
3 2001-01-03 37.1 82.0 59.5
4 2001-01-04 6.91 18.0 12.4
5 2001-01-05 53.0 8.76 30.9
6 2001-01-06 26.1 7.63 16.9
7 2001-01-07 59.3 30.8 45.0
8 2001-01-08 39.9 14.6 27.3
9 2001-01-09 59.2 93.6 76.4
10 2001-01-10 30.7 89.1 59.9
you can do it with Base R: rowMeans
Full Base R:
DF$MeanofAandZ <- rowMeans(DF[c("A", "Z")])
head(DF)
#> Date A Z MeanofAandZ
#> 1 2001-01-01 17.967074 76.92436 47.44572
#> 2 2001-01-02 47.510003 99.28325 73.39663
#> 3 2001-01-03 25.129638 64.33253 44.73109
#> 4 2001-01-04 53.098027 32.42556 42.76179
#> 5 2001-01-05 56.487570 23.99162 40.23959
#> 6 2001-01-06 3.687833 81.08720 42.38751
or inside a mutate:
library(dplyr)
DF <- DF %>% mutate(MeanofAandZ = rowMeans(cbind(A,Z)))
head(DF)
#> Date A Z MeanofAandZ
#> 1 2001-01-01 17.967074 76.92436 47.44572
#> 2 2001-01-02 47.510003 99.28325 73.39663
#> 3 2001-01-03 25.129638 64.33253 44.73109
#> 4 2001-01-04 53.098027 32.42556 42.76179
#> 5 2001-01-05 56.487570 23.99162 40.23959
#> 6 2001-01-06 3.687833 81.08720 42.38751
We can also do
DF$MeanofAandZ <- Reduce(`+`, DF[c("A", "Z")])/2
Or using apply
DF$MeanofAandZ <- apply(DF[c("A", "Z")], 1, mean)

taking the difference between two dates dplyr

I have the following data:
# A tibble: 7,971 x 10
symbol date open high low close volume adjusted start_date end_date
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <date>
1 AAPL 2009-01-02 12.3 13.0 12.2 13.0 186503800 11.4 2009-07-31 2010-06-30
2 AAPL 2009-01-05 13.3 13.7 13.2 13.5 295402100 11.8 2009-07-31 2010-06-30
3 AAPL 2009-01-06 13.7 13.9 13.2 13.3 322327600 11.6 2009-07-31 2010-06-30
4 AAPL 2009-01-07 13.1 13.2 12.9 13.0 188262200 11.4 2009-07-31 2010-06-30
5 AAPL 2009-01-08 12.9 13.3 12.9 13.2 168375200 11.6 2009-07-31 2010-06-30
6 AAPL 2009-01-09 13.3 13.3 12.9 12.9 136711400 11.3 2009-07-31 2010-06-30
7 AAPL 2009-01-12 12.9 13.0 12.5 12.7 154429100 11.1 2009-07-31 2010-06-30
8 AAPL 2009-01-13 12.6 12.8 12.3 12.5 199599400 11.0 2009-07-31 2010-06-30
9 AAPL 2009-01-14 12.3 12.5 12.1 12.2 255416000 10.7 2009-07-31 2010-06-30
10 AAPL 2009-01-15 11.5 12.0 11.4 11.9 457908500 10.4 2009-07-31 2010-06-30
I am trying to group by symbol, start date and end date and take the difference between the first observation on the start date and the last observation on the end date. I just can't seem to get it working.
That is take the difference of the "close" on the start date and the "close" on the end date.
Any help would be great, thanks!
syms <- c("AAPL", "MSFT", "GOOG")
library(tidyquant)
data <- tq_get(syms)
data <- data %>%
mutate( start_date = paste(year(date %m+% months(6)), "07", "31", sep = "-"), # note this is the start_date for when we calculate the returns - we will have bought this portfolio on the 1st July but we get returns on the 31st
end_date = paste(year(date %m+% months(18)), "06", "30", sep = "-"),
start_date = as.Date(start_date),
end_date = as.Date(end_date))
My attempt...
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = diff(close))
EDIT:
I am trying to group by symbol and then take start_date - end_date. So first, I should be grouping by symbol and filtering the date column down to between the start_date and end_date values. i.e. I am only interested in the "close" price on the start_date and end_date days (which is fixed). Then just take the difference between the close price on the start_date and end_date. So most of the stock price data is useless here and I am only interested in the close on the start_date and end_date then take the difference between these two values.
I think what you are looking for is to subtract first and last close value for each group
library(dplyr)
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = first(close) - last(close))
# symbol start_date end_date diff
# <chr> <date> <date> <dbl>
# 1 AAPL 2009-07-31 2010-06-30 -7.38
# 2 AAPL 2010-07-31 2011-06-30 -15.5
# 3 AAPL 2011-07-31 2012-06-30 -12.5
# 4 AAPL 2012-07-31 2013-06-30 -34.4
# 5 AAPL 2013-07-31 2014-06-30 28.0
# 6 AAPL 2014-07-31 2015-06-30 -34.5
# 7 AAPL 2015-07-31 2016-06-30 -31.9
# 8 AAPL 2016-07-31 2017-06-30 31
# 9 AAPL 2017-07-31 2018-06-30 -48.1
#10 AAPL 2018-07-31 2019-06-30 -41.6
# … with 26 more rows
Another way to write it could be
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = close[1L] - close[n()])
Or it can be also done using base R aggregate
aggregate(close~symbol +start_date + end_date,data,function(x) x[1L] - x[length(x)])
You can take this approach...
# create df of unique symbol, start, and end date combos
df1 <- df %>% distinct(symbol,start_date,end_date)
# join original data that match the desired start/end dates
df1 <- df %>% select(start_close=close,symbol,start_date=date) %>% left_join(df1,.)
df1 <- df %>% select(end_close=close,symbol,end_date=date) %>% left_join(df1,.)
# find difference in close values
df1 %>% mutate(diff=end_close - start_close)
# A tibble: 36 x 6
symbol start_date end_date start_close end_close diff
<chr> <date> <date> <dbl> <dbl> <dbl>
1 AAPL 2009-07-31 2010-06-30 23.3 35.9 12.6
2 AAPL 2010-07-31 2011-06-30 NA 48.0 NA
3 AAPL 2011-07-31 2012-06-30 NA NA NA
4 AAPL 2012-07-31 2013-06-30 87.3 NA NA
5 AAPL 2013-07-31 2014-06-30 64.6 92.9 28.3
6 AAPL 2014-07-31 2015-06-30 95.6 125. 29.8
7 AAPL 2015-07-31 2016-06-30 121. 95.6 -25.7
8 AAPL 2016-07-31 2017-06-30 NA 144. NA
9 AAPL 2017-07-31 2018-06-30 149. NA NA
10 AAPL 2018-07-31 2019-06-30 190. NA NA
# ... with 26 more rows
There are NAs since not every start/end date are in the original date column.

Adding intermediate observations to a data frame (manual interpolating)

I've got a data frame like below with vector coordinates:
df <- structure(list(x0 = c(22.6, 38.5, 73.7), y0 = c(62.9, 56.6, 27.7
), x1 = c(45.8, 49.3, 80.8), y1 = c(69.9, 21.9, 14)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
# A tibble: 3 x 4
x0 y0 x1 y1
<dbl> <dbl> <dbl> <dbl>
1 22.6 62.9 45.8 69.9
2 38.5 56.6 49.3 21.9
3 73.7 27.7 80.8 14
For visualisation purposes I need to manually interpolate points, i.e. add an intermediate row between each two rows of df, where the starting coordinates x0, y0 are the ending coordinates of original, previous row, while ending coordinates x1, y1 are the starting coordinates of original, next row. I also need to preserve information if an observation is from original dataset or it is manually added. So the expected output would be:
# A tibble: 5 x 5
x y pass_end_x pass_end_y source
<dbl> <dbl> <dbl> <dbl> <chr>
1 22.6 62.9 45.8 69.9 original
2 45.8 69.9 38.5 56.6 added
3 38.5 56.6 49.3 21.9 original
4 49.3 21.9 73.7 27.7 added
5 73.7 27.7 80.8 14 original
How can I do that in efficient and elegant way (preferably in tidyverse)?
To do this, all I'm going to do is swap the column names of the start and end points, and then use lead to get the next value of x1 and y1. Then we just add the source tag, and bind_rows
library(tidyverse)
df2 <- df
names(df2) <- names(df2)[c(3,4,1,2)] # swap names
df2 <- df2 %>% mutate(x1 = lead(x1), y1 = lead(y1),source = "added")
df <- df %>% mutate(source = "original") %>% bind_rows(., df2)
Resulting in:
# A tibble: 6 x 5
x0 y0 x1 y1 source
<dbl> <dbl> <dbl> <dbl> <chr>
1 22.6 62.9 45.8 69.9 original
2 38.5 56.6 49.3 21.9 original
3 73.7 27.7 80.8 14 original
4 45.8 69.9 38.5 56.6 added
5 49.3 21.9 73.7 27.7 added
6 80.8 14 NA NA added
If you need the rows in order:
df2 <- df2 %>% mutate(x1 = lead(x1), y1 = lead(y1),source = "added", ID = seq(1,n()*2, by =2)+1)
df <- df %>% mutate(source = "original", ID = seq(1,n()*2, by =2)) %>% bind_rows(., df2) %>% arrange(ID)
# A tibble: 6 x 6
x0 y0 x1 y1 source ID
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 22.6 62.9 45.8 69.9 original 1
2 45.8 69.9 38.5 56.6 added 2
3 38.5 56.6 49.3 21.9 original 3
4 49.3 21.9 73.7 27.7 added 4
5 73.7 27.7 80.8 14 original 5
6 80.8 14 NA NA added 6

Resources