Replace missing values in R dataframe - r

I have data:
Date
Price
"2021-01-01"
1
"2021-01-02"
NA
"2021-01-03"
NA
"2021-01-04"
NA
"2021-01-05"
NA
"2021-01-06"
6
"2021-01-07"
NA
"2021-01-08"
NA
"2021-01-09"
3
And I would like to replace missing values with means, so that the end result would look like this:
Date
Price
"2021-01-01"
1
"2021-01-02"
2
"2021-01-03"
3
"2021-01-04"
4
"2021-01-05"
5
"2021-01-06"
6
"2021-01-07"
5
"2021-01-08"
4
"2021-01-09"
3

You can use zoo::na.approx:
library(zoo)
na.approx(dat$Price)
# [1] 1 2 3 4 5 6 5 4 3

One way would be to use na_interpolation from imputeTS library:
imputeTS::na_interpolation(c(1, NA, NA, 4))
# 1 2 3 4
imputeTS::na_interpolation(c(6, NA, NA, 3))
# 6 5 4 3

I consider that you have multiple price cols, where you got the price. Then you want to create a new column named Price which is the mean and without NA values.
library(tidyverse)
library(dplyr)
Date <- c("2021-01-01","2021-01-02","2021-01-03","2021-01-04","2021-01-05",
"2021-01-06", "2021-01-07", "2021-01-08","2021-01-09", "2021-01-08","2021-01-09")
your.price.col1 <- c(floor(runif(9,0,100)),NA,NA)
your.price.col2 <- c(floor(runif(9,0,100)),33,44)
df <- data.frame(Date, your.price.col1,your.price.col2)
# slice your price cols, which you want to include in the mean with [2:3] for col1 and col2
df %>%
mutate(Price = rowMeans(df[2:3], na.rm=T))
Date your.price.col1 your.price.col2 Price
1 2021-01-01 96 55 75.5
2 2021-01-02 22 43 32.5
3 2021-01-03 68 62 65.0
4 2021-01-04 18 51 34.5
5 2021-01-05 27 6 16.5
6 2021-01-06 26 30 28.0
7 2021-01-07 32 22 27.0
8 2021-01-08 53 95 74.0
9 2021-01-09 74 78 76.0
10 2021-01-08 NA 33 33.0
11 2021-01-09 NA 44 44.0

Related

Using lag function to find the last value for a specific individual

I'm trying to create a column in my spreadsheet that takes the last recorded value (IC) for a specific individual (by the Datetime column) and populates it into a column (LIC) for the current event.
A sub-sample of my data looks like this (actual dataset has 4949 rows and 37 individuals):
> head(ACdatas.scale)
Date Datetime ID.2 IC LIC
1 2019-05-25 2019-05-25 11:57 139 High NA
2 2019-06-09 2019-06-09 19:42 139 Low NA
3 2019-07-05 2019-07-05 20:12 139 Medium NA
4 2019-07-27 2019-07-27 17:27 152 Low NA
5 2019-08-04 2019-08-04 9:13 152 Medium NA
6 2019-08-04 2019-08-04 16:18 139 Medium NA
I would like to be able to populate the last value from the IC column into the current LIC column for the current event (see below)
> head(ACdatas.scale)
Date Datetime ID.2 IC LIC
1 2019-05-25 2019-05-25 11:57 139 High NA
2 2019-06-09 2019-06-09 19:42 139 Low High
3 2019-07-05 2019-07-05 20:12 139 Medium Low
4 2019-07-27 2019-07-27 17:27 152 Low NA
5 2019-08-04 2019-08-04 9:13 152 Medium Low
6 2019-08-04 2019-08-04 16:18 139 Medium Medium
I've tried the following code:
ACdatas.scale <- ACdatas.scale %>%
arrange(ID.2, Datetime) %>%
group_by(ID.2) %>%
mutate(LIC= lag(IC))
This worked some of the time, but when I checked back through the data, it seemed to have a problem when the date switched, so it could accurately populate the field within the same day, but not when the previous event was on the previous day. Just to make it super confusing, it only had issues with some of the day switches, and not all! Help please!!
Sample data,
dat <- data.frame(id=c(rep("A",5),rep("B",5)), IC=c(1:5,11:15))
dplyr
library(dplyr)
dat %>%
group_by(id) %>%
mutate(LIC = lag(IC)) %>%
ungroup()
# # A tibble: 10 x 3
# id IC LIC
# <chr> <int> <int>
# 1 A 1 NA
# 2 A 2 1
# 3 A 3 2
# 4 A 4 3
# 5 A 5 4
# 6 B 11 NA
# 7 B 12 11
# 8 B 13 12
# 9 B 14 13
# 10 B 15 14
data.table
library(data.table)
as.data.table(dat)[, LIC := shift(IC, type = "lag"), by = .(id)][]
# id IC LIC
# <char> <int> <int>
# 1: A 1 NA
# 2: A 2 1
# 3: A 3 2
# 4: A 4 3
# 5: A 5 4
# 6: B 11 NA
# 7: B 12 11
# 8: B 13 12
# 9: B 14 13
# 10: B 15 14
base R
dat$LIC <- ave(dat$IC, dat$id, FUN = function(z) c(NA, z[-length(z)]))
dat
# id IC LIC
# 1 A 1 NA
# 2 A 2 1
# 3 A 3 2
# 4 A 4 3
# 5 A 5 4
# 6 B 11 NA
# 7 B 12 11
# 8 B 13 12
# 9 B 14 13
# 10 B 15 14
By using your data:
mydat <- structure(list(Date = structure(c(18041, 18056, 18082,
18104, 18112, 18112),
class = "Date"),
Datetime = structure(c(1558760220,1560084120,
1562332320, 1564223220,
1564884780, 1564910280),
class = c("POSIXct","POSIXt"),
tzone = ""),
ID.2 = c(139, 139, 139, 152, 152, 139),
IC = c("High", "Low", "Medium", "Low", "Medium", "Medium"),
LIC = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA, -6L),
class = "data.frame")
mydat %>% arrange(Datetime) %>% group_by(ID.2) %>% mutate(LIC = lag(IC))
# A tibble: 6 x 5
# Groups: ID.2 [2]
Date Datetime ID.2 IC LIC
<date> <dttm> <dbl> <chr> <chr>
1 2019-05-25 2019-05-25 11:57:00 139 High NA
2 2019-06-09 2019-06-09 19:42:00 139 Low High
3 2019-07-05 2019-07-05 20:12:00 139 Medium Low
4 2019-07-27 2019-07-27 17:27:00 152 Low NA
5 2019-08-04 2019-08-04 09:13:00 152 Medium Low
6 2019-08-04 2019-08-04 16:18:00 139 Medium Medium

How to Make a stock returns data set using R?

I have a dataset as below:
stockCode date Closeprice
A 2022-01-24 100
A 2022-01-25 101
A 2022-01-26 103
A 2022-01-27 104
A 2022-01-28 103
B 2022-01-24 200
B 2022-01-25 180
B 2022-01-26 177
B 2022-01-27 192
B 2022-01-28 202
C 2022-01-24 304
C 2022-01-25 333
C 2022-01-26 324
C 2022-01-27 360
C 2022-01-28 335
and then, I wish to add some return columns as below:
enter image description here
I tried to make a new column, and calculating the return,
but always shows errors.
> data$newclose <- data$Closeprice[2:length(data$Closeprice)-2]
Error in `$<-.data.frame`(`*tmp*`, newclose, value = c(8900, 9090, 9200, :
replacement has 126626 rows, data has 126628
The assignment should have the same length on the lhs and rhs. Perhaps we need to get the lead
library(dplyr)
data1 <- data %>%
mutate(newcolose = lead(Closeprice, n = 1))
I first create new columns with the values from 1 to 4 days using lead. Then, I calculate the percentage change for each day for each group.
library(tidyverse)
df %>%
group_by(stockCode) %>%
mutate(day1 = lead(Closeprice, n = 1),
day2 = lead(Closeprice, n = 2),
day3 = lead(Closeprice, n = 3),
day4 = lead(Closeprice, n = 4)) %>%
mutate(across(starts_with("day"), ~((. - Closeprice)/Closeprice)*100))
Output
# A tibble: 15 × 5
# Groups: stockCode [3]
stockCode day1 day2 day3 day4
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 4 3
2 A 1.98 2.97 1.98 NA
3 A 0.971 0 NA NA
4 A -0.962 NA NA NA
5 A NA NA NA NA
6 B -10 -11.5 -4 1
7 B -1.67 6.67 12.2 NA
8 B 8.47 14.1 NA NA
9 B 5.21 NA NA NA
10 B NA NA NA NA
11 C 9.54 6.58 18.4 10.2
12 C -2.70 8.11 0.601 NA
13 C 11.1 3.40 NA NA
14 C -6.94 NA NA NA
15 C NA NA NA NA

How to create month-end date series using complete function?

Here is my toy dataset:
df <- tibble::tribble(
~date, ~value,
"2007-01-31", 25,
"2007-05-31", 31,
"2007-12-31", 26
)
I am creating month-end date series using the following code.
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(as.Date("2007-01-31"), as.Date("2019-12-31"), by="month"))
However, I am not getting the correct month-end dates.
date value
<date> <dbl>
1 2007-01-31 25
2 2007-03-03 NA
3 2007-03-31 NA
4 2007-05-01 NA
5 2007-05-31 31
6 2007-07-01 NA
7 2007-07-31 NA
8 2007-08-31 NA
9 2007-10-01 NA
10 2007-10-31 NA
11 2007-12-01 NA
12 2007-12-31 26
What am I missing here? I am okay using other functions from any other package.
No need of complete function, you can do this in base R.
Since last day of the month is different for different months, we can create a sequence of monthly start dates and subtract 1 day from it.
seq(as.Date("2007-02-01"), as.Date("2008-01-01"), by="month") - 1
#[1] "2007-01-31" "2007-02-28" "2007-03-31" "2007-04-30" "2007-05-31" "2007-06-30"
# "2007-07-31" "2007-08-31" "2007-09-30" "2007-10-31" "2007-11-30" "2007-12-31"
Using the same logic in updated dataframe, we can do :
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
tidyr::complete(date = seq(min(date) + 1, max(date) + 1, by="month") - 1)
# date value
# <date> <dbl>
# 1 2007-01-31 25
# 2 2007-02-28 NA
# 3 2007-03-31 NA
# 4 2007-04-30 NA
# 5 2007-05-31 31
# 6 2007-06-30 NA
# 7 2007-07-31 NA
# 8 2007-08-31 NA
# 9 2007-09-30 NA
#10 2007-10-31 NA
#11 2007-11-30 NA
#12 2007-12-31 26

R Dataframe Average Group by last months over Users

Suppose I have the next dataframe. How can I create a new "avg" column that is the result of averaging the last 2 dates ("date") for each group.
The idea is to apply this to a dataset with hundreds of thousands of files, so performance is important. The function should contemplate a variable number of months (example 2 or 3 months) and be able to change between simple and medium average.
Thanks in advance.
table1<-data.frame(group=c(1,1,1,1,2,2,2,2),date=c(201903,201902,201901,201812,201903,201902,201901,201812),price=c(10,30,50,20,2,10,9,20))
group date price
1 1 201903 10
2 1 201902 30
3 1 201901 50
4 1 201812 20
5 2 201903 2
6 2 201902 10
7 2 201901 9
8 2 201812 20
result<-data.frame(group=c(1,1,1,1,2,2,2,2),date=c(201903,201902,201901,201812,201903,201902,201901,201812),price=c(10,30,50,20,2,10,9,20), avg = c(20, 40, 35, NA, 6, 9.5, 14.5, NA))
group date price avg
1 1 201903 10 20.0
2 1 201902 30 40.0
3 1 201901 50 35.0
4 1 201812 20 NA
5 2 201903 2 6.0
6 2 201902 10 9.5
7 2 201901 9 14.5
8 2 201812 20 NA
sort the data.frame first so that date is ascending for each group
table1 <- table1[order(table1$group, table1$date), ]
create a moving average function with argument for number of months.
other function options available from: Calculating moving average
mov_avg <- function(y, months = 2){as.numeric(filter(y, rep(1 / months, months), sides = 1))}
Use the classic do.call-lapply-split combo with this mov_avg function
table1$avg_2months <- do.call(c, lapply(split(x=table1$price, f=table1$group), mov_avg, months=2))
table1$avg_3months <- do.call(c, lapply(split(x=table1$price, f=table1$group), mov_avg, months=3))
table1
group date price avg_2months avg_3months
4 1 201812 20 NA NA
3 1 201901 50 35.0 NA
2 1 201902 30 40.0 33.33333
1 1 201903 10 20.0 30.00000
8 2 201812 20 NA NA
7 2 201901 9 14.5 NA
6 2 201902 10 9.5 13.00000
5 2 201903 2 6.0 7.00000
If your date column is sorted, then hers's a way to do it using data.table:
library(data.table)
setDT(table1)[, next_price := dplyr::lead(price), by = group][, total_price := price + next_price][, avg := total_price / 2][, c("total_price", "next_price") := NULL]
table1
group date price avg
1: 1 201903 10 20.0
2: 1 201902 30 40.0
3: 1 201901 50 35.0
4: 1 201812 20 NA
5: 2 201903 2 6.0
6: 2 201902 10 9.5
7: 2 201901 9 14.5
8: 2 201812 20 NA

How to split a data set with duplicated informations based on date

I have this situation:
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
3 2014-07-11 56
3 NA 34
4 2014-10-05 25
4 2014-08-09 14
5 NA NA
5 NA NA
And I would like split the dataset in this, like this:
1-
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
4 2014-10-05 25
4 2014-08-09 14
2- Lowest Date
ID date Weight
3 2014-07-11 56
3 NA 34
5 NA NA
5 NA NA
I tried this for second dataset:
dt <- dt[order(dt$ID, dt$date), ]
dt.2=dt[duplicated(dt$ID), ]
but didn't work
Get the ID's for which date are NA and then subset based on that
NA_ids <- unique(df$ID[is.na(df$date)])
subset(df, !ID %in% NA_ids)
# ID date Weight
#1 1 2014-12-02 23
#2 1 2014-10-02 25
#3 2 2014-11-03 27
#4 2 2014-09-03 45
#7 4 2014-10-05 25
#8 4 2014-08-09 14
subset(df, ID %in% NA_ids)
# ID date Weight
#5 3 2014-07-11 56
#6 3 <NA> 34
#9 5 <NA> NA
#10 5 <NA> NA
Using dplyr, we can create a new column which has TRUE/FALSE for each ID based on presence of NA and then use group_split to split into list of two.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NA_ID = any(is.na(date))) %>%
ungroup %>%
group_split(NA_ID, keep = FALSE)
The above dplyr logic can also be implemented in base R by using ave and split
df$NA_ID <- with(df, ave(is.na(date), ID, FUN = any))
split(df[-4], df$NA_ID)

Resources