I have a data frame in R where I have couple of variables, right now concerned is with two variables, title and Date. I write down the short data similar with real data frame
Title Date
Veterans, Sacrame 1997
Action Newsmaker 2005
New Tri-Cable 1990 mar
EFEST June 16, 1987 28494
The Inhuman Perception: what we do 1999 june
New Tri-Cable 2003 july/august
Interviews Concerning His/her 1991-1992
Festival EFEST June 6, 1997 83443
Intervention of the people Undated
What I want is create a new variable year where we only have the year(no date/month or anything like that).
I can extract year from date format or exact similar text format, but here it's different because the title is complicated and not same(not equal word/letter) for each row. I am just wondering any easy way to create a variable 'year' in r-studio I desire. I can extract the year from the date variable if it's some sort of date format. However in some data where the date are like 83443, but I see the year in title but can't extract the year manually because of huge dataset of this format.
Use mdy to convert to Date class and then year to extract the year.
library(lubridate)
year(mdy(dat1$Title, quiet = TRUE))
## [1] NA NA NA 1987 NA NA NA 1997 NA
Note
The data in reproducible form:
Lines <- "Title Date
Veterans, Sacrame 1997
Action Newsmaker 2005
New Tri-Cable 1990 mar
EFEST June 16, 1987 28494
The Inhuman Perception: what we do 1999 june
New Tri-Cable 2003 july/august
Interviews Concerning His/her 1991-1992
Festival EFEST June 6, 1997 83443
Intervention of the people Undated"
L <- readLines(textConnection(Lines))
dat1 <- read.csv(text = sub(" +", ";", trimws(L)), sep = ";")
I'm helping a friend with some R homework for a apparently badly taught R class (because all the stuff covered in the class and the supplementary material doesn't help).
We have two datasets. One contains daily discrete returns of a company share in percent and the other contains daily exchange rates from two currencies, let's say USD to Swiss Franc. It looks like this:
Date Mon Day Exchangerate
2000 01 01 1.03405
2000 01 02 1.02987
2000 01 03 1.03021
2000 01 04 1.03456
2000 01 05 1.03200
And the daily discrete returns:
Date Share1
20000104 -0.03778
20000105 0.02154
20000106 0.01345
20000107 -0.01234
20000108 -0.01789
The task is to write a function that uses both matrices and calculates the daily returns from the perspective of a Swiss investor. We assume an initial investment of 1000 US Dollar.
I tried using tidyverse and calculate the changes in total return and percent changes from one day to another using the lag function from dplyr as in the code provided below.
library(tidyverse)
myCHFreturn <- function(matrix1, matrix2) {
total = dplyr::right_join(matrix1, matrix2, by = "date") %>%
dplyr::filter(!is.na(Share1)) %>%
dplyr::select(-c(Date, Mon, Day)) %>%
dplyr::mutate(rentShare1_usd = (1+Share1)*1000,
rentShare1_usd = dplyr::lag(rentShare1_usd) * (1+Share1),
rentShare1_chf = rentShare1_usd*Exchangerate,
rentShare1_chfperc =(rentShare1_chf - dplyr::lag(rentShare1_chf))/dplyr::lag(rentShare1_chf),
rentShare1_chfperc = rentShare1_chfperc*100)
}
The problem is that the rentShare1_usd = dplyr::lag(rentShare1_usd) * (1+Share1) part of the function relies on the values calculated for the initial 1000 US Dollar investment. Thus, my perception is that we need some type of rolling calculation of the changes, based on the initial investment. However, I don't know how to implement this in the function, since I've only worked with rolling means. We want to calculate the daily returns based on the change given in Variable Share1 and the value of the investment of the previous day. Any help is very much appreciated.
At least to point you to part of a solution, the value of a unit share on any one day is the cumulative product from the start date to that date of (1 + daily_discrete_return) over the time period concerned. To take an example using an extended version of your daily discrete returns table:
df = read.table(text = "Date Share1
20000104 -0.03778
20000105 0.02154
20000106 0.01345
20000107 -0.01234
20000108 -0.01789
20000109 0.02154
20000110 0.01345
20000111 0.02154
20000112 0.02154
20000113 0.01345", header = TRUE, stringsAsFactors = FALSE)
library(dplyr)
Shares = 1000
df1 = mutate(df, ShareValue = cumprod(1+Share1) * Shares)
Date Share1 ShareValue
1 20000104 -0.03778 962.2200
2 20000105 0.02154 982.9462
3 20000106 0.01345 996.1668
4 20000107 -0.01234 983.8741
5 20000108 -0.01789 966.2726
6 20000109 0.02154 987.0862
7 20000110 0.01345 1000.3625
8 20000111 0.02154 1021.9103
9 20000112 0.02154 1043.9222
10 20000113 0.01345 1057.9630
Once you've got a table with the share value as at that date in it you can join it back to your exchange rate table to calculate the swiss currency equivalent for that date, and extend it to do percentage changes and so on.
I need to make a dataset for a set of years and per year a certain function will calculate the values. I have the program in STATA but am having difficulties to translate it into R. The years are from 2017 to 2025 and a value of 17305004 for the year 2017. From then on the total hours increase with 0.025 per year at a fixed hourprice.
Here is the program in STATA:
set obs 9
scalar growth=.025
scalar hourprice=32
egen year = seq(), from(2017) to(2025)
gen totalhours = .
replace totalhours=17305004 if year==2017
replace totalhours=[(totalhours[_n-1] + growth*totalhours[_n-1])] if year!=2017
format %10.0g totalhours
gen cost = .
replace cost=totalhours*hourprice
format %12.0g cost
list year totalhours cost
year <- 2017:2025
totalhours <- c(17305004, rep(NA, 8))
for(i in 1:8){
totalhours[i+1] <- totalhours[i] + totalhours[i]*.025
}
cost <- totalhours*32
mydata <- data.frame(year, totalhours, cost)
Dataset:
> mydata
year totalhours cost
1 2017 17305004 553760128
2 2018 17737629 567604131
3 2019 18181070 581794234
4 2020 18635597 596339090
5 2021 19101486 611247568
6 2022 19579024 626528757
7 2023 20068499 642191976
8 2024 20570212 658246775
9 2025 21084467 674702944
totalhours could probably be generated easier, I just couldn't come up with a better way right now.
Sample Data:
product_id <- c("1000","1000","1000","1000","1000","1000", "1002","1002","1002","1002","1002","1002")
qty_ordered <- c(1,2,1,1,1,1,1,2,1,2,1,1)
price <- c(2.49,2.49,2.49,1.743,2.49,2.49, 2.093,2.093,2.11,2.11,2.11, 2.97)
date <- c("2/23/15","2/23/15", '3/16/15','3/16/15','5/16/15', "6/18/15", "2/19/15","3/19/15","3/19/15","3/19/15","3/19/15","4/19/15")
sampleData <- data.frame(product_id, qty_ordered, price, date)
I would like to identify every time when a change in a price occurred. Also, I would like to sum() the total qty_ordered between those two price change dates. For example,
For product_id == "1000", price changed occurred on 3/16/15 from $2.49 to $1.743. The total qty_ordered is 1+2+1=4;
the difference between those two earliest date of price change is from 2/23/15 to 3/16/15 which is 21 days.
So the New Data Frame should be:
product_id sum_qty_ordered price date_diff
1000 4 2.490 21
1000 1 1.743 61
1000 2 2.490 33
Here are what I have tried:
**NOTE: for this case, a simple "dplyr::group_by" will not work since it will ignore the date effect.
1) I found this code from Determine when columns of a data.frame change value and return indices of the change:
This is to identify every time when the price changed, which identify the first date when the price changed for each product.
IndexedChanged <- c(1,which(rowSums(sapply(sampleData[,3],diff))!=0)+1)
sampleData[IndexedChanged,]
However, I am not sure how to calculate the sum(qty_ordered) and the date difference for each of those entries if I use that code.
2) I tried to write a WHILE loop to temporarily store each batch of product_id, price, range of dates (e.g. a subset of data frame with one product_id, one price, and all entries ranged from the earliest date of price change till the last date of price before it changed),
and then, summarise that subset to get sum(sum_qty_ordered) and the date diff.
However, I think I always am confused by WHILE and FOR, so my code has some problems in it. Here is my code:
create an empty data frame for later data storage
NewData_Ready <- data.frame(
product_id = character(),
price = double(),
early_date = as.Date(character()),
last_date=as.Date(character()),
total_qty_demanded = double(),
stringsAsFactors=FALSE)
create a temp table to store the batch price order entries
temp_dataset <- data.frame(
product_id = character(),
qty_ordered = double(),
price = double(),
date=as.Date(character()),
stringsAsFactors=FALSE)
loop:
This is messy...and probably not make sense, so I do really help on this.
for ( i in unique(sampleData$product_id)){
#for each unique product_id in the dataset, we are gonna loop through it based on product_id
#for first product_id which is "1000"
temp_table <- sampleData[sampleData$product_id == "i", ] #subset dataset by ONE single product_id
#this dataset only has product of "1000" entries
#starting a new for loop to loop through the entire entries for this product
for ( p in 1:length(temp_table$product_id)){
current_price <- temp_table$price[p] #assign current_price to the first price value
#assign $2.49 to current price.
min_date <- temp_table$date[p] #assign the first date when the first price change
#assign 2015-2-23 to min_date which is the earliest date when price is $2.49
while (current_price == temp_table$price[p+1]){
#while the next price is the same as the first price
#that is, if the second price is $2.49 is the same as the first price of $2.49, which is TRUE
#then execute the following statement
temp_dataset <- rbind(temp_dataset, temp_table[p,])
#if the WHILE loop is TRUE, means every 2 entries have the same price
#then combine each entry when price is the same in temp_table with the temp_dataset
#if the WHILE loop is FALSE, means one entry's price is different from the next one
#then stop the statement at the above, but do the following
current_price <- temp_table$price[p+1]
#this will reassign the current_price to the next price, and restart the WHILE loop
by_idPrice <- dplyr::group_by(temp_dataset, product_id, price)
NewRow <- dplyr::summarise(
early_date = min(date),
last_date = max(date),
total_qty_demanded = sum(qty_ordered))
NewData_Ready <- rbind(NewData_Ready, NewRow)
}
}
}
I have searched a lot on related questions but I have not found anything that are related to this problem yet. If you have some suggestions, please let me know.
Also, please provide some suggestions on the solution to my questions. I would greatly appreciate your time and help!
Here is my R version:
platform x86_64-apple-darwin13.4.0
arch x86_64
os darwin13.4.0
system x86_64, darwin13.4.0
status
major 3
minor 3.1
year 2016
month 06
day 21
svn rev 70800
language R
version.string R version 3.3.1 (2016-06-21)
nickname Bug in Your Hair
Using data.table:
library(data.table)
setDT(sampleData)
Some Preprocessing:
sampleData[, firstdate := as.Date(date, "%m/%d/%y")]
Based on how you calculate date diff, we are better off creating a range of dates for each row:
sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]
Then create a new ID for every change in price:
sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]
Then calculate your groupwise functions, by product and price run:
sampleData[,
.(
price = unique(price),
sum_qty = sum(qty_ordered),
date_diff = max(lastdate) − min(firstdate)
),
by = .(
product_id,
price_id
)
]
product_id price_id price sum_qty date_diff
1: 1000 0 2.490 4 21 days
2: 1000 1 1.743 1 61 days
3: 1000 2 2.490 2 33 days
4: 1002 0 2.093 3 28 days
5: 1002 1 2.110 4 31 days
6: 1002 2 2.970 1 0 days
I think the last price change for 1000 is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1
I'm confused about the way the paste() function is behaving. I have a dplyr table with the following columns:
Year Month DayofMonth
2001 May 21
2001 May 22
2001 June 9
2001 March 4
Which I'd like to combine into a single column called "Date". I figured I'd used the command:
df2 = mutate(df, Date = paste(c(Year, Month, DayofMonth), sep = "-",))
Unfortunately, this seems to concatenate every element in Year, then every element in Month, then every element in DayofMonth so the result looks something like this:
2001-2001-2001-2001 ... May-May-June-March ... 21-22-9-4
How should I modify my command so that the paste function iterates over each row individually?
P.S. This is part of a Data Camp course and as such I am running commands through whatever version of R they've got on their server.
Currently you are concatenating all the columns together. Take c() out of your paste() call to paste them together element-by-element.
mutate(df, Date = paste(Year, Month, DayofMonth, sep = "-"))
# Year Month DayofMonth Date
# 1 2001 May 21 2001-May-21
# 2 2001 May 22 2001-May-22
# 3 2001 June 9 2001-June-9
# 4 2001 March 4 2001-March-4