Oh man I feel dumb. This is beginner central, but I'm totally lost trying to figure out how to subset arguments in lapply. At this point I've just been randomly trying different combinations of [[ and friends, cursing my clumsiness with the debugging in RStudio.
The issue: I have a dataset collected from SQL Server which includes several columns of date data. Some of these are legitimate datetime values, others are strings. Most have (valid) missing data and some have more than one format. Often the date 1900-01-01 is used as a substitute for NULL. I'm trying very, very hard to be idiomatic and concise in solving this instead of brute forcing it with copy/paste invocations.
My ParseDates() function seems to work well if called column-by-column, but I can't get it to work with lapply. I can see that I'm sending the whole list of orders and threshold values when I only want to pass the current observation, but I can't get my head around how lapply iterates or how to align multiple lists so that the right arguments go with the right call.
I need to finish with all values correctly held as dates (or POSIXct in this instance) with anything close to 1900-01-01 set to NA.
library(lubridate)
# build sample data extract
events <-
structure(
list(
ReservationDate = structure(
c(4L, 2L, 3L, NA,
1L), .Label = c(
"18/12/2006", "1/1/1900", "May 11 2004 12:00AM",
"May 17 2004 12:00AM"
), class = "factor"
), OrigEnquiryDate = structure(
c(1094565600,
937404000, 1089295200, NA, NA), class = c("POSIXct", "POSIXt"), tzone = ""
), UnconditionalDate = structure(
c(1092146400, 935676000,
1087740000, NA, 1168952400), class = c("POSIXct", "POSIXt"), tzone = ""
),
ContractsExchangedDate = structure(
c(NA, NA, NA, NA, 1171544400), class = c("POSIXct", "POSIXt"), tzone = ""
)
), .Names = c(
"ReservationDate",
"OrigEnquiryDate", "UnconditionalDate", "ContractsExchangedDate"
), row.names = c(54103L, 54090L, 54057L, 135861L, 73433L), class = "data.frame"
)
ParseDates <- function(x, orders=NULL, threshold=10) {
# converts to POSIXct if required and replaces 1900-01-01 or similar with na
if(!is.null(orders)) {
x <- parse_date_time(x, orders)
}
x[abs(difftime(x, as.POSIXct("1900-01-01"), units="days")) < threshold] <- NA
return(x)
}
# only consider these columns
date.cols <- names(events) %in% c(
"ReservationDate", "UnconditionalDate", "ContractsExchangedDate", "OrigEnquiryDate"
)
# columns other than these should use the default threshold of 10
date.thresholds <- list("UnconditionalDate"=90, "ContractsExchangedDate"=400)
# columns *other* than these should use the default order of NULL,
# they skip parsing and go straight to threshold testing
date.orders <- list(
"SettlementDate"=c("dmY", "bdY I:Mp"),
"ReservationDate"=c("dmY", "bdY I:Mp")
)
events[date.cols] <- lapply(events[date.cols],
ParseDates(events[date.cols],
orders = date.orders,
threshold = date.thresholds))
Related
I'm fairly new to R and am trying to plot some expenditure data. I read the data in from excel and then do some manipulation on the dates
data <- read.csv("Spending2019.csv", header = T)
#converts time so R can use the dates
strdate <- strptime(data$DATE,"%m/%d/%Y")
newdate <- cbind(data,strdate)
finaldata <- newdate[order(strdate),]
This probably isn't the most efficient, but it gets me there :)
Here's the relevant columns of the first four lines of my finaldata dataframe
dput(droplevels(finaldata[1:4,c(5,7)]))
structure(list(AMOUNT = c(25.13, 14.96, 43.22, 18.43), strdate = structure(c(1546578000,
1546750800, 1547010000, 1547010000), class = c("POSIXct", "POSIXt"
), tzone = "")), row.names = c(NA, 4L), class = "data.frame")
The full data set has 146 rows and the dates range from 1/4/2019 to 12/30/2019
I then plot the data
plot(finaldata$strdate,finaldata$AMOUNT, xlab = "Month", ylab = "Amount Spent")
and I get this plot
This is fine for me getting started, EXCEPT why is JAN repeated at the far right end? I have tried various forms of xlim and can't seem to get it to go away.
I am having a data set like below and I am trying to impute the value like below.
ID In Out
4 2019-09-20 21:57:22 NA
4 NA 2019-09-21 5:07:03
When there NA's in lead and lag for each ID's, I am trying to impute the time to cut off the previous day and start new time for the next day. I was doing like this, but I am getting error
df1%>%
group_by(ID) %>%
mutate(In= ifelse(is.na(In) & is.na(lag(Out)),
as.POSIXct(as.character(paste(as.Date(In),"05:00:01"))),
In)) %>%
mutate(Out= ifelse(is.na(Out) & lead(In) == "05:00:01",
as.POSIXct(as.character(paste(as.Date(Out),"05:00:00"))),
Out))
The desired output will be
ID In Out
4 2019-09-20 21:57:22 2019-09-21 05:00:00
4 2019-09-21 5:00:01 2019-09-21 5:07:03
Dput for the data
structure(list(concat = c("176 - 2019-09-20", "176 - 2019-09-20",
"176 - 2019-09-20", "176 - 2019-09-20", "176 - 2019-09-21"),
ENTRY = structure(c(1568989081, 1569008386, 1569016635, 1569016646,
NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), EXIT = structure(c(1569005439,
1569014914, 1569016645, NA, 1569042433), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -5L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000007e21ef0>)
Finally, I got the desired output by separating the date and time and pasting it back. Definitely this is not a efficient way to achieve this. May be some one can suggest other efficient way to do this which gives some learning at least.
df%>%
mutate(ENTRY_date = as.Date(ENTRY)) %>%
mutate(EXIT_date = as.Date(EXIT))%>%
mutate(ENTRY_time = format(ENTRY,"%H:%M:%S"))%>%
mutate(EXIT_time = format(EXIT,"%H:%M:%S"))%>%
mutate(Entry_date1 = if_else(is.na(ENTRY_date)&is.na(lag(EXIT_date)),EXIT_date,ENTRY_date))%>%
mutate(Exit_date1 = if_else(is.na(EXIT_date)& is.na(lead(ENTRY_date)),ENTRY_date,EXIT_date))%>%
mutate(Entry_time1 = if_else(is.na(ENTRY_time)&is.na(lag(EXIT_time)),"05:00:01",ENTRY_time))%>%
mutate(Exit_time1 = if_else(is.na(EXIT_time)& is.na(lead(ENTRY_time)),"04:59:59",EXIT_time))%>%
mutate(ENTRY1 = as.POSIXct(paste(Entry_date1, Entry_time1), format = "%Y-%m-%d %H:%M:%S"))%>%
mutate(EXIT1 = as.POSIXct(paste(Exit_date1, Exit_time1), format = "%Y-%m-%d %H:%M:%S"))
First, using your dput() data did not work for me. Anyway, if I understand your question correctly you can do it like this:
# load pacakge
library(lubridate)
# replace missing In values with the corresponding Out values,
# setting 5:00:01 as time.
df$In[is.na(df$In)] <- ymd_hms(paste0(as.Date(df$Out[is.na(df$In)]), " 5:00:01"))
# same idea but first we save it as a vector...
Out <- ymd_hms(paste0(as.Date(df$In[is.na(df$Out)]), " 5:00:00"))
# ... then we add one day
day(Out) <- day(Out) + 1; df$Out[is.na(df$Out)] <- Out
This works for the data that you provided but if Out time is 2019-09-21 04:07:03, for example, then the correstponding In time is later, namely 2019-09-21 05:00:01. I do not know if this is intended. If not please specify your question.
I used this data
structure(list(In = structure(c(1569016642, NA), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Out = structure(c(NA, 1569042423), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("In", "Out"), row.names = c(NA, -2L), class = "data.frame")
This should be really simple. I am currently trying to make a list I am building slightly more efficient. Instead of having to write out:
list('1'= value1, '2' =value1, '3' = value1)
how would I condense this to be able to simply list the numbers I want to be equal to value1. e.g. '1:4' =value1 or '1,2,3,4' =value1
EDIT:
So, for background, I am currently trying to create custom formatting for an excel file using the xlsx package.
wb = createWorkbook()
sheet =createSheet(wb,sheetName = "TestFormatting")
dfcurrency = DataFormat("[$$-409]#,##0_ ;[Red]-[$$-409]#,##0 ")
dfdate = DataFormat("m/d/yyyy")
currency = CellStyle(wb, dataFormat = dfcurrency)
date = CellStyle(wb, dataFormat = dfdate)
datastyle = setNames(as.list(c(currency,date)),rep(c(3,4),c(1)))
data = addDataFrame(table,sheet, colStyle = datastyle)
Is what I am currently running, thanks to akrun's help. This gives the error:
Error in thisColStyle$ref : no field, method or inner class called 'ref'
And just in case it's useful, here is the data structure of table:
structure(list(workingdate = structure(c(1458518400, 1458604800,
1458691200, 1458777600, 1458864000, 1459119600), class = c("POSIXct",
"POSIXt"), tzone = ""), trader = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("a", "b", "c",
"d", "e"), class = "factor"), pnl.1d = c(3,
-573.7978, -107.1941, 1128.3061, -0.709699999999998, 3.55990000000003
), rt.1d.Util = c(0, -3.82531866666667e-05, -7.14627333333333e-06,
7.52204066666667e-05, -4.73133333333332e-08, 2.37326666666669e-07
)), .Names = c("workingdate", "trader", "pnl.1d", "rt.1d.Util"
), row.names = c(NA, 6L), class = "data.frame")
Here's a very general way to do similar things. This solution is likely more convoluted than the best solution, but it will work and can be extended to similar problems. It is based on eval and parse. parse turns a string into an unevaluated expression, eval evaluates it.
So, eval(parse(text="5+5")) will return 10.
If we can create the string "list('1'=value1, '2'=value1, '3'=value1)", we can then use eval(parse(text= to turn it into the list you want.
The following code will create the above string:
value1 <- 'asdf'
paste(
'list(', paste(sapply(seq_len(4),
function(n) { paste("'", n,"'", "=", "value1", sep="")}),
collapse = ","),
')')
So, combining everything, call
eval(parse(text=
paste(
'list(', paste(sapply(seq_len(4),
function(n) { paste("'", n,"'", "=", "value1", sep="")}),
collapse = ","),
')')))
And you get the list you want.
Thanks to Julian's comment I was able to create a solution to this. I will accept Julian's comment as the answer but will give my own (less general) solution as an example. It basically applies his solution so as to create more customisability in an albeit very roundabout way:
#if no columns need a type of format enter 0
a =paste(sapply(list(c(
#enter column numbers formatted as currency eg. 1:5, 8, 10
3
)),
function(n) { paste("'", n,"'", "=", "currency", sep="")}))
b =paste(sapply(list(c(
#columns formatted as date
1
)),
function(n) { paste("'", n,"'", "=", "date", sep="")}))
You can continue in this fashion with this general formula for as many variables as you like. You can then combine them into one text file ready to be parsed:
text = paste( 'list(',paste(c(a,b,c,d), collapse = ","),')')
datastyle = eval(parse(text = text))
where you simply enter all your formats or styles in a,b,c,d,...
Hopefully this will help someone who finds a similar problem.
So I am trying to operate a function over a few columns of a data frame, using a for loop.
z <- function(x) gsub("[^\\.\\d]", "", x, perl = TRUE)
data <- cbind(data[1:2], for(i in seq(3, 9)) {y(data[[i]])})
I keep running into the error as mentioned in the subject
arguments imply differing number of rows
The number of rows in all my columns are same.
I tried to use lapply for this, but though it works, it converts the column types over which I apply the function to factor. The columns are numerical values, but are originally read as characters from the file (they are stored as such). So when I try to convert to numbers after using lapply, I get number of levels as output (like, 1,2,3...)
Any suggestions, using either the for loop, or lapply are welcome. Thanks in advance.
> dput(head(data,3))
structure(list(MCF.Channel.Grouping = structure(c(6L, 6L, 6L), .Label = c("(Other)",
"Direct", "Display", "Email", "Organic Search", "Paid Search",
"Referral", "Social Network"), class = "factor"), Device.Category = structure(c(2L,
1L, 3L), .Label = c("desktop", "mobile", "tablet"), class = "factor"),
Spend = c("A$503,172.17", "A$375,940.43", "A$92,560.94"),
Clicks = c("1,545,416", "1,037,740", "291,314"), Impressions = c("7,328,657",
"3,787,612", "1,178,508"), Data.Driven.Conversions = c("1,697,814.32",
"1,540,810.43", "430,738.63"), Data.Driven.CPA = c("A$0.30",
"A$0.24", "A$0.21"), Data.Driven.Conversion.Value = c("A$12,815,842.66",
"A$13,883,073.58", "A$3,804,800.15"), Data.Driven.ROAS = c("2547.01%",
"3692.89%", "4110.59%")), .Names = c("MCF.Channel.Grouping",
"Device.Category", "Spend", "Clicks", "Impressions", "Data.Driven.Conversions",
"Data.Driven.CPA", "Data.Driven.Conversion.Value", "Data.Driven.ROAS"
), row.names = c(NA, 3L), class = "data.frame")
We can use
data[-(1:2)] <- lapply(data[-(1:2)], z)
The function is run on columns that are not the first or second. The output is assigned to the same subset in the data.
The original method did not work because the for loop does not result in saved output. Check by trying to save it as a variable:
x <- for(i in seq(3, 9)) {z(data[[i]])}
x
NULL
Even though we saved the contents of the loop, nothing was captured. The loop ran then dumped the results. To see how a loop could work, we can assign values within:
for ( i in 3:9) data[,i] <- z(data[,i])
I'm currently in the process of building a strategy using quantstrat/blotter. The price data that I'm using uses numbers as the security identifiers and these numbers are therefore the column names as well as what I use for the synbol names in functions such as stock() in order to import the financial instruments. However as shown in the reproducible code below, using a very small portion of my dataset, whenever stock() is used on these numerical identifiers, the FinancialInstrument package modifies them in a strange manner, by appending an "X" and removing the leading digit. Based upon this, are there any restrictions on symbol names for use with the FinancialInstrument package?
structure(c(9.17000007629395, 9.17000007629395, 9.17000007629395,
9.17000007629395, 9.17000007629395, 9.17000007629395, 41.0999984741211,
40.7599983215332, 40.4599990844727, 40.1500015258789, 40.5299987792969,
40.5299987792969, 41.9900016784668, 41.7449989318848, 42.0299987792969,
41.7200012207031, 42.25, 41.7000007629395, 29.3199996948242,
29.3199996948242, 29.3199996948242, 29.3199996948242, 29.3199996948242,
29.3199996948242), class = c("xts", "zoo"), .indexCLASS = "Date", tclass = "Date", .indexTZ = "UTC", tzone = "UTC", index = structure(c(1403481600,
1403568000, 1403654400, 1403740800, 1403827200, 1404086400), tzone = "UTC", tclass = "Date"), .Dim = c(6L,
4L), .Dimnames = list(NULL, c("10078", "10104", "10107", "10108"
)))
colnames(x)
# "10078" "10104" "10107" "10108"
for(i in colnames(x)){
stock(i,currency="USD",multiplier=1)
}
ls_stocks()
# "X0078" "X0104" "X0107" "X0108"
instrument names need to begin with a letter or a dot. The instrument function uses make.names to ensure this. If it's important to be able to find your instruments by a number, then you can add it as an identifier.
stock("X1234", currency("USD"), identifiers=list(num=1234))
getInstrument("1234")
#primary_id :"X1234"
#currency :"USD"
#multiplier :1
#tick_size :0.01
#identifiers:List of 1
# ..$ num:1234
#type :"stock"
Another way to add an identifier
add.identifier("X1234", id2=42)