Trouble with converting csv file into xts using Highfrequency package in R

Trouble with converting csv file into xts using Highfrequency package in R - r

I'm using the convert function in Highfrequency package in R. The dataset I'm using is TAQ downloaded from WRDS. The data looks like This.
The function convert suppose to convert the .csv into .RData files of xts objects.
I follow the instruction of the package and use the following code:
library(highfrequency)
from <- "2017-01-05"
to <- "2017-01-05"
format <- "%Y%m%d %H:%M:%S"
datasource <- "C:/Users/feimo/OneDrive/SFU/Thesis-Project/R/IBM"
datadestination <- "C:/Users/feimo/OneDrive/SFU/Thesis-Project/R/IBM"
convert( from=from, to=to, datasource=datasource,
datadestination=datadestination, trades = T, quotes = F,
ticker="IBM", dir = T, extension = "csv",
header = F, tradecolnames = NULL,
format=format, onefile = T )
But I got the following error message:
> Error in `$<-.data.frame`(`*tmp*`, "COND", value = numeric(0)) :
> replacement has 0 rows, data has 23855
I believe the default column names in the function is: c("SYMBOL", "DATE", "EX", "TIME", "PRICE", "SIZE", "COND", "CORR", "G127") which is different from my dataset, so I manually changed it in my .csv to match it. Then I got another error
>Error in xts(tdata, order.by = tdobject) : 'order.by' cannot contain 'NA', 'NaN', or 'Inf'
Tried to look at the original code, but couldn't find a solution.
Any suggestion would be really helpful. Thanks!

When I run your code on the data to which you provide a link, I get the second error you mention:
Error in xts(tdata, order.by = tdobject) :
'order.by' cannot contain 'NA', 'NaN', or 'Inf'
This error can be traced to these lines in the function highfrequency:::makeXtsTrades(), which is called by highfrequency::convert():
tdobject = as.POSIXct(paste(as.vector(tdata$DATE), as.vector(tdata$TIME)),
format = format, tz = "GMT")
tdata = xts(tdata, order.by = tdobject)
The error results from two problems:
The variable "DATE" in your data file is read into R as numeric, whereas it appears that the code creating tdobject expects tdata$DATE to be a character vector. You could fix this by manually converting that variable to a character vector:
tdata <- read.csv("IBM_trades.csv")
tdata$DATE <- as.character(tdata$DATE)
write.csv(tdata, file = "IBM_trades_DATE_fixed.csv", row.names = FALSE)
The variable "TIME_M" in your data file is not a time of the format "%H:%M:%S". It looks like it is only the minutes and seconds component of a more complete time variable, because values only contain one colon and the values before and after the colon vary from 0 to 59.9. Fixing this problem would require finding the hour component of the time variable.
These two problems result in tdobject being filled with NA values rather than valid date-times, which causes an error when xts::xts() tries to order the data by tdobject.
The more general issue seems to be that the function highfrequency::convert() expects your data to follow something like the format described here on the WRDS website, but your data has slightly different column names and possibly different value formats. I would recommend taking a close look at that WRDS page and the documentation for your data file and determining which variables in your data correspond to those described on that page (for instance, it's not clear to me that your data contains any variable that is equivalent to "G127").

Related

CSV imported data table is not possible to use for histogram plot

I have created my own data set named as Kwality.csv in Excel and when I am executing above code I am not able to get histogram for the same data and it's throwing me error like this:
Error in hist.default(mydata) : 'x' must be numeric
library(data.table)
mydata = fread("Kwality.csv", header = FALSE)
View(mydata)
hist(mydata)

I tried to reproduce you work flow and exported xlsx-file into csv-file (using export to comma-separated file).
First, you should check what kind of character is used for variable and decimal places separation. In my case, for variable separation it is the ; semicolon, and the decimal places is "," comma.
Then you should choose the column, which you will use for the histogramm plot with the function[[]]. The data table itself is not a valid argument for hist function. Please see as below.
See below:
Taken this into consideration you cod execute your code:
library(data.table)
# load csv generatd by NORMSINV(RAND()) in Excel
mydata = fread("check.csv",header = FALSE, sep = ";", dec = ",")
mydata
#hist(mydata)
# Error in hist.default(mydata) : 'x' should be numeric
# does not work
# access by column, e.g. third colum - OK
hist(mydata[[3]])
Output:

Date formatted cell in xlsx files to R

I have an excel file which has date information in some cells. like :
I read this file into R by the following command :
library(xlsx)
data.files = list.files(pattern = "*.xlsx")
data <- lapply(data.files, function(x) read.xlsx(x, sheetIndex = 9,header = T))
Everything is correct except the cells with date! instead of having the xlsx information into those cell, I always have 42948 as a date :
Does anybody know how could I fix this ?

As you can see, after importing your files, dates are represented as numeric values (here 42948). They are actually the internal representation of the date information in Excel. Those values are the ones that R presents instead of the “real” dates.
You can get those dates in R with as.Date(42948 - 25569, origin = "1970-01-01")
Notice that you can also use a vector containing the internal representation of the dates, so this should also work
vect <- c(42948, 42949, 42950)
as.Date(vect - 25569, origin = "1970-01-01")
PS: To convert an Excel datetime colum, see this (p.31)

Error in arulesSequences in R

I am trying to do a sequential pattern analysis using aruleSequences in R.
My data set has 626,047 rows after removing all kinds of duplicates. It has 3 columns. I unfortunately cant put the dataset out here. I have created sample data in a google sheet to give an idea of how the data looks like. it is here. The data is named as df_sq
It has 3 columns:
Numeric_id of class numeric . This is a user_id
Product - of class factor.
Time - of class integer
I have been able to convert the data in 'transaction' format according to the package. But on running cSpade, i get the following error:
Error in makebin(data, file) : 'eid' invalid (strict order)
Now, i know from reading other questions on Stackoverflow, that this means that i have to sort my data.
So i went back and sorted my orignal data by numeric_id and time. Vice versa as well. And re coverted data to 'transaction' format and re ran cSpade.
I am still getting the same error.
Has any one worked with this package before?
Here is the code i had used:
library(arules)
library(arulesViz)
library(arulesSequences)
library(sqldf)
df_sq = read.csv("service_data.csv", stringsAsFactors = FALSE)
#Changing class of timestamp column and coercing product name to factor
df_sq$time1 = as.integer(as.numeric(df_sq$time1))
df_sq$service_name = as.factor(df_sq$service_name)
#Clearing duplicates
df_sq = sqldf("select distinct numeric_id, service_name, time1
from df_sq")
#Ordering the dataset on numeric id and time
df_sq = df_sq3[order(df_sq3$numeric_id, df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$sequenceID),]
#Coverting to transactional format per the package
sq_data = data.frame(item=df_sq3$service_name)
sq_tran = as(sq_data, "transactions")
transactionInfo(sq_tran)$sequenceID = df_sq3$numeric_id
transactionInfo(sq_tran)$eventID = df_sq3$time1
summary(sq_tran)
#Running cSpade
s1 = cspade(sq_tran, parameter = list(support = 0.1), control = list(verbose
= TRUE),tmpdir = tempdir())
summary(s1)

Load csv into R as xts, or comparable to enable time series analysis

I am still learning R, and get very confused when using various data types, classes, etc. I have run into this issue of "Dates" not being in the right format for xts countless times now, and find a solution each time after searching long and hard for (what I consider) complicated solutions.
I am looking for a way to load a CSV into R and convert the date upon loading it each time I want to load a csv into R. 99% of my files contain Date as the first column, in format 01-31-1900 (xts wants YYYY-mm-dd).
Right now I have the following:
FedYieldCurve <- read.csv("Yield Curve.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
FedYieldCurve$Date <- format(as.Date(FedYieldCurve$Date), "%Y/%m/%d")
and i am getting: Error in charToDate(x) :
character string is not in a standard unambiguous format

The format argument must be inside as.Date. Try this (if the dates in the files are stored in the 01-31-1900 format):
as.Date(FedYieldCurve$Date,format="%m-%d-%Y")
When you try to coerce a string to a Date object you have to specify the format of the string as the format argument in the as.Date call. You have the error you reported when you try to coerce a string which has a format other than the standard YYYY-mm-dd.

Provide a few lines of the file when asking questions like this. In the absence of this we have supplied some data below in a self contained example.
Use read.zoo from the zoo package (which xts loads) specifying the format. (Replace the read.zoo line with the commented line to read from a file.)
Lines <- "Date,Value
01-31-1900,3"
library(xts)
# z <- read.zoo("myfile.csv", header = TRUE, sep = ",", format = "%m-%d-%Y")
z <- read.zoo(text = Lines, header = TRUE, sep = ",", format = "%m-%d-%Y")
x <- as.xts(z)
See ?read.zoo and Reading Data in zoo.

Imported a csv-dataset to R but the values becomes factors

I am very new to R and I am having trouble accessing a dataset I've imported. I'm using RStudio and used the Import Dataset function when importing my csv-file and pasted the line from the console-window to the source-window. The code looks as follows:
setwd("c:/kalle/R")
stuckey <- read.csv("C:/kalle/R/stuckey.csv")
point <- stuckey$PTS
time <- stuckey$MP
However, the data isn't integer or numeric as I am used to but factors so when I try to plot the variables I only get histograms, not the usual plot. When checking the data it seems to be in order, just that I'm unable to use it since it's in factor form.

Both the data import function (here: read.csv()) as well as a global option offer you to say stringsAsFactors=FALSE which should fix this.

By default, read.csv checks the first few rows of your data to see whether to treat each variable as numeric. If it finds non-numeric values, it assumes the variable is character data, and character variables are converted to factors.
It looks like the PTS and MP variables in your dataset contain non-numerics, which is why you're getting unexpected results. You can force these variables to numeric with
point <- as.numeric(as.character(point))
time <- as.numeric(as.character(time))
But any values that can't be converted will become missing. (The R FAQ gives a slightly different method for factor -> numeric conversion but I can never remember what it is.)

You can set this globally for all read.csv/read.* commands with
options(stringsAsFactors=F)
Then read the file as follows:
my.tab <- read.table( "filename.csv", as.is=T )

When importing csv data files the import command should reflect both the data seperation between each column (;) and the float-number seperator for your numeric values (for numerical variable = 2,5 this would be ",").
The command for importing a csv, therefore, has to be a bit more comprehensive with more commands:
stuckey <- read.csv2("C:/kalle/R/stuckey.csv", header=TRUE, sep=";", dec=",")
This should import all variables as either integers or numeric.

None of these answers mention the colClasses argument which is another way to specify the variable classes in read.csv.
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "numeric") # all variables to numeric
or you can specify which columns to convert:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = c("PTS" = "numeric", "MP" = "numeric") # specific columns to numeric
Note that if a variable can't be converted to numeric then it will be converted to factor as default which makes it more difficult to convert to number. Therefore, it can be advisable just to read all variables in as 'character' colClasses = "character" and then convert the specific columns to numeric once the csv is read in:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "character")
point <- as.numeric(stuckey$PTS)
time <- as.numeric(stuckey$MP)

I'm new to R as well and faced the exact same problem. But then I looked at my data and noticed that it is being caused due to the fact that my csv file was using a comma separator (,) in all numeric columns (Ex: 1,233,444.56 instead of 1233444.56).
I removed the comma separator in my csv file and then reloaded into R. My data frame now recognises all columns as numbers.
I'm sure there's a way to handle this within the read.csv function itself.

This only worked right for me when including strip.white = TRUE in the read.csv command.
(I found the solution here.)

for me the solution was to include skip = 0
(number of rows to skip at the top of the file. Can be set >0)
mydata <- read.csv(file = "file.csv", header = TRUE, sep = ",", skip = 22)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Trouble with converting csv file into xts using Highfrequency package in R - r

Related

CSV imported data table is not possible to use for histogram plot

Date formatted cell in xlsx files to R

Error in arulesSequences in R

Load csv into R as xts, or comparable to enable time series analysis

Imported a csv-dataset to R but the values becomes factors

Categories

Resources