I am trying to do a sequential pattern analysis using aruleSequences in R.
My data set has 626,047 rows after removing all kinds of duplicates. It has 3 columns. I unfortunately cant put the dataset out here. I have created sample data in a google sheet to give an idea of how the data looks like. it is here. The data is named as df_sq
It has 3 columns:
Numeric_id of class numeric . This is a user_id
Product - of class factor.
Time - of class integer
I have been able to convert the data in 'transaction' format according to the package. But on running cSpade, i get the following error:
Error in makebin(data, file) : 'eid' invalid (strict order)
Now, i know from reading other questions on Stackoverflow, that this means that i have to sort my data.
So i went back and sorted my orignal data by numeric_id and time. Vice versa as well. And re coverted data to 'transaction' format and re ran cSpade.
I am still getting the same error.
Has any one worked with this package before?
Here is the code i had used:
library(arules)
library(arulesViz)
library(arulesSequences)
library(sqldf)
df_sq = read.csv("service_data.csv", stringsAsFactors = FALSE)
#Changing class of timestamp column and coercing product name to factor
df_sq$time1 = as.integer(as.numeric(df_sq$time1))
df_sq$service_name = as.factor(df_sq$service_name)
#Clearing duplicates
df_sq = sqldf("select distinct numeric_id, service_name, time1
from df_sq")
#Ordering the dataset on numeric id and time
df_sq = df_sq3[order(df_sq3$numeric_id, df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$sequenceID),]
#Coverting to transactional format per the package
sq_data = data.frame(item=df_sq3$service_name)
sq_tran = as(sq_data, "transactions")
transactionInfo(sq_tran)$sequenceID = df_sq3$numeric_id
transactionInfo(sq_tran)$eventID = df_sq3$time1
summary(sq_tran)
#Running cSpade
s1 = cspade(sq_tran, parameter = list(support = 0.1), control = list(verbose
= TRUE),tmpdir = tempdir())
summary(s1)
Related
We are interested in analyzing our pupil data (only interested in size, not position) recorded with an SR eyelink 1000Hz system.
We exported the files using the SR data viewer as sample reports.
After running ppl_prep_data the TIMESTAMP variable class is converted from character to numeric however it returns all NA and the real timestamp values are lost. The rest of the pipeline isthereforer not working.
Does anyone of you have an idea why this is the case that it gives us a NA message and if so how can we maybe work around this?
Below you can find the code the code that we are using:
#step 1 Load library
library(PupilPre)
#step 2:load data
# change folder were the data is in the line below
Pupildat <- read.table("DATAXX.txt", header = T, sep = "\t", na.strings = c(".", "NA"))
# after reading in the first column is called weird something with ?.. so we rename it for the next line of code
names(Pupildat)[1] <- 'RECORDING_SESSION_LABEL'
## Step 3:PupilPre Pipeline ###
# Check classes of columns and reassigns => creates event variable
data_pre <- ppl_prep_data(data = Pupildat, Subject = "RECORDING_SESSION_LABEL", EventColumns = c("Subject", "TRIAL_INDEX"))
align_msg(data_pre, Msg = "Hashtag_1")
#Using the function check_msg_time you can see that the TIMESTAMP values associated with the message are not the same for each event.
#This indicates that alignment is required. Note that a regular expression (regex) can be used here as the message string.
#example below, though think we want different timings for the events
check_msg_time(data = data_pre, Msg = "Hashtag_1")
### returns NA
I am trying to read a csv (~ 18,000,000 rows, ~ 1000 columns) into arrow (in R) with open_dataset pre-specifying a schema. There are some instances in which the csv was generated incorrectly and some values don't match the intended schema (say some values where the age (int) of the individual was supposed to be entered have the name (string) of the individual). My intention is to set these ages that have strings that can't be parsed as integers as NA.
The default behaviour of open_dataset is to throw the following error:
CSV conversion error to int8: invalid value
Is there a way in which instead of getting an error when the schema is unable to parse I can get a missing value NA?
Here is an example of code that generates the error:
library(tidyverse)
library(arrow)
#Write csv
tibble(age = c(1,2,"StackOverflow",5)) %>%
write_csv("example.csv")
#Read the csv
arrow::open_dataset("example.csv", format = "csv", schema = schema(age = int8()), skip = 1) %>%
collect()
I know that I can specify the null_values inside the CsvConvertOptions if I know them previously as follows:
arrow::open_dataset("example.csv", format = "csv", schema = schema(age = int8()), skip = 1,
convert_options = CsvConvertOptions$create(null_values = "StackOverflow")) %>%
collect()
However this feels pretty inefficient as not knowing the mistakes a priori it seems to me that I need to go through the data twice (once to search the values and then once to set the schema correctly).
I'm using the convert function in Highfrequency package in R. The dataset I'm using is TAQ downloaded from WRDS. The data looks like This.
The function convert suppose to convert the .csv into .RData files of xts objects.
I follow the instruction of the package and use the following code:
library(highfrequency)
from <- "2017-01-05"
to <- "2017-01-05"
format <- "%Y%m%d %H:%M:%S"
datasource <- "C:/Users/feimo/OneDrive/SFU/Thesis-Project/R/IBM"
datadestination <- "C:/Users/feimo/OneDrive/SFU/Thesis-Project/R/IBM"
convert( from=from, to=to, datasource=datasource,
datadestination=datadestination, trades = T, quotes = F,
ticker="IBM", dir = T, extension = "csv",
header = F, tradecolnames = NULL,
format=format, onefile = T )
But I got the following error message:
> Error in `$<-.data.frame`(`*tmp*`, "COND", value = numeric(0)) :
> replacement has 0 rows, data has 23855
I believe the default column names in the function is: c("SYMBOL", "DATE", "EX", "TIME", "PRICE", "SIZE", "COND", "CORR", "G127") which is different from my dataset, so I manually changed it in my .csv to match it. Then I got another error
>Error in xts(tdata, order.by = tdobject) : 'order.by' cannot contain 'NA', 'NaN', or 'Inf'
Tried to look at the original code, but couldn't find a solution.
Any suggestion would be really helpful. Thanks!
When I run your code on the data to which you provide a link, I get the second error you mention:
Error in xts(tdata, order.by = tdobject) :
'order.by' cannot contain 'NA', 'NaN', or 'Inf'
This error can be traced to these lines in the function highfrequency:::makeXtsTrades(), which is called by highfrequency::convert():
tdobject = as.POSIXct(paste(as.vector(tdata$DATE), as.vector(tdata$TIME)),
format = format, tz = "GMT")
tdata = xts(tdata, order.by = tdobject)
The error results from two problems:
The variable "DATE" in your data file is read into R as numeric, whereas it appears that the code creating tdobject expects tdata$DATE to be a character vector. You could fix this by manually converting that variable to a character vector:
tdata <- read.csv("IBM_trades.csv")
tdata$DATE <- as.character(tdata$DATE)
write.csv(tdata, file = "IBM_trades_DATE_fixed.csv", row.names = FALSE)
The variable "TIME_M" in your data file is not a time of the format "%H:%M:%S". It looks like it is only the minutes and seconds component of a more complete time variable, because values only contain one colon and the values before and after the colon vary from 0 to 59.9. Fixing this problem would require finding the hour component of the time variable.
These two problems result in tdobject being filled with NA values rather than valid date-times, which causes an error when xts::xts() tries to order the data by tdobject.
The more general issue seems to be that the function highfrequency::convert() expects your data to follow something like the format described here on the WRDS website, but your data has slightly different column names and possibly different value formats. I would recommend taking a close look at that WRDS page and the documentation for your data file and determining which variables in your data correspond to those described on that page (for instance, it's not clear to me that your data contains any variable that is equivalent to "G127").
I am working on data mining in R programming and I'm using RStudio. My dataset looks like this:
I've used 'yes' 'no' instead of any other disease name in some places just to check if it works for 'yes' or 'no'.
Here you can see that a patient has different diseases/diagnosis. I am trying to use association rule to display me the diseases that a person is suffering along with HTN. I've written the following code:
mytestdata <- read.csv("D:/Senior Thesis/Program/test.csv", header=T,
colClasses = "factor", sep = ",")
library(arules)
myrules <- apriori(mytestdata,
parameter = list(supp = 0.1, conf = 0.1, maxlen=10, minlen=2),
appearance = list(rhs=c("Disease.1=HTN")))
summary(myrules)
inspect(myrules)
But I'm not getting any disease name in the column lhs; you can see that in the following image:
Please help me so that lhs shows the name of the disease associated with rhs which is Disease.1=HTN.
Your code takes missing values (e.g. cell E4 in excel sheet) as a factor level. You could prevent this behaviour when you specify the NA value in read.csv function.
mytestdata <- read.csv("D:/Senior Thesis/Program/test.csv", header=T,
colClasses = "factor", sep = ",", na.strings = "")
It would, if you had more data. There is just 3 rows that satisfy your rhs!
Note that you do get Disease.2=yes.
But I assume you want to ignore order on the diseases...
I want to create a transaction object in basket format which I can call anytime for my analyses. The data contains comma separated items with 1001 transactions. The first 10 transactions look like this:
hering,corned_b,olives,ham,turkey,bourbon,ice_crea
baguette,soda,hering,cracker,heineken,olives,corned_b
avocado,cracker,artichok,heineken,ham,turkey,sardines
olives,bourbon,coke,turkey,ice_crea,ham,peppers
hering,corned_b,apples,olives,steak,avocado,turkey
sardines,heineken,chicken,coke,ice_crea,peppers,ham
olives,bourbon,coke,turkey,ice_crea,heineken,apples
corned_b,peppers,bourbon,cracker,chicken,ice_crea,baguette
soda,olives,bourbon,cracker,heineken,peppers,baguette
corned_b,peppers,bourbon,cracker,chicken,bordeaux,hering
...
I observed that there are duplicated transactions in the data and removed them but each time I tried to read the transactions, I get:
Error in asMethod(object) :
can not coerce list with transactions with duplicated items
Here is my code:
data <- read.csv("AssociationsItemList.txt",header=F)
data <- data[!duplicated(data),]
pop <- NULL
for(i in 1:length(data)){
pop <- paste(pop, data[i],sep="\n")
}
write(pop, file = "Trans", sep = ",")
transdata <- read.transactions("Trans", format = "basket", sep=",")
I'm sure there's something little yet important I've missed. Kindly offer your assistance.
The problem is not with duplicated transactions (the same row appearing twice)
but duplicated items (the same item appearing twice, in the same transaction --
e.g., "olives" on line 4).
read.transactions has an rm.duplicates argument to remove those duplicates.
read.transactions("Trans", format = "basket", sep=",", rm.duplicates=TRUE)
Vincent Zoonekynd is right, the problem is caused by duplicated items in a transaction. Here I can explain why arules require transactions without duplicated items.
The data of transactions is store internally as a ngCMatrix Object. Relevant source code:
setClass("itemMatrix",
representation(
data = "ngCMatrix",
...
setClass("transactions",
contains = "itemMatrix",
...
ngCMatrix is an sparse matrix defined at Matrix package. It's description from official document:
The nsparseMatrix class is a virtual class of sparse “pattern” matrices, i.e., binary matrices conceptually with TRUE/FALSE entries. Only the positions of the elements that are TRUE are stored
It seems ngCMatirx stored status of an element by an binary indicator. Which means the transactions object in arules can only store exist/not exist for a transaction object and can not record quantity. So...
I just used the 'unique' function to remove duplicates. My data was a little different since I had a dataframe (data was too large for a CSV) and I had 2 columns: product_id and transaction_id. I know it's not your specific question, but I had to do this to create the transaction dataset and apply association rules.
data # > 1 Million Transactions
data <- unique(data[ , 1:2 ] )
trans <- as(split(data[,"product_id"], data[,"trans_id"]),"transactions")
rules <- apriori(trans, parameter = list(supp = 0.001, conf = 0.2))