I have what I thought was a well-prepared dataset. I wanted to use the Apriori Algorithm in R to look for associations and come up with some rules. I have about 16,000 rows (unique customers) and 179 columns that represent various items/categories. The data looks like this:
Cat1 Cat2 Cat3 Cat4 Cat5 ... Cat179
1, 0, 0, 0, 1, ... 0
0, 0, 0, 0, 0, ... 1
0, 1, 1, 0, 0, ... 0
...
I thought having a comma separated file with binary values (1/0) for each customer and category would do the trick, but after I read in the data using:
data5 = read.csv("Z:/CUST_DM/data_test.txt",header = TRUE,sep=",")
and then run this command:
rules = apriori(data5, parameter = list(supp = .001,conf = 0.8))
I get the following error:
Error in asMethod(object):
column(s) 1, 2, 3, ...178 not logical or a factor. Discretize the columns first.
I understand Discretize but not in this context I guess. Everything is a 1 or 0. I've even changed the data from INT to CHAR and received the same error. I also had the customer ID (unique) as column 1 but I understand that isn't necessary when the data is in this form (flat file). I'm sure there is something obvious I'm missing - I'm new to R.
What am I missing? Thanks for your input.
I solved the problem this way: After reading in the data to R I used lapply() to change the data to factors (I think that's what it does). Then I took that data set and created a data frame from it. Then I was able to apply apriori() successfully.
Your data is actually already in (dense) matrix format, but read.csv always reads data in as a data.frame. Just coerce the data to a matrix first:
dat <- as.matrix(data5)
rules <- apriori(dat, parameter = list(supp = .001,conf = 0.8))
1s in the data will be interpreted as the presence of the item and 0s as the absence. More information about how to create transactions can be found in the manual page ? transactions.
Related
I have checked the documentation of the package and found an example of how they fitted a DTMC on data.frame objects using the following code:
library(holson)
data(holson)
singleMc<-markovchainFit(data=holson[,2:12],name="holson")
The data I apply the code to is structured essentially in the same way as the holson data only that there are 10 states. Additionally, the numbers in my excel file are indeed integers and not class characters. These states are the numbers 1 to 10. When I run the code on my data it gives me a transition matrix where the states are listed as followed in (1,10,2,3,4,5,6,7,8,9). Thus in the matrix the state following 1 is 10.
It appears to me that R Studio thinks that the character 10 is between 1 and 2? (Like lexicographic sorting?) How can I fix this issue and have the package recognize 10 as the character following 9?
EDIT:
Here is an example
library(markovchain)
set.seed(12)
Test <- data.frame(entity = LETTERS[1:100],
Time1 = round(runif(n = 100, min = 1, max = 10)),
Time2 = round(runif(n = 100, min = 1, max = 10)),
Time3 = round(runif(n = 100, min = 1, max = 10)))
Test_Fit <- markovchainFit(data=Test[,2:4] , name="Test_FIT")
Est_Test_Fit <- Test_Fit$estimate
Est_Test_Fit#transitionMatrix
I am not familiar with the markovchain package. I tend to use r4jags.
Reading the markovchain manual, it seems that a call to markovchainFit should be preceded by a call to createSequenceMatrix. (See the example code on page 11 of the manual.) The first parameter of createSequenceMatrix is "... a n x n matrix or a character vector or a list". Thus, contrary to my comment above, it seems that markovchain expects the state labels to be character rather than numeric. Given your question, it seems that your states are ordered rather than merely categorical, so the ordering '"1"', `"10", '"2"' is a problem for you.
The solution would be to convert your numeric state labels to character before calling markovChainFit/createSequenceMatrix. Here are two possible ways of doing this:
charState <- LETTERS[state]
which will give you state labels of "A" to "J". or
charState <- sprintf("%02i", state)
which produces "01", "02", ... , "10".
By the way, did you run your test code before adding it to your question? Rows 27 to 100 of your Test dataframe have entity equal to NA, which I suspect is not what you intended. In addition, I suspect your columns Time1 to Time3 are misnamed because I believe they contain states of the process rather than times at which the process(es) was/were in a given state.
I'm using data in a format shown:
Actual data set much longer. Column labels are: Date | Variable 1 | Variable 2 | Failed ?
I'm sorting the data into date order. Some dates may be missing, but an ordering function should be sorting this out. From there, I'm trying to split the data into sets where new sets are denoted by the far right column registering a 1. I'm then trying to plot these sets on a single graph with number of days passed on the x-axis. I've looked into using the ggplot function, but it seems to require frames where the length of each vector is already known. I tried creating a matrix of a length based on the maximum number of days that passed for all sets and then fill the spare cells with NaN values to be plotted, but this took ages as my data set is quite large. I was wondering whether there was a more elegant way of plotting the values against days past for all sets on a single graph, and then iterate the process for additional variables.
Any help would be much appreciated.
Code for a reproducible example is included here:
test <-matrix(c(
"01/03/1997", 0.521583294, 0.315170092, 0,
"02/03/1997", 0.63946859, 0.270870821, 0,
"03/03/1997", 0.698687101, 0.253495021, 0,
"04/03/1997", 0.828754157, 0.233024574, 0,
"05/03/1997", 0.87078867, 0.214507537, 0,
"06/03/1997", 0.883279874, 0.212268627, 0,
"07/03/1997", 0.952083969, 0.062663598, 0,
"08/03/1997", 0.991100195, 0.054875256, 0,
"09/03/1997", 0.992490126, 0.026610776, 1,
"10/03/1997", 0.020707391, 0.866874513, 0,
"11/03/1997", 0.32405139, 0.778696984, 0,
"12/03/1997", 0.32665243, 0.703234151, 0,
"13/03/1997", 0.603941956, 0.362869647, 0,
"14/03/1997", 0.944046386, 0.026992527, 1,
"15/03/1997", 0.108246142, 0.939363715, 0,
"16/03/1997", 0.152195386, 0.907458966, 0,
"17/03/1997", 0.285748169, 0.765212667, 0), ncol = 4, byrow=TRUE)
colnames(test) <- c("Date", "Variable 1", "Variable 2", "Failed")
test <-as.table(test)
test
I've managed to hash together a solution, but it looks very messy. I'm convinced that there is a far more elegant way of solving this.
z = as.data.frame.matrix(test)
attach(z)
x = as.numeric(as.character(Failed))
x = cumsum(x) #Variable names recycled
A corrected cumulative failure sum puts data into sets of number of preceding failures
z <- within(z, acc_sum <- x)
attach(z)
z$acc_sum <- as.numeric(as.character(z$acc_sum))-as.numeric(as.character(z$Failed))
attach(z)
z = data.frame(z, Group_Index=ave(acc_sum==acc_sum,acc_sum,FUN=cumsum)
An extra row is created that has the number of days passed since the start of the measurement. It's easier to read the code to keep new variable names than to keep indexing directly.
attach(z)
x = (max(acc_sum)+1) #This is the number of sets of variable results
Current columns read: Date|Variable.1|Variable.2|Failed|acc_sum|Group_Index
library(ggplot2)
n = data.frame(acc_sum, Group_Index)
This initialises the frame and should make it faster so Group_Index and acc_sum aren't read-in each time.
for(j in 1:(ncol(z)-4)){ #This iterates through all the variables to generate a new set of lists. -4 is from removing date, failed, Group_index and acc_sum
n$Variable <- z[,(j+1)] #This reads in the new variable data, but requires the variables to all be next to each other
n[] <- lapply(n,function(x)as.numeric(as.character(x))) #This ensures all the values are numeric for plotting
plot <- ggplot(n, aes(x = Group_Index, y = Variable, colour = acc_sum)) +
theme_bw() +
geom_line(aes(group=acc_sum)) #linetype = "dotted"
print(plot) #This ensures that the graph is presented in every iteration
cat ("Press [enter] to continue") #This waits for a user input before moving to the next variable
line <- readline()
}
The graph could be improved for the actual variable name to change with what is being plotted. This could be done by including a ylabel in the for loop.
I tried to import data from mongodb to r using:
mongo.find.all(mongo, namespace, query=query,
fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ), data.frame= T)
The command works find for small data sets, but I want to import 1,000,000 documents.
Using system.time and adding limit= X to the command, I measure the time as a function of the data to import:
system.time(mongo.find.all(mongo, namespace, query=query ,
fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ),
limit= 10000, data.frame= T))
The results:
Data Size Time
1 0.02
100 0.29
1000 2.51
5000 16.47
10000 20.41
50000 193.36
100000 743.74
200000 2828.33
After plotting the data I believe that:
Import Time = f(Data^2)
Time = -138.3643 + 0.0067807*Data Size + 6.773e-8*(Data Size-45762.6)^2
R^2 = 0.999997
Am I correct?
Is there a faster command?
Thanks!
lm is cool, but I think if you'll try to add power 3,4,5, ... features, you'll also receive great R^2 =) you overfit=)
One of the known R's drawbacks is that you can't efficiently append elements to vector (or list). Appending element triggers copy of the entire object. And here you can see derivative of this effect.
In general when you fetching data from mongodb, you don't know size of the result in advance. You iterate though cursor and grow resulting list. In older versions this procedure was incredibly slow because of described above R's behaviour. After this pull performance become much better.
Trick with environments helps a lot, but it still not as fast as preallocated list.
But can we potentially do better? Yes.
1) Simply allow user to point size of the result and preallocate list. And do it automatically if limit= is passed into mongo.find.all. I filled issue for this enhancement.
2) Construct result in C code.
If know size of your data in advance you can:
cursor <- mongo.find(mongo, namespace, query=query, fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ))
result_lst <- vector('list', NUMBER_OF_RECORDS)
i <- 1
while (mongo.cursor.next(cursor)) {
result_lst[[i]] <- mongo.bson.to.list(mongo.cursor.value(cursor))
i <- i + 1
}
result_dt <- data.table::rbindlist(result_lst)
I have this huge data vector "a"
here is a screenshot
Now I am supposed to find inconsistencies with the gene data. Basically... a positive number is given 1, 0 is given 0, and a negative is given -1. I am trying to make a command that would do something like this:
if a > 0, then print 1.. a < 0, print -1... a = 0, print 0.
I would also like it to be in a vector fashion like the image above in 1s, 0s and -1s.
I tried the for loop, if statement but it doesnt seem to work.
Maybe this helps:
b <- sign(a)
print(b)
# create some data for illustration
a <- rnorm(100, 0, 56)
print_a<-ifelse(x>0, 1, ifelse(x==0, 0, -1))
print_a
+1 for #Rhertel's answer as a quick way to accomplish the task.
However, if you wish to fix your for loop, please post your code. Here is a general example of a for loop implementing what you wanted to do using the iris dataset.
library(datasets)
for (i in diff(iris[,1])){
if (i < 0){print(-1)}
else if (i > 0){print(1)}
else {print(0)}
}
I want to test the sensitivity of a calculation to the value of 4 parameters. To do this, I want to vary one parameter at a time -- i.e., change Variable 1, hold variables 2-4 at a "default" value (e.g., 1). I thought an easy way to organize these values would be in a data.frame(), where each column corresponds to a different variable, and each row to a set of parameters for which the calculation should be made. I would then loop through each row of the data frame, evaluating a function given the parameter values in that row.
This seems like it should be a simple thing to do, but I can't find a quick way to do it.
The problem might be my overall approach to programming the sensitivity analysis, but I can't think of a good, simple way to program the aforementioned data.frame.
My code for generating the data.frame:
Adj_vals <- c(seq(0, 1, by=0.1), seq(1.1, 2, by=0.1)) #a series of values for 3 of the parameters to use
A_Adj_vals <- 10^(seq(1,14,0.5)) #a series of values for another one of the parameters to use
n1 <- length(Adj_vals)
n2 <- length(A_Adj_vals)
data.frame(
"Dg_Adj"=c(Adj_vals, rep(1, n1*2+n2)), #this parameter's default is 1
"Df_Adj"=c(rep(1, n1), Adj_vals, rep(1, n1+n2)), #this parameter's default is 1
"sd_Adj"=c(rep(1, n1*2), 0.01, Adj_vals[-1], rep(1, n2)), #This parameter has default of 1, but unlike the others using Adj_vals, it can only take on values >0
"A"=c(rep(1E7, n1*3), A_Adj_vals) #this parameter's default is 10 million
)
This code produces the desired data.frame. Is there a simpler way to achieve the same result? I would accept an answer where sd_Adj takes on 0 instead of 0.01.
It's pretty debatable if this is better, but another way to do it would be to follow this pattern:
defaults<-data.frame(a=1,b=1,c=1,d=10000000)
merge(defaults[c("b","c","d")],data.frame(a=c(seq(0, 1, by=0.1), seq(1.1, 2, by=0.1))))
This should be pretty easy to cook up into a function that automatically removes the correct column from defaults based on the column name in the data frame you are merging with etc