R package arules: read.transactions file format - r

I have a .csv file with the following type of data:
Day Item
1 12,19,24,31,48,
1 1,19,
1 16,28,32,45,
1 19,36,41,43,44,
1 7,24,27,
1 21,31,33,41,
1 46
1 50
2 12,31,36,48,
2 17,29,47,
2 2,18,20,29,38,39,40,41
2 17,29,47,
And I can't get the read.transactions to read it properly.
The data set is based on several item selection for each day (more than one time per day, if necessary). For instance, the third selection on day 1, returned items 16,28,32, and 45.
Shouldn't this be enough?
library(arules)
dataset <- read.transactions("file.csv", format = 'basket')

I have tried to create a sample data using data provided by you
data <- read.table(text="Day Item
1 12,19,24,31,48,
1 1,19,
1 16,28,32,45,
1 19,36,41,43,44,
1 7,24,27,
1 21,31,33,41,
1 46
1 50
2 12,31,36,48,
2 17,29,47,
2 2,18,20,29,38,39,40,41
2 17,29,47",header = T)
data <- as(data[-1], "transactions") ##removing 1st header column for the transactional data
inspect(data)
## apply apriori algorithm ###
rules <- apriori(data, parameter = list(supp = 0.001, conf = 0.80))
### Arrange top 10 rules by lift ####
inspect(rules[1:10])
Please try this method hope it helps

Related

R aggregate by a variable then find out proportion of a each column

Sorry, I've tried my best but I didn't find the answer. As beginner, I'm not sure that I'm able to put the question clearly. Thanks in advance.
So I have a dataframe with data about consumption with 24000 rows.
In this dataframe, there is a series of variable about the number of objects bought within the last two months :
NumberOfCoat, NumberOfShirt, NumberOfPants, NumberOfShoes...
And there is a variable "profession" registered by number.
So now the data looks looks like this
profession NumberOfCoat NumberOfShirt NumberOfShoes
individu1 1 1 1 1
individu2 3 2 4 1
individu3 2 2 0 0
individu4 6 0 3 2
individu5 5 0 2 3
individu6 7 1 0 5
individu7 4 3 1 2
I would like to know the structure of consumption by profession and get something like this :
ProportionOfCoat ProportionOfShirt ProportionOfShoe...
profession1 0.3 0.5 0.1
profession2 0.1 0.2 0.4
profession3 0.2 0.6 0.1
profession4 0.1 0.1 0.2
I don't know if it is clear, but finally I want to be able to say :
10% of clothing products that doctors bought are Tshirts whereas 20% of what teachers bought are T-shirts.
And finally, I'd like to draw a stacked barplot where each stack is scaled to sum to 100%.
I suppose that we can you dplyr ?
Thank you very much !!
temp <- aggregate( . ~ profession, data=zzz, FUN=sum)
cbind(temp[1],temp[-1]/rowSums(temp[-1]))
or also using prop.table
As other people noted, it is always better to post a reproducible example, I´ll try to post one with my solution, which is longer than the ones already posted but, for the same reason, maybe clearer.
First you should create an example dataframe:
set.seed(10) # I set a seed cause I´ll use the sample() function
n <- 1:100 # vector from 1 to 100 to obtain the number of products bought
p <- 1:8 # vector for obtaining id of professions
profession <- sample(p,50, replace = TRUE)
NumberOfCoat <- sample(n,50, replace = TRUE)
NumberOfShirt <- sample(n,50, replace = TRUE)
NumberOfShoes <- sample(n,50, replace = TRUE)
df <- as.data.frame(cbind(profession, NumberOfCoat,
NumberOfShirt, NumberOfShoes))
Once you got the dataframe, you can explain what you have tried so far, or a possible solution. Here I used dplyr.
df <- df %>% group_by(profession) %>% summarize(coats = sum(NumberOfCoat),
shirts = sum(NumberOfShirt),
shoes = sum(NumberOfShoes)) %>%
mutate(tot_prod = coats + shirts + shoes,
ProportionOfCoat = coats/tot_prod,
ProportionOfShirt = shirts/tot_prod,
ProportionofShoes = shoes/tot_prod) %>%
select(profession, ProportionOfCoat, ProportionOfShirt,
ProportionofShoes)
dfcorresponds to the second dataframe you show, where you have the proportion of each product bought by each profession. In my example looks like this:
profession ProportionOfCoat ProportionOfShirt ProportionofShoes
<int> <dbl> <dbl> <dbl>
1 1 0.3910483 0.2343934 0.3745583
2 2 0.4069641 0.3525571 0.2404788
3 3 0.3330804 0.3968134 0.2701062
4 4 0.2740657 0.3952435 0.3306908
5 5 0.2573991 0.3784753 0.3641256
6 6 0.2293814 0.3543814 0.4162371
7 7 0.2245841 0.3955638 0.3798521
8 8 0.2861635 0.3490566 0.3647799
If you want to produce a stacked barplot, you have to reshape your data to a long format in order to be able to use ggplot2. As #alistaire noted, you can do it with the gather function from the tidyr package.
df <- df %>% gather(product, proportion, -profession)
And finally you can plot it with ggplot2.
ggplot(df, aes(x=profession, y=proportion, fill=product)) +
geom_bar(stat="identity")

names of a dataset returns NULL- R version 3.2.4 Revised-Ubuntu 14.04 LTS

I have a small issue regarding a dataset I am using. Suppose I have a dataset called mergedData2 defined using those command lines from a subset of mergedData:
mergedData=rbind(test_set,training_set)
lookformean<-grep("mean()",names(mergedData),fixed=TRUE)
lookforstd<-grep("std()",names(mergedData),fixed=TRUE)
varsofinterests<-sort(c(lookformean,lookforstd))
mergedData2<-mergedData[,c(1:2,varsofinterests)]
If I do names(mergedData2), I get:
[1] "volunteer_identifier" "type_of_experiment"
[3] "body_acceleration_mean()-X" "body_acceleration_mean()-Y"
[5] "body_acceleration_mean()-Z" "body_acceleration_std()-X"
(I takes this 6 first names as MWE but I have a vector of 68 names)
Now, suppose I want to take the average of each of the measurements per volunteer_identifier and type_of_experiment. For this, I used a combination of split and lapply:
mylist<-split(mergedData2,list(mergedData2$volunteer_identifier,mergedData2$type_of_experiment))
average_activities<-lapply(mylist,function(x) colMeans(x))
average_dataset<-t(as.data.frame(average_activities))
As average_activities is a list, I converted it into a data frame and transposed this data frame to keep the same format as mergedData and mergedData2. The problem now is the following: when I call names(average_dataset), it returns NULL !! But, more strangely, when I do:head(average_dataset) ; it returns :
volunteer_identifier type_of_experiment body_acceleration_mean()-X body_acceleration_mean()-Y
1 1 0.2773308 -0.01738382
2 1 0.2764266 -0.01859492
3 1 0.2755675 -0.01717678
4 1 0.2785820 -0.01483995
5 1 0.2778423 -0.01728503
6 1 0.2836589 -0.01689542
This is just a small sample of the output, to say that the names of the variables are there. So why names(average_dataset) returns NULL ?
Thanks in advance for your reply, best
EDIT: Here is an MWE for mergedData2:
volunteer_identifier type_of_experiment body_acceleration_mean()-X body_acceleration_mean()-Y
1 2 5 0.2571778 -0.02328523
2 2 5 0.2860267 -0.01316336
3 2 5 0.2754848 -0.02605042
4 2 5 0.2702982 -0.03261387
5 2 5 0.2748330 -0.02784779
6 2 5 0.2792199 -0.01862040
body_acceleration_mean()-Z body_acceleration_std()-X body_acceleration_std()-Y body_acceleration_std()-Z
1 -0.01465376 -0.9384040 -0.9200908 -0.6676833
2 -0.11908252 -0.9754147 -0.9674579 -0.9449582
3 -0.11815167 -0.9938190 -0.9699255 -0.9627480
4 -0.11752018 -0.9947428 -0.9732676 -0.9670907
5 -0.12952716 -0.9938525 -0.9674455 -0.9782950
6 -0.11390197 -0.9944552 -0.9704169 -0.9653163
gravity_acceleration_mean()-X gravity_acceleration_mean()-Y gravity_acceleration_mean()-Z
1 0.9364893 -0.2827192 0.1152882
2 0.9274036 -0.2892151 0.1525683
3 0.9299150 -0.2875128 0.1460856
4 0.9288814 -0.2933958 0.1429259
5 0.9265997 -0.3029609 0.1383067
6 0.9256632 -0.3089397 0.1305608
gravity_acceleration_std()-X gravity_acceleration_std()-Y gravity_acceleration_std()-Z
1 -0.9254273 -0.9370141 -0.5642884
2 -0.9890571 -0.9838872 -0.9647811
3 -0.9959365 -0.9882505 -0.9815796
4 -0.9931392 -0.9704192 -0.9915917
5 -0.9955746 -0.9709604 -0.9680853
6 -0.9988423 -0.9907387 -0.9712319
My duty is to get this average_dataset (which is a dataset which contains the average value for each physical quantity (column 3 and onwards) for each volunteer and type of experiment (e.g 1 1 mean1 mean2 mean3...mean68
2 1 mean1 mean2 mean3...mean68, etc)
After this I will have to export it as a txt file (so I think using write.table with row.names=F, and col.names=T). Note that for now, if I do this and import the dataset generated using read.table, I don't recover the names of the columns of the dataset; even while specifying col.names=T.

R-convert transaction format dataset to basket format for sequence mining

ORIGINAL TABLE
CELL NUMBER ----------ACTIVITY--------TIME<br/>
001................................call a................12.23<br/>
002................................call b................01.00<br/>
002................................call d................01.09<br/>
001................................call b................12.25<br/>
003................................call a................12.23<br/>
002................................call a................02.07<br/>
003................................call b................12.25<br/>
REQUIRED-
To mine the highest occurring sequence of ACTIVITY from a data-set of size 400,000
ABOVE EXAMPLE SHOULD SHOW
[call a-12.23,call b-12.25] frequency 2<br/>
[call b-01.00,call d-01.09,call a-02.07] frequency 1
I'm aware that this can be achieved using arulesSequences. What transformations on dataset do i need to carry out and how so as to use the arulesSequences package?
Current db format- transaction with 3 columns like sample above.
df<-read.table(header=T,sep="|",text="CELL NUMBER|ACTIVITY|TIME
001|call a|12.23
002|call b|01.00
002|call d|01.09
001|call b|12.25
003|call a|12.23
002|call a|02.07
003|call b|12.25")
require(plyr) # for count() function
freqs<-count(df[,-1]) # [,-1] to exclude the CELL NUMBER column from the group
freqs[order(-freqs$freq),]
ACTIVITY TIME freq
2 call a 12.23 2
4 call b 12.25 2
1 call a 2.07 1
3 call b 1.00 1
5 call d 1.09 1
EDIT - Updated like this:
unique(ddply(freqs,.(-freq),summarise,calls=paste0("[",paste0(paste0(ACTIVITY,"-",TIME),collapse=","),"]","frequency",freq)))
# -freq calls
#1 -2 [call a-12.23,call b-12.25]frequency2
#3 -1 [call a-2.07,call b-1,call d-1.09]frequency1

Multiple plots in R with different settings for each axis with less lines of code

In the graph below,
Is it possible to create same graph with less lines of codes? I mean, since each Figs. A-D has different label settings, I have to write settings for each Fig. which makes it longer.
The graph below is produced with the data in pdf device.
Any help with these issues is highly appreciated.(Newbie to R!). Since all the code is too long to post here, I have posted a part relevant to the problem here for Fig.C
#FigC
label1=c(0,100,200,300)
plot(data$TimeVariable2C,data$Variable2C,axes=FALSE,ylab="",xlab="",xlim=c(0,24),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
lines(data$TimeVariable3C,data$Variable3C)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,24,by=6),label=seq(0,24,by=6))
mtext("(C)",side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
plot(data$TimeVariable1C,data$Variable1C,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(data$TimeVariable1C,data$Variable1C,col='violetred4',border=NA)
You ask many questions in the same OP. I will try to answer to just one : How to simplify your code or rather how to call it once for each letter. I think it is better to put your data in the long format. For example, This will create a list of 4 elements
ll <- lapply(LETTERS[1:4],function(let){
dat.let <- dat[,grepl(let,colnames(dat))]
dd <- reshape(dat.let,direction ='long',
v.names=c('TimeVariable','Variable'),
varying=1:6)
dd$time <- factor(dd$time)
dd$Type <- let
dd
}
)
ll is a list of 4 data.frame, where each one that looks like :
head(ll[[1]])
time TimeVariable Variable id Type
1.1 1 0 0 1 A
2.1 1 0 5 2 A
3.1 1 8 110 3 A
4.1 1 16 0 4 A
5.1 1 NA NA 5 A
6.1 1 NA NA 6 A
Then you can use it like this for example :
library(Hmisc)
layout(matrix(1:4, 2, 2, byrow = TRUE))
lapply(ll,function(data){
label1=c(0,100,200,300)
Type <- unique(dat$Type)
dat <- subset(data,time==2)
x.mm <- max(dat$Variable,na.rm=TRUE)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,ylab="",xlab="",xlim=c(0,x.mm),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
dat <- subset(data,time==2)
lines(dat$TimeVariable,dat$Variable)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,x.mm,by=6),label=seq(0,x.mm,by=6))
mtext(Type,side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
dat <- subset(data,time==1)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(dat$TimeVariable,dat$Variable,col='violetred4',border=NA)
})
Another advantage of using the long data format is to use ``ggplot2andfacet_wrap` for example .
## transform your data to a data.frame
dat.l <- do.call(rbind,ll)
library(ggplot2)
ggplot(subset(dat.l,time !=1)) +
geom_line(aes(x=TimeVariable,y=Variable,group=time,color=time))+
geom_polygon(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=60-Variable/10,fill=Type))+
geom_line(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=Variable,fill=Type))+
facet_wrap(~Type,scales='free')

reshaping a data frame into long format in R

I'm struggling with a reshape in R. I have 2 types of error (err and rel_err) that have been calculated for 3 different models. This gives me a total of 6 error variables (i.e. err_1, err_2, err_3, rel_err_1, rel_err_2, and rel_err_3). For each of these types of error I have 3 different types of predivtive validity tests (ie random holdouts, backcast, forecast). I would like to make my data set long so I keep the 4 types of test long while also making the two error measurements long. So in the end I will have one variable called err and one called rel_err as well as an id variable for what model the error corresponds to (1,2,or 3)
Here is my data right now:
iter err_1 rel_err_1 err_2 rel_err_2 err_3 rel_err_3 test_type
1 -0.09385732 -0.2235443 -0.1216982 -0.2898543 -0.1058366 -0.2520759 random
1 0.16141630 0.8575728 0.1418732 0.7537442 0.1584816 0.8419816 back
1 0.16376930 0.8700738 0.1431505 0.7605302 0.1596502 0.8481901 front
1 0.14345986 0.6765194 0.1213689 0.5723444 0.1374676 0.6482615 random
1 0.15890059 0.7435382 0.1589823 0.7439204 0.1608709 0.7527580 back
1 0.14412360 0.6743928 0.1442039 0.6747684 0.1463520 0.6848202 front
and here is what I would like it to look like:
iter model err rel_err test_type
1 1 -0.09385732 (#'s) random
1 2 -0.1216982 (#'s) random
1 3 -0.1216982 (#'s) random
and on...
I've tried playing around with the syntax but can't quite figure out what to put for the time.varying argument
Thanks very much for any help you can offer.
You could do it the "hard" way. For transparency you can use names.
with( dat, data.frame(iter = rep(iter, 3),
model = rep(1:3, each = nrow(dat)),
err = c(err_1, err_2, err_3),
rel_err = c(rel_err_1, rel_err_2, rel_err_3),
test_type = rep(test_type, 3)) )
Or, for conciseness, indexes.
data.frame(iter = dat[,1], model = rep(1:3, each = nrow(dat)), err = dat[,c(2, 4, 6)],
rel_err = dat[,c(3, 5, 7)], test_type = dat[,8]) )
If you had a LOT of columns the hard way might involve grepping the column names.
This "hard" way was about as concise as reshape and required less thinking about how to use the commands. Sometimes I just skip thinking about reshape.
The base function reshape will let you do this
reshape(DT, direction = 'long', varying = list(paste('err',1:3,sep ='_'), paste('rel_err',1:3,sep ='_')), v.names = c('err','rel_err'), timevar = 'model')
iter test_type model err rel_err id
1.1 1 random 1 -0.09385732 -0.2235443 1
2.1 1 back 1 0.16141630 0.8575728 2
3.1 1 front 1 0.16376930 0.8700738 3
4.1 1 random 1 0.14345986 0.6765194 4
5.1 1 back 1 0.15890059 0.7435382 5
6.1 1 front 1 0.14412360 0.6743928 6
1.2 1 random 2 -0.12169820 -0.2898543 1
2.2 1 back 2 0.14187320 0.7537442 2
3.2 1 front 2 0.14315050 0.7605302 3
4.2 1 random 2 0.12136890 0.5723444 4
5.2 1 back 2 0.15898230 0.7439204 5
6.2 1 front 2 0.14420390 0.6747684 6
1.3 1 random 3 -0.10583660 -0.2520759 1
2.3 1 back 3 0.15848160 0.8419816 2
3.3 1 front 3 0.15965020 0.8481901 3
4.3 1 random 3 0.13746760 0.6482615 4
5.3 1 back 3 0.16087090 0.7527580 5
6.3 1 front 3 0.14635200 0.6848202 6
I agree that the syntax for reshape hard to get your head around sometimes. I will spell out how this call works
direction = 'long' -- reshaping to long format
varying = list(paste('err',1:3,sep ='_'), paste('rel_err',1:3,sep ='_')) -- We pass a list of length 2 because we are trying to stack into two different variables. The columns paste('err',1:3,sep ='_') will become the first new variable in long format and
paste('rel_err',1:3,sep ='_')) will become the second new variable in long format
v.names = c('err','rel_err') sets the names of the two new variables in long format
timevar = 'model' sets the name of the time identifier (here the _1 from the columns in wide format.
I hope this is somewhat clearer.

Resources