R - apriori() not recognising lhs from numerical transaction - r

I am having real trouble getting my data to produce any rules using the arules package. I have managed to get 100000 rows of transaction data and in SAS the rules are shown. I cannot get it to work in R.
[5] {19,29,40,119,134}
[6] {24,40,45,67,141}
[7] {17,18,57,74,412}
[8] {16,79,90,150,498}
[9] {18,57,111,161,267}
[10] {11,75,131,427,429}
[11] {57,99,111,143,236}
The transactions data looks like this and originally came from a table where all the numbers were separate.
arules <- read.transactions('tid.csv', format = c("basket", "single"),
sep=",")
rules <- apriori(arules,parameter = list(supp = 0.1, conf = 0.1, target =
"rules"))
summary(rules)
For reference the supports and confidence settings make no difference. Sometimes I get this when I inspect the rules.
lhs rhs support confidence lift count
[1] {} => {8,11,96,112,432} 9.710623e-06 9.710623e-06 1 1
[2] {} => {62,134,222,254,412} 9.710623e-06 9.710623e-06 1 1
Any idea why apriori can't separate the items in the transaction? Does this need to be recast into long format and if so how would I do that form this data frame?
V2 V3 V4 V5 V6
8 11 96 112 432
10 35 39 76 119
18 38 68 141 267
29 36 57 61 63
19 29 40 119 134
24 40 45 67 141
17 18 57 74 412

If I understood you correctly then you should try this and let us know if it helped.
library(arules)
library(arulesViz)
#sample data
df <- read.table(text="V2 V3 V4 V5 V6
8 11 96 112 432
10 35 39 76 119
18 38 68 141 267
29 36 57 61 63
19 29 40 119 134
24 40 45 67 141
17 18 57 74 412", header=T)
write.csv(df, "apriori_demo.csv", row.names = F)
#convert sample data into transactions format for apriori algorithm
trx <- read.transactions("apriori_demo.csv", format="basket", sep=",", skip=1)
#apriori rules
apriori_rule <- apriori(trx, parameter = list(supp = 0.1, conf = 0.1))
#obviously you need to have better parameters compared to the one you have used in your post!
inspect(apriori_rule)
plot(apriori_rule, method="graph")

Related

2 lines of headers in R from csv

I have a lot of csv files with double headers as below. (This is only part of it, and both headers contain important info) How could I combine the first two rows of the csv file to obtain a single line of header? (e.g.Life.expectancy.at.birth..years..1Female)
Life.expectancy.at.birth..years..1 Life.expectancy.at.birth..years..2
1 Female Male
2 62 61
3 61 58
4 56 54
5 50 49
6 76 73
Read it twice and paste the headers together. For the second read limit the number of rows read since we really only need the header.
# in next 2 lines replace text=Lines with something like "myfile"
DF <- read.table(text = Lines, header = TRUE, skip = 1)
hdr1 <- read.table(text = Lines, header = TRUE, nrows = 1)
names(DF) <- paste0(names(hdr1), names(DF))
giving:
> DF
Life.expectancy.at.birth..years..1Female Life.expectancy.at.birth..years..2Male
1 62 61
2 61 58
3 56 54
4 50 49
5 76 73
Note: We used this for the input Lines:
Lines <- " Life.expectancy.at.birth..years..1 Life.expectancy.at.birth..years..2
Female Male
62 61
61 58
56 54
50 49
76 73"

assign objects to dynamic lists in r

I have a nested loops which produce outputs that I want to store in list objects with dynamic names. A toy example of this would look as follows:
set.seed(8020)
names<-sample(LETTERS,5,replace = F)
for(n in names)
{
#Create the list
assign(paste0("examples_",n),list())
#Poulate the list
get(paste0("examples_",n))[[1]]<-sample(100,10)
get(paste0("examples_",n))[[2]]<-sample(100,10)
get(paste0("examples_",n))[[3]]<-sample(100,10)
}
Unfortunately I keep getting the error:
Error in get(paste0("examples_", n))[[1]] <- sample(100, 10) :
target of assignment expands to non-language object
I have tried all kind of assign, eval, get type of functions to parse the object, but haven't had any luck
Expanding on my comment with a worked example:
examples <- vector(mode="list", length=length(names) )
names(examples) <- names # please change that to mynames
# or almost anything other than `names`
examples <- lapply( examples, function(L) {L[[1]] <- sample(100,10)
L[[2]] <- sample(100,10)
L[[3]] <- sample(100,10); L} )
# Top of the output:
> examples
$P
$P[[1]]
[1] 34 49 6 55 19 28 72 42 14 92
$P[[2]]
[1] 97 71 63 59 66 50 27 45 76 58
$P[[3]]
[1] 94 39 77 44 73 15 51 78 97 53
$F
$F[[1]]
[1] 12 21 89 26 16 93 4 13 62 45
$F[[2]]
[1] 83 21 68 74 32 86 52 49 16 13
$F[[3]]
[1] 14 45 40 46 64 85 88 28 53 42
This mode of programming does become more natural over time. It gets you out of writing clunky for-loops all the time. Develop your algorithms for a single list-node at a time and then use sapply or lapply to iterate the processing.

Loop Linear Regression

As a begginer in R i have a, probably, simple question.
I have a linear regression with this specification:
X1 = X1_t-h + X2_t-h
h for is equal to 1,2,3,4,5:
For example, when h=1 i run this code:
Modelo11 <- dynlm(X1 ~ L(X1,1) + L(X2, 1)-1, data = GDP)
Its a simple regression.
I want to implement a function that gives me the five linear regressions (h=1,2,3,4 and 5) with and without HAC heteroscedasticity estimation:
I did this, and didnt work:
for(h in 1:5){
Modelo1[h] <- dynlm(GDPTrimestralemT ~ L(SpreademT,h) + L(GDPTrimestralemT, h)-1, data = MatrizDadosUS)
coeftest(Modelo1[h], df = Inf, vcov = parzenHAC)
return(list(summary(Modelo1[h])))
}
One of the error message is:
number of items to replace is not a multiple of replacement length
This is my data.frame:
GDP <- data.frame(data )
GDP
X1 X2
1 0.542952690 0.226341364
2 0.102328393 0.743360185
3 0.166345969 0.186533485
4 1.406733422 1.392420181
5 -0.469811005 -0.114609464
6 -0.509268267 0.687555461
7 1.470439930 0.298655018
8 1.046456428 -1.056387597
9 -0.492462197 -0.530284962
10 -0.516065519 0.645957530
11 0.624638996 1.044731264
12 0.213616470 -1.652979785
13 0.669747432 1.398602289
14 0.552089131 -0.821013792
15 0.452715216 1.420094663
16 -0.892063248 -1.436600779
17 1.429284965 0.559738610
18 0.853740565 -0.898976767
19 0.741864168 1.352012831
20 0.171494650 1.704764705
21 0.422326351 -0.267064235
22 -1.261643503 -2.090694608
23 -1.321086283 -0.273954212
24 0.365226000 1.965167113
25 -0.080888690 -0.594498893
26 -0.183293801 -0.483053404
27 -1.033792032 0.586491772
28 0.718322432 1.776210145
29 -2.822693790 -0.731509917
30 -1.251740437 -1.918124078
31 1.184256949 -0.016548037
32 2.255202675 0.303438286
33 -0.930446147 0.803126180
34 -1.691383225 -0.157839283
35 -1.081643279 -0.006652717
36 1.034162006 -1.970063305
37 -0.716827488 0.306792930
38 0.098471514 0.338333164
39 0.343536547 0.389775011
40 1.442117465 -0.668885360
41 0.095131066 -0.298356861
42 0.222524607 0.291485267
43 -0.499969717 1.308312472
44 0.588162304 0.026539575
45 0.581215173 0.167710855
46 0.629343124 -0.052835206
47 0.811618963 0.716913172
48 1.463610069 -0.356369304
49 -2.000576321 1.226446201
50 1.278233553 0.313606888
51 -0.700373666 0.770273988
52 -1.206455648 0.344628878
53 0.024602262 1.001621886
54 0.858933385 -0.865771777
55 -1.592291995 -0.384908852
56 -0.833758365 -1.184682199
57 -0.281305858 2.070391729
58 -0.122848757 -0.308397782
59 -0.661013984 1.590741535
60 1.887869805 -1.240283364
61 -0.313677463 -1.393252994
62 1.142864110 -1.150916732
63 -0.633380499 -0.223923970
64 -0.158729527 -1.245647224
65 0.928619010 -1.050636078
66 0.424317087 0.593892028
67 1.108704956 -1.792833100
68 -1.338231248 1.138684394
69 -0.647492569 0.181495183
70 0.295906675 -0.101823172
71 -0.079827607 0.825158278
72 0.050353111 -0.448453121
73 0.129068772 0.205619797
74 -0.221450137 0.051349511
75 -1.300967949 1.639063824
76 -0.861963677 1.273104220
77 -1.691001610 0.746514122
78 0.365888734 -0.055308006
79 1.297349754 1.146102001
80 -0.652382297 -1.095031447
81 0.165682952 -0.012926971
82 0.127996446 0.510673745
83 0.338743162 -3.141650682
84 -0.266916587 -2.483389321
85 0.148135154 -1.239997153
86 1.256591385 0.051984536
87 -0.646281986 0.468210275
88 0.180472423 0.393014848
89 0.231892902 -0.545305005
90 -0.709986273 0.104969765
91 1.231712844 -1.703489840
92 0.435378714 0.876505107
93 -1.880394798 -0.885893722
94 1.083580732 0.117560662
95 -0.499072654 -1.039222894
96 1.850756855 -1.308752222
97 1.653952857 0.440405804
98 -1.057618294 -1.611779530
99 -0.021821282 -0.807071503
100 0.682923562 -2.358596342
101 -1.132293845 -1.488806929
102 0.319237353 0.706203968
103 -2.393105781 -1.562111727
104 0.188653972 -0.637073832
105 0.667003685 0.047694037
106 -0.534018861 1.366826933
107 -2.240330371 -0.071797320
108 -0.220633546 1.612879694
109 -0.022442941 1.172582601
110 -1.542418139 0.635161458
111 -0.684128812 -0.334973482
112 0.688849615 0.056557966
113 0.848602803 0.785297518
114 -0.874157558 -0.434518305
115 -0.404999060 -0.078893114
116 0.735896917 1.637873669
117 -0.174398836 0.542952690
118 0.222418628 0.102328393
119 0.419461884 0.166345969
120 -0.042602368 1.406733422
121 2.135670836 -0.469811005
122 1.197644287 -0.509268267
123 0.395951293 1.470439930
124 0.141327444 1.046456428
125 0.691575897 -0.492462197
126 -0.490708151 -0.516065519
127 -0.358903359 0.624638996
128 -0.227550909 0.213616470
129 -0.766692832 0.669747432
130 -0.001690915 0.552089131
131 -1.786701123 0.452715216
132 -1.251495762 -0.892063248
133 1.123462446 1.429284965
134 0.237862653 0.853740565
Thanks.
Your variable Modelo1 is a vector which cannot store lm objects. When Modelo1 is a list it should work.
library(dynlm)
df<-data.frame(rnorm(50),rnorm(50))
names(df)<-c("a","b")
c<-list()
for(h in 1:5){
c[[h]] <- dynlm(a ~ L(a,h) + L(b, h)-1, data = df)
}
To get the summary you have to access the single list elements. For example:
summary(c[[1]])
*edit in response to Richard Scriven comment
The most efficent way to to get all summaries would be:
lapply(c, summary)
This applies the summary function to each element of the list and returns a list with the results.

R 2.15.0 lm() on Windows 6.1.7601 - Subsetting data frame receive error when labeling columns

Purpose: Subset a dataframe into 2 columns that are labeled with new names
for example:
Age Height
1 65 183
2 73 178
[data1[dataset1$Age>50 | dataset1$Height>140,], c("Age","Cm")]
# Error: unexpected ',' in "data1[data1$Age>50 | data1$Height>140,],"
What I've tried:
data1[dataset1$Age>50 | dataset1$Height>140,] #This doesn't organize results in columns
data1[dataset1$Age>50 | dataset1$Height>140,], c("Age","Cm") #Returns same error
I can't get the columns to be organized side-by-side with the labels in c("label1", "label2"). Thanks for your help! New to R and learning it alongside biostats.
If I got it clearly can subset function be of help
dataset1 <- data.frame(
age=c(44,77,21,55,66,90,23,54,31),
height=c(144,177,121,155,166,190,123,154,131)
)
data1 <- as.data.frame(subset(dataset1,dataset1$age>50 | dataset1$height>140))
colnames(data1) <- c("Age", "Height")
I may have missed what you were trying to do need a bit more reproducible data I think.
Nevertheless I had a go
dataset1 = data.frame(cbind((35:75),(135:175)))
colnames(dataset1) = c("Age","Height")
Age Height
35 135
36 136
37 137
38 138
39 139
40 140
41 141
42 142
43 143
44 144
and subset
data1 = dataset1[dataset1$Age>50 | dataset1$Height>140,]
colnames(data1) = c("Age","Cm")
Age Cm
41 141
42 142
43 143
44 144
45 145
46 146
47 147
48 148
49 149
50 150
My apologies if I missed what you wanted but to me it wasn't very clear.

How to grep a word exactly

I'd like to grep for "nitrogen" in the following character vector and want to get
back only the entry which is containing "nitrogen" and nothing of the rest (e.g. nitrogen fixation):
varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")
I tried something like this:
grepl(pattern= "![[:space:]]nitrogen![[:space:]]", varnames)
But this doesn't work.
Although Dason's answer is easier, you could do an exact match using grep via:
varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")
grep("^nitrogen$",varnames,value=TRUE)
[1] "nitrogen"
grep("^nitrogen$",varnames)
[1] 1
To get the indices that are exactly equal to "nitrogen" you could use
which(varnames == "nitrogen")
Depending on what you want to do you might not even need the 'which' as varnames == "nitrogen" gives a logical vector of TRUE/FALSE. If you just want to do something like replace all of the occurances of "nitrogen" with "oxygen" this should suffice
varnames[varnames == "nitrogen"] <- "oxygen"
Or use fixed = TRUE if you want to match actual string (regexlessly):
v <- sample(c("nitrogen", "potassium", "hidrogen"), size = 100, replace = TRUE, prob = c(.8, .1, .1))
grep("nitrogen", v, fixed = TRUE)
# [1] 3 4 5 6 7 8 9 11 12 13 14 16 19 20 21 22 23 24 25
# [20] 26 27 29 31 32 35 36 38 39 40 41 43 44 46 47 48 49 50 51
# [39] 52 53 54 56 57 60 61 62 65 66 67 69 70 71 72 73 74 75 76
# [58] 78 79 80 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97
# [77] 98 99 100
Dunno about the speed issues, I like to test stuff and claim that approach A is faster than approach B, but in theory, at least from my experience, indexing/binary operators should be the fastest, so I vote for #Dason's approach. Also note that regexes are always slower than fixed = TRUE greping.
A little proof is attached bellow. Note that this is a lame test, and system.time should be put inside replicate to get (more) accurate differences, you should take outliers into an account, etc. But surely this one proves that you should use which! =)
(a0 <- system.time(replicate(1e5, grep("^nitrogen$", v))))
# user system elapsed
# 5.700 0.023 5.724
(a1 <- system.time(replicate(1e5, grep("nitrogen", v, fixed = TRUE))))
# user system elapsed
# 1.147 0.020 1.168
(a2 <- system.time(replicate(1e5, which(v == "nitrogen"))))
# user system elapsed
# 1.013 0.020 1.033

Resources