R - Apriori function error - r

I'm currently having problems with the apriori function. The thing is I have a csv with data like the following:
Desc,Cantidad,Valor,Fecha,Lugar,UUID
DESCUENTO,1,-3405,2014-10-04T14:02:57,53100,7F74AFC0-FC28-4105-89A5-CD99416B50C7
DESCUENTO,1,-3405,2014-10-04T14:02:57,53100,7F74AFC0-FC28-4105-89A5-CD99416B50C7
DESCUENTO,1,-170,2014-09-05T15:10:24,83000,7F0C7F0B-BCFC-4FCA-8740-B36AE9932869
Descuento de TYK Dia,1,-156,2014-06-19T16:52:27,86280,1E08E51E-213A-4EE0-8FE9-492E677FF0C9
Descuento de TYK Dia,1,-139,2014-04-25T10:52:44,86280,AB802E63-2D0D-4B47-AB70-DDE007929F9F
DESCUENTO,1,-63,2014-07-04T13:53:10,83000,5B1F12BB-71DE-4734-A774-8D377757A880
REDONDEO,1,-1,2014-03-29T10:50:59,0,5B241EFA-6654-46EA-B47A-3CB76C5EA923
DESCUENTO,1,-1,2014-10-04T14:02:57,53100,7F74AFC0-FC28-4105-89A5-CD99416B50C7
DESCUENTO,1,-1,2014-10-04T14:02:57,53100,7F74AFC0-FC28-4105-89A5-CD99416B50C7
LAVADO,1,0,2014-05-27T18:18:11,44500,e5d540d6-0f98-4993-ec09-56887cd4a27d
TUA,1,0,2014-09-29T10:20:31,6500,1d8ada06-a8a1-4bd8-9356-851b5da28108
Transportación Aerea,1,0,2014-10-03T10:41:09,6500,5fc3925a-d08a-4cdc-be7e-ca02bd488d5b
OBSEQUIO LAVADO DE CARROCERIA,1,0,2014-04-07T13:45:55,91800,8148ab07-5804-4b2b-b37c-5323b394907a
Arroz Al Azafran Combos A,1,0,2014-08-19T11:50:34,11520,f09c23e6-dc60-4aaf-a1b8-1506d38f3585
Frijoles Charros A,1,0,2014-08-19T11:50:34,11520,f09c23e6-dc60-4aaf-a1b8-1506d38f3585
Pepsi Ch A,1,0,2014-08-19T11:50:34,11520,f09c23e6-dc60-4aaf-a1b8-1506d38f3585
FECHA DE CONSUMO 18/07/2014,1,0,2014-07-19T18:01:45,6060,0f0465aa-a75b-4f95-8e3b-43c13452cafb
CAMBIO DE ACEITE DE MOTOR,1,0,2014-02-01T11:18:53,39890,5BDF0742-CDF5-4F6B-9937-DF1CB00274ED
CAMBIO DE FILTRO DE ACEITE,1,0,2014-02-01T11:18:53,39890,5BDF0742-CDF5-4F6B-9937-DF1CB00274ED
Whole CSV (https://github.com/antonio1695/BaseX/blob/master/facturas1.csv)
To download the file just click on find file and then you will see the file.
So what I did was:
> df1 <- read.csv("facturas1.csv")
> rules <- apriori(df1,parameter=list(support=0.01,confidence=0.5))
Error in asMethod(object) :
column(s) 3 not logical or a factor. Discretize the columns first.
Nevertheless, the problem is that the columns are discrete already and if I change the data in order for it to have column 3 in the place of column 2 and viceversa. It still says that that column 3 is not logical or a factor when it should say it about column 2 instead. Thanks!

library(arules)
df1 <- read.csv("https://raw.githubusercontent.com/antonio1695/BaseX/master/facturas1.csv")
trans <- as(df1, "transactions")
Error in asMethod(object) :
column(s) 3 not logical or a factor. Discretize the columns first.
Let's look at the data frame:
str(df1)
'data.frame': 10510 obs. of 6 variables:
$ Desc : Factor w/ 3927 levels "0","00000215R0 - LIQUIDO DE FRENOS",..: 1490 1490 1490 1491 1491 1490 3209 1490 1490 2238 ...
$ Cantidad: Factor w/ 85 levels "","1","-1","10",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Valor : int -3405 -3405 -170 -156 -139 -63 -1 -1 -1 0 ...
$ Fecha : Factor w/ 4054 levels "1294","2014-01-06T11:10:21",..: 4041 4041 3443 1794 596 2125 241 4041 4041 1215 ...
$ Lugar : Factor w/ 982 levels "","0","1000",..: 487 487 802 848 848 802 2 487 487 373 ...
$ UUID : Factor w/ 4056 levels "0019A60D-78F8-E341-8D3E-9786201FE017",..: 1988 1988 1979 456 2711 1423 1424 1988 1988 3658 ...
Valor is a number (int) and needs to be discretized! For example with discretize():
df1$Valor <- discretize(df1$Valor)
head(df1$Valor)
[1] [-3405, 2400) [-3405, 2400) [-3405, 2400) [-3405, 2400) [-3405, 2400)
[6] [-3405, 2400)
Levels: [-3405, 2400) [ 2400, 8204) [ 8204,14009]
Now you can create transactions and applt APRIORI:
trans <- as(df1, "transactions")
rules <- apriori(trans,parameter=list(support=0.01,confidence=0.5))
rules
set of 84 rules

After some research I found that the apriori function must take intervals in order for it to work properly, so when you use discretize you must add the parameter "categories" to select how many intervals you want. It isn't possible for it not to take intervals. I'll post the code here:
I decided to take 20 intervals which are all depending on how often the value in the interval is repeated.
df$Valor <- discretize(df$Valor, method="frequency",categories = 20)
Hope it helps somebody.

Related

TM - Clustering data with special date variable

Ive got the following data from tripadvisor:
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : Factor w/ 674 levels "id","rn106322397",..: 672 671 670 669 668 667 666 665 664 663 ...
$ quote : Factor w/ 606 levels "\"Picturesque Lake Konigssee\"",..: 389 139 113 149 384 39 176 598 199 603 ...
$ rating : Factor w/ 6 levels "1","2","3","4",..: 3 5 5 5 4 5 5 5 4 5 ...
$ date : Factor w/ 505 levels "date","Reviewed 1 August 2014\n",..: 200 200 427 427 427 443 434 351 313 494 ...
$ reviewnospace: Factor w/ 674 levels "- Good car parking facilities- Organized boat trips- Ensure that you have enough time at hand for the boat trip",..: 624 573 144 211 507 26 351 672 451 249 ...
I try to cluster the data on the basis of the date, to get two groups - winter and summer vacationers. With this clustering i want to analyse the reviews afterwards. I am using the tm package and tried it with the following code:
> x <- read.csv ("seeganz.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
> corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
> meta(corp,tag = "date") <- x$date
> idx <- meta(corp, "date") == 'December'
But it is not working as the content say 0 documents:
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 1
Content: documents: 0
As the date has the structure "Reviewed 1 August 2014", how do I have to adapt this code to get, for example just the reviews from Nov - Feb?
Do you have any idea how I can solve this problem?
Thank you.
Generic Approach:
Use substr(date, 10, nchar(date)) to get to 1 August 2014 call this new vector dateNew
Use normal date function e.g. as.Date(dateNew,...) to change dateNew into a vector of type Date where you can do subsetting/subtraction and other operations
References from http://www.statmethods.net/input/dates.html
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]

How to plot graph with certain requirement of choosing of x-axis in data table in R?

I have a data frame as following:
>str(df)
'data.frame': 22673 obs. of 6 variables:
$ V1 : Factor w/ 39 levels "2015-02-09","2015-02-09 ",..: 1 1 1 1 1 1 1 1 1 1 ...
$ V2 : Factor w/ 10465 levels "00:48:26","01:49:26",..: 3949 3956 3964 3985 4196 4254 4262 4268 4275 4309 ...
$ V3 : Factor w/ 3 levels "Admin","AmbassadorSchoolPlayer",..: 3 3 3 3 3 3 3 3 1 3 ...
$ V4 : Factor w/ 104 levels "1builder1","22mAsgarfus",..: 77 77 57 77 48 48 48 48 6 77 ...
$ V5 : Factor w/ 8580 levels ""," - -?"," - 14 1",..: 2306 874 7433 3650 2306 2306 3364 6501 3257 2306 ...
df$V4 is the user_name, and I'd like to plot the graph which takes df$V1 as x-axis, df$V4 as y-axis. But given the number of user is too big, I 'd like to choose the ones(user-name) who appear for more than a threshold times, let's say, 10, in the data frame. How can I do it? I am quite new to R, and I have read several article introducing ggplot2, but did not find the answer. Thank you in advance.
use the table function
count <- table(df$V4)
subset which usernames with more than 10 entries
some_usernames <- names(count[count>10])
then subset your dataframe
df_subset <- df[df$V4 %in% some_usernames, ]
then use ggplot2 or base graphics to do what you want. Hope this helps.

Subsetting two corresponding variables if both of them are true

I'm trying to create a vector with two columns that contain the following strings given that the data in BOTH columns are true. I tried, unsuccessfully with:
CrimesAndLocation <- table(c(Crimes_Data$Primary.Type=="ARSON","ASSAULT","BATTERY","BURGLARY","HOMICIDE","HUMAN TRAFFICKING","KIDNAPPING","ROBBERY",Crimes_Data$Location.Description=="RESIDENCE")))
I'm trying to get an output where:
Primary.Type, is one of the 8 specific felonies listed above. Thus, it should not show all 32 possible felonies, just out of the 8 listed above
Location.Description, is RESIDENCE
This is the goal of what I'm trying to do:
COLUMN 1 COLUMN 2
"ARSON" "RESIDENCE"
"KIDNAPPING" "RESIDENCE"
"BATTERY" "RESIDENCE"
"HOMICIDE" "RESIDENCE"
"ASSAULT" "RESIDENCE"
...
UPDATE: > str(Crimes_Data) :
'data.frame': 293036 obs. of 22 variables:
$ ID : int 10248194 10251162 10248198 10248242 10248228 10248223 10248192 10248157 10249529 10252453 ...
$ Case.Number : Factor w/ 293015 levels "F218264","HA168845",..: 292354 292350 292363 292359 292368 292366 292351 292348 292364 292816 ...
$ Date : Factor w/ 124573 levels "01/01/2015 01:00:00 AM",..: 94544 94542 94539 94536 94535 94535 94535 94535 94529 94528 ...
$ Block : Factor w/ 27983 levels "0000X E 100TH PL",..: 13541 7650 22635 1317 13262 9623 12854 8232 24201 14279 ...
$ IUCR : Factor w/ 334 levels "0110","0130",..: 49 139 321 33 251 82 38 282 97 38 ...
$ Primary.Type : Factor w/ 32 levels "ARSON","ASSAULT",..: 3 7 24 3 18 31 3 13 17 3 ...
$ Description : Factor w/ 313 levels "$500 AND UNDER",..: 111 281 119 35 131 1 260 193 274 260 ...
$ Location.Description: Factor w/ 121 levels "","ABANDONED BUILDING",..: 95 19 110 48 97 110 106 110 110 99 ...
$ Arrest : Factor w/ 2 levels "false","true": 1 1 2 1 2 2 1 2 2 1 ...
$ Domestic : Factor w/ 2 levels "false","true": 2 1 1 1 1 1 1 1 1 1 ...
$ Beat : int 835 333 733 634 1121 1432 1024 735 414 2535 ...
$ District : int 8 3 7 6 11 14 10 7 4 25 ...
$ Ward : int 18 5 6 21 27 1 22 17 7 26 ...
$ Community.Area : int 70 43 68 49 23 22 30 67 46 23 ...
$ FBI.Code : Factor w/ 26 levels "01A","01B","02",..: 11 17 26 6 21 8 11 25 9 11 ...
$ X.Coordinate : int 1154209 1190610 1172166 1176493 1153156 1159961 1154332 1163770 1193570 NA ...
$ Y.Coordinate : int 1852321 1856955 1858813 1841948 1904451 1915955 1887190 1857568 1852889 NA ...
$ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ Updated.On : Factor w/ 442 levels "01/01/2015 12:39:07 PM",..: 288 288 288 288 288 288 288 288 288 288 ...
$ Latitude : num 41.8 41.8 41.8 41.7 41.9 ...
$ Longitude : num -87.7 -87.6 -87.6 -87.6 -87.7 ...
$ Location : Factor w/ 173646 levels "","(41.644604096, -87.610728247)",..: 31318 40835 45858 15601 116871 140063 84837 42961 32176 1 ...
This is a good job for the dplyr package. The filter function will filter a data frame according to any number of logical expressions that you feed it. The following should work for you:
library(dplyr)
filter(
Crimes_Data,
Primary.Type %in% c("ARSON", "ASSAULT", "BATTERY",
"BURGLARY", "HOMICIDE", "HUMAN TRAFFICKING",
"KIDNAPPING", "ROBBERY"),
Location.Description == "RESIDENCE"
)
If you'd rather not use dplyr, you can do it the old fashioned way with base R, like this:
type.bool <- Crimes_Data$Primary.Type %in% c("ARSON", "ASSAULT", "BATTERY",
"BURGLARY", "HOMICIDE",
"HUMAN TRAFFICKING", "KIDNAPPING",
"ROBBERY")
location.bool <- Crimes_Data$Location.Description == "RESIDENCE"
Crimes_Data[type.bool & location.bool, ]
Instead of an integer vector of indices, the [ subsetting operator can take a boolean vector. In that case, it will only return the rows of the data frame for which the corresponding elements of the boolean vector are TRUE.
Thanks for the str() aka "structure" output update, it makes it clearer to be able to help you.
To obtain a list of observations where
these eight felonies : "ARSON","ASSAULT","BATTERY","BURGLARY","HOMICIDE","HUMAN TRAFFICKING","KIDNAPPING","ROBBERY"
occurred at RESIDENCE
Try breaking up the task into slightly smaller parts:
Step 1:
ViolentCrimes = subset(Crimes_Data, Primary.Type == "ARSON" | Primary.Type == "ASSAULT" | Primary.Type == "BATTERY" | Primary.Type == "BURGLARY" | Primary.Type == "HOMICIDE" | Primary.Type == "HUMAN TRAFFICKING" | Primary.Type == "KIDNAPPING" | Primary.Type == "ROBBERY")
Step 2:
ViolentCrimesResidence = subset(ViolentCrimes, Location.Description == "RESIDENCE", select = c(Primary.Type, Location.Description))
Result:
ViolentCrimesResidence holds two columns with Column 1 being a list of Primary.Type and column 2 is Location.Description, where Column 1 only has values from the eight felonies of interest and column2 only "RESIDENCE"
Explanation
Step 1:
From R website's examples about subset and OR condition:
PineTreeGrade3Data<-subset(StudentData, SchoolName=="Pine Tree Elementary" | Grade==3)
Whereas we have:
ViolentCrimes = subset(Crimes_Data, Primary.Type == "ARSON" |
we use the subset() function
Crimes_Data is the existing data frame as input
next are the conditions. Which simply take the pattern of VectorName == "Some string", in this casePrimary.Type == "ARSON"`
But we want observations for the other types too, so use the "or" condition to include them
in R, "or" is written with | symbol. So we use this repeatedly to include each of the other felonies of interest
the equal sign = is synonymous with <- and assigns, saves this subset result, into to a new data frame we call ViolentCrimes.
note I prefer using = because it is less keystrokes to type than <-, either is correct
Step 2:
ViolentCrimesResidence = subset(ViolentCrimes, Location.Description == "RESIDENCE", select = c(Primary.Type, Location.Description))
the input is ViolentCrimes data frame we made previously which contains only the eight violent crimes , the eight felonies "ARSON", "ASSAULT"...
now we are interested in, out of all these violent crimes, which ones occurred at home, so use condition Location.Description == "RESIDENCE"
but a further option of subset() we didn't use before, is the select = ... option
we do a select = c(Variable1, Variable2) to choose just the Primary.Type and Location.Description vectors
note that if you actually don't want to limit to the columns aka Variables, you simply omit this , select ... option
thus it saves this new subset into ViolentCrimesResidence
So, now in R when you:
ViolentCrimesResidence
You will see a two-column output you wanted of the eight felonies of interest, that happened in RESIDENCE.

can't draw the grouped value above stacked bar plot in ggplot2

I have a ggplot2 question, I run the code below show the stacked barplot without add value above each bar correctly:
p=ggplot(data=essnn)
p+geom_bar(binwidth=0.5,stat="identity")+ #
aes(x=reorder(classname,-amount,sum), y=amount, label=amount, fill = sort(year))+
theme()
I want add the sum amount grouped by year in each class, and here is my code:
+geom_text(aes(x=classes,y=total,label=total), data=essnnta, fill=NULL, size=3)
But an error message appear:
Error in fill = year, can not find object "year"
That's my problem: why the object "year" can be found when I draw stack bar plot without add the sum amount grouped by year in each class, but when I add the sum amount grouped by year, the error appear?
> str(essnn)
'data.frame': 48619 obs. of 15 variables:
$ id : int 2006051337 2006051337 2006051337 2006051337 2006051337 2006051337 2004070648 2006031360 2006031360 2004070062 ...
$ gender : Factor w/ 3 levels "","F","M": 3 3 3 3 3 3 3 3 3 3 ...
$ age : num 30 30 30 30 30 30 38 43 43 37 ...
$ class : Factor w/ 92 levels "100ab","100aa",..: 18 18 18 18 18 18 18 18 18 18 ...
$ classname: Factor w/ 1136 levels "cad"," Office2010",..: 111 111 111 111 111 111 116 107 107 107 ...
$ grade : num 7 5 6 8 3 4 1 4 3 2 ...
$ year : Factor w/ 6 levels "98","99","100",..: 3 3 3 3 2 2 4 5 5 3 ...
$ ses : num 212 210 211 213 207 208 217 221 220 210 ...
$ date : int 1010421 1001115 1010214 1010701 1000411 1000627 1020424 1030304 1021121 1001108 ...
$ money : num 5800 5800 5800 5800 5200 5200 3000 0 5500 5500 ...
$ discount : num 1160 1160 1160 1160 1040 1040 600 0 275 550 ...
$ amount : num 4640 4640 4640 4640 4160 ...
$ idc : Factor w/ 7 levels "在校生","校友",..: 2 2 2 2 2 2 2 7 7 7 ...
$ mdy : Date, format: "2012-04-21" "2011-11-15" "2012-02-14" "2012-07-01" ...
$ day : num 1123 1281 1190 1052 1499 ...
> str(essnnta)
'data.frame': 10 obs. of 2 variables:
$ classes: Factor w/ 10 levels "JD","JF",..: 1 7 8 4 6 10 3 5 2 9
$ total : num 55603526 43708950 43555010 35649129 33214372 ...
Your problem might be that your x-axes are not the same in the two data frames. So ggplot does not know which value corresponds with which stack. I am not sure about this as I don't understand the way you define your x axis in the original barplot. I also find it a bit strange to define the aes outside of the ggplot function or the geom_bar. But that might just be me be used to a different kind of syntax.
All in all I find it difficult to help you as you do not provide any reproducible example.
Here is a small bit of data, and a plot that sort of works. If you supplement your question with your data (or a subset of it), see if this works. You may also want to position the label at the top of each bar.
essnn <- data.frame(year = c(98,99,100,101,102),
classname = c("a", "b", "c", "d", "e"),
amount = c(1e6, 2e6,3e6,4e6,5e6))
essnnta <- data.frame(total = c(10, 20, 30, 40, 50))
ggplot(data=essnn, aes(x=reorder(classname,-amount, sum), y=amount, fill = year)) +
geom_bar(binwidth=0.5, stat="identity", position = "stack") +
geom_text(aes(x=essnn$classname, y=essnnta$total, label=essnnta$total), size=3) # not "classes"

R. Handling dates and wide format from an imported Stata file

I have been given a Stata data file (counts.dta) that contains daily counts for the years 1975 to 2006 stored in wide-format. The columns are labelled month (full name of the month as a character string), day (numeric with values 1-31), and then the years from 1975 to 2006 with labels '_1975', '_1976' ... '_2006'. I assume that the underline is a consequence of something in Stata. There are dummy counts of zero (0) inserted for the date 29 February when the year-column is not a leap year.
I want to do several things. First, convert to long form with a sensible representation for year. Second, change the tri-partite representation of the date to something more sensible.
My approach has been to change the character string month to a factor and then to get it into the correct order:
require("foreign")
counts <- read.dta(file='counts.dta')
counts[['month']] <- as.factor( counts[['month']] )
counts[['month']] <-
factor(counts[['month']], levels( counts[['month']] )[c(5,4,8,1,9,7,6,2,12,11,10,3)])
I then have
str( counts )
'data.frame': 366 obs. of 34 variables:
$ month: Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ _1975: int 515 649 745 599 445 667 725 749 646 740 ...
$ _1976: int 485 685 529 467 630 723 712 685 715 504 ...
$ _1977: int 505 437 489 588 634 734 682 537 453 673 ...
and so forth. Converting to long format
lcounts <- reshape(counts,
direction="long",
varying=list(names( counts )[3:34]),
v.names="n.counts",
idvar=c("month","day"),
timevar="Year",
times=1975:2006)
str( lcounts )
gives
'data.frame': 11712 obs. of 4 variables:
$ month : Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ Year : int 1975 1975 1975 1975 1975 1975 1975 1975 1975 1975 ...
$ n.counts: int 515 649 745 599 445 667 725 749 646 740 ...
plus some further lines relating to the original Stata file.
My questions are: (1) what is now a good way to convert to factor-month, numeric-year and numeric-day to a useful date format, so that I can determine, for example, the day of the week, the interval between two dates and so on? (2) Was there a better way to have tackled the problem from the start?
This should be pretty easy because all you have to do is paste together the rows of your data.frame and use as.Date to create a Date class vector.
Let's start with some data similar to yours:
dat <- data.frame(month = c(rep("January",31), rep("February",29)),
day = c(1:31, 1:29),
Year = 1975,
n.counts = 515)
Then the creation of the date variable is simple:
dat$Date <- as.Date(with(dat, paste(as.numeric(month), day, Year)), "%m %d %Y")
str(dat)
# 'data.frame': 60 obs. of 5 variables:
# $ month : Factor w/ 2 levels "February","January": 2 2 2 2 2 2 2 2 2 2 ...
# $ day : int 1 2 3 4 5 6 7 8 9 10 ...
# $ Year : num 1975 1975 1975 1975 1975 ...
# $ n.counts: num 515 515 515 515 515 515 515 515 515 515 ...
# $ Date : Date, format: "1975-02-01" "1975-02-02" "1975-02-03" "1975-02-04" # ...
The main focus in this thread is naturally what to do in R after data import, but here I bundle together various details on the Stata side of this.
It is longstanding advice that data of this kind are much more easily handled in Stata in a long shape and reshape long is a standard command to do that conversion for data arriving with each year's data in a separate variable (R users: please read "column" as a translation). So, if possible, you should ask a provider of such Stata files to do that before export.
What the OP calls labels such as _1975 are legal variable names in Stata, and as the OP guesses the underscore is needed because variable names in Stata may not start with numeric characters.
On the information given, it would have been possible to export the data without loss from Stata in file formats other than .dta, notably as the usual kinds of text files (.csv, etc.).
Stata's preferred way of holding daily dates is as integers with origin 0 = 1 January 1960 (so 26 March 2015 would be 20173), which presumably is trivially easy to convert to any date representation in R.
In short, the particular and indeed peculiar form of the data as presented to the OP is in no sense either required by any Stata syntax or even recommended as part of good Stata practice.

Resources