How to prepare transaction data for arules

How to prepare transaction data for arules - r

I've been digging the questions for 3 days already so finally have a courage to ask here.
I have a dataset of 379,584 entries and I want to feed it to "arules" in R
It looks like this
A. If I try to go with the format = "basket", I do the following
sales <- read.csv("sales.csv", sep=";")
s1 <- split(sales$product_id, sales$order_id)
s1 <- unique(s1)
tr <- as(s1, "transactions")
This gives me an error "can not coerce list with transactions with duplicated items"
B. If I go with the format = "single"
tr <- read.transactions("sales.csv",
sep=";", format = "single", cols = c(4,2))
I have the same error "can not coerce list with transactions with duplicated items"
I've already checked the files for duplicates and Excel can't find any. I believe the trouble is trivial but I'm just stuck.

Apparently the unique(s1) is causing some problem to your coding. Is it required?
I'd managed to create the transaction just by hashing out that line.
sales <- structure(list(sku = c(207426L, 207422L, 207424L, 9793L, 33186L,
72406L), product_id = c(15729L, 15725L, 15727L, 15999L, 15983L,
15992L), item_id = 1:6, order_id = c(1L, 1L, 1L, 2L, 2L, 2L)),
.Names = c("sku", "product_id", "item_id", "order_id"),
class = "data.frame", row.names = c(NA, -6L))
s1 <- split(sales$product_id, sales$order_id)
#s1 <- unique(s1)
tr <- as(s1, "transactions")
tr
transactions in sparse format with
2 transactions (rows) and
6 items (columns)
If unique is really required, run this instead:
s1 <- lapply(s1, unique)

Related

How to write a large list with S4 objects as a CSV file?

I have code that runs and outputs a large list. I am stuck on writing the output to a file as I keep getting different errors so I haven't been able to write a file in any way that I normally would for a dataframe.
The code and data I'm using is this:
library(GeneOverlap)
library(dplyr)
library(stringr)
dataset1 <- structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene3.",
"Gene3"), Gene_count = c(5L, 5L, 3L, 16L, 16L, 16L), Phenotype = c("Phenotype1",
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))
dataset2 <- structure(list(Gene = c("Gene1", "Gene1", "Gene4", "Gene2", "Gene6",
"Gene7"), Gene_count = c(10L, 10L, 4L, 17L, 3L, 2L), Phenotype = c("Phenotype1",
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))
d1_split <- split(dataset1, dataset1$Phenotype)
d2_split <- split(dataset2, dataset2$Phenotype)
# this should be TRUE in order for Map to work correctly
all(names(d1_split) == names(d2_split))
tests <- Map(function(d1, d2) {
go.obj <- newGeneOverlap(d1$Gene, d2$Gene, genome.size = 1871)
return(testGeneOverlap(go.obj))
}, d1_split, d2_split)
I then want to write out the tests large list object to a file - ideally getting the p-values for each Phenotype in the code above as a column. But I keep getting various errors relating to either things like:
library(Matrix)
library(data.table)
lstData <- Map(as.data.frame, tests)
Error in as.data.frame.default(dots[[1L]][[1L]]) :
cannot coerce class ‘structure("GeneOverlap", package = "GeneOverlap")’ to a data.frame
dfrData <- rbindlist(lstData)
Error in rbindlist(lstData) : object 'lstData' not found
Error in fwrite(tests, "list.csv") :
Column 1's type is 'S4' - not yet implemented in fwrite.
library(data.table)
outputfile <- "test.csv" #output file name
sep <- "," #define the separator (related to format of the output file)
for(nam in names(tests)){
fwrite(list(nam), file=outputfile, sep=sep, append=T) #write names of the list elements
ele <- tests[[nam]]
if(is.list(ele)) fwrite(ele, file=outputfile, sep=sep, append=T, col.names=T) else fwrite(data.frame(matrix(ele, nrow=1)), file=outputfile, append=T) #write elements of the list
fwrite(list(NA), file=outputfile, append=T) #add an empty row to separate elements
}
Error in as.vector(data) :
no method for coercing this S4 class to a vector
I've been trying to understand the S4 object but I'm a beginnger R user - what functions or packages could I use to write out my tests object? Example data is included above to run all the code.

The GeneOverlap package has several get* functions for accessing test result statistics. You can combine this with the tidyverse to create a tidy table of results:
results <- tibble(pheno = names(tests), tests = tests) %>%
rowwise() %>%
mutate(
across(tests,
.fns = list(tested = getTested, pval = getPval, OR = getOddsRatio, jaccard = getJaccard),
.names = '{.fn}')
) %>%
select(-tests) # drop test object column
pheno tested pval OR jaccard
<chr> <lgl> <dbl> <dbl> <dbl>
1 Phenotype1 TRUE 0.00481 410. 0.2
2 Phenotype2 TRUE 0.00214 1302. 0.333
3 Phenotype6 TRUE 1 0 0
You can then save this data frame with write_csv or a similar method.

The CSV format is very simple: it is a text file, storing "comma-separated variables", where the variables are all strings. Some of the strings will be converted to numbers if they are in the right format.
S4 objects are very complicated things that are not easy to store as strings.
So to put an S4 object into a CSV file, you're going to need to convert it to one or more strings. You could use paste(dput(x), collapse="") to convert x to a string that could be restored as an S4 object later, but that won't give access to things stored in x. You'll need to use something like #jdobres's approach to extract things before storing them as a CSV file, and then you probably won't be able to restore the object from the file.
If you do need to restore the S4 objects, use saveRDS() on the list to store the complete list in an .rds file. It will be readable by R, but not by other software.

Strings does not change but masquareding as changed

I wrote a function for wrangling strings. It includes converting non-English character to English character and other operations.
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
library(qdapRegex)
wrangle_string <- function(s) {
# 1 character substitutions
old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüýşğçıöüŞĞÇİÖÜ"
new1 <- "szyaaaaaaceeeeiiiidnooooouuuuysgciouSGCIOU"
s1 <- chartr(old1, new1, s)
# 2 character substitutions
old2 <- c("œ", "ß", "æ", "ø")
new2 <- c("oe", "ss", "ae", "oe")
s2 <- s1
for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)
s2
#diger donusumlar
s2= gsub('[[:punct:] ]+',' ',s2)
s2=tolower(s2)
s2=trim(s2)
s2=rm_white(s2)
return(s2)
}
Here is my minimal data for reproduction:
outgoing=structure(list(source = structure(c(1L, 1L, 1L), .Label = "YÖNETIM KURULU BASKANLIGI", class = "factor"),
target = structure(c(2L, 1L, 3L), .Label = c("x Yayincilik Reklam ve Organizasyon Hizmetleri",
"Suat", "Yavuz"), class = "factor")), .Names = c("source",
"target"), row.names = c(NA, 3L), class = "data.frame")
The thing is when I call the function directly it works.
wrangle_string("YÖNETİM KURULU BAŞKANLIĞI")
The result is:
"yonetim kurulu baskanligi"
When I use it apply function on a data frame it looks like work when I check it with View(outgoing) function there is no problem.
outgoing$source=as.vector(sapply(outgoing$source,wrangle_string))
However, when I check the cell with outgoing[1,1] I get this:
"yonetİm kurulu başkanliği"
How can I fix this problem?

By the help and guidance of MrFlick I found the answer. The problem stems from local language settings. R was on English but my data includes Turkish characters. To solve the problem I executed this command:
Sys.setlocale("LC_CTYPE", "turkish")
and also I added the proper encoding parameter to my importing csv function like below:
outgoing <- read_delim("ebys_gidenevrak_rapor.csv", ";", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE,locale = locale(encoding = "utf-8"))

Setting several values at once in a list r

This should be really simple. I am currently trying to make a list I am building slightly more efficient. Instead of having to write out:
list('1'= value1, '2' =value1, '3' = value1)
how would I condense this to be able to simply list the numbers I want to be equal to value1. e.g. '1:4' =value1 or '1,2,3,4' =value1
EDIT:
So, for background, I am currently trying to create custom formatting for an excel file using the xlsx package.
wb = createWorkbook()
sheet =createSheet(wb,sheetName = "TestFormatting")
dfcurrency = DataFormat("[$$-409]#,##0_ ;[Red]-[$$-409]#,##0 ")
dfdate = DataFormat("m/d/yyyy")
currency = CellStyle(wb, dataFormat = dfcurrency)
date = CellStyle(wb, dataFormat = dfdate)
datastyle = setNames(as.list(c(currency,date)),rep(c(3,4),c(1)))
data = addDataFrame(table,sheet, colStyle = datastyle)
Is what I am currently running, thanks to akrun's help. This gives the error:
Error in thisColStyle$ref : no field, method or inner class called 'ref'
And just in case it's useful, here is the data structure of table:
structure(list(workingdate = structure(c(1458518400, 1458604800,
1458691200, 1458777600, 1458864000, 1459119600), class = c("POSIXct",
"POSIXt"), tzone = ""), trader = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("a", "b", "c",
"d", "e"), class = "factor"), pnl.1d = c(3,
-573.7978, -107.1941, 1128.3061, -0.709699999999998, 3.55990000000003
), rt.1d.Util = c(0, -3.82531866666667e-05, -7.14627333333333e-06,
7.52204066666667e-05, -4.73133333333332e-08, 2.37326666666669e-07
)), .Names = c("workingdate", "trader", "pnl.1d", "rt.1d.Util"
), row.names = c(NA, 6L), class = "data.frame")

Here's a very general way to do similar things. This solution is likely more convoluted than the best solution, but it will work and can be extended to similar problems. It is based on eval and parse. parse turns a string into an unevaluated expression, eval evaluates it.
So, eval(parse(text="5+5")) will return 10.
If we can create the string "list('1'=value1, '2'=value1, '3'=value1)", we can then use eval(parse(text= to turn it into the list you want.
The following code will create the above string:
value1 <- 'asdf'
paste(
'list(', paste(sapply(seq_len(4),
function(n) { paste("'", n,"'", "=", "value1", sep="")}),
collapse = ","),
')')
So, combining everything, call
eval(parse(text=
paste(
'list(', paste(sapply(seq_len(4),
function(n) { paste("'", n,"'", "=", "value1", sep="")}),
collapse = ","),
')')))
And you get the list you want.

Thanks to Julian's comment I was able to create a solution to this. I will accept Julian's comment as the answer but will give my own (less general) solution as an example. It basically applies his solution so as to create more customisability in an albeit very roundabout way:
#if no columns need a type of format enter 0
a =paste(sapply(list(c(
#enter column numbers formatted as currency eg. 1:5, 8, 10
3
)),
function(n) { paste("'", n,"'", "=", "currency", sep="")}))
b =paste(sapply(list(c(
#columns formatted as date
1
)),
function(n) { paste("'", n,"'", "=", "date", sep="")}))
You can continue in this fashion with this general formula for as many variables as you like. You can then combine them into one text file ready to be parsed:
text = paste( 'list(',paste(c(a,b,c,d), collapse = ","),')')
datastyle = eval(parse(text = text))
where you simply enter all your formats or styles in a,b,c,d,...
Hopefully this will help someone who finds a similar problem.

How can I use purrr to limit rows where column element is not a list

I have a data.frame,df, where one of the columns has entries which are either a character or list
I would like to use the purrr package, or other means, to eliminate the second row
df <- structure(list(member_id = c("1715", "2186", "2187"), date_of_birth = list(
"1953-12-15T00:00:00", structure(list(`#xsi:nil` = "true",
`#xmlns:xsi` = "http://www.w3.org/2001/XMLSchema-instance"), .Names = c("#xsi:nil",
"#xmlns:xsi")), "1941-02-16T00:00:00")), .Names = c("member_id",
"date_of_birth"), row.names = c(1L, 8L, 9L), class = "data.frame")
TIA

If you are looking to drop any row whose date_of_birth field is of type list, the following should be a decent solution:
df[sapply(df$date_of_birth, function(x) typeof(x)!="list"),]
Edit:
Imo's comment should shorten the above solution as follows:
df[!sapply(df$date_of_birth, is.list),]
I hope this helps.

Here is base R method using lengths and subsetting. Any element in the date_of_birth column that has more than one element is dropped
dfNew <- df[lengths(df$date_of_birth) < 2,]
which returns
dfNew
member_id date_of_birth
1 1715 1953-12-15T00:00:00
9 2187 1941-02-16T00:00:00
Note that dfNew$date_of_birth will still be of type list, which might cause problems down the line. You can fix this with unlist.
dfNew$date_of_birth <- unlist(dfNew$date_of_birth)

R error : arguments imply differing number of rows

So I am trying to operate a function over a few columns of a data frame, using a for loop.
z <- function(x) gsub("[^\\.\\d]", "", x, perl = TRUE)
data <- cbind(data[1:2], for(i in seq(3, 9)) {y(data[[i]])})
I keep running into the error as mentioned in the subject
arguments imply differing number of rows
The number of rows in all my columns are same.
I tried to use lapply for this, but though it works, it converts the column types over which I apply the function to factor. The columns are numerical values, but are originally read as characters from the file (they are stored as such). So when I try to convert to numbers after using lapply, I get number of levels as output (like, 1,2,3...)
Any suggestions, using either the for loop, or lapply are welcome. Thanks in advance.
> dput(head(data,3))
structure(list(MCF.Channel.Grouping = structure(c(6L, 6L, 6L), .Label = c("(Other)",
"Direct", "Display", "Email", "Organic Search", "Paid Search",
"Referral", "Social Network"), class = "factor"), Device.Category = structure(c(2L,
1L, 3L), .Label = c("desktop", "mobile", "tablet"), class = "factor"),
Spend = c("A$503,172.17", "A$375,940.43", "A$92,560.94"),
Clicks = c("1,545,416", "1,037,740", "291,314"), Impressions = c("7,328,657",
"3,787,612", "1,178,508"), Data.Driven.Conversions = c("1,697,814.32",
"1,540,810.43", "430,738.63"), Data.Driven.CPA = c("A$0.30",
"A$0.24", "A$0.21"), Data.Driven.Conversion.Value = c("A$12,815,842.66",
"A$13,883,073.58", "A$3,804,800.15"), Data.Driven.ROAS = c("2547.01%",
"3692.89%", "4110.59%")), .Names = c("MCF.Channel.Grouping",
"Device.Category", "Spend", "Clicks", "Impressions", "Data.Driven.Conversions",
"Data.Driven.CPA", "Data.Driven.Conversion.Value", "Data.Driven.ROAS"
), row.names = c(NA, 3L), class = "data.frame")

We can use
data[-(1:2)] <- lapply(data[-(1:2)], z)
The function is run on columns that are not the first or second. The output is assigned to the same subset in the data.
The original method did not work because the for loop does not result in saved output. Check by trying to save it as a variable:
x <- for(i in seq(3, 9)) {z(data[[i]])}
x
NULL
Even though we saved the contents of the loop, nothing was captured. The loop ran then dumped the results. To see how a loop could work, we can assign values within:
for ( i in 3:9) data[,i] <- z(data[,i])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to prepare transaction data for arules - r

Related

How to write a large list with S4 objects as a CSV file?

Strings does not change but masquareding as changed

Setting several values at once in a list r

How can I use purrr to limit rows where column element is not a list

R error : arguments imply differing number of rows

Categories

Resources