So I want to find patterns and "clusters" based on what items that are bought together, and according to the wiki for eclat:
The Eclat algorithm is used to perform itemset mining. Itemset mining let us find frequent patterns in data like if a consumer buys milk, he also buys bread. This type of pattern is called association rules and is used in many application domains.
Though, when I use the eclat in R, i get "zero frequent items" and "NULL" when when retrieving the results through tidLists. Anyone can see what I am doing wrong?
The full dataset: https://pastebin.com/8GbjnHK2
Each row is a transactions, containing different items in the columns. Quick snap of the data:
3060615;;;;;;;;;;;;;;;
3060612;3060616;;;;;;;;;;;;;;
3020703;;;;;;;;;;;;;;;
3002469;;;;;;;;;;;;;;;
3062800;;;;;;;;;;;;;;;
3061943;3061965;;;;;;;;;;;;;;
The code
trans = read.transactions("Transactions.csv", format = "basket", sep = ";")
f <- eclat(trans, parameter = list(supp = 0.1, maxlen = 17, tidLists = TRUE))
dim(tidLists(f))
as(tidLists(f), "list")
Could it be due to the data structure? In that case, how should I change it? Furthermore, what do I do to get the suggested itemsets? I couldn't figure that out from the wiki.
EDIT: I used 0.004 for supp, as suggested by #hpesoj626. But it seems like the function is grouping the orders/users and not the items. I don't know how to export the data, so here is a picture of the tidLists:
The problem is that you have set your support too high. Try adjusting supp say, supp = .001, for which we get
dim(tidLists(f))
# [1] 928 15840
For your data set, the highest support is 0.08239 which is below 0.1. That is why you are getting no results with supp = 0.1.
inspect(head(sort(f, by = "support"), 10))
# items support count
# [1] {3060620} 0.08239 1305
# [2] {3060619} 0.07260 1150
# [3] {3061124} 0.05688 901
# [4] {3060618} 0.05663 897
# [5] {4027039} 0.04975 788
# [6] {3060617} 0.04564 723
# [7] {3061697} 0.04306 682
# [8] {3060619,3060620} 0.03087 489
# [9] {3039715} 0.02727 432
# [10] {3045117} 0.02708 429
Related
Struggling with data loss when using read.table in R
I downloaded the entire World Checklist of Vascular Plant Database version 9:
http://sftp.kew.org/pub/data-repositories/WCVP/
unzip the file and to get wcvp_v9_jun_2022.txt and use control + F to search "Corymbia", and you will find many rows of data where genus=="Corymbia",
the same is also true for genus=="Eucalyptus" and genus=="Angophora"
imported it into R studio with the following line
WCVP <- read.table("wcvp_v9_jun_2022.txt",sep = "|", fill = T, header = T)
and check for the data
WCVP[WCVP$genus=="Corymbia",]
WCVP[WCVP$genus=="Eucalyptus",]
WCVP[WCVP$genus=="Angophora",]
I got the response
WCVP[WCVP$genus=="Corymbia",]
[1] kew_id family genus species
[5] infraspecies taxon_name authors rank
[9] taxonomic_status accepted_kew_id accepted_name accepted_authors
[13] parent_kew_id parent_name parent_authors reviewed
[17] publication original_name_id
<0 rows> (or 0-length row.names)
While data for the other 2 genera are intact and R spits out rows of data?
Why is the data for Genus Corymbia missing after the .txt is imported into R studio? is there a bug or how do I troubleshoot?
Many thanks
How to troubleshoot:
Count the number of lines in the database file, and compare to the number of rows in WCVP. If they are the same (or off by one, because of the title row), then you have the data, but it is messed up somehow. If you have a lot fewer lines, then see 3 below.
What line number is "Corymbia" on in the text file? What is on that line in WCVP?
If lines are missing, figure out what is the first missing line, by comparing line n of the text file to line n-1 of the dataframe. Start with small n, and increase until you find something that's wrong, then zero in to find the first one. What is special about that line? A likely cause is that the formatting isn't what you expect, e.g. missing or extra delimiters.
There are embedded single-quotes (singles, not always paired) in the data that are throwing off reading it in. Set quote="" and you should see all the data.
WCVP <- read.table("wcvp_v9_jun_2022.txt",
sep = "|", fill = TRUE, header = TRUE)
nrow(WCVP)
# [1] 605649
WCVP[WCVP$genus=="Corymbia",]
# [1] kew_id family genus species infraspecies taxon_name authors rank taxonomic_status accepted_kew_id accepted_name accepted_authors parent_kew_id
# [14] parent_name parent_authors reviewed publication original_name_id
# <0 rows> (or 0-length row.names)
WCVP <- read.table("wcvp_v9_jun_2022.txt",
sep = "|", fill = TRUE, header = TRUE, quote = "")
nrow(WCVP)
# [1] 1232931 ## DIFFERENT!
head(WCVP[WCVP$genus=="Corymbia",], 3)
# kew_id family genus species infraspecies taxon_name authors rank taxonomic_status accepted_kew_id accepted_name accepted_authors parent_kew_id parent_name
# 758307 986238-1 Myrtaceae Corymbia Corymbia K.D.Hill & L.A.S.Johnson GENUS Accepted
# 758308 986307-1 Myrtaceae Corymbia abbreviata Corymbia abbreviata (Blakely & Jacobs) K.D.Hill & L.A.S.Johnson SPECIES Accepted 986238-1 Corymbia
# 758309 986248-1 Myrtaceae Corymbia abergiana Corymbia abergiana (F.Muell.) K.D.Hill & L.A.S.Johnson SPECIES Accepted 986238-1 Corymbia
# parent_authors reviewed publication original_name_id
# 758307 Reviewed Telopea 6: 214 (1995)
# 758308 K.D.Hill & L.A.S.Johnson Reviewed Telopea 6: 344 (1995) 592646-1
# 758309 K.D.Hill & L.A.S.Johnson Reviewed Telopea 6: 244 (1995) 592647-1
I want to create a list of column names that contain the word "arrest" AND their associated index number. I do not want all the columns, so I DO NOT want to subset the arrest columns into a new data frame. I merely want to see the list of names and their index numbers so I can delete the ones I don't want from the original data frame.
I tried getting the column names and their associated index numbers by using the below codes, but they only gave one or the other.
This gives me their names only
colnames(x2009_2014)[grepl("arrest",colnames(x2009_2014))]
[1] "poss_cannabis_tot_arrests" "poss_drug_total_tot_arrests"
[3] "poss_heroin_coke_tot_arrests" "poss_other_drug_tot_arrests"
[5] "poss_synth_narc_tot_arrests" "sale_cannabis_tot_arrests"
[7] "sale_drug_total_tot_arrests" "sale_heroin_coke_tot_arrests"
[9] "sale_other_drug_tot_arrests" "sale_synth_narc_tot_arrests"
[11] "total_drug_tot_arrests"
This gives me their index numbers only
grep("county", colnames(x2009_2014))
[1] 93 168 243 318 393 468 543 618 693 768 843
But I want their name AND index number so that it looks something like this
[93] "poss_cannabis_tot_arrests"
[168] "poss_drug_total_tot_arrests"
[243] "poss_heroin_coke_tot_arrests"
[318] "poss_other_drug_tot_arrests"
[393] "poss_synth_narc_tot_arrests"
[468] "sale_cannabis_tot_arrests"
[543] "sale_drug_total_tot_arrests"
[618] "sale_heroin_coke_tot_arrests"
[693] "sale_other_drug_tot_arrests"
[768] "sale_synth_narc_tot_arrests"
[843] "total_drug_tot_arrests"
Lastly, using advice here, I used the below code, but it did not work.
K=sapply(x2009_2014,function(x)any(grepl("arrest",x)))
which(K)
named integer(0)
The person who provided the advice in the above link used
K=sapply(df,function(x)any(grepl("\\D+",x)))
names (df)[K]
Zo.A Zo.B
Which (k)
Zo.A Zo.B
2 4
I'd prefer the list I showed in the third block of code, but the code this person used provides a structure I can work with. It just did not work for me when I tried using it.
Hacky as a one-liner because I really dislike use <- inside a function call, but this should work:
setNames(
nm = matches <- grep("arrest", colnames(x2009_2014)),
colnames(x2009_2014)[matches]
)
Reproducible example:
setNames(nm = x <- grep("b|c", letters), letters[x])
# 2 3
# "b" "c"
Or write your own function that does it. Here I put it in a data frame, which seems nicer than a named vector:
grep_ind_value = function(pattern, x, ...) {
index = grep(x, pattern, ...)
value = x[index]
data.frame(index, value)
}
I would like to split an object in R according to the suffixes of the barcodes it contains. These end in '-n' where n is a number from 1 to 6. e.g. AAACCGTGCCCTCA-1, GAACCGTGCCCTCA-2, CATGCGTGCCCTCA-5, etc. I would like all the corresponding information about each barcode to be split accordingly as well. Here is some example code of an object, cds.
class(cds)
[1] "CellDataSet"
attr(,"package")
[1] "monocle"
split(cds, cds$barcode)
#not by individual barcodes, but by groups of those ending '-1', '-2',...,'-6'. So 6 new objects in total
Many thanks!
Abigail
Split does not work because you need to subset based on the columns. I am not sure if there is a split method defined for this class. You can try the following:
First to get something like your example:
library(monocle)
library(HSMMSingleCell)
library(Biostrings)
cds = load_HSMM()
class(cds)
[1] "CellDataSet"
attr(,"package")
[1] "monocle"
dim(cds)
Features Samples
47192 271
And to create a barcode for every sample:
bar = paste(names(oligonucleotideFrequency(DNAString("NNNNN"),5))[1:ncol(cds)],
sample(1:6,ncol(cds),replace=TRUE),sep="-")
head(bar)
[1] "AAAAA-3" "AAAAC-6" "AAAAG-5" "AAAAT-1" "AAACA-5" "AAACC-5"
Now we get the group, which is the suffix 1-6 :
cds$barcodes= bar
grp = sub("[A-Z]*[-]","",cds$barcodes)
To get one subset, for example, those will "-1", you can just do:
group1 = cds[,grp==1]
dim(group1)
Features Samples
47192 46
head(group1$barcodes)
[1] "AAAAT-1" "AACGA-1" "AAGCG-1" "AAGGG-1" "AAGTA-1" "AATAG-1"
To get your 6 groups, you can do the below, but check whether your machine has the memory to accommodate this!
subset_obj = lapply(unique(grp),function(i){
cds[,grp==i]
})
names(subset_obj) = unique(grp)
We can use sub to remove the -\\d+ and split the 'cds' based on that
split(cds, sub("-\\d+$", "", cds$barcode))
I have a dataset with a "Notes" column, which I'm trying to clean up with R. The notes look something like this:
Collected for 2 man-hours total. Cloudy, imminent storms.
Collected for 2 man-hours total. Rainy.
Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.
..And so on
I want to remove all sentences that start with "Collected" but not any of the sentences that follow. The number of sentences that follow vary, e.g. from 0-4 sentences afterwards. I was trying to remove all combinations of Collected + (last word of the sentence) but there's too many combinations. Removing Collected + [.] removes all the subsequent sentences. Does anyone have any suggestions? Thank you in advance.
An option using gsub can be as:
gsub("^Collected[^.]*\\. ","",df$Notes)
# [1] "Cloudy, imminent storms."
# [2] "Rainy."
# [3] "Sunny."
Regex explanation:
- `^Collected` : Starts with `Collected`
- `[^.]*` : Followed by anything other than `.`
- `\\. ` : Ends with `.` and `space`.
Replace such matches with "".
Data:
df<-read.table(text=
"Notes
'Collected for 2 man-hours total. Cloudy, imminent storms.'
'Collected for 2 man-hours total. Rainy.'
'Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.'",
header = TRUE, stringsAsFactors = FALSE)
a = "Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny."
sub("^ ","",sub("Collected.*?\\.","",a))
> [1] "Sunny."
Or if you know that there will be a space after the period:
sub("Collected.*?\\. ","",a)
Recently twitter has expanded the character limit of a tweet to 280 characters.
Since then, the TwitteR package only retrieves (or shows, IDK) the initial 140 characters of an extended tweet.
# load package
library(twitteR)
# set oauth
setup_twitter_oauth(Consumer_Key,Consumer_Secret,Access_Token,Access_Token_Secret)
# get user timeline
k<-userTimeline("SenateFloor", n = 50, includeRts = T)
# to data frame
k<-twListToDF(k)
# print tweet text
print(k$text[1:5])
Console output
[1] "#Senate in at 4:00 PM. Following Leader remarks, will proceed to Executive Session & resume consideration of Cal. #… https:// t.co/BpcPa15Twp"
[2] "RT #GovTop: Weekly Digest of the #CongressionalRecord https:// t.co/vuH71y8FpH"
[3] "#HJRes123 ( Making further continuing appropriations for fiscal year 2018). The Joint Resolution was agreed to by a… https:// t.co/bquyMPPhhm"
[4] "#HJRes123 ( Making further continuing appropriations for fiscal year 2018). https:// t.co/SOmYJ3Dv4t"
[5] "Cal. #167, Susan Bodine to be Assistant Administrator of the Environmental Protection Agency. The nomination was co… https:// t.co/pW7qphwloh"
As you can see, an elipsis (...) cuts the tweets that pass the 140 limit.
> nchar(k2$text[1:5])
[1] 144 77 140 99 140
Is there any way to get the whole text from this extended tweets?
As noted in the comment, just use rtweet:
library(rtweet)
library(tidyverse)
sen_df <- get_timeline("SenateFloor", 300)
mutate(sen_df, `Tweet Length`=map_dbl(text, nchar)) %>%
ggplot(aes(`Tweet Length`)) +
ggalt::geom_bkde(color="steelblue", fill="steelblue", alpha=2/3) +
scale_y_continuous(expand=c(0,0)) +
labs(title="#SenateFloor Tweet Length Distribution") +
hrbrthemes::theme_ipsum_rc(grid="XY")
If you would like to continue to use twitteR then you could try this:
# get user timeline
k<-userTimeline("SenateFloor", n = 50, includeRts = T, tweet_mode = "extended")