I am trying to mine sequences in R via the arulessequence implementation of cspade.
My data frame looks like this:
items sequenceId eventId size
A 1 1 1
B 2 1 1
C 2 2 1
A 3 1 1
This data frame was created from an existing data set via the following code (removing unnecessary columns and creating the sequences):
data %>%
select(seqId, sequence, items) %>%
group_by(seqId) %>%
mutate(basketSize = 1, sequence = rank(sequence)) %>%
ungroup() %>%
mutate(seqId = ordered(seqId), sequence = ordered(sequence)) %>%
write.table("data.txt", sep=" ", row.names = FALSE, col.names = FALSE, quote = FALSE)
data <- read_baskets("data.txt", info = c("sequenceID", "eventID", "size"))
as(data, "data.frame") #shows the data frame above!
So far so good!
However when I try:
s1 <- cspade(data, parameter = list(support = 0.4), control = list(verbose = TRUE))
I get the following error:
Error in makebin(data, file) : 'eid' invalid (strict order)
I have read elsewhere that this is because cspade needs the event and sequence id to be ordered. But how do I specify this? Clearly ordering the factors before exporting them to ".txt" does not work.
Edit:
Some further details
Just to explain the code to create the data input for cspade a bit more. Originally the sequence-variable had some missing steps (e.g. 1,3,4 for some sequences) because I had filtered some events. Therefore I ran a rank-function on it to reindex the events per sequence. The size-column is totally unecessary (it is constant) but was included in the sample code in the documentation for arulessequence, which is why I included it too.
Related
I have StringTie data for a parental cell line and a KO cell line (which I'll refer to as B10). I am interested in comparing the parental and B10 cell lines. The issue seems to be that my StringTie files are separate, meaning I have one for the parental cell line and one for B10. I've included the code I have written to date for context along with the error messages I received and troubleshooting steps I have already tried. I have no idea where to go from here and I'd appreciate all the help I could get. This isn't something that anyone in my lab has done before so I'm struggling to do this without any guidance.
Thank you all in advance!
`# My code to go from StringTie to count data:
(I copy pasted this so all my notes are included. I'm new to R so they're really just for me. I'm not trying to explain to everyone what every bit of the code means condescendingly. You all likely know much more that I do)
# Open Data
# List StringTie output files for all samples
# All files should be in same directory
files_B10 <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/B10", recursive = TRUE, full.names = TRUE)
files_parental <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/parental", recursive = TRUE, full.names = TRUE)
tmp_B10 <- read_tsv(files_B10[1])
tx2gene_B10 <- tmp_B10[, c("t_name", "gene_name")]
txi_B10 <- tximport(files_B10, type = "stringtie", tx2gene = tx2gene_B10)
tmp_parental <- read_tsv(files_parental[1])
tx2gene_parental <- tmp_parental[, c("t_name", "gene_name")]
txi_parental <- tximport(files_parental, type = "stringtie", tx2gene = tx2gene_parental)
# Create a filter (vector) showing which rows have at least two columns with 5 or more counts
txi_B10.filter<-apply(txi_B10$counts,1,function(x) length(x[x>5])>=2)
txi_parental.filter<-apply(txi_parental$counts,1,function(x) length(x[x>5])>=2)
head(txi_parental.filter)
sum(txi_B10.filter)
# Now filter the txi object to keep only the rows of $counts, $abundance, and $length where the txi.filter value is >=5 is true
txi_B10$counts<-txi_B10$counts[txi_B10.filter,]
txi_B10$abundance<-txi_B10$abundance[txi_B10.filter,]
txi_B10$length<-txi_B10$length[txi_B10.filter,]
txi_parental$counts<-txi_parental$counts[txi_parental.filter,]
txi_parental$abundance<-txi_parental$abundance[txi_parental.filter,]
txi_parental$length<-txi_parental$length[txi_parental.filter,]
# save count data as csv files
write.csv(txi_B10$counts, "txi_B10.counts.csv")
write.csv(txi_parental$counts, "txi_parental.counts.csv")
# Open count data
# Do this in order that the files are organized in file manager
txi_B10_counts <- read_csv("txi_B10.counts.csv")
txi_parental_counts <- read_csv("txi_parental.counts.csv")
# Set column names
colnames(txi_B10_counts) = c("Gene_name", "B10_n1", "B10_n2")
View(txi_B10_counts)
colnames(txi_parental_counts) = c("Gene_name", "parental_n1", "parental_n2")
View(txi_parental_counts)
## R is case sensitive so you just wanna ensure that everything is in the same case
## convert Gene names which is column [[1]] into lowercase
txi_parental_counts[[1]] <- tolower( txi_parental_counts[[1]])
View(txi_parental_counts)
txi_B10_counts[[1]] <- tolower(txi_B10_counts[[1]])
View(txi_B10_counts)
## Capitalize the first letter of each gene name
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_parental_counts$Gene_name <- capFirst(txi_parental_counts$Gene_name)
View(txi_parental_counts)
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_B10_counts$Gene_name <- capFirst(txi_B10_counts$Gene_name)
View(txi_B10_counts)
# Merge PL and KO into one table
# full_join takes all counts from PL and KO even if the gene names are missing
# If a value is missing it writes it as NA
# This site explains different types of merging https://remiller1450.github.io/s230s19/Merging_and_Joining.html
mergedCounts <- full_join (x = txi_parental_counts, y = txi_B10_counts, by = "Gene_name")
view(mergedCounts)
# Replace NA with value = 0
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# Save file for merged counts
write.csv(mergedCounts, "MergedCounts.csv")
## --------------------------------------------------------------------------------
# My code to go from count data to DEseq2
# Import data
# I added my metadata incase the issue is how I set up the columns
# metaData is a file with your samples name and Comparison
# Your second column in metadata must be called Comparison, otherwise you'll get error in dds line
metadata <- read.csv(metadata.csv', header = TRUE, sep = ",")
countData <- read.csv('MergedCounts.csv', header = TRUE, sep = ",")
# Assign "Gene Names" as row names
# Notice how there's suddenly an extra row (x)?
# R automatically created and assigned column x as row names
# If you don't fix this the # of columns won't add up
rownames(countData) <- countData[,1]
countData <- countData[,-1]
# Create DEseq2 object
# !!!!!!! Here is where I get stuck!!!!!!!
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line
# It says Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers
## --------------------------------------------------------------------------------
# How I tried to fix this:
# 1) I saw something here that suggested this might be an issue with having zeros in the count data
# I viewed the countData files to make sure there were no zeros and there weren't any
# I thought that would be the case since I replaced NA with value = 0 earlier using this bit of code
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# 2) I was then informed that StringTie outputs non integer values
# It was recommended that I try DESeqDataSetFromTximport instead
dds <- DESeqDataSetFromTximport(countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line either
# It says Error in DESeqDataSetFromTximport(countData, colData = metaData, design = ~Comparison, : is(txi, "list") is not TRUE
# I think this might be because merging the parental and B10 counts led to a file that's no longer a txi or accessible through Tximport
# It seems like this should be done with the original StringTie files from the very beginning of the code
# My concern with doing that is that the files for parental and B10 are separate so I don't see how I could end up comparing the two
# I think this approach would work if I was interested in comparing n1 verses n2 for each cell line but that is not of interest to me
`
Short version: when executing the following command qtm(countries, "freq") I get the following error message:
Error in $<-.data.frame(*tmp*, "SHAPE_AREAS", value =
c(652270.070308042, : replacement has 177 rows, data has 210
Disclaimer: I have already checked other answers like this one or this one as well as this explanation that states that usually this error comes from misspelling objects, but could not find an answer to my problem.
Reproducible code:
library(rgdal)
library(dplyr)
library(tmap)
# Load JSON file with countries.
countries = readOGR(dsn = "https://gist.githubusercontent.com/ccamara/fc26d8bb7e777488b446fbaad1e6ea63/raw/a6f69b6c3b4a75b02858e966b9d36c85982cbd32/countries.geojson")
# Load dataframe.
df = read.csv("https://gist.githubusercontent.com/ccamara/fc26d8bb7e777488b446fbaad1e6ea63/raw/754ea37e4aba1b7ed88eaebd2c75fd4afcc54c51/sample-dataframe.csv")
countries#data = left_join(countries#data, df, by = c("iso_a2" = "country_code"))
qtm(countries, "freq")
Your error is in the data - the code works fine.
What you are doing right now is:
1) attempting a 1:1 match
2) realize that your .csv data contains several ids to match
3) a left-join then multiplies the left hand side with all matches on the right hand-side
To avoid this issue you have to aggregate your data one more time like:
library(dplyr)
df_unique = df %>%
group_by(country_code, country_name) %>%
summarize(total = sum(total), freq = sum(freq))
#after that you should be fine - as long as just adding up the data is okay.
countries#data = left_join(countries#data, df, by = c("iso_a2" =
"country_code"))
qtm(countries, "freq")
I'm dealing with a couple of txt files with climatological data with 3 parameters that differentiate each chunk of data (Parameter measured, station of measurement, and year), each file has more than a million lines, In the past I mannualy selected each parameter one a time, for a station and year and read it into r using read.fwd; but with this size files that is absurd and inefficient. Is there any way to automate this process, taking into account that the file has a "FF" as indicator every time a new parameter for a station and a given year starts and knowing that i want to generate separate files or datasets that have to be named according to the station, year and parameter to be able to use it thereafter?
File to read Format
Circled in red is the FF, I guess intended to mark the start of a new set of records.
Circled in Black is the name of the parameter measured (there are in total 8 different parameter classes)
Circled in blue is the year of meassurement.
Circled in green is the number or identifier of the station of measurement.
In the past, i read just what i need it with read.fwf, given the fixed with in the data, but that separation is not applied in the head of each table.
PRUEBA3 <- read.fwf("SanIgnacio_Pmax24h.txt", header = FALSE, widths = c(5,4,4,6,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,10,2),skip=1)
Thanks, and every help will be appreciated
You will need to make a function that loops through the txt files. (The output that you linked to was produced by a database; I assume you don't have access to it).
Here is how the function could look like using the fast fread from data.table and a foreach loop (you can make the loop parallel by registering a parallel backend and change %do% into %dopar%):
library(data.table)
library(foreach)
myfiles = dir(pattern = ".txt$")
res = foreach(i = 1:myfiles) %dopar% {
x = fread(myfiles[i], na.strings = c("", " "))
# get row indices for start and end dates
# the "V" variables are column indices, I assume these don't change per file
start.dia = x[, grep("DIA", V2)] + 2
end.dia = x[, grep("MEDIA", V2)] - 2
# get name of station
estacion.detect = x[, grep("ESTACION", V9)]
estacion.name = x[estacion.detect, V10]
mydf = x[start.dia : end.dia, estacion := estacion.name]
# remove empty rows and columns
junkcol = which(colSums(is.na(mydf)) == nrow(mydf))
junkrow = which(rowSums(is.na(mydf)) == ncol(mydf))
if (length(junkcol) > 0) {
mydf = mydf[, !junkcol, with = F]
}
if (length(junkrow) > 0) {
mydf = mydf[!junkrow, ]
}
# further data cleaning
}
# bind all files
all = rbindlist(res)
I'm converting a local R script to make use of the RevoScaleR functions in the Revolution-R (aka Microsoft R Client/Server) package. This to be able to scale better with large amounts of data.
The goal is to create a new column that numbers the rows per group. Using data.table this would be achieved using the following code:
library(data.table)
eventlog[,ActivityNumber := seq(from=1, to=.N, by=1), by=Case.ID]
For illustration purposes, the output is something like this:
Case.ID ActivityNumber
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3
After some research to do this using the rx-functions I found the package dplyrXdf, which is basically a wrapper to use dplyrfunctions on Xdfstored data, while still benefitting from the optimized functions of RevoScaleR (see http://blog.revolutionanalytics.com/2015/10/using-the-dplyrxdf-package.html)
In my case, this would lead to the following:
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_len(n()))
However, this leads to the following error:
ERROR: Attempting to add a variable without a name to an analysis.
Caught exception in file: CxAnalysis.cpp, line: 3756. ThreadID: 1248 Rethrowing.
Caught exception in file: CxAnalysis.cpp, line: 5249. ThreadID: 1248 Rethrowing.
Error in doTryCatch(return(expr), name, parentenv, handler) :
Error in executing R code: ERROR: Attempting to add a variable without a name to an analysis.
Any ideas how to solve this error? Or other (better?) approaches to get the requested result?
Thanks to #Matt-parker for pointing me to this question.
Note that n() is not a regular R function, although it looks like one. It needs to be implemented specially for each data source, and maybe also separately for each of mutate, summarise and filter.
Right now, the only usage of n that is supported for xdf files is within summarise, to count the number of rows. Implementing it for the other verbs is actually nontrivial.
In particular, there is a problem with Matt's use of seq_along to implement n's functionality. Remember that xdf files are block-structured: each chunk of rows is read in and processed independently of other chunks. This means that the sequence generated is for that chunk of rows only, and not for all the rows in a group. If a group spans more than one chunk, the sequence numbers will restart in the middle.
The way to get correct sequence numbers is to keep a running count of how many rows you've read in for that group, and update it each time a chunk is processed. You can do this with a transformFunc, which you pass to transmute via the .rxArgs argument:
ev <- eventlog %>% group_by(Case.ID) %>% transmute(.rxArgs = list(
transformFunc = function(varList) {
n <- .n + seq_along(varList[[1]])
if(!.rxIsTestChunk) # need this b/c rxDataStep does a test run on the 1st 10 rows
.n <<- n[length(n)]
list(n=n)
},
transformObjects = list(.n = 0))
This should work with the local, localpar and foreach compute contexts. It may not work (or at least won't give a reproducible result) with any context where you can't guarantee that rxDataStep will process the rows in a deterministic order -- so Mapreduce, Spark, Teradata or similar.
I'm not sure why this works, but try using seq_along(Case.ID) instead of seq_len(n()):
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_along(Case.ID))
It seems to be some problem with n(). Here's my exploratory code, in case anyone else wants to experiment:
options(stringsAsFactors = FALSE)
library(dplyrXdf)
# Set up some test data
eventlog_df <- data.frame(Case.ID = c("A", "A", "A", "A", "A", "B", "C", "C", "C"))
# Add a variable for artificially splitting the XDF into small chunks
eventlog_df$Chunk.ID <- factor((seq_len(nrow(eventlog_df)) + 2) %/% 3)
# Check the results
eventlog_df
# Now read it into an XDF file. I'm going to read just three rows in at a time
# so that the XDF file has several chunks, so we can be confident this works
# across chunks
eventlog <- tempfile(fileext = ".xdf")
for(i in 1:3) {
rxImport(inData = eventlog_df[eventlog_df$Chunk.ID %in% i, ],
outFile = eventlog,
colInfo = list(Case.ID = list(type = "factor",
levels = c("A", "B", "C"))),
append = file.exists(eventlog))
}
# Convert to a proper data source
eventlog <- RxXdfData(eventlog)
rxGetInfo(eventlog, getVarInfo = TRUE, numRows = 10)
# Now to dplyr. First, let's make sure it can count up the records
# in each group without any trouble.
result <- eventlog %>%
group_by(Case.ID) %>%
summarise(ActivityNumber = n())
# It can:
rxDataStep(result)
# Now if we switch to mutate, does n() still work?
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = n())
# No - and it seems to be complaining about missing variables. So what if
# we try to refer to a variable we *know* exists?
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_along(Case.ID))
# It works
rxDataStep(result)
dplyr and dplyrXdf have a tally method that counts items per group:
result <- eventlog %>%
group_by(Case.ID) %>%
tally()
If you want to do more than just tabulate the records per group, you can use summarize (since you didn't show your data, I'm using a hypothetical column called delay, which I'm assuming is numeric for illustrative purposes):
result <- eventlog %>%
group_by(Case.ID) %>%
summarize(counts = n(),
ave_delay = mean(delay))
You could do the above with regular RevoScaleR functions,
rxCrossTabs(~ Case.ID, data = eventlog)
and for the second example:
rxCube(delay ~ Case.ID, data = eventlog)
I've successfully imported my data into R as transactions, but when I try targeting a specific website, I get this error:
Error in asMethod(object) : FACEBOOK.COM is an unknown item label
Is there any reason why this could be happening? Here is a snippet of code:
target.conf80 = apriori(trans,
parameter = list(supp=.002,conf=.8),
appearance = list(default="lhs",rhs = "FACEBOOK.COM"),
control = list(verbose = F))
target.conf80 = sort(target.conf80,decreasing=TRUE,by="confidence")
inspect(target.conf80[1:10])
Thanks!
Here is what the transactions look like:
1 {V1=Google,
V2=Google Web Search,
V3=FACEBOOK.COM} 1
2 {V1=FACEBOOK.COM,
V2=MCAFEE.COM,
V3=7EER.NET,
V4=Google} 2
3 {V1=MCAFEE.COM,
The problem is the way you read/convert the data to transactions. The transactions should look like:
1 {Google,
Google Web Search,
FACEBOOK.COM} 1
2 {FACEBOOK.COM,
MCAFEE.COM,
7EER.NET,
Google} 2
3 {MCAFEE.COM,
...
Without the V1, V2, etc. In your transactions V1=Google and V4=Google are different items.
Error as(data, 'transactions') From Data Frames
I'm assuming that the dataset was transformed as follow...data <- as(data, 'transactions'). If you run that code without performing some manipulations with your data you will get those V1, V2, ....
Cleaning Data Before Transactions
I want to include how to manipulate the data to be ready for read.transctions(). After importing your data into R you want to convert your dataframe to a matrix like so... d.matrix <- as.matrix(df), the you want to eliminate any headers if happened that you do have headers; colnames(d.matrix) <- NULL. Now you don't have headers. After that you want to....
write.table(x = d.matrix,
file = 'clean_data.csv',
sep = ',',
col.names = FALSE,
row.names = FALSE)
Finally, you want to import the data as transaction like so...
data <- read.transactions('clean_data.csv',
format = 'basket',
sep = ',',
rm.duplicates = TRUE)
Now you have a dataset with no V1, V2, V3, ... and no row ID's