preparing data for arulesSequence - r

I'm struggling with inputes for arulesSequences in R
My data, lets call the dataframe df, looks like this
sequenceID eventID SIZE event
1 1 1 1 E_351-
2 1 2 1 1-
3 2 1 1 30006+
4 2 2 1 20198+
5 2 3 1 111+
6 2 4 1 610-
7 2 5 1 26+
8 2 6 1 30006-
9 2 7 2 11+, 11
the next step as(df,"transactions") gives the following error
error in asMethod(object) :
can not coerce list with transactions with duplicated items
Calls: as ... .nextMethod -> callNextMethod -> .nextMethod -> as -> asMethod
I just spent 2 days trying to just input my data in cspade without success !

After many try-and-fail I managed to convert the file to a transactions object.
Tricks for those who would struggle the same :
I had to remove the commas (use paste rather than toString)
I wrote the table in csv fil : BEAWARE : no header and no rownames or the import with read-baskets will fail. Hope this helps future users.

I did it similiar to how you did it. I also included a size column, I saw it in another example, Im not sure what it does though.
My data is bulit like this, but with > 200 000 unique ID.
mytxt <- data.frame(ID=c(1,1,1,2,2),
Time=c(1,2,3,1,2),
Size=1,
Event=c("A","B","E", "B","A"))
I simply just save it as a txt file with no column or row names.
write.table(mytxt, "C:\\mytxt.txt", sep=" ", row.names = FALSE, col.names = FALSE, quote = FALSE)
And then I read it with the follwing line
data <- read_baskets(con = "C:\\mytxt.txt", info = c("sequenceID","eventID","SIZE"))
So it is similiar to what you describe in the comment.

Related

How to compare two variable columns with each other in R?

I'm new to R and need help! I have many variables including Response and RightResponse.
I need to compare those two columns, and create a new column that can show whether there is a match or miss between each of the value pairs.
Thanks.
Perhaps something like this?
library(magrittr)
library(dplyr)
> res <- data.frame(Response=c(1,4,4,3,3,6,3),RightResponse=c(1,2,4,3,3,6,5))
> res <- res %>% mutate("CorrectOrNot" = ifelse(Response == RightResponse, "Correct","Incorrect"))
> res
Response RightResponse CorrectOrNot
1 1 1 Correct
2 4 2 Incorrect
3 4 4 Correct
4 3 3 Correct
5 3 3 Correct
6 6 6 Correct
7 3 5 Incorrect
Basically the mutate function has created a new column containing the results of a comparison between Response and RightResponse.
Hope this helps!

Combine rows of data frame in R using colMeans?

I'm impressed by the number of "how to combine rows/columns" threads, but even more by the fact that none of these was particularly helpful or at least not applicable to my issue.
My data look like this:
MyData<-data.frame("id" = c("a","a","b"),
"value1_1990" = c(5,NA,1),
"value2_1990" = c(5,NA,2),
"value1_2000" = c(2,1,1),
"value2_2000" = c(2,1,2),
"value1_2010" = c(NA,9,1),
"value2_2010" = c(NA,9,2))
What I want to do is to combine the two rows where id=="a" for columns MyData[,(2:7)] using base R's colMeans.
What it looks like:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 2 2 NA NA
2 a NA NA 1 1 9 9
3 b 1 2 1 2 1 2
What I need:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 1.5 1.5 9 9
2 b 1 2 1 2 1 2
What I tried (among numerous other things):
MyData[nrow(MyData)+1, 2:7] = colMeans(MyData[which(MyData$id=="a"),(2:7)],na.rm=T) # to combine values from rows where id=="a"
MyData$id<-ifelse(is.na(MyData$id),"NewRow",MyData$id) # to replace "<NA>" in the id-column of the newly created row by "NewRow".
This works, except for the fact that...
...it turns all other existing id's into numeric values (and I don't want to let the second line of code -- the ifelse-statement -- touch any of the existing id's, which is why I wrote else==MyData$id).
...this is not particulary fancy code. Is there a one-line-of-code-solution that does the trick? I saw other approaches using aggregate() but this didn't work for me.
You can try using dplyr:
library(dplyr)
Possible solution:
MyData %>% group_by(id) %>% summarise_all(funs(mean(., na.rm = TRUE)))

R creating new column based on split column name

I faced a problem while trying to re-arrange by data frame into long format.
my table looks like this:
x <- data.frame("Accession"=c("AGI1","AGI2","AGI3","AGI4","AGI5","AGI6"),"wt_rep_1"=c(1,2,3,4,4,5), "wt_rep_2" = c(1,2,3,4,8,9), "mutant1_rep_1"=c(1,1,0,0,5,3), "mutant2_rep_1" = c(1,7,0,0,1,5), "mutant2_rep_2" = c(1,1,4,0,1,8) )
> x
Accession wt_rep_1 wt_rep_2 mutant1_rep_1 mutant2_rep_1 mutant2_rep_2
1 AGI1 1 1 1 1 1
2 AGI2 2 2 1 7 1
3 AGI3 3 3 0 0 4
4 AGI4 4 4 0 0 0
5 AGI5 4 8 5 1 1
6 AGI6 5 9 3 5 8
I need to create a column that I would name "genotype", and it would containt the first part of the name of the column before "_"
How to use
strsplit(names(x), "_")
for that?
and preferably loop...
please, anyone, help.
I'll extract the part of the column names of x before the first _ in two instructions. Note that it can be done in just one line, but I'm posting like this for clarity.
sp <- strsplit(names(x), "_")
sapply(sp[-1], `[`, 1)
Now, how can this be a new column in data.frame x? There are only five elements in the resulting vector and x has six rows.
I agree with Ruy Barradas: I don't get how this vector could be a part of your original dataframe. Could you please clarify?
William Doane's response to this question suggests that using regular expressions might do the trick. I like this approach because I find it elegant and fast:
> gsub("(_.*)$", "", names(x))[-1]
[1] "wt" "wt" "mutant1" "mutant2" "mutant2"

handling 'wrong' entries and NAs in a data.table substituting them with entries from other table

I am using data.table in the context of a wider application using shiny and handsontable.js. This is the flow of this part of the app:
I publish a data.table on the browser with numeric columns using handsontable & shiny. This is rendered on the screen.
The user changes values and each time this happens a new data.table is returned with the data.
The problem is with error management, specifically if an user accidentally keys a character.
My objective is to correct the user's error replacing the single cell value where the character was entered with the value in the original copy (only this cell as the others may contain valid changes to be saved at a later stage in the app).
Sadly I am not able to find an efficient solution to this problem. This is my code and a reproducible sample:
# I generate a sample datatable
originTable = data.table( Cat = LETTERS[1:5],
Jan=1:5,
Feb=sample(1:5),
Mar=sample(1:5),
Apr=sample(1:5),
May=sample(1:5))
# I take a full copy & to simulate the effect of a character key in by mistake I convert
# the entire column to character
dt_ <- copy(originTable)
dt_[,Jan := as.character(Jan)]
# "q" entered by mistake by the user -
dt_[[5,2]] <- "q"
# This is what I get back:
Cat Jan Feb Mar Apr May
1: A 1 1 2 4 4
2: B 2 5 4 2 2
3: C 3 4 3 1 5
4: D 4 3 5 5 1
5: E q 2 1 3 3
Now to my code to try to fix this:
valCols <- month.abb[1:5]
for (j in valCols)
set(dt_,
i = NULL,
j = j,
value= as.numeric(as.character(dt_[[j]])))
This gives me a data.table with a NA value somewhere (in place of the character entered by mistake - in a position I ignore).
To substitute the value I've used the following code
for (j in valCols)
set(dt_,
i = which(is.na(dt_[[j]])),
j = j,
value= as.numeric(originTable[[j]]))
But it does not work: it finds the correct column, but ignores the i value and copies the value contained in originTable[1,j] rather than originTable[i,j]. In the example dt_[5,2] will get 1 (positioned as originTable[1,2] instead of 5.
In other words I would have expect to see as.numeric(originTable[[j]]) subsetted by i (implicitly) and by j (explicitly).
To be fair the Warning is telling me what is happening:
Warning message:
In set(dt_, i = which(is.na(dt_[[j]])), j = j, value = as.numeric(originTable[[j]])) :
Supplied 5 items to be assigned to 1 items of column 'Jan' (4 unused)
But I remain with my problem unsolved.
I have read countless of apparently similar SO posts but sadly to no avail (possibly because NA handling has evolved in recent releases and older answers do not fully reflect best practice any more). Also a non-NA based solution would be equally acceptable. Thanks
Try the following:
# use your criteria to determine what the incorrect values are in each column
wrongs = lapply(dt_[, !"Cat"], function(x) which(is.na(as.numeric(x))))
# now substitute
for (n in names(wrongs)) dt_[wrongs[[n]], (n) := originTable[[n]][wrongs[[n]]]]
dt_
# Cat Jan Feb Mar Apr May
#1: A 1 2 5 2 4
#2: B 2 4 3 4 5
#3: C 3 3 2 5 2
#4: D 4 1 1 1 1
#5: E 5 5 4 3 3

How can I order a dataframe by the second column in R? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to sort a dataframe by column(s) in R
I was just wondering if some one could help me out, I have what I thought should be a easy problem to solve.
I have the table below:
SampleID Cluster
R0132F041p 1
R0132F127 1
R0132F064 1
R0132F068p 1
R0132F015 2
R0132F094 3
R0132F105 1
R0132F013 2
R0132F114 1
R0132F014 2
R0132F039p 3
R0132F137 1
R0132F059 1
R0132F138p 2
R0132F038p 2
and I would like to sort/order it by Cluster to get the results as below:
SampleID Cluster
R0132F041p 1
R0132F127 1
R0132F064 1
R0132F068p 1
R0132F105 1
R0132F114 1
R0132F137 1
R0132F059 1
R0132F015 2
R0132F013 2
R0132F014 2
R0132F138p 2
R0132F038p 2
R0132F094 3
R0132F039p 3
I have tried the following R code:
data<-read.table('Table.txt', header=TRUE,row.names=1,sep='\t')
data <- data.frame(data)
data <- data[order(data$Cluster),]
write.table(data, file = 'OrderedTable.txt', append = TRUE,quote=FALSE, sep = '\t', na ='NA', dec = '.', row.names = TRUE, col.names = FALSE)
and get the following output:
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2
11 2
12 2
13 2
14 3
15 3
Why have the SampleIDs been replaced by the numbers 1-15 and what do these numbers represent, I have read the ?order() page however this seems to explain sort.list better than order() if any one could help me out on this I would be very grateful.
The short answer is you did it perfectly. You just are having some difficulty with reading and writing files. Going through your code:
data<-read.table('Table.txt', header=TRUE,row.names=1,sep='\t')
The above line is reading in your data fine, but the row.names=1 told it to use the first column as names for rows. So now your SampleIDs are row names instead of being their own column. If you type data or head(data) or str(data) immediately after running this line, this should be clear. Just omit that row.names argument and it will read properly.
data <- data.frame(data)
You don't need this above line because read.table() produces a dataframe. You can see that with str(data) as well.
data <- data[order(data$Cluster),]
The above line is perfect.
write.table(data, file = 'OrderedTable.txt', append = TRUE,
quote=FALSE, sep = '\t', na ='NA', dec = '.', row.names = TRUE,
col.names = FALSE)
Here you included the argument col.names = FALSE which is why your file doesn't have column names. You also don't need/want append=TRUE. If you look at help(write.table), you see it is "only relevant if file is a character string". Here it seems to make the file write without ending the last line, which would likely cause any later read.table() to complain.
The numbers 1-15 in your result look like row numbers. You don't explain how you look at the resulting file, so I cannot be sure. You likely read your file in a way that doesn't parse the row.names and is showing row numbers instead. If you make certain your SampleIDs column does not get assigned to be names of rows, you'll probably be fine.
Have a look at the arrange function of the plyr package.
arrange(data, Cluster)
write.table(data, "ordered_data.txt")

Resources