I have a list in R in which every single row is a half verse of bible text. The columns are: B(ook), C(chapter), V(erse), H(alf verse), and a1–a31. These columns with a plus an integer are codes that represent Hebrew cantillation marks.
What I need is a way to find patterns in the sequences of the numbers that tell me which combinations of integers occurs and how many times.
E.g.: how many times is 74 followed by 63; how many times is 63 preceded by 05.
Ideally it would also tell me combination of more than two. E.g.: how many times is 74 preceded by 05 which is preceded by 35.
Finally I'd need to chart this in some way.
Below are the header and the first 3 rows of the list.
B,C,V,H,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,a21,a22,a23,a24,a25,a26,a27,a28,a29,a30,a31
Genesis,1,1,A,73,74,92
Genesis,1,1,B,71,73,71,00
Genesis,1,2,A,81,71,3303,80,73,74,92
I have a rather complicated solution, not sure it is the best one.
I makes use of data.table for reshaping and for the data manipulation.
the data:
library(data.table)
df <- setDT(read.table(text="
B,C,V,H,a1,a2,a3,a4,a5,a6,a7
Genesis,1,1,A,73,74,92,NA,NA,NA,NaN
Genesis,1,1,B,71,73,71,00,NA,NA,NA
Genesis,1,2,A,81,71,3303,80,73,74,92",h=T,sep = ","))
I transform the data this way:
test <- melt(df,measure.vars = patterns("^a"))
plouf <- dcast(test[!is.na(value)],variable ~B+C+V+H,fill = "value")
variable Genesis_1_1_A Genesis_1_1_B Genesis_1_2_A
1: a1 73 71 81
2: a2 74 73 71
3: a3 92 71 3303
4: a4 NA 0 80
5: a5 NA NA 73
6: a6 NA NA 74
7: a7 NA NA 92
I then create a vector of all combination of 3 successive numbers:
allcomb <- unlist(lapply(1:(nrow(plouf)-2),function(i){
plouf[i:(i+2),lapply(.SD,function(col){paste(col,collapse = ",")}),.SDcols = grep("Genesis",names(plouf),value = T)]
}))
It is a bit tricky:
plouf[1:3,lapply(.SD,function(col){paste(col,collapse = ",")}),.SDcols = grep("Genesis",names(plouf),value = T)]
concantenate the 3 first lines of all columns of plouf specified by .SDcols = grep("Genesis",names(plouf),value = T)
[1] "Genesis_1_1_A" "Genesis_1_1_B" "Genesis_1_2_A"
that is the columns beginning with Genesis. Doing that for all successive combination of 3 lines, and transforming the output into a vector gives me the vector allcomb. It contains combination with NA, that you can clean:
allcomb <- allcomb[!grepl("NA",allcomb)]
Genesis_1_1_A Genesis_1_1_B Genesis_1_2_A Genesis_1_1_B Genesis_1_2_A Genesis_1_2_A Genesis_1_2_A Genesis_1_2_A
"73,74,92" "71,73,71" "81,71,3303" "73,71,0" "71,3303,80" "3303,80,73" "80,73,74" "73,74,92"
Having all the combination in text, you can use table to count the occurrence of each combination, leading to the results you wanted:
> table(allcomb)
3303,80,73 71,3303,80 71,73,71 73,71,0 73,74,92 80,73,74 81,71,3303
1 1 1 1 2 1 1
The vector allcomb contains the names in which you have the combination as a column name. You can thus find back each repetition :
sapply(unique(allcomb),function(comb){
names(allcomb[grep(comb,allcomb)])}
)
$`73,74,92`
[1] "Genesis_1_1_A" "Genesis_1_2_A"
$`71,73,71`
[1] "Genesis_1_1_B"
$`81,71,3303`
[1] "Genesis_1_2_A"
$`73,71,0`
[1] "Genesis_1_1_B"
$`71,3303,80`
[1] "Genesis_1_2_A"
$`3303,80,73`
[1] "Genesis_1_2_A"
$`80,73,74`
[1] "Genesis_1_2_A"
Related
I have two data sets that are supposed to be the same size but aren't. I need to trim the values from A that are not in B and vice versa in order to eliminate noise from a graph that's going into a report. (Don't worry, this data isn't being permanently deleted!)
I have read the following:
Selecting columns in R data frame based on those *not* in a vector
http://www.ats.ucla.edu/stat/r/faq/subset_R.htm
How to combine multiple conditions to subset a data-frame using "OR"?
But I'm still not able to get this to work right. Here's my code:
bg2011missingFromBeg <- setdiff(x=eg2011$ID, y=bg2011$ID)
#attempt 1
eg2011cleaned <- subset(eg2011, ID != bg2011missingFromBeg)
#attempt 2
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg]
The first try just eliminates the first value in the resulting setdiff vector. The second try yields and unwieldy error:
Error in `[.data.frame`(eg2012, !eg2012$ID %in% bg2012missingFromBeg)
: undefined columns selected
This will give you what you want:
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]
The error in your second attempt is because you forgot the ,
In general, for convenience, the specification object[index] subsets columns for a 2d object. If you want to subset rows and keep all columns you have to use the specification
object[index_rows, index_columns], while index_cols can be left blank, which will use all columns by default.
However, you still need to include the , to indicate that you want to get a subset of rows instead of a subset of columns.
If you really just want to subset each data frame by an index that exists in both data frames, you can do this with the 'match' function, like so:
data_A[match(data_B$index, data_A$index, nomatch=0),]
data_B[match(data_A$index, data_B$index, nomatch=0),]
This is, though, the same as:
data_A[data_A$index %in% data_B$index,]
data_B[data_B$index %in% data_A$index,]
Here is a demo:
# Set seed for reproducibility.
set.seed(1)
# Create two sample data sets.
data_A <- data.frame(index=sample(1:200, 90, rep=FALSE), value=runif(90))
data_B <- data.frame(index=sample(1:200, 120, rep=FALSE), value=runif(120))
# Subset data of each data frame by the index in the other.
t_A <- data_A[match(data_B$index, data_A$index, nomatch=0),]
t_B <- data_B[match(data_A$index, data_B$index, nomatch=0),]
# Make sure they match.
data.frame(t_A[order(t_A$index),], t_B[order(t_B$index),])[1:20,]
# index value index.1 value.1
# 27 3 0.7155661 3 0.65887761
# 10 12 0.6049333 12 0.14362694
# 88 14 0.7410786 14 0.42021589
# 56 15 0.4525708 15 0.78101754
# 38 18 0.2075451 18 0.70277874
# 24 23 0.4314737 23 0.78218212
# 34 32 0.1734423 32 0.85508236
# 22 38 0.7317925 38 0.56426384
# 84 39 0.3913593 39 0.09485786
# 5 40 0.7789147 40 0.31248966
# 74 43 0.7799849 43 0.10910096
# 71 45 0.2847905 45 0.26787813
# 57 46 0.1751268 46 0.17719454
# 25 48 0.1482116 48 0.99607737
# 81 53 0.6304141 53 0.26721208
# 60 58 0.8645449 58 0.96920881
# 30 59 0.6401010 59 0.67371223
# 75 61 0.8806190 61 0.69882454
# 63 64 0.3287773 64 0.36918946
# 19 70 0.9240745 70 0.11350771
Really human comprehensible example (as this is the first time I am using %in%), how to compare two data frames and keep only rows containing the equal values in specific column:
# Set seed for reproducibility.
set.seed(1)
# Create two sample data frames.
data_A <- data.frame(id=c(1,2,3), value=c(1,2,3))
data_B <- data.frame(id=c(1,2,3,4), value=c(5,6,7,8))
# compare data frames by specific columns and keep only
# the rows with equal values
data_A[data_A$id %in% data_B$id,] # will keep data in data_A
data_B[data_B$id %in% data_A$id,] # will keep data in data_b
Results:
> data_A[data_A$id %in% data_B$id,]
id value
1 1 1
2 2 2
3 3 3
> data_B[data_B$id %in% data_A$id,]
id value
1 1 5
2 2 6
3 3 7
Per the comments to the original post, merges / joins are well-suited for this problem. In particular, an inner join will return only values that are present in both dataframes, making thesetdiff statement unnecessary.
Using the data from Dinre's example:
In base R:
cleanedA <- merge(data_A, data_B[, "index"], by = 1, sort = FALSE)
cleanedB <- merge(data_B, data_A[, "index"], by = 1, sort = FALSE)
Using the dplyr package:
library(dplyr)
cleanedA <- inner_join(data_A, data_B %>% select(index))
cleanedB <- inner_join(data_B, data_A %>% select(index))
To keep the data as two separate tables, each containing only its own variables, this subsets the unwanted table to only its index variable before joining. Then no new variables are added to the resulting table.
I have this vector:
a = c(4,5,6,81,82,83)
My desired result is the following:
b = c(1,2,3,4,5,6,78,79,80,81,82,83)
My logic is: There are two different sequences in a (this can be checked by using length(which(diff(a)>1))+1). Each one of them has to be extended from behind to reach the length of 1:end_of_first_seq (end_of_first_seq = a[which(diff(a)>1))[1]). Thus, in this case the length of each sequence should be 6. Each sequence must therefore grow three steps behind, so 4,5,6 becomes 1,2,3,4,5,6 and 81,82,83 becomes 78,79,80,81,82,83 while all being in the same vector.
Is there any fast way to do this? (this is a simple example, the number of sequences can be higher). It is worth mentioning all "previous" sequences are the same length (in this case, 3) and they are separated by at least two values (a case like 6,7,8,9,10,11 cannot happen). I know I can do this with loops but speed is a factor.
If all sequences have same length:
vec <- c(4,5,6,81,82,83)
LEN <- 3 # sequence length
want <- matrix(vec, ncol = LEN, byrow = TRUE)
want <- cbind(want - LEN, want)
want <- as.vector(t(want))
want
# [1] 1 2 3 4 5 6 78 79 80 81 82 83
We calculate length of each sequence and since all the sequence are of same length we can extract every nth value and create a sequence between two points in every sequence.
length_of_each_seq <- a[which.max(diff(a)>1)]
n <- 3
vals <- a[seq(n, length(a), by = n)]
c(mapply(`:`, vals - (length_of_each_seq - 1), vals))
#[1] 1 2 3 4 5 6 78 79 80 81 82 83
where vals is the end of sequence
vals
#[1] 6 83
and vals - (length_of_each_seq - 1) is from where we need to start
vals - (length_of_each_seq - 1)
#[1] 1 78
I have a dataframe
names2 <- c('AdagioBarber','AdagioBarber', 'Beethovan','Beethovan')
Value <- c(33,55,21,54)
song.data <- data.frame(names2,Value)
I would like to arrange it according to this character vector
names <- c('Beethovan','Beethovan','AdagioBarber','AdagioBarber')
I am using match() to achieve this
data.frame(song.data[match((names), (song.data$names2)),])
The problem is that match returns only first occurences
names2 Value
3 Beethovan 21
3.1 Beethovan 21
1 AdagioBarber 33
1.1 AdagioBarber 33
You can use order, as #zx8754 and #Evan Friedland have pointed out.
> name.order <- c('Beethovan','AdagioBarber')
> song.data$names2 <- factor(song.data$names2, levels= name.order)
> song.data[order(song.data$names2), ]
names2 Value
3 Beethovan 21
4 Beethovan 54
1 AdagioBarber 33
2 AdagioBarber 55
Basically, factor turns the strings into integers and creates a lookup table of what integers correspond to what strings. The levels argument specifies what you want that lookup table to be. Without that argument, it would just go by order of appearance.
So for example:
> as.numeric(factor(letters[1:5]))
[1] 1 2 3 4 5
> as.numeric(factor(letters[1:5], levels=c("d","b","e","a","c")))
[1] 4 2 5 1 3
Note: You'll need to be absolutely sure you get all your (correctly spelled) levels in that name.order vector, otherwise you'll end up with NA's in the output from order.
(I'm not sure why sort doesn't have the ability to sort factors, but it is what it is.)
I am trying to reorganize my data, basically a list of data.frames.
Its elements represent subjects of interest (A and B), with observations on x and y, collected on two occasions (1 and 2).
I am trying to make this a list that contains data.frames referring to the subjects, with the information on which occasion x and y were collected being stored in the respective data.frames as new variable, as opposed to the element name:
library('rlist')
A1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
A2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
list <- list(A1=A1,A2=A2,B1=B1,B2=B2)
A <- do.call(rbind,list.match(list,"A"))
B <- do.call(rbind,list.match(list,"B"))
list <- list(A=A,B=B)
list <- lapply(list,function(x) {
y <- data.frame(x)
y$class <- c(rep.int(1,2),rep.int(2,2))
return(y)
})
> list
$A
x y class
A1.1 66 96 1
A1.2 76 58 1
A2.1 50 93 2
A2.2 57 12 2
$B
x y class
B1.1 58 56 1
B1.2 69 15 1
B2.1 77 77 2
B2.2 9 9 2
In my real world problem there are about 500 subjects, not always two occasions, differing numbers of observations.
So my example above is just to illustrate where I want to get, and I am stuck at how to pass to the do.call-rbind that it should, based on elements names, bind subject-specific elements as new list elements together, while assigning a new variable.
To me, this is a somewhat fuzzy task, and the closest I got was the rlist package. This question is related but uses unique to identify elements, whereas in my case it seems to be more a regex problem.
I'd be happy even for instructions on how to use google, any keywords for further research etc.
From the data you provided:
subj <- sub("[A-Z]*", "", names(lst))
newlst <- Map(function(x, y) {x[,"class"] <- y;x}, lst, subj)
First we do the regular expression call to isolate the number that will go in the class column. In this case, I matched on capital letters and erased them leaving the number. Therefore, "A1" becomes "1". Please note that the real names will mean a different regex pattern.
Then we use Map to create a new column for each data frame and save to a new list called newlst. Map takes the first element of each argument and carries out the function then continues on with each object element. So the first data frame in lst and the first number in subj are used first. The anonymous function I used is function(x,y) {x[, "class"] <- y; x}. It takes two arguments. The first is the data frame, the second is the column value.
Now it's much easier to move forward. We can create a vector called uniq.nmes to get the names of the data frames that we will combine. Where "A1" will become "A". Then we can rbind on that match:
uniq.nmes <- unique(sub("\\d", "", names(lst)))
lapply(uniq.nmes, function(x) {
do.call(rbind, newlst[grep(x, names(newlst))])
})
# [[1]]
# x y class
# A1.1 1 79 1
# A1.2 30 13 1
# A2.1 90 39 2
# A2.2 43 22 2
#
# [[2]]
# x y class
# B1.1 54 59 1
# B1.2 83 90 1
# B2.1 85 36 2
# B2.2 91 28 2
Data
A1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
A2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
lst <- list(A1=A1,A2=A2,B1=B1,B2=B2)
It sounds like you're doing a lot of gymnastics because you have a specific form in mind. What I would suggest is first trying to make the data tidy. Without reading the link, the quick summary is to put your data into a single data frame, where it can be easily processed.
The quick version of the answer (here I've used lst instead of list for the name to avoid confusion with the built-in list) is to do this:
do.call(rbind,
lapply(seq(lst), function(i) {
lst[[i]]$type <- names(lst)[i]; lst[[i]]
})
)
What this will do is create a single data frame, with a column, "type", that contains the name of the list item in which that row appeared.
Using a slightly simplified version of your initial data:
lst <- list(A1=data.frame(x=rnorm(5)), A2=data.frame(x=rnorm(3)), B=data.frame(x=rnorm(5)))
lst
$A1
x
1 1.3386071
2 1.9875317
3 0.4942179
4 -0.1803087
5 0.3094100
$A2
x
1 -0.3388195
2 1.1993115
3 1.9524970
$B
x
1 -0.1317882
2 -0.3383545
3 0.8864144
4 0.9241305
5 -0.8481927
And then applying the magic function
df <- do.call(rbind,
lapply(seq(lst), function(i) {
lst[[i]]$type <- names(lst)[i]; lst[[i]]
})
)
df
x type
1 1.3386071 A1
2 1.9875317 A1
3 0.4942179 A1
4 -0.1803087 A1
5 0.3094100 A1
6 -0.3388195 A2
7 1.1993115 A2
8 1.9524970 A2
9 -0.1317882 B
10 -0.3383545 B
11 0.8864144 B
12 0.9241305 B
13 -0.8481927 B
From here we can process to our hearts content; with operations like df$subject <- gsub("[0-9]*", "", df$type) to extract the non-numeric portion of type, and tools like split can be used to generate the sub-lists that you mention in your question.
In addition, once it is in this form, you can use functions like by and aggregate or libraries like dplyr or data.table to do more advanced split-apply-combine operations for data analysis.
The input.txt contains 8000000 rows and 4 columns. The first 2 columns is text.The last 2 columns is number. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. The value of columns 3 and 4 is the number of unique symbols of columns 1 and 2 after splitting by "]" respectively.
Each row of input.txt file is like this:
c33]c21]c5]c7]c8]c9 TPS2]MIC17]ERG3]NNF1]CIS3]CWP2 6 6
**The desired result:
row[ , ] represents characters like "c33 c21 c5 c7 c8 c9" or "TPS2 MIC17 ERG3 NNF1 CIS3 CWP2", | .| represents the number of characters, |c33 c21 c5 c7 c8 c9|=6
If two rows are overlapped (>=0.6), it outputs the NO. of these two rows to a file.**
This code is as follows, but it runs too slow.
The code:
library(compiler)
enableJIT(3)
data<-read.table("input.txt",header=FALSE)
row<-8000000
for (i in 1:(row-1)){
row11<-unlist(strsplit(as.character(data[i,1]),"]"))
row12<-unlist(strsplit(as.character(data[i,2]),"]"))
s1<-data[i,3]*data[i,4]
zz<-file(paste("output",i,".txt",sep=""),"w")
for (j in (i+1):row)
{ row21<-unlist(strsplit(as.character(data[j,1]),"]"))
row22<-unlist(strsplit(as.character(data[j,2]),"]"))
up<-length(intersect(row11,row21))*length(intersect(row12,row22))
s2<-data[j,3]*data[j,4]
down<-min(s1,s2)
if ((up/down)>=0.6) cat(i,"\t",j,"\n",file=zz,append=TRUE)
}
close(zz)
}
The running result:
each row can produce a file, it is like this:
1 23
1 67
1 562
1 78
...
In order to run fast, I rewrite the code.The code is as follows
The input.txt contains 16000000 rows. The number of columns is not fixed. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. Each two rows of input.txt file is like this:
The 1st row (odd row1): c33 c21 c5 c7 c8
The 2nd row (even row1): TPS2 MIC17 ERG3 NNF1 CIS3 CWP2 MCM6
The 3rd row (odd row2): c33 c21 c5 c21 c18 c4 c58
The 4th row (even row2): TPS12 MIC3 ERG2 NNF1 CIS4
**The desired result:
If two rows are overlapped (>=0.6) with other two rows, it outputs the NO. of these two rows to a file.**
The code:
library(compiler)
enableJIT(3)
con <- file("input.txt", "r")
zz<-file("output.txt","w")
oddrow1<-readLines(con,n=1)
j<-0
i<-0
while( length(oddrow1) != 0 ){
oddrow1<-strsplit(oddrow1," ")
evenrow1<-readLines(con,n=1)
evenrow1<-strsplit(evenrow1," ")
j<-j+1
con2 <- file("input.txt", "r")
readLines(con2,n=(j*2))
oddrow2<-readLines(con2,n=1)
i<-j
while( length(oddrow2) != 0 ){
i<-i+1
oddrow2<-strsplit(oddrow2," ")
evenrow2<-readLines(con2,n=1)
evenrow2<-strsplit(evenrow2," ")
oddrow1<-unlist(oddrow1)
oddrow2<-unlist(oddrow2)
evenrow1<-unlist(evenrow1)
evenrow2<-unlist(evenrow2)
up<-length(intersect(oddrow1,oddrow2))*length(intersect(evenrow1,evenrow2))
down<-min(length(oddrow1)*length(evenrow1),length(oddrow2)*length(evenrow2))
if ((up/down)>=0.6) {cat(j,"\t",i,"\n",file=zz,append=TRUE) }
oddrow2<-readLines(con2,n=1)
}
close(con2)
oddrow1<-readLines(con,n=1)
}
close(con)
close(zz)
The running result:
it can produce a file, it is like this:
1 23
1 67
1 562
1 78
2 25
2 89
3 56
3 79
...
Both the above two methods are too slow, In order to run fast,how to rewrite this code. Thank you!
Well, I suspect uses too much memory for your size of data, but perhaps it will provoke some ideas.
Make up some data, with 20 total unique values and 5 to 10 in each cell.
set.seed(5)
n <- 1000L
ng <- 20
g1 <- paste(sample(10000:99999, ng))
g2 <- paste(sample(10000:99999, ng))
n1 <- sample(5:10, n, replace=TRUE)
n2 <- sample(5:10, n, replace=TRUE)
x1 <- sapply(n1, function(i) paste(g1[sample(ng, i)], collapse="|"))
x2 <- sapply(n2, function(i) paste(g2[sample(ng, i)], collapse="|"))
Load Matrix library and a helper function that takes a list of string vectors and converts them to a matrix with number of columns equal to the number of unique strings and 1's where it was present.
library(Matrix)
str2mat <- function(s) {
n <- length(s)
ni <- sapply(s, length)
s <- unlist(s)
u <- unique(s)
spMatrix(nrow=n, ncol=length(u), i=rep(1L:n, ni), j=match(s, u), x=rep(1, length(s)))
}
OK, now we can actually do something. First create the matrices and get the total number present in each row.
m1 <- str2mat(strsplit(x1, "|", fixed=TRUE))
m2 <- str2mat(strsplit(x2, "|", fixed=TRUE))
n1 <- rowSums(m1)
n2 <- rowSums(m2)
Now we can use crossproducts of these matrices to get the numerator, and outer to get the minimum to get the numerator. We then can compute the overlap and test if > 0.6. Since we have the whole matrix, we're not interested in the diagonal or the lower half. (There's ways of storing this kind of matrix more efficiently with Matrix library, but I'm not sure how.) We then get the rows that have enough overlap with which.
num <- tcrossprod(m1)*tcrossprod(m2)
n12 <- n1*n2
den <- outer(n12, n12, pmin)
use <- num/den > 0.6
diag(use) <- FALSE
use[lower.tri(use)] <- FALSE
out <- which(use, arr.ind=TRUE)
> head(out)
[,1] [,2]
[1,] 64 65
[2,] 27 69
[3,] 34 81
[4,] 26 82
[5,] 5 85
[6,] 21 115