Replace every word by an index in 15 million strings - r

I have a list of 15 million strings and I have a dictionary of 8 million words. I want to replace every string in database by the index of the string in the dictionary.
I tried using the hash package for faster indexing, but it is still taking hours for replacing in all 15 million strings.
What is the efficient way of implementing this?
Example[EDITED]:
# Database
[[1]]
[1]"a admit been c case"
[[2]]
[1]"co confirm d ebola ha hospit howard http lik"
# dictionary
"t" 1
"ker" 2
"be" 3
.
.
.
.
# Output:
[[1]]123 3453 3453 567
[[2]]6786 3423 234123 1234 23423 6767 3423 124431 787889 111
Where the index of admit in the dictionary is 3453.
Any kind of help is appreciated.
Updated Example with Code:
This is what I am currently doing.
Example: data =
[1] "a co crimea divid doe east hasten http polit secess split t threaten ukrain via w west xtcnwl youtub"
[2] "billion by cia fund group nazy spent the tweethead ukrain"
[3] "all back energy grandpar home miss my posit radiat the"
[4] "ao bv chega co de ebola http kkmnxv pacy rio suspeito t"
[5] "android androidgam co coin collect gameinsight gold http i jzdydkylwd t ve"
words.list = strsplit(data, "\\W+", perl=TRUE)
words.vector = unlist(words.list)
sorted.words = sort(table(words.vector),decreasing=TRUE)
h = hash(names(sorted.words),1:length(names(sorted.words)))
index = lapply(data, function(row)
{
temp = trim.leading(row)
word_list = unlist(strsplit(temp, "\\W+", perl=TRUE))
index_list = lapply(word_list,function(x)
{
return(h[[x]])
}
)
#print(index_list)
return(unlist(index_list))
}
)
Output:
index_list
[[1]]
[1] 6 1 19 21 22 23 31 2 40 44 46 3 48 5 51 52 53 54 55
[[2]]
[1] 12 14 16 26 30 38 45 4 49 5
[[3]]
[1] 7 11 25 29 32 36 37 41 42 4
[[4]]
[1] 10 13 15 1 20 24 2 35 39 43 47 3
[[5]]
[1] 8 9 1 17 18 27 28 2 33 34 3 50
The output is index. This runs fast if the length of data is small but execution is really slow if the length is 15 million.
My task is the nearest neighbor search. I want to search for 1000 queries which are of same format as the database.
I have tried many things like parallel computations as well, but had issues with memory.
[EDIT] How can I implement this using RCpp?

I think you'd like to avoid the lapply() by splitting the data, unlisting, then processing the vector of words
data.list = strsplit(data, "\\W+", perl=TRUE)
words = unlist(data.list)
## ... additional processing, e.g., strip white space, on the vector 'words'
perform the match, then re-list to original
relist(match(words, word.vector), data.list)
For downstream applications it might actually pay to retain the vector + 'partitioning' information, partition = sapply(data.list, length) rather than re-listing, since it'll continue to be efficient to operate on the unlisted vector. The Bioconductor S4Vectors package provides a CharacterList class that takes this approach, where one mostly works on something that is list-like, but where the data are stored and most operations are on an underlying character vector.

Sounds like you're doing NLP.
A fast non-R solution (which you could wrap in R) is word2vec
The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the
training text data and then learns vector representation of words. The
resulting word vector file can be used as features in many natural
language processing and machine learning applications.

Related

Finding value of a series in R without for-loop

I am a newbie in R` and I found this problem:
Calculate the following sum using R:
1+(2/3)+(2/3)(4/5)+...+(2/3)(4/5)...(38/39)
I was enthusiastic to know how to solve this without using a for loop, and using only vector operations.
My thoughts and what I've tried till now:
Suppose I create two vectors such as
x<-2*(1:19)
y<-2*(1:19)+1
Then, x consists of all the numerators in the question and y has all the denominators. Now
z<-x/y
will create a vector of length 19 in which will be stored the values of 2/3, 4/5, ..., 38/39
I was thinking of using the prod function in R to find the required products. So, I created a vector such that
i<-1:19
In hopes of traversing z from the first element to the last, I did write:
prod(z[1:i])
But it failed miserably, giving me the result:
[1] 0.6666667
Warning message:
In 1:i : numerical expression has 19 elements: only the first used
What I wanted to do:
I expected to store the values of (2/3), (2/3)(4/5), ..., (2/3)(4/5)...(38/39) individually in another vector (say p) which will thus have 19 elements in it. I then intend to use the sum function to finally find out the sum of all those...
Where am I stuck:
As described in the R documentation, the prod function returns the product of all the values present in its arguments. So,
prod(z[1:1])
prod(z[1:2])
prod(z[1:3])
will return the values of (2/3), (2/3)(4/5), (2/3)(4/5)(6/7) respectively which it does:
> prod(z[1:1])
[1] 0.6666667
> prod(z[1:2])
[1] 0.5333333
> prod(z[1:3])
[1] 0.4571429
But it's not possible to go on like this and do it for all the 19 elements of the vector z. I am stuck here thinking as to what could be done. I wanted to iterate all the elements of z one-by-one for which I created another vector i as described above, but it didn't go as I had thought. Any help, suggestions, and hints will be really great as to how this can be done. I seem to have run out of ideas here.
More Information:
Here, I am providing with all the outputs in a systematic manner for others to understand my problem better:
> x
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
> y
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
> z
[1] 0.6666667 0.8000000 0.8571429 0.8888889 0.9090909 0.9230769 0.9333333
[8] 0.9411765 0.9473684 0.9523810 0.9565217 0.9600000 0.9629630 0.9655172
[15] 0.9677419 0.9696970 0.9714286 0.9729730 0.9743590
> i
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Short Note (controversial statement ahead): This post would really have benefited from the use of LaTeX, but unfortunately, due to extremely heavy dependencies, as is mentioned in several posts regarding inclusion of LaTeX on Stack Overflow (like this), that is not a thing till now.
You can use cumprod to get a cumulative product of a vector which is what you are after
p <- cumprod(z)
p
# [1] 0.6666667 0.5333333 0.4571429 0.4063492 0.3694084 0.3409923 0.3182595
# [8] 0.2995384 0.2837732 0.2702602 0.2585097 0.2481694 0.2389779 0.2307373
# [15] 0.2232941 0.2165276 0.2103411 0.2046562 0.1994087
A less-efficient but more generalized alternative to cumprod would be
p <- sapply(i, function(x) prod(z[1:x]))
Here the sapply takes the place of the loop and passes a different ending index for each product
Then you can do
1 + sum(p)

How to normalize rather long decimal number in R?

I have list of data.frame, where I need to do transformation for .score column. However, I implemented helper function for this transformation. After I call .helperFunc for my input list of data.frame, but I got weird pvalue format in first, third data.frame. How to normalize rather big decimal to simple scientific number ? Can anyone tell me how to make this happen easily ?
toy data :
savedDF <- list(
bar = data.frame(.start=c(12,21,37), .stop=c(14,29,45), .score=c(5,69,14)),
cat = data.frame(.start=c(18,42,18,42,81), .stop=c(27,46,27,46,114), .score=c(15,5,15,5,134)),
foo = data.frame(.start=c(3,3,33,3,33,91), .stop=c(26,26,42,26,42,107), .score=c(22,22,6,22,6,7))
)
I got this weird output:
> .savedDF
$bar
.start .stop .score p.value
1 12 14 5 0.000010000000000000000817488438054070343241619411855936050415039062500
2 21 29 69 0.000000000000000000000000000000000000000000000000000000000000000000001
3 37 45 14 0.000000000000009999999999999999990459020882127560980734415352344512939
$cat
.start .stop .score p.value
1 18 27 15 1e-15
2 42 46 5 1e-05
3 18 27 15 1e-15
4 42 46 5 1e-05
5 81 114 134 1e-134
$foo
.start .stop .score p.value
1 3 26 22 0.0000000000000000000001
2 3 26 22 0.0000000000000000000001
3 33 42 6 0.0000010000000000000000
4 3 26 22 0.0000000000000000000001
5 33 42 6 0.0000010000000000000000
6 91 107 7 0.0000001000000000000000
I don't know what happen this, only second data.frame' format is desired. How can I normalize p.value column as simple as possible ?
last column of cat is considered to be desired format, or more precise but simple scientific number is also fit for me.
How can I make this normalization for unexpectedly long decimal numbers ? How can I achieve my desired output ? Any idea ? Thanks a lot
0 is the default scipen option. (See ?options for more details.) You apparently have changed the option to 100, which tells R to use decimal notation unless it is 100 characters longer than scientific notation. To get back to the default, run the line
options(scipen = 0)
As to "So in my function, I could add this option as well?" - you shouldn't do that. Doing it in your script is fine, but not in a function. Functions really shouldn't set user options. That's likely how you got in to this mess - some function you used probably rudely ran options(scipen = 100) and changed your options without you being aware.
Related: the opposite question How to disable scientific notation in R?

Speedup search of Elements

I got two data.frames m (23 columns 135.973 rows) with the two important columns
head(m[,2])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(m[,7])
# [1] 3661216 3661217 3661223 3661224 3661564 3661567
and search (4 columns 1.019.423 rows) with three important columns
head(search[,1])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(search[,3])
# [1] 3000009 3003160 3003187 3007262 3028947 3050944
head(search[,4])
# [1] 3000031 3003182 3003209 3007287 3028970 3050995
For each row in m I like to get the information if the m[XX,7] position is between any position of search[,3] and search[,4]. So search[,3] can be considered as "start" and search[,4] as "end". In addition search[,1] and m[,2] have to be identical.
Example:
m at row 215
"chr1" 10.984.038
hits in search at line 2898
"chr1" 10.984.024 10.984.046
In general I'm not interested which line or how many lines of search could be found. I just want the information for any line of m is there a matching line in search yes or no.
I'm ending up in this function:
f_4<-function(x,y,z){
for (out in 1:length(x[,1])) {
z[out]<-length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4])))
}
return(z)
}
found4<-vector(length=length(m[,1]), mode="numeric")
found4<-f_4(m,search,found4)
It took 3 hours to run this code.
I have already tried some speedup approaches, however I didn't manage to get any of this running proper or faster.
I even tried some lappy/apply approaches -which worked but aren't faster-. However they failed when trying to speed up using parLapply/parRapply.
Anybody got a quite faster approach and may can give some advise?
EDIT 2015/09/18
Found another way to speed up, using foreach %dopar%.
f5<-function(x,y,z){
foreach(out=1:length(x[,1]), .combine="c") %dopar% {
takt<-1000
z=length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4]) ))
}
return(z)
}
found5<-vector(length=length(m[,1]), mode="numeric")
found5<-f5(m,search,found5)
Only need 45min. However I'm always getting 0 only. Thing I need to read some more of the foreach %dopar% tutorials.
You can try merging with subsequent logical subsetting. First let's create some mock data:
set.seed(123) # used for reproducibility
m <-as.data.frame(matrix(sample(50,7000, replace=T), ncol=7, nrow=1000))
search <- as.data.frame(matrix(sample(50,1200, replace=T), ncol=4, nrow=300))
Since we want to compare different rows of the two sets, we can use the criterion that m[,2] should be equal to search[,1]. For convenience we can name these columns "ID" in both sets:
m <- cbind(m,seq_along(1:nrow(m)))
search <- cbind(search,seq_along(1:nrow(search)))
colnames(m) <- c("a","ID","c","d","e","f","val","rownum.m")
colnames(search) <- c("ID","nothing","start","end", "rownum.s")
We have added a column to m named 'rownum.m' and a similar column to search which in the end will help identifying the resulting entries in the initial dataset.
Now we can merge the data sets, such that the ID is the same:
m2 <- merge(m,search)
In a final step, we can perform a logical subset of the merged data set and assign the output to a new data frame m3:
m3 <- m2[(m2[,"val"] >= m2[,"start"]) & (m2[,"val"] <= m2[,"end"]),]
#> head(m3)
# ID a c d e f val rownum.m nothing start end rownum.s
#5 1 14 36 36 31 30 25 846 10 20 36 291
#13 1 34 49 24 8 44 21 526 10 20 36 291
#17 1 19 32 29 44 24 35 522 6 33 48 265
#20 1 19 32 29 44 24 35 522 32 31 50 51
#21 1 19 32 29 44 24 35 522 10 20 36 291
#29 1 6 50 10 13 43 22 15 10 20 36 291
If we are only interested in a TRUE/FALSE statement whether a specific row of m matches the criterions, we can define a vector match_s:
match_s <- m$rownum.m %in% m3$rownum.m
which can be stored as an additional column in the original data set m:
m <- cbind(m,match_s)
Finally, we can remove the auxiliary column 'rownum.m' from the data set m which is no longer needed, with m <- m[,-8].
The result is:
> head(m)
# a ID c d e f val match_s
#1 15 14 8 11 16 13 23 FALSE
#2 40 30 8 48 42 50 20 FALSE
#3 21 9 8 19 30 36 19 TRUE
#4 45 43 26 32 41 33 27 FALSE
#5 48 43 25 10 15 13 4 FALSE
#6 3 24 31 33 8 5 36 FALSE
If you're trying to find SNPs (say) inside a set of genomic regions, don't use R. Use BEDOPS.
Convert your SNP or single-base positions to a three-column BED file. In R, make a three-column data table with m[,2], m[,7] and m[,7] + 1, which represent the chromosome, start and stop position of the SNP. Use write.table() to write out this data table to a tab-delimited text file.
Do the same with your genomic regions: Write search[,1], search[,3], and search[,4] to a three-column data table representing the chromosome, start and stop position of the region. Use write.table() to write this out to a tab-delimited text file.
Use sort-bed to sort both BED files. This step might be optional, but it doesn't take long to do and it guarantees that the files are prepped for use with BEDOPS tools.
Finally, use bedmap on the two BED files to map SNPs to genomic regions. Mapping associates SNPs with regions. The bedmap tool can report which SNPs map to a region, or report the number of SNPs, or one or more of many other operations. The documentation for bedmap goes into more detail on the list of operations, but the provided example should get you started quickly.
If your data are in BED format, or can be quickly coerced into BED format, don't use R for genomic operations, as it is slow and memory-intensive. The BEDOPS toolkit introduced the use of sorting to make genomic operations fast, with low memory overhead.

Need help pairing data

I'm looking for what I'm sure is a quick answer. I'm working with a data set that looks like this:
Week Game.ID VTm VPts HTm HPts Differential HomeWin
1 NFL_20050908_OAK#NE OAK 20 NE 30 10 TRUE
1 NFL_20050911_ARI#NYG ARI 19 NYG 42 23 TRUE
1 NFL_20050911_CHI#WAS CHI 7 WAS 9 2 TRUE
1 NFL_20050911_CIN#CLE CIN 27 CLE 13 -14 FALSE
1 NFL_20050911_DAL#SD DAL 28 SD 24 -4 FALSE
1 NFL_20050911_DEN#MIA DEN 10 MIA 34 24 TRUE
NFL data. I want to come up with a way to pair each HTm with its Differential, and store these values in another table. I know it's easy to do, but all the methods I am coming up with involve doing each team individually via a for loop that searches for [i,5]=="NE", [i,5]=="NYG". I'm wondering if there's a way to systematically do this for all 32 teams. I would then use the same method to pair VTM of the same team code ("NYG" or "NE") with VPTs and a VDifferential.
Thanks for the help.
Im not sure if i understood your question correctly(you need a function like select in a database?) but:
cbind(matr[,x], matr[,y])
selects column x and y and creates a new matrix
It sounds like you'd like to perform operations on your data frame based on a grouping variable. For that, there are many functions, among which is tapply(). For example, if your data is in a data.frame object named nflDF, you could get the maximum Differential for each home team HTm by
tapply(nflDF$Differential, nflDF$HTm, FUN = max)
Which would return (with your sample data)
CLE MIA NE NYG SD WAS
-14 24 10 23 -4 2
Alternatively, you could use by:
by(nflDF, nflDF$HTm, FUN = function(x) max(x$Differential))
HTm: CLE
[1] -14
------------------------------------------------------------
HTm: MIA
[1] 24
------------------------------------------------------------
HTm: NE
[1] 10
------------------------------------------------------------
HTm: NYG
[1] 23
------------------------------------------------------------
HTm: SD
[1] -4
------------------------------------------------------------
HTm: WAS
[1] 2
To perform more complicated operations, change the values supplied to the FUN arguments in the appropriate function.

Combining vectors of unequal length into a data frame

I have a list of vectors which are time series of inequal length. My ultimate goal is to plot the time series in a ggplot2 graph. I guess I am better off first merging the vectors in a dataframe (where the shorter vectors will be expanded with NAs), also because I want to export the data in a tabular format such as .csv to be perused by other people.
I have a list that contains the names of all the vectors. It is fine that the column titles be set by the first vector, which is the longest. E.g.:
> mylist
[[1]]
[1] "vector1"
[[2]]
[1] "vector2"
[[3]]
[1] "vector3"
etc.
I know the way to go is to use Hadley's plyr package but I guess the problem is that my list contains the names of the vectors, not the vectors themselves, so if I type:
do.call(rbind, mylist)
I get a one-column df containing the names of the dfs I wanted to merge.
> do.call(rbind, actives)
[,1]
[1,] "vector1"
[2,] "vector2"
[3,] "vector3"
[4,] "vector4"
[5,] "vector5"
[6,] "vector6"
[7,] "vector7"
[8,] "vector8"
[9,] "vector9"
[10,] "vector10"
etc.
Even if I create a list with the object themselves, I get an empty dataframe :
mylist <- list(vector1, vector2)
mylist
[[1]]
1 2 3 4 5 6 7 8 9 10 11 12
0.1875000 0.2954545 0.3295455 0.2840909 0.3011364 0.3863636 0.3863636 0.3295455 0.2954545 0.3295455 0.3238636 0.2443182
13 14 15 16 17 18 19 20 21 22 23 24
0.2386364 0.2386364 0.3238636 0.2784091 0.3181818 0.3238636 0.3693182 0.3579545 0.2954545 0.3125000 0.3068182 0.3125000
25 26 27 28 29 30 31 32 33 34 35 36
0.2727273 0.2897727 0.2897727 0.2727273 0.2840909 0.3352273 0.3181818 0.3181818 0.3409091 0.3465909 0.3238636 0.3125000
37 38 39 40 41 42 43 44 45 46 47 48
0.3125000 0.3068182 0.2897727 0.2727273 0.2840909 0.3011364 0.3181818 0.2329545 0.3068182 0.2386364 0.2556818 0.2215909
49 50 51 52 53 54 55 56 57 58 59 60
0.2784091 0.2784091 0.2613636 0.2329545 0.2443182 0.2727273 0.2784091 0.2727273 0.2556818 0.2500000 0.2159091 0.2329545
61
0.2556818
[[2]]
1 2 3 4 5 6 7 8 9 10 11 12
0.2824427 0.3664122 0.3053435 0.3091603 0.3435115 0.3244275 0.3320611 0.3129771 0.3091603 0.3129771 0.2519084 0.2557252
13 14 15 16 17 18 19 20 21 22 23 24
0.2595420 0.2671756 0.2748092 0.2633588 0.2862595 0.3549618 0.2786260 0.2633588 0.2938931 0.2900763 0.2480916 0.2748092
25 26 27 28 29 30 31 32 33 34 35 36
0.2786260 0.2862595 0.2862595 0.2709924 0.2748092 0.3396947 0.2977099 0.2977099 0.2824427 0.3053435 0.3129771 0.2977099
37 38 39 40 41 42 43 44 45 46 47 48
0.3320611 0.3053435 0.2709924 0.2671756 0.2786260 0.3015267 0.2824427 0.2786260 0.2595420 0.2595420 0.2442748 0.2099237
49 50 51 52 53 54 55 56 57 58 59 60
0.2022901 0.2251908 0.2099237 0.2213740 0.2213740 0.2480916 0.2366412 0.2251908 0.2442748 0.2022901 0.1793893 0.2022901
but
do.call(rbind.fill, mylist)
data frame with 0 columns and 0 rows
I have tried converting the vectors to dataframes, but there is no cbind.fill function, so plyr complains that the dataframes are of different length.
So my questions are:
Is this the best approach? Keep in mind that the goals are a) a ggplot2 graph and b) a table with the time series, to be viewed outside of R
What is the best way to get a list of objects starting with a list of the names of those objects?
What the best type of graph to highlight the patterns of 60 timeseries? The scale is the same, but I predict there'll be a lot of overplotting. Since this is a cohort analysis, it might be useful to use color to highlight the different cohorts in terms of recency (as a continuous variable). But how to avoid overplotting? The differences will be minimal so faceting might leave the viewer unable to grasp the difference.
I think that you may be approaching this the wrong way:
If you have time series of unequal length then the absolute best thing to do is to keep them as time series and merge them. Most time series packages allow this. So you will end up with a multi-variate time series and each value will be properly associated with the same date.
So put your time series into zoo objects, merge them, then use my qplot.zoo function to plot them. That will deal with switching from zoo into a long data frame.
Here's an example:
> z1 <- zoo(1:8, 1:8)
> z2 <- zoo(2:8, 2:8)
> z3 <- zoo(4:8, 4:8)
> nm <- list("z1", "z2", "z3")
> z <- zoo()
> for(i in 1:length(nm)) z <- merge(z, get(nm[[i]]))
> names(z) <- unlist(nm)
> z
z1 z2 z3
1 1 NA NA
2 2 2 NA
3 3 3 NA
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
>
> x.df <- data.frame(dates=index(x), coredata(x))
> x.df <- melt(x.df, id="dates", variable="val")
> ggplot(na.omit(x.df), aes(x=dates, y=value, group=val, colour=val)) + geom_line() + opts(legend.position = "none")
If you're doing it just because ggplot2 (as well as many other things) like data frames then what you're missing is that you need the data in long format data frames. Yes, you just put all of your response variables in one column concatenated together. Then you would have 1 or more other columns that identify what makes those responses different. That's the best way to have it set up for things like ggplot.
You can't. A data.frame() has to be rectangular; but recycling rules assures that the shorter vectors get expanded.
So you may have a different error here -- the data that you want to rbind is not suitable, maybe ? -- but is hard to tell as you did not supply a reproducible example.
Edit Given your update, you get precisely what you asked for: a list of names gets combined by rbind. If you want the underlying data to appear, you need to involve get() or another data accessor.

Resources