How to compare two different columns(both contains string) efficiently in R? - r

Suppose A is a data frame and structure of A is as follows
Row no C1 C2
1 I am fine 1234
2 He is fine 1234
3 am better 1234
4 better butter 1234
5 fine good 1234
6 good to be better 1234
and B is another data frame such that
Row no C1
1 fine
2 good
I want to compare A$C1 with B$C1 and the string in B$C1 should be contains in A$C1.
So when I will compare A$C1 with B$C1 the result will be the row number in A which contains the B's strings. For the above scenario the output will be 1, 2, 5,6 as 1,2,5 contains the word "fine" and 6 as it contains the word "good". I don't want to compare "good" with row 5 of A as I have already selected row 5. I want an efficient solution for this as the number of rows for my real data (A) set is around 400000 and B is around 10000

This function
phrasesWithWords <- function(x, table)
{
words <- strsplit(x, "\\W")
found <- relist(unlist(words) %in% table, words)
which(sapply(found, any))
}
works on your phrases and a table of acceptable words:
phrase <- c("I am fine", "He is fine", "am better", "better butter",
"fine good", "good to be better")
table <- c("fine", "good")
phrasesWithWords(phrase, table)
The function works by spliting the phrases into words, then looking up each word (without looping through the long list of phrases) in the table, re-listing the logical vector, and asking which list elements contain at least one TRUE.
This turns out not to be so efficient compared to a simple grep solution
f1 <- function(x, table)
grep(paste(table, collapse="|"), x)
with
library(microbenchmark)
x1000 <- rep(x, 1000)
giving
> microbenchmark(phrasesWithWords(x1000, table), f1(x1000, table),
+ times=5)
Unit: milliseconds
expr min lq median uq
phrasesWithWords(x1000, table) 130.167172 132.815303 133.011161 133.112888
f1(x1000, table) 2.959576 2.973416 2.990412 3.060494
max neval
134.504282 5
3.439293 5
The pretty neat package "lineprof" shows that for a modified function
f0 <- function(x, table)
{
words <- strsplit(x, "\\W")
idx <- unlist(words) %in% table
found <- relist(idx, words)
which(sapply(found, any))
}
the main bottleneck is in relist
> lineprof(f0(x1000, table))
Reducing depth to 2 (from 7)
Common path: words.R!30719TCY
time alloc release dups ref src
1 0.003 0.668 0 0 words.R!30719TCY#3 f0/strsplit
2 0.024 28.240 0 17393 words.R!30719TCY#5 f0/relist
3 0.003 3.959 0 6617 words.R!30719TCY#6 f0/which
leading to a more elaborate approach
f2 <- function(x, table)
{
words <- strsplit(x, "\\W")
len <- cumsum(sapply(words, length))
idx <- cumsum(unlist(words) %in% table)
which(idx[len] != c(0, idx[head(len, -1)]))
}
which is somewhat better-performing
> identical(f2(x1000, table), f1(x1000, table))
[1] TRUE
> microbenchmark(f2(x1000, table), f1(x1000, table), times=5)
Unit: milliseconds
expr min lq median uq max neval
f2(x1000, table) 25.426832 25.815504 25.844033 26.075279 26.387559 5
f1(x1000, table) 2.963365 2.968197 2.984395 2.984423 3.129873 5
I think both f2 and f1 would scale well enough to the problem in the original question, provided there is sufficient memory (if the table of acceptable words is small compared to the phrases, then I think the grep approach will actually be more memory efficient; in the end I think I might up-vote the simple grep solution!). Maybe the major limitation with the grep approach is that the size of the regular expression is limited, on my computer at about 2560 terms
> grep(paste(as.character(1:2559), collapse="|"), "1")
[1] 1
> grep(paste(as.character(1:2560), collapse="|"), "1")
Error in grep(paste(as.character(1:2560), collapse = "|"), "1") :
invalid regular expression '1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|4

grep can do the job for you:
grep(paste(B$C1, collapse="|"), A$C1)
1 2 5 6
The code above gets you all the lines in A$C1 that contains at least one word of B$C1, that is, lines 1, 2, 5 and 6. The first argument is a regular expression, that's why we collapse the words with "|" (that means "or").
And it seems scalable. Benchmarking with 100.000 sample phrases (from your phrases) and the two words, grep takes only 0.076 seconds.

Related

Count Distinct values from 2 columns

I need to get only distinct values that are spread over two columns and return the distinct values into one column.
Example:
colA colB
---- --------
darcy elizabeth
elizabeth darcy
jon doe
doe joe
It should return:
resultCol
darcy
elizabeth
jon
doe
Is there any builtin function or library that can do that more efficiently?
I tried a workaround to get the results but it is extremely slow for more than 100 thousands observations.
#First i create a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Then i create an empty dataframe to store all values in a single column
emptyDataframe<-data.frame(resultColumn=character())
for(i in 1:nrow(dfSample)){
emptyDataframe<-rbind(emptyDataframe,c(toString(dfSample[i,1])),stringsAsFactors=FALSE)
}
for(i in 1:nrow(dfSample)){
emptyDataframe<-rbind(emptyDataframe,c(toString(dfSample[i,2])),stringsAsFactors=FALSE)
}
emptyDataframe
#Finally i get the distinct values using dplyr
var_distinct_values<-distinct(emptyDataframe)
I use union to get unique values across specific columns:
with(dfSample, union(col1, col2))
PS: The answer from d.b in the comments is also another way.
You can improvise his answer if you have extra columns but want to run it only over specific columns:
unique(unlist(dfSample[1:2]))
This gets the unique values from first two columns.
Here is a general purpose solution.
It's based on this answer but can be extended to any number of columns as long as the object is a data.frame or list.
Reduce(union, dfSample)
[1] "darcy" "elizabeth" "john" "doe"
Now with 100K observations in each of 10 columns.
set.seed(1234)
n <- 1e5
bigger <- replicate(n, sample(c(col1, col2), 10, TRUE))
bigger <- as.data.frame(bigger)
system.time(Reduce(union, bigger))
# user system ellapsed
# 3.769 0.000 3.772
Edit.
After a second thought, I realized that the test above is run with a dataframe with a very small number of different values. A test with a larger number does not necessarily give the same results.
set.seed(1234)
s <- sprintf("%05d", 1:5000)
big2 <- replicate(n, sample(s, 10, TRUE))
big2 <- as.data.frame(big2)
rm(s)
microbenchmark::microbenchmark(
red = Reduce(union, big2),
uniq = unique(unlist(big2)),
times = 10
)
#Unit: seconds
# expr min lq mean median uq max neval cld
# red 26.021855 26.42693 27.470746 27.198807 28.56720 29.022047 10 b
# uniq 1.405091 1.42978 1.632265 1.548753 1.56691 2.693431 10 a
The unique/unlist solution is now clearly better.

How to make factor names appear in ifelse statement in R?

I have to following dataset. I want to create a column so that if there is a number in the unid column then in dat$identification I want it to say "unidentified" otherwise I want it to print whatever is there in the species column. So the final output should look like dat$identificaiton x,y,unidentified,unidentified. With this code it shows 1,2,unidentified,unidentified.
Please note, for other purposes I want to use only the unid column for the !(is.na) part of the ifelse statement and not the species.
unid <- c(NA,NA,1,4)
species <- c("x","y",NA,NA)
df <- data.frame(unid, species)
df$identification <- ifelse(!is.na(unid), "unidentified", df$species)
#Current Output of df$identification:
1,2,unidentified,unidentified
#Needed Output
x,y,unidentified,unidentified
You can coerce the column of class 'factorto classcharacterin theifelse`.
df$identification <- ifelse(!is.na(unid), "unidentified", as.character(df$species))
df
# unid species identification
#1 NA x x
#2 NA y y
#3 1 <NA> unidentified
#4 4 <NA> unidentified
Edit.
After the OP accepted the answer, I reminded myself that ifelse is slow and indexing fast, so I tested both using a larger dataset.
First of all, see if both solutions produce the same results:
df$id1 <- ifelse(!is.na(unid), "unidentified", as.character(df$species))
df$id2 <- "unidentified"
df$id2[is.na(unid)] <- species[is.na(unid)]
identical(df$id1, df$id2)
#[1] TRUE
The results are the same.
Now time them both using package microbenchmark.
n <- 1e4
df1 <- data.frame(unid = rep(unid, n), species = rep(species, n))
microbenchmark::microbenchmark(
ifelse = {df1$id1 <- ifelse(!is.na(df1$unid), "unidentified", as.character(df1$species))},
index = {df1$id2 <- "unidentified"
df1$id2[is.na(df1$unid)] <- species[is.na(df1$unid)]
},
relative = TRUE
)
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# ifelse 12502465 12749881 16080160.39 14365841 14507468.5 85836870 100 c
# index 3243697 3299628 4575818.33 3326692 4983170.0 74526390 100 b
#relative 67 68 208.89 228 316.5 540 100 a
On average, indexing is 200 times faster. More than worth the trouble to write two lines of code instead of just one for ifelse.

Replace values in a dataframe based on lookup table

I am having some trouble replacing values in a dataframe. I would like to replace values based on a separate table. Below is an example of what I am trying to do.
I have a table where every row is a customer and every column is an animal they purchased. Lets call this dataframe table.
> table
# P1 P2 P3
# 1 cat lizard parrot
# 2 lizard parrot cat
# 3 parrot cat lizard
I also have a table that I will reference called lookUp.
> lookUp
# pet class
# 1 cat mammal
# 2 lizard reptile
# 3 parrot bird
What I want to do is create a new table called new with a function replaces all values in table with the class column in lookUp. I tried this myself using an lapply function, but I got the following warnings.
new <- as.data.frame(lapply(table, function(x) {
gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)
Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
Any ideas on how to make this work?
You posted an approach in your question which was not bad. Here's a smiliar approach:
new <- df # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])
An alternative approach which will be faster is:
new <- df
new[] <- look$class[match(unlist(df), look$pet)]
Note that I use empty brackets ([]) in both cases to keep the structure of new as it was (a data.frame).
(I'm using df instead of table and look instead of lookup in my answer)
Another options is a combination of tidyr and dplyr
library(dplyr)
library(tidyr)
table %>%
gather(key = "pet") %>%
left_join(lookup, by = "pet") %>%
spread(key = pet, value = class)
Anytime you have two separate data.frames and are trying to bring info from one to the other, the answer is to merge.
Everyone has their own favorite merge method in R. Mine is data.table.
Also, since you want to do this to many columns, it'll be faster to melt and dcast -- rather than loop over columns, apply it once to a reshaped table, then reshape again.
library(data.table)
#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE)
setDT(lookUp)
#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')
# merging
][lookup, new_value := i.class, on = c(value = 'pet')
#reform back to original shape
][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
# rn P1 P2 P3
# 1: 1 mammal reptile bird
# 2: 2 reptile bird mammal
# 3: 3 bird mammal reptile
In case you find the dcast/melt bit a bit intimidating, here's an approach that just loops over columns; dcast/melt is simply sidestepping the loop for this problem.
setDT(table) #don't need row names this time
setDT(lookUp)
sapply(names(table), #(or to whichever are the relevant columns)
function(cc) table[lookUp, (cc) := #merge, replace
#need to pass a _named_ vector to 'on', so use setNames
i.class, on = setNames("pet", cc)])
Make a named vector, and loop through every column and match, see:
# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1
# cat lizard parrot
# "mammal" "reptile" "bird"
# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL
# res
# P1 P2 P3
# 1 mammal reptile bird
# 2 reptile bird mammal
# 3 bird mammal reptile
data
df1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
lookUp <- read.table(text = "
pet class
1 cat mammal
2 lizard reptile
3 parrot bird", header = TRUE)
I did it using the factor built-in.
table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)
The answer above showing how to do this in dplyr doesn't answer the question, the table is filled with NAs. This worked, I would appreciate any comments showing a better way:
# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>%
# put in long format, naming column filled with P1, P2, P3 "petCount"
gather(key="petCount", value="pet", -customer) %>%
# add a new column based on the pet's class in data frame "lookup"
left_join(lookup, by="pet") %>%
# since you wanted to replace the values in "table" with their
# "class", remove the pet column
select(-pet) %>%
# put data back into wide format
spread(key="petCount", value="class")
Note that it would likely be useful to keep the long table that contains the customer, the pet, the pet's species(?) and their class. This example simply adds an intermediary save to a variable:
table$customer = seq(nrow(table))
petClasses <- table %>%
gather(key="petCount", value="pet", -customer) %>%
left_join(lookup, by="pet")
custPetClasses <- petClasses %>%
select(-pet) %>%
spread(key="petCount", value="class")
I tried other approaches and they took a really long time with my very large dataset. I used the following instead:
# make table "new" using ifelse. See data below to avoid re-typing it
new <- ifelse(table1 =="cat", "mammal",
ifelse(table1 == "lizard", "reptile",
ifelse(table1 =="parrot", "bird", NA)))
This method requires you to write more text for your code, but the vectorization of ifelse makes it run faster. You have to decide, based on your data, if you want to spend more time writing code or waiting for your computer to run. If you want to make sure it worked (you didn't have any typos in your iflese commands), you can use apply(new, 2, function(x) mean(is.na(x))).
data
# create the data table
table1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
Benchmark
Out of burning curiosity, I just ran a benchmark with some of the approaches that I want to share with you. I couldn't quite believe some of the statements about performance in the answers and am trying to clarify this herewith. In order not to be misled by different rows/columns ratios, I consider three scenarios:
ncol == nrow
ncol << nrow
ncol >> nrow.
It might be beneficial to coerce as.matrix beforehand, so I included this as an additional solution (unlist_mat).
microbenchmark::microbenchmark(
lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)], ## added
ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
ifelse(Dat3[col_set] == "lizard", "reptile",
ifelse(Dat3[col_set] == "parrot", "bird", NA))),
look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
times=3L
)
## 1e3 x 1e3
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lapply 40.42905 63.47053 78.03831 86.51201 96.84294 107.17387 3 a
# unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297 3 b
# unlist_mat 45.91743 56.51087 68.50595 67.10432 79.80021 92.49611 3 a
# ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581 3 ab
# look_vec 58.54449 88.40293 112.91165 118.26137 140.09522 161.92908 3 a
## 1e4 x 1e4
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.427077 3.558234 3.992481 4.689390 4.775183 4.860977 3 a
# unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084 3 b
# unlist_mat 4.940254 5.011684 5.576553 5.083114 5.894703 6.706291 3 a
# ifelse 9.714553 14.444899 36.176777 19.175244 49.407889 79.640535 3 a
# look_vec 8.460969 8.558600 8.784463 8.656230 8.946209 9.236188 3 a
## 1e5 x 1e3
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.314427 2.403001 3.270708 2.491575 3.748848 5.006120 3 a
# unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586 3 b
# unlist_mat 5.018869 5.060865 5.638499 5.102861 5.948314 6.793767 3 a
# ifelse 6.244744 16.488266 39.208119 26.731788 55.689807 84.647825 3 ab
# look_vec 4.512672 6.434651 7.496267 8.356630 8.988064 9.619498 3 a
## 1e3 x 1e5
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 52.833019 55.373432 71.308981 57.913845 80.546963 103.180080 3 ab
# unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819 3 c
# unlist_mat 3.872551 4.422904 4.695393 4.973257 5.106814 5.240372 3 a
# ifelse 72.592437 76.473418 103.930063 80.354399 119.598876 158.843354 3 b
# look_vec 56.444824 58.904604 62.677267 61.364383 65.793488 70.222593 3 ab
Note: Performed on an Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz using R --vanilla.
all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1)) ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!
Conclusion
Using lapply with a lookup matrix appears to be a good choice if the number of columns is rather low/lower than the number of rows. If we have many columns, especially compared to rows, we might benefit from coercing the respective columns of the data frame into a matrix first, which should only take a blink of an eye.
set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
Mat <- as.matrix(Dat)
)
# user system elapsed
# 0.844 0.318 1.161
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat

Millions of tiny matches in R : need performance

I have a one million length vector of words called WORDS. I got a 9 millions objects list called SENTENCES. Each object of my list is a sentence which is represented by a 10-50 length vector of words. Here is an example :
head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"
SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"
I want to convert every sentence of my list into a numeric vector whose elements correspond to the position of the sentence's word in the WORDS big vector.
Actually, I know how to do it with that command :
convert <- function(sentence){
return(which(WORDS %in% sentence))
}
SENTENCES_NUM <- lapply(SENTENCES, convert)
The problem is that it takes way too long time. I mean my RStudio blows up although i got a 16Go RAM computer. So the question is do you have any ideas to speed up the computation?
fastmatch, a small package by an R core person, hashes the lookups so the initial and especially subsequent searches are faster.
What you are really doing is making a factor with predefined levels common to each sentence. The slow step in his C code is sorting the factor levels, which you can avoid by providing the (unique) list of factor levels to his fast version of the factor function.
If you just want the integer positions, you can easily convert from factor to integer: many do this inadvertently.
You don't actually need a factor at all for what you want, just match. Your code also generates a logical vector, then recalculates positions from it: match just goes straight to the positions.
library(fastmatch)
library(microbenchmark)
WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]
words_factor <- as.factor(WORDS)
# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))
bench_fun <- function(fun)
lapply(SENTENCES, fun)
# poster's slow solution:
hg_convert <- function(sentence)
return(which(WORDS %in% sentence))
jw_convert_match <- function(sentence)
match(sentence, WORDS)
jw_convert_match_factor <- function(sentence)
match(sentence, words_factor)
jw_convert_fastmatch <- function(sentence)
fmatch(sentence, WORDS)
jw_convert_fastmatch_factor <- function(sentence)
fmatch(sentence, words_factor)
message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
bench_fun(jw_convert_match),
bench_fun(jw_convert_match_factor),
bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
I put this on https://gist.github.com/jackwasey/59848d84728c0f55ef11
The results don't format very well, suffice to say, fastmatch with or without factor input is dramatically faster.
# starting benchmark one
Unit: microseconds
expr min lq mean median uq max neval
bench_fun(hg_convert) 665167.953 678451.008 704030.2427 691859.576 738071.699 777176.143 10
bench_fun(jw_convert_match) 878269.025 950580.480 962171.6683 956413.486 990592.691 1014922.639 10
bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764 10
bench_fun(jw_convert_fastmatch) 203.031 220.134 462.1246 289.647 305.070 2196.906 10
bench_fun(jw_convert_fastmatch_factor) 251.474 300.729 1351.6974 317.439 362.127 10604.506 10
# starting benchmark two, compare with factor vs vector of words
Unit: seconds
expr min lq mean median uq max neval
bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648 10
bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907 10
And therefore I wouldn't go to the trouble of a parallel implementation just yet.
Won't be faster, but it is the tidy way of going about things.
library(dplyr)
library(tidyr)
sentence =
data_frame(word.name = SENTENCES,
sentence.ID = 1:length(SENTENCES) %>%
unnest(word.name)
word = data_frame(
word.name = WORDS,
word.ID = 1:length(WORDS)
sentence__word =
sentence %>%
left_join(word)

Use Factor Vector to Lookup Value in Data Frame

I have a vector
> head(gbmPred)
[1] COMPLETED DEAD COMPLETED COMPLETED COMPLETED LOW
I also have a data frame
> head(gbmPredProb)
COLLECTION COMPLETED DEAD LOW
1 0.04535981 0.8639282 0.07698963 0.01372232
2 0.19031127 0.6680874 0.11708416 0.02451713
3 0.25004446 0.6789679 0.04827067 0.02271702
4 0.09625138 0.7877128 0.09906595 0.01696983
5 0.15696875 0.7617585 0.04441733 0.03685539
6 0.14157307 0.7690410 0.06057754 0.02880836
I want to be create a vector by using the levels in gbmPred to lookup the values in gbmPredProb:
0.8639282
0.1170841
0.6789679
0.7877128
0.7617585
0.02880836
Does anyone know how to do this in R? Appreciate the help.
EDIT *** Sorry copy and paste error. Fixed above
The first value .86 matches COMPLETED
the second value .11 matches DEAD
WHat I am looking for is to loop through the vector gbmPred to get the value (COMPLETED,etc), then search gbmPredProb data frame for the value matching the column with the same name as well as the index of the vector.
So, the first value is COMPLETED. Look at gbmPredProb and get .863
The second value of gbmPred is DEAD. Look at gbmPredProb and get .117
the thrid value of gbmPred is COMPLETED. Look at gbmPredProb and get .678
If you have a bunch of (row, col) pairs that you want to grab out of a matrix, a good way to get them is to index by a 2-column matrix where the first column is all the row numbers of the elements you want and the second column is all the column numbers of the elements you want:
gbmPredProb[cbind(1:length(gbmPred), match(gbmPred, names(gbmPredProb)))]
# [1] 0.86392820 0.11708416 0.67896790 0.78771280 0.76175850
# [6] 0.02880836
One advantage of this sort of an approach is that it will be a good deal quicker than a row-by-row approach on larger data frames:
gbmPredProb <- gbmPredProb[rep(1:6, each=1000),] # 6000x4
gbmPred <- rep(gbmPred, each=1000) # Length 6000
josilber <- function(mat, vec) mat[cbind(1:length(vec), match(vec, names(mat)))]
rscriven <- function(mat, vec) sapply(seq_along(vec), function(i) mat[i, as.character(vec[i])])
all.equal(josilber(gbmPredProb, gbmPred), rscriven(gbmPredProb, gbmPred))
# [1] TRUE
library(microbenchmark)
microbenchmark(josilber(gbmPredProb, gbmPred), rscriven(gbmPredProb, gbmPred))
# Unit: microseconds
# expr min lq median uq max neval
# josilber(gbmPredProb, gbmPred) 328.524 398.8545 442.065 512.949 766.082 100
# rscriven(gbmPredProb, gbmPred) 97843.015 111478.4360 117294.079 123901.368 254645.966 100

Resources