Use Factor Vector to Lookup Value in Data Frame - r

I have a vector
> head(gbmPred)
[1] COMPLETED DEAD COMPLETED COMPLETED COMPLETED LOW
I also have a data frame
> head(gbmPredProb)
COLLECTION COMPLETED DEAD LOW
1 0.04535981 0.8639282 0.07698963 0.01372232
2 0.19031127 0.6680874 0.11708416 0.02451713
3 0.25004446 0.6789679 0.04827067 0.02271702
4 0.09625138 0.7877128 0.09906595 0.01696983
5 0.15696875 0.7617585 0.04441733 0.03685539
6 0.14157307 0.7690410 0.06057754 0.02880836
I want to be create a vector by using the levels in gbmPred to lookup the values in gbmPredProb:
0.8639282
0.1170841
0.6789679
0.7877128
0.7617585
0.02880836
Does anyone know how to do this in R? Appreciate the help.
EDIT *** Sorry copy and paste error. Fixed above
The first value .86 matches COMPLETED
the second value .11 matches DEAD
WHat I am looking for is to loop through the vector gbmPred to get the value (COMPLETED,etc), then search gbmPredProb data frame for the value matching the column with the same name as well as the index of the vector.
So, the first value is COMPLETED. Look at gbmPredProb and get .863
The second value of gbmPred is DEAD. Look at gbmPredProb and get .117
the thrid value of gbmPred is COMPLETED. Look at gbmPredProb and get .678

If you have a bunch of (row, col) pairs that you want to grab out of a matrix, a good way to get them is to index by a 2-column matrix where the first column is all the row numbers of the elements you want and the second column is all the column numbers of the elements you want:
gbmPredProb[cbind(1:length(gbmPred), match(gbmPred, names(gbmPredProb)))]
# [1] 0.86392820 0.11708416 0.67896790 0.78771280 0.76175850
# [6] 0.02880836
One advantage of this sort of an approach is that it will be a good deal quicker than a row-by-row approach on larger data frames:
gbmPredProb <- gbmPredProb[rep(1:6, each=1000),] # 6000x4
gbmPred <- rep(gbmPred, each=1000) # Length 6000
josilber <- function(mat, vec) mat[cbind(1:length(vec), match(vec, names(mat)))]
rscriven <- function(mat, vec) sapply(seq_along(vec), function(i) mat[i, as.character(vec[i])])
all.equal(josilber(gbmPredProb, gbmPred), rscriven(gbmPredProb, gbmPred))
# [1] TRUE
library(microbenchmark)
microbenchmark(josilber(gbmPredProb, gbmPred), rscriven(gbmPredProb, gbmPred))
# Unit: microseconds
# expr min lq median uq max neval
# josilber(gbmPredProb, gbmPred) 328.524 398.8545 442.065 512.949 766.082 100
# rscriven(gbmPredProb, gbmPred) 97843.015 111478.4360 117294.079 123901.368 254645.966 100

Related

Count Distinct values from 2 columns

I need to get only distinct values that are spread over two columns and return the distinct values into one column.
Example:
colA colB
---- --------
darcy elizabeth
elizabeth darcy
jon doe
doe joe
It should return:
resultCol
darcy
elizabeth
jon
doe
Is there any builtin function or library that can do that more efficiently?
I tried a workaround to get the results but it is extremely slow for more than 100 thousands observations.
#First i create a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Then i create an empty dataframe to store all values in a single column
emptyDataframe<-data.frame(resultColumn=character())
for(i in 1:nrow(dfSample)){
emptyDataframe<-rbind(emptyDataframe,c(toString(dfSample[i,1])),stringsAsFactors=FALSE)
}
for(i in 1:nrow(dfSample)){
emptyDataframe<-rbind(emptyDataframe,c(toString(dfSample[i,2])),stringsAsFactors=FALSE)
}
emptyDataframe
#Finally i get the distinct values using dplyr
var_distinct_values<-distinct(emptyDataframe)
I use union to get unique values across specific columns:
with(dfSample, union(col1, col2))
PS: The answer from d.b in the comments is also another way.
You can improvise his answer if you have extra columns but want to run it only over specific columns:
unique(unlist(dfSample[1:2]))
This gets the unique values from first two columns.
Here is a general purpose solution.
It's based on this answer but can be extended to any number of columns as long as the object is a data.frame or list.
Reduce(union, dfSample)
[1] "darcy" "elizabeth" "john" "doe"
Now with 100K observations in each of 10 columns.
set.seed(1234)
n <- 1e5
bigger <- replicate(n, sample(c(col1, col2), 10, TRUE))
bigger <- as.data.frame(bigger)
system.time(Reduce(union, bigger))
# user system ellapsed
# 3.769 0.000 3.772
Edit.
After a second thought, I realized that the test above is run with a dataframe with a very small number of different values. A test with a larger number does not necessarily give the same results.
set.seed(1234)
s <- sprintf("%05d", 1:5000)
big2 <- replicate(n, sample(s, 10, TRUE))
big2 <- as.data.frame(big2)
rm(s)
microbenchmark::microbenchmark(
red = Reduce(union, big2),
uniq = unique(unlist(big2)),
times = 10
)
#Unit: seconds
# expr min lq mean median uq max neval cld
# red 26.021855 26.42693 27.470746 27.198807 28.56720 29.022047 10 b
# uniq 1.405091 1.42978 1.632265 1.548753 1.56691 2.693431 10 a
The unique/unlist solution is now clearly better.

Difference between 'select' and '$' in R

I want to understand the speed difference between select and $ to subset columns in R (whilst appreciating that they do not return exactly the same things, rather both perform the conceptual get-me-a-column operation). I would like to understand when either is most appropriate.
Specifically, under what conditions would the following select statement be faster than the corresponding $ statement?
Syntax is:
select(df, colName1, colName2, ...)
df$colName
In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.
Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
dplyr returns a different (more complex) object.
Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.
This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:
library(dplyr)
microbenchmark::microbenchmark(
base1 = mtcars$cyl, # returns a vector
base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
base3 = mtcars[,"cyl"], # returns a vector
base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.682 6.3860 9.23727 7.7125 10.6050 25.397 100
## base2 4.224 5.9905 9.53136 7.7590 11.1095 27.329 100
## base2a 3.710 5.5380 7.92479 7.0845 10.1045 16.026 100
## base3 6.312 10.9935 13.99914 13.1740 16.2715 37.765 100
## base4 51.084 70.3740 92.03134 76.7350 95.9365 662.395 100
## dplyr1 698.954 742.9615 978.71306 784.8050 1154.6750 3568.188 100
## dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388 100
## dplyr3 64.299 78.3745 126.97205 85.3110 112.1000 2383.731 100
## dplyr4 63.235 73.0450 99.28021 85.1080 114.8465 263.219 100
But, what if we have alot of columns:
# Make a wider version of mtcars
do.call(
cbind.data.frame,
lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols
# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
base1 = mtcars_manycols$cyl_4, # returns a vector
base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
base3 = mtcars_manycols[,"cyl_4"], # returns a vector
base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.534 6.8535 12.15802 8.7865 13.1775 75.095 100
## base2 4.150 6.5390 11.59937 9.3005 13.2220 73.332 100
## base2a 3.904 5.9755 10.73095 7.5820 11.2715 61.687 100
## base3 6.255 11.5270 16.42439 13.6385 18.6910 70.106 100
## base4 66.175 89.8560 118.37694 99.6480 122.9650 340.653 100
## dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705 9354.698 100
## dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716 100
## dplyr3 124.295 142.9535 216.89692 166.7115 209.1550 1138.368 100
## dplyr4 127.280 150.0575 195.21398 169.5285 209.0480 488.199 100
For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.
NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.
It is not the same. If you're looking for the same functionality you could consider pull() from the same dplyr package.
Dollarsign returns a vector 'build' from the dataframe, pull does the same.
select is in the dplyr package, part of the tidyverse. https://dplyr.tidyverse.org/
you might do something like
df %>%
select(colName1, colName2)
Which would select those columns from df. These statements are written like verbs (e.g. select, arrange, group_by, etc.) and makes it much easier to work with data.
$ is from base r. It would show you only that column from df.

Millions of tiny matches in R : need performance

I have a one million length vector of words called WORDS. I got a 9 millions objects list called SENTENCES. Each object of my list is a sentence which is represented by a 10-50 length vector of words. Here is an example :
head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"
SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"
I want to convert every sentence of my list into a numeric vector whose elements correspond to the position of the sentence's word in the WORDS big vector.
Actually, I know how to do it with that command :
convert <- function(sentence){
return(which(WORDS %in% sentence))
}
SENTENCES_NUM <- lapply(SENTENCES, convert)
The problem is that it takes way too long time. I mean my RStudio blows up although i got a 16Go RAM computer. So the question is do you have any ideas to speed up the computation?
fastmatch, a small package by an R core person, hashes the lookups so the initial and especially subsequent searches are faster.
What you are really doing is making a factor with predefined levels common to each sentence. The slow step in his C code is sorting the factor levels, which you can avoid by providing the (unique) list of factor levels to his fast version of the factor function.
If you just want the integer positions, you can easily convert from factor to integer: many do this inadvertently.
You don't actually need a factor at all for what you want, just match. Your code also generates a logical vector, then recalculates positions from it: match just goes straight to the positions.
library(fastmatch)
library(microbenchmark)
WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]
words_factor <- as.factor(WORDS)
# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))
bench_fun <- function(fun)
lapply(SENTENCES, fun)
# poster's slow solution:
hg_convert <- function(sentence)
return(which(WORDS %in% sentence))
jw_convert_match <- function(sentence)
match(sentence, WORDS)
jw_convert_match_factor <- function(sentence)
match(sentence, words_factor)
jw_convert_fastmatch <- function(sentence)
fmatch(sentence, WORDS)
jw_convert_fastmatch_factor <- function(sentence)
fmatch(sentence, words_factor)
message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
bench_fun(jw_convert_match),
bench_fun(jw_convert_match_factor),
bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
I put this on https://gist.github.com/jackwasey/59848d84728c0f55ef11
The results don't format very well, suffice to say, fastmatch with or without factor input is dramatically faster.
# starting benchmark one
Unit: microseconds
expr min lq mean median uq max neval
bench_fun(hg_convert) 665167.953 678451.008 704030.2427 691859.576 738071.699 777176.143 10
bench_fun(jw_convert_match) 878269.025 950580.480 962171.6683 956413.486 990592.691 1014922.639 10
bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764 10
bench_fun(jw_convert_fastmatch) 203.031 220.134 462.1246 289.647 305.070 2196.906 10
bench_fun(jw_convert_fastmatch_factor) 251.474 300.729 1351.6974 317.439 362.127 10604.506 10
# starting benchmark two, compare with factor vs vector of words
Unit: seconds
expr min lq mean median uq max neval
bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648 10
bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907 10
And therefore I wouldn't go to the trouble of a parallel implementation just yet.
Won't be faster, but it is the tidy way of going about things.
library(dplyr)
library(tidyr)
sentence =
data_frame(word.name = SENTENCES,
sentence.ID = 1:length(SENTENCES) %>%
unnest(word.name)
word = data_frame(
word.name = WORDS,
word.ID = 1:length(WORDS)
sentence__word =
sentence %>%
left_join(word)

'which' command in R with case insensitive

I am trying to find indexes within a data frame which holds a certain string. But I would like my string to be case insensitive.
Say, I want to search for column number in my data frame called COLUMN73 and I expect it to return 73 because it is the seventy third column. I have,
which(names(mydata) == "COLUMN73")
Is it possible to make my search string case insensitive so as to get 73 even if I search for say, CoLumN73 ?
You can index it with grepl by using the ignore.case argument
x <- c("col7", "COL73", "Col17", "CoL73", "cOl73")
grepl("col73", x, ignore.case=TRUE)
# [1] FALSE TRUE FALSE TRUE TRUE
Similarly, grep returns the numeric index
grep("col73", x, ignore.case=TRUE)
# [1] 2 4 5
For data frame column subsets
df[grepl("col73", names(df), ignore.case=TRUE)]
You can convert your names to upper cases
which(toupper(names(mydata)) == "COLUMN73")
Completely edited, with a correction for Will's code. Thanks to David Arenburg for pointing this out.
x <- rep(c("col7", "COL73", "Col17","COLUMN73", "CoL73", "cOl73"),1e4)
scriven<- function(x) grepl("COLUMN73", x, ignore.case=TRUE)
will<-function(x) which(toupper((x)) == "COLUMN73")
microbenchmark(scriven(x),will(x))
Unit: milliseconds
expr min lq median uq max neval
scriven(x) 30.55911 33.04852 34.91243 37.01039 39.59833 100
will(x) 26.10728 26.47967 27.21592 28.76291 30.46163 100

How to compare two different columns(both contains string) efficiently in R?

Suppose A is a data frame and structure of A is as follows
Row no C1 C2
1 I am fine 1234
2 He is fine 1234
3 am better 1234
4 better butter 1234
5 fine good 1234
6 good to be better 1234
and B is another data frame such that
Row no C1
1 fine
2 good
I want to compare A$C1 with B$C1 and the string in B$C1 should be contains in A$C1.
So when I will compare A$C1 with B$C1 the result will be the row number in A which contains the B's strings. For the above scenario the output will be 1, 2, 5,6 as 1,2,5 contains the word "fine" and 6 as it contains the word "good". I don't want to compare "good" with row 5 of A as I have already selected row 5. I want an efficient solution for this as the number of rows for my real data (A) set is around 400000 and B is around 10000
This function
phrasesWithWords <- function(x, table)
{
words <- strsplit(x, "\\W")
found <- relist(unlist(words) %in% table, words)
which(sapply(found, any))
}
works on your phrases and a table of acceptable words:
phrase <- c("I am fine", "He is fine", "am better", "better butter",
"fine good", "good to be better")
table <- c("fine", "good")
phrasesWithWords(phrase, table)
The function works by spliting the phrases into words, then looking up each word (without looping through the long list of phrases) in the table, re-listing the logical vector, and asking which list elements contain at least one TRUE.
This turns out not to be so efficient compared to a simple grep solution
f1 <- function(x, table)
grep(paste(table, collapse="|"), x)
with
library(microbenchmark)
x1000 <- rep(x, 1000)
giving
> microbenchmark(phrasesWithWords(x1000, table), f1(x1000, table),
+ times=5)
Unit: milliseconds
expr min lq median uq
phrasesWithWords(x1000, table) 130.167172 132.815303 133.011161 133.112888
f1(x1000, table) 2.959576 2.973416 2.990412 3.060494
max neval
134.504282 5
3.439293 5
The pretty neat package "lineprof" shows that for a modified function
f0 <- function(x, table)
{
words <- strsplit(x, "\\W")
idx <- unlist(words) %in% table
found <- relist(idx, words)
which(sapply(found, any))
}
the main bottleneck is in relist
> lineprof(f0(x1000, table))
Reducing depth to 2 (from 7)
Common path: words.R!30719TCY
time alloc release dups ref src
1 0.003 0.668 0 0 words.R!30719TCY#3 f0/strsplit
2 0.024 28.240 0 17393 words.R!30719TCY#5 f0/relist
3 0.003 3.959 0 6617 words.R!30719TCY#6 f0/which
leading to a more elaborate approach
f2 <- function(x, table)
{
words <- strsplit(x, "\\W")
len <- cumsum(sapply(words, length))
idx <- cumsum(unlist(words) %in% table)
which(idx[len] != c(0, idx[head(len, -1)]))
}
which is somewhat better-performing
> identical(f2(x1000, table), f1(x1000, table))
[1] TRUE
> microbenchmark(f2(x1000, table), f1(x1000, table), times=5)
Unit: milliseconds
expr min lq median uq max neval
f2(x1000, table) 25.426832 25.815504 25.844033 26.075279 26.387559 5
f1(x1000, table) 2.963365 2.968197 2.984395 2.984423 3.129873 5
I think both f2 and f1 would scale well enough to the problem in the original question, provided there is sufficient memory (if the table of acceptable words is small compared to the phrases, then I think the grep approach will actually be more memory efficient; in the end I think I might up-vote the simple grep solution!). Maybe the major limitation with the grep approach is that the size of the regular expression is limited, on my computer at about 2560 terms
> grep(paste(as.character(1:2559), collapse="|"), "1")
[1] 1
> grep(paste(as.character(1:2560), collapse="|"), "1")
Error in grep(paste(as.character(1:2560), collapse = "|"), "1") :
invalid regular expression '1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|4
grep can do the job for you:
grep(paste(B$C1, collapse="|"), A$C1)
1 2 5 6
The code above gets you all the lines in A$C1 that contains at least one word of B$C1, that is, lines 1, 2, 5 and 6. The first argument is a regular expression, that's why we collapse the words with "|" (that means "or").
And it seems scalable. Benchmarking with 100.000 sample phrases (from your phrases) and the two words, grep takes only 0.076 seconds.

Resources