Replace values in a dataframe based on lookup table - r

I am having some trouble replacing values in a dataframe. I would like to replace values based on a separate table. Below is an example of what I am trying to do.
I have a table where every row is a customer and every column is an animal they purchased. Lets call this dataframe table.
> table
# P1 P2 P3
# 1 cat lizard parrot
# 2 lizard parrot cat
# 3 parrot cat lizard
I also have a table that I will reference called lookUp.
> lookUp
# pet class
# 1 cat mammal
# 2 lizard reptile
# 3 parrot bird
What I want to do is create a new table called new with a function replaces all values in table with the class column in lookUp. I tried this myself using an lapply function, but I got the following warnings.
new <- as.data.frame(lapply(table, function(x) {
gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)
Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
Any ideas on how to make this work?

You posted an approach in your question which was not bad. Here's a smiliar approach:
new <- df # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])
An alternative approach which will be faster is:
new <- df
new[] <- look$class[match(unlist(df), look$pet)]
Note that I use empty brackets ([]) in both cases to keep the structure of new as it was (a data.frame).
(I'm using df instead of table and look instead of lookup in my answer)

Another options is a combination of tidyr and dplyr
library(dplyr)
library(tidyr)
table %>%
gather(key = "pet") %>%
left_join(lookup, by = "pet") %>%
spread(key = pet, value = class)

Anytime you have two separate data.frames and are trying to bring info from one to the other, the answer is to merge.
Everyone has their own favorite merge method in R. Mine is data.table.
Also, since you want to do this to many columns, it'll be faster to melt and dcast -- rather than loop over columns, apply it once to a reshaped table, then reshape again.
library(data.table)
#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE)
setDT(lookUp)
#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')
# merging
][lookup, new_value := i.class, on = c(value = 'pet')
#reform back to original shape
][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
# rn P1 P2 P3
# 1: 1 mammal reptile bird
# 2: 2 reptile bird mammal
# 3: 3 bird mammal reptile
In case you find the dcast/melt bit a bit intimidating, here's an approach that just loops over columns; dcast/melt is simply sidestepping the loop for this problem.
setDT(table) #don't need row names this time
setDT(lookUp)
sapply(names(table), #(or to whichever are the relevant columns)
function(cc) table[lookUp, (cc) := #merge, replace
#need to pass a _named_ vector to 'on', so use setNames
i.class, on = setNames("pet", cc)])

Make a named vector, and loop through every column and match, see:
# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1
# cat lizard parrot
# "mammal" "reptile" "bird"
# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL
# res
# P1 P2 P3
# 1 mammal reptile bird
# 2 reptile bird mammal
# 3 bird mammal reptile
data
df1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
lookUp <- read.table(text = "
pet class
1 cat mammal
2 lizard reptile
3 parrot bird", header = TRUE)

I did it using the factor built-in.
table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)

The answer above showing how to do this in dplyr doesn't answer the question, the table is filled with NAs. This worked, I would appreciate any comments showing a better way:
# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>%
# put in long format, naming column filled with P1, P2, P3 "petCount"
gather(key="petCount", value="pet", -customer) %>%
# add a new column based on the pet's class in data frame "lookup"
left_join(lookup, by="pet") %>%
# since you wanted to replace the values in "table" with their
# "class", remove the pet column
select(-pet) %>%
# put data back into wide format
spread(key="petCount", value="class")
Note that it would likely be useful to keep the long table that contains the customer, the pet, the pet's species(?) and their class. This example simply adds an intermediary save to a variable:
table$customer = seq(nrow(table))
petClasses <- table %>%
gather(key="petCount", value="pet", -customer) %>%
left_join(lookup, by="pet")
custPetClasses <- petClasses %>%
select(-pet) %>%
spread(key="petCount", value="class")

I tried other approaches and they took a really long time with my very large dataset. I used the following instead:
# make table "new" using ifelse. See data below to avoid re-typing it
new <- ifelse(table1 =="cat", "mammal",
ifelse(table1 == "lizard", "reptile",
ifelse(table1 =="parrot", "bird", NA)))
This method requires you to write more text for your code, but the vectorization of ifelse makes it run faster. You have to decide, based on your data, if you want to spend more time writing code or waiting for your computer to run. If you want to make sure it worked (you didn't have any typos in your iflese commands), you can use apply(new, 2, function(x) mean(is.na(x))).
data
# create the data table
table1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)

Benchmark
Out of burning curiosity, I just ran a benchmark with some of the approaches that I want to share with you. I couldn't quite believe some of the statements about performance in the answers and am trying to clarify this herewith. In order not to be misled by different rows/columns ratios, I consider three scenarios:
ncol == nrow
ncol << nrow
ncol >> nrow.
It might be beneficial to coerce as.matrix beforehand, so I included this as an additional solution (unlist_mat).
microbenchmark::microbenchmark(
lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)], ## added
ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
ifelse(Dat3[col_set] == "lizard", "reptile",
ifelse(Dat3[col_set] == "parrot", "bird", NA))),
look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
times=3L
)
## 1e3 x 1e3
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lapply 40.42905 63.47053 78.03831 86.51201 96.84294 107.17387 3 a
# unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297 3 b
# unlist_mat 45.91743 56.51087 68.50595 67.10432 79.80021 92.49611 3 a
# ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581 3 ab
# look_vec 58.54449 88.40293 112.91165 118.26137 140.09522 161.92908 3 a
## 1e4 x 1e4
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.427077 3.558234 3.992481 4.689390 4.775183 4.860977 3 a
# unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084 3 b
# unlist_mat 4.940254 5.011684 5.576553 5.083114 5.894703 6.706291 3 a
# ifelse 9.714553 14.444899 36.176777 19.175244 49.407889 79.640535 3 a
# look_vec 8.460969 8.558600 8.784463 8.656230 8.946209 9.236188 3 a
## 1e5 x 1e3
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.314427 2.403001 3.270708 2.491575 3.748848 5.006120 3 a
# unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586 3 b
# unlist_mat 5.018869 5.060865 5.638499 5.102861 5.948314 6.793767 3 a
# ifelse 6.244744 16.488266 39.208119 26.731788 55.689807 84.647825 3 ab
# look_vec 4.512672 6.434651 7.496267 8.356630 8.988064 9.619498 3 a
## 1e3 x 1e5
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 52.833019 55.373432 71.308981 57.913845 80.546963 103.180080 3 ab
# unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819 3 c
# unlist_mat 3.872551 4.422904 4.695393 4.973257 5.106814 5.240372 3 a
# ifelse 72.592437 76.473418 103.930063 80.354399 119.598876 158.843354 3 b
# look_vec 56.444824 58.904604 62.677267 61.364383 65.793488 70.222593 3 ab
Note: Performed on an Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz using R --vanilla.
all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1)) ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!
Conclusion
Using lapply with a lookup matrix appears to be a good choice if the number of columns is rather low/lower than the number of rows. If we have many columns, especially compared to rows, we might benefit from coercing the respective columns of the data frame into a matrix first, which should only take a blink of an eye.
set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
Mat <- as.matrix(Dat)
)
# user system elapsed
# 0.844 0.318 1.161
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat

Related

equivalent of melt+reshape that splits on column names

Point: if you are going to vote to close, it is poor form not to give a reason why. If it can be improved without requiring a close, take the 10 seconds it takes to write a brief comment.
Question:
How do I do the following "partial melt" in a way that memory can support?
Details:
I have a few million rows and around 1000 columns. The names of the columns have 2 pieces of information in them.
Normally I would melt to a data frame (or table) comprised of a pair of columns, then I would split on the variable name to create two new columns, then I would cast using one of the new splits for new column names, and one for row names.
This isn't working. My billion or so rows of data are making the additional columns overwhelm my memory.
Outside the "iterative force" (as opposed to brute force) of a for-loop, is there a clean and effective way to do this?
Thoughts:
this is a little like melt-colsplit-cast
libraries common for this seem to be "dplyr", "tidyr", "reshape2", and "data.table".
tidyr's gather+separate+spread looks good, but doesn't like not having a unique row identifier
reshape2's dcast (I'm looking for 2d output) wants to aggregate
brute force loses the labels. By brute force I mean df <- rbind(df[,block1],...) where block is the first 200 column indices, block2 is the second, etcetera.
Update (dummy code):
#libraries
library(stringr)
#reproducibility
set.seed(56873504)
#geometry
Ncol <- 2e3
Nrow <- 1e6
#column names
namelist <- numeric(length=Ncol)
for(i in 1:(Ncol/200)){
col_idx <- 1:200+200*(i-1)
if(i<26){
namelist[col_idx] <- paste0(intToUtf8(64+i),str_pad(string=1:200,width=3,pad="0"))
} else {
namelist[col_idx] <- paste0(intToUtf8(96+i),str_pad(string=1:200,width=3,pad="0"))
}
}
#random data
df <- as.data.frame(matrix(runif(n=Nrow*Ncol,min=0, max=16384),nrow=Nrow,ncol=Ncol))
names(df) <- namelist
The output that I would be looking for would have a column with the first character of the current name (single alphabet character) and colnames would be 1 to 200. It would be much less wide than "df" but not fully melted. It would also not kill my cpu or memory.
(Ugly/Manual) Brute force version:
(working on it... )
Here are two options both using data.table.
If you know that each column string always has 200 (or n) fields associated with it (i.e., A001 - A200), you can use melt() and make a list of measurement variables.
melt(dt
, measure.vars = lapply(seq_len(Ncol_p_grp), seq.int, to = Ncol_p_grp * n_grp, by = Ncol_p_grp)
, value.name = as.character(seq_len(Ncol_p_grp))
)[, variable := rep(namelist_letters, each = Nrow)][]
#this data set used Ncol_p_grp <- 5 to help condense the data.
variable 1 2 3 4 5
1: A 0.2655087 0.06471249 0.2106027 0.41530902 0.59303088
2: A 0.3721239 0.67661240 0.1147864 0.14097138 0.55288322
3: A 0.5728534 0.73537169 0.1453641 0.45750426 0.59670404
4: A 0.9082078 0.11129967 0.3099322 0.80301300 0.39263068
5: A 0.2016819 0.04665462 0.1502421 0.32111280 0.26037592
---
259996: Z 0.5215874 0.78318812 0.7857528 0.61409610 0.67813484
259997: Z 0.6841282 0.99271480 0.7106837 0.82174887 0.92676493
259998: Z 0.1698301 0.70759513 0.5345685 0.09007727 0.77255570
259999: Z 0.2190295 0.14661878 0.1041779 0.96782695 0.99447460
260000: Z 0.4364768 0.06679642 0.6148842 0.91976255 0.08949571
Alternatively, we can use rbindlist(lapply(...)) to go through the data set and subset it based on the letter within the columns.
rbindlist(
lapply(namelist_letters,
function(x) setnames(
dt[, grep(x, names(dt), value = T), with = F]
, as.character(seq_len(Ncol_p_grp)))
)
, idcol = 'ID'
, use.names = F)[, ID := rep(namelist_letters, each = Nrow)][]
With 78 million elements in this dataset, it takes around a quarter of a second. I tried to up it to 780 million, but I just don't really have the RAM to generate the data that quickly in the first place.
#78 million elements - 10,000 rows * 26 grps * 200 cols_per_group
Unit: milliseconds
expr min lq mean median uq max neval
melt_option 134.0395 135.5959 137.3480 137.1523 139.0022 140.8521 3
rbindlist_option 290.2455 323.4414 350.1658 356.6373 380.1260 403.6147 3
Data: Run this before everything above:
#packages ----
library(data.table)
library(stringr)
#data info
Nrow <- 10000
Ncol_p_grp <- 200
n_grp <- 26
#generate data
set.seed(1)
dt <- data.table(replicate(Ncol_p_grp * n_grp, runif(n = Nrow)))
names(dt) <- paste0(rep(LETTERS[1:n_grp], each = Ncol_p_grp)
, str_pad(rep(seq_len(Ncol_p_grp), n_grp), width = 3, pad = '0'))
#first letter
namelist_letters <- unique(substr(names(dt), 1, 1))

Iterate a data frame containing lists of column numbers, of different lengths, with a function in R

I have a data frame (df) of survey responses about human values with 57 columns/variables of numerical/scale responses. Each column belongs to one of ten categories, and they're not in contiguous groups.
I have a second dataframe (scoretable) that associates the categories with the column numbers for the variables; the lists of column numbers are all different lengths:
scoretable <- data.frame(
valuename =
c("Conformity","Tradition","Benevolence","Universalism","Self-
Direction","Stimulation","Hedonism","Achievement","Power","Security"),
valuevars = I(list(c(11,20,40,47), # Conformity
c(18,32,36,44,51), # Tradition
c(33,45,49,52,54), # Benevolence
c(1,17,24,26,29,30,35,38), # Universalism
c(5,16,31,41,53), # Self-Direction
c(9,25,37), # Stimulation
c(4,50,57), # Hedonism
c(34,39,43,55), # Achievement
c(3,12,27,46), # Power
c(8,13,15,22,56))), # Security
stringsAsFactors=FALSE)
I would like to iterate through scoretable with a function, valuescore, that calculates the mean and sd of all responses in that group of columns in data frame df and write the result to a third table of results:
valuescore = function(df,scoretable,valueresults){
valuename = scoretable[,1]
set <- df[,scoretable[,2]]
setmeans <- colMeans(set,na.rm=TRUE)
valuemean <- mean(setmeans)
setvars <- apply(set, 2, var)
valuesd <-sqrt(mean(setvars))
rbind(valueresults,c(valuename, valuemean, valuesd))
}
a <- nrow(scoretable)
for(i in 1:a){
valuescore(df,scoretable[i,],valueresults)
}
I am very new to R and programming in general (this is my first question here), and I'm struggling to determine how to pass list variables to functions and/or as address ranges for data frames.
Let's create an example data.frame:
df <- replicate(57, rnorm(10, 50, 20)) %>% as.data.frame()
Let's prepare the table result format:
valueresults <- data.frame(
name = scoretable$valuename,
mean = 0
)
Now, a loop on the values of scoretable, a mean calculation by column and then the mean of the mean. It is brutal (first answer with Map is more elegant), but maybe it is easier to understand for a R beginnner.
for(v in 1:nrow(scoretable)){
# let's suppose v = 1 "Conformity"
columns_id <- scoretable$valuevars[[v]]
# isolate columns that correspond to 'Conformity'
temp_df <- df[, columns_id]
# mean of the values of these columns
temp_means <- apply(temp_df, 2, mean)
mean <- mean(temp_means)
# save result in the prepared table
valueresults$mean[v] <- mean
}
> (valueresults)
name mean
1 Conformity 45.75407
2 Tradition 52.76935
3 Benevolence 50.81724
4 Universalism 51.04970
5 Self-Direction 55.43723
6 Stimulation 52.15962
7 Hedonism 53.17395
8 Achievement 47.77570
9 Power 52.61731
10 Security 54.07066
Here is a way using Map to apply a function to the list scoretable[, 2].
First I will create a test df.
set.seed(1234)
m <- 100
n <- 57
df <- matrix(sample(10, m*n, TRUE), nrow = m, ncol = n)
df <- as.data.frame(df)
And now the function valuescore.
valuescore <- function(DF, scores){
f <- function(inx) mean(as.matrix(DF[, inx]), na.rm = TRUE)
res <- Map(f, scores[, 2])
names(res) <- scores[[1]]
res
}
valuescore(df, scoretable)
#$Conformity
#[1] 5.5225
#
#$Tradition
#[1] 5.626
#
#$Benevolence
#[1] 5.548
#
#$Universalism
#[1] 5.36125
#
#$`Self-Direction`
#[1] 5.494
#
#$Stimulation
#[1] 5.643333
#
#$Hedonism
#[1] 5.546667
#
#$Achievement
#[1] 5.3175
#
#$Power
#[1] 5.41
#
#$Security
#[1] 5.54

How to make factor names appear in ifelse statement in R?

I have to following dataset. I want to create a column so that if there is a number in the unid column then in dat$identification I want it to say "unidentified" otherwise I want it to print whatever is there in the species column. So the final output should look like dat$identificaiton x,y,unidentified,unidentified. With this code it shows 1,2,unidentified,unidentified.
Please note, for other purposes I want to use only the unid column for the !(is.na) part of the ifelse statement and not the species.
unid <- c(NA,NA,1,4)
species <- c("x","y",NA,NA)
df <- data.frame(unid, species)
df$identification <- ifelse(!is.na(unid), "unidentified", df$species)
#Current Output of df$identification:
1,2,unidentified,unidentified
#Needed Output
x,y,unidentified,unidentified
You can coerce the column of class 'factorto classcharacterin theifelse`.
df$identification <- ifelse(!is.na(unid), "unidentified", as.character(df$species))
df
# unid species identification
#1 NA x x
#2 NA y y
#3 1 <NA> unidentified
#4 4 <NA> unidentified
Edit.
After the OP accepted the answer, I reminded myself that ifelse is slow and indexing fast, so I tested both using a larger dataset.
First of all, see if both solutions produce the same results:
df$id1 <- ifelse(!is.na(unid), "unidentified", as.character(df$species))
df$id2 <- "unidentified"
df$id2[is.na(unid)] <- species[is.na(unid)]
identical(df$id1, df$id2)
#[1] TRUE
The results are the same.
Now time them both using package microbenchmark.
n <- 1e4
df1 <- data.frame(unid = rep(unid, n), species = rep(species, n))
microbenchmark::microbenchmark(
ifelse = {df1$id1 <- ifelse(!is.na(df1$unid), "unidentified", as.character(df1$species))},
index = {df1$id2 <- "unidentified"
df1$id2[is.na(df1$unid)] <- species[is.na(df1$unid)]
},
relative = TRUE
)
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# ifelse 12502465 12749881 16080160.39 14365841 14507468.5 85836870 100 c
# index 3243697 3299628 4575818.33 3326692 4983170.0 74526390 100 b
#relative 67 68 208.89 228 316.5 540 100 a
On average, indexing is 200 times faster. More than worth the trouble to write two lines of code instead of just one for ifelse.

How can I expand regexp matches from a list into binary fields without a for loop in R? [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
I have a dataframe that contains fields with strings, such as "fish, birds, animals", etc. I have collapsed them into a list, and iterate over them in order to create logical fields within same dataframe. Update: The question is now updated with a more elaborate example.
However, this is slow and does not feel optimal. It's not an operation I have to do more than once, so I wasn't that bothered, but think there might be a better way, using dplyr, perhaps.
This code does create new fields for every match of each element in my_list within the field items.
no <- seq(1:3)
items <- c('fish,cat,dog', 'horse,elephant,dog', 'hamster,pig')
df <- data.frame(no, items)
df$items <- as.character(df$items)
df
Will create the following data frame:
no items
1 1 fish,cat,dog
2 2 horse,elephant,dog
3 3 hamster,pig
Running this code will harvest the field items and expand it into logical fields
items <- paste(df$items, collapse = ",")
item_list <- unlist(unique(strsplit(items, ",")))
for (i in 1:length(item_list)) {
lt <- item_list[i]
df <- df %>% rowwise() %>% mutate(!!lt := grepl(lt, items))
}
data.frame(df)
Resulting in this data frame:
no items fish cat dog horse elephant hamster pig
1 1 fish,cat,dog TRUE TRUE TRUE FALSE FALSE FALSE FALSE
2 2 horse,elephant,dog FALSE FALSE TRUE TRUE TRUE FALSE FALSE
3 3 hamster,pig FALSE FALSE FALSE FALSE FALSE TRUE TRUE
This will be fairly fast
f1 = function(df, column_name) {
## pre-process words
words = strsplit(df[[column_name]], ",")
uwords = unlist(words)
colnames = unique(uwords)
## pre-allocate result matrix of 'FALSE' values
m = matrix(FALSE, nrow(df), length(colnames), dimnames = list(NULL, colnames))
## update rows and columns of result matrix containing matches to TRUE
row = rep(seq_len(nrow(df)), lengths(words))
col = match(uwords, colnames)
m[cbind(row, col)] = TRUE
## return the final result
cbind(df, m)
}
The trickiest part is that a matrix subset by a two-column matrix treats the first column of the two-column matrix as a row index, and the second column as the column index. So the rows and columns that you want to set to TRUE are
row = rep(seq_len(nrow(df)), lengths(words))
col = match(uwords, colnames)
and the matrix is updated with
m[ cbind(row, col) ] = TRUE
There is no iteration (e.g., sapply()), so the match() function is called once rather than nrow(df) times.
For 3M rows, I have
> idx = rep(1:3, 1000000)
> df1 = df[idx,]
> system.time(f1(df1, "items"))
user system elapsed
13.304 0.112 13.421
For the other solution (at time of writing) by Christoph:
f0 = function(df, column_name) {
categories_per_row <- strsplit(df[[column_name]], split=",")
categories <- unique(unlist(categories_per_row))
categoryM <- t(sapply(categories_per_row, function(y) categories %in% y))
colnames(categoryM) <- categories
cbind(df, categoryM)
}
and the data.table solution by Uwe (watch out, reference semantics change the value of dt! Also, I don't know how to pass the column name as a function argument):
library(data.table)
dt = df1
dt$no = seq_len(nrow(dt))
f2 = function(dt) {
setDT(dt)
dt1 = dt[, strsplit(items, ","), by = .(no, items)]
dt1[, dcast(.SD, no + items ~ V1, function(x) length(x) > 0)]
}
with times
> system.time(res0 <- f0(df1, "items"))
user system elapsed
23.776 0.000 23.786
> system.time(res2 <- f2(dt, "items"))
Using 'V1' as value column. Use 'value.var' to override
user system elapsed
45.668 0.072 45.593
About 1/2 the time of f1() is used by strsplit(); stringr::str_split() is about two times faster, but since the pattern used to split is fixed (not a regular expression) it makes sense to use strsplit(fixed=TRUE), which is about 3x faster. Probably some data.table pro will come up with a very fast solution (but then you need to be a data.table pro...).
It's tempting to do things like 'collapse them [words shared by an item] into a list [actually a vector!]', but it will often be sensible to leave words in a list
> df1$items = strsplit(df1$items, ",", fixed=TRUE)
> head(df1)
no items
1 1 fish, cat, dog
2 2 horse, elephant, dog
3 3 hamster, pig
4 4 fish, cat, dog
5 5 horse, elephant, dog
6 6 hamster, pig
and save yourself the time / trouble required to re-split. The tidyverse way would be to create an extended version of the table
tidyr::unnest(df1)
(or the other approaches in the so-called 'duplicate' question). This would probably cause one to re-think what the role of the columns of logicals is in subsequent manipulations.
Here's a step by step solution; Uwe's is probably much faster but I hope this is easier to understand:
categories_per_row <- strsplit(df$items, split=",")
categories <- unique(unlist(strsplit(df$items, ",")))
categoryM <- t(sapply(categories_per_row, function(y) categories %in% y))
colnames(categoryM) <- categories
cbind(df, categoryM)

Using 'fastmatch' package in R

I have to find indices for 1MM numeric values within a vector of roughly 10MM values. I found the package fastmatch, but when I use the function fmatch(), I am only returning the index of the first match.
Can someone help me use this function to find all values, not just the first? I realize this is a basic question but online documentation is pretty sparse and fmatch has cut down the computing time considerably.
Thanks so much!
Here is some sample data - for the purposes of this exercise, let's call this data frame A:
DateTime Address Type ID
1 2014-03-04 20:21:03 982076970 1 2752394
2 2014-03-04 20:21:07 98174238211 1 2752394
3 2014-03-04 20:21:08 76126162197 1 2752394
4 2014-03-04 20:21:16 6718053253 1 2752394
5 2014-03-04 20:21:17 98210219176 1 2752510
6 2014-03-04 20:21:20 7622877100 1 2752510
7 2014-03-04 20:21:23 2425126157 1 2752510
8 2014-03-04 20:21:23 2425126157 1 2752510
9 2014-03-04 20:21:25 701838650 1 2752394
10 2014-03-04 20:21:27 98210219176 1 2752394
What I wish to do is to find the number of unique Type values for each Address. There are several million rows of data with roughly 1MM unique Address values... on average, each Address appears about 6 times in the data set. And, though the Type values listed above are all 1, they can take any value from 0:5. I also realize the Address values are quite long, which adds to the time required for the matching.
I have tried the following:
uvals <- unique(A$Address)
utypes <- matrix(0,length(uvals),2)
utypes[,1] <- uvals
for (i in 1:length(unique(Address))) {
b <- which(uvals[i] %in% A$Address)
c <- length(unique(A$Type[b]))
utypes[i,2] <- c
}
However, the code above is not very efficient - if I am looping over 1MM values, I estimate this will take 10-15 hours.
I have tried this, as well, within the loop... but it is not considerably faster.
b <- which(A$Address == uvals[i])
I know there is a more elegant/faster way, I am fairly new to R and would appreciate any help.
This can be done using unique function in data.table, followed by an aggregation. I'll illustrate it using more or less the sample data generated by #Chinmay:
Create sample data:
set.seed(100L)
dat = data.frame(
address = sample(1e6L, 1e7L, TRUE),
value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
)
data.table solution:
require(data.table) ## >= 1.9.2
dat.u = unique(setDT(dat), by=c("address", "value"))
ans = dat.u[, .N, by=address]
Explanation:
The setDT function converts a data.frame to data.table by reference (which is very fast).
unique function operated on a data.table evokes the unique.data.table method, which is incredibly fast compared to base:::unique. Now, we've only unique values of type for every address.
All that's left to do is to aggregate or group-by address and get the number of observations that are there in each group. The by=address part groups by address and .N is an in-built data.table variable that provides the number of observations for that group.
Benchmarks:
I'll create functions to generate data as data.table and data.frame to benchmark data.table answer againstdplyr solution (a) proposed by #beginneR, although I don't see the need for arrange(.) there and therefore will skip that part.
## function to create data
foo <- function(type = "df") {
set.seed(100L)
dat = data.frame(
address = sample(1e6L, 1e7L, TRUE),
value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
)
if (type == "dt") setDT(dat)
dat
}
## DT function
dt_sol <- function(x) {
unique(x, by=c("address", "value"))[, .N, by=address]
}
## dplyr function
dplyr_sol <- function(x) {
distinct(x) %>% group_by(address) %>% summarise(N = n_distinct(value))
}
The timings reported here are three consecutive runs of system.time(.) on each function.
## benchmark timings in seconds
## pkg run-01 run-02 run-03 command
## data.table 2.4 2.3 2.4 system.time(ans1 <- dt_sol(foo("dt")))
## dplyr 15.3 16.3 15.7 system.time(ans2 <- dplyr_sol(foo()))
For some reason, dplyr automatically orders the result by the grouping variable. So in order to compare the results, I'll also order them in the result from data.table:
system.time(setkey(ans1, address)) ## 0.102 seconds
identical(as.data.frame(ans1), as.data.frame(ans2)) ## TRUE
So, data.table is ~6x faster here.
Note that bit64:::integer64 is also supported in data.table - since you mention the address values are too long, you can also store them as integer64.
You can try creating an index of your 10MM values and sort that. Then looking for your 1MM values in that indexed vector should be faster.
For example, using data.table package you can do that by using setkey function which indexes given column of data.table.
require(data.table)
set.seed(100)
dat <- sample(1:1e+07, size = 1e+07, replace = T)
searchval <- sample(dat, size = 1e+06)
DT <- data.table(dat, index = seq_along(dat))
setkey(DT, dat)
DT
## dat index
## 1: 1 169458
## 2: 1 4604823
## 3: 1 7793446
## 4: 2 5372388
## 5: 3 2036622
## ---
## 9999996: 9999996 1271426
## 9999997: 9999998 530029
## 9999998: 10000000 556672
## 9999999: 10000000 6776063
## 10000000: 10000000 6949665
lookup <- data.table(val = searchval)
setkey(lookup, val)
lookup
## val
## 1: 2
## 2: 16
## 3: 24
## 4: 33
## 5: 36
## ---
## 999996: 9999970
## 999997: 9999973
## 999998: 9999988
## 999999: 9999996
## 1000000: 9999998
Now you can lookup all the values from lookup in DT by simply using
DT[lookup]
## dat index
## 1: 2 5372388
## 2: 16 537927
## 3: 16 1721233
## 4: 24 7286522
## 5: 33 7448516
## ---
## 2000298: 9999973 8008610
## 2000299: 9999988 3099060
## 2000300: 9999988 7996302
## 2000301: 9999996 1271426
## 2000302: 9999998 530029
fmatch seems to clearly state that it only finds the first match. And given that it uses an underlying hashing strategy, I imagine it's unlikely that it stores multiple items per key which is one of the ways it stays so fast (and it's the same way match works).
Do you have many duplicate values? Perhaps you could store those in a separate place/table and create a fast index to a list of possible matches. It would be more helpful if you provided sample data representative of what you're trying to do and the code you tried to see if it would be easy to extend.
If I understand your question correctly, you can also do this with dplyr:
I will include two different ways, since I am not entirely sure which is your desired output.
First create some sample data:
Address <- rep(letters, 5)
Type <- sample(1:5, size=5*26, replace=T)
A <- data.frame(Address, Type)
Then install and load dplyr
require(dplyr)
a) To find the number of different Type values for each Address value:
A %.% arrange(Address, Type) %.% group_by(Address) %.% summarize(NoOfTypes = length(unique(Type)))
b) To find all unique combinations of Address and Type:
A %.% arrange(Address, Type) %.% group_by(Address, Type) %.% filter( 1:n() == 1)

Resources