it's a fairly simple task, but I'm trying to wrap my head around how to match values using a dataframe with keys and values. I've tried merge, but as the number of rows is different, I'm not sure that's appropriate.
Is there a for loop I can write that will loop through each key in my input dataframe and change Product's value if it's one of the ones in the lookup table?
Essentially, my data looks like this:
input_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
input_product <- c("Water", "Bread", NA, "Chips", "Chicken", NA, "Chocolate", "Donuts", "Juice")
input <- as.data.frame(cbind(input_key, input_product))
I'd like to replace the NAs with the Product values in the corresponding lookup table:
lookup_key <- c(1245,1546, 7764, 9909)
lookup_product <- c("Ice Cream","Soda", "Bacon","Cheese")
lookup_data <- as.dataframe(cbind(lookup_key, lookup_product))
Finally, I'm hoping to get the final dataframe looking like this:
output_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
output_product <- c("Water", "Bread", "Soda", "Chips", "Chicken", "Cheese", Chocolate","Donuts", "Juice")
output_data <- as.data.frame(cbind(output_key, output_product))
OPTION 1: Using R-base functions:
Vectorial solution:
input$input_product[input$input_key %in% lookup_data$lookup_key == TRUE] <-
lookup_product[lookup_data$lookup_key %in% input$input_key == TRUE]
Note: The ==TRUE is redundant, added just for better understanding.
Using lapply function:
idx <- input$input_key %in% lookup_data$lookup_key
lapply((1:nrow(input)),
function(i) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <<- lookup_data$lookup_product[jdx == TRUE]
}
}
)
Note: Attention to the global assignment operation (<<)
Using for loop:
idx <- input$input_key %in% lookup_data$lookup_key
for (i in (1:nrow(input))) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <- lookup_data$lookup_product[jdx == TRUE]
}
}
Note: Here we just need a simple assignment.
In the above cases you need to create the data frames setting the input argument: stringsAsFactors as FALSE, for example:
input <- as.data.frame(cbind(input_key, input_product), stringsAsFactors = FALSE)
lookup_data <- as.data.frame(cbind(lookup_key, lookup_product), stringsAsFactors = FALSE)
Then you get the output:
> input
input_key input_product
1 9061 Water
2 8680 Bread
3 1546 Soda
4 5376 Chips
5 9550 Chicken
6 9909 Cheese
7 3853 Chocolate
8 3732 Donuts
9 9209 Juice
>
OPTION 2: Using data.tablepackage
I found this elegant solution using inner join:
require(data.table)
setkey(input,input_key)
setkey(lookup_data,lookup_key)
> setDT(input)[setDT(lookup_data), input_product := i.lookup_product, nomatch=0][]
input_key input_product
1: 1546 Soda
2: 3732 Donuts
3: 3853 Chocolate
4: 5376 Chips
5: 8680 Bread
6: 9061 Water
7: 9209 Juice
8: 9550 Chicken
9: 9909 Cheese
>
data.tableis actually very powerful for data set manipulation. Let's explain the syntax behind:
setDT: Converts a data frame by reference (no copy occurs) into data.table, because the original data sets are not a data.table classes, that's the way to
convert them on the fly. Notice that now it is not necessary to use the attribute stringsAsFactors because for data.tableits default value is FALSE.
input[lookup_data, nomatch=0]: Is the way, with data.table package to create a inner join (see this link). It means the interception of both tables. The no match option with value 0 means no rows will be returned for that row of i (in our case: lookup_data).
This would be the output:
> setDT(input)[setDT(lookup_data), nomatch=0][]
input_key input_product lookup_product
1: 1546 NA Soda
2: 9909 NA Cheese
>
input_product := i.lookup_product: assigns the column from the outer
data set, with the value of the inner data set.
[]: Prints the result (for verifying the solution purpose)
For more information about data.tableI would recommend to read the package documentation it comes with many examples. It is also a good idea to run in R the following command (after loading the data.tablepackage):
example(data.table)
It provides more than 50 examples (the same from the package documentation) with its corresponding result about the different uses of this package.
PERFORMANCE
Let's compare all possible alternatives in terms of performance. Then we need to modify
the input data set for increasing its size:
rep.num <- 1000
input_key <- rep(c(9061,8680,1546,5376,9550,9909,3853,3732,9209),rep.num)
input_product <- rep(c("Water", "Bread", NA, "Chips", "Chicken", NA, "Chocolate",
"Donuts", "Juice"),rep.num)
input <- as.data.frame(cbind(input_key, input_product), stringsAsFactors=F)
Wrap all different alternatives into a corresponding given function. I have included
the solution via dplyr proposed by #count
vectSol <- function(input, lookup_data) {
input$input_product[input$input_key %in% lookup_data$lookup_key == TRUE] <-
lookup_product[lookup_data$lookup_key %in% input$input_key == TRUE]
return(input)
}
lapplySol <- function(input, lookup_data) {
idx <- input$input_key %in% lookup_data$lookup_key
lapply((1:nrow(input)),
function(i) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <<- lookup_data$lookup_product[jdx == TRUE]
}
}
)
return(input)
}
forSol <- function(input, lookup_data) {
idx <- input$input_key %in% lookup_data$lookup_key
for (i in (1:nrow(input))) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <- lookup_data$lookup_product[jdx == TRUE]
}
}
return(input)
}
dataTableSol <- function (input, lookup_data) {
setkey(input,input_key)
setkey(lookup_data,lookup_key)
input[lookup_data, input_product := i.lookup_product, nomatch=0]
return(input)
}
dplyrSol <- function(input, lookup_data) {
rbind(input[!is.na(input$input_product),],
inner_join(lookup_data,input,by=c("lookup_key"="input_key")) %>%
select(lookup_key,lookup_product) %>%
rename(input_product = lookup_product, input_key = lookup_key))
return(input)
}
Now test each solution (double check).
Make a copy of the input data set, because data.table operate by reference. We need to create a copy from scratch.
input.copy <- setDT(as.data.frame(cbind(input_key, input_product), stringsAsFactors=F))
lookup_data.copy<- setDT(as.data.frame(cbind(lookup_key, lookup_product),
stringsAsFactors=F))
input1.out <- vectSol(input, lookup_data)
input2.out <- lapplySol(input, lookup_data)
input3.out <- forSol(input, lookup_data)
input4.out <- forSol(input, lookup_data)
input5.out <- dataTableSol(copy(input.copy), lookup_data.copy)
We use the package compare because all.equal fails for comparing a data frame
with a data.table object, because the attributes values, therefore we need a
comparison that only checks the values.
library(compare)
OK <- all(
all.equal(input1.out, input2.out) && all.equal(input1.out, input3.out)
&& all.equal(input1.out, input4.out)
&& compare(input1.out[order(input1.out$input_key),],
input5.out, ignoreAttrs=T)$result
)
try(if(!OK) stop("Result are not the same for all methods"))
Now let's to use microbenchmarkpackage for comparing the time performance of all solutions
library(microbenchmark)
op <- microbenchmark(
VECT = {vectSol(input, lookup_data)},
FOR = {forSol(input, lookup_data)},
LAPPLY = {lapplySol(input, lookup_data)},
DPLYR = {dplyrSol(input, lookup_data)},
DATATABLE = {dataTableSol(input.copy, lookup_data.copy)},
times=100L)
print(op)
Here is the result:
Unit: milliseconds
expr min lq mean median uq max neval cld
VECT 1.005890 1.078983 1.384964 1.108162 1.282269 6.562040 100 a
FOR 416.268583 438.545475 476.551526 449.679426 476.032938 740.027018 100 b
LAPPLY 428.456092 454.664204 492.918478 464.204607 501.168572 751.786224 100 b
DPLYR 13.371847 14.919726 16.482236 16.105815 17.086174 23.537866 100 a
DATATABLE 1.699995 2.059205 2.427629 2.279371 2.489406 8.542219 100 a
Additionally we can graph the solution, via:
library(ggplot2) #nice log plot of the output
qplot(y=time, data=op, colour=expr) + scale_y_log10()
The best performance on this order is: Vectorial, data.table, dplyr, for-loop, lapply.
Pretty tired so this is clumsy, but it should work for the data provided (your output sample is probaly wrong though):
require(dplyr)
rbind(input[!is.na(input$input_product),],
inner_join(lookup_data,input,by=c("lookup_key"="input_key")) %>%
select(lookup_key,lookup_product) %>%
rename(input_product = lookup_product, input_key = lookup_key))
This is easily done using the data.table package as follows:
# load sample data
input_data <- structure(list(
input_key =
structure(c(6L, 5L, 1L, 4L, 8L, 9L,
3L, 2L, 7L),
.Label = c("1546", "3732", "3853", "5376", "8680",
"9061", "9209", "9550", "9909"), class = "factor"),
input_product = structure(c(7L, 1L, NA, 3L, 2L, NA, 4L, 5L, 6L),
.Label = c("Bread", "Chicken", "Chips", "Chocolate",
"Donuts", "Juice", "Water"), class = "factor")),
.Names = c("input_key",
"input_product"),
row.names = c(NA, -9L), class = "data.frame")
lookup_data <- structure(list(
lookup_key = structure(1:4,
.Label = c("1245", "1546", "7764", "9909"), class = "factor"),
lookup_product = structure(c(3L,
4L, 1L, 2L), .Label = c("Bacon", "Cheese", "Ice Cream", "Soda"
), class = "factor")), .Names = c("lookup_key", "lookup_product"
), row.names = c(NA, -4L), class = "data.frame")
# convert to data.table and add keys for merging
library(data.table)
input <- data.table(input_data, key = 'input_key')
lookup <- data.table(lookup_data, key = 'lookup_key')
# merge the data (can use merge method as well)
DT <- lookup[input]
# where the input_product is NA, replace with lookup
DT[is.na(input_product), input_product := lookup_product]
print(DT)
# you can now get rid of lookup_product column, if you like
DT[, lookup_product:= NULL]
print(DT)
The final output of the above is:
> print(DT)
lookup_key input_product
1: 1546 Soda
2: 3732 Donuts
3: 3853 Chocolate
4: 5376 Chips
5: 8680 Bread
6: 9061 Water
7: 9209 Juice
8: 9550 Chicken
9: 9909 Cheese
Related
I would like to loop a function over a vector of characters. The function will create a vector or a list, and the name of each vector (list) will be taken from the vector of characters. For example.
# The data would look like:
fist second third
1 2 3
1 NA 3
1 2 3
1 2 NA
NA 2 3
# I want to create three lists/vectors such as
first <- c("1.pdf", "1.pdf", "1.pdf", "1.pdf")
second <- c("2.pdf", "2.pdf", "2.pdf", "2.pdf")
third <- c("3.pdf", "3.pdf", "3.pdf", "3.pdf")
# where, first, second, third, now are the names of the vectors. I tried the following way.
vector_names <- c("first", "second", "third")
cleanNA <- function(x){
x <- as.character(as.data.frame(t(data[paste0(x)])))
x <- na.omit(x) # remove all NA observations.
x <- paste0(x, ".pdf")
return(x)
}
# I can do this by a vector length 1.
name <- c("first")
assign(name, namef)
namef <- createlists(name)
# But once I do an lapply, it won't create the three vectors as I wanted. The lapply does run and returns what I want, but not create the three vectors.
lapply(vector_names, cleanNA)
I've been searching for this type of questions many times and feel R doesn't really provide a good way to generate a new vector within a loop. Am I right? Thanks.
Here's a simplified version :
cleanNA <- function(data, x){
x <- data[[x]]
x <- na.omit(x)
x <- paste0(x, ".pdf")
return(x)
#Or a one-liner
#paste0(na.omit(data[[x]]), '.pdf')
}
list_vec <- lapply(vector_names, cleanNA, data = data)
list_vec
#[[1]]
#[1] "1.pdf" "1.pdf" "1.pdf" "1.pdf"
#[[2]]
#[1] "2.pdf" "2.pdf" "2.pdf" "2.pdf"
#[[3]]
#[1] "3.pdf" "3.pdf" "3.pdf" "3.pdf"
It is better to keep data in a list so that it is easier to manage and avoids creating lot of objects in global environment. However, if you want them as separate vectors you can use list2env :
list_vec <- setNames(list_vec, vector_names)
list2env(list_vec, .GlobalEnv)
data
data <- structure(list(first = c(1L, 1L, 1L, 1L, NA), second = c(2L,
NA, 2L, 2L, 2L), third = c(3L, 3L, 3L, NA, 3L)), class = "data.frame",
row.names = c(NA, -5L))
This is a follow up question to my previous question. I run into a problem to find a memory efficient solution to find a common third for my large data set (3.5 million groups and 6.2 million persons)
The proposed solution using the igraph package works fast for a normal sized data sets unfortunately runs into memory issues by creating a large matrix for bigger data sets. Similar issue comes up with my own solution using concatenated inner joins where the third inner join inflates the dataframe so my pc runs out of memory (16gb).
df.output <- inner_join(df,df, by='group' ) %>%
inner_join(.,df, by=c('person.y'='person')) %>%
inner_join(.,df, by=c('group.y'='group')) %>%
rename(person_in_common='person.y', pers1='person.x',pers2='person') %>%
select(pers1,pers2,person_in_common) %>%
filter(pers1!=pers2) %>%
distinct() %>%
filter(person_in_common!=pers1 & person_in_common!=pers2)
df.output[-3] <- t(apply(df.output[-3], 1,
FUN=function(x) sort(x, decreasing=FALSE)))
df.output <- unique(df.output)
Small data set example and expected output
df <- data.frame(group= c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE)
df
group person
1 a Tom
2 a Jerry
3 b Tom
4 b Anna
5 b Sam
6 c Nic
and expected result
df.output
pers1 pers2 person_in_common
1 Anna Jerry Tom
2 Jerry Sam Tom
3 Sam Tom Anna
4 Anna Tom Sam
6 Anna Sam Tom
I unfortunately don't have access to a machine with more ram and are also not really experienced with cloud computing, so I hope to make it work on my local pc. I would appreciate input how to optimize any of the solutions or an advise how to tackle the problem otherwise.
Edit 1
A dataframe which reflects my actual data size.
set.seed(33)
Data <- data.frame(group = sample(1:3700000, 14000000, replace=TRUE),
person = sample(1:6800000, 14000000,replace = TRUE))
Edit 2
My real data is a bit more complex in terms of larger groups and more persons per group as the example data. Consequently it gets more memory intense. I could not figure out how to simulate this kind of structure so following the real data for download:
Full person-group data
So, I managed to run this on your test data (I have 16 GB of RAM), but if you run this on your small example then you would see that it does not give the same results. I did not get why, but maybe you could hep me with that. So I will try to explain every step:
myFun <- function(dt) {
require(data.table)
# change the data do data.table:
setDT(dt)
# set key/order the data by group and person:
setkey(dt, group, person)
# I copy the initial data and change the name of soon to be merged column name to "p2"
# which represents person2
dta <- copy(dt)
setnames(dta, "person", "p2")
# the first merge using data.table:
dt1 <- dt[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
# now we remove rows where persons are the same:
dt1 <- dt1[person != p2] # remove equal persons
# and also we need to remove rows where person1 and person2 are the same,
# just in different order , example:
# 2: a Tom Jerry
# 3: a Jerry Tom
# is the same, if I get it right then you did this using apply in the end of code,
# but it would be much better if we could reduce data now
# also my approach will be much faster (we take pairwise min word to 2 column
# and max to the last):
l1 <- pmin(dt1[[2]], dt1[[3]])
l2 <- pmax(dt1[[2]], dt1[[3]])
set(dt1, j = 2L, value = l1)
set(dt1, j = 3L, value = l2)
# now lets clear memory and take unique rows of dt1:
rm(l1, l2, dt)
dt1 <- unique(dt1)
gc()
# change name for group column:
setnames(dta, "group", "g2")
# second merge:
dt2 <- dt1[dta, on = "p2", allow.cartesian = TRUE, nomatch = 0]
rm(dt1)
gc()
setnames(dta, "p2", "p3")
dt3 <- dt2[dta, on = "g2", allow.cartesian = TRUE, nomatch = 0] # third merge
rm(dt2)
gc()
dt3 <- dt3[p3 != p2 & p3 != person] # removing equal persons
gc()
dt3 <- dt3[, .(person, p2, p3)]
gc()
return(dt3[])
}
On Small data set example:
df <- data.frame(group = c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"),
stringsAsFactors = FALSE)
df
myFun(df)
# person p2 p3
# 1: Anna Tom Jerry
# 2: Sam Tom Jerry
# 3: Jerry Tom Anna
# 4: Sam Tom Anna
# 5: Jerry Tom Sam
# 6: Anna Tom Sam
# 7: Anna Sam Tom
Something similar to your result but not quite the same
Now with larger data:
set.seed(33)
N <- 10e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # 13.22 sek
rm(results)
gc()
And:
set.seed(33)
N <- 14e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # around 40 sek, but RAM does get used to max
Update:
Maybe you can try this splitting aproch, lets say with nparts 6-10?:
myFunNew3 <- function(dt, nparts = 2) {
require(data.table)
setDT(dt)
setkey(dt, group, person)
dta <- copy(dt)
# split into N parts
splits <- rep(1:nparts, each = ceiling(dt[, .N]/nparts))
set(dt, j = "splits", value = splits)
dtl <- split(dt, by = "splits", keep.by = F)
set(dt, j = "splits", value = NULL)
rm(splits)
gc()
i = 1
for (i in seq_along(dtl)) {
X <- copy(dtl[[i]])
setnames(dta, c("group", "person"))
X <- X[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
X <- X[person != i.person]
gc()
X <- X[dta, on = "person", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(dta, "group", "i.group")
X <- X[dta, on = "i.group", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(X, "i.person.1", "pers2")
setnames(X, "i.person", "pers1" )
setnames(X, "person", "person_in_common" )
X <- X[, .(pers1, pers2, person_in_common)]
gc()
X <- X[pers1 != pers2 & person_in_common != pers1 & person_in_common != pers2]
gc()
name1 <- "pers1"
name2 <- "pers2"
l1 <- pmin(X[[name1]], X[[name2]])
l2 <- pmax(X[[name1]], X[[name2]])
set(X, j = name1, value = l1)
set(X, j = name2, value = l2)
rm(l1, l2)
gc()
X <- unique(X)
gc()
if (i > 1) {
X1 <- rbindlist(list(X1, X), use.names = T, fill = T)
X1 <- unique(X1)
rm(X)
gc()
} else {
X1 <- copy(X)
}
dtl[[i]] <- 0L
gc()
}
rm(dta, dtl)
gc()
setkey(X1, pers1, pers2, person_in_common)
X1[]
}
I Have the following dataset which includes 2 variables:
dt4<-structure(list(a1 = c(4L, 4L, 3L, 4L, 4L), a2 = c(1L,
3L, 4L, 5L, 4L)), .Names = c("a1", "a2"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
I Have the following function that add labels and levels to an existing dataset:
Add_Labels_Level_To_Dataset <- function(df, df_name,levels_list,labels_list) {
df[] <- lapply( df, ordered)
for (i in 1:length(colnames(df))) {
arg0<-paste0(df_name,"[i]", "<-ordered(", df_name, "$'", colnames(df)[i], "', levels=c(", levels_list[[i]], "), labels = c(", labels_list[[i]],"))" )
eval(parse(text=arg0))
}
df
}
which is run by that R command:
Add_Labels_Level_To_Dataset(dt4, "dt4", level_list, labels_list)
The lists supplied in the R command are the following ones which represents the ordered levels of each variable in the dataset, respectively:
label_list=list("'S','SA','SB','SC,'SD'", "'S','SA','SB','SC,'SD'")
level_list=list("5,4,3,2,1", "5,4,3,2,1")
Why my function is not working properly?
I dont know what is wrong with that!
When I run the R commands outside R function, they tie levels/ labels to the dataset given. However, when I run my R function, this does not happen!
df_name="dt4"
df=dt4
levels_list=level_list
labels_list=label_list
i=3
df[] <- lapply( df, ordered)
arg0<-paste0(df_name,"[i]", "<-ordered(", df_name, "$'", colnames(df)[i], "', levels=c(", levels_list[[i]], "), labels = c(", labels_list[[i]],"))" )
eval(parse(text=arg0))
Can you help?
This is a xy problem. I agree with #MrFlick that parse should be avoided.
On the original post the main issue is the function should be returning dt4 and not df. There are some missing ' (single quote) when defining label_list.
We could use mapply and avoid the single quote:
label_list=list(c('S','SA','SB','SC','SD'), c('S','SA','SB','SC','SD'))
level_list=list(c(5,4,3,2,1), c(5,4,3,2,1))
as.data.frame(mapply(function(x, labels,levels ) {ordered(x, labels,levels)}, dt4, level_list, label_list, SIMPLIFY = F))
# a1 a2
#1 SA SD
#2 SA SB
#3 SB SA
#4 SA S
#5 SA SA
Using eval/parse should be avoided. There are tpyically much easier ways to do what you want in R. For example, with this code, we can just write
Add_Labels_Level_To_Dataset <- function(df, levels_list, labels_list) {
df[] <- Map(function(data, levels, labels) {
ordered(data, levels=strsplit(levels,",")[[1]], labels=strsplit(labels, ",")[[1]])
}, df, levels_list, labels_list)
df
}
And we can call it like
dt4 <- Add_Labels_Level_To_Dataset(dt4, level_list, label_list)
Note that it returns a new data.frame which you can reassign to dt4 or some other variable. Functions in R should never modify objects outside their own scope which is one of the other reasons you were running into problems with your function.
I have a dataframe that contains more than 2 millions records. I am only sharing the few records due to data security reasons .I wish you guys can understand my reason.
data <- data[order(data$email_address_hash),]
skip_row <- c()
data$hash_time <- rep('NA',NROW(data)) #adding new column to our data
rownames(data) <- as.character(1:NROW(data))
dput(droplevels(data))
structure(list(email_address_hash = structure(c(2L, 1L, 1L, 2L
), .Label = c("0004eca7b8bed22aaf4b320ad602505fe9fa9d26", "35c0ef2c2a804b44564fd4278a01ed25afd887f8"
), class = "factor"), open_time = structure(c(2L, 1L, 3L, 4L), .Label = c(" 04:39:24",
" 09:57:20", " 10:39:43", " 19:00:09"), class = "factor")), .Names = c("email_address_hash",
"open_time"), row.names = c(41107L, 47808L, 3973L, 8307L), class = "data.frame")
str(data)
'data.frame': 4 obs. of 2 variables:
$ email_address_hash: Factor w/ 36231 levels "00012aec4ca3fa6f2f96cf97fc2a3440eacad30e",..: 7632 2 2 7632
$ open_time : Factor w/ 34495 levels " 00:00:03"," 00:00:07",..: 15918 5096 16971 24707
.
skip_row <- c()
data$hash_time <- rep('NA',NROW(data)) #adding new column to our data
rownames(data) <- as.character(1:NROW(data))
for(i in 1:NROW(data)){
#Skipping the email_address_hash that was already used for grouping
if(i %in% skip_row) next
hash_row_no <- c()
#trimming data so that we don't need to look into whole dataframe
trimmed_data <- data[i:NROW(data),]
# Whenever we search for email_address_hash the previous one was ignored or removed from the check
#extracting rownames so that we can used that as rownumber inside the skip_row
hash_row_no <- rownames(trimmed_data[trimmed_data$email_address_hash==trimmed_data$email_address_hash[1],])
#note :- we know the difference b/w rownames and rownumber
#converting rownames into numeric so that we can use them as rowno
hash_row_no <- as.numeric(hash_row_no)
first_no <- hash_row_no[1]
last_no <- hash_row_no[NROW(hash_row_no)]
skip_row <- append(skip_row,hash_row_no)
data$hash_time[first_no] <- paste(data$open_time[first_no:last_no], collapse = "")
}
Please note that I also tried the below approaches for to speed up the process but that seems to be ineffective
hash_row_no <- rownames(trimmed_data[trimmed_data$email_address_hash==trimmed_data$email_address_hash[1],])
converted dataframe to data.table
setDT(data)
performed either of the operation gives times similar time
system.time(rownames(trimmed_data[trimmed_data$email_address_hash==trimmed_data$email_address_hash[1],]))
system.time(rownames(trimmed_data)[trimmed_data[["email_address_hash"]] == trimmed_data$email_address_hash[1]])
Can you guys help me to speed up my code as my data contains more than 2 millions records and it is taking more than 30 minutes and even more ?
Apparently you want to do this:
library(data.table)
setDT(data)
data[, .(open_times = paste(open_time, collapse = "")), by = email_address_hash]
# email_address_hash open_times
#1: 35c0ef2c2a804b44564fd4278a01ed25afd887f8 09:57:20 19:00:09
#2: 0004eca7b8bed22aaf4b320ad602505fe9fa9d26 04:39:24 10:39:43
Or possibly this:
data[email_address_hash == "0004eca7b8bed22aaf4b320ad602505fe9fa9d26",
paste(open_time, collapse = "")]
#[1] " 04:39:24 10:39:43"
With following data of start and end points, how can we get routes between 2 points.
> ddf
start end
1 a b
2 a c
3 b e
4 b f
5 b c
6 a d
7 e f
8 f g
> dput(ddf)
structure(list(start = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 3L,
4L), .Label = c("a", "b", "e", "f"), class = "factor"), end = structure(c(1L,
2L, 4L, 5L, 2L, 3L, 5L, 6L), .Label = c("b", "c", "d", "e", "f",
"g"), class = "factor")), .Names = c("start", "end"), class = "data.frame", row.names = c(NA,
-8L))
>
This page (http://www.anselm.edu/homepage/mmalita/culpro/graf1.html) shows only 2 line solution in Prolog! Following code works but does not give correct output list. It can be started with mainpath(ddf, 'a', 'f') to find path between 'a' and 'f'.
mainpath = function(ddf, startpt, endpt){
ddf$start = as.character(ddf$start)
ddf$end = as.character(ddf$end)
nodenames = sort(unique(c(ddf$start, ddf$end)))
routev=c(startpt)
if(is_arc(ddf, startpt, endpt, routev)) {return("Direct route found."); }
else{
nodenames = nodenames[which(nodenames!=startpt)]
nodenames = nodenames[which(nodenames!=endpt )]
if(is_path(ddf, nodenames, startpt, endpt, routev))
return ("Completed successfully")
else return ("Could not find a path.")
}
}
is_arc = function(ddf, frompt, topt, routevector){
len = nrow(ddf)
for(i in 1:len)
if(frompt == ddf$start[i] && topt == ddf$end[i]) {
routevector = append(routevector, frompt)
routevector = append(routevector, topt)
print(routevector)
return (TRUE);
}
return (FALSE)
}
is_path = function(ddf, othernodes, frompt, topt, routevector){
if(is_arc(ddf, frompt, topt, routevector)){
return (TRUE)
}
if(length(othernodes)==0){
print(routevector)
return (FALSE)
}
for(i in 1:length(othernodes)){
intermediate = othernodes[i]
if(is_arc(ddf, frompt, intermediate, routevector) && is_path(ddf, othernodes, intermediate, topt, routevector)){
return (TRUE)
}
}
print(routevector)
return (FALSE)
}
I am sure it can be much improved, especially all these for loops etc can be removed using apply etc functions. I know that packages with such functions are available but how can it be done in base R?
Your answers / comments will be appreciated.
While I'm sure there are wonderful ways to do this with linear algebra, here's a relatively intuitive method (using dplyr here, but translate as you like):
library(dplyr)
# convert factors to characters, filter down to possible starting points
df %>% mutate_each(funs(as.character)) %>% filter(start == 'a') %>%
# join to add possible next steps, indexing endpoints to startpoints
left_join(df, by = c('end' = 'start')) %>%
# iterate for successive steps
left_join(df, by = c('end.y' = 'start')) %>%
left_join(df, by = c('end.y.y' = 'start')) %>%
# chop out rows that didn't end at 'g' (omit if you're curious)
filter(apply(., 1, function(x){x[length(na.omit(x))]}) == 'g')
# start end.x end.y end.y.y end
# 1 a b e f g
# 2 a b f g <NA>
If df is factors, you'll get warnings about coercing, though it'll run fine (coerce to start or add %>% mutate_each(funs(as.character)) to each df call and they'll go away). The column names are a bit ugly; set them with left_join's suffix parameter or select or rename if you like.
Obviously the iteration of joins invites a loop, which might look as such:
df2 <- df %>% mutate_each(funs(as.character)) %>% filter(start == 'a')
for(i in 0:2){
endcol <- paste0('end', paste(rep('.y', i), collapse = ''))
df2 <- df2 %>% left_join(df, by = setNames('start', endcol))
}
df2 %>% filter(apply(., 1, function(x){x[length(na.omit(x))]}) == 'g')
# start end.x end.y end.y.y end
# 1 a b e f g
# 2 a b f g <NA>
If you set the number of iterations too high, it will error out because there are no rows to join to, but the error is actually quite convenient, as the loop has already saved the df2 you want, so the error just stops extra work from being done. Add tryCatch if you like, or go the other direction and refactor it into a scary-looking while loop which will actually iterate the perfect number of times:
df2 <- df %>% mutate_each(funs(as.character)) %>% filter(start == 'a')
endcol <- 'end' # initialize iterating variable
while(TRUE){
df2 <- df2 %>% left_join(df, by = setNames('start', endcol))
endcol <- paste0(endcol, '.y')
}
df2 %>% filter(apply(., 1, function(x){x[length(na.omit(x))]}) == 'g')
# start end.x end.y end.y.y end
# 1 a b e f g
# 2 a b f g <NA>
Following is much shorter and easily understandable, recursive function using base R. (First 2 lines are not needed if start and end columns of data.frame being sent are already character and not factor).
mainpath2 = function(ddf, startpt, endpt, route=c()){
ddf$start = as.character(ddf$start)
ddf$end = as.character(ddf$end)
if(startpt == endpt) return("Error: Same Start and End points.\n")
for(i in 1:nrow(ddf)){
if(ddf$start[i] == startpt){
route = append(route, startpt)
if(ddf$end[i] == endpt){
# PATH FOUND:
route = append(route, endpt)
print(route)
}
else mainpath2(ddf[-i, ], ddf$end[i], endpt, route)
route = route[-length(route)]
}
}
}
> mainpath2(ddf, 'a', 'g')
[1] "a" "b" "e" "f" "g"
[1] "a" "b" "f" "g"