I'm doing some product association work where I have two large data.tables. One is a rules table (2.4m rows) and one is a customer product table (3m rows). Effectively what I want to do is merge the two together and select the top 10 products per customer, but doing this at the high level isn't viable due to the size. However, to get round this, I want to iteratively merge the two tables at a customer level, select the top 10 products and return it.
The below example probably explains it better:
require(data.table)
customer <- data.table(customer=rep(seq(1:5),3),product_bought=rep(c("A","B","C"),5), number=runif(15))[order(customer)]
rules <- data.table(product_bought=c("A","B","C"),recommended_product=c("D","E","F"),number2=runif(3,min=100,max=200))
customer[,lapply(.SD, function(z){
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]
return(a)
}),by=customer]
But I get the following error:
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid colum
What I want it to do for all customers is this:
z <- customer[customer==1]
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]
Which gives:
> a[new==max(new)]
product_bought customer number recommended_product number2 new
1:
C 1 0.613043 F 168.4335 103.257
I did try using lists, but having a list of 30k data.tables had issues when trying to rbindlist it back up again.
Any ideas why the merge within a .SD doesn't work?
Cheers,
Scott
I guess you were trying to do this:
customer[, {
a <- merge(.SD,rules,by="product_bought");
a[, new:=number*number2];
a[new==max(new)]
}, by = customer]
But it's much better to do a single merge:
customer[rules, on = 'product_bought', new := number * number2]
customer[, .SD[new == max(new)], by = customer]
Or do the .I trick if the last line is too slow.
Related
The dataset I am working on is not very big, but quite wide. I tcurrently has 10 854 columns and I would like to add approximately another 10/11k columns. It has only 760 rows.
When I try (applying functions to a subset of the existing columns), I get the following
Warning message:
In `[.data.table`(setDT(Final), , `:=`(c(paste0(vars, ".xy_diff"), :
truelength (30854) is greater than 10,000 items over-allocated (length = 10854). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo().
I have tried to play with setalloccol, but I get something similar. For example:
setalloccol(Final, 40960)
Error in `[.data.table`(x, i, , ) :
getOption('datatable.alloccol') should be a number, by default 1024. But its type is 'language'.
In addition: Warning message:
In setalloccol(Final, 40960) :
tl (51894) is greater than 10,000 items over-allocated (l = 21174). If you didn't set the datatable.alloccol option to be very large, please report to data.table issue tracker including the result of sessionInfo().
Is there a way to bypass this problem?
Thanks a lot
Edit:
to answer Roland's comment, here is what I am doing:
vars <- c(colnames(FinalTable_0)[271:290], colnames(FinalTable_0)[292:dim(FinalTable_0)[2]]) # <- variables I want to operate on
# FinalTable_0 is a previous table I use to collect the roots of the variables I want to work with
difference <- function(root) lapply(root, function(z) paste0("get('", z, ".x') - get('", z, ".y')"))
ratio <- function(root) lapply(root, function(z) paste0("get('", z, ".x') / get('", z, ".y')"))
# proceed to the computation
setDT(Final)[ , c(paste0(vars,".xy_diff"), paste0(vars,".xy_ratio")) := lapply(c(difference(vars), ratio(vars)), function(x) eval(parse(text = x)))]
I tried the solution proposed by Roland, but was not fully satisfied. It works, but I do not like the idea of transposing my data.
In the end, I just split the original data.table into multiple ones, proceeded to the computations on each individually and joined back at the end. Fast and simple, no need to play with variables, tell which ones are ids and which are measures, no need to shape and reshape. I just prefer.
I'm using the fb_insightsfrom the package fbRads like this: (I use more metrics in my real problem)
fb_campaigns <- rbindlist(lapply(l, function(l) cbind(Campaign = l$campaign_name, rbindlist(l$actions))))
Oh, and I get some warnings (I know I'm doing something wrong, but can't solve it):
Warning messages:
1: In data.table::data.table(...) :
Item 1 is of size 11 but maximum size is 104 (recycled leaving remainder of 5 items)
The result is the data frame with all the data I need (Campaign, action_type, value), but... the columns with the "action_types" and their numbers came out of order. The action data don't seem to be from the campaigns in the rows.
How can I merge the action types with the campaigns?
After I the data in the correct rows, I will use reshape to make the action_types columns with the values.
The data I get from fb Rads and I want to transform are like this:
The data I get using my code are like this (the format is OK, but not the order of the values, they are not the values for the campaigns)
daroczig give me the solution bellow, and seems to work fine!
## list of action types to extract
actiontypes <- c('link_click', 'comment', 'like', 'post')
## extract actions from the data returned by the Insights API
lactions <- unlist(lapply(l, function(x) x$actions), recursive = FALSE)
## extract fields from the actions
library(data.table)
lactions <- rbindlist(lapply(lactions, function(actions) {
setnames(as.data.table(
do.call(cbind,
lapply(actiontypes,
function(action) {
if (is.null(actions)) return(0)
value <- subset(actions, action_type == action, value)
if (nrow(value) == 0) return(0) else
return(value[[1]])
}))),
actiontypes)
}))
## Merging the dataframe with the original data and the dataframe with the actions
fb_campaigns <- cbind(l[,c(1,4:11)],lactions))
I have two dataframes in R, one big but imcomplete (import) and I want to create a smaller, complete subset of it (export). Every ID in the $unique_name column is unique, and does not appear twice. Other columns might be for example body mass, but also other categories that correspond to the unique ID. I've made this code, a double-loop and an if-statement and it does work, but it is slow:
for (j in 1:length(export$unique_name)){
for (i in 1:length(import$unique_name)){
if (toString(export$unique_name[j]) == toString(import$unique_name[i])){
export$body_mass[j] <- import$body_mass[i]
}
}
}
I'm not very good with R but I know this is a bad way to do it. Any tips on how I can do it with functions like apply() or perhaps the plyr package?
Bjørn
There are many functions to do this. check out..
library(compare)
compare(DF1,DF2,allowAll=TRUE)
or as mentioned by #A.Webb Merge is pretty handy function.
merge(x = DF1, y = DF2, by.x = "Unique_ID",by.y = "Unique_ID", all.x = T, sort = F)
If you prefer SQL style statements then
library(sqldf)
sqldf('SELECT * FROM DF1 INTERSECT SELECT * FROM DF2')
easy to implement and to avoid for and if conditions
As A.Webb suggested you need join:
# join data on unique_name
joined=merge(export, import[c("unique_name", "body_mass")], c('unique_name'))
joined$body_mass=joined$body_mass.y # update body_mass from import to export
joined$body_mass.x=NULL # remove not needed column
joined$body_mass.y=NULL # remove not needed column
export=joined;
Note:As shown below use "which" function .This would reduce the loop iterations
for (j in 1 : nrow(export)){
index<- which(import$unique_name %in% export$unique_name[j])
if(length(index)=1)
{
export$body_mass[j] <- import[index[1],"body_mass"]
}
}
I have a data.table that holds ids and locations. for example, here is it with one row in it:
(it has col and row names, don't know if it matters)
locations<-data.table(c(11,12),c(-159.58,0.2),c(21.901,22.221))
colnames(locations)<-c("id","location_lon","location_lat")
rownames(locations)<-c("1","2")
I then want to iterate over the rows and compare them to another point (with lat,lon).
In a for loop it works:
for (i in 1:nrow(locations)) {
loc <- locations[i,]
dist <- gdist(-159.5801, 21.901, loc$location_lon, loc$location_lat, units="m")
if(dist <= 50) {
return (loc)
}
return (NULL)
}
and returns:
id location_lon location_lat
1: 11 -159.58 21.901
but I want to use apply.
The following code fails to run:
dists <- apply(locations,1,function(x) if (50 - gdist(-159.5801, 21.901, x$location_lon, x$location_lat, units="m")>=0) x else NULL)
with $ operator is invalid for atomic vectors error. Changing to reference by location (x[2],x[3]) isn't enough to fix this, I get
Error in if (radius - gdist(lon, lat, x[2], x[3], units = "m") >= 0) x else NULL :
missing value where TRUE/FALSE needed
This is because the data.table is converted to matrix, and the coordinates are treated as text instead of numbers.
Is there a way to overcome this? The solution needs to be efficient (I want to run this check for >1,000,000 different coordinates). Changing the data structure of the locations table is possible if needed.
No loops are required, just use data.table as intended. If all you want to see are the rows that within 50 meters from the desired location, all you have to do is
locations[, if (gdist(-159.58, 21.901, location_lon, location_lat, units="m") <= 50) .SD, id]
## id location_lon location_lat
## 1: 11 -159.58 21.901
Here we are iterating by the id column within the locations data set itself and checking if each id is within 50 meters from -159.58, 21.901. If so, we are calling .SD which is basically the data set itself for that specific id.
As a side note, data.table doesn't have row.names, so there is no need of specifiying them, see here, for example
I have a large (~4.5 million records) data frame, and several of the columns have been anonymised by hashing, and I don't have the key, but I do wish to renumber them to something more legible to aid analysis.
To this end, for example, I've deduced that 'campaignID' has 161 unique elements over the 4.5 records, and have created a vector to hold these. I've then tried writing a FOR/IF loop to search through the full dataset using the unique element vector - for each value of 'campaignID', it is checked against the unique element vector, and when it finds a match, it returns the index value of the unique element vector as the new campaign ID.
campaigns_length <- length(unique_campaign)
dataset_length <- length(dataset$campaignId)
for (i in 1:dataset_length){
for (j in 1:campaigns_length){
if (dataset$campaignId[[i]] == unique_campaign[[j]]){
dataset$campaignId[[i]] <- j
}}}
The problem of course is that, while it works, it takes an enormously long time - I had to stop it after 12 hours! Can anything think of a better approach that's much, much quicker and computationally less expensive?
You could use match.
dataset$campaignId <- match(dataset$campaignId, unique_campaign)
See Is there an R function for finding the index of an element in a vector?
You might benefit from using the data.table package in this case:
library(data.table)
n = 10000000
unique_campaign = sample(1:10000, 169)
dataset = data.table(
campaignId = sample(unique_campaign, n, TRUE),
profit = round(runif(n, 100, 1000))
)
dataset[, campaignId := match(campaignId, unique_campaign)]
This example with 10 million rows will only take you a few seconds to run.
You could avoid the inside loop with a dictionnary-like structure :
id_dict = list()
for (id in 1:unique_campaign) {
id_dict[[ unique_campaign[[id]] ]] = id
}
for (i in 1:dataset_length) {
dataset$campaignId[[i]] = id_dict[[ dataset$campaignId[[i]] ]]
}
As pointed in this post, list do not have O(1) access so it will not divided the time recquired by 161 but by a smaller factor depending on the repartition of ids in your list.
Also, the main reason why your code is so slow is because you are using those inefficient lists (dataset$campaignId[[i]] alone can take a lot of time if i is big). Take a look at the hash package which provides O(1) access to the elements (see also this thread on hashed structures in R)