Merge with allow.cartasian=TRUE results in too many observations - r

I want to merge 2 data frames (data1 and data2). Both initially contain around 35 million observations (around 2GB each).
I removed the duplicates from data2. I would need to keep the duplicates in data 1, as I wish to use them for further calculations per observation in data1.
I initially get the well documented error:
Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
As a solution (I looked at several topics, such here, here, and here), I included allow.cartesian=TRUE, but now I run into memory issues. Also, for a subset it works, but it gives me more observations than I wish (data1 now has 50 million observations, although is specify all.x=TRUE).
My code is:
#Remove duplicates before merge
data2 <- unique(data2)
#Merge
require(data.table)
data1 <- merge(data1, data2, by="ID", all.x=TRUE, allow.cartesian=TRUE)
Any advice on how to merge this, is very welcome.

In order to do a left join, the merge statement needs to understand which column you are using as the "key" or "index" for the join. If you have duplicate column names that are used as the key/index, it doesn't know what to do and gives that error. Further, it needs to know what to do if columns are being joined that have the same name as existing columns.
The solution is to temporarily rename the key/index column in your left (data1) dataset As a general rule, having duplicate column names is "bad" in R because it will confuse a lot of functions. Many functions silently call make.unique() to de-duplicate column names to avoid confusion.
If you have duplicate ID columns in data1 change them with colnames(data1) <- make.unique(colnames(data1)), which will set them to ID.1, ID.2, etc. Then do your merge (make sure to specify by.x="ID.1", by.y="ID" because of the rename. By default, duplicate columns that are merged will be appended with .y although you can specify the suffix with the suffixes= option (See Merge helpfile for details)
Lastly, it's worth noting that the merge() function in the data.table package tends to be a lot faster than the base merge() function with similar syntax. Seepage 47 of the data.table manual.

Related

Duplicated Rows No Longer Recognized As Duplicates After read_csv() Is Used

I have a historical file for a project. If we call this data frame A, say it is 5000 rows by 20 columns, I need to constantly be appending A with new records as they come in.
If I want to test that I'm removing duplicates, my test is:
A <- rbind(A, A) #These are the exact same file. This is now 10,000 x 20
A <- dplyr::filter(A, !duplicated(A)) #using the dplyr package, this is now 5,000 x 20
Keep in mind, the above works as it should. It removes all duplicates. However, when I want to test that this works I save the file, then read it in again and rbind once more:
readr::write_csv(A, "path/A_saved") #Saving the historical file A
A_import <- readr::read_csv("A_saved") #Loading in the historical file I just saved to my computer
A <- rbind(A, A_import) #Again, these are still the exact same file, same dimension with the same records, each with a duplicate row. This is now 10,000 by 20
A <- dplyr::filter(A, !duplicated(A)) #Same as above, BUT this is now 6,000 x 20
It is removing the majority of duplicates. However, it does not remove all the duplicates. Upon inspection, the 1,000 rows that should have been removed are still exact duplicates of other rows.
What is happening when I use read_csv() and the duplicated() function in this instance? I searched far and wide for a similar issue and could not find a solution.
I have used the unique() function as opposed to duplicated() and the problem persists. When I read in the exact same data frame, rbind() the data frame object with its read_csv() version, then try to filter() duplicates, not all duplicates are filtered.
duplicated() is not recognizing all rows as proper duplicates.
Any ideas?
Thank you in advance.

R: Warning when creating a (long) list of dummies

A dummy column for a column c and a given value x equals 1 if c==x and 0 else. Usually, by creating dummies for a column c, one excludes one value x at choice, as the last dummy column doesn't add any information w.r.t. the already existing dummy columns.
Here's how I'm trying to create a long list of dummies for a column firm, in a data.table:
values <- unique(myDataTable$firm)
cols <- paste('d',as.character(inds[-1]), sep='_') # gives us nice d_value names for columns
# the [-1]: I arbitrarily do not create a dummy for the first unique value
myDataTable[, (cols):=lapply(values[-1],function(x)firm==x)]
This code reliably worked for previous columns, which had smaller unique values. firm however is larger:
tr(values)
num [1:3082] 51560090 51570615 51603870 51604677 51606085 ...
I get a warning when trying to add the columns:
Warning message:
truelength (6198) is greater than 1000 items over-allocated (length = 36). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().
As far as I can tell, there is still all columns that I need. Can I just ignore this issue? Will it slow down future computations? I'm not sure what to make of this and the relevant of truelength.
Taking Arun's comment as an answer.
You should use alloc.col function to pre-allocate required amount of columns in your data.table to the number which will be bigger than expected ncol.
alloc.col(myDataTable, 3200)
Additionally depending on the way how you consume the data I would recommend to consider reshaping your wide table to long table, see EAV. Then you need to have only one column per data type.

Any viable alternatives to merge function? R and R Studio shut down

I'm trying to merge to data frames based on a common field called "lookup" that I created. I created the data frames after subsetting the original data frame.
Each of the two newly created data frames is less than 10,000 rows. When trying to execute merge, after much thinking, both R and R Studio shuts down, with R sometimes producing an error message stating:
Error in make.unique(as.character(rows)) :
promise already under evaluation: recursive default argument reference or earlier problems?
Below is my code...is there any other way to pull down the data from the other data frame based on the common field besides using the merge function? Any help is appreciated.
Also, do you have any thoughts as to why it may be shutting down, using up all the memory, when, in fact, the data size is so small?
wmtdata <- datastep2[datastep2$Market.Type=="WMT", c("Item", "Brand.Family", "Brand", "Unit.Size", "Pack.Size",
"Container", "UPC..int.", "X..Vol", "Unit.Vol", "Distribution", "Market.Type",
"Week.Ending", "MLK.Day","Easter", "Independence.Day", "Labor.Day", "Veterans.Day", "Thanksgiving",
"Christmas", "New.Years","Year","Month","Week.Sequence","Price")]
compdata <- datastep2[datastep2$Market.Type=="Rem Mkt", c("Week.Ending", "UPC..int.","X..Vol", "Unit.Vol", "Price","lookup")]
colnames(compdata)[colnames(compdata)=="X..Vol"]<-"Comp Vol"
colnames(compdata)[colnames(compdata)=="Unit.Vol"]<-"Comp Unit Vol"
colnames(compdata)[colnames(compdata)=="Price"]<-"Comp Price"
combineddata <-merge(wmtdata, compdata, by="lookup")
Try join from the plyr package:
combineddata <- join(wmtdata, compdata, by="lookup")
With only 10,000 rows, the problem is unlikely to be the use of merge(...) instead of something else. Are the elements of the lookup column unique? Otherwise you get a cross-join.
Consider this trivial case:
df.1 <- data.frame(id=rep(1,10), x=rnorm(10))
df.2 <- data.frame(id=rep(1,10), y=rnorm(10))
z <- merge(df.1,df.2,by="id")
nrow(z)
# [1] 100
So two df with 10 rows each produce a merge with 100 rows because the id is not unique.
Now consider:
df.1 <- data.frame(id=rep(1:10, each=40), x=rnorm(400))
df.2 <- data.frame(id=rep(1:10, each=50), y=rnorm(500))
z <- merge(df.1,df.2,by="id")
nrow(z)
# [1] 20000
In this example, df.1 has each id replicated 40 times, and in df.2 each id is replicated 50 times. Merge will produce one row for every instance of an id in each df, so 50 X 40 =2000 rows per id. Since there are 10 ids in this example, you get 20,000 rows. So your merge results can get very big very quickly if the id field (lookup in your case) is not unique.
Instead of using data frames, use the data.table package (see here for an intro). A data.table is like an indexed data frame. It has its own merge method that would probably work in this case.
Thank you all for all the great help. Data tables are the way to go for me, as I think this was a memory issue ("lookup" values were common between data frames). While 8 GB of memory (~ 6GB free) should be plenty, it was all used up during this process. nevertheless, data tables worked just fine. Learning a lot from these boards.

Conflicting/duplicate column names in J()?

I have two data.tables (dat and results) that share column names. On a side note, results holds summary statistics computed earlier on *sub*groups of dat. In other words, nrow(results) != nrow(dat) (but I don't think this is relevant for the question)
Now I want to incorporate these results back into dat (i.e. the original data.table) by adding a new column (i.e. NewColZ) to dat
This doesn't work as I expect:
dat[,list(colA,colB,NewColZ=results1[colX==colX & colY==colY,colZ])
,by=list(colX, colY)]
Why? because "colX" and "colY" are columns names in both data.tables (i.e. dat and results). What I want to say is, results1[take_from_self(colX)==take_from_parent(colX)]
Therefore the following works (observe I have only RENAMED the columns)
dat[,list(colA,colB,NewCol=results1[cx==colX & cy==colY,colZ,])
,by=list(colX, colY)]
Though I have a feeling that this can simply and easily be done by a join. But dat has many more columns than results
What you are trying to do is a join on colX and colY. You can use := to assign by reference. Joining is most straightforward when you have unique combinations (which I am assuming you do)
keys <- c('colX', 'colY')
setkeyv(dat, keys)
setkeyv(results, keys)
dat[results, newcolZ := colZ]
# perhap use `i.` if there is a colZ in dat
# dat[results, newcolZ := i.colZ]
I do concur with the comments that suggest reading the FAQ and introduction vignettes as well as going through the many examples in ?data.table.
Your issue was a scoping issue, but your primary issue was not being fully aware of the data.table idioms. The join approach is the idoimatically data.table approach.

Endless function/loop in R: Data Management

I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
Thanks in advance!
Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.
So, as Andrie tells you can look at melt.
Or you can do it with core R: stack.
Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols
A third alternative would be array2df from package arrayhelpers.
I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:
As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.
Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.
Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.
yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns
That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.
yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)
And now you're done... mostly, you might want to make an extra column that names the new rows
newDat$condNum <- rep(1:10, nrow(newDat)/10)
This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.
This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.
(read this to better understand your problems in this code... you've made most of the big errors in there)

Resources