I'm trying to merge to data frames based on a common field called "lookup" that I created. I created the data frames after subsetting the original data frame.
Each of the two newly created data frames is less than 10,000 rows. When trying to execute merge, after much thinking, both R and R Studio shuts down, with R sometimes producing an error message stating:
Error in make.unique(as.character(rows)) :
promise already under evaluation: recursive default argument reference or earlier problems?
Below is my code...is there any other way to pull down the data from the other data frame based on the common field besides using the merge function? Any help is appreciated.
Also, do you have any thoughts as to why it may be shutting down, using up all the memory, when, in fact, the data size is so small?
wmtdata <- datastep2[datastep2$Market.Type=="WMT", c("Item", "Brand.Family", "Brand", "Unit.Size", "Pack.Size",
"Container", "UPC..int.", "X..Vol", "Unit.Vol", "Distribution", "Market.Type",
"Week.Ending", "MLK.Day","Easter", "Independence.Day", "Labor.Day", "Veterans.Day", "Thanksgiving",
"Christmas", "New.Years","Year","Month","Week.Sequence","Price")]
compdata <- datastep2[datastep2$Market.Type=="Rem Mkt", c("Week.Ending", "UPC..int.","X..Vol", "Unit.Vol", "Price","lookup")]
colnames(compdata)[colnames(compdata)=="X..Vol"]<-"Comp Vol"
colnames(compdata)[colnames(compdata)=="Unit.Vol"]<-"Comp Unit Vol"
colnames(compdata)[colnames(compdata)=="Price"]<-"Comp Price"
combineddata <-merge(wmtdata, compdata, by="lookup")
Try join from the plyr package:
combineddata <- join(wmtdata, compdata, by="lookup")
With only 10,000 rows, the problem is unlikely to be the use of merge(...) instead of something else. Are the elements of the lookup column unique? Otherwise you get a cross-join.
Consider this trivial case:
df.1 <- data.frame(id=rep(1,10), x=rnorm(10))
df.2 <- data.frame(id=rep(1,10), y=rnorm(10))
z <- merge(df.1,df.2,by="id")
nrow(z)
# [1] 100
So two df with 10 rows each produce a merge with 100 rows because the id is not unique.
Now consider:
df.1 <- data.frame(id=rep(1:10, each=40), x=rnorm(400))
df.2 <- data.frame(id=rep(1:10, each=50), y=rnorm(500))
z <- merge(df.1,df.2,by="id")
nrow(z)
# [1] 20000
In this example, df.1 has each id replicated 40 times, and in df.2 each id is replicated 50 times. Merge will produce one row for every instance of an id in each df, so 50 X 40 =2000 rows per id. Since there are 10 ids in this example, you get 20,000 rows. So your merge results can get very big very quickly if the id field (lookup in your case) is not unique.
Instead of using data frames, use the data.table package (see here for an intro). A data.table is like an indexed data frame. It has its own merge method that would probably work in this case.
Thank you all for all the great help. Data tables are the way to go for me, as I think this was a memory issue ("lookup" values were common between data frames). While 8 GB of memory (~ 6GB free) should be plenty, it was all used up during this process. nevertheless, data tables worked just fine. Learning a lot from these boards.
Related
The problem
I am trying to reshape a survey dataset loaded into a dataframe with about 11k variables and 2k rows into a long(er) format, in order to do some analysis on variables that resulted from looped questions. I have not been able to figure out a way to get around memory allocation errors.
Am I hitting the practical size limit for using melt on dataframes (with about 28MB in CSV-format)? Is there a different way to use melt, or would you use a different function/library for this purpose?
What I've tried so far
I've tried using reshape2's melt function, which should be straightforward but gives a memory error immediately ("cannot allocate vector of size...").
Then I tried breaking up the looped variables into chunks, in order to get many smaller dataframes to melt and then re-constitute. That gives me similar errors (with smaller sizes that cannot be allocated).
For reference, my data has an identifier field ("SbjNum"), a number of variables that only occur once (about 1900), and 99 variables that occur 100 times each (with a prefix of "I_X_I_Y", where X and Y identify loops)--and should be molten into rows corresponding to unique X and Y.
Just using melt naively looked like this:
molten <- melt(data, id.vars = c("SbjNum"))
The chunking I've tried so far looks like this:
#all variable names produced by the loops
loops <- names(data)[grep("I_\\d{1,2}_I_\\d{1,2}",names(data))]
#setting number of desired chunks
nloopvars <- length(loops)
nchunks <- 100
#make nchunks indexers to subset my data
chunks <- lapply(#indices of loops split into nchunks groups
split(1:nloopvars, sort(1:nloopvars%%nchunks)),
function(v){loops[v]}
)
#melt little subsets of the data
molten <- lapply(chunks,
function(x){
# take only identifier and a subset of loop vars
df <- data[c("SbjNum", x)]
# melt the loop vars
return(melt(df, id.vars = "SbjNum"))
}
)
EDIT: after terminating and restarting R as well as clearing my workspace several different ways, approach #2 now works.
After terminating and restarting R, and clearing the workspace several times, my own "chunking" approach is now working (see question)--I recommend trying this in case anyone else has similar issues.
[There is still a question of up to what size melting makes sense, but I can live without knowing that answer for now.]
I cannot for the life of me figure out where the simple error is in my for loop to perform the same analyses over multiple data frames and output each iteration's new data frame utilizing the variable used along with extra string to identify the new data frame.
Here is my code:
john and jane are 2 data frames among many I am hoping to loop over and compare to bcm to find duplicate results in rows.
x <- list(john,jane)
for (i in x) {
test <- rbind(bcm,i)
test$dups <- duplicated(test$Full.Name,fromLast=T)
test$dups2 <- duplicated(test$Full.Name)
test <- test[which(test$dups==T | test$dups2==T),]
newname <- paste("dupl",i,sep=".")
assign(newname, test)
}
Thus far, I can either get the naming to work correctly without including the x data or the loop to complete correctly without naming the new data frames correctly.
Intended Result: I am hoping to create new data frames dupl.john and dupl.jane to show which rows are duplicated in comparison to bcm.
I understand that lapply() might be better to use and am very open to that form of solution. I could not figure out how to use it to solve my problem, so I turned to the more familiar for loop.
EDIT:
Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.
If I understand correctly based on your intended result, maybe using the match_df could be an option.
library(plyr)
dupl.john <- match_df(john, bcm)
dupl.jane <- match_df(jane, bcm)
dupl.john and dupl.jane will be both data frames and both will have the rows that are in these data frames and bcm. Is this what you are trying to achieve?
EDITED after the first comment
library(plyr)
l <- list(john, jane)
res <- lapply(l, function(x) {match_df(x, bcm, on = "Full.Name")} )
dupl.john <- as.data.frame(res[1])
dupl.jane <- as.data.frame(res[2])
Now, res will have a list of the data frames with the matches, based on the column "Full.Name".
I want to merge 2 data frames (data1 and data2). Both initially contain around 35 million observations (around 2GB each).
I removed the duplicates from data2. I would need to keep the duplicates in data 1, as I wish to use them for further calculations per observation in data1.
I initially get the well documented error:
Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
As a solution (I looked at several topics, such here, here, and here), I included allow.cartesian=TRUE, but now I run into memory issues. Also, for a subset it works, but it gives me more observations than I wish (data1 now has 50 million observations, although is specify all.x=TRUE).
My code is:
#Remove duplicates before merge
data2 <- unique(data2)
#Merge
require(data.table)
data1 <- merge(data1, data2, by="ID", all.x=TRUE, allow.cartesian=TRUE)
Any advice on how to merge this, is very welcome.
In order to do a left join, the merge statement needs to understand which column you are using as the "key" or "index" for the join. If you have duplicate column names that are used as the key/index, it doesn't know what to do and gives that error. Further, it needs to know what to do if columns are being joined that have the same name as existing columns.
The solution is to temporarily rename the key/index column in your left (data1) dataset As a general rule, having duplicate column names is "bad" in R because it will confuse a lot of functions. Many functions silently call make.unique() to de-duplicate column names to avoid confusion.
If you have duplicate ID columns in data1 change them with colnames(data1) <- make.unique(colnames(data1)), which will set them to ID.1, ID.2, etc. Then do your merge (make sure to specify by.x="ID.1", by.y="ID" because of the rename. By default, duplicate columns that are merged will be appended with .y although you can specify the suffix with the suffixes= option (See Merge helpfile for details)
Lastly, it's worth noting that the merge() function in the data.table package tends to be a lot faster than the base merge() function with similar syntax. Seepage 47 of the data.table manual.
I need a function that recognises every x amount of columns as a separate site. So in df1 below there are 8 columns, with 4 sites each consisting of 2 variables. Previously, I have used a procedure like this as answered here Selecting column sequences and creating variables.
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:20, 8*10, replace=TRUE), ncol=8))
I then need to calculate a column sum so that a total for each variable is obtained.
colsums <- as.data.frame(t(colSums(df1)))
I subsequently split the dataframe using this technique...
lst1 <- setNames(lapply(split(1:ncol(colsums), as.numeric(gl(ncol(colsums),
2, ncol(colsums)))), function(i) colsums[,i]), paste0('site', 1:4))
list2env(lst1, envir=.GlobalEnv)
And organise into one dataframe...
Combined <- as.matrix(mapply(c,site1,site2,site3,site4))
rownames(Combined) <- c("Site.1","Site.2","Site.3","Site.4")
Whilst this technique has been great on smaller dataframes, where there are a substantial amount of sites (>500) typing out each site following the mapply function takes up a lot of code and could lead to some sites getting missed off if I'm typing them all in manually. Is there an easy way to overcome this following the colsums stage?
A matrix is a vector with dimensions. Matrices are stored in column-major order in R.
The call matrix(colsums, nrow=2) should help you a lot.
NB.: Polluting the "global" environment is generally a bad idea.
I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
Thanks in advance!
Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.
So, as Andrie tells you can look at melt.
Or you can do it with core R: stack.
Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols
A third alternative would be array2df from package arrayhelpers.
I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:
As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.
Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.
Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.
yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns
That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.
yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)
And now you're done... mostly, you might want to make an extra column that names the new rows
newDat$condNum <- rep(1:10, nrow(newDat)/10)
This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.
This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.
(read this to better understand your problems in this code... you've made most of the big errors in there)