Memory limit error when merging data frames using reduce

Memory limit error when merging data frames using reduce - r

I have a list of 67 data frames. Each data frame has two columns, the first is named "Material" and the second is named a month and year. Every data frame has the same name for the first column, but no two data frames have the same name for the second column.
> head(str(fy16_list))
List of 67
$ April_FY11 :'data.frame': 1559 obs. of 2 variables:
..$ Material : chr [1:1559] "622-5129-105" "622-5129-109" "622-5129-203" "622-5129-223" ...
..$ April_FY11: chr [1:1559] "1 " NA "(3)" NA ...
$ April_FY12 :'data.frame': 1721 obs. of 2 variables:
..$ Material : chr [1:1721] "622-5129-021" "622-5129-105" "622-5129-109" "622-5129-203" ...
..$ April_FY12: chr [1:1721] NA NA NA NA ...
$ April_FY13 :'data.frame': 1189 obs. of 2 variables:
..$ Material : chr [1:1189] "122000-F15SA_1" "122000-F15SA_2" "987-9705-001" "822-1867-001" ...
..$ April_FY13: chr [1:1189] NA NA "-15" "15" ...
The list is 5.2Mb in size so it really isn't very big at all, but for some reason when I do:
mydf <- Reduce(function(...) merge(..., all=T), mylist)
I wait for like 5-10min and then get an error message that says I've reached my memory limit!
Error: cannot allocate vector of size 88.7 Mb
Warning messages:
1: In `[.data.frame`(x, c(m$xi, if (all.x) m$x.alone), c(by.x, ... :
Reached total allocation of 8078Mb: see help(memory.size)
2: In `[.data.frame`(x, c(m$xi, if (all.x) m$x.alone), c(by.x, ... :
Reached total allocation of 8078Mb: see help(memory.size)
# The warning message repeats 12 times...
The data frame I've created is 8Gb in size! I have no idea why this is happening. I tried
mydf <- reshape::merge_all(mylist)
But the same thing happens.
Everything works smoothly when I do
mylist <- mylist[1:10]
mydf <- Reduce(function(...) merge(..., all=T), mylist)
So I'm thinking that this code just is not scalable, but honestly 5.2Mb seems pretty small as I've worked with lists over 300Mb in r before.
Any suggestions for getting this to work?

Related

Keep specific columns in a data.frame while unlisting a column which is a data.frame type

I have a data.frame called StockWeights. The structure of the data.frame is as follows:
'data.frame': 3 obs. of 6 variables:
$ Id : chr "159347" "161863" "22646"
$ ISIN : chr "DK0061156759" "DK0061533726" "DK0060681468"
$ $id : chr "21" "22" "23"
$ Name : chr "159347" "161863" "22646"
$ SumPeriod:'data.frame': 3 obs. of 27 variables:
..$ AccPeriodBasTwrAtMarketPrice : num 0.0969 0.538 -0.1071
..$ AccPeriodLocTwrAtMarketPrice : num 0.0969 0.538 -0.1071
..$ BopDate : chr "2022-02-28T00:00:00" "2022-02-28T00:00:00" "2022-02-28T00:00:00"
..$ BopBasHoldingValueAtMarketPrice: num 7592267 5135961 7166816
My question is then: How can I "unlist" this SumPeriod data.frame column and display the BopBasHoldingValueAtMarketPrice column together with the Id and ISIN columns? What I have done so far is to use the pluck function in the purrr package as such:
StockWeights %>%
pluck('SumPeriod') %>%
select("EopBasHoldingValueAtMarketPrice")
Which only gives me the "EopBasHoldingValueAtMarketPrice":
'data.frame': 3 obs. of 1 variable:
$ EopBasHoldingValueAtMarketPrice: num 7599626 5163591 7159142
But I can't find a way to get theese three values together with the corresponding "Id" and "ISIN" in the original data.frame. Anyone got an idea how to achieve this? Sorry for not producing a reproducible code. The data I am looking at is made from an API call and I am having some trouble in recreating it manually. But the end goal is to get a data.frame that looks like:
df = data.frame(
Id = c("159347", "161863", "22646"),
ISIN = c("DK0061156759", "DK0061533726", "DK0060681468"),
BopBasHoldingValueAtMarketPrice = c(7592267,5135961,7166816)
)

Concatenate a series of lists with incrementing numeric suffixes in R

In R I have a series of lists with incrementing numeric suffixes eg mylist1 , mylist2 , mylist3.
I want to concatenate these , like c(mylist1, mylist2, mylist3)
Is there a shorthand way to manage this?

I think you are trying to create a list of lists.
You can do it simply by calling:
list(list1, list2, list3)
If you have many lists with a similar name pattern, you can select use mget to GET all objects whose names have a specific pattern, (ls(pattern=x)).
data
list8<-list(1,2)
list9<-list(3,4)
list10<-list(5,6)
#Included the lists with indexes 8:10 so that the importance of ordering by `parse_number(ls)` is highlighted. Without the `parse_number` step, the list would be sorted by names, with a different order
Answer
list_of_lists<-mget(ls(pattern = 'list\\d+')[order(parse_number(ls(pattern = 'list\\d+')))])
> str(list_of_lists)
List of 3
$ mylist8:List of 2
..$ : num 1
..$ : num 2
$ mylist9:List of 2
..$ : num 3
..$ : num 4
$ mylist10:List of 2
..$ : num 5
..$ : num 6

R get row mean of every nth element in a list

I have a nested list and I want to get the mean of one particular variable inside the list.
When my list was not nested, it was simple to do, but I am not sure how to change my code now that there are multiple elements of different sizes
str of list:
> str(means2)
List of 1
$ :List of 2
..$ :'data.frame': 12 obs. of 2 variables:
.. ..$ means : num [1:12] 465063 355968 76570 542873 854570 ...
.. ..$ variablenames: chr [1:12] "NumberOfPassengers" "FareClass" "TripType" "JourneyTravelTime" ...
..$ :'data.frame': 12 obs. of 2 variables:
.. ..$ means : num [1:12] 449490 359997 67899 602895 967327 ...
.. ..$ variablenames: chr [1:12] "NumberOfPassengers" "FareClass" "TripType" "JourneyTravelTime" ...
I was using this code
testdf=as.data.frame(rowMeans(simplify2array(sapply(means2,"[[",1))))
I am just not sure how to change this code to match the fact the means I am obtaining are from the 2nd element and not the first(only) element.
Thanks for any help
example:
edited: example had an error

Based on the str of the nested list for 'means2', this should work
unlist(lapply(means2, function(x) rowMeans(do.call(cbind, sapply(x, "[", 1)))))
As there is only a single outer list, we can extract it using [[, loop over the list elements get the first column as vector, get the elementwise sum with Reduce and divide by the length of the list (in the example it is 2).
Reduce(`+`, lapply(means2[[1]], `[`, 1))/2
Or after extracting the list elements, cbind it and do a rowMeans
rowMeans(do.call(cbind,lapply(means2[[1]], `[`, 1)))
data
means2 <- list(list(data.frame(means = 1:5, variablenames = letters[1:5]),
data.frame(means = 11:15, variablenames = letters[6:10])) )

Write a data frame containing a list to csv file

I would like to save my data train.user (213451 obs. of 20 variables. 2 of the variables are lists) as a csv file.
I use:
write.csv(train.user, "train_user.csv", row.names = FALSE)
but an error occurs
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type 'list' in 'EncodeElement'
This is how my data train.user looks like. (by using str) (showing only part of it)
'data.frame': 213451 obs. of 20 variables:
$ id : Factor w/ 213451 levels "00023iyk9l","0005ytdols",..: 100523 48039 26485 68504 48956 147281 129610 2144 59779 40826 ...
$ gender : Factor w/ 4 levels "-unknown-","FEMALE",..: 1 3 2 2 1 1 2 2 2 1 ...
$ age :List of 213451
..$ : num NA
..$ : num 38
..$ : num 56
..$ : num 42
..$ : num 41
.. [list output truncated]
It seems like the column age is stored as a list, and write.csv doesn't accept this format. From my naive intuition, I tried to re-store the column as a data frame with following code, but it failed.
train.user$age <- as.data.frame(train.user$age)
Error message:
Error in `$<-.data.frame`(`*tmp*`, "age", value = list(NA_real_. = NA_real_, :
replacement has 1 row, data has 213451
I also tried train.user$age <- data.frame(lapply(train.user$age, unlist)) as suggested in another post, but the same error occurs.
I appreciate any help!

train.user$age <- unlist(train.user$age)
Technically, a data.frame is a list of equal-length vectors, but most functions will assume that all of the columns are atomic vectors and will fail when you try to use a list.
NB: Don't edit an answer into your question.

pacman::p_load(tidyverse)
train.user %>% as_tibble() %>%
mutate(age = map(age,~unlist(.)))

Sorting several dates by one observation

I am at a loss! I am trying to sort my data by business_id. Each id has several dates associated with it. I am trying to create a new variable that shows the time in days between the first and last date associated with a business_id. Such that
row.names business_id Days
1 x8453 DxUn-ukNL27GOuwjnFGFKA 876
The data currently is structured as:
row.names date business_id
1 X27038 2012-04-21 FV0BkoGOd3Yu_eJnXY15ZA
2 X60951 2012-05-14 Trar_9cFAj6wXiXfKfEqZA
3 X60462 2011-10-05 DxUn-ukNL27GOuwjnFGFKA
4 X2078 2010-12-19 PlcCjELzSI3SqX7mPF5cCw
5 X166883 2011-09-29 pF7uRzygyZsltbmVpjIyvw
6 X177828 2010-09-19 XkNQVTkCEzBrq7OlRHI11Q
7 X128628 2012-05-05 6TWRuHn24DL6vnW8Uyu4Vw
8 X202882 2011-12-10 Xo9Im4LmIhQrzJcO4R3ZbA
9 X64569 2012-02-07 Z67obTep38V9HMtA10yu5A
10 X14667 2009-07-18 xsSnuGCCJD4OgWnOZ0zB4A
11 X17432 2012-08-11 XkNQVTkCEzBrq7OlRHI11Q
Thanks in advance!
Update:
str(data)
'data.frame': 2299 obs. of 2 variables:
$ date :List of 2299
..$ X2736 : chr "2012-05-29"
..$ X160403: chr "2011-08-29"
..$ X19897 : chr "2010-09-27"
..$ X44519 : chr "2012-05-22"
..$ X75910 : chr "2012-10-22"
..$ X13052 : chr "2010-07-14"
$ business_id:List of 2299
..$ X2736 : chr "EFJAVVBQQqftuqY5Wb3WtQ"
..$ X160403: chr "YDlk9buwF8JQE3JgQgraOw"
..$ X19897 : chr "sc1UacpE3cVNJueMdXiCyA"
..$ X44519 : chr "VY_tvNUCCXGXQeSvJl757Q"
..$ X75910 : chr "fowXs9zAM0TQhSfSkPeVuw"
..$ X13052 : chr "xM5F0cLAlKWoB8rOgt5ZOw"
..$ X87807 : chr "nLL0sjLdZ13YdvhXKyss7A"

Edit now that the OP has provided the structure:
Your data is structured quite oddly. A usual structure in R is a data.frame, which is technically a list of vectors where the vectors are all the same length. In your case, you have a list of two (named) lists.
Store the somewhere else for the time being:
old.names <- names(x[[1]])
Then turn the data into an ordinary data.frame, using the handy unlist() function:
x$date <- unlist(x$date)
x$business_id <- unlist(x$business_id)
Use str(x) to see the difference. The names can go back in now, and it's also a good time to turn your "date" column from a character into a proper date, and sort by date order.
x$old.names <- old.names
x$date <- as.POSIXct(x$date)
x <- x[order(x$date), ]
My original answer should now work.
Original answer:
Like agstudy I'd use the plyr package, but if you have the "date" column in a date format and want to keep it that way, you could try:
require(plyr)
ddply(x, "business_id", summarise
, duration = difftime(max(date), min(date), units = "days")
, old.names = old.names[1])
This also gives you flexibility on the units.
With your example data, sorted by date ascending with dat <- dat[order(dat$date), ] means that old.names[1] gives you the name of the earliest row, and old.names[length(old.names)] would give you the name of the most recent row, but I don't know whether that is reliable given the magic inside ddply.
Further edit:
I only showed how to handle the names because they're in your example. They look as though they were originally column headers from imported data, and R has prepended "X" to them because names aren't allowed to begin with numerals.

Using plyr package:
ddply(dat,.(business_id),function(x)
if(length(x$date)>1)
diff(range(as.POSIXct(x$date)))
else 0)
business_id V1
1 6TWRuHn24DL6vnW8Uyu4Vw 0
2 DxUn-ukNL27GOuwjnFGFKA 0
3 FV0BkoGOd3Yu_eJnXY15ZA 0
4 pF7uRzygyZsltbmVpjIyvw 0
5 PlcCjELzSI3SqX7mPF5cCw 0
6 Trar_9cFAj6wXiXfKfEqZA 0
7 XkNQVTkCEzBrq7OlRHI11Q 692
8 Xo9Im4LmIhQrzJcO4R3ZbA 0
9 xsSnuGCCJD4OgWnOZ0zB4A 0
10 Z67obTep38V9HMtA10yu5A 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Memory limit error when merging data frames using reduce - r

Related

Keep specific columns in a data.frame while unlisting a column which is a data.frame type

Concatenate a series of lists with incrementing numeric suffixes in R

R get row mean of every nth element in a list

Write a data frame containing a list to csv file

Sorting several dates by one observation

Categories

Resources