Write a data frame containing a list to csv file - r

I would like to save my data train.user (213451 obs. of 20 variables. 2 of the variables are lists) as a csv file.
I use:
write.csv(train.user, "train_user.csv", row.names = FALSE)
but an error occurs
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type 'list' in 'EncodeElement'
This is how my data train.user looks like. (by using str) (showing only part of it)
'data.frame': 213451 obs. of 20 variables:
$ id : Factor w/ 213451 levels "00023iyk9l","0005ytdols",..: 100523 48039 26485 68504 48956 147281 129610 2144 59779 40826 ...
$ gender : Factor w/ 4 levels "-unknown-","FEMALE",..: 1 3 2 2 1 1 2 2 2 1 ...
$ age :List of 213451
..$ : num NA
..$ : num 38
..$ : num 56
..$ : num 42
..$ : num 41
.. [list output truncated]
It seems like the column age is stored as a list, and write.csv doesn't accept this format. From my naive intuition, I tried to re-store the column as a data frame with following code, but it failed.
train.user$age <- as.data.frame(train.user$age)
Error message:
Error in `$<-.data.frame`(`*tmp*`, "age", value = list(NA_real_. = NA_real_, :
replacement has 1 row, data has 213451
I also tried train.user$age <- data.frame(lapply(train.user$age, unlist)) as suggested in another post, but the same error occurs.
I appreciate any help!

train.user$age <- unlist(train.user$age)
Technically, a data.frame is a list of equal-length vectors, but most functions will assume that all of the columns are atomic vectors and will fail when you try to use a list.
NB: Don't edit an answer into your question.

pacman::p_load(tidyverse)
train.user %>% as_tibble() %>%
mutate(age = map(age,~unlist(.)))

Related

Keep specific columns in a data.frame while unlisting a column which is a data.frame type

I have a data.frame called StockWeights. The structure of the data.frame is as follows:
'data.frame': 3 obs. of 6 variables:
$ Id : chr "159347" "161863" "22646"
$ ISIN : chr "DK0061156759" "DK0061533726" "DK0060681468"
$ $id : chr "21" "22" "23"
$ Name : chr "159347" "161863" "22646"
$ SumPeriod:'data.frame': 3 obs. of 27 variables:
..$ AccPeriodBasTwrAtMarketPrice : num 0.0969 0.538 -0.1071
..$ AccPeriodLocTwrAtMarketPrice : num 0.0969 0.538 -0.1071
..$ BopDate : chr "2022-02-28T00:00:00" "2022-02-28T00:00:00" "2022-02-28T00:00:00"
..$ BopBasHoldingValueAtMarketPrice: num 7592267 5135961 7166816
My question is then: How can I "unlist" this SumPeriod data.frame column and display the BopBasHoldingValueAtMarketPrice column together with the Id and ISIN columns? What I have done so far is to use the pluck function in the purrr package as such:
StockWeights %>%
pluck('SumPeriod') %>%
select("EopBasHoldingValueAtMarketPrice")
Which only gives me the "EopBasHoldingValueAtMarketPrice":
'data.frame': 3 obs. of 1 variable:
$ EopBasHoldingValueAtMarketPrice: num 7599626 5163591 7159142
But I can't find a way to get theese three values together with the corresponding "Id" and "ISIN" in the original data.frame. Anyone got an idea how to achieve this? Sorry for not producing a reproducible code. The data I am looking at is made from an API call and I am having some trouble in recreating it manually. But the end goal is to get a data.frame that looks like:
df = data.frame(
Id = c("159347", "161863", "22646"),
ISIN = c("DK0061156759", "DK0061533726", "DK0060681468"),
BopBasHoldingValueAtMarketPrice = c(7592267,5135961,7166816)
)

Concatenate a series of lists with incrementing numeric suffixes in R

In R I have a series of lists with incrementing numeric suffixes eg mylist1 , mylist2 , mylist3.
I want to concatenate these , like c(mylist1, mylist2, mylist3)
Is there a shorthand way to manage this?
I think you are trying to create a list of lists.
You can do it simply by calling:
list(list1, list2, list3)
If you have many lists with a similar name pattern, you can select use mget to GET all objects whose names have a specific pattern, (ls(pattern=x)).
data
list8<-list(1,2)
list9<-list(3,4)
list10<-list(5,6)
#Included the lists with indexes 8:10 so that the importance of ordering by `parse_number(ls)` is highlighted. Without the `parse_number` step, the list would be sorted by names, with a different order
Answer
list_of_lists<-mget(ls(pattern = 'list\\d+')[order(parse_number(ls(pattern = 'list\\d+')))])
> str(list_of_lists)
List of 3
$ mylist8:List of 2
..$ : num 1
..$ : num 2
$ mylist9:List of 2
..$ : num 3
..$ : num 4
$ mylist10:List of 2
..$ : num 5
..$ : num 6

casting to a data.frame in order to sort columns fails with unimplemented type list

Why does the final cast to a data.frame appear not to work? When I try to sort it I get: Error in order(temp[, 1], decreasing = T) : unimplemented type 'list' in 'orderVector1'
data<-lapply(1:5,function(i){
lapply(1:5,function(j){
list(i=i,j=j)
})
})
temp<-as.data.frame(data)
temp<-matrix(temp,ncol=2,byrow=T)
head(temp,20)
temp<-data.frame(temp)
class(temp) #####IS A DATA.FRAME
temp<-temp[order(temp[,1],decreasing=T),]
The columns in the OP's dataset are each list of length 25. We can convert it to a normal data.frame with column vectors.
temp1 <- data.frame(lapply(temp, unlist))
and then do the order
temp1[order(temp1[,1], decreasing = TRUE),]
It is easier to check the structure of the dataset with str
str(temp, list.len = 3)
#'data.frame': 25 obs. of 2 variables:
# $ X1:List of 25
# ..$ : int 1
# ..$ : int 1
# ..$ : int 1
# .. [list output truncated]
# $ X2:List of 25
# ..$ : int 1
# ..$ : int 2
# ..$ : int 3
# .. [list output truncated]
Also, we can directly get a data.frame with expand.grid
expand.grid(rep(list(1:5), 2))
Or using CJ from data.table
library(data.table)
CJ(1:5, 1:5)

Memory limit error when merging data frames using reduce

I have a list of 67 data frames. Each data frame has two columns, the first is named "Material" and the second is named a month and year. Every data frame has the same name for the first column, but no two data frames have the same name for the second column.
> head(str(fy16_list))
List of 67
$ April_FY11 :'data.frame': 1559 obs. of 2 variables:
..$ Material : chr [1:1559] "622-5129-105" "622-5129-109" "622-5129-203" "622-5129-223" ...
..$ April_FY11: chr [1:1559] "1 " NA "(3)" NA ...
$ April_FY12 :'data.frame': 1721 obs. of 2 variables:
..$ Material : chr [1:1721] "622-5129-021" "622-5129-105" "622-5129-109" "622-5129-203" ...
..$ April_FY12: chr [1:1721] NA NA NA NA ...
$ April_FY13 :'data.frame': 1189 obs. of 2 variables:
..$ Material : chr [1:1189] "122000-F15SA_1" "122000-F15SA_2" "987-9705-001" "822-1867-001" ...
..$ April_FY13: chr [1:1189] NA NA "-15" "15" ...
The list is 5.2Mb in size so it really isn't very big at all, but for some reason when I do:
mydf <- Reduce(function(...) merge(..., all=T), mylist)
I wait for like 5-10min and then get an error message that says I've reached my memory limit!
Error: cannot allocate vector of size 88.7 Mb
Warning messages:
1: In `[.data.frame`(x, c(m$xi, if (all.x) m$x.alone), c(by.x, ... :
Reached total allocation of 8078Mb: see help(memory.size)
2: In `[.data.frame`(x, c(m$xi, if (all.x) m$x.alone), c(by.x, ... :
Reached total allocation of 8078Mb: see help(memory.size)
# The warning message repeats 12 times...
The data frame I've created is 8Gb in size! I have no idea why this is happening. I tried
mydf <- reshape::merge_all(mylist)
But the same thing happens.
Everything works smoothly when I do
mylist <- mylist[1:10]
mydf <- Reduce(function(...) merge(..., all=T), mylist)
So I'm thinking that this code just is not scalable, but honestly 5.2Mb seems pretty small as I've worked with lists over 300Mb in r before.
Any suggestions for getting this to work?

Sorting several dates by one observation

I am at a loss! I am trying to sort my data by business_id. Each id has several dates associated with it. I am trying to create a new variable that shows the time in days between the first and last date associated with a business_id. Such that
row.names business_id Days
1 x8453 DxUn-ukNL27GOuwjnFGFKA 876
The data currently is structured as:
row.names date business_id
1 X27038 2012-04-21 FV0BkoGOd3Yu_eJnXY15ZA
2 X60951 2012-05-14 Trar_9cFAj6wXiXfKfEqZA
3 X60462 2011-10-05 DxUn-ukNL27GOuwjnFGFKA
4 X2078 2010-12-19 PlcCjELzSI3SqX7mPF5cCw
5 X166883 2011-09-29 pF7uRzygyZsltbmVpjIyvw
6 X177828 2010-09-19 XkNQVTkCEzBrq7OlRHI11Q
7 X128628 2012-05-05 6TWRuHn24DL6vnW8Uyu4Vw
8 X202882 2011-12-10 Xo9Im4LmIhQrzJcO4R3ZbA
9 X64569 2012-02-07 Z67obTep38V9HMtA10yu5A
10 X14667 2009-07-18 xsSnuGCCJD4OgWnOZ0zB4A
11 X17432 2012-08-11 XkNQVTkCEzBrq7OlRHI11Q
Thanks in advance!
Update:
str(data)
'data.frame': 2299 obs. of 2 variables:
$ date :List of 2299
..$ X2736 : chr "2012-05-29"
..$ X160403: chr "2011-08-29"
..$ X19897 : chr "2010-09-27"
..$ X44519 : chr "2012-05-22"
..$ X75910 : chr "2012-10-22"
..$ X13052 : chr "2010-07-14"
$ business_id:List of 2299
..$ X2736 : chr "EFJAVVBQQqftuqY5Wb3WtQ"
..$ X160403: chr "YDlk9buwF8JQE3JgQgraOw"
..$ X19897 : chr "sc1UacpE3cVNJueMdXiCyA"
..$ X44519 : chr "VY_tvNUCCXGXQeSvJl757Q"
..$ X75910 : chr "fowXs9zAM0TQhSfSkPeVuw"
..$ X13052 : chr "xM5F0cLAlKWoB8rOgt5ZOw"
..$ X87807 : chr "nLL0sjLdZ13YdvhXKyss7A"
Edit now that the OP has provided the structure:
Your data is structured quite oddly. A usual structure in R is a data.frame, which is technically a list of vectors where the vectors are all the same length. In your case, you have a list of two (named) lists.
Store the somewhere else for the time being:
old.names <- names(x[[1]])
Then turn the data into an ordinary data.frame, using the handy unlist() function:
x$date <- unlist(x$date)
x$business_id <- unlist(x$business_id)
Use str(x) to see the difference. The names can go back in now, and it's also a good time to turn your "date" column from a character into a proper date, and sort by date order.
x$old.names <- old.names
x$date <- as.POSIXct(x$date)
x <- x[order(x$date), ]
My original answer should now work.
Original answer:
Like agstudy I'd use the plyr package, but if you have the "date" column in a date format and want to keep it that way, you could try:
require(plyr)
ddply(x, "business_id", summarise
, duration = difftime(max(date), min(date), units = "days")
, old.names = old.names[1])
This also gives you flexibility on the units.
With your example data, sorted by date ascending with dat <- dat[order(dat$date), ] means that old.names[1] gives you the name of the earliest row, and old.names[length(old.names)] would give you the name of the most recent row, but I don't know whether that is reliable given the magic inside ddply.
Further edit:
I only showed how to handle the names because they're in your example. They look as though they were originally column headers from imported data, and R has prepended "X" to them because names aren't allowed to begin with numerals.
Using plyr package:
ddply(dat,.(business_id),function(x)
if(length(x$date)>1)
diff(range(as.POSIXct(x$date)))
else 0)
business_id V1
1 6TWRuHn24DL6vnW8Uyu4Vw 0
2 DxUn-ukNL27GOuwjnFGFKA 0
3 FV0BkoGOd3Yu_eJnXY15ZA 0
4 pF7uRzygyZsltbmVpjIyvw 0
5 PlcCjELzSI3SqX7mPF5cCw 0
6 Trar_9cFAj6wXiXfKfEqZA 0
7 XkNQVTkCEzBrq7OlRHI11Q 692
8 Xo9Im4LmIhQrzJcO4R3ZbA 0
9 xsSnuGCCJD4OgWnOZ0zB4A 0
10 Z67obTep38V9HMtA10yu5A 0

Resources