Gather list after partitioning with split in R - r

Simplified example is:
data<-data.frame(rnorm(10),rbinom(10,1,prob=.7))
sdata<-split(data[,1],data[,2])
ret <- lapply(sdata,mean)
I make data column and factor column. I come from SQL language and solve an,y task with grouping pattern.
lapply(sdata,mean) is:
class(ret)
[1] "list"
str(ret)
List of 2
$ 0: num -0.146
$ 1: num -0.0572
How can I make data frame again from list?
Are there limitation to factor type? Factor become "name" of list elements and I afraid to lose data/precision when convert from name bask to actual data in data frame.
Is there better way to process partitioned/grouped data then split/lapply?
PS Fill free to correct question wording. I have little experience with R to write professionally.
#MLavoie ret <- data.frame(lapply(sdata,mean)) gives me:
> dim(ret)
[1] 1 2
I expect 2x1.
#David Arenburg function application in lapply on result from sapply receive not only single column - but all and not only single row - but all within group. This approach may lead to performance degradation but allow any processing logic.
aggregate and data.table work on individual column in each group if I understand properly.

Related

Does disk.frame allow to work with large lists in R?

I am producing a very big datasets (>120 Gb), which are actually a list of named (100x100x3) matrices. A very large lists (millions of records). They are then fed to CNN and classified to one of 4 categories. Processing this amount of data at once is taedious and it often stuck my RAM, so I would like to split my dataset into chunks and process the chunks in parallel.
I found a few packages: bigmemory and disk.frame look most suitable. But do they accept lists? Or maybe there are better solutions for lists?
I had to adjust my data to data.table format, so I did something like this:
I need it to be named, so I extracted names to the vec:
nameslist <- names(list1)
I converted my list to data.table ("chunk" are my original data from the list1 used as the dummy; this is nested list of matrices; 3 matrices per name to be specific)
dummy_dframe <- data.frame(name= nameslist, chunk = I(list1))
I tried to convert it into the disk.frame:
dummy_diskframe <- as.disk.frame(dummy_dframe)
Then I encountered a following error:
Error in `[.data.table`(df, , { :
The data frame contains these list-columns: 'chunk'. List-columns are not yet supported by disk.frame. Remove these columns to create a disk.frame
So no way to use this for nested list of matrices.
After that I changed approach and decided to process the dummy data.table with column containing name and column containing matrix - I created this in a two-step fashion, based on this thread (used Jonathan Gellar's example):
data.frame with a column containing a matrix in R
Under this scenario, the disk.frame threw another type of error:
Error in `[.data.table`(df, , { :
Column 2 ['mat'] is length 4 but column 1 is length 2; malformed data.table.
So, nope, unfortunately this is not the solution I could use with my datasets. I share this, so other ppl could spare their time.
{disk.frame} only works with tabular data

Is there a reason why common R data types/container are indexed so differently?

common container types used in R are data.tables, data.frames, matrices and lists (probably more?!).
All these storage types have slightly different rules for indexing.
Let's say we have a simple dataset with named columns:
name1 name2
1 11
2 12
... ...
10 20
We now put this data in every container accordingly. If I want to index the number 5 which is in the name1 column it goes as follows:
lists: dataset[['name1']][5]
-> why the double brackets?!?!
data frames: dataset$name1[5] or dataset[5,'name1']
-> here are two options possible, why the ambiguity?!?
data table: dataset$name1[5]
-> why is it here only one possibility
I often stumbled upon this problem and coming from python this is something very odd. It furthermore leads to extremely tedious debuging. In python this is solved in a very uniform way where indexing is pretty much standard across lists,numpy arrays, pandas data frames, etc.
A data.frame is a list with equal elements having equal length. We use $ or [[ to extract the list elements or else it would still be a list with one element
You reference the data.frame example in R and then go on to say you are used to pandas, except these have direct, standard equivalents in pandas for the exact same purpose, so unsure where the confusion comes from.
dataset$name1[5] -> dataset['name1'][5] or dataset.name1[5]
dataset[5, 'name1'] -> dataset.loc[5, 'name1']
Using the definitions in the Note at the end these all work and give the same answer.
L[["name1"]][5]
DF[["name1"]][5]
DT[["name1"]][5]
L$name1[5]
DF$name1[5]
DT$name1[5]
It seems not unreasonable that a data frame which is conceptually a 2d object can take two subscripts whereas a list which is one dimensional takes one.
[[ and [ have different meanings so I am not sure consistency plays a role here.
Note
L <- list(name1 = 1:10, name2 = 11:20)
DF <- as.data.frame(L)
library(data.table)
DT <- as.data.table(DF)

Iterate over Factors in a dataframe in R

I am rather new to R and struggling at the moment with a specific issue. I need to iterate over a dataframe with 1 variable returned from a SQL database so that I can ultimately issue additional SQL queries using the information in the 1 variable. I need help understanding how to do this.
Here is what I have
> dt
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by using it prints the entire list just as above
> dt[1]
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by dt[1,] it brings additional unwanted information.
> a<-dt[1,]
> a
[1] 5D2D3F03-286E-4643-8F5B-10565608E5F8
5 Levels: 4EB4D553-C680-4576-A854-54ED817226B0 ... 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
I need to isolate just the '5D2D3F03-286E-4643-8F5B-10565608E5F8' information and not the '5 levels......'.
I am sure this is simple, I just can't find it. any help is appreciated!
thanks!
There are two issues you need to address. One is that you want character data, not a factor variable (a factor is essentially a category variable). The other is that you want a simple vector of the values, not a data.frame.
1) To get the first column as a vector, use double-brackets or the $ notation:
a <- dt[[1]]
a <- dt[['Col']]
a <- dt$Col
Your notation dt[1,] does actually return the column as a vector too, but using the somewhat obscure fact that the [ method for data.frame objects will silently "drop" its value to a vector when using the two-index form dt[i,j], but not when using the one-index form dt[i]:
When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list. In this usage a drop argument is ignored, with a warning.
Think of "dropping" like unboxing the data - instead of getting a data.frame with a single column, you're just getting the column data itself.
2) To convert to character data, use one of the suggestions in the comments from #akrun or #Vlo:
a <- as.character(dt[[1]])
a <- as.character(dt[['Col']])
a <- as.character(dt$Col)
or use the API of whatever you're using to make the SQL query - or to read in the results of the query - not convert the strings to factors in the first place.

How can I convert the format of columns from multiple data frames?

I have several data frames with some columns sharing the same names. I'm trying to come up a way to change the format of the columns with the same name from different data frames systematically. Here is what I have come up with:
data1=data.frame(a=seq(1:10),b=c("a","b"))
data2=data.frame(a=seq(11:20),b=c("c","d"))
temp = c("data1$a","data2$a")
for (i in 1:length(temp)) {
eval(parse(text=(temp)[i])) = as.character(eval(parse(text=(temp)[i])))
}
After running the code, I have got the following message:
Error in file(filename, "r") : cannot open the connection
In addition: Warning message:
In file(filename, "r") :
cannot open file 'data1$a': No such file or directory
However, if I run the following code, it works:
as.character(eval(parse(text=(temp)[1])))
Can someone please help to correct my code and explain why it doesn't work?
It looks to me like you're mixing "separateness" and "systematicness" of data handling. In other words, you're trying to store multiple data objects separately in the global environment, but also trying to work with them systematically. I would suggest that this is a mistake. You should choose one approach to data handling, and stick to it.
1: Separateness
This one is easy. Just store the data.frames separately (which is exactly what you're doing), and modify them separately:
data1 <- data.frame(a=seq(1:10),b=c('a','b'));
data2 <- data.frame(a=seq(11:20),b=c('c','d'));
data1$a <- as.character(data1$a);
data2$a <- as.character(data2$a);
2: Systematicness
This one requires storing the data in a list from the beginning. That may slightly increase the verbosity of some code, since you have to dereference the list to access the individual data.frames, but it facilitates the systematic data handling that you're looking for, which can eliminate a lot of duplicate code:
data <- list(
data.frame(a=seq(1:10),b=c('a','b')),
data.frame(a=seq(11:20),b=c('c','d'))
);
for (i in seq_along(data)) data[[i]]$a <- as.character(data[[i]]$a);
As you can see, each of these approaches alleviates the need to use messy parse/eval solutions. Usually that kind of dynamic code generation, parsing, and evaluation should not be necessary.
We can place the datasets in a list (mget(ls(patterns = "data\\d+"))), loop over the list and convert the column of interest ("col_of_interest") to character class. To reflect the change in the original object, we use list2env (but I would recommend to work with list instead of individual objects).
col_of_interest <- "a"
list2env(lapply(mget(ls(pattern = "data\\d+")),
function(x) {x[[col_of_interest]] <- as.character(x[[col_of_interest]])
x}), envir = .GlobalEnv)
str(data1)
#'data.frame': 10 obs. of 2 variables:
#$ a: chr "1" "2" "3" "4" ...
#$ b: Factor w/ 2 levels "a","b": 1 2 1 2 1 2 1 2 1 2
NOTE: The idea of placing the datasets in a list and converting to character for selected columns is already described in this post.

In R using data.table, how does one exclude rows and how does one include NA values in an integer column

I am using data.table quite a lot. It works well but I am finding it is taking me a long time to transition my syntax so that it takes advantage of the binary searching.
In the following data table how would 1 select all the rows, including where the CPT value is NA but exclude rows where the CPT value is 23456 or 10000.
cpt <- c(23456,23456,10000,44555,44555,NA)
description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy","miscellaneous procedure")
cpt.desc <- data.table(cpt,description)
setkey(cpt.desc,cpt)
The following line works but I think it uses the vector scan method instead of a binary search (or binary exclusion). Is there a way to to drop rows by binary methods?
cpt.desc[!cpt %in% c(23456,10000),]
Only a partial answer, because I am new to data.table. A self-join works for number, but the same fails for strings. I am sure one of the professional data tablers knows what to do.
library(data.table)
n <- 1000000
cpt.desc <- data.table(
cpt=rep(c(23456,23456,10000,44555,44555,NA),n),
description=rep(c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy","miscellaneous procedure"),n))
# Added on revision. Not very elegant, though. Faster by factor of 3
# but probably better scaling
setkey(cpt.desc,cpt)
system.time(a<-cpt.desc[-cpt.desc[J(23456,45555),which=TRUE]])
system.time(b<-cpt.desc[!(cpt %in% c(23456,45555))] )
str(a)
str(b)
identical(as.data.frame(a),as.data.frame(b))
# A self-join works Ok with numbers
setkey(cpt.desc,cpt)
system.time(a<-cpt.desc[cpt %in% c(23456,45555),])
system.time(b<-cpt.desc[J(23456,45555)])
str(a)
str(b)
identical(as.data.frame(a),as.data.frame(b)[,-3])
# But the same failes with characters
setkey(cpt.desc,description)
system.time(a<-cpt.desc[description %in% c("castration","orchidectomy"),])
system.time(b<-cpt.desc[J("castration","orchidectomy"),])
identical(as.data.frame(a),as.data.frame(b)[,-3])
str(a)
str(b)

Resources