rbind doing different things for single versus multiple arguments - r

I am new to R and came over code that uses do.call("rbind", df.list) to combine a list of data frames.
The data frames have arrays as columns and rbind does remove the arrays, but only if there are at least two elements in the list to combine.
Quick example:
> class(rbind(data.frame(a=array(1,2)), data.frame(a=array(3,4)))$a)
[1] "numeric"
> class(rbind(data.frame(a=array(1,2)))$a)
[1] "array"
Is this a bug in rbind? It appears if it is called with one argument, it does just return that argument, while if called with multiple, it does remove arrays.
How can I "unarray" such a data frame if length(df.list) == 1?
Example of what I need:
> df.list1 <- list(data.frame(a=array(1,2), b=array("a")), data.frame(a=array(3,4), b=array("b")))
> df.list2 <- list(data.frame(a=array(1,2), b=array("a")))
> df.combined1 <- do.call("rbind", df.list1)
> df.combined2 <- do.call("rbind", df.list2)
> class(df.combined1$a)
[1] "numeric"
> class(df.combined2$a)
[1] "array"
The goal is to have a data frame df.combined not having array columns independent whether df.list had one or multiple elements. The type and number of the data frame columns are unknown in advance.

Lets start with:
class(rbind(data.frame(a=array(1,2)), data.frame(a=array(3,4))))
class(rbind(data.frame(a=array(1,2))))
Both these have class of data.frame.
Now, as you noticed:
> class(rbind(data.frame(a=array(1,2)), data.frame(a=array(3,4)))$a)
[1] "numeric"
> class(rbind(data.frame(a=array(1,2)))$a)
[1] "array"
The first one is expected, however, second one is unexpected due to the way rbind method for data.frame. As per the documentation for rbind:
... It [then] takes the classes of the columns from the first data frame,...
If you want to coerce the class in case of a single array to numeric, then you can use something like this:
ifelse(length(df.list) == 1,
class(rbind(data.frame(a=as.vector(array(1,2))))$a),
...)
Coercing it as.vector gets rid of the array class.
(EDIT: Depending on what you want, you might also benefit from the discussion in the comments below!)
Lastly, note that this is an issue only for a one-dimensional array. For higher dimensions, you get the appropriate class:
class(rbind(data.frame(a=array(as.numeric(1:10),c(2,5))))$a.1)
EDIT: Based on your update, I think here is what you want:
df.list1 <- list(data.frame(a=array(1,2), b=array("a")), data.frame(a=array(3,4), b=array("b")))
df.list2 <- list(data.frame(a=array(1,2), b=array("a")))
cobmineDFList <- function(df.list) {
temp <- do.call(rbind, df.list)
if(class(temp$a) == "array") temp$a <- as.numeric(temp$a)
temp
}
df.combined1 <- cobmineDFList(df.list1)
df.combined2 <- cobmineDFList(df.list2)
class(df.combined1$a)
class(df.combined2$a)
Hope this helps!

Related

How to use save() function in R when variable names are stored in a vector?

I have a vector say varNames which has "name" of certain variables as "character". Now I want to save those particular variables as rdata using save(). How should I go about that?
I was trying to do the following:
> varSet
[1] "blah1" [2] "blah2"
> str(vatSet)
chr [1:44] "blah1" "blah2" ...
> foo <- lapply(varSet, function(x) as.name(x))
As expected foo is a list of symbols. I was thinking of doing something like
eval(unlist(foo), file="fileName")
I guess unlist(foo) is not working. How should I solve this issue? Can you also clear my concept why unlist(foo) is not unlisting the list of symbols?
Edit: Adding artificial example
> x <- c(1,2,3)
> y <- data.frame(m=c(1,2), n=c(1,2,3))
I can do this to save x and y.
> save(x, y, file="filename.rda")
But suppose I have
> varSet <- c("x", "y")
In my example varSet is a very big set. So I need to use varSet to save corresponding variables whose names are stored.
You can save any data object as:
save(varSet, file="varSet.RData")
But your inquiry sounds a bit confused. Do you want just to save it, or save it in a particular way, like data.frame?
Assuming your list of lists is called varSet:
You can also use a plyr solution:
library (plyr)
df <- ldply(varSet, data.frame)
Or more manually strategy. Assuming you list has 100 elements:
df <- data.frame(matrix(unlist(varSet), nrow=100, byrow=T))
The above will convert all character columns to factors, to avoid this you can add a parameter to the data.frame() call:
df <- data.frame(matrix(unlist(varSet), nrow=100, byrow=T),stringsAsFactors=FALSE)

How to subset a list based on the length of its elements in R

In R I have a function (coordinates from the package sp ) which looks up 11 fields of data for each IP addresss you supply.
I have a list of IP's called ip.addresses:
> head(ip.addresses)
[1] "128.177.90.11" "71.179.12.143" "66.31.55.111" "98.204.243.187" "67.231.207.9" "67.61.248.12"
Note: Those or any other IP's can be used to reproduce this problem.
So I apply the function to that object with sapply:
ips.info <- sapply(ip.addresses, ip2coordinates)
and get a list called ips.info as my result. This is all good and fine, but I can't do much more with a list, so I need to convert it to a dataframe. The problem is that not all IP addresses are in the databases thus some list elements only have 1 field and I get this error:
> ips.df <- as.data.frame(ips.info)
Error in data.frame(`128.177.90.10` = list(ip.address = "128.177.90.10", :
arguments imply differing number of rows: 1, 0
My question is -- "How do I remove the elements with missing/incomplete data or otherwise convert this list into a data frame with 11 columns and 1 row per IP address?"
I have tried several things.
First, I tried to write a loop that removes elements with less than a length of 11
for (i in 1:length(ips.info)){
if (length(ips.info[i]) < 11){
ips.info[i] <- NULL}}
This leaves some records with no data and makes others say "NULL", but even those with "NULL" are not detected by is.null
Next, I tried the same thing with double square brackets and get
Error in ips.info[[i]] : subscript out of bounds
I also tried complete.cases() to see if it could potentially be useful
Error in complete.cases(ips.info) : not all arguments have the same length
Finally, I tried a variation of my for loop which was conditioned on length(ips.info[[i]] == 11 and wrote complete records to another object, but somehow it results in an exact copy of ips.info
Here's one way you can accomplish this using the built-in Filter function
#input data
library(RDSTK)
ip.addresses<-c("128.177.90.10","71.179.13.143","66.31.55.111","98.204.243.188",
"67.231.207.8","67.61.248.15")
ips.info <- sapply(ip.addresses, ip2coordinates)
#data.frame creation
lengthIs <- function(n) function(x) length(x)==n
do.call(rbind, Filter(lengthIs(11), ips.info))
or if you prefer not to use a helper function
do.call(rbind, Filter(function(x) length(x)==11, ips.info))
Alternative solution based on base package.
# find non-complete elements
ids.to.remove <- sapply(ips.info, function(i) length(i) < 11)
# remove found elements
ips.info <- ips.info[!ids.to.remove]
# create data.frame
df <- do.call(rbind, ips.info)

Process and update simultaneously bunches of data.frames/matrix in R

I have bunches of data.frames in R workspace. And I have exactly same processing to treat them. Since I am "lazy" to run the command for each data.frame one by one, I wish to treat them as a group and process them with a loop which saves time.
Let me say, simply, to apply as.data.frame to those matrix for example of my real serial data-processing.
# dummy data
set.seed(1026)
a<-matrix(rnorm(100),50,2)
b<-matrix(rnorm(100),50,2)
c<-matrix(rnorm(100),50,2)
# process data one-by-one which is not good
a<-as.data.frame(a)
b<-as.data.frame(b)
c<-as.data.frame(c)
I could do but it is time-consume. I turn to a lazy but quick way similar to*applydealing with rows or columns inside data.frame.
sapply(c(a,b,c),as.data.frame) or sapply(list(a,b,c),as.data.frame), or even:
> for (dt in c(a,b,c)){
+ dt<-as.data.frame(dt)
+ }
But, none of them make changes happened to the original three matrix.
> class(a)
[1] "matrix"
> class(b)
[1] "matrix"
> class(c)
[1] "matrix"
I wish to see all of them have been trans to data.frame.
Your problem is that you're using sapply, which simplifies results to vectors or matrices.
You want lapply instead:
lapply(list(a,b,c), as.data.frame)
Edit for the (generally frowned upon) practice of changing the objects systematically but keeping the object names the same:
for(i in c("a", "b", "c"))
assign(i, as.data.frame(get(i))
This should get you a list of 3 data.frames:
set.seed(1026)
lapply(1:3,function(x){as.data.frame(matrix(rnorm(100),50,2))})

Why cbind for ts objects behaves different from cbind for matrices and data.frames in R?

Does anyone why the result of the following code is different?
a <- cbind(1:10,1:10)
b <- a
colnames(a) <- c("a","b")
colnames(b) <- c("c","d")
colnames(cbind(a,b))
> [1] "a" "b" "c" "d"
colnames(cbind(ts(a),ts(b)))
> [1] "ts(a).a" "ts(a).b" "ts(b).c" "ts(b).d"
Is this or compatibility reasons? Cbind for xts and zoo does not have this feature.
I always accepted this as given, but now my code is littered with the following:
ca<-colnames(a)
cb<-colnames(b)
out <- cbind(a,b)
colnames(out) <- c(ca,cb)
This is just what the cbind.ts method does. You can see the relevant code via stats:::cbind.ts, stats:::.cbind.ts, and stats:::.makeNamesTs.
I can't explain why it was made to be different, since I didn't write it, but here's a work-around.
cbts <- function(...) {
dots <- list(...)
ists <- sapply(dots,is.ts)
if(!all(ists)) stop("argument ", which(!ists), " is not a ts object")
do.call(cbind,unlist(lapply(dots,as.list),recursive=FALSE))
}
I take it that you're interested in why this happens.
Taking a look at the body of stats:::.cbind.ts, which is the function that does column binding for time series, shows that naming is performed by .makeNamesTs. Taking a look at stats:::.make.Names.Ts reveals that the names are derived directly from the arguments you pass to cbind, and there is no obvious way to influence this. As an example, try:
cbind(ts(a),ts(b, start = 2))
You will find that the start specification of the second time series appears in the name of the respective columns.
As to why that's the way things are ... I can't help you there!

basic R question on manipulating dataframes

I have a data frame with several columns. rows have names.
I want to calculate some value for each row (col1/col2) and create a new data frame with the original row names. If I just do something like data$col1/data$col2 I get a vector with the results but lose the row names.
i know it's very basic but I'm quite new to R.
It would help to read ?"[.data.frame" to understand what's going on. Specifically:
Note that there is no ‘data.frame’
method for ‘$’, so ‘x$name’ uses the
default method which treats ‘x’ as a
list.
You will see that the object's names are lost if you convert a data.frame to a list (using Joris' example data):
> as.list(Data)
$col1
[1] -0.2179939 -2.6050843 1.6980104 -0.9712305 1.6953474 0.4422874
[7] -0.5012775 0.2073210 1.0453705 -0.2883248
$col2
[1] -1.3623349 0.4535634 0.3502413 -0.1521901 -0.1032828 -0.9296857
[7] 1.4608866 1.1377755 0.2424622 -0.7814709
My suggestion would be to avoid using $ if you want to keep row names. Use this instead:
> Data["col1"]/Data["col2"]
col1
a 0.1600149
b -5.7435947
c 4.8481157
d 6.3816918
e -16.4146120
f -0.4757387
g -0.3431324
h 0.1822161
i 4.3114785
j 0.3689514
use the function names() to add the names :
Data <- data.frame(col1=rnorm(10),col2=rnorm(10),row.names=letters[1:10])
x <- Data$col1/Data$col2
names(x) <- row.names(Data)
This solution gives a vector with the names. To get a data-frame (solution from Marek) :
NewFrame <- data.frame(x=Data$col1/Data$col2,row.names=row.names(Data))
A very simple and neat way is to use row.names(data frame) to store it as a column and further manipulate

Resources