Difference between as.data.frame(x) and data.frame(x) - r

What is the difference between as.data.frame(x) and data.frame(x)
In this following example, the result is the same at the exception of the columns names.
x <- matrix(data=rep(1,9),nrow=3,ncol=3)
> x
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> data.frame(x)
X1 X2 X3
1 1 1 1
2 1 1 1
3 1 1 1
> as.data.frame(x)
V1 V2 V3
1 1 1 1
2 1 1 1
3 1 1 1

As mentioned by Jaap, data.frame() calls as.data.frame() but there's a reason for it:
as.data.frame() is a method to coerce other objects to class data.frame. If you're writing your own package, you would store your method to convert an object of your_class under as.data.frame.your_class(). Here are just a few examples.
methods(as.data.frame)
[1] as.data.frame.AsIs as.data.frame.Date
[3] as.data.frame.POSIXct as.data.frame.POSIXlt
[5] as.data.frame.aovproj* as.data.frame.array
[7] as.data.frame.character as.data.frame.complex
[9] as.data.frame.data.frame as.data.frame.default
[11] as.data.frame.difftime as.data.frame.factor
[13] as.data.frame.ftable* as.data.frame.integer
[15] as.data.frame.list as.data.frame.logLik*
[17] as.data.frame.logical as.data.frame.matrix
[19] as.data.frame.model.matrix as.data.frame.numeric
[21] as.data.frame.numeric_version as.data.frame.ordered
[23] as.data.frame.raw as.data.frame.table
[25] as.data.frame.ts as.data.frame.vector
Non-visible functions are asterisked

data.frame() can be used to build a data frame while as.data.frame() can only be used to coerce other object to a data frame.
for example:
# data.frame()
df1 <- data.frame(matrix(1:12,3,4),1:3)
# as.data.frame()
df2 <- as.data.frame(matrix(1:12,3,4),1:3)
df1
# X1 X2 X3 X4 X1.3
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
df2
# V1 V2 V3 V4
# 1 1 4 7 10
# 2 2 5 8 11
# 3 3 6 9 12

As you noted, the result does differ slightly, and this means that they are not exactly equal:
identical(data.frame(x),as.data.frame(x))
[1] FALSE
So you might need to take care to be consistent in which one you use.
But it is also worth noting that as.data.frame is faster:
library(microbenchmark)
microbenchmark(data.frame(x),as.data.frame(x))
Unit: microseconds
expr min lq median uq max neval
data.frame(x) 71.446 73.616 74.80 78.9445 146.442 100
as.data.frame(x) 25.657 27.631 28.42 29.2100 93.155 100
y <- matrix(1:1e6,1000,1000)
microbenchmark(data.frame(y),as.data.frame(y))
Unit: milliseconds
expr min lq median uq max neval
data.frame(y) 17.23943 19.63163 23.60193 41.07898 130.66005 100
as.data.frame(y) 10.83469 12.56357 14.04929 34.68608 38.37435 100

The difference becomes clearer when you look at their main arguments:
as.data.frame(x, ...): check if object is a data frame, or coerce if possible. Here, "x" can be any R object.
data.frame(...): build a data frame. Here, "..." allows specifying all the components (i.e. the variables of the data frame).
So, the results by Ophelia are similar since both functions received a single matrix as argument: however, when these functions receive 2 (or more) vectors, the distinction becomes clearer:
> # Set seed for reproducibility
> set.seed(3)
> # Create one int vector
> IDs <- seq(1:10)
> IDs
[1] 1 2 3 4 5 6 7 8 9 10
> # Create one char vector
> types <- sample(c("A", "B"), 10, replace = TRUE)
> types
[1] "A" "B" "A" "A" "B" "B" "A" "A" "B" "B"
> # Try to use "as.data.frame" to coerce components into a dataframe
> dataframe_1 <- as.data.frame(IDs, types)
> # Look at the result
> dataframe_1
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
duplicate row.names: A, B
> # Inspect result with head
> head(dataframe_1, n = 10)
IDs
A 1
B 2
A.1 3
A.2 4
B.1 5
B.2 6
A.3 7
A.4 8
B.3 9
B.4 10
> # Check the structure
> str(dataframe_1)
'data.frame': 10 obs. of 1 variable:
$ IDs: int 1 2 3 4 5 6 7 8 9 10
> # Use instead "data.frame" to build a data frame starting from two components
> dataframe_2 <- data.frame(IDs, types)
> # Look at the result
> dataframe_2
IDs types
1 1 A
2 2 B
3 3 A
4 4 A
5 5 B
6 6 B
7 7 A
8 8 A
9 9 B
10 10 B
> # Inspect result with head
> head(dataframe_2, n = 10)
IDs types
1 1 A
2 2 B
3 3 A
4 4 A
5 5 B
6 6 B
7 7 A
8 8 A
9 9 B
10 10 B
> # Check the structure
> str(dataframe_2)
'data.frame': 10 obs. of 2 variables:
$ IDs : int 1 2 3 4 5 6 7 8 9 10
$ types: Factor w/ 2 levels "A","B": 1 2 1 1 2 2 1 1 2 2
As you see "data.frame()" works fine, while "as.data.frame()" produces an error as it recognises the first argument as the object to be checked and coerced.
To sum up, "as.data.frame()" should be used to convert/coerce one single R object into a data frame (as you correctly did using a matrix), while "data.frame()" to build a data frame from scratch.

Try
colnames(x) <- c("C1","C2","C3")
and then both will give the same result
identical(data.frame(x), as.data.frame(x))
What is more startling are things like the following:
list(x)
Provides a one-elemnt list, the elemnt being the matrix x; whereas
as.list(x)
gives a list with 9 elements, one for each matrix entry
MM

Looking at the code, as.data.frame fails faster. data.frame will issue warnings, and do things like remove rownames if there are duplicates:
> x <- matrix(data=rep(1,9),nrow=3,ncol=3)
> rownames(x) <- c("a", "b", "b")
> data.frame(x)
X1 X2 X3
1 1 1 1
2 1 1 1
3 1 1 1
Warning message:
In data.row.names(row.names, rowsi, i) :
some row.names duplicated: 3 --> row.names NOT used
> as.data.frame(x)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE, :
duplicate row.names: b

Related

how to make lists of lists into one consecutive list r

Here is my list of lists
$cor
[1] 0.10194656 0.09795653 0.18832356 0.12265421 0.09669338 0.13781369 0.16763787 0.15137726 0.10203826 0.12649443 0.16451622 0.18429656 0.21234920
[14] 0.18254895 0.10761731 0.15354220 0.13458206
$cor
[1] 0.3332299 0.3909873 0.3631544
$cor
[1] 0.11601617 0.10834637 0.10138418 0.13864724 0.17582967 0.15005935 0.05481153 0.15443826 0.08987235 0.19109966 0.13404778 0.15816381
I want all values to be in one list in the order they appear.
I'm not sure what you are trying to do. But base R has an unlist function which outputs what you may be looking for:
c1 <- 1:5
c2 <- 6:10
cor <- list(c1, c2)
cor
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 6 7 8 9 10
unlist(cor)
[1] 1 2 3 4 5 6 7 8 9 10
Using the following example data:
lst <- list(cor = 1:3, cor = 4:6)
lst
$cor
[1] 1 2 3
$cor
[1] 4 5 6
You could use the following, which places all items in one vector, in their original order:
do.call(c, lst)
cor1 cor2 cor3 cor1 cor2 cor3
1 2 3 4 5 6
Reduce() also works similarly:
Reduce(c, lst)
[1] 1 2 3 4 5 6

R: change data frame structure using values from one variable as new variable

df1 <- data.frame(
name = c("a", "b", "b", "c"),
score = c(1, 1, 2, 1)
)
How can I get a new data frame with variables/columns from df$name and with each 'corresponding' df$score. I figure that its actually a two-step problem:
First I would need to make a list of (in this example) unequal length vectors like this:
$a
[1] 1
$b
[1] 1 2
$c
[1] 1
Second, NAs need to be padded so one get vectors of equal length before making the desired data frame
that would be like:
a b c
1 1 1 1
2 NA 2 NA
I cannot find any simple means to do this - Im sure there must be!
If the solution can be delivered using dplyr it would be fantastic! Thanks!
To split the data:
(s <- split(df1$score, df1$name))
# $a
# [1] 1
#
# $b
# [1] 1 2
#
# $c
# [1] 1
To create the new data frame:
as.data.frame(sapply(s, `length<-`, max(vapply(s, length, 1L))))
# a b c
# 1 1 1 1
# 2 NA 2 NA
Slightly more efficient would be to use vapply in place of sapply
len <- max(vapply(s, length, 1L))
as.data.frame(vapply(s, `length<-`, double(len), len))
# a b c
# 1 1 1 1
# 2 NA 2 NA

Difference between `names(df[1]) <- ` and `names(df)[1] <- `

Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2

R: Aggregate character strings with c

I have a data frame with two columns: one is strings, the other one is integers.
> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> df
x rnames
1 5 item.1
2 3 item.2
3 5 item.3
4 3 item.4
5 1 item.5
6 3 item.6
7 4 item.7
8 5 item.8
9 4 item.9
10 5 item.10
11 5 item.11
12 2 item.12
13 2 item.13
14 1 item.14
15 3 item.15
16 4 item.16
17 5 item.17
18 4 item.18
19 1 item.19
20 1 item.20
I'm trying to aggregate the strings into list or vectors of strings (characters) with the 'c' or the 'list' function, but getting weird results:
> aggregate(rnames ~ x, df, c)
x rnames
1 1 16, 6, 11, 13
2 2 4, 5
3 3 12, 15, 17, 7
4 4 18, 20, 8, 10
5 5 1, 14, 19, 2, 3, 9
When I use 'paste' instead of 'c', I can see that the aggregate is working correctly - but the result is not what I'm looking for.
> aggregate(rnames ~ x, df, paste)
x rnames
1 1 item.5, item.14, item.19, item.20
2 2 item.12, item.13
3 3 item.2, item.4, item.6, item.15
4 4 item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17
What I'm looking for is that every aggregated group would be presented as a vector or a lit (hence the use of c) as opposed to the single string I'm getting with 'paste'. Something along the lines of the following (which in reality doesn't work):
> aggregate(rnames ~ x, df, c)
x rnames
1 1 item.5, item.14, item.19, item.20
2 2 item.12, item.13
3 3 item.2, item.4, item.6, item.15
4 4 item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17
Any help would be appreciated.
You fell in the usual trap of data.frame: your character column is not a character column, it is a factor column! Hence the numbers instead of the characters in your result:
> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 2 5 5 5 5 4 3 3 2 4 ...
$ rnames: Factor w/ 20 levels "item.1","item.10",..: 1 12 14 15 16 17 18 19 20 2 ...
To prevent the conversion to factors, use argument stringAsFactors=FALSE in your call to data.frame:
> df <- data.frame(x, rnames,stringsAsFactors=FALSE)
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 5 5 3 5 5 3 2 5 1 5 ...
$ rnames: chr "item.1" "item.2" "item.3" "item.4" ...
> aggregate(rnames ~ x, df, c)
x rnames
1 1 item.9, item.13, item.17
2 2 item.7
3 3 item.3, item.6, item.19
4 4 item.12, item.15, item.16
5 5 item.1, item.2, item.4, item.5, item.8, item.10, item.11, item.14, item.18, item.20
Another solution to avoid the conversion to factor is function I:
> df <- data.frame(x, I(rnames))
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 3 5 4 5 4 5 3 3 1 1 ...
$ rnames:Class 'AsIs' chr [1:20] "item.1" "item.2" "item.3" "item.4" ...
Excerpt from ?I:
In function data.frame. Protecting an object by enclosing it in I() in
a call to data.frame inhibits the conversion of character vectors to
factors and the dropping of names, and ensures that matrices are
inserted as single columns. I can also be used to protect objects
which are to be added to a data frame, or converted to a data frame
via as.data.frame.
It achieves this by prepending the class "AsIs" to the object's
classes. Class "AsIs" has a few of its own methods, including for [,
as.data.frame, print and format.
'm not sure just exactly what it is that you are looking for... so perhaps some reference output would be good to give us an idea of what we are aiming at?
But, since your last bit of code seems to be close to what you are after, maybe a solution like the following would work:
> library(plyr)
> ddply(df, .(x), summarize, rnames = paste(rnames, collapse = "|"))
x rnames
1 1 item.9|item.11|item.20
2 2 item.1|item.2|item.15|item.16
3 3 item.7|item.8
4 4 item.4|item.5|item.6|item.12|item.13
5 5 item.3|item.10|item.14|item.17|item.18|item.19
You can vary how the individual elements are stuck together by changing the collapse argument to paste().
Alternatively, if you want to just have each of the groups as a vetor then you could use this:
> df$rnames = as.character(df$rnames)
> L = dlply(df, .(x), function(df) {df$rnames})
> L
$`1`
[1] "item.9" "item.11" "item.20"
$`2`
[1] "item.1" "item.2" "item.15" "item.16"
$`3`
[1] "item.7" "item.8"
$`4`
[1] "item.4" "item.5" "item.6" "item.12" "item.13"
$`5`
[1] "item.3" "item.10" "item.14" "item.17" "item.18" "item.19"
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
x
1 1
2 2
3 3
4 4
5 5
This gives you a list of vectors, which is what you were after. And each group can be indexed out of the resulting list:
> L[[1]]
[1] "item.9" "item.11" "item.20"

Putting output in R into excel

Guys I have a code that generates 2 columns of data (e.g Number, Median) which refers to a particular person...but I have taken samples of 7 people
so basically I get this output:
[[1]
Number Median
1 5
2 3
.....
[[2]]
Number Median
1 6
2 4
....
[[3]]
Number Median
1 3
2 5
So I basically get this output....up til [[7]]
I tried transferring this output in excel using this code
write.csv(cbind(data),"data1.csv")
and I get this type of output:
list(c(Median =.......It lists all the median on the rows
But I want it to save the data referring to the 'median' and 'Number' in columns NOT ROWS
If I just type
write.csv(data,"data1.csv")
I get an error
arguments imply differing number of rows: 157, 179, 178, 180
As Marius said, you have a list of data.frames which can't be written to a .csv file. You need to do:
NewDataFrame <- do.call("rbind", YourList)
write.csv(NewDataFrame, "Data.csv")
do.call takes each of the elements from a list and applies whatever function you tell it (in this case rbind) to all of them.
Here are two options. Both use the following sample data:
myList <- list(data.frame(matrix(1:4, ncol = 2)),
data.frame(matrix(3:10, ncol = 2)),
data.frame(matrix(11:14, ncol =2)))
myList
# [[1]]
# X1 X2
# 1 1 3
# 2 2 4
#
# [[2]]
# X1 X2
# 1 3 7
# 2 4 8
# 3 5 9
# 4 6 10
#
# [[3]]
# X1 X2
# 1 11 13
# 2 12 14
Option 1: Write a csv file where the data.frames are presented as they are in the list
sink("list_of_dataframes.csv", type="output")
invisible(lapply(myList, function(x) dput(write.csv(x))))
sink()
If you open the resulting "list_of_dataframes.csv" file in a text editor, you will get something that looks like this. When you read this into a spreadsheet program, the first column will include the rownames and NULL separating each data.frame:
"","X1","X2"
"1",1,3
"2",2,4
NULL
"","X1","X2"
"1",3,7
"2",4,8
"3",5,9
"4",6,10
NULL
"","X1","X2"
"1",11,13
"2",12,14
NULL
Option 2: Write or search around for a version of cbind that accommodates binding data.frames with differing number of rows.
Here is one such function that I've written.
cbind2 <- function(datalist) {
nrows <- max(sapply(datalist, nrow))
expandmyrows <- function(mydata, rowsneeded) {
temp1 = names(mydata)
rowsneeded = rowsneeded - nrow(mydata)
temp2 = setNames(data.frame(
matrix(rep(NA, length(temp1) * rowsneeded),
ncol = length(temp1))), temp1)
rbind(mydata, temp2)
}
do.call(cbind, lapply(datalist, expandmyrows, rowsneeded = nrows))
}
And here is that function applied to your list:
cbind2(myList)
# X1 X2 X1 X2 X1 X2
# 1 1 3 3 7 11 13
# 2 2 4 4 8 12 14
# 3 NA NA 5 9 NA NA
# 4 NA NA 6 10 NA NA
That output should be easy for you to use with write.csv and related functions.

Resources