Creating a data frame from two vectors using cbind - r

Consider the following R code.
> x = cbind(c(10, 20), c("[]", "[]"), c("[[1,2]]","[[1,3]]"))
> x
[,1] [,2] [,3]
[1,] "10" "[]" "[[1,2]]"
[2,] "20" "[]" "[[1,3]]"
Similarly
> x = rbind(c(10, "[]", "[[1,2]]"), c(20, "[]", "[[1,3]]"))
> x
[,1] [,2] [,3]
[1,] "10" "[]" "[[1,2]]"
[2,] "20" "[]" "[[1,3]]"
Now, I don't want the integers 10 and 20 to be converted to strings.
How can I perform this operation without any such conversion? I would of
course also like to know why this conversion happens. I looked at
the cbind help and also tried Googling, but had no luck finding a
solution. I also believe that in some cases. R converts strings to
factors, and I don't want that to happen either, though it doesn't seem
to be happening here.

Vectors and matrices can only be of a single type and cbind and rbind on vectors will give matrices. In these cases, the numeric values will be promoted to character values since that type will hold all the values.
(Note that in your rbind example, the promotion happens within the c call:
> c(10, "[]", "[[1,2]]")
[1] "10" "[]" "[[1,2]]"
If you want a rectangular structure where the columns can be different types, you want a data.frame. Any of the following should get you what you want:
> x = data.frame(v1=c(10, 20), v2=c("[]", "[]"), v3=c("[[1,2]]","[[1,3]]"))
> x
v1 v2 v3
1 10 [] [[1,2]]
2 20 [] [[1,3]]
> str(x)
'data.frame': 2 obs. of 3 variables:
$ v1: num 10 20
$ v2: Factor w/ 1 level "[]": 1 1
$ v3: Factor w/ 2 levels "[[1,2]]","[[1,3]]": 1 2
or (using specifically the data.frame version of cbind)
> x = cbind.data.frame(c(10, 20), c("[]", "[]"), c("[[1,2]]","[[1,3]]"))
> x
c(10, 20) c("[]", "[]") c("[[1,2]]", "[[1,3]]")
1 10 [] [[1,2]]
2 20 [] [[1,3]]
> str(x)
'data.frame': 2 obs. of 3 variables:
$ c(10, 20) : num 10 20
$ c("[]", "[]") : Factor w/ 1 level "[]": 1 1
$ c("[[1,2]]", "[[1,3]]"): Factor w/ 2 levels "[[1,2]]","[[1,3]]": 1 2
or (using cbind, but making the first a data.frame so that it combines as data.frames do):
> x = cbind(data.frame(c(10, 20)), c("[]", "[]"), c("[[1,2]]","[[1,3]]"))
> x
c.10..20. c("[]", "[]") c("[[1,2]]", "[[1,3]]")
1 10 [] [[1,2]]
2 20 [] [[1,3]]
> str(x)
'data.frame': 2 obs. of 3 variables:
$ c.10..20. : num 10 20
$ c("[]", "[]") : Factor w/ 1 level "[]": 1 1
$ c("[[1,2]]", "[[1,3]]"): Factor w/ 2 levels "[[1,2]]","[[1,3]]": 1 2

Using data.frame instead of cbind should be helpful
x <- data.frame(col1=c(10, 20), col2=c("[]", "[]"), col3=c("[[1,2]]","[[1,3]]"))
x
col1 col2 col3
1 10 [] [[1,2]]
2 20 [] [[1,3]]
sapply(x, class) # looking into x to see the class of each element
col1 col2 col3
"numeric" "factor" "factor"
As you can see elements from col1 are numeric as you wish.
data.frame can have variables of different class: numeric, factor and character but matrix doesn't, once you put a character element into a matrix all the other will become into this class no matter what clase they were before.

Related

Why does R convert numbers and characters to factors when coercing to data frame?

Recently I have come across a problem where my data has been converted to factors.
This is a large nuisance, as it's not (always) easily picked up on.
I am aware that I can convert them back with solutions such as as.character(paste(x)) or as.character(paste(x)), but that seems really unnecessary.
Example code:
nums <- c(1,2,3,4,5)
chars <- c("A","B","C,","D","E")
str(nums)
#> num [1:5] 1 2 3 4 5
str(chars)
#> chr [1:5] "A" "B" "C," "D" "E"
df <- as.data.frame(cbind(a = nums, b = chars))
str(df)
#> 'data.frame': 5 obs. of 2 variables:
#> $ a: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
#> $ b: Factor w/ 5 levels "A","B","C,","D",..: 1 2 3 4 5
Don't cbind as it converts data to matrix and matrix can hold data of only one type, so it converts numbers to characters.
Use data.frame because as.data.frame(a = nums, b = chars) returns an error.
Use stringsAsFactors = FALSE because in data.frame default value of
stringsAsFactors is TRUE which converts characters to factors. The numbers also change to factors because in 1) they have been changed to characters.
df <- data.frame(a = nums, b = chars, stringsAsFactors = FALSE)
str(df)
#'data.frame': 5 obs. of 2 variables:
# $ a: num 1 2 3 4 5
# $ b: chr "A" "B" "C," "D" ...
EDIT: As of the newest version of R, the default value of stringAsFactors has changed to FALSE.
This should no longer happen if you have updated R: data frames don't automatically turn chr to fct. In a way, data frames are now more similar to tibbles.

Issue with user defined function in R

I am trying to change the data type of my variables in data frame to 'factor' if they are 'character'. I have tried to replicate the problem using sample data as below
a <- c("AB","BC","AB","BC","AB","BC")
b <- c(12,23,34,45,54,65)
df <- data.frame(a,b)
str(df)
'data.frame': 6 obs. of 2 variables:
$ a: chr "AB" "BC" "AB" "BC" ...
$ b: num 12 23 34 45 54 65
I wrote the below function to achieve that
abc <- function(x) {
for(i in names(x)){
if(is.character(x[[i]])) {
x[[i]] <- as.factor(x[[i]])
}
}
}
The function is executing properly if i pass the dataframe (df), but still it doesn't change the 'character' to 'factor'.
abc(df)
str(df)
'data.frame': 6 obs. of 2 variables:
$ a: chr "AB" "BC" "AB" "BC" ...
$ b: num 12 23 34 45 54 65
NOTE: It works perfectly with for loop and if condition. When I tried to generalize it by writing a function around it, there's a problem.
Please help. What am I missing ?
Besides the comment from #Roland, you should make use of R's nice indexing possibilities and learn about the *apply family. With that you can rewrite your code to
change_to_factor <- function(df_in) {
chr_ind <- vapply(df_in, is.character, logical(1))
df_in[, chr_ind] <- lapply(df_in[, chr_ind, drop = FALSE], as.factor)
df_in
}
Explanation
vapply loops over all elements of a list, applies a function to each element and returns a value of the given type (here a boolean logical(1)). Since in R data frames are in fact lists where each (list) element is required to be of the same length, you can conveniently loop over all the columns of the data frame and apply the function is.character to each column. vapply then returns a boolean (logical) vector with TRUE/FALSE values depending on whether the column was a character column or not.
You can then use this boolean vector to subset your data frame to look only at columns which are character columns.
lapply is yet another memeber of the *apply family and loops through list elements and returns a list. We loop now over the character columns, apply as.factor to them and return a list of them which we conveniently store in the original positions in the data frame
By the way, if you look at str(df) you will see that column b is already a factor. This is because data.frame automatically converts character columns to characters. To avoid that you need to pass stringsAsFactors = FALSE to data.frame:
a <- c("AB", "BC", "AB", "BC", "AB", "BC")
b <- c(12, 23, 34, 45, 54, 65)
df <- data.frame(a, b)
str(df) # column b is factor
# 'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "AB","BC": 1 2 1 2 1 2
# $ b: num 12 23 34 45 54 65
str(df2 <- data.frame(a, b, stringsAsFactors = FALSE))
# 'data.frame': 6 obs. of 2 variables:
# $ a: chr "AB" "BC" "AB" "BC" ...
# $ b: num 12 23 34 45 54 65
str(change_to_factor(df2))
# 'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "AB","BC": 1 2 1 2 1 2
# $ b: num 12 23 34 45 54 65
It may also be worth to learn the tidyverse syntax with which you can simply do
library(tidyverse)
df2 %>%
mutate_if(is.character, as.factor) %>%
str()

How do I stop merge from converting characters into factors?

E.g.
chr <- c("a", "b", "c")
intgr <- c(1, 2, 3)
str(chr)
str(base::merge(chr,intgr, stringsAsFactors = FALSE))
gives:
> str(base::merge(chr,intgr, stringsAsFactors = FALSE))
'data.frame': 9 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
$ y: num 1 1 1 2 2 2 3 3 3
I originally thought it has something to do with how merge coerces arguments into data frames. However, I thought that adding the argument stringsAsFactors = FALSE would override the default coercion behaviour of char -> factor, yet this is not working.
EDIT: Doing the following gives me expected behaviour:
options(stringsAsFactors = FALSE)
str(base::merge(chr,intgr))
that is:
> str(base::merge(chr,intgr))
'data.frame': 9 obs. of 2 variables:
$ x: chr "a" "b" "c" "a" ...
$ y: num 1 1 1 2 2 2 3 3 3
but this is not ideal as it changes the global stringsAsFactors setting.
You can accomplish this particular "merge" using expand.grid(), since you're really just taking the cartesian product. This allows you to pass the stringsAsFactors argument:
sapply(expand.grid(x=chr,y=intgr,stringsAsFactors=F),class);
## x y
## "character" "numeric"
Here's a way of working around this limitation of merge():
sapply(merge(data.frame(x=chr,stringsAsFactors=F),intgr),class);
## x y
## "character" "numeric"
I would argue that it never makes sense to pass an atomic vector to merge(), since it is only really designed for merging data.frames.
We can use CJ from data.table as welll
library(data.table)
str(CJ(chr, intgr))
Classes ‘data.table’ and 'data.frame': 9 obs. of 2 variables:
#$ V1: chr "a" "a" "a" "b" ...
#$ V2: num 1 2 3 1 2 3 1 2 3

Difference between data frame and matrix indexing

I have a text file of integers which I've been reading into R and storing as a data frame for the time being. However, coercing it to a matrix it (say y, using as.matrix()) doesn't seem to be the same as the matrix I created (x). Namely, if I look at a single entry I get different output
> y[1,1]
V1
0
as opposed to
> x[1,1]
[1] 0
Can anyone explain the difference?
I am interpreting your question as asking what is the difference between a matrix and a data frame and not just why does the output of y[1,1] look different if y is a data frame vs. matrix. If all you want to know is why they look different then the answer is that data frames and matrices are different classes and have different internal representations and although many operations have been designed and implemented to paper over the differences in the end matrix indexing and data frame indexing are separately implemented and do not necessarily have to be the same although hopefully they are implemented reasonably consistently. At this point it would likely be unwise to modify R to reduce any inconsistencies given how much code it might break.
matrix A matrix is a vector with dimensions.
m1 <- 1:12
dim(m1) <- c(4, 3)
m2 <- matrix(1:12, 4, 3)
identical(m1, m2)
## [1] TRUE
length(m1) # 12 elements in the underlying vector
## [1] 12
data frame
A data.frame is a named list (the names are the column names) of columns with row names -- the default row names of 1, 2, ... are internally represented as c(NA, -4L) for a 4 row data frame in order to avoid having to store a possibly large vector of row names.
DF1 <- as.data.frame(m1)
DF2 <- list(V1 = 1:4, V2 = 5:8, V3 = 9:12)
attr(DF2, "row.names") <- c(NA, -4L)
class(DF2) <- "data.frame"
identical(DF1, DF2)
## [1] TRUE
length(DF1) # 3 columns
## [1] 3
names
Matrices do not have to have row or column names whereas data frames always do. If a matrix has row and column names then they are represented as a list of two vectors called dimnames (as opposed to a named list with a row.names attribute which is how data frames represent their row names).
m3 <- m1
rownames(m3) <- c("a", "b", "c", "d")
colnames(m3) <- c("A", "B", "C")
str(m3)
## int [1:4, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:4] "a" "b" "c" "d"
## ..$ : chr [1:3] "A" "B" "C"
m4 <- m1
dimnames(m4) <- list(c("a", "b", "c", "d"), c("A", "B", "C"))
identical(m3, m4)
## [1] TRUE
lapply
Suppose we lapply over matrix m1. Since it is really a vector with dimensions we are lapplying over each of the 12 elements:
> str(lapply(m1, length))
List of 12
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
whereas if we do this over DF1 we are lapplying over 3 elements each of which has length 4
> str(lapply(DF1, length))
List of 3
$ V1: int 4
$ V2: int 4
$ V3: int 4
double indexing
Indexing is such that DF1[1,1] and m1[1,1] give the same result if the matrix does not have names.
DF1[1,1]
## [1] 1
m1[1,1]
## [1] 1
If it does then there is the observed difference:
as.matrix(DF1)[1,1] # as.matrix(DF1) has col names V1, V2, V3 from DF1
V1
1
DF1[1,1]
[1] 1
One has to be careful when convering a matrix to a data frame because if there are character and numeric columns in the data frame then the conversion will force them all to the same type, i.e. all to character.
single indexing
however, if we index like this then since a data frame is a list of columns we get a data frame made of the first column
> DF1[1]
V1
1 1
2 2
3 3
4 4
but for a matrix since it is a vector with dimensions we get the first element of that vector
> m1[1]
[1] 1
other
In the usual case all elements of a matrix are numeric, or all are character but for a data frame each column might be different. One column might be numeric whereas another might be character or logical.
Typically operations on matrices are faster than operations on data frames.
The attributes assigned to data structures also depend on the methods used to import or read data, and whether they are explicitly defined or coerced using others functions.
Here is a data frame called integers created by importing data from a .txt file.
> integers
V1 V2 V3
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
Here is data a matrix called m.integers created by passing integers to as.matrix()
as.matrix(integers)
> m.integers
V1 V2 V3
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
Here is a matrix called m2 created as indicated above by using matrix()
> m2
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
Now selecting the first element of each structure gives the following.
Also looking at the attributes of each reveals the default values (or assigned values if you assigned any) for each attribute.
# The element is given a row name.
> integers[1,1]
[1] 1
# Notice attributes$row.names
> attributes(integers)
$names
[1] "V1" "V2" "V3"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4
#################################################
# The element is given a column name.
> m.integers[1,1]
V1
1
# Notice there is no row name attribute
> attributes(m.integers)
$dim
[1] 4 3
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "V1" "V2" "V3"
###############################################
# The element is given a row name.
> m2[1,1]
[1] 1
# Notice no row name attribute.
> attributes(m2)
$dim
[1] 4 3
According the the documentation for data.frame() the default for row.names = NULL and the row names are set to the integer sequence starting at [1]. And the row names are not preserved by as.matrix(). When passing a data frame to as.matrix() the column names are preserved. Rownames are also automatically assigned as a sequence of integers if unassigned when using matrix().
If necessary, the row names can be changed.
> attributes(integers)$row.names <- c("one", "two", "three", "four")
> integers
V1 V2 V3
one 1 5 9
two 2 6 10
three 3 7 11
four 4 8 12
> attributes(integers)$row.names <- c("one", "two", "three", "four")
> integers
V1 V2 V3
one 1 5 9
two 2 6 10
three 3 7 11
four 4 8 12
> attributes(m.integers)$dimnames[[2]] <- NULL
> m.integers
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
> attributes(m.integers)$dimnames[[1]] <- c("one", "two", "three", "four")
> m.integers
[,1] [,2] [,3]
one 1 5 9
two 2 6 10
three 3 7 11
four 4 8 12

Append a data frame to a list

I'm trying to figure out how to add a data.frame or data.table to the first position in a list.
Ideally, I want a list structured as follows:
List of 4
$ :'data.frame': 1 obs. of 3 variables:
..$ a: num 2
..$ b: num 1
..$ c: num 3
$ d: num 4
$ e: num 5
$ f: num 6
Note the data.frame is an object within the structure of the list.
The problem is that I need to add the data frame to the list after the list has been created, and the data frame has to be the first element in the list. I'd like to do this using something simple like append, but when I try:
append(list(1,2,3),data.frame(a=2,b=1,c=3),after=0)
I get a list structured:
str(append(list(1,2,3),data.frame(a=2,b=1,c=3),after=0))
List of 6
$ a: num 2
$ b: num 1
$ c: num 3
$ : num 1
$ : num 2
$ : num 3
It appears that R is coercing data.frame into a list when I'm trying to append. How do I prevent it from doing so? Or what alternative method might there be for constructing this list, inserting the data.frame into the list in position 1, after the list's initial creation.
The issue you are having is that to put a data frame anywhere into a list as a single list element, it must be wrapped with list(). Let's have a look.
df <- data.frame(1, 2, 3)
x <- as.list(1:3)
If we just wrap with c(), which is what append() is doing under the hood, we get
c(df)
# $X1
# [1] 1
#
# $X2
# [1] 2
#
# $X3
# [1] 3
But if we wrap it in list() we get the desired list element containing the data frame.
list(df)
# [[1]]
# X1 X2 X3
# 1 1 2 3
Therefore, since x is already a list, we will need to use the following construct.
c(list(df), x) ## or append(x, list(df), 0)
# [[1]]
# X1 X2 X3
# 1 1 2 3
#
# [[2]]
# [1] 1
#
# [[3]]
# [1] 2
#
# [[4]]
# [1] 3

Resources