Difference between data frame and matrix indexing - r

I have a text file of integers which I've been reading into R and storing as a data frame for the time being. However, coercing it to a matrix it (say y, using as.matrix()) doesn't seem to be the same as the matrix I created (x). Namely, if I look at a single entry I get different output
> y[1,1]
V1
0
as opposed to
> x[1,1]
[1] 0
Can anyone explain the difference?

I am interpreting your question as asking what is the difference between a matrix and a data frame and not just why does the output of y[1,1] look different if y is a data frame vs. matrix. If all you want to know is why they look different then the answer is that data frames and matrices are different classes and have different internal representations and although many operations have been designed and implemented to paper over the differences in the end matrix indexing and data frame indexing are separately implemented and do not necessarily have to be the same although hopefully they are implemented reasonably consistently. At this point it would likely be unwise to modify R to reduce any inconsistencies given how much code it might break.
matrix A matrix is a vector with dimensions.
m1 <- 1:12
dim(m1) <- c(4, 3)
m2 <- matrix(1:12, 4, 3)
identical(m1, m2)
## [1] TRUE
length(m1) # 12 elements in the underlying vector
## [1] 12
data frame
A data.frame is a named list (the names are the column names) of columns with row names -- the default row names of 1, 2, ... are internally represented as c(NA, -4L) for a 4 row data frame in order to avoid having to store a possibly large vector of row names.
DF1 <- as.data.frame(m1)
DF2 <- list(V1 = 1:4, V2 = 5:8, V3 = 9:12)
attr(DF2, "row.names") <- c(NA, -4L)
class(DF2) <- "data.frame"
identical(DF1, DF2)
## [1] TRUE
length(DF1) # 3 columns
## [1] 3
names
Matrices do not have to have row or column names whereas data frames always do. If a matrix has row and column names then they are represented as a list of two vectors called dimnames (as opposed to a named list with a row.names attribute which is how data frames represent their row names).
m3 <- m1
rownames(m3) <- c("a", "b", "c", "d")
colnames(m3) <- c("A", "B", "C")
str(m3)
## int [1:4, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:4] "a" "b" "c" "d"
## ..$ : chr [1:3] "A" "B" "C"
m4 <- m1
dimnames(m4) <- list(c("a", "b", "c", "d"), c("A", "B", "C"))
identical(m3, m4)
## [1] TRUE
lapply
Suppose we lapply over matrix m1. Since it is really a vector with dimensions we are lapplying over each of the 12 elements:
> str(lapply(m1, length))
List of 12
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
$ : int 1
whereas if we do this over DF1 we are lapplying over 3 elements each of which has length 4
> str(lapply(DF1, length))
List of 3
$ V1: int 4
$ V2: int 4
$ V3: int 4
double indexing
Indexing is such that DF1[1,1] and m1[1,1] give the same result if the matrix does not have names.
DF1[1,1]
## [1] 1
m1[1,1]
## [1] 1
If it does then there is the observed difference:
as.matrix(DF1)[1,1] # as.matrix(DF1) has col names V1, V2, V3 from DF1
V1
1
DF1[1,1]
[1] 1
One has to be careful when convering a matrix to a data frame because if there are character and numeric columns in the data frame then the conversion will force them all to the same type, i.e. all to character.
single indexing
however, if we index like this then since a data frame is a list of columns we get a data frame made of the first column
> DF1[1]
V1
1 1
2 2
3 3
4 4
but for a matrix since it is a vector with dimensions we get the first element of that vector
> m1[1]
[1] 1
other
In the usual case all elements of a matrix are numeric, or all are character but for a data frame each column might be different. One column might be numeric whereas another might be character or logical.
Typically operations on matrices are faster than operations on data frames.

The attributes assigned to data structures also depend on the methods used to import or read data, and whether they are explicitly defined or coerced using others functions.
Here is a data frame called integers created by importing data from a .txt file.
> integers
V1 V2 V3
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
Here is data a matrix called m.integers created by passing integers to as.matrix()
as.matrix(integers)
> m.integers
V1 V2 V3
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
Here is a matrix called m2 created as indicated above by using matrix()
> m2
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
Now selecting the first element of each structure gives the following.
Also looking at the attributes of each reveals the default values (or assigned values if you assigned any) for each attribute.
# The element is given a row name.
> integers[1,1]
[1] 1
# Notice attributes$row.names
> attributes(integers)
$names
[1] "V1" "V2" "V3"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4
#################################################
# The element is given a column name.
> m.integers[1,1]
V1
1
# Notice there is no row name attribute
> attributes(m.integers)
$dim
[1] 4 3
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "V1" "V2" "V3"
###############################################
# The element is given a row name.
> m2[1,1]
[1] 1
# Notice no row name attribute.
> attributes(m2)
$dim
[1] 4 3
According the the documentation for data.frame() the default for row.names = NULL and the row names are set to the integer sequence starting at [1]. And the row names are not preserved by as.matrix(). When passing a data frame to as.matrix() the column names are preserved. Rownames are also automatically assigned as a sequence of integers if unassigned when using matrix().
If necessary, the row names can be changed.
> attributes(integers)$row.names <- c("one", "two", "three", "four")
> integers
V1 V2 V3
one 1 5 9
two 2 6 10
three 3 7 11
four 4 8 12
> attributes(integers)$row.names <- c("one", "two", "three", "four")
> integers
V1 V2 V3
one 1 5 9
two 2 6 10
three 3 7 11
four 4 8 12
> attributes(m.integers)$dimnames[[2]] <- NULL
> m.integers
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
> attributes(m.integers)$dimnames[[1]] <- c("one", "two", "three", "four")
> m.integers
[,1] [,2] [,3]
one 1 5 9
two 2 6 10
three 3 7 11
four 4 8 12

Related

R: Creating a Column For Each Element in a List

I am working with the R programming language. I have a list ("my_list") that looks something like this - each element in the list (e.g. [[i]]) has a different number of subelements (e.g. [[i]][j]) :
> my_list
my_list
[[1]]
[1] "subelement1" "subelement2" "subelement3"
[[2]]
[1] "subelement1" "subelement2" "subelement3" "subelement4" "subelement5"
[[3]]
[1] "subelement1" "subelement2" "subelement3" "subelement4" "subelement5"
[[4]]
[1] "subelement1" "subelement2" "subelement3" "subelement4" "subelement5"
> summary(my_list)
Length Class Mode
[1,] 3 -none- character
[2,] 5 -none- character
[3,] 5 -none- character
[4,] 5 -none- character
[5,] 5 -none- character
[6,] 5 -none- character
[7,] 5 -none- character
[8,] 5 -none- character
[9,] 5 -none- character
[10,] 5 -none- character
[11,] 5 -none- character
[12,] 6 -none- character
For each element in this list, I want to extract each of these subelement and make them into a dataframe all together (each row in this dataframe will not necessarily have the same number of columns). Since I don't the maximum number of subelements, I tried to find out the maximum number of subelements - but some parsing is still involved (many entries in the "Length" column are not numbers for some reason?):
summary = summary(my_list)
> summary
Var1 Var2 Freq
1 A Length 3
2 B Length 5
3 C Length 5
4 D Length 5
5 E Length 5
6 F Length 5
7 G Length 5
8 H Length 5
####
96 R3 Length 5
97 S3 Length 5
98 T3 Length 5
99 U3 Length 5
100 V3 Length 5
####
101 A Class -none-
102 B Class -none-
103 C Class -none-
104 D Class -none-
######
296 R3 Mode character
297 S3 Mode character
298 T3 Mode character
299 U3 Mode character
300 V3 Mode character
Next:
summary = data.frame(summary)
freq = as.numeric(gsub("([0-9]+).*$", "\\1", summary$Freq))
freq = freq[!is.na(freq)]
> max(freq)
[1] 6
With this very "roundabout way" - I now know there at most 6 subelements, and I can create 6 corresponding columns:
col1 = sapply(my_list,function(x) x[1])
col2 = sapply(my_list,function(x) x[2])
col3 = sapply(my_list,function(x) x[3])
col4 = sapply(my_list,function(x) x[4])
col5 = sapply(my_list,function(x) x[5])
col6 = sapply(my_list,function(x) x[6])
#final answer : desired output
final_data = data.frame(col1, col2, col3, col4, col5, col6)
My Question: Would there have been an easier way to find out the maximum number of subelements in this list and then create a data frame with the correct number of columns? I.e. Is there an "automatic" way to create a data frame with the same number of columns as subelements in the list and name these columns accordingly (e.g. col1, col2, col3, etc.)?
Thanks!
Try this
mx <- max(sapply(my_list , length))
df <- do.call(rbind , lapply(my_list , \(x) if(length(x) == mx) x
else c(x , rep(NA , mx - length(x)))))
df <- data.frame(df)
colnames(df) <- paste0("col" , 1:mx)
output
col1 col2 col3 col4 col5
1 subelement1 subelement2 subelement3 <NA> <NA>
2 subelement1 subelement2 subelement3 subelement4 subelement5
3 subelement1 subelement2 subelement3 subelement4 subelement5
4 subelement1 subelement2 subelement3 subelement4 subelement5
Your solution is functional, so obviously take this with a grain of salt, but it's possible to find the maximum length of a sublist with one loop.
max_length <- 0
lapply(my_list, \(x){if (length(x) > max_length){max_length = length(x)} }
> max_length
[1] 6
To make a dataframe with the corresponding columns a similar approach can be used:
#create an empty dataframe to add rows to
df <- data.frame(matrix(ncol = max_length, nrow = 0))
colnames(df) <- sprintf("raster[%d]",seq(1:max_length))
#add rows
lapply(listanswer, \(x){df[nrow(df) + 1,] <- x})
See this post regarding sprintf. Since you need to know the maximum row length going in, two loops are necessary, one to find the max length, and one to fill the data frame.

How to combine columns that have the same name and remove NA's?

Relatively new to R, but I have an issue combining columns that have the same name. I have a very large dataframe (~70 cols and 30k rows). Some of the columns have the same name. I wish to merge these columns and remove the NA's.
An example of what I would like is below (although on a much larger scale).
df <- data.frame(x = c(2,1,3,5,NA,12,"blah"),
x = c(NA,NA,NA,NA,9,NA,NA),
y = c(NA,5,12,"hop",NA,2,NA),
y = c(2,NA,NA,NA,8,NA,4),
z = c(9,5,NA,3,2,6,NA))
desired.result <- data.frame(x = c(2,1,3,5,9,12,"blah"),
y = c(2,5,12,"hop",8,2,4),
z = c(9,5,NA,3,2,6,NA))
I have tried a number of things including suggestions such as:
R: merging columns and the values if they have the same column name
Combine column to remove NA's
However, these solutions either require a numeric dataset (I need to keep the character information) or they require you to manually input the columns that are the same (which is too time consuming for the size of my dataset).
I have managed to solve the issue manually by creating new columns that are combinations:
df$x <- apply(df[,1:2], 1, function(x) x[!is.na(x)][1])
However I don't know how to get R to auto-identify where the columns have the same names and then apply something like the above such that I don't need to specify the index each time.
Thanks
here is a base R approach
#split into a named list, nased on colnames befote the .-character
L <- split.default(df, f = gsub("(.*)\\..*", "\\1", names(df)))
#get the first non-na value for each row in each chunk
L2 <- lapply(L, function(x) apply(x, 1, function(y) na.omit(y)[1]))
# result in a data.frame
as.data.frame(L2)
# x y z
# 1 2 2 9
# 2 1 5 5
# 3 3 12 NA
# 4 5 hop 3
# 5 9 8 2
# 6 12 2 6
# 7 blah 4 NA
# since you are using mixed formats, the columsn are not of the same class!!
str(as.data.frame(L2))
# 'data.frame': 7 obs. of 3 variables:
# $ x: chr "2" "1" "3" "5" ...
# $ y: chr " 2" "5" "12" "hop" ...
# $ z: num 9 5 NA 3 2 6 NA

Storing frequencies returned from table function in R

I have a vector of size 5 which stores random digits 0-9 so that there can be multiple occurrences of the same digit. Here is an example vector:
nums <- c(5,2,5,9,2)
If I print the results of running the table function on this vector, I get the following output:
nums
2 5 9
2 2 1
I would like to know what the highest and second highest frequencies are that are returned from table(nums). How can I store all of the frequencies that are returned from an iteration of the table function?
table returns an array that can be saved to a variable. If you convert it to a data.frame using as.data.frame you get an easier to work with object:
nums <- c(5,2,5,9,2)
tab <- as.data.frame(table(nums))
tab
nums Freq
1 2 2
2 5 2
3 9 1
You can use plyr, its lightening fast.
library(plyr)
nums <- c(5,2,5,9,2)
count(nums)
Result
x freq
2 2
5 2
9 1
To shrink the table only to the two most frequent options you would want
sort(table(nums), dec = TRUE)[1:2]
# nums
# 2 5
# 2 2
Just to get their names you could do
names(sort(table(nums), dec = TRUE))[1:2]
# [1] "2" "5"
If it may happen that there are not that many unique values, you could use na.omit, as in
names(sort(table(nums), dec = TRUE))[1:4]
# [1] "2" "5" "9" NA
na.omit(names(sort(table(nums), dec = TRUE))[1:4])
# [1] "2" "5" "9"
# attr(,"na.action")
# [1] 4
# attr(,"class")
# [1] "omit"
As for storing the results, using a list should be pretty convenient:
tabs <- list()
tabs[[1]] <- sort(table(nums), dec = TRUE)[1:2]
tabs[[2]] <- sort(table(c(1, 1, 2, 3, 3)), dec = TRUE)[1:2]
tabs
# [[1]]
# nums
# 2 5
# 2 2
#
# [[2]]
#
# 1 3
# 2 2
In particular, using lists is compatible with the option that the number of options is varying.

Why does class change from integer to character when indexing a data frame with a numeric matrix?

If I index a data.frame of all integers with a matrix, I get the expected result.
df <- data.frame(c1=1:4, c2=5:8)
df1
# c1 c2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
df1[matrix(c(1:4,1,2,1,2), nrow=4)]
# [1] 1 6 3 8
If the data.frame has a column of characters, the result is all characters, even though I'm only indexing the integer columns.
df2 <- data.frame(c0=letters[1:4], c1=1:4, c2=5:8)
df2
# c0 c1 c2
#1 a 1 5
#2 b 2 6
#3 c 3 7
#4 d 4 8
df2[matrix(c(1:4,2,3,2,3), nrow=4)]
# [1] "1" "6" "3" "8"
class(df[matrix(c(1:4,2,3,2,3), nrow=4)])
# [1] "character"
df2[1,2]
# [1] 1
My best guess is that R is too busy to go through the answer to check if they all originated from a certain class. Can anyone please explain why this is happening?
In ?Extract it is described that indexing via a numeric matrix is intended for matrices and arrays. So it might be surprising that such indexing worked for a data frame in the first place.
However, if we look at the code for [.data.frame (getAnywhere(`[.data.frame`)), we see that when extracting elements from a data.frame using a matrix in i, the data.frame is first coerced to a matrix with as.matrix:
function (x, i, j, drop = if (missing(i)) TRUE else length(cols) ==
1)
{
# snip
if (Narg < 3L) {
# snip
if (is.matrix(i))
return(as.matrix(x)[i])
Then look at ?as.matrix:
"The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column".
Thus, because the first column in "df2" is of class character, as.matrix will coerce the entire data frame to a character matrix before the extraction takes place.

How to turn variable names into factors in a data frame in R

Say I have a data frame containing time-series data, where the first column is the index, and the remaining columns all contain different data streams, and are named descriptively, as in the following example:
temps = data.frame(matrix(1:20,nrow=2,ncol=10))
names(temps) <- c("flr1_dirN_areaA","flr1_dirS_areaA","flr1_dirN_areaB","flr1_dirS_areaB","flr2_dirN_areaA","flr2_dirS_areaA","flr2_dirN_areaB","flr2_dirS_areaB","flr3_dirN_areaA","flr3_dirS_areaA")
temps$Index <- as.Date(2013,7,1:2)
temps
flr1_dirN_areaA flr1_dirS_areaA ... Index
1 1 3 ... 1975-07-15
2 2 4 ... 1975-07-16
Now I want to prep the data frame for plotting with ggplot2, and i want to include the three factors: flr, dir, and area.
I can achieve this for this simple example as follows:
temps.m <- melt(temps,"Index")
temps.m$flr <- factor(rep(1:3,c(8,8,4)))
temps.m$dir <- factor(rep(c("N","S"),each=2,len=20))
temps.m$area <- factor(rep(c("A","B"),each=4,len=20))
temps.m
Index variable value flr dir area
1 1975-07-15 flr1_dirN_areaA 1 1 N A
2 1975-07-16 flr1_dirN_areaA 2 1 N A
3 1975-07-15 flr1_dirS_areaA 3 1 S A
4 1975-07-16 flr1_dirS_areaA 4 1 S A
5 1975-07-15 flr1_dirN_areaB 5 1 N B
6 1975-07-16 flr1_dirN_areaB 6 1 N B
7 1975-07-15 flr1_dirS_areaB 7 1 S B
8 1975-07-16 flr1_dirS_areaB 8 1 S B
9 1975-07-15 flr2_dirN_areaA 9 2 N A
10 1975-07-16 flr2_dirN_areaA 10 2 N A
11 1975-07-15 flr2_dirS_areaA 11 2 S A
12 1975-07-16 flr2_dirS_areaA 12 2 S A
13 1975-07-15 flr2_dirN_areaB 13 2 N B
14 1975-07-16 flr2_dirN_areaB 14 2 N B
15 1975-07-15 flr2_dirS_areaB 15 2 S B
16 1975-07-16 flr2_dirS_areaB 16 2 S B
17 1975-07-15 flr3_dirN_areaA 17 3 N A
18 1975-07-16 flr3_dirN_areaA 18 3 N A
19 1975-07-15 flr3_dirS_areaA 19 3 S A
20 1975-07-16 flr3_dirS_areaA 20 3 S A
In reality, I have data streams (columns) of varying lengths - each of which comes from its own file, has missing data, more than 3 factors encoded in the column (file) names, so this simple method of applying factors won't work. I need something more robust, and I'm inclined to parse the variable names into the different factors, and populate the factor-columns of the melted data frame.
My end goal is to plot something like this:
ggplot(temps.m,aes(x=Index,y=value,color=area,linetype=dir))+geom_line()+facet_grid(flr~.)
I imagine that the reshape, reshape2, plyr, or some other package can do this in one or two statements - but I struggle with melt/cast/ddply and the rest of them. Any suggestions?
Also, if you can suggest an entirely different [better] approach to structuring my data, I'm all ears.
Thanks in advance
You can use some regular expressions to creates your factors:
res <- do.call(rbind,strsplit(gsub('flr([0-9]+).*dir([A-Z]).*area([A-Z])',
'\\1,\\2,\\3',
temps.m$variable),
','))
[,1] [,2] [,3]
[1,] "1" "N" "A"
[2,] "1" "N" "A"
[3,] "1" "S" "A"
[4,] "1" "S" "A"
[5,] "1" "N" "B"
[6,] "1" "N" "B"
[7,] "1" "S" "B"
[8,] "1" "S" "B"
........
Maybe you need further step to transform your columns to factors.
res <- colwise(as.factor)(data.frame(res))
X1 X2 X3
1 1 N A
2 1 N A
3 1 S A
4 1 S A
........
To combine the result with your melted data you can use cbind
temps.m <- cbind(temps.m,res)
Here's a way to turn a bunch of appropriately-formatted strings into a data frame of factor variables. This assumes the factors are split by _, and the last character in each substring is the desired level.
require(plyr)
v <- do.call(rbind, strsplit(as.character(temps.m$variable), "_"))
v <- alply(v, 2, function(x) {
n <- nchar(x)
name <- substr(x, 1, n - 1)[1]
lev <- substr(x, n, n)
structure(factor(lev), name=name)
})
names(v) <- sapply(v, attr, "name")
temps.m <- cbind(temps.m, as.data.frame(v))
Adding more generality is left as an exercise for the reader.

Resources