Assuming, I'm understanding the documentation of [[ correctly, a matrix can be used to subset a data.frame:
A third form of indexing is via a numeric matrix with the one column for each dimension: each row of the index matrix then selects a single element of the array, and the result is a vector. Negative indices are not allowed in the index matrix. NA and zero values are allowed: rows of an index matrix containing a zero are ignored, whereas rows containing an NA produce an NA in the result.
While this works for [, I'm struggling to understand how to do this with [[.
mtcars[1:6, 1:6]
#> mpg cyl disp hp drat wt
#> Mazda RX4 21.0 6 160 110 3.90 2.620
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
#> Datsun 710 22.8 4 108 93 3.85 2.320
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440
#> Valiant 18.1 6 225 105 2.76 3.460
(ind <- matrix(1:6, ncol = 2))
#> [,1] [,2]
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 6
mtcars[ind]
#> [1] 110.00 3.90 2.32
mtcars[[ind]]
#> Error in as.matrix(x)[[i]]: attempt to select more than one element in vectorIndex
Is this a bug? Or am I misinterpreting the documentation?
Here is the source of [[.data.frame (v3.6.1)
function (x, ..., exact = TRUE)
{
na <- nargs() - !missing(exact)
if (!all(names(sys.call()) %in% c("", "exact")))
warning("named arguments other than 'exact' are discouraged")
if (na < 3L)
(function(x, i, exact) if (is.matrix(i))
as.matrix(x)[[i]]
else .subset2(x, i, exact = exact))(x, ..., exact = exact)
else {
col <- .subset2(x, ..2, exact = exact)
i <- if (is.character(..1))
pmatch(..1, row.names(x), duplicates.ok = TRUE)
else ..1
col[[i, exact = exact]]
}
}
The doc page (?Extract) you reference says that arrays can be indexed by matrices. Implicitly, I take that to mean non-arrays cannot be indexed by matrices. Data frames are not arrays, so they cannot be indexed by matrices. (Matrices are arrays, of course.)
I do think you're misinterpreting the documentation. You're looking at a documentation page that jointly documents [, [[, and $, together. In the argument description, it says
When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x...
The section you quote at the top of your question comes later on, under the heading Matrices and Arrays, which I take to be a section about subsetting matrices and arrays, not about using matrices as indices. (Look at the rest of the section, and the sections before and after, and I think you'll agree with me.)
Nowhere on that documentation page does it talk about using matrices as indices for [[.
I'm surprised it's handled specially in the [[ code you show - but near as I can tell, a matrix given to [[.data.frame will error out unless it's a 1x1 matrix, in which case the data frame is treated as a matrix and the single element is returned, for some arcane reason (probably "compatability with S", though I've no good guess as to why S would allow it).
Related
Here is my session in R:
bash$ R
R version 4.1.3 (2022-03-10) -- "One Push-Up"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
> require(dplyr)
Loading required package: dplyr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
> x <- select(starwars, name)
> y <- select(starwars, 'name')
> assertthat::are_equal(x, y)
[1] TRUE
>
Could you explain, please, why it's possible to address dplyr and name as if they were some variable names, i.e. without quotes, in the require and select functions?
This is known as non-standard evaluation (NSE) and is a large part of programming (and meta-programming) with R. There are many Q&A's on Stack Overflow about specific aspects of NSE, but I can't find any that cover the concept broadly enough to answer your question, so will venture a brief explanation here.
I should preface this by saying that the subject is treated more authoratatively in other places, such as the Programming with dplyr vignette, and the Non-Standard Evaluation chapter of "Advanced R" by Hadley Wickham.
The concept of NSE relies on the fact that expressions in R undergo lazy evaluation. That is, the actual object in memory that a name refers to is only retrieved when R needs it. For example, let's define a function that takes an argument which it doesn't actually use:
print_hello <- function(unused_variable) {
print('hello')
}
Now, if we do:
print_hello(bananas)
#> [1] "hello"
The function runs without a problem. The object bananas doesn't exist, but R never had to use it, so it didn't bother to check whether it existed. The symbolic name bananas only ever existed within a promise passed to the code within the function. A promise in this context is the unevaluated name bananas plus the calling environment, in which the object with that name can be retrieved if needed.
Of course, if the name doesn't exist in the search path when R comes to use it, we will get an error at that point:
print_hello2 <- function(unused_variable) {
print(unused_variable)
print('hello')
}
print_hello2(bananas)
#> Error in print(unused_variable) : object 'bananas' not found
Within the body of this function, R needs to print the object bananas, but when it looks it up, the object doesn't exist, so R throws an error.
The idea of non-standard evaluation is that we can essentially hijack the promise object before it is evaluated, and perform useful operations on it. For example, suppose we want to take the variable name and put it in a string, whether the variable exists or not. We can capture the name without evaluating it using substitute, and convert it into a string using deparse, which allows
compare_to_apples <- function(fruit) {
paste('I prefer', deparse(substitute(fruit)), 'to apples')
}
compare_to_apples(bananas)
#> [1] "I prefer bananas to apples"
Although this example isn't very useful, we can write functions that make the end-user's life a bit easier by removing the need for them to quote column names within a function. This makes for easier-to-write and easier-to-read code. For example, we could write a function like this:
select_one <- function(data, column) {
data[deparse(substitute(column))]
}
mtcars[1:10,] |> select_one(am)
#> am
#> Mazda RX4 1
#> Mazda RX4 Wag 1
#> Datsun 710 1
#> Hornet 4 Drive 0
#> Hornet Sportabout 0
#> Valiant 0
#> Duster 360 0
#> Merc 240D 0
#> Merc 230 0
#> Merc 280 0
The main place this is type of syntax is used in R is in functions that take unquoted data frame column names. It is especially familiar to users of the tidyverse, but it is also used a lot in base R (for example in $, subset, with and within). Base R also uses it in calls to library and require for package names.
The main disdvantage to NSE is ambiguity. We don't want R to be confused between variables in our global environment and in our data frame, and we sometimes want to store column names in a character vector and pass that as a method of selecting column names. For example, as an end-user, we might expect that if we did:
am <- c('gear', 'hp', 'mpg')
select_one(mtcars[1:10, ], am)
Then we would get three columns selected, but instead we get the same result as before.
Some of the complex underlying machinery of the tidyverse exists to prevent and reduce these ambiguities by ensuring that names are evaluated in the most appropriate context.
Base R's library and require functions take a different approach, employing a specific on/off switch for non-standard evaluation by using another parameter called character.only. We could add a mechanism like this to our own function here like this:
select_one <- function(data, column, NSE = TRUE) {
if(NSE) data[deparse(substitute(column))] else data[column]
}
am <- c('gear', 'hp', 'mpg')
select_one(mtcars[1:10,], am, NSE = TRUE)
#> am
#> Mazda RX4 1
#> Mazda RX4 Wag 1
#> Datsun 710 1
#> Hornet 4 Drive 0
#> Hornet Sportabout 0
#> Valiant 0
#> Duster 360 0
#> Merc 240D 0
#> Merc 230 0
#> Merc 280 0
select_one(mtcars[1:10,], am, NSE = FALSE)
#> gear hp mpg
#> Mazda RX4 4 110 21.0
#> Mazda RX4 Wag 4 110 21.0
#> Datsun 710 4 93 22.8
#> Hornet 4 Drive 3 110 21.4
#> Hornet Sportabout 3 175 18.7
#> Valiant 3 105 18.1
#> Duster 360 3 245 14.3
#> Merc 240D 4 62 24.4
#> Merc 230 4 95 22.8
#> Merc 280 4 123 19.2
Another disadvantage of functions that employ NSE is that it makes them more difficult to work with inside other functions. For example, we might expect the following function to return two columns of our data frame:
select_two <- function(data, column1, column2) {
cbind(select_one(data, column1), select_one(data, column2))
}
But it doesn't:
select_two(mtcars[1:10,], am, cyl)
#> Error in `[.data.frame`(data, deparse(substitute(column))) :
#> undefined columns selected
This is because the NSE employed in select_one causes the code to look for columns inside mtcars called column1 and column2, which don't exist. To use select_one inside another function we need to take account of its NSE, for example by carefully building and evaluating any calls to it:
select_two <- function(data, column1, column2) {
call1 <- as.call(list(select_one,
data = quote(data),
column = substitute(column1)))
call2 <- as.call(list(select_one,
data = quote(data),
column = substitute(column2)))
cbind(eval(call1), eval(call2))
}
select_two(mtcars[1:10,], am, cyl)
#> am cyl
#> Mazda RX4 1 6
#> Mazda RX4 Wag 1 6
#> Datsun 710 1 4
#> Hornet 4 Drive 0 6
#> Hornet Sportabout 0 8
#> Valiant 0 6
#> Duster 360 0 8
#> Merc 240D 0 4
#> Merc 230 0 4
#> Merc 280 0 6
So although NSE makes the end-user experience a bit nicer, it makes programming with such functions more difficult.
Created on 2023-01-21 with reprex v2.0.2
I have a set of Fisher's discriminant linear functions that I need to multiply against some test data. Both data files are in the form of two matrices (variables lined up to match variable order), so I need to multiply them together.
Here is some example test data, which I've added a constant=1 variable (you'll see why when you we get to the coefficients)
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
> testdata
constant mpg disp hp
Mazda RX4 1 21.0 160 110
Mazda RX4 Wag 1 21.0 160 110
Datsun 710 1 22.8 108 93
Hornet 4 Drive 1 21.4 258 110
Hornet Sportabout 1 18.7 360 175
Valiant 1 18.1 225 105
Here are my coefficients matrix (the Fishers discriminant linear functions)
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
> coefs
constant mpg disp hp
Function1 -67.67 4.01 0.14 0.13
Function2 -59.46 3.49 0.15 0.15
Function3 -89.70 3.69 0.22 0.20
I need to multiply the values in test data against the respective coefficients to get 3 functions scores per row. Here is how the values would be calculated
for the first row, Function1 = 1*(-67.67)+21*(4.01)+160*(0.14)+110*(0.13)
for the first row, Function2 = 1*(-59.46)+21*(3.49)+160*(0.15)+110*(0.15)
for the first row, Function3 = 1*(-89.70)+21*(3.69)+160*(0.22)+110*(0.20)
It's kind of like a sumproduct of coefficients against each row time 3 for each function.
So the df/matrix should look like this when multiplied same number of rows with 3 function score variables
> df_result
Function1 Function2 Function3
row1 53.24 54.33 44.99
row2
Not ideal, but I'm taking the data out doing it excel. If this is possible to do, any help is greatly appreciated. Many thanks
Are you just looking for the inner product?
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
as.matrix(testdata) %*% t(as.matrix(coefs))
# Function1 Function2 Function3
# Mazda RX4 53.240 54.330 44.990
# Mazda RX4 Wag 53.240 54.330 44.990
# Datsun 710 50.968 50.262 36.792
# Hornet 4 Drive 68.564 70.426 68.026
# Hornet Sportabout 80.467 86.053 93.503
# Valiant 50.061 53.209 47.589
I'm pretty new to R and this is my first post on the website, I am trying to omit na rows from my data frame. I am using na.omit function which runs put doesn't omit the desired column.
My data frame looks as below, I want to remove "na" values from the Gene.Symbol Column only without affecting the other two columns.
I've tried
na.omit(data.frame, cols= Gene.Symbol(data.frame))
Which runs, but doesn't remove any rows, I know from looking at the data frame that there are about 19 rows with "na" so the command isn't working at all.
thanks for the help!
Gene.Symbol Diag.A Rel.A
A2ML 173 17
na 02 95
ABCA10 18 97
ABCA4 14 na
ADCY2 81 98
If you're looking to exclude all missing values from N columns, complete.cases is another option. It returns a vector of TRUE's and FALSE's where TRUE are rows that don't have any NA in the selected columns. I find it has better documentation than na.omit and it's much clearer:
tst <- mtcars[1:5, ]
tst$some_na <- c(NA, NA, 2, 2, 3)
tst$another_na <- c(NA, NA, 2, 2, NA)
# These are the columns you want to exclude `NA` from:
non_na <- complete.cases(tst[, c("some_na", "another_na")])
# No NA's
tst[non_na, ]
#> mpg cyl disp hp drat wt qsec vs am gear carb some_na
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
#> another_na
#> Datsun 710 2
#> Hornet 4 Drive 2
I'm writing a function that takes a data.table as an argument. The column names of data.table are partially specified as arguments, but not all columns names are specified and all original columns need to be maintained. Inside the function, some columns need to be added to the data.table. Even if the data.table is copied inside the function, I want to add these columns in a way that is guaranteed not to overwrite existing columns. What's the best way to ensure I'm not overwriting columns given that column names are not known?
Here's one approach:
#x is a data.table and knownvar is a column name of that data.table
f <- function(x,knownvar){
x <- copy(x)
tempcol <- "z"
while(tempcol %in% names(x))
tempcol <- paste0("i.",tempcol)
tempcol2 <- "q"
while(tempcol2 %in% names(x))
tempcol2 <- paste0("i.",tempcol2)
x[, (tempcol):=3]
eval(parse(text=paste0("x[,(tempcol2):=",tempcol,"+4]")))
x
}
Note that even though I'm copying x here, I still need this process to be memory efficient. Is there an easier way of doing this? Possibly without using eval(parse(text=?
Obviously I could just create a local variable (e.g. a vector) in the function environment (rather than adding it explicitly as column of the data.table), but this wouldn't work if I then need to sort/join the data.table. Plus I may want to explicitly return a data.table that contains both the original variables and the new column.
Here is one way to write the function using set and non-standard evaluation with substitute() + eval().
Note 1: if new columns are created based on the column names in newcols (instead of the column name in knownvar), the character names in newcols are converted to symbols with as.name() (or equivalently as.symbol()).
Note 2: new columns in newvals can only be added in a sensible order, i.e. if column q requires column z, column z should be added before column q.
library(data.table)
f <- function(x, knownvar) {
## remove if x should be modified in-place
x <- copy(x)
## new column names
newcols <- setdiff(make.unique(c(names(x), c("z", "q"))), names(x))
## new column values based on knownvar or new column names
zcol <- as.name(newcols[1])
newvals <- list(substitute(3 * knownvar), substitute(zcol + 4))
for(i in seq_along(newvals)) {
set(x, j = newcols[i], value = eval(newvals[[i]], envir = x))
}
return(x)
}
## example data
x <- as.data.table(mtcars)
x[, c("q", "q.1") := .(mpg, 2 * mpg)]
head(f(x, mpg))
#> mpg cyl disp hp drat wt qsec vs am gear carb q q.1 z q.2
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 42.0 63.0 67.0
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.0 42.0 63.0 67.0
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 22.8 45.6 68.4 72.4
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4 42.8 64.2 68.2
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.7 37.4 56.1 60.1
#> 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.1 36.2 54.3 58.3
When running a regression analysis in R (using glm) cases are removed due to 'missingness' of the data. Is there any way to flag which cases have been removed? I would ideally like to remove these from my original dataframe.
Many thanks
The model fit object returned by glm() records the row numbers of the data that it excludes for their incompleteness. They are a bit buried but you can retrieve them like this:
## Example data.frame with some missing data
df <- mtcars[1:6, 1:5]
df[cbind(1:5,1:5)] <- NA
df
# mpg cyl disp hp drat
# Mazda RX4 NA 6 160 110 3.90
# Mazda RX4 Wag 21.0 NA 160 110 3.90
# Datsun 710 22.8 4 NA 93 3.85
# Hornet 4 Drive 21.4 6 258 NA 3.08
# Hornet Sportabout 18.7 8 360 175 NA
# Valiant 18.1 6 225 105 2.76
## Fit an example model, and learn which rows it excluded
f <- glm(mpg~drat,weight=disp, data=df)
as.numeric(na.action(f))
# [1] 1 3 5
Alternatively, to get the row indices without having to fit the model, use the same strategy with the output of model.frame():
as.numeric(na.action(model.frame(mpg~drat,weight=disp, data=df)))
# [1] 1 3 5
Without a reproducible example I can't provide code tailored to your problem, but here's a generic method that should work. Assume your data frame is called df and your variables are called y, x1, x2, etc. And assume you want y, x1, x3, and x6 in your model.
# Make a vector of the variables that you want to include in your glm model
# (Be sure to include any weighting or subsetting variables as well, per Josh's comment)
glm.vars = c("y","x1","x3","x6")
# Create a new data frame that includes only those rows with no missing values
# for the variables that are in your model
df.glm = df[complete.cases(df[ , glm.vars]), ]
Also, if you want to see just the rows that have at least one missing value, do the following (note the addition of ! (the "not" operator)):
df[!complete.cases(df[ , glm.vars]), ]