Why does a dataframe allow 2-dimensional selection without a comma? - r

My understanding is that you can select from a dataframe in two ways. If you use [] and not include a comma, then it is a list-style selection. It works that way since a dataframe is built on a list and you are really just pulling from top level components.
And, if you include a comma, then you are doing matrix style selection and you get this syntax [rows, columns].
If that's true, then why can I select from a dataframe with an array?
df <- as.data.frame(state.x77)
df2 <- cbind(df, rep(NA, nrow(df)))
df2[is.na(df2)]
is.na() is a an array with dim attributes for 50 rows and 9 columns.
How does it know to select against every value instead of doing the typical selection amongst columns?

is.na(df2) produces a logical matrix, with the same dimensions as the data.frame, df2.
Subsetting a data.frame by a matrix of the same dimensions is a standard operation. See ?'[.data.frame' for more information.

Related

Intersecting tables in R based on rownames string

I have 2 datasets: one set has my actual data, and the other one is a list of my KOs of interest, and I'm trying to intersect the data to select only the KOs of interest.
As you can see, the row names also have the associated taxa. I've intersected these tables previously without the taxa data:
foi <- read.csv("krakened/biogeochemical.csv")
new <- intersect(rownames(kegg.f),foi$genefamily)
kegg.df.select <- kegg.f[new,]
but I'd really like to have the taxa in the row names. Is it possible to intersect the tables by only comparing the "KOxxxx" part of my rownames?
We may use trimws to extract the substring, use %in% to find the matches in the genefamily column from 'foi' and subset using the logical vector
kegg.f[trimws(rownames(kegg.f), whitespace = "\\|.*") %in% foi$genefamily,]
It can also be done with sub
kegg.f[sub("^(K\\d+)\\|.*", "\\1", rownames(kegg.f)), %in% foi$genefamily,]

How to create a subset from a set of values within a column in R

I have a dataframe with 62 columns and 110 rows. In the column "date_observed" I have 57 dates with some of them having multiple records for the same date.
I am trying to extract only 12 dates out of this. They are not in any given order.
I tried this:
datesubset <- original %>% select (original$date_observed == c("13-Jun-21","21-Jun-21", "28-Jun-21", "13-Jul-21", "20-Jul-21", "8-Aug-21", "9-Aug-21", "25-Aug-21", "31-Aug-21", "8-Sep-21", "27-Sep-21"))
But, I got the following error:
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type logical.
i It must be numeric or character.
I did try searching here and on google but I could find results only for how to subset a set of columns but not for specific values within columns. I am still new to R so please pardon me if this was a very simple question to ask.
In {dplyr}, the select() function is for selecting particular columns, but if you want to subset particular rows you want to use filter().
The logical operator == will also compare what is on the left, to EVERYTHING on the right, giving you a vector of TRUE/FALSE for each row, rather than just a single TRUE or FALSE for each row, which is what you are after.
What I think you are after is the logical operator %in% which checks to see if what is on the left appears at all on the right, and returns a single TRUE or FALSE.
As was mentioned, inside of tidyverse functions you don't need the $, you can just input the column name as in the example below.
I don't have your original data to double check, but the example below should work with your original data frame.
specific_dates <- c(
"13-Jun-21",
"21-Jun-21",
"28-Jun-21",
"13-Jul-21",
"20-Jul-21",
"8-Aug-21",
"9-Aug-21",
"25-Aug-21",
"31-Aug-21",
"8-Sep-21",
"27-Sep-21"
)
datesubset <- original %>%
filter(date_observed %in% specific_dates)

Is there a R methodology to select the columns from a dataframe that are listed in a separate array

I have a dataframe with over 100 columns. Post implementation of certain conditions, I need a subset of the dataframe with the columns that are listed in a separate array.
The array has 50 entries with 2 columns. The first column has the selected variable names and the second column has some associated values.
I wish to build a new data frame with just the variables mentioned in the the first column of the separate array. Could you please point me as to how to proceed?
Try this:
library(dplyr)
iris <- iris %>% select(contains(dataframe_with_names$names))
In R you can use square brackets [rows, columns] to select specific rows or specific columns. (Leaving either blank selects all).
If you had a vector of column names you wanted to keep called important_columns you could select only those columns with:
myData[,important_columns]
In your case the vector of column names is actually a column in your array. So you select that column and use it as your vector:
myData[, array$names]

What is happening during assignment to a dataframe by lapply

Given a dataframe df and a function f which is applied to df:
df[] <- lapply(df, f)
What is the magic R is performing to replace columns in df with collection of vectors in the list from lapply? I see that the result from lapply is a list of vectors having the same names as the dataframe df. I assume some magic mapping is being done to map the vectors to df[], which is the collection of columns in df (methinks). Just works? Trying to better understand so that I remember what to use the next time.
A data.frame is merely a list of vectors having the same length. You can see it using is.list(a_data_frame). It will return TRUE.
[] can have different meaning or action depending of the object it is applied on. It even can be redefined as it is in fact a function.
[] allows to subset or insert vector columns from data.frame.
df[1] get the first column
df[1] <- 2 replace the first column with 2 (repeated in order to have the same length as other columns)
df[] return the whole data.frame
df[] <- list(c1,c2,c3) sets the content of the data.frame replacing it's current content
Plus a wide number of other way to access or set data in a data.frame (by column name, by subset of rows, of columns, ...)

How to put columns names to a matrix from values from a vector?

I am trying to name the columns of matrix from data from a vector.
suppose I have the following matrix:
A <- matrix(1:110, ncol=11)
and also a vector with 11 values from read.table:
code <- data1$code
I would like to do something like:
colnames(A)=data.frame(code)
to put the names of the columns using the values from the vector code
It will be far simpler just to pass code (or perhaps as.character(code), if it is a factor variable
colnames(A) <- as.character(code)
Passing a data.frame with one column will not work, as this has length =1 (the one column).
A data.frame is a list with two elements of the correct lengths to dimnames you could set both rownames and colnames at the same time.

Resources