Select specific columns by string label in R frame - r

I want to exclude the "fldname" labeled column from a frame frm in R. If we know the index of the column say i then we can use the frm[-i] to exclude the ith column. Is there any simple way to do the same by specifying the column label string or list of label strings which i want to exclude?
I worked out a solution (corrected by Fhnuzoag):
frm[names (frm)[names (frm) != c("fldname1","fldname2")]]
frm[names (frm)[!names (frm) %in% c("fldname1","fldname2")]]
get the list of wanted strings and use them as index. Above "fldname1" and "fldname2" are the unwanted fields.
Is there a simply solution which the language syntax has?

Yes, use a combination of negation ! and %in%. For example, using iris:
x <- iris[, !names(iris) %in% c("Sepal.Width", "Sepal.Length")]
str(x)
'data.frame': 150 obs. of 3 variables:
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

I think, no. Usually I do frm[, setdiff(names(frm), excludelist)].

Related

dplyr:filter() to grab the rows

For the code below, I am able to grab rows with Species==setosa.
data("iris")
filter_data<-dplyr::filter(iris,"Species" == "setosa")
print(filter_data)
or
data("iris")
filter_data<-dplyr::filter(iris,"Species" == 1)
#since by using ```str(iris)```, the data set shows that the the ```Species ```is factor
print(filter_data)
However, the results both show:
Description:df [0 × 5]
0 rows
How can that be?
However, if I try to use
data("iris")
filter_data<-dplyr::filter(iris, as.factor(Species) == "setosa")
print(filter_data)
data("iris")
filter_data<-dplyr::filter(iris,as.numeric(Species) == 1)
print(filter_data)
then it works.
What is the main difference? R cannot identify the factor or dummy variables itself?
I am also a little bit confused with the terminology: factor and dummy variables.
For lm or glm, the factor should be converted to the numeric value, so we say dummy variables?
"Species" == "setosa" is like matching two unequal strings, which are evidently not equal.
So you are filtering a df with vector having only FALSE in it. Thus, no rows were returned. For filtering any dataframe in dplyr we indeed require a logical vector equal to the length of rows in that dataframe. Wherever, there is a TRUE that row is returned.
you are actually doing something like this
filter(iris, 'something' == 'something else')
[1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<0 rows> (or 0-length row.names)
If instead you'll do something like this, all rows will be returned.
filter(iris, 'a' == 'a')
#check
str(filter(iris, 'a' == 'a'))
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
To match contents of object Species you have to remove its quotation marks and then R will recognise it an object and not a string.
Moreover, in dplyr/tidyverse objects are always attached and thus we do not have to use $
There are certain rules which are followed by every language and/or package. When using dplyr, it is expected that bare column names i.e column names without quotes are passed to the data.
dplyr::filter(iris, Species == "setosa")
Notice how we are using Species and not "Species". There are some other set of rules which are applied when you pass column names as string.
For example, you can use .data and then use "Species".
dplyr::filter(iris, .data[["Species"]] == "setosa")

Remove dataframes from list of dataframes using loop

I want to remove parts from a list to reduce the list to the elements of it that have a certain number of columns.
This a dummy example of what I'm trying to do:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe and not matrix
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables <- tables[-k]
}
}
another option I tried:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables[[-k]] <- NULL
}
}
I'm getting
Error in tables[[k]] : subscript out of bounds.
Is there an alternative and correct approach?
We can use Filter
Filter(function(x) ncol(x)==5, tables)
Or with sapply to create a logical index and subset the list
tables[sapply(tables, ncol)==5]
Or as #Sotos commented
tables[lengths(tables)==5]
lengths return the length of each list element convert it a logical vector and subset the list. The length of a data.frame is the number of columns it has
For a tidyverse option you can use purrr:keep for this. You just define a predicate function, if true it keeps the list element, if false it removes it. Here I've done that with the formula option.
library(purrr)
tables <- list(mtcars, iris)
result <- purrr::keep(tables, ~ ncol(.x) == 5)
str(result)
#> List of 1
#> $ :'data.frame': 150 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

using eval, parse and as.character in data.table in R

I am having some trouble running a combination of eval, parse and as.character for a data.table. I basically want to convert a given column of the data table to as.character output of the same column.
library(data.table)
options(datatable.WhenJisSymbolThenCallingScope=TRUE)
# an options that I heard may solve the problem
iris2 <- data.table(iris)
VARS <- colnames(iris)
j <- 1
iris2[,eval(parse(text = paste0(VARS[j])))] # this works fine
iris2[,eval(parse(text = paste0(VARS[j]))) := as.character(eval(parse(text = paste0(VARS[j]))))]
#but this fails
From the looks of it, it appears the eval and parse functions work fine but when it comes to updating the column with := it seems to break. Could someone tell me what the issue is?
We can use the data.table methods to transform the variables. Specify the 'VARS' or subset of 'VARS' i.e 'VARS[j]' in .SDcols, loop through the columns (in case we want to loop for multiple columns) and assign (:=) to the columns specified in 'VARS[j]`
iris2[, VARS[j] := lapply(.SD, as.character) , .SDcols = VARS[j]]
str(iris2)
#Classes ‘data.table’ and 'data.frame': 150 obs. of 5 variables:
#$ Sepal.Length: chr "5.1" "4.9" "4.7" "4.6" ...
#$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1

Test whether data is numeric or Factor/Ordinal

I'm sitting with a large dataset and want to get som basic information about my variables, first of all if they are numeric or factor/ordinal.
I'm working with a function, and want, one variable at a time, investigate if it is numeric or a factor.
To make the for loop work I'm using dataset[i] to get to the variable I want.
object<-function(dataset){
n=ncol(dataset)
for(i in 1:n){
variable_name<-names(dataset[i])
factor<-is.factor(dataset[i])
rdered<-is.ordered(dataset[i])
numeric<-is.numeric(dataset[i])
print(list(variable_name,factor,ordered,numeric))
}
}
is.ordered
My problem is that is.numeric() does not seem to work with dataset[i], all the results becomes "FALSE", but only with dataset$.
Do you have any idea how to solve this?
Try str(dataset) to get summary information on an object, but to solve your problem you need to compeletely extract your data with double square brackets. Single square bracket subsetting keeps the output as a sub-list (or data.frame) rather than extracting the contents:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
is.numeric(iris[1])
[1] FALSE
class(iris[1])
[1] "data.frame"
is.numeric(iris[[1]])
[1] TRUE
Assuming that dataset is something like a data.frame, you can do the following (and avoid the loop):
names = sapply(dataset, names) # or simply `colnames(dataset)`
types = sapply(dataset, class)
Then types gives you either numeric or factor. You can then simply do something like this:
is_factor = types == 'factor'

Elegant way to get the colclasses of a data.frame

I currently use the following function to list the classes of a data.frame:
sapply(names(iris),function(x) class(iris[,x]))
There must be a more elegant way to do this...
Since data.frames are already lists, sapply(iris, class) will just work. sapply won't be able to simplify to a vector for classes that extend other classes, so you could do something to take the first class, paste the classes together, etc.
EDIT If you just want to LOOK at the classes, consider using str:
str(iris) # Show "summary" of data.frame or any other object
#'data.frame':   150 obs. of  5 variables:
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
But to expand on #JoshuaUlrish excellent answer, a data.frame with time or ordered factor columns would cause pain with the sapply solution:
d <- data.frame(ID=1, time=Sys.time(), factor=ordered(42))
# This doesn't return a character vector anymore
sapply(d, class)
#$ID
#[1] "numeric"
#
#$time
#[1] "POSIXct" "POSIXt"
#
#$factor
#[1] "ordered" "factor"
# Alternative 1: Get the first class
sapply(d, function(x) class(x)[[1]])
# ID time factor
#"numeric" "POSIXct" "ordered"
# Alternative 2: Paste classes together
sapply(d, function(x) paste(class(x), collapse='/'))
# ID time factor
# "numeric" "POSIXct/POSIXt" "ordered/factor"
Note that none of these solutions are perfect. Getting only the first (or last) class can return something quite meaningless. Pasting makes using the compound class harder. Sometimes you might just want to detect when this happens, so an error would be preferable (and I love vapply ;-):
# Alternative 3: Fail if there are multiple-class columns
vapply(d, class, character(1))
#Error in vapply(d, class, character(1)) : values must be length 1,
# but FUN(X[[2]]) result is length 2

Resources