dplyr:filter() to grab the rows - r

For the code below, I am able to grab rows with Species==setosa.
data("iris")
filter_data<-dplyr::filter(iris,"Species" == "setosa")
print(filter_data)
or
data("iris")
filter_data<-dplyr::filter(iris,"Species" == 1)
#since by using ```str(iris)```, the data set shows that the the ```Species ```is factor
print(filter_data)
However, the results both show:
Description:df [0 × 5]
0 rows
How can that be?
However, if I try to use
data("iris")
filter_data<-dplyr::filter(iris, as.factor(Species) == "setosa")
print(filter_data)
data("iris")
filter_data<-dplyr::filter(iris,as.numeric(Species) == 1)
print(filter_data)
then it works.
What is the main difference? R cannot identify the factor or dummy variables itself?
I am also a little bit confused with the terminology: factor and dummy variables.
For lm or glm, the factor should be converted to the numeric value, so we say dummy variables?

"Species" == "setosa" is like matching two unequal strings, which are evidently not equal.
So you are filtering a df with vector having only FALSE in it. Thus, no rows were returned. For filtering any dataframe in dplyr we indeed require a logical vector equal to the length of rows in that dataframe. Wherever, there is a TRUE that row is returned.
you are actually doing something like this
filter(iris, 'something' == 'something else')
[1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<0 rows> (or 0-length row.names)
If instead you'll do something like this, all rows will be returned.
filter(iris, 'a' == 'a')
#check
str(filter(iris, 'a' == 'a'))
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
To match contents of object Species you have to remove its quotation marks and then R will recognise it an object and not a string.
Moreover, in dplyr/tidyverse objects are always attached and thus we do not have to use $

There are certain rules which are followed by every language and/or package. When using dplyr, it is expected that bare column names i.e column names without quotes are passed to the data.
dplyr::filter(iris, Species == "setosa")
Notice how we are using Species and not "Species". There are some other set of rules which are applied when you pass column names as string.
For example, you can use .data and then use "Species".
dplyr::filter(iris, .data[["Species"]] == "setosa")

Related

Writing to the global environment from a function in R

Im new to R and have some trouble understanding how to handle local and global environments. I checked the Post on local and global variables, but couldn't figure it out.
If, for example, I would like to make several plots using a function and save them like this:
PlottingFunction <- function(type) {
type <<- mydata %>%
filter(typeVariable==type) %>%
qplot(a,b)
}
lapply(ListOfTypes, PlottingFunction)
Which didn't yield the desired result. I tried using the assign() function, but couldn't get it to work either.
I want to save the graphs in the global environment so I can combine them using gridExtra. This might not be the best way to do that, but I think it might be useful to understand this issue nevertheless.
You don't need to assign the plot to a gloabl variable. All plots can be saved in one list.
For this example, I use the iris data set.
library(gridExtra)
library(ggplot2)
library(dplyr)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The modified function without assignment:
PlottingFunction <- function(type) {
iris %>%
filter(Species == type) %>%
qplot(Sepal.Length, Sepal.Width, data = .)
}
One figure per Species is created
species <- unique(iris$Species)
# [1] setosa versicolor virginica
# Levels: setosa versicolor virginica
l <- lapply(species, PlottingFunction)
Now, the function do.call can be used to call grid.arrange with the plot objects in the list l.
do.call(grid.arrange, l)

Data frame changes from numeric to character

I open my csv file and I control the class of each of my data:
mydataP<-read.csv("Energy_protein2.csv", stringsAsFactors=F)
apply(mydataP, 2, function(i) class(i))
#[1] "numeric"
I add a column and check the class of the data:
mydataP[ ,"ID"] <-rep(c("KOH1", "KOH2", "KOH3", "KON1", "KON2", "KON3", "WTH1", "WTH2", "WTH3","WTN1", "WTN2", "WTN3"), each=2)
apply(mydataP, 2, function(i) class(i))
Here it changes to a "character"
as.numeric(as.factor(mydataP))
#Error in sort.list(y) : 'x' must be atomic for 'sort.list'
#Have you called 'sort' on a list?
as.numeric(as.character(mydataP))
I get a vector with 117 NA
I have no idea what to do now, as soon I touch the frame it changes to character, can somebody help me? Thanks
That happens because apply converts your data.frame to matrix and those can only have one class in them.
Try this instead:
sapply(mydataP, class)
This is the reason you should normally try to avoid using apply on data.frames.
This behavior is documented in the help file (?apply):
If X is not an array but an object of a class with a non-null dim
value (such as a data frame), apply attempts to coerce it to an array
via as.matrix if it is two-dimensional (e.g., a data frame) or via
as.array.
Here's a reproducible example with the built-in iris dataset:
> apply(iris, 2, function(i) class(i))
#Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# "character" "character" "character" "character" "character"
> sapply(iris, class)
#Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# "numeric" "numeric" "numeric" "numeric" "factor"
> str(iris)
#'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
As you can see, apply converts all columns to the same class.

Test whether data is numeric or Factor/Ordinal

I'm sitting with a large dataset and want to get som basic information about my variables, first of all if they are numeric or factor/ordinal.
I'm working with a function, and want, one variable at a time, investigate if it is numeric or a factor.
To make the for loop work I'm using dataset[i] to get to the variable I want.
object<-function(dataset){
n=ncol(dataset)
for(i in 1:n){
variable_name<-names(dataset[i])
factor<-is.factor(dataset[i])
rdered<-is.ordered(dataset[i])
numeric<-is.numeric(dataset[i])
print(list(variable_name,factor,ordered,numeric))
}
}
is.ordered
My problem is that is.numeric() does not seem to work with dataset[i], all the results becomes "FALSE", but only with dataset$.
Do you have any idea how to solve this?
Try str(dataset) to get summary information on an object, but to solve your problem you need to compeletely extract your data with double square brackets. Single square bracket subsetting keeps the output as a sub-list (or data.frame) rather than extracting the contents:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
is.numeric(iris[1])
[1] FALSE
class(iris[1])
[1] "data.frame"
is.numeric(iris[[1]])
[1] TRUE
Assuming that dataset is something like a data.frame, you can do the following (and avoid the loop):
names = sapply(dataset, names) # or simply `colnames(dataset)`
types = sapply(dataset, class)
Then types gives you either numeric or factor. You can then simply do something like this:
is_factor = types == 'factor'

Select specific columns by string label in R frame

I want to exclude the "fldname" labeled column from a frame frm in R. If we know the index of the column say i then we can use the frm[-i] to exclude the ith column. Is there any simple way to do the same by specifying the column label string or list of label strings which i want to exclude?
I worked out a solution (corrected by Fhnuzoag):
frm[names (frm)[names (frm) != c("fldname1","fldname2")]]
frm[names (frm)[!names (frm) %in% c("fldname1","fldname2")]]
get the list of wanted strings and use them as index. Above "fldname1" and "fldname2" are the unwanted fields.
Is there a simply solution which the language syntax has?
Yes, use a combination of negation ! and %in%. For example, using iris:
x <- iris[, !names(iris) %in% c("Sepal.Width", "Sepal.Length")]
str(x)
'data.frame': 150 obs. of 3 variables:
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
I think, no. Usually I do frm[, setdiff(names(frm), excludelist)].

Elegant way to get the colclasses of a data.frame

I currently use the following function to list the classes of a data.frame:
sapply(names(iris),function(x) class(iris[,x]))
There must be a more elegant way to do this...
Since data.frames are already lists, sapply(iris, class) will just work. sapply won't be able to simplify to a vector for classes that extend other classes, so you could do something to take the first class, paste the classes together, etc.
EDIT If you just want to LOOK at the classes, consider using str:
str(iris) # Show "summary" of data.frame or any other object
#'data.frame':   150 obs. of  5 variables:
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
But to expand on #JoshuaUlrish excellent answer, a data.frame with time or ordered factor columns would cause pain with the sapply solution:
d <- data.frame(ID=1, time=Sys.time(), factor=ordered(42))
# This doesn't return a character vector anymore
sapply(d, class)
#$ID
#[1] "numeric"
#
#$time
#[1] "POSIXct" "POSIXt"
#
#$factor
#[1] "ordered" "factor"
# Alternative 1: Get the first class
sapply(d, function(x) class(x)[[1]])
# ID time factor
#"numeric" "POSIXct" "ordered"
# Alternative 2: Paste classes together
sapply(d, function(x) paste(class(x), collapse='/'))
# ID time factor
# "numeric" "POSIXct/POSIXt" "ordered/factor"
Note that none of these solutions are perfect. Getting only the first (or last) class can return something quite meaningless. Pasting makes using the compound class harder. Sometimes you might just want to detect when this happens, so an error would be preferable (and I love vapply ;-):
# Alternative 3: Fail if there are multiple-class columns
vapply(d, class, character(1))
#Error in vapply(d, class, character(1)) : values must be length 1,
# but FUN(X[[2]]) result is length 2

Resources