I need to remove a participant from a data set in R, but struggling to find an easy way to do so. I identified the participant in the data set via a category. I need to take out the participants data from the entire environment. How do I do it?
I tried googling it and couldn't find a simple answer.
In base R there is a subset function. Here's an example using the built in iris dataframe:
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
iris2 <- subset(iris, iris$Species != "setosa")
head(iris2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
The dplyr package of the tidyverse has a filter function for more complex operations.
Related
I have a data frame with 81 objects and 12 variables, including an ID for each object.
Further, I have a sorted(!) list of ID's.
Now, I want to sort my data frame after this specific list.
Can anyone make a simple example for that case?
I am a newbie, trying to learn.
Thanks in advance!
Quick example of my case:
ID City NR1 NR2
Dataframe1 = "11000", Berlin, (123,2), (532,1)
"02401", Hamburg, (435,2), (352,1)
"83329", München, (124,3), (125,2)
ID = list("02401", "83329", "11000")
Now, I want Dataframe1 to be sorted after the ID from the list.
You can arrange your dataframe using arrange().
An example:
The iris dataset, as is:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
creating an external vector:
index<-sample(1:150)
Then you can sort your dataframe with that external vector:
head(arrange(iris, index))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.4 2.7 5.3 1.9 virginica
2 5.5 3.5 1.3 0.2 setosa
3 6.3 3.3 6.0 2.5 virginica
4 6.3 3.3 4.7 1.6 versicolor
5 4.9 2.5 4.5 1.7 virginica
6 5.7 2.8 4.5 1.3 versicolor
To arrange by a specific external vector that matches one of the variables, you can use match()
iris2<-head(iris)%>%mutate(ID=sample(1:150, 6))
> iris2
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
1 5.1 3.5 1.4 0.2 setosa 29
2 4.9 3.0 1.4 0.2 setosa 61
3 4.7 3.2 1.3 0.2 setosa 69
4 4.6 3.1 1.5 0.2 setosa 89
5 5.0 3.6 1.4 0.2 setosa 59
6 5.4 3.9 1.7 0.4 setosa 84
external_vector<-c(69,59,84,29,61,89)
arrange with match():
iris2[match(external_vector, iris2$ID),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
3 4.7 3.2 1.3 0.2 setosa 69
5 5.0 3.6 1.4 0.2 setosa 59
6 5.4 3.9 1.7 0.4 setosa 84
1 5.1 3.5 1.4 0.2 setosa 29
2 4.9 3.0 1.4 0.2 setosa 61
4 4.6 3.1 1.5 0.2 setosa 89
Given a dataset with multiple unique elements in a column, I'd like to split those unique elements into new dataframes, but have the dataframe nested one level down. Essentially adding an extra level to the split() command.
For instance (using the built-in iris table as an example:
iris
mylist <- split(iris, iris$Species)
produces a list, mylist, that contains 3 sublists, setosa, versicolor, virginica.
mylist[["setosa"]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
But I would actually like to nest that data table in a sublist called results BUT keep the upper level list name as setosa. Such that:
mylist$setosa["results"]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
I could do this with manual manipulation, but I'd like this to run automatically. I've tried unsuccessfully with mapply
mapply(function(names, df)
names <- split(df, df[["Species"]]),
unique(iris$Species), iris)
Any advice? Also happy to use a tidyr package if that makes things easier...
Consider by (object-oriented wrapper to tapply), very similar to split but allows you to run a function on each subset. Often many useRs run split + lapply, unaware both can replaced with by:
mylist <- by(iris, iris$Species, function(sub) list(results=sub), simplify = FALSE)
head(mylist$setosa$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
head(mylist$versicolor$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 51 7.0 3.2 4.7 1.4 versicolor
# 52 6.4 3.2 4.5 1.5 versicolor
# 53 6.9 3.1 4.9 1.5 versicolor
# 54 5.5 2.3 4.0 1.3 versicolor
# 55 6.5 2.8 4.6 1.5 versicolor
# 56 5.7 2.8 4.5 1.3 versicolor
head(mylist$virginica$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 101 6.3 3.3 6.0 2.5 virginica
# 102 5.8 2.7 5.1 1.9 virginica
# 103 7.1 3.0 5.9 2.1 virginica
# 104 6.3 2.9 5.6 1.8 virginica
# 105 6.5 3.0 5.8 2.2 virginica
# 106 7.6 3.0 6.6 2.1 virginica
setNames in lapply will keep the names of the list you're iterating through
iris
mylist <- split(iris, iris$Species)
mylist2 <- lapply(setNames(names(mylist), names(mylist)), function(x){
list(results = mylist[[x]])
})
This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 7 years ago.
I understand that data.table allows you to do computations based on groups within a column. For example.
Reproducible example
iris[,.SD[which.min(Petal.Width)], by=Species]
generating
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1: setosa 4.9 3.1 1.5 0.1
2: versicolor 4.9 2.4 3.3 1.0
3: virginica 6.1 2.6 5.6 1.4
I want every row where the minimum is met; not just the first, something that is easily achieved in a DF:
for example this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
10 4.9 3.1 1.5 0.1 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
33 5.2 4.1 1.5 0.1 setosa
38 4.9 3.6 1.4 0.1 setosa
58 4.9 2.4 3.3 1.0 versicolor
61 5.0 2.0 3.5 1.0 versicolor
63 6.0 2.2 4.0 1.0 versicolor
68 5.8 2.7 4.1 1.0 versicolor
80 5.7 2.6 3.5 1.0 versicolor
82 5.5 2.4 3.7 1.0 versicolor
94 5.0 2.3 3.3 1.0 versicolor
135 6.1 2.6 5.6 1.4 virginica
What I don't want is just the first instance of where the minima is met:
This would be equivalent to doing something like this using a data.frame
iris
iris <- as.data.frame(iris) #in case reader does not start new R session
f.min <- function(spec) {
spec.sub <- iris[iris$Species==spec,]
min.rows <- spec.sub[spec.sub$Petal.Width == min(spec.sub$Petal.Width),]
}
do.call(rbind, lapply(levels(iris$Species), f.min ))
There are some powerful features in data.table which are worth learning. Hence why I would like to know the equivalent in data.table.
Try:
iris[,.SD[which.min(Petal.Width)], by=Species]
This will give you the minimas but does not show ties.
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1: setosa 4.9 3.1 1.5 0.1
2: versicolor 4.9 2.4 3.3 1.0
3: virginica 6.1 2.6 5.6 1.4
A dplyr solution showing the ties as well would be:
require(dplyr)
require(magrittr)
iris %>%
group_by(Species) %>%
filter(rank(Petal.Width, ties.method= "min") == 1)
Source: local data table [13 x 5]
Groups: Species
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.1 1.5 0.1 setosa
2 4.8 3.0 1.4 0.1 setosa
3 4.3 3.0 1.1 0.1 setosa
4 5.2 4.1 1.5 0.1 setosa
5 4.9 3.6 1.4 0.1 setosa
6 4.9 2.4 3.3 1.0 versicolor
7 5.0 2.0 3.5 1.0 versicolor
8 6.0 2.2 4.0 1.0 versicolor
9 5.8 2.7 4.1 1.0 versicolor
10 5.7 2.6 3.5 1.0 versicolor
11 5.5 2.4 3.7 1.0 versicolor
12 5.0 2.3 3.3 1.0 versicolor
13 6.1 2.6 5.6 1.4 virginica
The 'ties.method' parameter is where you can select what should be displayed.
Hope this helps.
This question builds from the SO post found here
I am trying to extract a random sample of rows in a data frame using a nesting condition.
Using the following dummy dataset (modified from iris):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
5 5.2 3.7 1.3 0.2 virginica
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
8 4.7 3.2 1.3 0.2 virginica
9 4.0 3.1 1.5 0.2 versicolor
10 5.0 3.6 1.4 0.2 versicolor
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
The code below works fine to take a simple sample of 2 rows:
iris[sample(nrow(iris), 2), ]
However, what I would like to do is to take a sample of 2 rows for each level of a specific variable. For example create a random sample of 2 rows for each level of the variable 'Species', like that:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
Thanks for your help!
Very easy with dplyr:
library(dplyr)
iris %>%
group_by(Species) %>%
sample_n(size = 2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.6 3.4 1.4 0.3 setosa
# 2 5.2 3.5 1.5 0.2 setosa
# 3 6.5 2.8 4.6 1.5 versicolor
# 4 5.7 2.8 4.5 1.3 versicolor
# 5 5.8 2.8 5.1 2.4 virginica
# 6 7.7 2.6 6.9 2.3 virginica
You can group by as many columns as you'd like
CO2 %>% group_by(Type, Treatment) %>% sample_n(size = 2)
I want to add a column to an existing data frame which identifies if the element in that row contains a specific pattern.
I though about using the transform() function to do it. Using the iris dataset,
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> tail(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
I would like to add a column which on identifies if the Species end with the string sa. In regex I can use the expression .*(sa) to flag the right strings.
How can I write a function which does populate the column with 1 if the Species ends with sa and 0 if it doesn't?
How about
iris$check <- as.numeric(grepl(".*(sa)", iris$Species))
grepl returns a logical vector (TRUE/FALSE) which can easily be converted to 1/0 by using as.numeric.
Also possible:
iris$check <- grepl(".*(sa)", iris$Species) + 0L