R range referencing with different dimensions not causing an error - r

In using range referencing I normally expect to see an error or at least a warning message when the operations in '[' ']' do not match the dimensions of the parent object, however I have just discovered that I am not seeing said warnings and errors. Is there a setting for this or a way to force an error? Example:
x = 1:5
y = 10:12
x[y>10]
y[x>2]
likewise this applies to data frames and other R objects:
dat = data.frame(x=runif(100),y=1:100)
dat[sample(c(TRUE,FALSE),23),c(TRUE,FALSE)]
The silent repetition and truncation of the references to match the dimensions of the parent object is unexpected, having used R for years, I've somehow never noticed this before.
I'm using R Console (64-bit) 3.0.1 for Windows (could be updated yes, but I hope this isn't the cause).
Edit: Fixed data.frame example as data.frame's don't allow more column references than columns. Thanks zero323.

You could modify the `[.data.frame` function to throw a warning when indexing with a logical vector that doesn't evenly divide the number of rows:
`[.data.frame` <- function(x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1) {
if (!missing(i) && is.logical(i) && nrow(x) %% length(i) != 0) {
warning("Indexing data frame with logical vector that doesn't evenly divide row count")
}
base::`[.data.frame`(x, i, j, drop)
}
Here's a demonstration with the 150-row iris dataset, passing logical indexing vectors of length 11 (should cause warning) and 15 (should not cause warning):
iris[c(rep(FALSE, 10), TRUE),]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 11 5.4 3.7 1.5 0.2 setosa
# 22 5.1 3.7 1.5 0.4 setosa
# 33 5.2 4.1 1.5 0.1 setosa
# 44 5.0 3.5 1.6 0.6 setosa
# 55 6.5 2.8 4.6 1.5 versicolor
# 66 6.7 3.1 4.4 1.4 versicolor
# 77 6.8 2.8 4.8 1.4 versicolor
# 88 6.3 2.3 4.4 1.3 versicolor
# 99 5.1 2.5 3.0 1.1 versicolor
# 110 7.2 3.6 6.1 2.5 virginica
# 121 6.9 3.2 5.7 2.3 virginica
# 132 7.9 3.8 6.4 2.0 virginica
# 143 5.8 2.7 5.1 1.9 virginica
# Warning message:
# In `[.data.frame`(iris, c(rep(FALSE, 10), TRUE), ) :
# Indexing data frame with logical vector that doesn't evenly divide number of rows
iris[c(rep(FALSE, 14), TRUE),]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 15 5.8 4.0 1.2 0.2 setosa
# 30 4.7 3.2 1.6 0.2 setosa
# 45 5.1 3.8 1.9 0.4 setosa
# 60 5.2 2.7 3.9 1.4 versicolor
# 75 6.4 2.9 4.3 1.3 versicolor
# 90 5.5 2.5 4.0 1.3 versicolor
# 105 6.5 3.0 5.8 2.2 virginica
# 120 6.0 2.2 5.0 1.5 virginica
# 135 6.1 2.6 5.6 1.4 virginica
# 150 5.9 3.0 5.1 1.8 virginica

Expanding on #josilber I've written the following for atomic vector and matrix subsetting in case anyone else wants it:
`[` <- function(x, i) {
if(!missin
g(i) && is.logical(i) && (length(x) %% length(i) != 0 || length(i) > length(x))) {
warning("Indexing atomic vector with logical vector that doesn't evenly divide row count")
}
base::`[`(x,i)
}
`[` <- function(x,i,j,...,drop=TRUE) {
if (!missing(i) && is.logical(i) && nrow(x) %% length(i) != 0) {
warning("Indexing matrix with logical vector that doesn't evenly divide row count")
}
if (!missing(j) && is.logical(j) && nrow(x) %% length(j) != 0) {
warning("Indexing matrix with logical vector that doesn't evenly divide column count")
}
base::`[`(x,i,j,...,drop)
}
Testing my original example afterwards with this modification now produces the warning and other operations behave as per normal:
> x =
1:5
> y = 10:12
> x[y>10]
[1] 2 3 5
Warning message:
In x[y > 10] :
Indexing atomic vector with logical vector that doesn't evenly divide row count
> y[x>2]
[1] 12 NA NA
Warning message:
In y[x > 2] :
Indexing atomic vector with logical vector that doesn't evenly divide row count
> x[x>2]
[1] 3 4 5
> x[1:2]
[1] 1 2

Related

A sensible function to sort dataframes

I am creating a function to sort data.frames (Why? Because of reasons). Some of the criteria:
Works on data.frames
Aimed at non-interactive use
Uses base R only
No dependency on any non-base package
The function looks like this now:
#' #title Sort a data.frame
#' #description Sort a data.frame based on one or more columns
#' #param x A data.frame class object
#' #param by A column in the data.frame. Defaults to NULL, which sorts by all columns.
#' #param decreasing A logical indicating the direction of sorting.
#' #return A data.frame.
#'
sortdf <- function(x,by=NULL,decreasing=FALSE) {
if(!is.data.frame(x)) stop("Input is not a data.frame.")
if(is.null(by)) {
ord <- do.call(order,x)
} else {
if(any(!by %in% colnames(x))) stop("One or more items in 'by' was not found.")
if(length(by) == 1) ord <- order(x[ , by])
if(length(by) > 1) ord <- do.call(order, x[ , by])
}
if(decreasing) ord <- rev(ord)
return(x[ord, , drop=FALSE])
}
Examples
sortdf(iris)
sortdf(iris,"Petal.Length")
sortdf(iris,"Petal.Length",decreasing=TRUE)
sortdf(iris,c("Petal.Length","Sepal.Length"))
sortdf(iris,"Petal.Length",decreasing=TRUE)
What works so far
Sort data.frame by one or more columns
Adjust overall direction of sort
But, I need one more feature: The ability to set sorting direction for each column separately by passing a vector of directions for each column specified in by. For example;
sortdf(iris,by=c("Sepal.Width","Petal.Width"),dir=c("up","down"))
Any ideas/suggestions on how to implement this?
Update
Benchmark of answers below:
library(microbenchmark)
library(ggplot2)
m <- microbenchmark::microbenchmark(
"base 1u"=iris[order(iris$Petal.Length),],
"Maël 1u"=sortdf(iris,"Petal.Length"),
"Mikko 1u"=sortdf1(iris,"Petal.Length"),
"arrange 1u"=dplyr::arrange(iris,Petal.Length),
"base 1d"=iris[order(iris$Petal.Length,decreasing=TRUE),],
"Maël 1d"=sortdf(iris,"Petal.Length",dir="down"),
"Mikko 1d"=sortdf1(iris,"Petal.Length",decreasing=T),
"arrange 1d"=dplyr::arrange(iris,-Petal.Length),
"base 2d"=iris[order(iris$Petal.Length,iris$Sepal.Length,decreasing=TRUE),],
"Maël 2d"=sortdf(iris,c("Petal.Length","Sepal.Length"),dir=c("down","down")),
"Mikko 2d"=sortdf1(iris,c("Petal.Length","Sepal.Length"),decreasing=T),
"arrange 2d"=dplyr::arrange(iris,-Petal.Length,-Sepal.Length),
"base 1u1d"=iris[order(iris$Petal.Length,rev(iris$Sepal.Length)),],
"Maël 1u1d"=sortdf(iris,c("Petal.Length","Sepal.Length"),dir=c("up","down")),
"Mikko 1u1d"=sortdf1(iris,c("Petal.Length","Sepal.Length"),decreasing=c(T,F)),
"arrange 1u1d"=dplyr::arrange(iris,Petal.Length,-Sepal.Length),
times=1000
)
autoplot(m)+theme_bw()
R 4.1.0
dplyr 1.0.7
Here's my attempt, using a function taken from this answer, and assuming up is ascending, and down is descending. dir is set to "up" by default.
sortdf <- function(x, by=NULL, dir=NULL) {
if(!is.data.frame(x)) stop("Input is not a data.frame.")
if(is.null(by) & is.null(dir)) {
dir <- rep("up", ncol(x))
} else if (is.null(dir)) {
dir <- rep("up", length(by))
}
sort_asc = by[which(dir == "up")]
sort_desc = by[which(dir == "down")]
if(is.null(by)) {
ord <- do.call(order,x)
} else {
if(any(!by %in% colnames(x))) stop("One or more items in 'by' was not found.")
if(length(by) == 1) ord <- order(x[ , by])
if(length(by) > 1) ord <- do.call(order, c(as.list(iris[sort_asc]), lapply(iris[sort_desc],
function(x) -xtfrm(x))))
}
if(length(dir) == 1 & all(dir == "down")) ord <- rev(ord)
x[ord, , drop=FALSE]
}
You can then have multiple different directions to sort:
sortdf(iris, by=c("Sepal.Width","Petal.Width"), dir=c("up","down")) |>
head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
61 5.0 2.0 3.5 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
120 6.0 2.2 5.0 1.5 virginica
63 6.0 2.2 4.0 1.0 versicolor
54 5.5 2.3 4.0 1.3 versicolor
88 6.3 2.3 4.4 1.3 versicolor
And other examples work as intended as well:
sortdf(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14 4.3 3.0 1.1 0.1 setosa
9 4.4 2.9 1.4 0.2 setosa
39 4.4 3.0 1.3 0.2 setosa
43 4.4 3.2 1.3 0.2 setosa
42 4.5 2.3 1.3 0.3 setosa
4 4.6 3.1 1.5 0.2 setosa
sortdf(iris, c("Petal.Length","Sepal.Length"))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
23 4.6 3.6 1.0 0.2 setosa
14 4.3 3.0 1.1 0.1 setosa
36 5.0 3.2 1.2 0.2 setosa
15 5.8 4.0 1.2 0.2 setosa
39 4.4 3.0 1.3 0.2 setosa
43 4.4 3.2 1.3 0.2 setosa
sortdf(iris, "Petal.Length", "down")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
119 7.7 2.6 6.9 2.3 virginica
123 7.7 2.8 6.7 2.0 virginica
118 7.7 3.8 6.7 2.2 virginica
106 7.6 3.0 6.6 2.1 virginica
132 7.9 3.8 6.4 2.0 virginica
108 7.3 2.9 6.3 1.8 virginica
Here’s another alternative that gets rid of all the branching logic by
ensuring you always find a proxy to sort by for each by column with
xtfrm(). For consistency with base, instead of using a “new” dir
argument, it might also be preferable to keep the decreasing argument, but
just allow it to be a vector that’s recycled to match the by length.
sortdf <- function(x, by = colnames(x), decreasing = FALSE, ...) {
if (!is.data.frame(x)) {
stop("Input is not a data.frame.")
}
if (!all(by %in% colnames(x))) {
stop("One or more items in 'by' was not found.")
}
# Recycle `decreasing` to ensure it matches `by`
decreasing <- rep_len(as.logical(decreasing), length(by))
# Find a sorting proxy for each `by` column, according to `decreasing`
pxy <- Map(function(x, decr) (-1)^decr * xtfrm(x), x[by], decreasing)
ord <- do.call(order, c(pxy, list(...)))
x[ord, , drop = FALSE]
}
Thinking about this a bit more, I might even simplify this further and:
Let Map() handle the recycling for by and decreasing.
Let [ handle throwing errors for incorrect indexing (and consequently also accept
numeric indices for columns rather than just names).
Not pass ... (following the YAGNI principle).
This could then come down to two one-liner functions:
sortdf <- function(x, by = colnames(x), decreasing = FALSE) {
x[do.call(order, Map(sortproxy, x[by], decreasing)), , drop = FALSE]
}
sortproxy <- function(x, decreasing = FALSE) {
as.integer((-1)^as.logical(decreasing)) * xtfrm(x)
}
Examples:
sortdf(iris) |> head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 14 4.3 3.0 1.1 0.1 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 39 4.4 3.0 1.3 0.2 setosa
#> 43 4.4 3.2 1.3 0.2 setosa
#> 42 4.5 2.3 1.3 0.3 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
sortdf(iris, by = c("Sepal.Length", "Sepal.Width")) |> head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 14 4.3 3.0 1.1 0.1 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 39 4.4 3.0 1.3 0.2 setosa
#> 43 4.4 3.2 1.3 0.2 setosa
#> 42 4.5 2.3 1.3 0.3 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
sortdf(iris, by = c("Sepal.Length", "Sepal.Width"), decreasing = TRUE) |> head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 132 7.9 3.8 6.4 2.0 virginica
#> 118 7.7 3.8 6.7 2.2 virginica
#> 136 7.7 3.0 6.1 2.3 virginica
#> 123 7.7 2.8 6.7 2.0 virginica
#> 119 7.7 2.6 6.9 2.3 virginica
#> 106 7.6 3.0 6.6 2.1 virginica
sortdf(iris, by = c("Sepal.Length", "Sepal.Width"), decreasing = c(TRUE, FALSE)) |> head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 132 7.9 3.8 6.4 2.0 virginica
#> 119 7.7 2.6 6.9 2.3 virginica
#> 123 7.7 2.8 6.7 2.0 virginica
#> 136 7.7 3.0 6.1 2.3 virginica
#> 118 7.7 3.8 6.7 2.2 virginica
#> 106 7.6 3.0 6.6 2.1 virginica

Function to filter data equal to or greater than a certain value

I have a dataframe containing thousands of rows and columns. The rows contain the names of genes and the columns the names of samples.
I only want to keep the rows that contain a value equal to or greater than 5 in more than 3 samples.
I tried this so far but I can't figure out how to set multiple conditions:
data.frame1 %>% filter_all(all_vars(.>= 5))
I hope I have stated this question correctly.
The way I do it in my gene expression filtering pre-differential gene expression pipeline is as follows:
data.frame1[rowSums(data.frame1 >= 5) > 3, ] -> filtered.counts
And if your first column is your gene identifier, with all the other columns being numeric, you can have the evaluation skip the first column as follows:
data.frame1[rowSums(data.frame1[-1] >= 5) > 3, ] -> filtered.counts
The way to do this in dplyr 1.0.0 is
iris %>%
filter(rowSums(across(where(is.numeric)) > 6) > 1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.6 3.0 6.6 2.1 virginica
2 7.3 2.9 6.3 1.8 virginica
3 7.2 3.6 6.1 2.5 virginica
4 7.7 3.8 6.7 2.2 virginica
5 7.7 2.6 6.9 2.3 virginica
6 7.7 2.8 6.7 2.0 virginica
7 7.4 2.8 6.1 1.9 virginica
etc
For your case
data.frame1 %>%
filter(rowSums(across(where(is.numeric)) >= 5) > 3)

How can subset a dataframe by nrow and groups in r?

I have a dataframe that contains 240,000 obs. of 7 variables. In the dataframe there are 100 groups of 2400 records each, by Symbol. Example:
Complete DataFrame
I want to split this dataframe in new dataframe that contains every first observation and each 240 observation. The new dataframe will be 1000 obs of 7 variables:
New DataFrame
I tried df[seq(1, nrow(df), 240), ] but the new dataframe has each 240 observation and not distinguished by group (Symbol). I mean, I want a new dataframe that contains the rows 240, 480, 720, 960, and so on, for each symbol. In the original data frame every symbol has 2400 obs thus the new dataframe will have 10 obs by group.
Since we don't have your data, we can use an R database: iris. In this example we split iris by Species and select first n rows using head, in this example I set n=5 to extract first 5 rows by Species
> split_data <- lapply(split(iris, iris$Species), head, n=5)
> do.call(rbind, split_data)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
setosa.1 5.1 3.5 1.4 0.2 setosa
setosa.2 4.9 3.0 1.4 0.2 setosa
setosa.3 4.7 3.2 1.3 0.2 setosa
setosa.4 4.6 3.1 1.5 0.2 setosa
setosa.5 5.0 3.6 1.4 0.2 setosa
versicolor.51 7.0 3.2 4.7 1.4 versicolor
versicolor.52 6.4 3.2 4.5 1.5 versicolor
versicolor.53 6.9 3.1 4.9 1.5 versicolor
versicolor.54 5.5 2.3 4.0 1.3 versicolor
versicolor.55 6.5 2.8 4.6 1.5 versicolor
virginica.101 6.3 3.3 6.0 2.5 virginica
virginica.102 5.8 2.7 5.1 1.9 virginica
virginica.103 7.1 3.0 5.9 2.1 virginica
virginica.104 6.3 2.9 5.6 1.8 virginica
virginica.105 6.5 3.0 5.8 2.2 virginica
>
Update
Given your comment, try this using your data.frame:
ind <- seq(from=240, to=240000, by=240) # a row index of length = 1000
split_data <- lapply(split(yourData, yourData$Symbol), function(x) x[ind,] )
do.call(rbind, split_data)
Here is one way using base R.
just like in the answer by user #Jilber Urbina I will give an example use with the built-in dataset iris.
fun <- function(DF, n = 240, start = n){
DF[seq(start, NROW(DF), by = n), ]
}
res <- lapply(split(iris, iris$Species), fun, n = 24)
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.3 1.7 0.5 setosa
#2 4.6 3.2 1.4 0.2 setosa
#3 6.1 2.8 4.7 1.2 versicolor
#4 6.2 2.9 4.3 1.3 versicolor
#5 6.3 2.7 4.9 1.8 virginica
#6 6.5 3.0 5.2 2.0 virginica
This can be made into a function, I named selectStepN.
#
# x - dataset to subset
# f - a factor, split criterion
# n - the step
#
selectStepN <- function(x, f, n = 240, start = n){
fun <- function(DF, n){
DF[seq(start, NROW(DF), by = n), ]
}
res <- lapply(split(x, f), fun, n = n)
res <- do.call(rbind, res)
row.names(res) <- NULL
res
}
selectStepN(iris, iris$Species, 24)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.3 1.7 0.5 setosa
#2 4.6 3.2 1.4 0.2 setosa
#3 6.1 2.8 4.7 1.2 versicolor
#4 6.2 2.9 4.3 1.3 versicolor
#5 6.3 2.7 4.9 1.8 virginica
#6 6.5 3.0 5.2 2.0 virginica

Write a tidyeval function to rename a factor level in a dplyr

I'm trying to write a tidyeval function that takes a numeric column, replaces values above a certain limit with the value for limit, turns that column into a factor and then replaces the factor level equal to limit with a level called "limit+".
For example, I'm trying to replace any value above 3 in sepal.width with 3 and then rename that factor level to 3+.
As an example, here's how I'm trying to make it work with the iris dataset. The fct_recode() function is not renaming the factor level properly, though.
plot_hist <- function(x, col, limit) {
col_enq <- enquo(col)
x %>%
mutate(var = factor(ifelse(!!col_enq > limit, limit,!!col_enq)),
var = fct_recode(var, assign(paste(limit,"+", sep = ""), paste(limit))))
}
plot_hist(iris, Sepal.Width, 3)
To fix the last line, we can use the special symbol :=, since we need to set the value at the left hand side of the expression. For the RHS we need to coerce to character, since fct_recode expects a character vector on the right.
library(tidyverse)
plot_hist <- function(x, col, limit) {
col_enq <- enquo(col)
x %>%
mutate(var = factor(ifelse(!!col_enq > limit, limit, !!col_enq)),
var = fct_recode(var, !!paste0(limit, "+") := as.character(limit)))
}
plot_hist(iris, Sepal.Width, 3) %>%
sample_n(10)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species var
#> 40 5.1 3.4 1.5 0.2 setosa 3+
#> 98 6.2 2.9 4.3 1.3 versicolor 2.9
#> 7 4.6 3.4 1.4 0.3 setosa 3+
#> 99 5.1 2.5 3.0 1.1 versicolor 2.5
#> 76 6.6 3.0 4.4 1.4 versicolor 3+
#> 77 6.8 2.8 4.8 1.4 versicolor 2.8
#> 85 5.4 3.0 4.5 1.5 versicolor 3+
#> 119 7.7 2.6 6.9 2.3 virginica 2.6
#> 110 7.2 3.6 6.1 2.5 virginica 3+
#> 103 7.1 3.0 5.9 2.1 virginica 3+

Smart spreadsheet parsing (managing group sub-header and sum rows, etc)

Say you have a set of spreadsheets formatted like so:
Is there an established method/library to parse this into R without having to individually edit the source spreadsheets? The aim is to parse header rows and dispense with sum rows so the output is the raw data, like so:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 5.7 2.8 4.1 1.3 versicolor
8 6.3 3.3 6.0 2.5 virginica
9 5.8 2.7 5.1 1.9 virginica
10 7.1 3.0 5.9 2.1 virginica
I can certainly hack a tailored solution to this, but wondering there is something a bit more developed/elegant than read.csv and a load of logic.
Here's a reproducible demo csv dataset (can't assume an equal number of lines per group..), although I'm hoping the solution can transpose to *.xlsx:
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17
There is a variety of ways to present spreadsheets so it would be hard to have a consistent methodology for all presentations. However, it is possible to transform the data once it is loaded in R. Here's an example with your data. It uses the function na.locf from package zoo.
x <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
library(zoo)
x <- x[x$X!="Mean",] #remove Mean line
x$Species <- x$X #create species column
x$Species[grepl("[0-9]",x$Species)] <- NA #put NA if Species contains numbers
x$Species <- na.locf(x$Species) #carry last observation if NA
x <- x[!rowSums(is.na(x))>0,] #remove lines with NA
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 1 5.1 3.5 1.4 0.2 Setosa
4 2 4.9 3.0 1.4 0.2 Setosa
5 3 4.7 3.2 1.3 0.2 Setosa
9 1 7.0 3.2 4.7 1.4 Versicolor
10 2 6.4 3.2 4.5 1.5 Versicolor
11 3 6.9 3.1 4.9 1.5 Versicolor
15 1 6.3 3.3 6.0 2.5 Virginica
16 2 5.8 2.7 5.1 1.9 Virginica
17 3 7.1 3.0 5.9 2.1 Virginica
I just recently did something similar. Here was my solution:
iris <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
First I used a which splits at an index.
split_at <- function(x, index) {
N <- NROW(x)
s <- cumsum(seq_len(N) %in% index)
unname(split(x, s))
}
Then you define that index using:
iris[,1] <- stringr::str_trim(iris[,1])
index <- which(iris[,1] %in% c("Virginica", "Versicolor", "Setosa"))
The rest is just using purrr::map_df to perform actions on each data.frame in the list that's returned. You can add some additional flexibility for removing unwanted rows if needed.
split_at(iris, index) %>%
.[2:length(.)] %>%
purrr::map_df(function(x) {
Species <- x[1,1]
x <- x[-c(1,NROW(x) - 1, NROW(x)),]
data.frame(x, Species = Species)
})

Resources