I can't find the answer anywhere.
I would like to calculate new variable of data frame which is based on mean of rows.
For example:
data <- data.frame(id=c(101,102,103), a=c(1,2,3), b=c(2,2,2), c=c(3,3,3))
I want to use mutate to make variable d which is mean of a,b and c. And I would like to be able to make that by selecting columns in way d=mean(a,b,c), and also I need to use range of variables (like in dplyr) d=mean(a:c).
And of course
mutate(data, c=mean(a,b))
or
mutate(data, c=rowMeans(a,b))
doesn't work.
Can you give me some tip?
Regards
You're looking for
data %>%
rowwise() %>%
mutate(c=mean(c(a,b)))
# id a b c
# (dbl) (dbl) (dbl) (dbl)
# 1 101 1 2 1.5
# 2 102 2 2 2.0
# 3 103 3 2 2.5
or
library(purrr)
data %>%
rowwise() %>%
mutate(c=lift_vd(mean)(a,b))
dplyr is badly suited to operate on this kind of data because it assumes tidy data format and — for the problem in question — your data is untidy.
You can of course tidy it first:
tidy_data = tidyr::gather(data, name, value, -id)
Which looks like this:
id name value
1 101 a 1
2 102 a 2
3 103 a 3
4 101 b 2
5 102 b 2
6 103 b 2
…
And then:
tidy_data %>% group_by(id) %>% summarize(mean = mean(value))
name mean
(fctr) (dbl)
1 a 2
2 b 2
3 c 3
Of course this discards the original data. You could use mutate instead of summarize to avoid this. Finally, you can then un-tidy your data again:
tidy_data %>%
group_by(id) %>%
mutate(mean = mean(value)) %>%
tidyr::spread(name, value)
id mean a b c
(dbl) (dbl) (dbl) (dbl) (dbl)
1 101 2.000000 1 2 3
2 102 2.333333 2 2 3
3 103 2.666667 3 2 3
Alternatively, you could summarise and then merge the result with the original table:
tidy_data %>%
group_by(id) %>%
summarize(mean = mean(value)) %>%
inner_join(data, by = 'id')
The result is the same in either case. I conceptually prefer the second variant.
I think the answer suggesting using data.frame or slicing on . is the best, but could be made simpler and more dplyr-ish like so:
data %>% mutate(c = rowMeans(select(., a,b)))
Or if you want to avoid ., with the penalty of having two inputs to your pipeline:
data %>% mutate(c = rowMeans(select(data, a,b)))
And yet another couple of ways, useful if you have the numeric positions or vector names of the columns to be summarised:
data %>% mutate(d = rowMeans(.[, 2:4]))
or
data %>% mutate(d = rowMeans(.[, c("a","b","c")]))
I think this is the dplyr-ish way. First, I'd create a function:
my_rowmeans = function(...) Reduce(`+`, list(...))/length(list(...))
Then, it can be used inside mutate:
data %>% mutate(rms = my_rowmeans(a, b))
# id a b c rms
# 1 101 1 2 3 1.5
# 2 102 2 2 3 2.0
# 3 103 3 2 3 2.5
# or
data %>% mutate(rms = my_rowmeans(a, b, c))
# id a b c rms
# 1 101 1 2 3 2.000000
# 2 102 2 2 3 2.333333
# 3 103 3 2 3 2.666667
To deal with the possibility of NAs, the function must be uglified:
my_rowmeans = function(..., na.rm=TRUE){
x =
if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x))))
else list(...)
d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)
Reduce(`+`, x)/d
}
# alternately...
my_rowmeans2 = function(..., na.rm=TRUE) rowMeans(cbind(...), na.rm=na.rm)
# new example
data$b[2] <- NA
data %>% mutate(rms = my_rowmeans(a,b,na.rm=FALSE))
id a b c rms
1 101 1 2 3 1.5
2 102 2 NA 3 NA
3 103 3 2 3 2.5
data %>% mutate(rms = my_rowmeans(a,b))
id a b c rms
1 101 1 2 3 1.5
2 102 2 NA 3 2.0
3 103 3 2 3 2.5
The downside to the my_rowmeans2 is that it coerces to a matrix. I'm not certain that this will always be slower than the Reduce approach, though.
Another simple possibility with few code is:
data %>%
mutate(c= rowMeans(data.frame(a,b)))
# id a b c
# 1 101 1 2 1.5
# 2 102 2 2 2.0
# 3 103 3 2 2.5
As rowMeans needs something like a matrix or a data.frame, you can use data.frame(var1, var2, ...) instead of c(var1, var2, ...). If you have NAs in your data you'll need to tell R what to do, for example to remove them: rowMeans(data.frame(a,b), na.rm=TRUE)
If you'd like to use a pivot_longer()-style solution:
data%>%
pivot_longer(cols=-id)%>%
group_by(id)%>%
mutate(mean=mean(value))%>%
pivot_wider(names_from=name, values_from=value)
Note that this requires the tidyr package.
This is my preference for the fact that I only need to type the name of my ID column, and don't have to worry about column indices or names otherwise. Good for a quick copy-and-point-this-at-different-data solution, though the same can be said of other answers here. Also good for cases where you might have more than one column with categorical information and haven't created a single unique identifier column.
For what it's worth, I found that this solution is very easily modified to ignore NA values with simple addition of na.rm=TRUE in the mean calculation.
For example:
data <- data.frame(id=c(101,102,103), a=c(NA,2,3), b=c(2,2,2), c=c(3,3,3))
data%>%
pivot_longer(cols=-id)%>%
group_by(id)%>%
mutate(mean=mean(value,na.rm=TRUE))%>%
pivot_wider(names_from = name, values_from=value)
You can use a wrapper function around rowMeans() to make it easier to work with. The one below lets you specify na.rm, and you can use tidyselect to choose your columns if you want.
# This is the wrapper function
means <- function(..., na.rm = FALSE) {
rowMeans(data.frame(...), na.rm = na.rm)
}
library(dplyr)
# Example data
iris2 <- iris %>%
head() %>%
transmute(Sepal.Length = replace(Sepal.Length,
sample(c(TRUE, FALSE), nrow(.),
replace = TRUE),
NA),
Sepal.Width,
Petal.Length,
Petal.Width) %>%
print()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 NA 3.5 1.4 0.2
#> 2 NA 3.0 1.4 0.2
#> 3 NA 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 NA 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
# Basic usage
iris2 %>%
mutate(mean_sepal = means(Sepal.Length, Sepal.Width))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 NA
#> 2 NA 3.0 1.4 0.2 NA
#> 3 NA 3.2 1.3 0.2 NA
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 NA
#> 6 5.4 3.9 1.7 0.4 4.65
# If you want to exclude NAs
iris2 %>%
mutate(mean_sepal = means(Sepal.Length, Sepal.Width, na.rm = TRUE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 3.50
#> 2 NA 3.0 1.4 0.2 3.00
#> 3 NA 3.2 1.3 0.2 3.20
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 3.60
#> 6 5.4 3.9 1.7 0.4 4.65
# You can also use select() and choose columns using tidyselect
iris2 %>%
mutate(mean_sepal = means(select(., contains("Sepal")), na.rm = TRUE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 3.50
#> 2 NA 3.0 1.4 0.2 3.00
#> 3 NA 3.2 1.3 0.2 3.20
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 3.60
#> 6 5.4 3.9 1.7 0.4 4.65
Created on 2022-01-13 by the reprex package (v2.0.1)
Related
This question already has answers here:
Rename columns by pattern in R
(1 answer)
Rename columns based on pattern R
(1 answer)
dplyr: Renaming multiples columns by regular expression
(2 answers)
Closed last month.
I have data set that contains several columns beginning with "sum of" (e.g., sum of Whites). I wonder how can I rename those columns by removing the "sum of" (e.g., sum of Whites--> Whites). It is of note that some of the columns in my data frame have a single word (e.g., Blacks) name and therefore the renaming noted is only needed for a few of the columns!
In base R you can use gsub with names:
df <- data.frame(col1 = 1:5,
sum_of1a = 1:5,
sum_of2a = 1:5,
another_column = 1:5)
names(df) <- gsub("sum_of", "", names(df))
Or with dplyr:
df <- dplyr::rename_with(df, ~gsub("sum_of", "", .x))
Output (for both approaches):
# col1 1a 2a another_column
# 1 1 1 1 1
# 2 2 2 2 2
# 3 3 3 3 3
# 4 4 4 4 4
# 5 5 5 5 5
You can use rename_with as -
library(dplyr)
library(stringr)
dat %>% rename_with(~str_remove(., 'sum of'), starts_with('sum of'))
# Whites Browns Blacks
#1 1 5 6
#2 2 4 7
#3 3 3 8
#4 4 2 9
#5 5 1 10
data
dat <- data.frame(`sum of Whites` = 1:5, `sum of Browns` = 5:1,
`Blacks` = 6:10, check.names = FALSE)
Please check the below code where i used the iris dataframe and renamed the column Species to sum of Whites and then changing the name of the same to Whites
code
df <- iris %>% head %>% rename(`sum of Whites`=Species)
# get the names from the df and then change the name
nam <- trimws(str_replace(names(df),'sum of',''))
# apply the names to the df
names(df) <- nam
output
Sepal.Length Sepal.Width Petal.Length Petal.Width Whites
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I have this list of dataframes as follows :
library(carData)
library(datasets)
l = list(Salaries,iris)
I only want to select the numeric columns in this list of datasets. Already tried lapply with the function select_if(is.numeric) but it did not work with me.
We can use select with where in the newer version of dplyr - loop over the list with map and select the columns of the data.frames
library(purrr)
library(dplyr)
map(l, ~ .x %>%
select(where(is.numeric)))
Or using base R
lapply(l, Filter, f = is.numeric)
base R option using lapply twice like this:
library(carData)
library(datasets)
l = list(Salaries,iris)
lapply(l, \(x) x[, unlist(lapply(x, is.numeric), use.names = FALSE)])
#> [[1]]
#> yrs.since.phd yrs.service salary
#> 1 19 18 139750
#> 2 20 16 173200
#> 3 4 3 79750
#> 4 45 39 115000
#> 5 40 41 141500
#>
#> [[2]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
Created on 2022-09-25 with reprex v2.0.2
I'd like to get the cumulative sum of the corresponding records in the smaller column for each name under Species_a and Species_b as two new columns, and have them in the same row without including the value for that row. the smaller column lists which species column has a smaller width.
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
1 versicolor virginica 2.5 3.0 2022-05-05 a
2 versicolor virginica 2.6 2.8 2022-04-04 a
3 versicolor setosa 2.2 4.4 2021-03-03 a
4 setosa virginica 4.2 2.5 2021-02-02 b
5 virginica setosa 3.0 3.4 2020-01-01 a
Ideally the format of the data would be in the same format as it is now, and the summation would be based off of the smaller, Date, Species_a, and Species_b columns alone. I tried to create a count column but I get stuck on properly accumulating based on Date being less than the current value for that column.
My desired output would be as follows:
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller smaller_sum_a smaller_sum_b
1 versicolor virginica 2.5 3.0 2022-05-05 a 2 2
2 versicolor virginica 2.6 2.8 2022-04-04 a 1 2
3 versicolor setosa 2.2 4.4 2021-03-03 a 0 0
4 setosa virginica 4.2 2.5 2021-02-02 b 0 1
5 virginica setosa 3.0 3.4 2020-01-01 a 0 0
Code:
library(tidyverse)
set.seed(12)
data_a <- iris[sample(1:nrow(iris)), ] %>%
head()
colnames(data_a) <- paste0(colnames(data_a), "_a")
data_b <- iris[sample(1:nrow(iris)), ] %>%
tail()
colnames(data_b) <- paste0(colnames(data_b), "_b")
data <- bind_cols(data_a, data_b) %>%
filter(Species_a != Species_b) %>%
select(Species_a,
Species_b,
Sepal.Width_a,
Sepal.Width_b) %>%
mutate(Date = c('2022-05-05', '2022-04-04', '2021-03-03', '2021-02-02', '2020-01-01'),
smaller = ifelse(Sepal.Width_a > Sepal.Width_b, 'b',
ifelse(Sepal.Width_a < Sepal.Width_b, 'a', NA)))
I don't know if this is a solution, but it might be a start.
How exactly are the new columns calculated? Looks like smaller_sum_a is the number of consecutive rows where species a has the smaller value, minus one. But the same doesn't work for smaller_sum_b I don't think? Or is it just cumulative number of days where each species is has the smaller value, minus 1, but with zero if the species isn't smaller in that row (again this doesn't check out for smaller_sum_b though...).
As for determining if Date is less than the current value, firstly you'll want to tell R that your Date column is actually a date, rather than just a character.
Easiest way to see what format it is in is to make your data (not a good name for your data btw, preferably make it something that R or the computer wouldn't use, like my_data) a tibble rather than a data.frame. tibbles tell you what format each column is in which is handy.
data %>%
tibble
# # A tibble: 5 x 6
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
# <fct> <fct> <dbl> <dbl> <chr> <chr>
# 1 versicolor virginica 2.5 3 2022-05-05 a
# 2 versicolor virginica 2.6 2.8 2022-04-04 a
# 3 versicolor setosa 2.2 4.4 2021-03-03 a
# 4 setosa virginica 4.2 2.5 2021-02-02 b
# 5 virginica setosa 3 3.4 2020-01-01 a
The bits inside the < > under the column names tell you the formats, <fct> is factor, <dbl> is numeric (see here for explanation) and <chr> is character.
So, we want to make Date into a date format, which we can do with the ymd() (year-month-day) function from lubridate. Also, I rearranged the data so the rows are in chronological order (earliest at the top), because that's how things are normally arranged, and makes more sense to me, especially if you're interested in cumulative sums.
data %>%
tibble %>%
mutate(
Date = ymd(Date)
) %>%
arrange(Date) %>%
{. ->> my_data}
my_data
# # A tibble: 5 x 6
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
# <fct> <fct> <dbl> <dbl> <date> <chr>
# 1 virginica setosa 3 3.4 2020-01-01 a
# 2 setosa virginica 4.2 2.5 2021-02-02 b
# 3 versicolor setosa 2.2 4.4 2021-03-03 a
# 4 versicolor virginica 2.6 2.8 2022-04-04 a
# 5 versicolor virginica 2.5 3 2022-05-05 a
We can see that R now recognises that the Date column is a date, and is now in the R-recognised <date> format.
Now this is where I'm not 100% sure on exactly how you want to calculate your new columns, but for example you can use ifelse() to determine if species a is smaller, and then calculate the cumulative sum of the days where it was smaller.
my_data %>%
mutate(
s_a = ifelse(smaller == 'a', 1, 0),
smaller_sum_a = cumsum(s_a),
)
# # A tibble: 5 x 8
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller s_a smaller_sum_a
# <fct> <fct> <dbl> <dbl> <date> <chr> <dbl> <dbl>
# 1 virginica setosa 3 3.4 2020-01-01 a 1 1
# 2 setosa virginica 4.2 2.5 2021-02-02 b 0 1
# 3 versicolor setosa 2.2 4.4 2021-03-03 a 1 2
# 4 versicolor virginica 2.6 2.8 2022-04-04 a 1 3
# 5 versicolor virginica 2.5 3 2022-05-05 a 1 4
As long as either a) the Date column is in an R-recognised <date> format, or b) it is arranged chronologically, you can use the less than or greater than operators < & > to determine if dates are before/after a given row.
This is a good resource for understanding how R treats dates and times, and is well worth a read https://r4ds.had.co.nz/dates-and-times.html
Here is my current solution, I'd like to not use plyr if I can help it since I heard it breaks some of dplyr's functions. I feel like there is definitely a more efficient and modern way of solving this issue but I can't seem to find it.
library(plyr)
library(lubridate)
# creating counts for smaller sums for red side
data$Date <- lubridate::parse_date_time(x = data$Date, # standardizing date (outside of the reproducible example there are two date types)
orders = c("%m/%d/%Y", "%Y-%m-%d"))
A_rn <- mutate(filter(select(data,
Species_a,
Date,
smaller),
smaller == 'a'),
smaller_ct_a = 1)
# creating counts for smaller sums for b
BtoA_rn <- mutate(filter(select(data,
Species_b,
Date,
smaller),
smaller == 'b'), # calling Species_b Species_a for easier joining
Species_a = Species_b,
smaller_ct_a = 1) %>%
select(Species_a, Date, smaller, smaller_ct_a)
# cumsum for both a and b
A <- ddply(bind_rows(A_rn, BtoA_rn) %>%
arrange(Date),
.(Species_a), transform,
smaller_sum_a = lag(cumsum(replace_na(smaller_ct_a, 0)))) %>%
select(-smaller_ct_a)
# naming adjustment
B <- A %>% filter(smaller == "b") %>%
select(-smaller)
names(B) <- gsub(x = names(B), pattern = "_a", replacement = "_b")
A <- A %>% filter(smaller == "a") %>%
select(-smaller)
data <- left_join(data, A, by = c("Species_a", "Date")) %>%
left_join(B, by = c("Species_b", "Date"))
data[is.na(data)] <- 0
I am working with two datasets that I would like to join based not exact matches between them, but rather approximate matches. My question is similar to this OP.
Here are examples of what my two dataframes look like.
df1 is this one:
x
4.8
12
4
3.5
12.5
18
df2 is this one:
x y
4.8 6.6
12 1
4.5 1
3.5 0.5
13 1.8
15 2
I am currently using inner_join(df1, df2, by=c("x") to join the two together.
This gives me:
x y
4.8 6.6
12 1
3.5 0.5
However, what I really want to do is join the two dfs based on these conditions:
any exact matches are joined first (exactly like how inner_join() currently works)
BUT, if there are no exact matches, then join to any match ± 0.5
The kind of output I am trying to get would look like this:
x y
4.8 6.6
12 1
4 1 #the y value is from x=4.5 in df1
4 0.5 #the y value is from x=3.5 in df1
3.5 0.5
12.5 1 #the y value is from x=12 in df1
12.5 1.8 #the y value is from x=13 in df1
I typically work in dplyr, so a dplyr solution would be appreciated. But, I am also open to other suggestions because I don't know if dplyr will be flexible enough to do a "fuzzy" join.
(I am aware of the fuzzyjoin package, but it doesn't seem to get at exactly what I am trying to do here)
A possible solution, with no join:
library(tidyverse)
df1 %>%
rename(x1 = x) %>%
crossing(df2) %>%
mutate(diff = abs(x1-x)) %>%
filter(diff <= 0.5) %>%
group_by(x1) %>%
mutate(aux = any(diff == 0)) %>%
filter(aux*(diff == 0) | !aux) %>%
select(-diff, -aux) %>%
ungroup
#> # A tibble: 7 × 3
#> x1 x y
#> <dbl> <dbl> <dbl>
#> 1 3.5 3.5 0.5
#> 2 4 3.5 0.5
#> 3 4 4.5 1
#> 4 4.8 4.8 6.6
#> 5 12 12 1
#> 6 12.5 12 1
#> 7 12.5 13 1.8
You could use {powerjoin}
library(powerjoin)
power_left_join(
df1, df2,
by = ~ .x$x == .y$x | ! .x$x %in% .y$x & .x$x <= .y$x +.5 & .x$x >= .y$x -.5,
keep = "left")
#> x y
#> 1 4.8 6.6
#> 2 12.0 1.0
#> 3 4.0 1.0
#> 4 4.0 0.5
#> 5 3.5 0.5
#> 6 12.5 1.0
#> 7 12.5 1.8
Created on 2022-04-14 by the reprex package (v2.0.1)
I've got a dataframe that is split into a list by its id, as shown below. Now I'd like to create a list of dataframes of all possible combinations always using only one row of each dataframe in the list. I've already experimented with expand.grid and combn in an lapply call using names(data) to index the dataframes, but I can't figure out how to do it.
Using the iris dataset here's a short example:
library(dplyr)
# data
iris %>%
select(Sepal.Length,Sepal.Width,Species) %>%
mutate_if(is.numeric,round,0) %>%
distinct() %>%
split(.,.$Species)
# This is what you get
$`setosa`
Sepal.Length Sepal.Width Species
1 5 4 setosa
2 5 3 setosa
3 4 3 setosa
4 6 4 setosa
5 4 2 setosa
$versicolor
Sepal.Length Sepal.Width Species
6 7 3 versicolor
7 6 3 versicolor
8 6 2 versicolor
9 5 2 versicolor
10 5 3 versicolor
$virginica
Sepal.Length Sepal.Width Species
11 6 3 virginica
12 7 3 virginica
13 8 3 virginica
14 5 2 virginica
15 7 2 virginica
16 7 4 virginica
17 6 2 virginica
18 8 4 virginica
And now I want to get all possible dataframes, always using one line of each dataframe in the list above like:
$[[1]]
Sepal.Length Sepal.Width Species
1 5 4 setosa
6 7 3 versicolor
11 6 3 virginica
$[[2]]...
Thank you for your suggestions!
There should be probably a better way to do this but one way using base R where it should work for any number of groups is
#Find all possible combinations of row indices for each list
row_combns <- do.call(expand.grid, lapply(lst, function(x) seq(nrow(x))))
#Make one big dataframe combining all possible combination subsetting
#it from corresponding list element
df1 <- do.call(rbind, lapply(seq_along(lst),
function(x) lst[[x]][row_combns[[x]], ]))
#Create a grouping index
df1$index <- seq_len(nrow(row_combns))
#Use the index to split
split(df1, df1$index)
#.....
#$`199`
# Sepal.Length Sepal.Width Species index
#4.39 6 4 setosa 199
#10.38 5 3 versicolor 199
#18.23 8 4 virginica 199
#$`200`
# Sepal.Length Sepal.Width Species index
#5.39 4 2 setosa 200
#10.39 5 3 versicolor 200
#18.24 8 4 virginica 200
where lst is
lst <- iris %>%
select(Sepal.Length,Sepal.Width,Species) %>%
mutate_if(is.numeric,round,0) %>%
distinct() %>%
split(., .$Species)
Here's a tidyverse approach:
library(tidyverse)
# update data
iris %>%
select(Sepal.Length,Sepal.Width,Species) %>%
mutate_if(is.numeric,round,0) %>%
distinct() %>%
mutate(Species = as.character(Species)) -> iris_upd
iris_upd %>%
split(.,.$Species) %>% # split by species column
reduce(crossing) %>% # create all row combinations
group_nest(id = row_number()) %>% # group by row id
mutate(d = map(data, ~{d = data.frame(t(matrix(., nrow=3, ncol=ncol(iris_upd)))) # reshape data
names(d) = names(iris_upd) # set column mnames
d})) -> iris_comb
Now the dataset iris_comb has a column d that contains all combinations that you want:
iris_comb$d
# .....
#
# [[199]]
# Sepal.Length Sepal.Width Species
# 1 4 2 setosa
# 2 5 3 versicolor
# 3 6 2 virginica
#
# [[200]]
# Sepal.Length Sepal.Width Species
# 1 4 2 setosa
# 2 5 3 versicolor
# 3 8 4 virginica