R: Making a More "Efficient" Distance Function - r

I am working with the R programming language.
I generated the following random data set that contains x and y points:
set.seed(123)
x_cor = rnorm(10,100,100)
y_cor = rnorm(10,100,100)
my_data = data.frame(x_cor,y_cor)
x_cor y_cor
1 43.95244 222.40818
2 76.98225 135.98138
3 255.87083 140.07715
4 107.05084 111.06827
5 112.92877 44.41589
6 271.50650 278.69131
7 146.09162 149.78505
8 -26.50612 -96.66172
9 31.31471 170.13559
10 55.43380 52.72086
I am trying to write a "greedy search" algorithm that shows which point is located the "shortest distance" from some starting point.
For example, suppose we start at -26.50612, -96.66172
distance <- function(x1,x2, y1,y2) {
dist <- sqrt((x1-x2)^2 + (y1 - y2)^2)
return(dist)
}
Then I calculated the distance between -26.50612, -96.66172 and each point :
results <- list()
for (i in 1:10){
distance_i <- distance(-26.50612, my_data[i,1], -96.66172, my_data[i,2] )
index = i
my_data_i = data.frame(distance_i, index)
results[[i]] <- my_data_i
}
results_df <- data.frame(do.call(rbind.data.frame, results))
However, I don't think this is working because the distance between the starting point -26.50612, -96.66172 and itself is not 0 (see 8th row):
distance_i index
1 264.6443 1
2 238.7042 2
3 191.3048 3
4 185.0577 4
5 151.7506 5
6 306.4785 6
7 331.2483 7
8 223.3056 8
9 213.3817 9
10 331.6455 10
My Question:
Can someone please show me how to write a function that correctly finds the nearest point from an initial point
(Step 1) Then removes the nearest point and the initial point from "my_data"
(Step 2) And then re-calculates the nearest point from "my_data" using the nearest point identified Step 1 (i.e. with the removed data)
And in the end, shows the path that was taken (e.g. 5,7,1,9,3, etc)
Can someone please show me how to do this?
Thanks!

This could be helpful and I think you can solve the further tasks by yourself
start <- c(x= -26.50612, y= -96.66172)
library(dplyr)
my_data <- data.frame(x_cor,y_cor) %>%
rowwise() %>%
mutate(dist = distance(start["x"], x_cor, start["y"], y_cor))

The solution is implemented as a recursive function distmin, which finds the closest point between an input x and a dataframe Y and then calls itself with the closest point and the dataframe without the closest point as arguments.
EDIT: I reimplemented distmin to use dataframes.
my_data = data.frame(x_cor,y_cor) |>
mutate(idx = row_number())
distmin <- function(x, Y) {
if(nrow(Y) == 0) {
NULL
} else {
dst <- sqrt((x$x_cor - Y$x_cor)^2 + (x$y_cor - Y$y_cor)^2)
m <- which.min(dst)
res <- data.frame(x, dist = dst[m], nearest = Y[m,"idx"])
rbind(res, distmin(Y[m,], Y[-m,]))
}}
N <- 5
distmin(my_data[N,], my_data[-N,])
##> x_cor y_cor idx dist nearest
##> 5 112.92877 44.41589 5 58.09169 10
##> 10 55.43380 52.72086 10 77.90211 4
##> 4 107.05084 111.06827 4 39.04847 2
##> 2 76.98225 135.98138 2 57.02661 9
##> 9 31.31471 170.13559 9 53.77858 1
##> 1 43.95244 222.40818 1 125.32571 7
##> 7 146.09162 149.78505 7 110.20762 3
##> 3 255.87083 140.07715 3 139.49323 6
##> 6 271.50650 278.69131 6 479.27176 8
The following shows the order in which points are called.
distmin(my_data[N,], my_data[-N,]) |>
mutate(ord = row_number()) |>
ggplot(aes(x = x_cor, y_cor)) +
geom_text(aes(label = ord))

Related

How can I join elements (columns from dataframes) from two lists by row names using R?

I need help please. I have two lists: the first contains ndvi time series for distinct points, the second contains precipitation time series for the same plots (plots are in the same order in the two lists).
I need to combine the two lists. I want to add the column called precipitation from one list to the corresponding ndvi column from the other list respecting the dates (represented here by letters in the row names) to a posterior analises of correlation between columns. However, both time series of ndvi and precipitation have distinct lenghts and distinct dates.
I created the two lists to be used as example of my dataset. However, in my actual dataset the row names are monthly dates in the format "%Y-%m-%d".
library(tidyverse)
set.seed(100)
# First variable is ndvi.mon1 (monthly ndvi)
ndvi.mon1 <- vector("list", length = 3)
for (i in seq_along(ndvi.mon1)) {
aux <- data.frame(ndvi = sample(randu$x,
sample(c(seq(1,20, 1)),1),
replace = T))
ndvi.mon1[i] <- aux
ndvi.mon1 <- ndvi.mon1 %>% map(data.frame)
rownames(ndvi.mon1[[i]]) <- sample(letters, size=seq(letters[1:as.numeric(aux %>% map(length))]) %>% length)
}
# Second variable is precipitation
precipitation <- vector("list", length = 3)
for (i in seq_along(ndvi.mon1)){
prec_aux <- data.frame(precipitation = sample(randu$x*500,
26,
replace = T))
row.names(prec_aux) <- seq(letters[1:as.numeric(prec_aux %>% map(length))])
precipitation[i] <- prec_aux
precipitation <- precipitation %>% map(data.frame)
rownames(precipitation[[i]]) <- letters[1:(as.numeric(precipitation[i] %>% map(dim) %>% map(first)))]
}
Can someone help me please?
Thank you!!!
Marcio.
Maybe like this?
library(dplyr)
library(purrr)
precipitation2 <- precipitation %>%
map(rownames_to_column) %>%
map(rename, precipitation = 2)
ndvi.mon2 <- ndvi.mon1 %>%
map(rownames_to_column) %>%
map(rename, ndvi = 2)
purrr::map2(ndvi.mon2, precipitation2, left_join, by = "rowname")
[[1]]
rowname ndvi precipitation
1 k 0.354886 209.7415
2 x 0.596309 103.3700
3 r 0.978769 403.8775
4 l 0.322291 354.2630
5 c 0.831722 348.9390
6 s 0.973205 273.6030
7 h 0.949827 218.6430
8 y 0.443353 61.9310
9 b 0.826368 8.3290
10 d 0.337308 291.2110
The below will return a list of data.frames, that have been merged, using rownames:
lapply(seq_along(ndvi.mon1), function(i) {
merge(
x = data.frame(date = rownames(ndvi.mon1[[i]]), ndvi = ndvi.mon1[[i]][,1]),
y = data.frame(date = rownames(precipitation[[i]]), precip = precipitation[[i]][,1]),
by="date"
)
})
Output:
[[1]]
date ndvi precip
1 b 0.826368 8.3290
2 c 0.831722 348.9390
3 d 0.337308 291.2110
4 h 0.949827 218.6430
5 k 0.354886 209.7415
6 l 0.322291 354.2630
7 r 0.978769 403.8775
8 s 0.973205 273.6030
9 x 0.596309 103.3700
10 y 0.443353 61.9310
[[2]]
date ndvi precip
1 g 0.415824 283.9335
2 k 0.573737 311.8785
3 p 0.582422 354.2630
4 y 0.952495 495.4340
[[3]]
date ndvi precip
1 b 0.656463 332.5700
2 c 0.347482 94.7870
3 d 0.215425 431.3770
4 e 0.063100 499.2245
5 f 0.419460 304.5190
6 g 0.712057 226.7125
7 h 0.666700 284.9645
8 i 0.778547 182.0295
9 k 0.902520 82.5515
10 l 0.593219 430.6630
11 m 0.788715 443.5345
12 n 0.347482 132.3950
13 q 0.719538 79.1835
14 r 0.911370 100.7025
15 s 0.258743 309.3575
16 t 0.940644 142.3725
17 u 0.626980 335.4360
18 v 0.167640 390.4915
19 w 0.826368 63.3760
20 x 0.937211 439.8685

R: Recursively add rows

The concentration of germs of hands following j surface contacts can be dictated by the following recursive relationship:
H[j+1]=H[j]+T[j]*(S[j]-H[j])
Where S is the surface concentration the hand touches (and is assumed random for ease). T is the transfer efficiency for each contact. I would like to calculate the eventual hand concentration (with zero starting concentration).
I have a data frame that has a vector of surface contacts and transfer efficiencies for each surface. I have two groups a & b and within each group assume I will touch each one sequentially 1:length(df):
df <- data.frame(S = runif(10)*100, T = runif(10),g=rep(c("a","b"),each=5))
I would like to compute the cumulative sum of H by group using dplyr where possible.
a special case:
If g = "a", the starting value of H is 0.
If g=="b" then the starting value of H is the last value from when g=="a"
Here is a similar approach as showed by #AnilGoyal for a generalized case
library(dplyr)
library(purrr)
df %>%
mutate(H = accumulate2(S, T* !lead(!duplicated(g), default = FALSE),
.init = 0, ~ ..1 + ..3 * (..2 - ..1))[-n()])
For the sake of completeness and taking clues from Arun and Onyambu (on a separate question), I am adding baseR answer here too.
transform(df, H = Reduce(function(.x, .y) .x + df$T[.y] * (df$S[.y] - .x) * !c(!duplicated(df$g)[-1], 0)[.y],
seq(nrow(df)),
init = 0,
accumulate = TRUE)[-(1 + nrow(df))])
S T g H
1 37.698250 0.8550377 a 0.00000
2 3.843585 0.4722659 a 32.23342
3 33.150788 0.3684791 a 18.82587
4 8.948116 0.8893603 a 24.10430
5 57.061844 0.5452377 a 10.62499
6 49.648827 0.7719067 b 10.62499
7 95.403697 0.5835950 b 40.74775
8 10.598677 0.1220491 b 72.64469
9 91.913365 0.2166443 b 65.07203
10 69.644200 0.2603413 b 70.88705
Earlier Answer
A slight variation of my friend's answer above, I hope that may serve your purpose. Only assumption I am having is that your data is sorted by groups already and a precedes b (exactly as shown in sample). Since you have not given the random seed, I am also taking the same data took by my friend.
Strategy/hack, I used 0 value of T inside accumulate2 argument so that last value of H in group a is repeated in first value of group b
library(tidyverse)
df <- read.table(header = TRUE, text = ' S T g
1 37.698250 0.8550377 a
2 3.843585 0.4722659 a
3 33.150788 0.3684791 a
4 8.948116 0.8893603 a
5 57.061844 0.5452377 a
6 49.648827 0.7719067 b
7 95.403697 0.5835950 b
8 10.598677 0.1220491 b
9 91.913365 0.2166443 b
10 69.644200 0.2603413 b')
df %>%
mutate(H = accumulate2(S, replace(T, length(g[g=='a']), 0), .init = 0, ~ ..1 + ..3 * (..2 - ..1))[-(1+n())])
S T g H
1 37.698250 0.8550377 a 0.00000
2 3.843585 0.4722659 a 32.23342
3 33.150788 0.3684791 a 18.82587
4 8.948116 0.8893603 a 24.10430
5 57.061844 0.5452377 a 10.62499
6 49.648827 0.7719067 b 10.62499
7 95.403697 0.5835950 b 40.74775
8 10.598677 0.1220491 b 72.64469
9 91.913365 0.2166443 b 65.07203
10 69.644200 0.2603413 b 70.88705
#check - formula
#H[j+1]=H[j]+T[j]*(S[j]-H[j])
# for j =2
# H[2] = H[1] + T[1] * (S[1] -H[1])
0 + 0.8550377 * (37.698250 - 0)
#> [1] 32.23342
#for j=7 (second row group b)
#H[6] + T[6] * (S[6] - H[6])
10.62499 + 0.7719067 * (49.648827 - 10.62499)
#> [1] 40.74775
Created on 2021-07-10 by the reprex package (v2.0.0)
Here is another generalized version I would use for this question:
df$H <- Reduce(function(x, y) {
x + df$T[y] * (df$g[y] == df$g[y + 1]) * (df$S[y] - x)
}, init = 0,
seq_len(nrow(df))[-nrow(df)], accumulate = TRUE)
df
S T g H
1 37.698250 0.8550377 a 0.00000
2 3.843585 0.4722659 a 32.23342
3 33.150788 0.3684791 a 18.82587
4 8.948116 0.8893603 a 24.10430
5 57.061844 0.5452377 a 10.62499
6 49.648827 0.7719067 b 10.62499
7 95.403697 0.5835950 b 40.74775
8 10.598677 0.1220491 b 72.64469
9 91.913365 0.2166443 b 65.07203
10 69.644200 0.2603413 b 70.88705

Calculating the distance between coordinates R

We have a set of 50 csv files from participants, currently being read into a list as
file_paths <- fs::dir_ls("data")
file_paths
file_contents <- list ()
for (i in seq_along (file_paths)) {
file_contents[[i]] <- read_csv(
file = file_paths[[i]]
)
}
dt <- set_names(file_contents, file_paths)
My data looks like this:
level time X Y Type
1 1 355. -10.6 22.36 P
1 1 371. -33 24.85 O
1 2 389. -10.58 17.23 P
1 2 402. -16.7 30.46 O
1 3 419. -29.41 17.32 P
1 4 429. -10.28 26.36 O
2 5 438. -26.86 32.98 P
2 6 451. -21 17.06 O
2 7 463. -21 32.98 P
2 8 474. -19.9 17.06 O
We have 70 sets of coordinates per csv.
Time does not matter for this, but I would like to split up by the level column at some stage.
For every 'P' I want to compare it to 'O' and get the distance between coordinates.The first P will always match with the first O and so on.
For now, I have them split into two different lists, though this may be the complete wrong way to do it! I'm having trouble figuring out how to take all of these csv files and get the distances for all of them, the list seems to cause issues with most functions (like dist)
Here is how I've pulled the right information so far
for (i in seq_along (dt)) {
pLoc[[i]] <- dplyr::filter(dt[[i]], grepl("P", type))
oLoc[[i]] <- dplyr::filter(dt[[i]], grepl("o", type))
pX[[i]] <- pLoc[[i]] %>% pull(as.numeric(headX))
pY[[i]] <- pLoc[[i]] %>% pull(as.numeric(headY))
pCoordinates[[i]] <- cbind(pX[[i]], pY[[i]])
}
[EDITED] Following comments, here is how you can do it with the raster library:
library(raster)
library(dplyr)
df = data.frame(
x = c(10, 20 ,15,9),
y = c(45,34,54,24),
type = c("P","O","P","O")
)
df = cbind(df[df$type=="P",] %>%
dplyr::select(-type) %>%
dplyr::rename(xP = x,
yP = y),
df[df$type=="O",] %>%
dplyr::select(-type) %>%
dplyr::rename(xO = x,
yO = y))
The following could probably be achieved more efficiently with some form of the apply() function:
v = c()
for(i in 1:nrow(df)){
dist = raster::pointDistance(lonlat = F,
p1 = c(df$xP[i],df$yP[i]),
p2 = c(df$xO[i],df$yO[i]))
v = c(v,dist)
}
df$dist = v
print(df)
xP yP xO yO dist
1 10 45 20 34 14.86607
3 15 54 9 24 30.59412

Sampling Nested For Loop

My loop knowledge is very minimal but I currently have a loop written, which takes values from three vectors (small.dens, med.dens, and large.dens) and each vector has 17 values. I have the loop setup to randomly select 2 values, then 3, then 4... all the way up to 17. Using these values, it calculates the mean and standard error (using the plotrix package). It then places these calculated means and standard errors into new vectors (small.density, small.stanerr, medium.density, medium.stanerr, large.density, and large.stanerr). Next, separately from the loop, I combine these vectors into a dataframe.
library(plotrix)
small.density = rep(NA,16)
small.stanerr = rep(NA,16)
medium.density = rep(NA,16)
medium.stanerr = rep(NA,16)
large.density = rep(NA,16)
large.stanerr = rep(NA,16)
for(i in 2:17){
xx=sample(small.dens,i,replace=TRUE)
small.density[[i]] = mean(xx)
small.stanerr[[i]] = std.error(xx)
yy = sample(med.dens, i, replace=TRUE)
medium.density[[i]] = mean(yy)
medium.stanerr[[i]] = std.error(yy)
zz = sample(large.dens, i, replace=TRUE)
large.density[[i]] = mean(zz)
large.stanerr[[i]] = std.error(zz)
}
I then want to run this loop 100 times, ultimately taking the mean, if that makes sense. For example, I would like it to select 2,3,4...17 values 100 times, taking the mean and standard error each time, and then taking the mean of all 100 times. Does this make sense? Would I make another for loop, turning this into a nested loop?
How would I go about doing this?
Thanks!
There are other ways to achieve what you want, but if you do not want to change your code, then just wrap it in something like this
res <- do.call(rbind, lapply(1:100, function(x) {
within(data.frame(
n = x,
size = 2:17,
small.density = rep(NA,16),
small.stanerr = rep(NA,16),
medium.density = rep(NA,16),
medium.stanerr = rep(NA,16),
large.density = rep(NA,16),
large.stanerr = rep(NA,16)
), {
for(i in 2:17){
xx = sample(small.dens,i,replace=TRUE)
small.density[[i - 1L]] = mean(xx)
small.stanerr[[i - 1L]] = std.error(xx)
yy = sample(med.dens, i, replace=TRUE)
medium.density[[i - 1L]] = mean(yy)
medium.stanerr[[i - 1L]] = std.error(yy)
zz = sample(large.dens, i, replace=TRUE)
large.density[[i - 1L]] = mean(zz)
large.stanerr[[i - 1L]] = std.error(zz)
}
rm(xx, yy, zz, i)
})
}))
res looks like this
> head(res, 20)
n size small.density small.stanerr medium.density medium.stanerr large.density large.stanerr
1 1 2 -0.04716195 0.35754422 13.1014925 4.374055 -42.089591 30.87786
2 1 3 -0.15893367 0.34557922 -0.2680632 6.206081 52.984076 36.85058
3 1 4 0.10013995 0.62374467 -0.1944930 5.784211 -112.684774 30.50707
4 1 5 0.40654132 0.40815013 1.6096970 5.026714 45.810098 46.58469
5 1 6 0.13310242 0.32104512 -6.9989844 4.232091 -22.312165 48.14705
6 1 7 0.21283027 0.53633472 -5.0702365 3.829677 -43.266482 41.74286
7 1 8 0.13870439 0.27161346 4.1629469 3.214053 -9.045643 48.49930
8 1 9 0.06495734 0.36738163 3.9742069 3.540913 -43.954345 38.23816
9 1 10 -0.01882762 0.37570468 -3.1764203 3.740403 -43.156792 38.47531
10 1 11 -0.02115580 0.26239465 -2.2026077 2.702412 7.343837 30.58314
11 1 12 0.09967753 0.27360125 3.9603382 3.214921 -13.461632 29.39910
12 1 13 0.53121414 0.27561862 4.3593802 1.872685 -38.572491 25.37029
13 1 14 0.21547909 0.36345292 -0.3377787 2.732968 17.305232 26.08317
14 1 15 0.33957964 0.23029520 0.4832063 2.886160 8.145410 18.23901
15 1 16 0.26871985 0.26846012 -6.7634873 3.436742 -4.011269 20.33814
16 1 17 0.24927792 0.20534048 -0.7481315 1.899348 9.993280 24.49623
17 2 2 -1.10840346 0.07123407 -3.4317644 6.966096 -30.384945 121.00972
18 2 3 1.73947551 0.35986535 -2.1415966 5.628115 -57.857871 10.47413
19 2 4 0.40033834 0.41963615 -4.2156733 1.206414 27.891021 13.84453
20 2 5 -0.08704736 0.52872770 0.3137693 2.974888 -3.100414 57.89126
If you want to calculate the mean of the 100 simulated values for each size, then just
aggregate(. ~ size, res[-1L], mean)
which gives you
size small.density small.stanerr medium.density medium.stanerr large.density large.stanerr
1 2 0.02872578 0.6341294 1.0938287 5.518797 3.141204 53.20675
2 3 0.16985732 0.5388110 -0.1627867 5.185643 -6.660756 49.83607
3 4 0.20543404 0.4815581 0.1385016 4.519419 -8.093673 46.64984
4 5 0.13019280 0.4546794 0.1299331 4.166335 -10.300542 41.40444
5 6 0.10675158 0.4307113 0.2191516 4.033863 -12.068151 38.95312
6 7 0.19326831 0.3834507 0.8784275 3.513812 -6.920378 36.17856
7 8 0.09020638 0.3580780 0.4388388 3.443349 -5.335405 30.49615
8 9 0.13956838 0.3558005 0.3740251 3.313501 -15.290834 31.64833
9 10 0.18368962 0.3397191 0.4600761 3.051425 -5.505220 29.46165
10 11 0.20653866 0.3116104 0.9913534 2.804659 -8.809398 28.79097
11 12 0.14653661 0.2988422 0.3337274 2.624418 -5.128882 26.78074
12 13 0.12255652 0.2864998 0.2085829 2.719396 -11.548064 27.08497
13 14 0.13102809 0.2830709 0.6448798 2.586491 -4.676053 25.21800
14 15 0.14536840 0.2749606 0.3415879 2.522826 -11.968496 24.44427
15 16 0.14871831 0.2571571 0.2218365 2.463486 -10.335511 23.64304
16 17 0.13664397 0.2461108 0.3387764 2.348594 -9.969407 22.84736

Plot In R with Multiple Lines Based On A Particular Variable?

I have this accelerometer dataset and, let's say that I have some n number of observations for each subject (30 subjects total) for body-acceleration x time.
I want to make a plot so that it plots these body acceleration x time points for each subject in a different color on the y axis and the x axis is just an index. I tried this:
ggplot(data = filtered_data_walk, aes(x = seq_along(filtered_data_walk$'body-acceleration-mean-y-time'), y = filtered_data_walk$'body-acceleration-mean-y-time')) +
geom_line(aes(color = filtered_data_walk$subject))
But, the problem is that it doesn't superimpose the 30 lines, instead, they run along side each other. In other words, I end up with n1 + n2 + n3 + ... + n30 x index points, instead of max{n1, n2, ..., n30}. This is my first time posting, so I hope this makes sense (I know my formatting is bad).
One solution I thought of was to create a new variable which gives a value of 1 to n for all the observations of each subject. So, for example, if I had 6 observations for subject1, 4 observations for subject2, and 9 observations for subject3, this new variable would be sequenced like:
1 2 3 4 5 6 1 2 3 4 1 2 3 4 5 6 7 8 9
Is there an easy way to do this? Please help, ty.
Assuming your data is formatted as a data.frame or matrix, for a toy dataset like
x <- data.frame(replicate(5, rnorm(10)))
x
# X1 X2 X3 X4 X5
# 1 -1.36452272 -1.46446475 2.0444381 0.001585876 -1.1085990
# 2 -1.41303046 -0.14690269 1.6179084 -0.310162018 -1.5528733
# 3 -0.15319554 -0.18779791 -0.3005058 0.351619212 1.6282955
# 4 -0.38712167 -0.14867239 -1.0776359 0.106694311 -0.7065382
# 5 -0.50711166 -0.95992916 1.3522922 1.437085757 -0.7921355
# 6 -0.82377208 0.50423328 -0.5366513 -1.315263679 1.0604499
# 7 -0.01462037 -1.15213287 0.9910678 0.372623508 1.9002438
# 8 1.49721113 -0.84914197 0.2422053 0.337141898 1.2405208
# 9 1.95914245 -1.43041783 0.2190829 -1.797396822 0.4970690
# 10 -1.75726827 -0.04123615 -0.1660454 -1.071688768 -0.3331887
...you might be able to get there with something like
plot(x[,1], type='l', xlim=c(1, nrow(x)), ylim=c(min(x), max(x)))
for(i in 2:ncol(x)) lines(x[,i], col=i)
You could play with formatting some more, of course, do things with lty= and lwd= and maybe a color ramp of your own choosing, etc.
If your data is in the format below...
x <- data.frame(id=c("A","A","A","B","B","B","B","C","C"), acc=rnorm(9))
x
# id acc
# 1 A 0.1796964
# 2 A 0.8770237
# 3 A -2.4413527
# 4 B 0.9379746
# 5 B -0.3416141
# 6 B -0.2921062
# 7 B 0.1440221
# 8 C -0.3248310
# 9 C -0.1058267
...you could get there with
maxn <- max(with(x, tapply(acc, id, length)))
ids <- sort(unique(x$id))
plot(x$acc[x$id==ids[1]], type='l', xlim=c(1,maxn), ylim=c(min(x$acc),max(x$acc)))
for(i in 2:length(ids)) lines(x$acc[x$id==ids[i]], col=i)
Hope this helps, and that I interpreted your problem right--
That's pretty quick to do if you are OK with using dplyr. group_by to enforce a separate counter for each subject, mutate to add the actual counter, and your ggplot should work. Example with iris dataset:
group_by(iris, Species) %>%
mutate(index = seq_along(Petal.Length)) %>%
ggplot() + geom_line(aes(x=index, y=Petal.Length, color=Species))

Resources