I would like to know if there is any good way to allow me getting the id of the points from a scatter plot by drawing a free hand polygon in R?
I found scatterD3 and it looks nice, but I can't manage to output the lab to a variable in R.
Thank you.
Roc
Here's one way
library(iplots)
with(iris, iplot(Sepal.Width,Petal.Width))
Use SHIFT (xor) or SHIFT+ALT (and) to select points (red):
Then:
iris[iset.selected(), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 119 7.7 2.6 6.9 2.3 virginica
# 115 5.8 2.8 5.1 2.4 virginica
# 133 6.4 2.8 5.6 2.2 virginica
# 136 7.7 3.0 6.1 2.3 virginica
# 146 6.7 3.0 5.2 2.3 virginica
# 142 6.9 3.1 5.1 2.3 virginica
gives you the selected rows.
The package "gatepoints" available on CRAN will allow you to draw a gate returning your points of interest.
The explanation is quite clear for anyone who reads the question. The link simply links to a package that can be used as follows:
First plot your points
x <- data.frame(x=1:10, y=1:10)
plot(x, col = "red", pch = 16)
Then select your points after running the following commands:
selectedPoints <- fhs(x)
This will return:
selectedPoints
#> [1] "4" "5" "7"
#> attr(,"gate")
#> x y
#> 1 6.099191 8.274120
#> 2 8.129107 7.048649
#> 3 8.526881 5.859404
#> 4 5.700760 6.716428
#> 5 5.605314 5.953430
#> 6 6.866882 3.764390
#> 7 3.313575 3.344069
#> 8 2.417270 5.217868
Related
I'm trying to pass a filtered dataframe onto a subsequent function.
Consider Iris dataframe. I filter out only on Versicolor species and then I want to use Sepal.Length and Sepal.Width column into a function that takes two vectors. I'm currently trying to implement DouglasPeuckerNbPoints, so I will use this as an example
iris %>%
filter(
(Species == "versicolor"))
I have tried:
library(kmlShape)
iris %>%
filter(
(Species == "versicolor")) %>%
DouglasPeuckerNbPoints(.$Sepal.Length,.$Sepal.Width,20)
But this is giving me the error "Error in xy.coords(x, y, setLab = FALSE) : 'x' and 'y' lengths differ".
Any help here?
The following works. We can put the function inside {}. This is called lambda expression as there are more than one dot. See https://magrittr.tidyverse.org/reference/pipe.html for more information.
library(tidyverse)
library(kmlShape)
iris %>%
filter(Species == "versicolor") %>%
{DouglasPeuckerNbPoints(trajx = .$Sepal.Length,
trajy = .$Sepal.Width, 20)}
# x y
# 1 7.0 3.2
# 2 4.9 2.4
# 3 6.6 2.9
# 4 5.2 2.7
# 5 5.0 2.0
# 6 5.9 3.0
# 7 6.0 2.2
# 8 5.6 2.9
# 9 6.7 3.1
# 10 5.6 3.0
# 11 6.2 2.2
# 12 5.9 3.2
# 13 6.7 3.0
# 14 5.5 2.4
# 15 5.4 3.0
# 16 6.7 3.1
# 17 6.3 2.3
# 18 5.6 3.0
# 19 5.0 2.3
# 20 5.7 2.8
I am trying to show the top 100 sales on a scatterplot by year. I used the below code to take top 100 games according to sales and then set it as a data frame.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
as.data.frame(top100)
I then tried to plot this with the below code:
ggplot(top100)+
aes(x=Year, y = Global_Sales) +
geom_point()
I bet the below error when using the subset top100
Error: data must be a data frame, or other object coercible by fortify(), not a numeric vector
if i use the actual games dataseti get the plot attached.
Any ideas?
As pointed out in comments by #CMichael, you have several issues in your code.
In absence of reproducible example, I used iris dataset to explain you what is wrong with your code.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
By doing that you are only extracting a single column.
The same command with the iris dataset:
> head(sort(iris$Sepal.Length, decreasing = TRUE), n = 20)
[1] 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 7.2 7.1 7.0 6.9 6.9 6.9 6.9 6.8 6.8 6.8
So, first, you do not have anymore two dimensions to be plot in your ggplot2. Second, even colnames are not kept during the extraction, so you can't after ask for ggplot2 to plot Year and Global_Sales.
So, to solve your issue, you can do (here the example with the iris dataset):
top100 = as.data.frame(head(iris[order(iris$Sepal.Length, decreasing = TRUE), 1:2], n = 100))
And you get a data.frame of of this type:
> str(top100)
'data.frame': 100 obs. of 2 variables:
$ Sepal.Length: num 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
$ Sepal.Width : num 3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
> head(top100)
Sepal.Length Sepal.Width
132 7.9 3.8
118 7.7 3.8
119 7.7 2.6
123 7.7 2.8
136 7.7 3.0
106 7.6 3.0
And then if you are plotting:
library(ggplot2)
ggplot(top100, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
Warning Based on what you provided in your example, I will suggest you to do:
top100 <- as.data.frame(head(games[order(games$NA_Sales,decreasing=TRUE),c("Year","Global_Sales")], 100))
However, if this is not satisfying to you, you should consider to provide a reproducible example of your dataset How to make a great R reproducible example
I'm trying to write a tidyeval function that takes a numeric column, replaces values above a certain limit with the value for limit, turns that column into a factor and then replaces the factor level equal to limit with a level called "limit+".
For example, I'm trying to replace any value above 3 in sepal.width with 3 and then rename that factor level to 3+.
As an example, here's how I'm trying to make it work with the iris dataset. The fct_recode() function is not renaming the factor level properly, though.
plot_hist <- function(x, col, limit) {
col_enq <- enquo(col)
x %>%
mutate(var = factor(ifelse(!!col_enq > limit, limit,!!col_enq)),
var = fct_recode(var, assign(paste(limit,"+", sep = ""), paste(limit))))
}
plot_hist(iris, Sepal.Width, 3)
To fix the last line, we can use the special symbol :=, since we need to set the value at the left hand side of the expression. For the RHS we need to coerce to character, since fct_recode expects a character vector on the right.
library(tidyverse)
plot_hist <- function(x, col, limit) {
col_enq <- enquo(col)
x %>%
mutate(var = factor(ifelse(!!col_enq > limit, limit, !!col_enq)),
var = fct_recode(var, !!paste0(limit, "+") := as.character(limit)))
}
plot_hist(iris, Sepal.Width, 3) %>%
sample_n(10)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species var
#> 40 5.1 3.4 1.5 0.2 setosa 3+
#> 98 6.2 2.9 4.3 1.3 versicolor 2.9
#> 7 4.6 3.4 1.4 0.3 setosa 3+
#> 99 5.1 2.5 3.0 1.1 versicolor 2.5
#> 76 6.6 3.0 4.4 1.4 versicolor 3+
#> 77 6.8 2.8 4.8 1.4 versicolor 2.8
#> 85 5.4 3.0 4.5 1.5 versicolor 3+
#> 119 7.7 2.6 6.9 2.3 virginica 2.6
#> 110 7.2 3.6 6.1 2.5 virginica 3+
#> 103 7.1 3.0 5.9 2.1 virginica 3+
Ok, this is a weird one. I suspect this is a bug inside data.table, but it would be useful if anyone can explain why this is happening - what is update doing exactly?
I'm using the list(list()) trick inside data.table to store fitted models. When you create a sequence of lm objects each for different groupings, and then update those models, the model data for all models becomes that of the last grouping. This seems like a reference is hanging around somewhere where a copy should have been made, but I can't find where and I can't reproduce this outside of lm and update.
Concrete example:
Starting with the iris data, first make the three species different sample sizes, then fit an lm model to each species, the update those models:
set.seed(3)
DT = data.table(iris)
DT = DT[rnorm(150) < 0.9]
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))),
by = Species]
fit2 = fit[, list(list(update(V1[[1]], ~.-Sepal.Length))), by = Species]
The original data table has different numbers of each species
DT[,.N, by = Species]
# Species N
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
And the first fit confirms thsi:
fit[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
But the updated second fit is showing 42 for all models
fit2[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 42
# 2: versicolor 42
# 3: virginica 42
We can also look at the model attribute which contains the data used for fitting, and see that all the model are indeed using the final groups data. The question is how has this happened?
head(fit$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 5.1 3.5 1.4
# 2 4.9 3.0 1.4
# 3 4.7 3.2 1.3
# 4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
# 6 5.4 3.9 1.7
head(fit$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
This is not an answer, but is too long for a comment
The .Environment for the terms component is identical for each resulting model
e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE
It appears that data.table is using the same bit of memory (my non-technical term) for
each evaluation of j by group (which is efficient). However when update is called, it is using this to refit the model. This will contain the values from the last group.
So, if you fudge this, it will work
fit = DT[, { xx <-list2env(copy(.SD))
mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
attr(mymodel$terms, '.Environment') <- xx
list(list(mymodel))}, by= 'Species']
lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
Species V1
1: setosa 41
2: versicolor 39
3: virginica 42
not a long term solution, but at least a workaround.
I would like to split my data frame using a couple of columns and call let's say fivenum on each group.
aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x)))
The returned value is a data.frame with only 2 columns and the second being a matrix. How can I turn it into normal columns of a data.frame?
Update
I want something like the following with less code using fivenum
ddply(iris, .(Species), summarise,
Min = min(Petal.Width),
Q1 = quantile(Petal.Width, .25),
Med = median(Petal.Width),
Q3 = quantile(Petal.Width, .75),
Max = max(Petal.Width)
)
Here is a solution using data.table (while not specifically requested, it is an obvious compliment or replacement for aggregate or ddply. As well as being slightly long to code, repeatedly calling quantile will be inefficient, as for each call you will be sorting the data
library(data.table)
Tukeys_five <- c("Min","Q1","Med","Q3","Max")
IRIS <- data.table(iris)
# this will create the wide data.table
lengthBySpecies <- IRIS[,as.list(fivenum(Sepal.Length)), by = Species]
# and you can rename the columns from V1, ..., V5 to something nicer
setnames(lengthBySpecies, paste0('V',1:5), Tukeys_five)
lengthBySpecies
Species Min Q1 Med Q3 Max
1: setosa 4.3 4.8 5.0 5.2 5.8
2: versicolor 4.9 5.6 5.9 6.3 7.0
3: virginica 4.9 6.2 6.5 6.9 7.9
Or, using a single call to quantile using the appropriate prob argument.
IRIS[,as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25))), by = Species]
Species 0% 25% 50% 75% 100%
1: setosa 4.3 4.800 5.0 5.2 5.8
2: versicolor 4.9 5.600 5.9 6.3 7.0
3: virginica 4.9 6.225 6.5 6.9 7.9
Note that the names of the created columns are not syntactically valid, although you could go through a similar renaming using setnames
EDIT
Interestingly, quantile will set the names of the resulting vector if you set names = TRUE, and this will copy (slow down the number crunching and consume memory - it even warns you in the help, fancy that!)
Thus, you should probably use
IRIS[,as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25), names = FALSE)), by = Species]
Or, if you wanted to return the named list, without R copying internally
IRIS[,{quant <- as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25), names = FALSE))
setattr(quant, 'names', Tukeys_five)
quant}, by = Species]
You can use do.call to call data.frame on each of the matrix elements recursively to get a data.frame with vector elements:
dim(do.call("data.frame",dfr))
[1] 3 7
str(do.call("data.frame",dfr))
'data.frame': 3 obs. of 7 variables:
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
$ Petal.Width.Min. : num 0.1 1 1.4
$ Petal.Width.1st.Qu.: num 0.2 1.2 1.8
$ Petal.Width.Median : num 0.2 1.3 2
$ Petal.Width.Mean : num 0.28 1.36 2
$ Petal.Width.3rd.Qu.: num 0.3 1.5 2.3
$ Petal.Width.Max. : num 0.6 1.8 2.5
As far as I know, there isn't an exact way to do what you're asking, because the function you're using (fivenum) doesn't return data in a way that can be easily bound to columns from within the 'ddply' function. This is easy to clean up, though, in a programmatic way.
Step 1: Perform the fivenum function on each 'Species' value using the 'ddply' function.
data <- ddply(iris, .(Species), summarize, value=fivenum(Petal.Width))
# Species value
# 1 setosa 0.1
# 2 setosa 0.2
# 3 setosa 0.2
# 4 setosa 0.3
# 5 setosa 0.6
# 6 versicolor 1.0
# 7 versicolor 1.2
# 8 versicolor 1.3
# 9 versicolor 1.5
# 10 versicolor 1.8
# 11 virginica 1.4
# 12 virginica 1.8
# 13 virginica 2.0
# 14 virginica 2.3
# 15 virginica 2.5
Now, the 'fivenum' function returns a list, so we end up with 5 line entries for each species. That's the part where the 'fivenum' function is fighting us.
Step 2: Add a label column. We know what Tukey's five numbers are, so we just call them out in the order that the 'fivenum' function returns them. The list will repeat until it hits the end of the data.
Tukeys_five <- c("Min","Q1","Med","Q3","Max")
data$label <- Tukeys_five
# Species value label
# 1 setosa 0.1 Min
# 2 setosa 0.2 Q1
# 3 setosa 0.2 Med
# 4 setosa 0.3 Q3
# 5 setosa 0.6 Max
# 6 versicolor 1.0 Min
# 7 versicolor 1.2 Q1
# 8 versicolor 1.3 Med
# 9 versicolor 1.5 Q3
# 10 versicolor 1.8 Max
# 11 virginica 1.4 Min
# 12 virginica 1.8 Q1
# 13 virginica 2.0 Med
# 14 virginica 2.3 Q3
# 15 virginica 2.5 Max
Step 3: With the labels in place, we can quickly cast this data into a new shape using the 'dcast' function from the 'reshape2' package.
library(reshape2)
dcast(data, Species ~ label)[,c("Species",Tukeys_five)]
# Species Min Q1 Med Q3 Max
# 1 setosa 0.1 0.2 0.2 0.3 0.6
# 2 versicolor 1.0 1.2 1.3 1.5 1.8
# 3 virginica 1.4 1.8 2.0 2.3 2.5
All that junk at the end are just specifying the column order, since the 'dcast' function automatically puts things in alphabetical order.
Hope this helps.
Update: I decided to return, because I realized there is one other option available to you. You can always bind a matrix as part of a data frame definition, so you could resolve your 'aggregate' function like so:
data <- aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x)))
result <- data.frame(Species=data[,1],data[,2])
# Species Min. X1st.Qu. Median Mean X3rd.Qu. Max.
# 1 setosa 0.1 0.2 0.2 0.28 0.3 0.6
# 2 versicolor 1.0 1.2 1.3 1.36 1.5 1.8
# 3 virginica 1.4 1.8 2.0 2.00 2.3 2.5
This is my solution:
ddply(iris, .(Species), summarize, value=t(fivenum(Petal.Width)))