Bear with me here as I am new to R.
I have a data frame with many columns, some of them have similar names:
> df
x1 x2 y1 z1 z2 z3
1 1 2 1.2 1.1 1.4 4.4
2 2 3 2.4 2.2 2.8 8.8
3 3 4 3.6 3.3 4.2 13.2
4 4 5 4.8 4.4 5.6 17.6
5 5 6 6.0 5.5 7.0 22.0
6 6 7 7.2 6.6 8.4 26.4
7 7 8 8.4 7.7 9.8 30.8
I want to plot all of the columns in the same figure, but each similar column name to be plotted in the same "facet" using ggplot. So for this there should be three sections, "x","y","z". Each facet should have a line for each column
Is there some type of ggplot solution using facet wrap?
Using some data wrangling to tidy your data you could do (where I assumed the x axis value should be the row number as you asked for a a line for each column):
library(tidyr)
library(dplyr)
library(ggplot2)
dat_tidy <- dat |>
mutate(row = row_number()) |>
pivot_longer(-row) |>
extract(name, into = c("facet", "col"), "(.)(.)")
ggplot(dat_tidy, aes(row, value, color = col)) +
geom_line() +
facet_wrap(~facet)
Related
I'm trying to pass a filtered dataframe onto a subsequent function.
Consider Iris dataframe. I filter out only on Versicolor species and then I want to use Sepal.Length and Sepal.Width column into a function that takes two vectors. I'm currently trying to implement DouglasPeuckerNbPoints, so I will use this as an example
iris %>%
filter(
(Species == "versicolor"))
I have tried:
library(kmlShape)
iris %>%
filter(
(Species == "versicolor")) %>%
DouglasPeuckerNbPoints(.$Sepal.Length,.$Sepal.Width,20)
But this is giving me the error "Error in xy.coords(x, y, setLab = FALSE) : 'x' and 'y' lengths differ".
Any help here?
The following works. We can put the function inside {}. This is called lambda expression as there are more than one dot. See https://magrittr.tidyverse.org/reference/pipe.html for more information.
library(tidyverse)
library(kmlShape)
iris %>%
filter(Species == "versicolor") %>%
{DouglasPeuckerNbPoints(trajx = .$Sepal.Length,
trajy = .$Sepal.Width, 20)}
# x y
# 1 7.0 3.2
# 2 4.9 2.4
# 3 6.6 2.9
# 4 5.2 2.7
# 5 5.0 2.0
# 6 5.9 3.0
# 7 6.0 2.2
# 8 5.6 2.9
# 9 6.7 3.1
# 10 5.6 3.0
# 11 6.2 2.2
# 12 5.9 3.2
# 13 6.7 3.0
# 14 5.5 2.4
# 15 5.4 3.0
# 16 6.7 3.1
# 17 6.3 2.3
# 18 5.6 3.0
# 19 5.0 2.3
# 20 5.7 2.8
I have a question related to ordering specific values of a bar chart created with ggplot.
My data "df" is the following:
city X2020 X2021
1 Stuttgart 2.9 3.1
2 Munich 2.3 2.4
3 Berlin 2.2 2.3
4 Hamburg 3.8 4.0
5 Dresden 3.3 3.0
6 Dortmund 2.5 2.6
7 Paderborn 1.7 1.8
8 Essen 2.6 2.6
9 Heidelberg 3.0 3.2
10 Karlsruhe 2.5 2.4
11 Kiel 2.6 2.7
12 Ravensburg 3.3 2.7
I want exactly this kind of barchart below, but cities should be only ordered by the value of 2021! I tried "reorder" in the ggplot as recommended, but this does not fit. There are some cities where the ordering is pretty weird and I do not understand what R is doing here. My code is the following:
df_melt <- melt(df, id = "city")
ggplot(df_melt, aes(value, reorder(city, -value), fill = variable)) +
geom_bar(stat="identity", position = "dodge")
str(df_melt)
'data.frame': 24 obs. of 3 variables:
$ city : chr "Stuttgart" "Munich" "Berlin" "Hamburg" ...
$ variable: Factor w/ 2 levels "X2020","X2021": 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 2.9 2.3 2.2 3.8 3.3 2.5 1.7 2.6 3 2.5 ...
https://i.stack.imgur.com/rJQMV.png
I think this gets messy because in the variable "value" there are values of both 2020 and 2021 and R possibly takes the mean of both (I dont know!). But I have no idea to deal with this further. I hope somebody can help me with my concern.
Thanks!
You could try sorting your df with arrange and then use fct_inorder to ensure that the city levels is in the order that you want.
library(tidyverse)
df <- read_table(" city X2020 X2021
1 Stuttgart 2.9 3.1
2 Munich 2.3 2.4
3 Berlin 2.2 2.3
4 Hamburg 3.8 4.0
5 Dresden 3.3 3.0
6 Dortmund 2.5 2.6
7 Paderborn 1.7 1.8
8 Essen 2.6 2.6
9 Heidelberg 3.0 3.2
10 Karlsruhe 2.5 2.4
11 Kiel 2.6 2.7
12 Ravensburg 3.3 2.7 ")
#> Warning: Missing column names filled in: 'X1' [1]
df %>%
select(-X1) %>%
pivot_longer(-city) %>%
arrange(desc(name), -value) %>%
mutate(
city = fct_inorder(city)
) %>%
ggplot(aes(city, value, fill = name)) +
geom_col(position = "dodge")
Created on 2021-07-13 by the reprex package (v1.0.0)
I just want to add to the previous answer that you can also take this plot and use coord_flip() to achieve the final result you were looking for. 😉
I am trying to show the top 100 sales on a scatterplot by year. I used the below code to take top 100 games according to sales and then set it as a data frame.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
as.data.frame(top100)
I then tried to plot this with the below code:
ggplot(top100)+
aes(x=Year, y = Global_Sales) +
geom_point()
I bet the below error when using the subset top100
Error: data must be a data frame, or other object coercible by fortify(), not a numeric vector
if i use the actual games dataseti get the plot attached.
Any ideas?
As pointed out in comments by #CMichael, you have several issues in your code.
In absence of reproducible example, I used iris dataset to explain you what is wrong with your code.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
By doing that you are only extracting a single column.
The same command with the iris dataset:
> head(sort(iris$Sepal.Length, decreasing = TRUE), n = 20)
[1] 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 7.2 7.1 7.0 6.9 6.9 6.9 6.9 6.8 6.8 6.8
So, first, you do not have anymore two dimensions to be plot in your ggplot2. Second, even colnames are not kept during the extraction, so you can't after ask for ggplot2 to plot Year and Global_Sales.
So, to solve your issue, you can do (here the example with the iris dataset):
top100 = as.data.frame(head(iris[order(iris$Sepal.Length, decreasing = TRUE), 1:2], n = 100))
And you get a data.frame of of this type:
> str(top100)
'data.frame': 100 obs. of 2 variables:
$ Sepal.Length: num 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
$ Sepal.Width : num 3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
> head(top100)
Sepal.Length Sepal.Width
132 7.9 3.8
118 7.7 3.8
119 7.7 2.6
123 7.7 2.8
136 7.7 3.0
106 7.6 3.0
And then if you are plotting:
library(ggplot2)
ggplot(top100, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
Warning Based on what you provided in your example, I will suggest you to do:
top100 <- as.data.frame(head(games[order(games$NA_Sales,decreasing=TRUE),c("Year","Global_Sales")], 100))
However, if this is not satisfying to you, you should consider to provide a reproducible example of your dataset How to make a great R reproducible example
I'm trying to create min, max and mean columns for a sparklyr dataframe. I want to use only 5 columns from that large dataframe, rowwise in the calculation. There are many NaN values in the columns, which could be calculating things. In standard R the code used would be:
df_train$MinEncoding <- spark_apply(df_train,f=min ,columns=[,EncodingFeatures], 1, FUN=min,na.rm=TRUE)
df_train$MaxEncoding <- spark_apply(df_train[,EncodingFeatures], 1, FUN=max,na.rm=TRUE)
df_train$MeanEncoding <- spark_apply(df_train[,EncodingFeatures], 1, FUN=mean,na.rm=TRUE)
I have tried
df_train %>% spark_apply(function(df) {dplyr::mutate(df, MeanLicenceEncoding = mean(LicenceEncodingFeatures))})
However spark aborts the job. Can someone help please?
For variable columns, you can use HIVE's greatest() and least() with dplyr and sparklyr as follows:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
iris <- copy_to(sc, iris)
columns <- c("Sepal_Length", "Sepal_Width")
transmute(iris,
max = greatest(!!! rlang::parse_exprs(columns)),
min = least(!!! rlang::parse_exprs(columns)),
avg = sql(!! paste(paste("if(isnull(", columns, "), 0, ", columns, ")", collapse = " + "))) / !!length(columns))
# Source: spark<?> [?? x 3]
max min avg
<dbl> <dbl> <dbl>
1 5.1 3.5 6.85
2 4.9 3 6.4
3 4.7 3.2 6.3
4 4.6 3.1 6.15
5 5 3.6 6.8
6 5.4 3.9 7.35
7 4.6 3.4 6.3
8 5 3.4 6.7
9 4.4 2.9 5.85
10 4.9 3.1 6.45
# … with more rows
i've got a data frame all that look like this:
http://pastebin.com/Xc1HEYyH
Now I want to create a scatter plot with the column headings in the x-axis and the respective values as the data points. For example:
7| x
6| x x
5| x x x x
4| x x x
3| x x
2| x x
1|
---------------------------------------
STM STM STM PIC PIC PIC
cold normal hot cold normal hot
This should be easy, but I can not figure out how.
Regards
The basic idea, if you want to plot using Hadley's ggplot2 is to get your data of the form:
x y
col_names values
And this can be done by using melt function from Hadley's reshape2. Do ?melt to see the possible arguments. However, here since we want to melt the whole data.frame, we just need,
melt(all)
# this gives the data in format:
# variable value
# 1 STM_cold 6.0
# 2 STM_cold 6.0
# 3 STM_cold 5.9
# 4 STM_cold 6.1
# 5 STM_cold 5.5
# 6 STM_cold 5.6
Here, x will be then column variable and y will be corresponding value column.
require(ggplot2)
require(reshape2)
ggplot(data = melt(all), aes(x=variable, y=value)) +
geom_point(aes(colour=variable))
If you don't want the colours, then just remove aes(colour=variable) inside geom_point so that it becomes geom_point().
Edit: I should probably mention here, that you could also replace geom_point with geom_jitter that'll give you, well, jittered points:
Here are two options to consider. The first uses dotplot from the "lattice" package:
library(lattice)
dotplot(values ~ ind, data = stack(all))
The second uses dotchart from base R's "graphics" options. To use the dotchart function, you need to wrap your data.frame in as.matrix:
dotchart(as.matrix(all), labels = "")
Note that the points in this graphic are not "jittered", but rather, presented in the order they were recorded. That is to say, the lowest point is the first record, and the highest point is the last record. If you zoomed into the plot for this example, you would see that you have 16 very faint horizontal lines. Each line represents one row from each column. Thus, if you look at the dots for "STM_cold" or any of the other variables that have NA values, you'll see a few blank lines at the top where there was no data available.
This has its advantages since it might show a trend over time if the values are recorded chronologically, but might also be a disadvantage if there are too many rows in your source data frame.
A bit of a manual version using base R graphics just for fun.
Get the data:
test <- read.table(text="STM_cold STM_normal STM_hot PIC_cold PIC_normal PIC_hot
6.0 6.6 6.3 0.9 1.9 3.2
6.0 6.6 6.5 1.0 2.0 3.2
5.9 6.7 6.5 0.3 1.8 3.2
6.1 6.8 6.6 0.2 1.8 3.8
5.5 6.7 6.2 0.5 1.9 3.3
5.6 6.5 6.5 0.2 1.9 3.5
5.4 6.8 6.5 0.2 1.8 3.7
5.3 6.5 6.2 0.2 2.0 3.5
5.3 6.7 6.5 0.1 1.7 3.6
5.7 6.7 6.5 0.3 1.7 3.6
NA NA NA 0.1 1.8 3.8
NA NA NA 0.2 2.1 4.1
NA NA NA 0.2 1.8 3.3
NA NA NA 0.8 1.7 3.5
NA NA NA 1.7 1.6 4.0
NA NA NA 0.1 1.7 3.7",header=TRUE)
Set up the basic plot:
plot(
NA,
ylim=c(0,max(test,na.rm=TRUE)+0.3),
xlim=c(1-0.1,ncol(test)+0.1),
xaxt="n",
ann=FALSE,
panel.first=grid()
)
axis(1,at=seq_along(test),labels=names(test),lwd=0,lwd.ticks=1)
Plot some points, with some x-axis jittering so they are not printed on top of one another.
invisible(
mapply(
points,
jitter(rep(seq_along(test),each=nrow(test))),
unlist(test),
col=rep(seq_along(test),each=nrow(test)),
pch=19
)
)
Result:
edit
Here's an example using alpha transparency on the points and getting rid of the jitter as discussed in the below comments with Ananda.
invisible(
mapply(
points,
rep(seq_along(test),each=nrow(test)),
unlist(test),
col=rgb(0,0,0,0.1),
pch=15,
cex=3
)
)