This question already has answers here:
How to add texture to fill colors in ggplot2
(8 answers)
Closed 9 months ago.
My tibble looks like this:
# A tibble: 5 × 6
clusters neuroticism introverty empathic open unconscious
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.242 1.02 0.511 0.327 -0.569
2 2 -0.285 -0.257 -1.36 0.723 -0.994
3 3 0.904 -0.973 0.317 0.0622 -0.0249
4 4 -0.836 0.366 0.519 0.269 1.00
5 5 0.0602 -0.493 -1.03 -1.53 -0.168
I was wondering how I can plot this with ggplot2, so that It looks like the big five personality profiles shown in this picture:
My goal is to plot an personality profile for each cluster.
In order to plot it you'd typically need to have the data in a long format. A tidyverse-solution using pivot_longer and then ggplot could look like:
df |>
pivot_longer(-clusters) |>
ggplot(aes(x = name,
y = value,
fill = as.factor(clusters))) +
geom_col(position = "dodge")
Plot:
Data:
df <-
tribble(
~clusters,~neuroticism,~introverty,~empathic,~open,~unconscious,
1,0.242,1.02,0.511,0.327,-0.569,
2,-0.285,-0.257,-1.36,0.723,-0.994,
3,0.904,-0.973,0.317,0.0622,-0.0249,
4,-0.836,0.366,0.519,0.269,1.00,
5,0.0602,-0.493,-1.03,-1.53,-0.168
)
Related
I have three columns, one per group, with numeric values. I want to analyze them using an Anova test, but I found applications when you have the different groups in a column and the respective values in the second column. I wonder if it is necessary to reorder the data like that, or if there is a method that I can use for the columns that I currently have. Here I attached a capture:
Thanks!
You can convert a wide table having many columns into another table having only two columns for key (group) and value (response) by pivoting the data:
library(tidyverse)
# create example data
set.seed(1337)
data <- tibble(
VIH = runif(100),
VIH2 = runif(100),
VIH3 = runif(100)
)
data
#> # A tibble: 100 × 3
#> VIH VIH2 VIH3
#> <dbl> <dbl> <dbl>
#> 1 0.576 0.485 0.583
#> 2 0.565 0.495 0.108
#> 3 0.0740 0.868 0.350
#> 4 0.454 0.833 0.324
#> 5 0.373 0.242 0.915
#> 6 0.331 0.0694 0.0790
#> 7 0.948 0.130 0.563
#> 8 0.281 0.122 0.287
#> 9 0.245 0.270 0.419
#> 10 0.146 0.488 0.838
#> # … with 90 more rows
data %>%
pivot_longer(everything()) %>%
aov(value ~ name, data = .)
#> Call:
#> aov(formula = value ~ name, data = .)
#>
#> Terms:
#> name Residuals
#> Sum of Squares 0.124558 25.171730
#> Deg. of Freedom 2 297
#>
#> Residual standard error: 0.2911242
#> Estimated effects may be unbalanced
Created on 2022-05-10 by the reprex package (v2.0.0)
This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 1 year ago.
I have a dataframe (gdata) with x (as "r") and y (as "km") coordinates of a function.
When I plot it like this:
plot(x = gdata$r, y = gdata$km, type = "l")
I get the graph of the function:
Now I want to calculate the area under the curve from x = 0 to x = 0.6. When I look for appropriate packages I only find something like calculation AUC of a ROC curve. But is there a way just to calculate the AUC of a normal function?
The area under the curve (AUC) of a given set of data points can be archived using numeric integration:
Let data be your data frame containing x and y values. You can get the area under the curve from lower x0=0 to upper x1=0.6 by integrating the function, which is linearly approximating your data.
This is a numeric approximation and not exact, because we do not have an infinite number of data points: For y=sqrt(x) we will get 0.3033 instead of true value of 0.3098. For 200 rows in data we'll get even better with auc=0.3096.
library(tidyverse)
data <-
tibble(
x = seq(0, 2, length.out = 20)
) %>%
mutate(y = sqrt(x))
data
#> # A tibble: 20 × 2
#> x y
#> <dbl> <dbl>
#> 1 0 0
#> 2 0.105 0.324
#> 3 0.211 0.459
#> 4 0.316 0.562
#> 5 0.421 0.649
#> 6 0.526 0.725
#> 7 0.632 0.795
#> 8 0.737 0.858
#> 9 0.842 0.918
#> 10 0.947 0.973
#> 11 1.05 1.03
#> 12 1.16 1.08
#> 13 1.26 1.12
#> 14 1.37 1.17
#> 15 1.47 1.21
#> 16 1.58 1.26
#> 17 1.68 1.30
#> 18 1.79 1.34
#> 19 1.89 1.38
#> 20 2 1.41
qplot(x, y, data = data)
integrate(approxfun(data$x, data$y), 0, 0.6)
#> 0.3033307 with absolute error < 8.8e-05
Created on 2021-10-03 by the reprex package (v2.0.1)
The absolute error returned by integrate is corerect, iff the real world between every two data points is a perfect linear interpolation, as we assumed.
I used the package MESS to solve the problem:
# Toy example
library(MESS)
x <- seq(0,3, by=0.1)
y <- x^2
auc(x,y, from = 0.1, to = 2, type = "spline")
The analytical result is:
7999/3000
Which is approximately 2.666333333333333
The R script offered gives: 2.66632 using the spline approximation and 2.6695 using the linear approximation.
This question already has answers here:
tidyverse pivot_longer several sets of columns, but avoid intermediate mutate_wider steps [duplicate]
(3 answers)
Closed 1 year ago.
Suppose I have a list of dataframes, mylist and I want to do the same operation to each dataframes.
Say my dataframes look like this:
set.seed(1)
test.tbl <- tibble(
case1_diff = rnorm(10,0),
case1_avg = rnorm(10,0),
case2_diff = rnorm(10,0),
case2_avg = rnorm(10,0),
case3_diff = rnorm(10,0),
case3_avg = rnorm(10,0),
case4_diff = rnorm(10,0),
case4_avg = rnorm(10,0),
)
> head(test.tbl)
# A tibble: 6 x 8
case1_diff case1_avg case2_diff case2_avg case3_diff case3_avg case4_diff case4_avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.626 1.51 0.919 1.36 -0.165 0.398 2.40 0.476
2 0.184 0.390 0.782 -0.103 -0.253 -0.612 -0.0392 -0.710
3 -0.836 -0.621 0.0746 0.388 0.697 0.341 0.690 0.611
4 1.60 -2.21 -1.99 -0.0538 0.557 -1.13 0.0280 -0.934
5 0.330 1.12 0.620 -1.38 -0.689 1.43 -0.743 -1.25
6 -0.820 -0.0449 -0.0561 -0.415 -0.707 1.98 0.189 0.291
and I wish to stack them into two columns of diff and avg as 40 x 2 dataframe.
Normally, I would just separate it into two objects through select(ends_with("diff")) and select(ends_with("avg")), pivot them, then bind_rows.
However, since my original object is list, I want to do it using map like:
mylist %>%
map(*insertfunction1*) %>%
map(*insertfunction2*)
meaning I would need to do this without separating. I would also need to make sure that diff and avg is correctly paired.
What I have tried so far is
test.tbl %>%
pivot_longer(cols=everything(),
names_to = "metric") %>%
mutate(metric = str_remove(metric,"[0-9]+")) %>%
pivot_wider(id_cols=metric,
values_from=value)
We don't need both pivot_longer and pivot_wider. it can be done within pivot_longer itself by specifying the names_to and the names_sep argument
library(dplyr)
library(tidyr)
test.tbl %>%
pivot_longer(cols = everything(), names_to = c('grp', '.value'),
names_sep = "_") %>%
select(-grp)
-output
# A tibble: 40 x 2
# diff avg
# <dbl> <dbl>
# 1 -0.626 1.51
# 2 0.919 1.36
# 3 -0.165 0.398
# 4 2.40 0.476
# 5 0.184 0.390
# 6 0.782 -0.103
# 7 -0.253 -0.612
# 8 -0.0392 -0.710
# 9 -0.836 -0.621
#10 0.0746 0.388
# … with 30 more rows
This question already has an answer here:
How to use Pivot_longer to reshape from wide-type data to long-type data with multiple variables
(1 answer)
Closed 2 years ago.
I have the results of a model prediction, including estimates and upper/lower CIs for each estimate -- all in a single row. How can I pivot longer (using tidyr) so that I get each var name in one column, and the respective estimate and lower CI and upper CI in their own columns?
Data
library(tidyverse)
prediction <- structure(list(prob.no_vacation = 0.117514519600163, prob.camping = 0.143492608263017,
prob.day_trip = 0.111421926419948, prob.hotel = 0.317703454494376,
prob.other = 0.046127755158774, prob.zimmmer = 0.263739736063722,
L.prob.no_vacation = 0.0862080033692849, L.prob.camping = 0.108591033069218,
L.prob.day_trip = 0.0824426383991041, L.prob.hotel = 0.269819723528852,
L.prob.other = 0.0280805399319794, L.prob.zimmmer = 0.21869871196767,
U.prob.no_vacation = 0.158221505149101, U.prob.camping = 0.187255261510882,
U.prob.day_trip = 0.148934253891266, U.prob.hotel = 0.369781447354612,
U.prob.other = 0.0748802031049477, U.prob.zimmmer = 0.314325057616515), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
> prediction
## # A tibble: 1 x 18
## prob.no_vacation prob.camping prob.day_trip prob.hotel prob.other prob.zimmmer L.prob.no_vacat~ ## L.prob.camping L.prob.day_trip L.prob.hotel L.prob.other L.prob.zimmmer
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.118 0.143 0.111 0.318 0.0461 0.264 0.0862 ## 0.109 0.0824 0.270 0.0281 0.219
## # ... with 6 more variables: U.prob.no_vacation <dbl>, U.prob.camping <dbl>, U.prob.day_trip <dbl>, ## U.prob.hotel <dbl>, U.prob.other <dbl>, U.prob.zimmmer <dbl>
Desired reshaped output
There are 6 different vacation types: no_vacation, camping, day_trip, hotel, zimmmer, other. In the original column names, the name of each vacation type is preceded by the kind of column I want it to go to.
If the prefix is just prob., I want the column to hold the numeric value of each of the 6 vacation types in an "estimate" column.
If the prefix is L.prob., I want the numeric value to go in a column for "lower_ci", in the row of that vacation type.
If the prefix is U.prob., I want the numeric value to go in a column for "upper_ci", in the row of that vacation type.
Ultimately, I want the output to look like:
I know that this type of reshaping questions comes too often, but I truly can't wrap my head around how to do it, even though I read through the pivot_longer documentation. I managed to simply pivot longer with
pivot_longer(cols = prob.no_vacation:U.prob.zimmmer) and got:
## name value
## <chr> <dbl>
## 1 prob.no_vacation 0.118
## 2 prob.camping 0.143
## 3 prob.day_trip 0.111
## 4 prob.hotel 0.318
## 5 prob.other 0.0461
## 6 prob.zimmmer 0.264
## 7 L.prob.no_vacation 0.0862
## 8 L.prob.camping 0.109
## 9 L.prob.day_trip 0.0824
## 10 L.prob.hotel 0.270
## 11 L.prob.other 0.0281
## 12 L.prob.zimmmer 0.219
## 13 U.prob.no_vacation 0.158
## 14 U.prob.camping 0.187
## 15 U.prob.day_trip 0.149
## 16 U.prob.hotel 0.370
## 17 U.prob.other 0.0749
## 18 U.prob.zimmmer 0.314
But this isn't the desired output, and I'm stuck.
Use the right regular expression to separate the column names then use the special verb .value
tidyr::pivot_longer(prediction, cols=everything(),
names_to = c(".value", "vacation_type"),
names_pattern = "(.*)\\.(.*$)")
# A tibble: 6 x 4
vacation_type prob L.prob U.prob
<chr> <dbl> <dbl> <dbl>
1 no_vacation 0.118 0.0862 0.158
2 camping 0.143 0.109 0.187
3 day_trip 0.111 0.0824 0.149
4 hotel 0.318 0.270 0.370
5 other 0.0461 0.0281 0.0749
6 zimmmer 0.264 0.219 0.314
Try this option reshaping your data. I have also formated names so that is easy to manage into pivot formulas:
library(tidyverse)
#Format names
names(prediction) <- gsub('L.prob','LowerCI',names(prediction))
names(prediction) <- gsub('U.prob','UpperCI',names(prediction))
#Reshape
prediction %>% pivot_longer(cols = names(prediction)) %>%
separate(col = name,into = c('var1','var2'),sep = '\\.') %>%
pivot_wider(names_from = var1,values_from = value)
Output:
# A tibble: 6 x 4
var2 prob LowerCI UpperCI
<chr> <dbl> <dbl> <dbl>
1 no_vacation 0.118 0.0862 0.158
2 camping 0.143 0.109 0.187
3 day_trip 0.111 0.0824 0.149
4 hotel 0.318 0.270 0.370
5 other 0.0461 0.0281 0.0749
6 zimmmer 0.264 0.219 0.314
You want the unlisted data put into a 6x3 matrix.
res <- as.data.frame(
matrix(unlist(prediction), 6,
dimnames=list(substring(names(prediction)[1:6], 6),
c("estimate", paste0(c("lower", "upper"), ".CI")))))
res
# estimate lower.CI upper.CI
# no_vacation 0.11751452 0.08620800 0.1582215
# camping 0.14349261 0.10859103 0.1872553
# day_trip 0.11142193 0.08244264 0.1489343
# hotel 0.31770345 0.26981972 0.3697814
# other 0.04612776 0.02808054 0.0748802
# zimmmer 0.26373974 0.21869871 0.3143251
I want to produce a x,y plot, with ggplot or whatever works, with multiple columns represented in the table below: They should be grouped together with Day, Soil Number, Sample. Mean is my y value and SD as my errorbar while the column Day should also serve as my x value as a timeline. How do I manage this?
Results_CMT
# A tibble: 22 x 5
# Groups: Day, Soil_Number [10]
Day Soil_Number Sample Mean SD
<int> <int> <chr> <dbl> <dbl>
1 3.84 0.230
2 0 65872 R 4.82 0.679
3 1 65871 R 3.80 1.10
4 1 65872 R 3.24 1.61
5 3 65871 fLF NA NA
6 3 65871 HF 1.73 0.795
7 3 65871 oLF 0.360 0.129
8 3 65871 R 3.13 1.36
9 3 65872 fLF NA NA
10 3 65872 HF 1.86 0.374
# ... with 12 more rows
At the end their should be 8 Lines (if data is found).
65871 R
65871 HF
65871 fLF
65871 oLF
65872 R
65872 HF
65872 fLF
65872 oLF
Do I have to produce another Column with a combined character of Day, SoilNumber and Sample?
Thanks for any help.
Try this:
library(ggplot2)
ggplot(Results_CMT, aes(x = Day, y = Mean, colour = interaction(Sample, Soil_Number))) +
geom_line() +
geom_errorbar(aes(ymin = Mean-SD, ymax = Mean+SD), width = .2)