How can i find the variance in groups over a dataset [R] - r

I am trying to find the standard deviation for my dataset groupwise (from AE to AE) which looks somewhat like this:
ID Pay_ee Pay_em Post
1 100 102 AE
1 105 112 RE
1 103 112 RE
1 106 123 RE
1 101 121 RE
1 109 143 AE
1 110 113 ME
1 115 132 RE
1 123 120 AE
1 100 120 AE
1 100 120 RE
I used ggplot for plotting pay_ee and pay_em. Now I am having difficulty in representing the standard deviation in my ggplot from one AE to other AE. which means I have to first calculate the standard deviation from one AE to next AE. and then plot it in my ggplot.
I tried to refer this link answer but the problem it's been done for the whole dataset.
Do you have any idea how can I do it?

Using dplyr, tidyr and ggplot2 will get you what you want.
library(dplyr)
library(tidyr)
library(ggplot2)
df <- read.table(header = TRUE,
text =
"ID Pay_ee Pay_em Post
1 100 102 AE
1 105 112 RE
1 103 112 RE
1 106 123 RE
1 101 121 RE
1 109 143 AE
1 110 113 ME
1 115 132 RE
1 123 120 AE
1 100 120 AE
1 100 120 RE")
df %>%
gather(key, value, starts_with("Pay_")) %>%
group_by(Post, key) %>%
summarize(m = mean(value),
sd = sd(value)) %>%
print %>%
ggplot(.) +
theme_bw() +
aes(x = Post, y = m, ymin = m - sd, ymax = m + sd, color = key) +
geom_point(position = position_dodge(width = 0.5)) +
geom_errorbar(position = position_dodge(width = 0.5)) +
ylab("Pay")

Related

How to apply a filter ( dplyr) in ggplot2?

I'm trying to fill some specific areas of my geographic map with the purple color and I have no problem in doing that. This is the script I'm using:
right_join(prov2022, database, by = "COD_PROV") %>%
ggplot(aes(fill = `wage` > 500 & `wage` <=1000))+
geom_sf() +
theme_void() +
theme(legend.title=element_blank())+
scale_fill_manual(values = c('white', 'purple'))
But now I want to apply a filter in my ggplot2 picture.
I need to fill the areas of the map, but only those that have the value 13 in the column(variable) COD_REG.
I have added filter( COD_REG == 13) but it doesn't work
right_join(prov2022, database, by = "COD_PROV") %>%
filter( COD_REG == 13)
ggplot(aes(fill = `wage` > 500 & `wage` <=1000))+
geom_sf() +
theme_void() +
theme(legend.title=element_blank())+
scale_fill_manual(values = c('white', 'purple'))
R answers
> right_join(prov2022, database, by = "COD_PROV") %>%
+ filter( COD_REG == 13)
Error in `stopifnot()`:
! Problem while computing `..1 = COD_REG == 13`.
✖ Input `..1` must be of size 106 or 1, not size 107.
Run `rlang::last_error()` to see where the error occurred.
My database has 106 obs and 13 variables and it is like this
COD_REG COD_PROV wage
1 91 530
1 92 520
1 93 510
2 97 500
2 98 505
2 99 501
13 102 700
13 103 800
13 109 900
Where is the mistake?
Why R answers << ✖ Input ..1 must be of size 106 or 1, not size 107. >> ??
How can I solve???
I think you might have another filter function that shadows the dplyr one. Also you have forgot to add a %>% after your filter. Could you try this:
right_join(prov2022, database, by = "COD_PROV") %>%
dplyr::filter(COD_REG == 13) %>%
ggplot(aes(fill = `wage` > 500 & `wage` <=1000))+
geom_sf() +
theme_void() +
theme(legend.title=element_blank())+
scale_fill_manual(values = c('white', 'purple'))
The code runs as expected with the sample data.
library(tidyverse)
data <- "COD_REG COD_PROV wage
1 91 530
1 92 520
1 93 510
2 97 500
2 98 505
2 99 501
13 102 700
13 103 800
13 109 900"
read_table(data) |>
filter(COD_REG == 13)
#> # A tibble: 3 × 3
#> COD_REG COD_PROV wage
#> <dbl> <dbl> <dbl>
#> 1 13 102 700
#> 2 13 103 800
#> 3 13 109 900
Created on 2023-02-03 with reprex v2.0.2

Circular line graph with groups

I have four dataframes that look like below:
X score.i score.ii score.iii mm
1: 1 -0.3958555 -0.3750726 -0.3378881 10
2: 2 -0.3954955 -0.3799290 -0.3400876 15
3: 3 -0.3962514 -0.3776692 -0.3401180 20
4: 4 -0.4033265 -0.3764099 -0.3436115 25
5: 5 -0.4035860 -0.3753792 -0.3426287 30
---
186: 186 -0.4041035 -0.3767158 -0.3419871 80
187: 187 -0.4040643 -0.3767881 -0.3417620 85
188: 188 -0.4052228 -0.3766468 -0.3436883 90
189: 189 -0.4047009 -0.3767359 -0.3431591 95
190: 190 -0.4061497 -0.3766785 -0.3433624 100
How can I plot a circular line graph with aes(x=mm, y=score.i) for these four such that there is a gap between the lines for each dataframe?
library(ggplot2)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(-c(X, mm), names_to = "Variable", values_to = "Score") %>%
ggplot(., aes(x = mm, y = Score, color = Variable)) +
geom_line() +
coord_polar()
Data:
read.table(text =
"X score.i score.ii score.iii mm
1 -0.3958555 -0.3750726 -0.3378881 10
2 -0.3954955 -0.3799290 -0.3400876 15
3 -0.3962514 -0.3776692 -0.3401180 20
4 -0.4033265 -0.3764099 -0.3436115 25
5 -0.4035860 -0.3753792 -0.3426287 30
186 -0.4041035 -0.3767158 -0.3419871 80
187 -0.4040643 -0.3767881 -0.3417620 85
188 -0.4052228 -0.3766468 -0.3436883 90
189 -0.4047009 -0.3767359 -0.3431591 95
190 -0.4061497 -0.3766785 -0.3433624 100",
header = T, stringsAsFactors = F) -> df1

Creating Boxplot in R

I have a table with data on the sales volumes of some products. I want to build several boxplots for each product. I.e. vertically I have sales volume and horizontally I have days. When building, I do not build boxplots in certain values. What is the reason for this?
Here is table:
Day Cottage cheese..pcs. Kefir..pcs. Sour cream..pcs.
1 1 99 103 111
2 2 86 101 114
3 3 92 100 116
4 4 87 112 120
5 5 86 104 111
6 6 88 105 122
7 7 88 106 118
Here is my code:
head(out1)# out1-the table above
boxplot(Day~Cottage cheese..pcs., data = out1)
Here is the result:
Try below:
# example data
out1 <- read.table(text = " Day Cottage.cheese Kefir Sour.cream
1 1 99 103 111
2 2 86 101 114
3 3 92 100 116
4 4 87 112 120
5 5 86 104 111
6 6 88 105 122
7 7 88 106 118", header = TRUE)
# reshape wide-to-long
outlong <- stats::reshape(out1, idvar = "Day", v.names = "value",
time = "product", times = colnames(out1)[2:4],
varying = colnames(out1)[2:4], direction = "long")
# then plot
boxplot(value~product, outlong)
In addition to the provided answer, if you desire to vertically have sales volume and horitontally have days (using the out1 data provided by zx8754).
library(tidyr)
library(data.table)
library(ggplot2)
#data from wide to long
dt <- pivot_longer(out1, cols = c("Kefir", "Sour.cream", "Cottage.cheese"), names_to = "Product", values_to = "Value")
#set dt to data.table object
setDT(dt)
#convert day from integer to a factor
dt[, Day := as.factor(Day)]
#ggplot
ggplot(dt, aes(x = Day, y = Value)) + geom_bar(stat = "identity") + facet_wrap(~Product)
facet_wrap provides separate graphs for the three products.
I created a bar chart here since boxplots would be useless in this case (every product has only one value each day)

ggplot showing a trend with more than 1 variables across y axis

I have a dataframe df where I need to see the comparison of the trend between weeks
df
Col Mon Tue Wed
1 47 164 163
2 110 168 5
3 31 146 109
4 72 140 170
5 129 185 37
6 41 77 96
7 85 26 41
8 123 15 188
9 14 23 163
10 152 116 82
11 118 101 5
Right now I can only plot 2 variables like below. But I need to see for Tuesday and Wednesday as well
ggplot(data=df,aes(x=Col,y=Mon))+geom_line()
You can either add a
geom_line(aes(x = Col, y = Mon), col = 1)
for each day, or you would need to restructure your data frame using a function like gather so your new columns are col, day, value. Without reformatting the data, your result would be
ggplot(data=df)+geom_line(aes(x=Col,y=Mon), col = 1) + geom_line(aes(x=Col,y=Tue), col = 2) + geom_line(aes(x=Col,y=Wed), col = 3)
with a restructure it would be
ggplot(data=df)+geom_line(aes(x=Col,y=Val, col = Day))
The standard way would be to get the data in long format and then plot
library(tidyverse)
df %>%
gather(key, value, -Col) %>%
ggplot() + aes(factor(Col), value, col = key, group = key) + geom_line()

Smoothing Lines in ggplot between all data point

I have a data.frame similar to this example
SqMt <- "Sex Sq..Meters PDXTotalFreqStpy
1 M 129 22
2 M 129 0
3 M 129 1
4 F 129 35
5 F 129 42
6 F 129 5
7 M 557 20
8 M 557 0
9 M 557 15
10 F 557 39
11 F 557 0
12 F 557 0
13 M 1208 33
14 M 1208 26
15 M 1208 3
16 F 1208 7
17 F 1208 0
18 F 1208 8
19 M 604 68
20 M 604 0
21 M 604 0
22 F 604 0
23 F 604 0
24 F 604 0"
Data <- read.table(text=SqMt, header = TRUE)
I want to show the average PDXTotalFreqStpy for each Sq..Meters organized by Sex. This is what I use:
library(ggplot2)
ggplot(Data, aes(x=Sq..Meters, y=PDXTotalFreqStpy)) + stat_summary(fun.y="mean", geom="line", aes(group=Sex,color=Sex))
How do I get these lines smoothed out so that they are not jagged and instead, nice and curvy and go through all the data points? I have seen things on spline, but I have not gotten those to work?
See if this works for you:
library(dplyr)
# increase n if the result is not smooth enough
# (for this example, n = 50 looks sufficient to me)
n = 50
# manipulate data to calculate the mean for each sex at each x-value
# before passing the result to ggplot()
Data %>%
group_by(Sex, x = Sq..Meters) %>%
summarise(y = mean(PDXTotalFreqStpy)) %>%
ungroup() %>%
ggplot(aes(x, y, color = Sex)) +
# optional: show point locations for reference
geom_point() +
# optional: show original lines for reference
geom_line(linetype = "dashed", alpha = 0.5) +
# further data manipulation to calculate values for smoothed spline
geom_line(data = . %>%
group_by(Sex) %>%
summarise(x1 = list(spline(x, y, n)[["x"]]),
y1 = list(spline(x, y, n)[["y"]])) %>%
tidyr::unnest(),
aes(x = x1, y = y1))

Resources