Creating Boxplot in R - r

I have a table with data on the sales volumes of some products. I want to build several boxplots for each product. I.e. vertically I have sales volume and horizontally I have days. When building, I do not build boxplots in certain values. What is the reason for this?
Here is table:
Day Cottage cheese..pcs. Kefir..pcs. Sour cream..pcs.
1 1 99 103 111
2 2 86 101 114
3 3 92 100 116
4 4 87 112 120
5 5 86 104 111
6 6 88 105 122
7 7 88 106 118
Here is my code:
head(out1)# out1-the table above
boxplot(Day~Cottage cheese..pcs., data = out1)
Here is the result:

Try below:
# example data
out1 <- read.table(text = " Day Cottage.cheese Kefir Sour.cream
1 1 99 103 111
2 2 86 101 114
3 3 92 100 116
4 4 87 112 120
5 5 86 104 111
6 6 88 105 122
7 7 88 106 118", header = TRUE)
# reshape wide-to-long
outlong <- stats::reshape(out1, idvar = "Day", v.names = "value",
time = "product", times = colnames(out1)[2:4],
varying = colnames(out1)[2:4], direction = "long")
# then plot
boxplot(value~product, outlong)

In addition to the provided answer, if you desire to vertically have sales volume and horitontally have days (using the out1 data provided by zx8754).
library(tidyr)
library(data.table)
library(ggplot2)
#data from wide to long
dt <- pivot_longer(out1, cols = c("Kefir", "Sour.cream", "Cottage.cheese"), names_to = "Product", values_to = "Value")
#set dt to data.table object
setDT(dt)
#convert day from integer to a factor
dt[, Day := as.factor(Day)]
#ggplot
ggplot(dt, aes(x = Day, y = Value)) + geom_bar(stat = "identity") + facet_wrap(~Product)
facet_wrap provides separate graphs for the three products.
I created a bar chart here since boxplots would be useless in this case (every product has only one value each day)

Related

Problem with creating kinship matrix in R (kinship2) without sex

I need to create a kinship matrix. For this purpose, I wanted to use the kinship2 library in R, but the sex variable is required, which I don't have. From the documentation I read that you can use the value "3" for unknown gender, but it doesn't work. My code where I'm trying to get the effect but it comes out wrong 1x1 matrix.
My data:
nr.os nr.oj nr.ma ferma
1 169 152 84 3
2 170 152 84 3
3 171 152 84 3
4 172 152 84 3
5 173 152 84 3
6 174 152 84 3
My code:
library(kinship2)
my_data <- read.table("Zeszyt_s1.csv",header = TRUE, sep = ";")
my_data$sex <- as.integer(3)
df_fixed <- fixParents(id = my_data$nr.os, dadid=my_data$nr.oj, momid=my_data$nr.ma, sex=my_data$sex)
pedAll <- with(df_fixed,pedigree(
id = id,
dadid = dadid,
momid = momid,
sex = sex))
kinship(pedAll["1"])
Output:
1
1 0.5

ggplot showing a trend with more than 1 variables across y axis

I have a dataframe df where I need to see the comparison of the trend between weeks
df
Col Mon Tue Wed
1 47 164 163
2 110 168 5
3 31 146 109
4 72 140 170
5 129 185 37
6 41 77 96
7 85 26 41
8 123 15 188
9 14 23 163
10 152 116 82
11 118 101 5
Right now I can only plot 2 variables like below. But I need to see for Tuesday and Wednesday as well
ggplot(data=df,aes(x=Col,y=Mon))+geom_line()
You can either add a
geom_line(aes(x = Col, y = Mon), col = 1)
for each day, or you would need to restructure your data frame using a function like gather so your new columns are col, day, value. Without reformatting the data, your result would be
ggplot(data=df)+geom_line(aes(x=Col,y=Mon), col = 1) + geom_line(aes(x=Col,y=Tue), col = 2) + geom_line(aes(x=Col,y=Wed), col = 3)
with a restructure it would be
ggplot(data=df)+geom_line(aes(x=Col,y=Val, col = Day))
The standard way would be to get the data in long format and then plot
library(tidyverse)
df %>%
gather(key, value, -Col) %>%
ggplot() + aes(factor(Col), value, col = key, group = key) + geom_line()

cluster analysis with weight

I have a data frame 'heat' demonstrating people's performance across time.
'Var1' represents the code of persons.
'Var2' represents a time line (measured by number of days from the starting point).
'Variable' is the score they get at a given time point.
Var1 Var2 value
1 1 36 -0.6941826
2 2 36 -0.5585414
3 3 36 0.8032384
4 4 36 0.7973031
5 5 36 0.7536959
6 6 36 -0.5942059
....
54 10 73 0.7063218
55 11 73 -0.6949616
56 12 73 -0.6641516
57 13 73 0.6890433
58 14 73 0.6310124
59 15 73 -0.6305091
60 16 73 0.6809655
61 17 73 0.8957870
....
101 13 110 0.6495796
102 14 110 0.5990869
103 15 110 -0.6210600
104 16 110 0.6441960
105 17 110 0.7838654
....
Now I want to cluster their performance and reflect it on a heatmap. So I used the function dist() and hclust() to clustered the data frame and plotted it with ggplot2:
ggplot(data = heat) + geom_tile(aes(x = Var2, y = Var1 %>% as.character(),
fill = value)) +
scale_fill_gradient(low = "yellow",high = "red") +
geom_vline(xintercept = c(746, 2142, 2917))
It looks like this:
However, I am more interested in what happened around day 746, day 2142 and day 2917 (the black lines). I would like the scores around these days bearing more weight in the clustering. I want people demonstrating similar performance around these days to have more priority to be clustered together. Is there a way of doing this?
As long as your weights are integer, you supposedly can just replicate those days artificially.
If you want more control, just compute the distance matrix yourself, with whatever weighted distance you want to use.

How can i find the variance in groups over a dataset [R]

I am trying to find the standard deviation for my dataset groupwise (from AE to AE) which looks somewhat like this:
ID Pay_ee Pay_em Post
1 100 102 AE
1 105 112 RE
1 103 112 RE
1 106 123 RE
1 101 121 RE
1 109 143 AE
1 110 113 ME
1 115 132 RE
1 123 120 AE
1 100 120 AE
1 100 120 RE
I used ggplot for plotting pay_ee and pay_em. Now I am having difficulty in representing the standard deviation in my ggplot from one AE to other AE. which means I have to first calculate the standard deviation from one AE to next AE. and then plot it in my ggplot.
I tried to refer this link answer but the problem it's been done for the whole dataset.
Do you have any idea how can I do it?
Using dplyr, tidyr and ggplot2 will get you what you want.
library(dplyr)
library(tidyr)
library(ggplot2)
df <- read.table(header = TRUE,
text =
"ID Pay_ee Pay_em Post
1 100 102 AE
1 105 112 RE
1 103 112 RE
1 106 123 RE
1 101 121 RE
1 109 143 AE
1 110 113 ME
1 115 132 RE
1 123 120 AE
1 100 120 AE
1 100 120 RE")
df %>%
gather(key, value, starts_with("Pay_")) %>%
group_by(Post, key) %>%
summarize(m = mean(value),
sd = sd(value)) %>%
print %>%
ggplot(.) +
theme_bw() +
aes(x = Post, y = m, ymin = m - sd, ymax = m + sd, color = key) +
geom_point(position = position_dodge(width = 0.5)) +
geom_errorbar(position = position_dodge(width = 0.5)) +
ylab("Pay")

Select a value for based on a highest value in another column

I don't understand why I can't find a solution for this, since I feel that this is a pretty basic question. Need to ask for help, then. I want to rearrange airquality dataset by month with maximum temp value for each month. In addition I want to find the corresponding day for each monthly maximum temperature. What is the laziest (code-wise) way to do this?
I have tried following without a success:
require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"))
dcast(mm, month + day ~ variable, max)
aggregate(formula = temp ~ month + day, data = airquality, FUN = max)
I am after something like this:
month day temp
5 7 89
...
There was quite a discussion a while back about whether being lazy is good or not. Anwyay, this is short and natural to write and read (and is fast for large data so you don't need to change or optimize it later) :
require(data.table)
DT=as.data.table(airquality)
DT[,.SD[which.max(Temp)],by=Month]
Month Ozone Solar.R Wind Temp Day
[1,] 5 45 252 14.9 81 29
[2,] 6 NA 259 10.9 93 11
[3,] 7 97 267 6.3 92 8
[4,] 8 76 203 9.7 97 28
[5,] 9 73 183 2.8 93 3
.SD is the subset of the data for each group, and you just want the row from it with the largest Temp, iiuc. If you need the row number then that can be added.
Or to get all the rows where the max is tied :
DT[,.SD[Temp==max(Temp)],by=Month]
Month Ozone Solar.R Wind Temp Day
[1,] 5 45 252 14.9 81 29
[2,] 6 NA 259 10.9 93 11
[3,] 7 97 267 6.3 92 8
[4,] 7 97 272 5.7 92 9
[5,] 8 76 203 9.7 97 28
[6,] 9 73 183 2.8 93 3
[7,] 9 91 189 4.6 93 4
Another approach with plyr
require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"), value.name = 'temp')
library(plyr)
ddply(mm, .(month), subset, subset = temp == max(temp), select = -variable)
Gives
month day temp
1 5 29 81
2 6 11 93
3 7 8 92
4 7 9 92
5 8 28 97
6 9 3 93
7 9 4 93
Or, even simpler
require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, .(month), subset,
subset = temp == max(temp), select = c(month, day, temp) )
how about with plyr?
max.func <- function(df) {
max.temp <- max(df$temp)
return(data.frame(day = df$Day[df$Temp==max.temp],
temp = max.temp))
}
ddply(airquality, .(Month), max.func)
As you can see, the max temperature for the month happens on more than one day. If you want different behavior, the function is easy enough to adjust.
Or if you want to use the data.table package (for instance, if speed is an issue and the data set is large or if you prefer the syntax):
library(data.table)
DT <- data.table(airquality)
DT[, list(maxTemp=max(Temp), dayMaxTemp=.SD[max(Temp)==Temp, Day]), by="Month"]
If you want to know what the .SD stands for, have a look here: SO

Resources