Plot time series with years in different columns - r

I have the following data frame dt(head,6):
I need to create a graph in which I have the years (2015, 2016, 2017, 2018, 2019) on the x-axis , different columns (W15, W16, W17, W18, W19 - each one relates to one year) on the y-axis. They are all should be grouped by the column TEAM.
I tried using ggplot2 to no avail.

You need to convert your data from wide to long and then use ggplot. Look below;
library(tidyverse)
dt %>%
pivot_longer(., -Team, values_to = "W", names_to = "Year") %>%
mutate(Year = as.integer(gsub("W", "20", Year))) %>%
ggplot(., aes(x=Year, y=W, group=Team)) +
geom_line(aes(color=Team))
Data:
dt <- structure(list(Team = c("AC", "AF", "AK", "AL", "AA&M", "Alst", "Alb"),
W15 = c(7L, 12L, 20L, 18L, 8L, 17L, 24L),
W16 = c(9L, 12L, 25L, 18L, 10L, 12L, 23L),
W17 = c(13L, 12L, 27L, 19L, 2L, 8L, 21L),
W18 = c(16L, 12L, 14L, 20L, 3L, 8L, 22L),
W19 = c(27L, 14L, 17L, 18L, 5L, 12L, 12L)),
class = "data.frame", row.names = c(NA, -7L))
# Team W15 W16 W17 W18 W19
# 1 AC 7 9 13 16 27
# 2 AF 12 12 12 12 14
# 3 AK 20 25 27 14 17
# 4 AL 18 18 19 20 18
# 5 AA&M 8 10 2 3 5
# 6 Alst 17 12 8 8 12
# 7 Alb 24 23 21 22 12

Create a zoo object z from t(dt[-1]) and the times from the numeric part of the names). Use dt$TEAM as its columnn names. Finally use autoplot.zoo to plot it using ggplot2. Remove facet=NULL if you prefer a separate panel for each series.
library(ggplot2)
library(zoo)
z <- zoo(t(dt[-1]), as.numeric(sub("W", "", names(dt)[-1])))
names(z) <- dt$TEAM
autoplot(z, facet = NULL) + scale_x_continuous(breaks = time(z))
Note
Suppose this input data:
set.seed(123)
dt <- data.frame(TEAM = letters[1:5], W15 = rnorm(5), W16 = rnorm(5), W17 = rnorm(5))

Related

removing columns based on segment of column names

I have a dataframe that has multiple columns (close to 100) I don't need that have "CNT" in the middle. Below is a short example:
id drink drink_CNT_v2 sage_CNT_v5
1 12 23 12
2 14 32 13
3 15 12 12
4 16 12 43
5 20 50 23
I want to remove all variables that have CNT in the middle. Does anyone know how I could do that. I tried using mutate in tidyverse, but that didn't work.
We could use contains in select
library(dplyr)
df2 <- df1 %>%
select(-contains("_CNT_"))
-output
df2
id drink
1 1 12
2 2 14
3 3 15
4 4 16
5 5 20
data
df1 <- structure(list(id = 1:5, drink = c(12L, 14L, 15L, 16L, 20L),
drink_CNT_v2 = c(23L, 32L, 12L, 12L, 50L), sage_CNT_v5 = c(12L,
13L, 12L, 43L, 23L)), class = "data.frame", row.names = c(NA,
-5L))
In base R, with grepl:
df[!grepl("CNT", colnames(df))]
Also works with select (use grep):
df %>%
select(-grep("CNT", names(.)))

R adding columns and data

I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y. These two new columns are to contain data from column A, but every second row from column A. Correspondingly for column X, starting from the first value in column A and from the second value in column A for column Y.
So far, I have been doing it in Excel. But now I need it in R best function form so that I can easily reuse that code. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample result:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c(2L,
NA, 5L, NA, 54L, NA, 34L, NA, 10L, NA), Y = c(NA, 7L, NA, 11L,
NA, 12L, NA, 14L, NA, 6L)), class = "data.frame", row.names = c(NA,
-10L))
It is not a super elegant solution, but it works:
exampleDF <- structure(list(A = c(2L, 7L, 5L, 11L, 54L,
12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L,
32L, 19L, 24L, 44L, 37L)),
class = "data.frame", row.names = c(NA, -10L))
index <- seq(from = 1, to = nrow(exampleDF), by = 2)
exampleDF$X <- NA
exampleDF$X[index] <- exampleDF$A[index]
exampleDF$Y <- exampleDF$A
exampleDF$Y[index] <- NA
You could also make use of the row numbers and the modulo operator:
A simple ifelse way:
library(dplyr)
df |>
mutate(X = ifelse(row_number() %% 2 == 1, A, NA),
Y = ifelse(row_number() %% 2 == 0, A, NA))
Or using pivoting:
library(dplyr)
library(tidyr)
df |>
mutate(name = ifelse(row_number() %% 2 == 1, "X", "Y"),
value = A) |>
pivot_wider()
A function using the first approach could look like:
See comment
xy_fun <- function(data, A = A, X = X, Y = Y) {
data |>
mutate({{X}} := ifelse(row_number() %% 2 == 1, {{A}}, NA),
{{Y}} := ifelse(row_number() %% 2 == 0, {{A}}, NA))
}
xy_fun(df, # Your data
A, # The col to take values from
X, # The column name of the first new column
Y # The column name of the second new column
)
Output:
A B X Y
1 2 3 2 NA
2 7 5 NA 7
3 5 1 5 NA
4 11 21 NA 11
5 54 67 54 NA
6 12 32 NA 12
7 34 19 34 NA
8 14 24 NA 14
9 10 44 10 NA
10 6 37 NA 6
Data stored as df:
df <- structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)
),
class = "data.frame",
row.names = c(NA, -10L)
)
I like the #harre approach:
Another approach with base R we could ->
Use R's recycling ability (of a shorter-vector to a longer-vector):
df$X <- df$A
df$Y <- df$B
df$X[c(FALSE, TRUE)] <- NA
df$Y[c(TRUE, FALSE)] <- NA
df
A B X Y
1 2 3 2 NA
2 7 5 NA 5
3 5 1 5 NA
4 11 21 NA 21
5 54 67 54 NA
6 12 32 NA 32
7 34 19 34 NA
8 14 24 NA 24
9 10 44 10 NA
10 6 37 NA 37

How to create a barplot in R for multiple variables and multiple groups?

I want to compare the means of the variables in a barplot.
This is a portion of my dataframe.
Group Gender Age Anxiety_score Depression_score IUS OBSC
1 Anxiety 0 25 32 29 12
2 Anxiety 1 48 34 28 11
3 Anxiety 0 32 48 32 12
4 Anxiety 1 24 43 26 12
5 Anxiety 1 18 44 26 15
6 Control 0 45 12 11 3
7 Control 0 44 11 11 5
8 Control 1 26 21 10 5
9 Control 1 38 12 NA 2
10 Control 0 18 13 10 1
I'd like to create a barplot where each variable (Gender, Age, Anxiety_score, depression_score, IUS, ...) represents a bar and I'd like to have this for each group (anxiety vs control next to each other, not stacked) on the same graph. The height of the bar would represent the mean. For gender, I'd like to have the gender ratio. I also want to map the variables on the y axis. How do I do this in R?
This type of problems generally has to do with reshaping the data. The format should be the long format and the data is in wide format. See this post on how to reshape the data from wide to long format.
Then, group by Group and name, compute the means and plot.
library(dplyr)
library(tidyr)
library(ggplot2)
df1 %>%
pivot_longer(-Group) %>%
group_by(Group, name) %>%
summarise(value = mean(value), .groups = "drop") %>%
ggplot(aes(name, value, fill = Group)) +
geom_col(position = position_dodge()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Data
df1 <-
structure(list(Group = c("Anxiety", "Anxiety", "Anxiety", "Anxiety",
"Anxiety", "Control", "Control", "Control", "Control", "Control"
), Gender = c(0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L), Age = c(25L,
48L, 32L, 24L, 18L, 45L, 44L, 26L, 38L, 18L), Anxiety_score = c(32L,
34L, 48L, 43L, 44L, 12L, 11L, 21L, 12L, 13L), Depression_score = c(29L,
28L, 32L, 26L, 26L, 11L, 11L, 10L, NA, 10L), IUS = c(12L, 11L,
12L, 12L, 15L, 3L, 5L, 5L, 2L, 1L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
Are you looking for something like this?
library(tidyverse)
df %>%
pivot_longer(
-Group
) %>%
group_by(Group, name) %>%
summarise(Mean=mean(value, na.rm=TRUE)) %>%
ggplot(aes(x=factor(Group), y=Mean, fill=name))+
geom_col(aes(group=name), position = "dodge") +
geom_text(
aes(label = Mean, y = Mean + 0.05),
position = position_dodge(0.9),
vjust = 0
)

Finding columns that contain values based on another column

I have the following data frame:
Step 1 2 3
1 5 10 6
2 5 11 5
3 5 13 9
4 5 15 10
5 13 18 10
6 15 20 10
7 17 23 10
8 19 25 10
9 21 27 13
10 23 30 7
I would like to retrieve the columns that satisfy one of the following conditions: if step 1 = step 4 or step 4 = step 8. In this case, column 1 and 3 should be retrieved. Column 1 because the value at Step 1 = value at step 4 (i.e., 5), and for column 3, the value at step 4 = value at step 8 (i.e., 10).
I don't know how to do that in R. Can someone help me please?
You can get the column indices by the following code:
df[1, -1] == df[4, -1] | df[4, -1] == df[8, -1]
# X1 X2 X3
# 1 TRUE FALSE TRUE
# data
df <- structure(list(Step = 1:10, X1 = c(5L, 5L, 5L, 5L, 13L, 15L,
17L, 19L, 21L, 23L), X2 = c(10L, 11L, 13L, 15L, 18L, 20L, 23L,
25L, 27L, 30L), X3 = c(6L, 5L, 9L, 10L, 10L, 10L, 10L, 10L, 13L,
7L)), class = "data.frame", row.names = c(NA, -10L))

Calculating top 4 of column 1 by column 2 - R

I'm new in R and to be honest don't know how to call what I'm looking for :)
I have data-set "ds" set with 2 columns:
D | res
==========
Ds 20
Dx 23
Dp 1
Ds 12
Ds 23
Ds 54
Dn 65
Ds 122
Dx 11
Dx 154
Dx 18
Do 4
Df 17
Dp 5
Dp 107
Dp 8
Df 3
Dp 33
Dd 223
Dc 7
Dv 22
Du 34
Dh 22
Ds 12
Dy 78
Dd 128
I need to calculate top 4 from column "D" by "Res" so desired result would look like :
D | Res
========
Dd 351
Dp 154
Ds 243
Dx 206
and by %age:
D | % Of Total
==========
Dd 29.10%
Dp 12.77%
Ds 20.15%
Dx 17.08%
Thanks
We can use aggregate() to obtain the sum of each type of "D", and we can introduce a new column to account for the edit of the OP and include also the percentage.
In order to display the result in the desired form, we can apply the order() function to rearrange the rows according to the value of Res. The function rev() in this case ensures that the highest value is put on top, and head() with the parameter 4 displays the first four rows.
summarized <- aggregate(Res ~. , df1, sum)
summarized$Perc <- with(summarized, paste0(round(Res/sum(Res)*100,2),"%"))
head(summarized[rev(order(summarized$Res)),],4)
D Res Perc
2 Dd 351 29.1%
8 Ds 243 20.15%
11 Dx 206 17.08%
7 Dp 154 12.77%
data
df1 <- structure(list(D = structure(c(8L, 11L, 7L, 8L, 8L, 8L, 5L,
8L, 11L, 11L, 11L, 6L, 3L, 7L, 7L, 7L, 3L, 7L, 2L, 1L, 10L, 9L,
4L, 8L, 12L, 2L), .Label = c("Dc", "Dd", "Df", "Dh", "Dn", "Do",
"Dp", "Ds", "Du", "Dv", "Dx", "Dy"), class = "factor"), Res = c(20L,
23L, 1L, 12L, 23L, 54L, 65L, 122L, 11L, 154L, 18L, 4L, 17L, 5L,
107L, 8L, 3L, 33L, 223L, 7L, 22L, 34L, 22L, 12L, 78L, 128L)),
.Names = c("D", "Res"), class = "data.frame", row.names = c(NA, -26L))
If you mean to sum Res per D and then select the top 4 sums (assuming you made mistakes calculating the sums for ds and dp) you could try:
library(dplyr)
df1 %>% mutate(per = Res/sum(Res)) %>% group_by(D) %>% summarise(Res = sum(Res), perc = sum(per)) %>% top_n(4, Res)
Source: local data frame [4 x 3]
D Res perc
(fctr) (int) (dbl)
1 Dd 351 0.2910448
2 Dp 154 0.1276949
3 Ds 243 0.2014925
4 Dx 206 0.1708126
Option using data.table
library(data.table)
out = setorder(setDT(data)[, .(tmp = sum(res)), by = D]
[, .(D, ptg = (tmp/sum(tmp))*100)], -ptg)[1:4,]
#> out
# D ptg
#1: Dd 29.10448
#2: Ds 20.14925
#3: Dx 17.08126
#4: Dp 12.76949

Resources