Related
does anyone know how I can loop pipe operator code through multiple dataframe?
I've quite a few dataframe named over the years (df_1990, df_1991 ... df_2020). However, not all years are included, (i.e. df_1993, df_2012 and 3 more years are not available). To account for this, I manually created a list to store all the data frame for the looping (do enlighten me if there's a faster way for this).
df_list = list(df_1990, df_1991, ..., df_2020)
for (i in df_list) {
...
}
The dataframes are pretty simple with just 2 columns (Item (character field) & Cost (numeric field).
Item
Cost
Book_A
3.00
Book_B
5.00
...
...
a sample code for the dataframe
df = structure(list(Item = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L,
3L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 1L), .Label = c("Book A",
"Book B", "Book C"), class = "factor"), Cost = c(5, 3.5, 12,
6, 8, 3, 6, 3.5, 3.8, 13, 5.1, 7, 11.5, 3.8, 5.5, 6.5, 13.5,
5.5, 3.5, 1.2)), class = "data.frame", row.names = c(NA, -20L
))
Does anyone know how I can add in the following code into the ... portion of the for loop code above? Thank you!
df %>%
group_by(Item) %>%
summarise(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset) %>%
left_join(df, by = 'Item') %>%
transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
The code basically detect the outlier (for e.g. if the cost is 60% higher or lower than majority average of the particular item) and output a column of "Y" and "N" for each row respectively. (Credits for the code goes to Ronak Shah)
Ideally the new column created should appear in the list created to allow exporting to excel format
Thank you!
Personally I would move the data wrangling code in a function and would then use lapply to loop over your list of data frames.
library(dplyr)
df_list <- list(df, df, df)
prep_data <- function(x) {
x %>%
group_by(Item) %>%
summarise(
outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset
) %>%
left_join(x, by = "Item") %>%
transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, "Y", "N"))
}
df_prep <- lapply(df_list, prep_data)
lapply(df_prep, head, 2)
#> [[1]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
#>
#> [[2]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
#>
#> [[3]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
If you want to do it via a for loop then you could achieve the same result like so:
df_prep <- list()
for (i in seq_along(df_list)) {
df_prep[[i]] <- prep_data(df_list[[i]])
}
Why don't you put all your data into one dataframe:
df_list = list(df_1990 = df_1990, df_1991 = df_1991, ..., df_2020 = df_2020)
df2 = dplyr::bind_rows(df_list, .id = 'Year')
then you only have to add the variable Year into the group_by statement:
group_by(Year, Item)
If you need to, you can always convert it back to a list of dataframes:
df2 %>%
tidyr::nest(data = Item:Cost) %>%
pull(data, name = Year)
Btw, you can also improve the code for the outlier detection, by omitting the join:
df2 %>%
group_by(Year, Item) %>%
mutate(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset) %>%
transmute(Item, Cost, Outlier = if_else(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
using mutate instead of summarise copies the result of mean(Cost) to every row of the group.
I have a dataframe with ID as rows and several parameters for each ID as columns, among those parameters are columns of "weight at age 1", "weight at age 2", "weight at age 3" and "population".
For each population, I would like to create its own scatter plot with age as x aes and weight as y aes, ideally, all population are layered on the same final graph.
How do I do that?
tnx!!
an example of my data:
ID
POPULATION
weight at age 1
weight at age 2
weight at age 3
1
A
13.37
14.15
17.36
2
A
5.19
15.34
NA
3
B
7.68
6.92
19.42
4
B
6.96
15.12
36.39
5
C
10.35
8.86
26.33
I attempted to interpret your question.
library(tidyverse)
#pivot data into long format
df <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L, 5L),
POPULATION = c("A", "A", "B", "B", "C"),
weight.at.age.1 = c(13.37, 5.19, 7.68, 6.96, 10.35),
weight.at.age.2 = c(14.15, 15.34, 6.92, 15.12, 8.86),
weight.at.age.3 = c(17.36, NA, 19.42, 36.39, 26.33)
) %>%
pivot_longer(cols = weight.at.age.1:weight.at.age.3,
names_to = 'age',
values_to = 'weight') %>%
mutate(age = str_remove(age, 'weight.at.age.'))
#plot data
ggplot(data = df,
mapping = aes(x = age,
y = weight))+
geom_point()+
facet_wrap(~POPULATION)
You can reshape your dataframe into a long-format, and then use facet_wrap to create one plot for each population:
library(tidyverse)
df <- expand_grid(population = LETTERS[1:3], age = 1:10, id = 1:3) %>% mutate(weight = rgamma(n(), 1) * 10) %>%
pivot_wider(names_from = age, names_prefix = "weight at ", values_from = weight) %>%
mutate(id = row_number())
df_long <- df %>% pivot_longer(starts_with("weight at "), names_to = "age", values_to = "weight") %>%
mutate(age = as.integer(str_extract(age, "\\d+")))
ggplot(df_long, aes(age, weight)) + geom_point() + facet_wrap(~ population)
Created on 2022-06-09 by the reprex package (v2.0.1)
I have a df that looks like
ID
Year
Nation, Nation - NA, Economy, Economy - Asia
2008
Economy, Economy - EU, State, Nation
2009
I would like to extract the frequencies of the ID's so that it looks like
Nation
Economy
State
Year
2
2
0
2008
1
2
1
2009
For ID's that have hyphens like "Economy - EU", I am only interested in counting this as a frequency of "Economy"
My end goal is to plot this df by year with the frequency counts of different ID's in the same plot. So for example, "State" would be a green dot in 2008, "Nation" would be a red dot in 2008, and "Economy" would be a blue dot in 2008.
If the second df is not a good way to do this, I am also open to suggestions! That was just my first thought on how to start this.
I will this post as a separate question if this is not appropriate, but my next question is how to plot the frequencies of the second df by year, as mentioned above?
Thank you!
You can split the data into different rows using separate_rows splitting on a comma (,). Separate the value after - in a different column and calculate occurrence of ID value in each Year and get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
separate_rows(ID, sep = ',\\s*') %>%
separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
count(Year, ID) %>%
pivot_wider(names_from = ID, values_from = n, values_fill = 0)
# Year Economy Nation State
# <int> <int> <int> <int>
#1 2008 2 2 0
#2 2009 2 1 1
You can also reduce the code by using janitor::tabyl.
df %>%
separate_rows(ID, sep = ',\\s*') %>%
separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
janitor::tabyl(Year, ID)
data
df <- structure(list(ID = c("Nation, Nation - NA, Economy, Economy - Asia",
"Economy, Economy - EU, State, Nation"), Year = 2008:2009),
class = "data.frame", row.names = c(NA, -2L))
We could use str_count to count the strings and summarise by Year
Bring the data in long format with pivot_longer for ggplot
Use ggplot for barchart (basic version demonstrated)
library(tidyverse)
# table
df <- df %>%
group_by(Year) %>%
summarise(Nation = str_count(ID, "Nation"),
Economy = str_count(ID, "Economy"),
State = str_count(ID,"State"))
df
# preparation for plotting
df1 <- df %>%
pivot_longer(
cols = -Year,
names_to = "names",
values_to = "values"
)
# plot
ggplot(df1, aes(x = factor(names), y=values, fill=factor(Year), label=values)) +
geom_col(position=position_dodge())+
geom_text(size = 4, position =position_dodge(1),vjust=-.5)
Output:
Year Nation Economy State
* <dbl> <int> <int> <int>
1 2008 2 2 0
2 2009 1 2 1
plot:
I think Ronak has nailed it completely, but as you have mentioned in question that your ultimate goal is to plot, I think there is no need to pivot_wider
library(tidyverse)
df <- structure(list(ID = c("Nation, Nation - NA, Economy, Economy - Asia",
"Economy, Economy - EU, State, Nation"), Year = 2008:2009),
class = "data.frame", row.names = c(NA, -2L))
df %>%
separate_rows(ID, sep = ',\\s*') %>%
separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
count(Year, ID) %>%
ggplot(aes(x= as.factor(Year), y = n, color = ID)) +
geom_col(position = 'dodge') +
coord_flip()
OR
df %>%
separate_rows(ID, sep = ',\\s*') %>%
separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
count(Year, ID) %>%
ggplot(aes(x= as.factor(Year), y = n, color = ID, label = paste(ID, n, sep = '-'))) +
geom_col(position = 'dodge') +
geom_text(size = 2, position =position_dodge(0.9), vjust = -0.5)
Created on 2021-05-27 by the reprex package (v2.0.0)
I have some patient data, where the individual patients change treatment groups over time. My goal is to visualize the sequence of group changes and aggregate this data into a "sequence profile" for each treatment group.
For each treatment group I would like to show, when it generally occurs
in the treatment cycle (say rather in the beginning or in the end). To account for the differing sequence length, I would like to standardize these profiles betweenn 0 (very beginning) and 1 (end).
I would like to find an efficient data preparation and visualization.
Mininmal Example
Structure of Data
library(dplyr)
library(purrr)
library(ggplot2)
# minimal data
cj_df_raw <- tibble::tribble(
~id, ~group,
1, "A",
1, "B",
2, "A",
2, "B",
2, "A"
)
# compute "intervals" for each person [start, end]
cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
start = (pos - 1) / len,
end = pos / len) %>%
filter(group == "A")
#> # A tibble: 3 x 6
#> # Groups: id [2]
#> id group pos len start end
#> <dbl> <chr> <int> <int> <dbl> <dbl>
#> 1 1 A 1 2 0 0.5
#> 2 2 A 1 3 0 0.333
#> 3 2 A 3 3 0.667 1
(So Id 1 was in group A in the first 50% of their sequence, and Id 2 was in Group A in the first 33% and the last 33% of their sequence. This means, that 2 Ids where between 0-33% of the sequence, 1 between 33-50%, 0 between 50-66% and 1 above 66%.)
This is the result I would like to achieve and I miss a chance to transform my data effectively.
Desired outcome
profile_treatmen_a <- tibble::tribble(
~x, ~y,
0, 0L,
0.33, 2L,
0.5, 1L,
0.66, 0L,
1, 1L,
1, 0L
)
profile_treatmen_a %>%
ggplot(aes(x, y)) +
geom_step(direction = "vh") +
expand_limits(x = c(0, 1), y = 0)
(Ideally the area under the curve would be shaded)
Ideal solution: via ggridges
The goal of the visualization would be to compare the "sequence-profile" of many treatment-groups at the same time. If I could prepare the data accordingly, I would like to use the ggridges-package for a striking visual comparison the treatment groups.
library(ggridges)
data.frame(group = rep(letters[1:2], each=20),
mean = rep(2, each=20)) %>%
mutate(count = runif(nrow(.))) %>%
ggplot(aes(x=count, y=group, fill=group)) +
geom_ridgeline(stat="binline", binwidth=0.5, scale=0.9)
You could build helper intervals and then just plot a histogram. Since each patient is either in Group A or B both groups sum up to 100%. With these helper intervals you could also easily switch to other geoms.
library(tidyverse, warn.conflicts = FALSE)
library(ggplot2)
# create sample data
set.seed(42)
id <- 1:10 %>% map(~ rep(x = .x, times = runif(n = 1, min = 1, max = 6))) %>%
unlist()
group <- sample(x = c("A", "B"), size = length(id), replace = TRUE) %>%
as_factor()
df <- tibble(id, group)
glimpse(df)
#> Observations: 37
#> Variables: 2
#> $ id <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
#> $ group <fct> A, B, B, A, A, B, B, A, A, B, B, A, B, B, A, B, A, B, A,...
# tidy data
df <- df %>%
group_by(id) %>%
mutate(from = (row_number() - 1) / n(),
to = row_number() / n()) %>%
ungroup() %>%
rowwise() %>%
mutate(list = seq(from + 1/60, to, 1/60) %>% list()) %>%
unnest()
# plot
df %>%
ggplot(aes(x = list, fill = group)) +
geom_histogram(binwidth = 1/60) +
ggthemes::theme_hc()
Created on 2018-09-16 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
My attempt at an answer.. although it is probably not the nicest/fastest/most efficient way, I think it might help you in your efforts.
library(data.table)
# compute "intervals" for each person [start, end]
df <- cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
from = (pos - 1) / len,
to = pos / len,
value = 1)
dt <- as.data.table(df)
setkey(dt, from, to)
#create intervals
dt.interval <- data.table(from = seq( from = 0, by = 0.01, length.out = 100),
to = seq( from = 0.01, by = 0.01, length.out = 100))
#perform overlap join on intervals
dt2 <- foverlaps( dt.interval, dt, type = "within", nomatch = NA)[, sum(value), by = c("i.from", "group")]
#some melting ans casting to fill in '0' on empty intervals
dt3 <- melt( dcast(dt2, ... ~ group, fill = 0), id.vars = 1 )
#plot
ggplot( dt3 ) +
geom_step( aes( x = i.from, y = value, color = variable ) ) +
facet_grid( .~variable )
There is a database of whole year:
Month Day Time X Y
...
3 1 0 2 4
3 1 1 4 2
3 1 2 7 3
3 1 3 8 8
3 1 4 4 6
3 1 5 1 4
3 1 6 6 6
3 1 7 7 9
...
3 2 0 5 7
3 2 1 7 2
3 2 2 9 3
...
4 1 0 2 8
...
I want to find maximum value of X for each day and create a plot for each day starting from beginning of the day (Time 0) up to this found maximum value. I tried to use dataframe but I got a bit lost and database is quite big so I'm not sure if this is the best idea.
Any ideas how to do it?
If I understood you correctly, this should work:
Sample dataset:
set.seed(123)
df <- data.frame(Month = sample(c(1:12), 30, replace = TRUE),
Day = sample(c(1:31), 30, replace = TRUE),
Time = sample(c(1:24), 30, replace = TRUE),
x = rnorm(30, mean = 10, sd = 5),
y = rnorm(30, mean = 10, sd = 5))
Using tidyverse (ggplot and dplyr):
require(tidyverse)
df %>%
#Grouping by month and day
group_by(Month, Day) %>%
#Creating new variables for x and y - the max value, and removing values bigger than the max value.
mutate(maxX = max(x, na.rm = TRUE),
maxY = max(y, na.rm = TRUE),
plotX = ifelse(x > maxY, NA, x),
plotY = ifelse(y > maxY, NA, y)) %>%
ungroup() %>%
#Select and gather only the needed variables for the plot
select(Time, plotX, plotY) %>%
gather(plot, value, -Time) %>%
#Plot
ggplot(aes(Time, value, color = plot)) +
geom_point()
output:
You can try a tidyverse. Duplicated Times per Day and Month are removed without any ranking.
library(tidyverse)
set.seed(123)
df <- data.frame(Month = sample(c(1:2), 30, replace = TRUE),
Day = sample(c(1:2), 30, replace = TRUE),
Time = sample(c(1:10), 30, replace = TRUE),
x = rnorm(30, mean = 10, sd = 5),
y = rnorm(30, mean = 10, sd = 5))
df %>%
group_by(Month, Day) %>%
filter(!duplicated(Time)) %>% # remove dupliceted "Time"'s.
filter(x<=max(x) & Time <= Time[x == max(x)]) %>%
ggplot(aes(Time, x)) +
geom_line() +
geom_point(data=. %>% filter(x == max(x)))+
facet_grid(Month~Day, labeller = label_both)
Or try to put all in one plot using different colors
df %>%
group_by(Month, Day) %>%
filter(!duplicated(Time)) %>%
filter(x<=max(x) & Time <= Time[x == max(x)]) %>%
ggplot(aes(Time, x, color = interaction(Month, Day))) +
geom_line() +
geom_point(data=. %>% filter(x == max(x)))