I have a long format dataset with each row being another measurement (as indicated by my "timeline.compressed" variable, which has 8 possible values; see dput below).
However, now I want to check the descriptive statistics of some of my variables (i.e., x1-x3) but for each of the timepoints seperately. I've tried using the if function, but that gives me the warning that the condition has >1 in length.
Does anyone perhaps know what code I should use to be able to get summary statistics for each of the timepoints seperately?
dput for table with possible timeline values:
structure(c(7518L, 6178L, 6393L, 5886L, 6121L, 5977L, 7440L,
5886L), .Dim = 8L, .Dimnames = structure(list(c("5", "16", "28",
"40", "52", "64", "79", "95")), .Names = ""), class = "table")
dput for example dataset
structure(list(nomem_encr = c(800009L, 800009L, 800012L, 800015L,
800015L, 800015L), timeline.compressed = c(79, 95, 79, 28, 40,
52), sel = c(4.9, NA, NA, 6.9, 6.7, NA), close_num = c(1, 0.2,
1, 0.8, 1, 0.8), gener_sat = c(7, 7, 8, 7, 7, 5)), .Names = c("ID",
"timeline.compressed", "x1", "x2", "x3"), row.names = c(NA,
6L), class = "data.frame")
Using dplyr you can do, e.g. with timeline_values being your frequency table and df your data
data.frame(timeline.compressed = as.numeric(names(timeline_values))) %>%
left_join(df) %>%
group_by(timeline.compressed) %>%
summarize_all(mean, na.rm = TRUE)
Related
I would like to plot the evolution of the number of workers per category ("A", "D", "F", "I"), from 2017 to 2021, with a stacked bar chart (with the labels in the middle of each bar, for each category), one bar per year. Yet my dataset isn't in the right way to do this, I think I need to use pivot_wider() or pivot_longer() from what I have seen here, but I don't really know how to manipulate these functions. Could anyone help ?
Here is the structure of my dataset, for reproducibility :
structure(list(A = c("10", "7", "8", "8", "9", "Total"), D = c(23,
14, 29, 35, 16, 117), F = c(8, 7, 11, 6, 6, 38), I = c(449, 498,
415, 470, 531, 2363), annee = c("2017", "2018", "2019", "2020",
"2021", NA)), core = structure(list(A = c("10", "7", "8", "8",
"9"), D = c(23, 14, 29, 35, 16), F = c(8, 7, 11, 6, 6), I = c(449,
498, 415, 470, 531)), class = "data.frame", row.names = c(NA,
-5L)), tabyl_type = "two_way", totals = "row", row.names = c(NA,
6L), class = c("tabyl", "data.frame"))
library(tidyverse)
library(ggrepel)
df <- structure(list(A = c("10", "7", "8", "8", "9", "Total"), D = c(
23,
14, 29, 35, 16, 117
), F = c(8, 7, 11, 6, 6, 38), I = c(
449, 498,
415, 470, 531, 2363
), annee = c(
"2017", "2018", "2019", "2020",
"2021", NA
)), core = structure(list(A = c(
"10", "7", "8", "8",
"9"
), D = c(23, 14, 29, 35, 16), F = c(8, 7, 11, 6, 6), I = c(
449,
498, 415, 470, 531
)), class = "data.frame", row.names = c(
NA,
-5L
)), tabyl_type = "two_way", totals = "row", row.names = c(
NA,
6L
), class = c("tabyl", "data.frame"))
df |>
filter(!is.na(annee)) |>
mutate(A = as.double(A)) |>
pivot_longer(-annee, names_to = "category") |>
ggplot(aes(annee, value, fill = category, label = value)) +
geom_col() +
geom_label_repel(position = position_stack(), max.overlaps = 20)
Created on 2022-08-08 by the reprex package (v2.0.1)
Once you remove the total row, and ensuring that A through I are numeric, you can pivot_longer and pass to ggplot() like this:
data %>%
filter(A!="Total") %>%
mutate(across(A:I, as.numeric)) %>%
pivot_longer(cols = -annee, names_to = "group", values_to = "ct") %>%
ggplot(aes(annee,ct,fill=group)) +
geom_col()
I did not add the category labels, since group I dominates each year; you might want to reconsider that visualization
I have a dataframe looks like this
df1<-structure(list(person = c("a", "a", "a", "a", "b", "b", "b",
"c"), visitID = c(123, 123, 256, 816, 237, 828, 828, 911), v1 = c(10,
5, 15, 8, 95, 41, 31, 16), v2 = c(8, 72, 29, 12, 70, 23, 28,
66), v3 = c(0, 1, 0, 0, 1, 1, 0, 1)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
Where person is the name/id of the person and visitID is a number generated for each visit. Now each visit may have one or multiple variables (v1, v2, v3). My problem is that I'm trying to transform the structure where cases are aggregated into unique row with wide visits and variables, to look like:
df2<-structure(list(person = c("a", "b", "c"), visit1 = c(123, 237,
911), visit2 = c(256, 828, NA), visit3 = c(816, NA, NA), v1.visit1 = c("10,5",
"95", "16"), v1.visit2 = c("15", "41,31", NA), v1.visit3 = c("8",
NA, NA), v2.visit1 = c("8,72", "70", "66"), v2.visit2 = c("29",
"23,28", NA), v1.visit3 = c("12", NA, NA), v3.visit1 = c("0,1",
"1", "1"), v3.visit2 = c("0", "1,0", NA), v3.visit3 = c("0",
NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
Methods I have tried so far:
Method1:
1-aggregate according to "person" with all other variables separated by comma
2-split the variables into multiple columns
The problem with this method is that I would not know then which variable corresponds to which visit, especially that some may have multiple entries and some may not.
Method2:
1-Count number of each visitID. Take the maximum number of visits per unique person (in the case above is 3)
2-Create 3 columns for each variable.
didn't know how to proceed from here
I found a great answer in the thread Reshape three column data frame to matrix ("long" to "wide" format)
so tried working around with reshape and pivot_wider but couldn't get it to work.
Any ideas are appreciated even if did not lead to the same output.
Thank you
You can try something like this:
df1 %>%
group_by(person, visitID) %>%
summarise(across(matches("v[0-9]+"), list)) %>%
group_by(person) %>%
mutate(visit = seq_len(n()) %>% str_c("visit.", .)) %>%
ungroup() %>%
pivot_wider(
id_cols = person,
names_from = visit,
values_from = c("visitID", matches("v[0-9]+"))
)
replace list with ~str_c(.x, collapse = ",") if you want to have it in character style.
I have a dataframe that looks like this:
Here's the code to create this DF:
structure(list(ethnicity = structure(c(1L, 2L, 3L, 5L), .Label = c("AS",
"BL", "HI", "Others", "WH", "Total"), class = "factor"), `Strongly agree` = c(30.7,
26.2, 37.4, 31.6), Agree = c(43.9, 34.5, 41, 45.4), `Neither agree nor disagree` = c(9.4,
14.3, 8.6, 8.7), Disagree = c(10, 15.5, 9.9, 9.7), `Strongly disagree` = c(6,
9.5, 3.2, 4.6)), row.names = c(NA, -4L), class = "data.frame")
I want to add data bars and makes these numbers as percentages. I tried using the formattable library to do that (see my code below).
formattable(df,align=c("l","l","l","l","l","l"),
list(`ethnicity` = formatter("span", style = ~ style(color = "grey", font.weight = "bold"))
,area(col = 2:6) ~ function(x) percent(x / 100, digits = 0)
,area(col = 2:6) ~ color_bar("#DeF7E9")))
I'm facing 2 problems:
The numbers don't appear as a percentage in the table output.
The alignment seems off in the last column i.e
Would really appreciate if someone could help me understand what am I missing here ?
Here is a solution but it require a lot of typing, I guess it is possible to use mutate_at() but I just can't find out how to pass column names in the percent() part. Using the . produces an error.
This works with a lot of typing:
library(dplyr)
library(formattable)
df %>%
mutate(`Strongly agree` = color_bar("#DeF7E9")(formattable::percent(`Strongly agree`/100))) %>%
mutate(`Agree` = color_bar("#DeF7E9")(formattable::percent(`Agree`/100))) %>%
mutate(`Disagree` = color_bar("#DeF7E9")(formattable::percent(`Disagree`/100))) %>%
mutate(`Neither agree nor disagree` = color_bar("#DeF7E9")(formattable::percent(`Neither agree nor disagree`/100))) %>%
mutate(`Strongly disagree` = color_bar("#DeF7E9")(formattable::percent(`Strongly disagree`/100))) %>%
formattable(.,
align=c("l","l","l","l","l","l"),
`ethnicity` = formatter("span", style = ~ style(color = "grey", font.weight = "bold")))
This does not work but might be improved:
df %>%
mutate_at(.vars = 2:6, .funs = color_bar("#DeF7E9")(formattable::percent(./100))) %>%
formattable(...)
Some more informations about this "strange" structure var = color_bar(...)(var)
I have the following data in messy format:
structure(list(com_level = c("B", "B", "B", "B", "A", "A"),
hf_com = c(1, 1, 1, 1, 1, 1),
sal_level = c("2", "3", "1", "2", "1", "4"),
exp_sal = c(NA, 1, 1, NA, 1, NA)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -6L))
Column com_level is the factor with 2 levels and column hf_com gives the frequency count for that level.
Column sal_level is the factor with 4 levels and column exp_sal gives the frequency count for that level.
I want to create a contingency table similar to this:
structure(list(`1` = c(1L, 2L),
`2` = c(0L, 1L),
`3` = c(0L, 2L),
`4` = c(1L, 0L)),
row.names = c("A", "B"), class = "data.frame")
I have code that works when I want to compare two columns with the same factor:
# 1 step to create table with frequency counts for exp_sal and curr_sal per category of level
cs_es_table <- df_not_na_num %>%
dplyr::count(sal_level, exp_sal, curr_sal) %>%
tidyr::spread(key = sal_level,value = n) %>% # this code spreads on just one key
select(curr_sal, exp_sal, 1, 2, 3, 4, 5, 6, 7, -8) %>% # reorder columns and omit Column 8 (no answer)
as.data.frame()
# step 2- convert cs_es_table to long format and summarise exp_sal and curr_sal frequencies
cs_es_table <- cs_es_table %>%
gather(key, value, -curr_sal,-exp_sal) %>% # crucial step to make data long
mutate(curr_val = ifelse(curr_sal == 1,value,NA),
exp_val = ifelse(exp_sal == 1,value,NA)) %>% #mutate actually cleans up the data and assigns a value to each new column for 'exp' and 'curr'
group_by(key) %>% #for your summary, because you want to sum up your previous rows which are now assigned a key in a new column
summarise_at( .vars = vars(curr_val, exp_val), .funs = sum, na.rm = TRUE)
This code produces this table but just spreads on one key in step 1:
structure(list(curr_val = c(533L, 448L, 237L, 101L, 56L), exp_val = c(179L,
577L, 725L, 401L, 216L)), row.names = c("< 1000 EUR", "1001-1500 EUR",
"2001-3000 EUR", "3001-4000 EUR", "4001-5000 EUR"), class = "data.frame")
Will I need to use pivot_wider as in this example?
Is it possible to use spread on multiple columns in tidyr similar to dcast?
or
tidyr::spread() with multiple keys and values
Any help would be appreciated to compare the two columns with different factors.
I read this SO question and that one, but still could not solve my problem. I have the following data.table which includes only a few of my total columns and rows of my data.table.
library(data.table)
structure(list(Patient = c("MB108", "MB108", "MB108", "MB108",
"MB108", "MB108", "MB108", "MB108", "MB108", "MB108"), Visit = c(1,
1, 1, 1, 9, 9, 9, 9, 12, 12), Stimulation = c("NC", "SEB", "PPD",
"E6C10", "NC", "SEB", "PPD", "E6C10", "NC", "SEB"), `CD38 ` = c(83.3,
63.4, 83.2, 91.5, 90.9, 70.9, 71, 88.4, 41.7, 47.9)), .Names = c("Patient",
"Visit", "Stimulation", "CD38 "), class = c("data.table", "data.frame"
), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x102806578>)
I would like to do a t.test on column 4 when visit is 1 and when visit is 9.
I checked for NAs as well as the length of both columns.
Thanks for any help!
#na.omit(boolean_dt3)
#print(length(unlist(boolean_dt3[Visit== 1,4, with = FALSE])))
#print(length(unlist(boolean_dt3[Visit== 9,4, with = FALSE])))
wilcox.test( unlist(boolean_dt3[Visit== 1,4, with = FALSE])~ unlist(boolean_dt3[Visit== 9,4, with = FALSE]) , paired = T, correct=FALSE)
I just figured out , instead of ~ works for my problem.
Here's how to perform wilcoxon test on column 4 grouping by Value
library(dplyr)
wilcox.test( filter(df, Visit==1)$CD38, filter(df, Visit==9)$CD38, paired=TRUE)
try this formulation:
wilcox.test(numeric_var ~ two_level_group_var)