I have a data frame like this:
mydf <- data.frame(ID, age, gender, diagnosis, A1, A2, A3)
mydf
ID age gender diagnosis A1 A2 A3
a 22 female 1 4 12 23
b 34 male 2 5 15 33
c 55 female 2 12 46 45
d 55 female 1 45 34 66
e 45 male 1 33 56 32
A1, A2, A3 refer to the questions in my test and the numbers below represent the score that an ID took from that question. "1" and "2" under the diagnosis represent whether the participant has the diagnosis or not.
What I want is to get mean scores for each question and make a bar plot showing the difference across diagnosis groups.
I calculated the mean for question columns like this:
mydf <- rbind(mydf, "mean" = round(colMeans(mydf[,5:7], na.rm = TRUE), 2))
mydf
ID age gender diagnosis A1 A2 A3
a 22.0 female 1 4.0 12.0 23.0
b 34.0 male 2 5.0 15.0 33.0
c 55.0 female 2 12.0 46.0 45.0
d 55.0 female 1 45.0 34.0 66.0
e 45.0 male 1 33.0 56.0 32.0
19.8 32.6 39.8 19.8 32.6 39.8 19.8
So, it added a new row but although I choose only question columns, I don't know why it also gave a mean for Id, age, gender, and diagnosis columns too.
And, I am not sure which steps should I take after this point to make a bar chart for the mean score for each question across diagnoses something like in the picture.
This should work:
library( tidyverse )
### Create tibble
mydf <- tibble( ID = c( "a", "b", "c", "d", "e" ),
age = c( 22, 34, 55, 55, 45 ),
gender = c( "female", "male", "female", "female", "male" ),
diagnosis = c( 1, 2, 2, 1, 1 ),
A1 = c( 4, 5, 12, 45, 33 ),
A2 = c( 12, 15, 46, 34, 56 ),
A3 = c( 23, 33, 45, 66, 32 ) )
### Caclulate the mean for each item within diagnosis group
mydf <- mydf %>% group_by( diagnosis ) %>% mutate( A1_mean = mean( A1, na.rm = T ) ) %>%
mutate( A2_mean = mean( A2, na.rm = T ) ) %>% mutate( A3_mean = mean( A3, na.rm = T ) )
### Reshape from wide to long format
mydf2 <- pivot_longer( mydf, cols = c( A1_mean, A2_mean, A3_mean ), names_to = "question" )
mydf2
### Plot data
ggplot( mydf2, aes( x = factor( question ), y = value, fill = factor( diagnosis ) ) ) +
geom_bar( position = "dodge", stat = "identity" ) + scale_fill_brewer( palette = "Set2" ) +
scale_y_continuous( expand = expansion( mult = c( 0,.05 ) ) ) +
scale_x_discrete( labels = c( "A1_mean" = "A1", "A2_mean" = "A2", "A3_mean" = "A3" ) ) +
theme( legend.position = "bottom" ) + ggtitle( "Mean for each Question by Diagnosis Group" ) +
xlab( "Question" ) + ylab( "Mean" ) + labs( fill = "Diagnosis" )
Thanks a lot to #David!
Inspired by him, this code solved my question.
#choose diagnosis and question columns and calculate the mean score for columns
mydf2 <- mydf %>% select(diagnosis, A1, A2, A3) %>% group_by(diagnosis)%>% summarise_if(is.numeric, mean, na.rm=TRUE)
and this is what I get:
# A tibble: 2 x 4
diagnosis A1 A2 A3
<chr> <dbl> <dbl> <dbl>
1 27.3 34 40.3
2 8.5 30.5 39
and then reshaped the tibble similarly to David's code with tidyverse:
mydf2 <- mydf2 %>%
pivot_longer(!diagnosis, names_to = "Questions", values_to = "Values")
mydf2
# A tibble: 6 x 3
diagnosis Questions Values
<chr> <chr> <dbl>
1 A1 27.3
1 A2 34
1 A3 40.3
2 A1 8.5
2 A2 30.5
2 A3 39
Then made the graph as David showed:
ggplot( mydf2, aes( x = Questions, y = Values, fill = diagnosis ) ) +
geom_bar( position = "dodge", stat = "identity" ) + scale_fill_brewer( palette = "Set2" ) +
scale_y_continuous( expand = expansion( mult = c( 0,.05 ) ) ) +
theme( legend.position = "bottom" ) + ggtitle( "Mean for each Question by Diagnosis Group" ) +
xlab( "Question" ) + ylab( "Mean" ) + labs( fill = "Diagnosis" )
and got the graph I wanted:
Related
I am working with the R programming language.
I have a dataset that looks something like this:
x = c("GROUP", "A", "B", "C")
date_1 = c("CLASS 1", 20, 60, 82)
date_1_1 = c("CLASS 2", 37, 22, 8)
date_2 = c("CLASS 1", 15,100,76)
date_2_1 = c("CLASS 2", 84, 18,88)
my_data = data.frame(x, date_1, date_1_1, date_2, date_2_1)
x date_1 date_1_1 date_2 date_2_1
1 GROUP CLASS 1 CLASS 2 CLASS 1 CLASS 2
2 A 20 37 15 84
3 B 60 22 100 18
4 C 82 8 76 88
I am trying to restructure the data so it looks like this:
note : in the real excel data, date_1 is the same date as date_1_1 and date_2 is the same as date_2_1 ... R wont accept the same names, so I called them differently
Currently, I am manually doing this in Excel using different "tranpose" functions - but I am wondering if there is a way to do this in R (possibly using the DPLYR library).
I have been trying to read different tutorial websites online (Pivoting), but so far nothing seems to match the problem I am trying to work on.
Can someone please show me how to do this?
Thanks!
Made assumptions about your data because of the duplicate column names. For example, if the Column header pattern is CLASS_ClassNum_Date
df<-data.frame(GROUP = c("A", "B", "C"),
CLASS_1_1 = c(20, 60, 82),
CLASS_2_1 = c(37, 22, 8),
CLASS_1_2 = c(15,100,76),
CLASS_2_2 = c(84, 18,88))
library(tidyr)
pivot_longer(df, -GROUP,
names_pattern = "(CLASS_.*)_(.*)",
names_to = c(".value", "Date"))
GROUP Date CLASS_1 CLASS_2
<chr> <chr> <dbl> <dbl>
1 A 1 20 37
2 A 2 15 84
3 B 1 60 22
4 B 2 100 18
5 C 1 82 8
6 C 2 76 88
Edit: Substantially improved pivot_longer by using names_pattern= correctly
There are lots of ways to achieve your desired outcome, but I don't believe there is an 'easy'/'simple' way. Here is one potential solution:
library(tidyverse)
library(vctrs)
x = c("GROUP", "A", "B", "C")
date_1 = c("CLASS 1", 20, 60, 82)
date_1_1 = c("CLASS 2", 37, 22, 8)
date_2 = c("CLASS 1", 15,100,76)
date_2_1 = c("CLASS 2", 84, 18,88)
my_data = data.frame(x, date_1, date_1_1, date_2, date_2_1)
# Combine column names with the names in the first row
colnames(my_data) <- paste(my_data[1,], colnames(my_data), sep = "-")
my_data %>%
filter(`GROUP-x` != "GROUP") %>% # remove first row (info now in column names)
pivot_longer(everything(), # pivot the data
names_to = c(".value", "Date"),
names_sep = "-") %>%
mutate(GROUP = vec_fill_missing(GROUP, # fill NAs in GROUP introduced by pivoting
direction = "downup")) %>%
filter(Date != "x") %>% # remove "unneeded" rows
mutate(`CLASS 2` = vec_fill_missing(`CLASS 2`, # fill NAs again
direction = "downup")) %>%
na.omit() %>% # remove any remaining NAs
mutate(across(starts_with("CLASS"), ~as.numeric(.x)),
Date = str_extract(Date, "\\d+")) %>%
rename("date" = "Date", # rename the columns
"group" = "GROUP",
"count_class_1" = `CLASS 1`,
"count_class_2" = `CLASS 2`) %>%
arrange(date) # arrange by "date" to get your desired output
#> # A tibble: 6 × 4
#> date group count_class_1 count_class_2
#> <chr> <chr> <dbl> <dbl>
#> 1 1 A 20 37
#> 2 1 B 60 84
#> 3 1 C 82 18
#> 4 2 A 15 37
#> 5 2 B 100 22
#> 6 2 C 76 8
Created on 2022-12-09 with reprex v2.0.2
I have a set of sample data such as the following:
tableData <- tibble(Fruits = sample(c('Apple', 'Banana', 'Orange'), 30, T),
Ripeness = sample(c('yes', 'no'), 30, T),
Mean = ifelse(Ripeness == 'yes', 1.4 + runif(30), 1.6 + runif(30))) %>%
add_row(Fruits = "Peach", Ripeness = "yes", Mean = 5)
I have a function that summarizes for p-value calculation and a mean difference calculation.
tableData %>%
group_by(Fruits) %>%
summarise(Meandiff = mean(Mean[Ripeness == 'yes'])-
mean(Mean[Ripeness == 'no']),
t_test_pval = get_t_test_pval(Mean ~ Ripeness))
Using the summarise function, is it also possible to add another column that counts the number of observations for each fruit if the fruit has a ripeness of "yes" (ie count apple observations with yes ripeness)?
How about this:
set.seed(2)
tableData <- tibble(Fruits = sample(c('Apple', 'Banana', 'Orange'), 30, T),
Ripeness = sample(c('yes', 'no'), 30, T),
Mean = ifelse(Ripeness == 'yes', 1.4 + runif(30), 1.6 + runif(30))) %>%
add_row(Fruits = "Peach", Ripeness = "yes", Mean = 5)
tableData %>%
group_by(Fruits) %>%
summarise(Meandiff = mean(Mean[Ripeness == 'yes']) - mean(Mean[Ripeness == 'no']),
t_test_p_val = if(length(unique(Ripeness))!=2) NaN else t.test(Mean ~ Ripeness)$p.value,
N.yes = sum(Ripeness=="yes"))
Fruits Meandiff t_test_p_val N.yes
<chr> <dbl> <dbl> <int>
1 Apple -0.260 0.241 5
2 Banana -0.223 0.305 4
3 Orange -0.692 0.000290 7
4 Peach NaN NaN 1
I have a .csv file with demographic data for my participants. The data are coded and downloaded from my study database (REDCap) in a way that each race has its own separate column. That is, each participant has a value in each of these columns (1 if endorsed, 0 if unendorsed).
It looks something like this:
SubjID Sex Age White AA Asian Other
001 F 62 0 1 0 0
002 M 66 1 0 0 0
I have to use a roundabout way to get my demographic summary stats. There's gotta be a simpler way to do this. My question is, how can I combine these columns into one column so that there is only one value for race for each participant? (i.e. recoding so 1 = white, 2 = AA, etc, and only the endorsed category is being pulled for each participant and added to this column?)
This is what I would like for it to look:
SubjID Sex Age Race
001 F 62 2
002 M 66 1
This is more or less similar to our approach with similar data from REDCap. We use pivot_longer for dummy variables. The final Race variable could also be made a factor. Please let me know if this is what you had in mind.
Edit: Added names_ptypes to pivot_longer to indicate that Race variable is a factor (instead of mutate).
library(tidyverse)
df <- data.frame(
SubjID = c("001", "002"),
Sex = c("F", "M"),
Age = c(62, 66),
White = c(0, 1),
AA = c(1, 0),
Asian = c(0, 0),
Other = c(0, 0)
)
df %>%
pivot_longer(cols = c("White", "AA", "Asian", "Other"), names_to = "Race", names_ptypes = list(Race = factor()), values_to = "Value") %>%
filter(Value == 1) %>%
select(-Value)
Result:
# A tibble: 2 x 4
SubjID Sex Age Race
<fct> <fct> <dbl> <fct>
1 001 F 62 AA
2 002 M 66 White
Here is another approach using reshape2
df[df == 0] <- NA
df <- reshape2::melt(df, measure.vars = c("White", "AA", "Asian", "Other"), variable.name = "Race", na.rm = TRUE)
df <- subset(df, select = -value)
# SubjID Sex Age Race
# 002 M 66 White
# 001 F 62 AA
Here's a base approach:
race_cols <- 4:7
ind <- max.col(df[, race_cols])
df$Race_number <- ind
df$Race <- names(df[, race_cols])[ind]
df[, -race_cols]
SubjID Sex Age Race_number Race
1 001 F 62 2 AA
2 002 M 66 1 White
Data from #Ben
df <- data.frame(
SubjID = c("001", "002"),
Sex = c("F", "M"),
Age = c(62, 66),
White = c(0, 1),
AA = c(1, 0),
Asian = c(0, 0),
Other = c(0, 0)
)
Not sure if tidyr::gather can be used to take multiple columns and merge them in multiple key columns.
Similar questions have been asked but they all refer to gathering multiple columns in one key column.
I'm trying to gather 4 columns into 2 key and 2 value columns like in the following example:
Sample data:
df <- data.frame(
subject = c("a", "b"),
age1 = c(33, 35),
age2 = c(43, 45),
weight1 = c(90, 67),
weight2 = c(70, 87)
)
subject age1 age2 weight1 weight2
1 a 33 43 90 70
2 b 35 45 67 87
Desired result:
dfe <- data.frame(
subject = c("a", "a", "b", "b"),
age = c("age1", "age2", "age1", "age2"),
age_values = c(33, 43, 35, 45),
weight = c("weight1", "weight2", "weight1", "weight2"),
weight_values = c(90, 70, 67, 87)
)
subject age age_values weight weight_values
1 a age1 33 weight1 90
2 a age2 43 weight2 70
3 b age1 35 weight1 67
4 b age2 45 weight2 87
Here's one way to do it -
df %>%
gather(key = "age", value = "age_values", age1, age2) %>%
gather(key = "weight", value = "weight_values", weight1, weight2) %>%
filter(substring(age, 4) == substring(weight, 7))
subject age age_values weight weight_values
1 a age1 33 weight1 90
2 b age1 35 weight1 67
3 a age2 43 weight2 70
4 b age2 45 weight2 87
Here's one approach. The idea is to do the use gather, then split the resulting dataframe by variable (age and weight), do the mutate operations separately for each of the two dataframes, then merge the dataframes back together using subject and the variable number (1 or 2).
library(dplyr)
library(tidyr)
library(purrr)
df %>%
gather(age1:weight2, key = key, value = value) %>%
separate(key, sep = -1, into = c("var", "num")) %>%
split(.$var) %>%
map(~mutate(., !!.$var[1] := paste0(var, num), !!paste0(.$var[1], "_values") := value)) %>%
map(~select(., -var, -value)) %>%
Reduce(f = merge, x = .) %>%
select(-num)
I am trying to create a panel of grouped stacked bar graphs, but the legend is not appearing automatically for the "age" below. How can I add this legend explicitly?
library(ggplot2)
# create the dataset
species=c(rep("A" , 2) , rep("B" , 2))
strain=rep(c("i" , "ii" ),2)
age=rep(c(1,2,3,4),4)
count=abs(rnorm(16 , 0 , 15))
data=data.frame(species,strain,age,count)
ggplot(data,aes(x=strain,y=count,fill=age))+
geom_bar(stat = "identity",size=0.5,col="black",fill=rep(c("black","saddlebrown","darkgreen","goldenrod"),times=4))+
facet_wrap(~species)+
labs(x="Species",y="Count")
Because you specify fill=rep(c("black","saddlebrown","darkgreen","goldenrod"),times=4) inside geom_bar() after aes(fill = age). You should use scale_fill_xxx to manually specify the desired colors.
library(dplyr)
library(ggplot2)
# create the dataset
set.seed(123)
species <- c(rep("A", 2), rep("B", 2))
strain <- rep(c("i", "ii"), 2)
age <- c(rep(c(1, 3, 4, 2), 1), rep(c(2, 4, 2, 3), 2), rep(c(3, 1, 3, 1), 1))
count <- abs(rnorm(16, 0, 15))
data <- data.frame(species, strain, age, count)
### convert age to factor
data <- data %>%
as_tibble() %>%
mutate(age = factor(age)) %>%
arrange(species, strain)
data
#> # A tibble: 16 x 4
#> species strain age count
#> <fct> <fct> <fct> <dbl>
#> 1 A i 1 8.41
#> 2 A i 2 1.94
#> 3 A i 2 10.3
#> 4 A i 3 6.01
#> 5 A ii 3 3.45
#> 6 A ii 4 25.7
#> 7 A ii 4 6.68
#> 8 A ii 1 1.66
#> 9 B i 4 23.4
#> 10 B i 2 6.91
#> 11 B i 2 18.4
#> 12 B i 3 8.34
#> 13 B ii 2 1.06
#> 14 B ii 3 19.0
#> 15 B ii 3 5.40
#> 16 B ii 1 26.8
ggplot(data, aes(x = strain, y = count, fill = age)) +
geom_col(color = 'black') +
facet_grid(~ species) +
scale_fill_brewer(palette = 'Dark2') +
labs(x = "Species", y = "Count") +
theme_minimal(base_size = 14)
### user-defined color scheme
myColor <- c('#a6cee3','#1f78b4','#b2df8a','#33a02c')
ggplot(data, aes(x = strain, y = count, fill = age)) +
geom_col(color = 'black') +
facet_grid(~ species) +
scale_fill_manual(values = myColor) +
labs(x = "Species", y = "Count") +
theme_minimal(base_size = 14)
Created on 2018-10-09 by the reprex package (v0.2.1.9000)
OK, I think I've got it with your help #Tung:
ggplot(data,aes(x=strain,y=count,fill=age))+
geom_col() +
geom_bar(stat = "identity",size=0.5,col="black",fill=rep(c("black","saddlebrown","darkgreen","goldenrod"),times=4))+
facet_wrap(~species)+
labs(x="Species",y="Count")+
theme(legend.position = "right") +
theme(axis.title.y = element_text(margin = margin(r = 20)))+
scale_fill_manual(values = c("black","saddlebrown","darkgreen","goldenrod"))