I am using this dataset: http://www.openintro.org/stat/data/cdc.R
to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:
Mean Standard Deviation
Age: 44.27 16.715
Height: 70.25 3.009219
Weight: 189.3 36.55036
Desired Weight: 178.6 26.25121
I created a subset for males and females with this code:
mdata <- subset(cdc, cdc$gender == ("m"))
fdata <- subset(cdc, cdc$gender == ("f"))
How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?
The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:
library(dplyr)
library(tidyr)
starwars |>
group_by(sex) |>
summarise(across(
where(is.numeric),
.fns = list(Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}__{fn}"
)) |>
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")
# A tibble: 15 × 4
sex var Mean SD
<chr> <chr> <dbl> <dbl>
1 female height 169. 15.3
2 female mass 54.7 8.59
3 female birth_year 47.2 15.0
4 hermaphroditic height 175 NA
5 hermaphroditic mass 1358 NA
6 hermaphroditic birth_year 600 NA
7 male height 179. 36.0
8 male mass 81.0 28.2
9 male birth_year 85.5 157.
10 none height 131. 49.1
11 none mass 69.8 51.0
12 none birth_year 53.3 51.6
13 NA height 181. 2.89
14 NA mass 48 NA
15 NA birth_year 62 NA
Just make a data frame of colMeans and column sd. Note, that you may also select columns.
fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)
res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
# mean sd
# age 44.27307 16.719940
# height 70.25165 3.009219
# weight 189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------
# cdc$gender: f
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
To extract only one of the data frames in the list-like object use e.g. res$m.
Usually we use aggregate for this, which you also might consider:
aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
do.call(what=data.frame)
# gender age.mean age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1 m 44.27307 16.71994 70.251646 3.009219 189.32271 36.55036 178.61657 26.25121
# 2 f 45.79772 17.58442 64.367750 2.787304 151.66619 34.29752 133.51500 18.96301
The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.
Note: R >= 4.1 used.
Data:
source('https://www.openintro.org/stat/data/cdc.R')
or
cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent",
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0,
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1,
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L,
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L,
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L,
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995",
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")
Related
I want to re-work the data that I have in a dataframe based on adding certain values up, and to do that to all of the (numerical) columns in the same way.
In code, I have created a dataframe that is structured a bit like this:
library(tibble)
df_in <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Three star", 1L, 1, 2L, 1,
"Two star", 5L, 5, 12L, 6,
"One star", 6L, 6, 100L, 50,
"No star", 88L, 88, 86L, 43,
"Empty", 0L, 0, 0L, 0,
"Also empty", 0L, 0, 0L, 0)
In my output I want to have one row that contains the sums of three rows in the input dataframe, another row that contains the sum of two of them, and one that contains the contents of a row from the original (but renamed).
I also want to keep other rows if they have numbers but to drop them if they are empty. I would prefer to do that programmatically, but can do it manually with indexing if need be, so that's a bit less important.
My desired output would be a bit like this:
df_out <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Any stars", 12L, 12, 114L, 57,
"... of which at least 2 stars", 6L, 6, 14L, 7,
"... of which 3 stars", 1L, 1, 2L, 1,
"No star", 88L, 88, 86L, 43)
For example, that 12L in the top left (meaning column a, "Any stars") is the sum of the 1L, 5L and 6L entries in the a column of the input.
I want to do this merging of rows at this stage in my processing because it's important to do it after I've already calculated the percentage columns (..._pc in the example). You'll see that in the output the percentage columns add to more than 100, which is correct because there is deliberately some 'double counting' - things can correctly show up in multiple rows if they meet the conditions.
Edit to add: Note that the labels I am using in the $names column of the test dataset df_in are not the real labels I have in my real situation. I imagine that a workable solution to this will be able to somehow take a collection of vectors that specify sets of rows and another collection of the same number of strings to label the new rows, and process through them. I might be able to define the sets of rows and the associated names like this:
set_1 <- c("Three star", "Two star", "One star")
set_2 <- c("Three star", "Two star")
set_3 <- "Three star"
set_4 <- "No star"
new_name_1 <- "Any stars"
new_name_2 <- "... of which at least 2 stars"
new_name_3 <- "... of which 3 stars"
new_name_4 <- "No star"
We may use imap to loop over patterns (as some cases are overlapping) and do the group by sum across those columns (after filtering the rows)
library(purrr)
library(stringr)
imap_dfr(setNames(c('(?<!No) star', 'Two|Three', 'Three', 'Empty|No'),
c("Any stars", "... of which at least 2 stars",
"... of which 3 stars", "No star" )), ~ df_in %>%
filter(str_detect(names, regex(.x, ignore_case = TRUE))) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
-ouptut
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
OP's expected
> df_out
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
Update
If the OP is passing a custom set of names
map2_dfr(mget(ls(pattern = '^set_\\d+')),
mget(ls(pattern = '^new_name_\\d+')),
~ df_in %>%
filter(names %in% .x) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
In R markdown through R Studio (R v. 4.0.3), I'm looking for a better solution to combining similarly structured dataframes while keeping all rows and matching entries on a key. Piping full_join() into a filter into a bind_rows() directly wasn't working, possibly because of the error message:
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
I have 23 dataframes (let's call these "semester data") of data I'm looking to combine into a single dataframe (intended to be a single dataset of individuals outcomes from semester-to-semester).
Each semester dataframe is roughly 3000-4000 observations (individuals) with 45-47 variables of relevant data. A simplified example of a semester (or term) dataframe is shown below.
Simplified example of a "semester" dataframe:
id
ACT_math
course_code
section_code
term_code
grade
term_GPA
0001
23
101
001
FA12
3.45
3.8
0002
28
201
003
FA12
3.2
3.4
Individuals will show up in multiple semester dataframes as they progress through the program (taking course 101 in the fall and course 102 in the spring).
I want to use the dplyr full_join() to match these individuals on an ID key.
Using the suffix argument, I hope to keep track of which semester and course a set of data (grade, term_GPA, etc) for an individual comes from.
There's some data (ACT score, gender, state residency, etc) that is the stable for an individual across semester dataframes. Ideally I could take the first input and drop the rest, but if I had to clean this afterwards, that's fine.
I started by defining an object programatic_database using the first semester of data SP11. To cut down on the duplication of stable data for an individual, I selected the relevant columns that I wanted to join.
programmatic_database <- programmatic_database %>%
full_join(select(fa12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".sp11", ".fa12"), keep = FALSE, name = "id")
However, every semester new students join the program. I would like to add these entries to the bottom of the growing programmatic_database.
I'm also looking to use rbind() or bind_rows() to add these individuals to the bottom of the programmatic_database, along with their relevant data.
After full_join(), I'm filtering out the entries that have already been added horizontally to the dataframe, then piping the remaining entries into bind_rows()
programmatic_database <- fa12[!which(fa12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, fa12)
Concatenated example of what my code is producing after several iterations:
id
ACT_math
course_code
section_code
section_code.db
section_code.db.db
term_code
grade.sp11
grade.fa12
grade.sp13
grade.sp15
term_GPA.sp11
term_GPA.fa12
term_GPA.sp15
0001
23
102
001
001
001
FA12
3.45
3.8
3.0
-
3.8
3.7
-
0002
28
201
003
003
003
FA12
3.2
3.4
3.0
-
3.8
3.7
-
1020
28
201
003
003
003
FA12
3.2
3.4
-
-
3.8
3.7
-
6783
30
101
-
-
-
SP15
-
-
-
3.8
-
-
4.0
where I have successfully added horizontally for students 0001 and 0002 for outcomes in subsequent courses in subsequent semesters. I have also managed to add vertically, like with student 6783, leaving blanks for previous semesters before they enrolled but still adding the relevant columns.
Questions:
Is there a way to pipe full_join() into a filter() into a bind_rows() without running into these errors?
rbind number of columns do not match
OR
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
Is there a easy way to keep certain columns and only add the suffix ".fa12" to certain columns? As you can see, the .db is piling up.
Is there any way to automate this? Loops aren't my strong suit, but I'm sure there's a better-looking code than doing each of the 23 joins/binds by hand.
Thank you for assistance!
Current code for simplicity:
#reproducible example
fa11 <- structure(list(id = c("1001", "1002", "1003",
"1013"), act6_05_composite = c(33L, 26L, 27L, 25L), course_code = c("101",
"101", "101", "101"), term_code = c("FA11", "FA11", "FA11", "FA11"
), section_code = c(1L, 1L, 1L, 1L), grade = c(4, 0, 0, 2.5
), repeat_status_flag = c(NA, "PR", NA, NA), class_code = c(1L,
1L, 1L, 1L), cum_atmpt_credits_prior = c(16, 0, 0, 0), cum_completed_credits_prior = c(0L,
0L, 0L, 0L), cum_passed_credits_prior = c(16, 0, 0, 0), cum_gpa_prior = c(0,
0, 0, 0), cum_atmpt_credits_during = c(29, 15, 18, 15), cum_completed_credits_during = c(13L,
1L, 10L, 15L), cum_passed_credits_during = c(29, 1, 14, 15),
term_gpa = c(3.9615, 0.2333, 2.3214, 2.9666), row.names = c(NA, 4L
), class = "data.frame")
sp12 <- structure(list(id = c("1007", "1013", "1355",
"2779", "2302"), act6_05_composite = c(24L, 26L, 25L, 24L,
24L), course_code = c(101L, 102L, 101L, 101L, 101L
), term_code = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_), section_code = c(1L, 1L, 1L, 1L, 1L), grade = c(2,
2.5, 2, 1.5, 3.5), repeat_status_flag = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), class_code = c(2L, 2L, 1L, 2L, 2L), cum_atmpt_credits_prior = c(44,
43, 12, 43, 30), cum_completed_credits_prior = c(41L, 43L,
12L, 43L, 12L), cum_passed_credits_prior = c(41, 43, 12,
43, 30), cum_gpa_prior = c(3.3125, 3.186, 3.5416, 3.1785,
3.8636), cum_atmpt_credits_during = c(56, 59, 25, 64, 43),
cum_completed_credits_during = c(53L, 56L, 25L, 56L, 25L),
cum_passed_credits_during = c(53, 59, 25, 64, 43), term_gpa = c(2.8333,
3.423, 3.1153, 2.1923, 3.6153), row.names = c(NA,
5L), class = "data.frame")
# make object from fall 2011 semester dataframe
programmatic_database <- fa11
# join the spring 2012 semester dataframe by id using select variables and attaching relevant suffix
programmatic_database <- programmatic_database %>%
full_join(select(sp12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".fa11", ".sp12"), keep = FALSE, name = "id")
#view results of join, force integer type on certain variables as needed (see error above)
#filter the joined entries from fall 2012 database, then bind the remaining entries to the bottom of the growing dataset
programmatic_database <- sp12[!which(sp12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, sp12)
It would be possible to use bind_rows here if you make the column types consistent between tables. For instance, you could make a function to re-type any particular columns that aren't consistent in your original data. (That might also be something you could fix upstream as you read it in.)
library(dplyr)
set_column_types <- function(df) {
df %>%
mutate(term_code = as.character(term_code),
course_code = as.character(course_code))
}
bind_rows(
fa11 %>% set_column_types(),
sp12 %>% set_column_types() %>% mutate(term_code = "SP12")
)
This will stack your data into a relatively "long" format, like below. You may want to then reshape it depending on what kind of subsequent calculations you want to do.
id act6_05_composite course_code term_code section_code grade repeat_status_flag class_code cum_atmpt_credits_prior cum_completed_credits_prior cum_passed_credits_prior cum_gpa_prior cum_atmpt_credits_during cum_completed_credits_during cum_passed_credits_during term_gpa
1 1001 33 101 FA11 1 4.0 <NA> 1 16 0 16 0.0000 29 13 29 3.9615
2 1002 26 101 FA11 1 0.0 PR 1 0 0 0 0.0000 15 1 1 0.2333
3 1003 27 101 FA11 1 0.0 <NA> 1 0 0 0 0.0000 18 10 14 2.3214
4 1013 25 101 FA11 1 2.5 <NA> 1 0 0 0 0.0000 15 15 15 2.9666
5 1007 24 101 SP12 1 2.0 <NA> 2 44 41 41 3.3125 56 53 53 2.8333
6 1013 26 102 SP12 1 2.5 <NA> 2 43 43 43 3.1860 59 56 59 3.4230
7 1355 25 101 SP12 1 2.0 <NA> 1 12 12 12 3.5416 25 25 25 3.1153
8 2779 24 101 SP12 1 1.5 <NA> 2 43 43 43 3.1785 64 56 64 2.1923
9 2302 24 101 SP12 1 3.5 <NA> 2 30 12 30 3.8636 43 25 43 3.6153
I have a column of species abundance
Pct_Cov Species Site Plot
1 2.25 AMLA AC 1
2 4.75 BECA4 AC 1
3 9.50 BEPA AC 1
4 7.00 BEPO AC 1
5 9.25 PIRU AC 1
6 2.25 PIRI AC 1
tail
tail(st.ov)
Pct_Cov Species Site Plot
612207 8.0 QUGA ZI 527
612208 1.0 RHAR4 ZI 527
612209 0.5 ARTR2 ZI 527
612210 1.0 POFE ZI 527
612211 3.0 VICIA ZI 527
612212 0.5 ARLU ZI 527
There are a LOT of plots here, 12438 to be exact. Each plot has a variety of different species, etc. I'm trying to write a function that creates a new column to calculate the ratio of the abundance of the dominant species / abundance of the subordinate species.
"Dominant" would be the sum of the top 1/4 of the species per each plot. So if a plot had 20 species, it would be the sum of the abundance of the 4 most abundant species.
I'm having a hard time going about this and was wondering if anyone had any tips. It would also be helpful to know what those species are, but that seems to be tricky.
Thanks!
Here's another tidyverse option. Since your data only has 6 rows for each of two Plots, I'll go with the "top 2" and "all but top 2", instead of your "4". It's easily modified.
dat %>%
group_by(Plot) %>%
mutate(R = dense_rank(Pct_Cov)) %>%
summarize(Ratio = sum(Pct_Cov[R %in% 1:2]) / sum(Pct_Cov[! R %in% 1:2]))
# # A tibble: 2 x 2
# Plot Ratio
# <int> <dbl>
# 1 1 0.359
# 2 527 0.273
This does not protect against plots with few unique species. For that, one might add some row-counting logic:
dat %>%
group_by(Plot) %>%
mutate(R = dense_rank(Pct_Cov)) %>%
summarize(Ratio = if (n() > (2+2)) sum(Pct_Cov[R %in% 1:2]) / sum(Pct_Cov[! R %in% 1:2]) else NA_real_)
If you get an NA, that means that that Plot had too few unique species.
Also, it doesn't acknowledge the possibility of 3 (my "2" plus one) having the same Pct_Cov, which sounds unlikely but would be a corner-case that will skew the math.
Data
dat <- structure(list(Pct_Cov = c(2.25, 4.75, 9.5, 7, 9.25, 2.25, 8, 1, 0.5, 1, 3, 0.5), Species = c("AMLA", "BECA4", "BEPA", "BEPO", "PIRU", "PIRI", "QUGA", "RHAR4", "ARTR2", "POFE", "VICIA", "ARLU"), Site = c("AC", "AC", "AC", "AC", "AC", "AC", "ZI", "ZI", "ZI", "ZI", "ZI", "ZI"), Plot = c(1L, 1L, 1L, 1L, 1L, 1L, 527L, 527L, 527L, 527L, 527L, 527L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "612207", "612208", "612209", "612210", "612211", "612212"))
We could use count to get the frequency count of 'plot', 'Species', arrange by 'plot' and descending order of 'n', then grouped by 'plot', create the ratio by taking the sum of first 3 'n' values divided by the sum of the rest and join with the original data
library(dplyr)
out <- df1 %>%
count(plot, Species) %>%
arrange(plot, desc(n)) %>%
group_by(plot) %>%
mutate(ratio = sum(n[1:3])/sum(n[-(1:3)])) %>%
right_join(df1)
id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113
df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.
Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)
in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000
Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)
My data is organized as such:
Distance r^2
0 1
0 0.9
0 0
0 0.8
0 1
1 0.5
1 0.45
1 0.56
1 1
2 0
2 0.9
3 0
3 0.1
3 0.2
3 0.3
...
300 1
300 0.8
I want to plot r^2 decay with distance, meaning I want to plot a mean value + st-dev for every unique distance value. So I should have 1 point at x=0, 1 point at x=1... but I have multiple x=0 values.
What is the best way to achieve this, given how the data is organized? I would like to do it in R if possible.
Thank you,
Adrian
Edit:
I have tried:
> dd <-structure(list(Distance = dist18, r.2 = a18[,13]), Names = c("Distance", "r^2"), class = "data.frame", row.names = c(NA, -15L))
> ggplot(dd, aes(x=Distance, y=r.2)) + stat_summary(fun.data="mean_sdl")
Error in data.frame(x = c(42L, 209L, 105L, 168L, 63L, 212L, 148L, 175L, : arguments imply differing number of rows: 126877, 15
> head(dist18)
[1] 42 209 105 168 63 212
> head(dd)
Distance r.2
1 42 0.89
2 209 0.92
3 105 0.91
4 168 0.81
5 63 0.88
6 212 0.88
Is this because my data is not sorted?
You can also plot your SD as an area around the mean similar to CI plotting (assuming temp is your data set)
library(data.table)
library(ggplot2)
temp <- setDT(temp)[, list(Mean = mean(r.2), SD = sd(r.2)), by = Distance]
ggplot(temp) + geom_point(aes(Distance, Mean)) + geom_ribbon(aes(x = Distance, y = Mean, ymin = (Mean - SD), ymax = (Mean + SD)), fill = "skyblue", alpha = 0.4)
Using dplyr it will be something like this:
df = data.frame(distance = rep(1:300, each = 10), r2 = runif(3000))
library(dplyr)
df_group = group_by(df, distance)
summarise(df_group, mn = mean(r2), s = sd(r2))Source: local data frame [300 x 3]
distance mn s
1 300 0.4977758 0.3565554
2 299 0.4295891 0.3281598
3 297 0.5346428 0.3424429
4 296 0.4623368 0.3163320
5 291 0.3224376 0.2103655
6 290 0.3916658 0.2115264
7 288 0.6147680 0.2953960
8 287 0.3405524 0.2032616
9 286 0.5690844 0.2458538
10 283 0.2901744 0.2835524
.. ... ... ...
Where df is the data.frame with your data, and distance and r2 the two column names.
this should work.
# Create a data frame like yours
df=data.frame(sample(50,size=300,replace=TRUE),runif(300))
colnames(df)=c('Distance','r^2')
#initialize empty data frame with columns x, mean and stdev
results=data.frame(x=numeric(0),mean=numeric(0),stdev=numeric(0))
count=1
for (i in 0:max(df$Distance)){
results[count,'x']=i
temp_mean=mean(df[which(df$Distance==i),'r^2'])
results[count,'mean']=temp_mean
temp_sd=sd(df[which(df$Distance==i),'r^2'])
results[count,'stdev']=temp_sd
count=count+1
}
# Plot your results
plot(results$x,results$mean,xlab='distance',ylab='r^2')
epsilon=0.02 #to add the little horizontal bar to the error bars
for (i in 1:nrow(results)){
up = results$mean[i] + results$stdev[i]
low = results$mean[i] - results$stdev[i]
segments(results$x[i],low , results$x[i], up)
segments(results$x[i]-epsilon, up , results$x[i]+epsilon, up)
segments(results$x[i]-epsilon, low , results$x[i]+epsilon, low)
}
Here's the result http://imgur.com/ED7PwD8
If you want to plot mean and +/- 1 sd for each point, the ggplot function makes this easy. With the test data
dd<-structure(list(Distance = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
2L, 2L, 3L, 3L, 3L, 3L), r.2 = c(1, 0.9, 0, 0.8, 1, 0.5, 0.45,
0.56, 1, 0, 0.9, 0, 0.1, 0.2, 0.3)), .Names = c("Distance", "r.2"
), class = "data.frame", row.names = c(NA, -15L))
you can just run
library(Hmisc)
ggplot(dd, aes(x=Distance, y=r.2)) +
stat_summary(fun.data="mean_sdl", mult=1)
which produces
I tried with your real data and got
real <- read.table("http://pelinfamily.ca/bio/GDR-18_conc.ld", header=F)
dd <- data.frame(Distance=real[,2]-real[,1], r.2=real[,13])
ggplot(dd, aes(x=Distance, y=r.2)) +
stat_summary(fun.data="mean_sdl", mult=1, geom="ribbon", alpha=.4) +
stat_summary(fun.data="mean_sdl", mult=1, geom="line")