I'm trying to loop through a large dataframe [5413 columns] and run an ANOVA on each column, however I'm getting an error when trying to do so.
I'd like to have the P value from the ANOVA written to a new row in a dataframe containing the column titles. But limited my current knowledge I'm writing the P-value outputs to files I can parse through in bash.
Here's an example layout of the data:
data()
Name, Group, aaaA, aaaE, bbbR, cccD
Apple, Fruit, 1.23, 0.45, 0.3, 1.1
Banana, Fruit, 0.54, 0.12, 2.0, 1.32
Carrot, Vegetable, 0.01, 0.05, 0.45, 0.9
Pear, Fruit, 0.1, 0.2, 0.1, 0.3
Fox, Animal, 1.0, 0.9, 1.2, 0.8
Dog, Animal, 1.2, 1.1, 0.8, 0.7
And here is the output from dput:
structure(list(Name = structure(c(1L, 2L, 3L, 6L, 5L, 4L), .Label = c("Apple",
"Banana", "Carrot", "Dog", "Fox", "Pear"), class = "factor"),
Group = structure(c(2L, 2L, 3L, 2L, 1L, 1L), .Label = c(" Animal",
" Fruit", " Vegetable"), class = "factor"), aaaA = c(1.23,
0.54, 0.01, 0.1, 1, 1.2), aaaE = c(0.45, 0.12, 0.05, 0.2,
0.9, 1.1), bbbR = c(0.3, 2, 0.45, 0.1, 1.2, 0.8), cccD = c(1.1,
1.32, 0.9, 0.3, 0.8, 0.7)), class = "data.frame", row.names = c(NA,
-6L))
To get a successful output from one I do:
summary(aov(aaaA ~ Group, data=data))[[1]][["Pr(>F)"]]
I then try to implement that in a loop:
for(i in names(data[3:6])){
out <- summary(aov(i ~ Group, data=data))[[1]][["Pr(>F)"]]
write.csv(out, i)}
Which returns the error:
Error in model.frame.default(formula = i ~ Group, data = test, drop.unused.levels = TRUE) :
variable lengths differ (found for 'Group')
Can anyone help with getting around the error or implementing a per-column ANOVA?
We can do the following and later get the p values:
to_use<-setdiff(names(df),"aaaA")
lapply(to_use,function(x) summary(do.call(aov,list(as.formula(paste("aaaA","~",x)),
data=df))))
This gives you:
[[1]]
Df Sum Sq Mean Sq
Name 5 1.48 0.296
[[2]]
Df Sum Sq Mean Sq F value Pr(>F)
Group 2 0.8113 0.4057 1.819 0.304
Residuals 3 0.6689 0.2230
[[3]]
Df Sum Sq Mean Sq F value Pr(>F)
aaaE 1 0.9286 0.9286 6.733 0.0604 .
Residuals 4 0.5516 0.1379
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[[4]]
Df Sum Sq Mean Sq F value Pr(>F)
bbbR 1 0.043 0.0430 0.12 0.747
Residuals 4 1.437 0.3593
[[5]]
Df Sum Sq Mean Sq F value Pr(>F)
cccD 1 0.1129 0.1129 0.33 0.596
Residuals 4 1.3673 0.3418
Related
Sample data:
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.21 0.02 0.61 10 5 3 0.01
2 0.01 0.02 0.37 0.4 0.01 0.8 0.5
3 0.02 0.03 0.55 0.01 0.01 0.3 0.99
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55
5 0.11 0.1 -0.08 0.22 0.015 0.01 0.01
6 0.22 0.21 -0.08 0.02 0.03 0.01 0.01
I have a dataset which has columns of some variable of interest, say alpha, beta, and so on. I also have this saved as a character vector. I want to be able to mutate new columns based on these variable names, suffixed with an identifier, using the existing columns in the dataset as part of some transformation, like this:
df %>% mutate(
alpha_new = ((alpha_5-alpha_1) / (X_5-X_1) * Y),
beta_new = ((beta_5-beta_1) / (X_5-X_1) * Y)
)
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1 alpha_new beta_new
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.21 0.02 0.61 10 5 3 0.01 16.1 9.60
2 0.01 0.02 0.37 0.4 0.01 0.8 0.5 -14.4 -11.1
3 0.02 0.03 0.55 0.01 0.01 0.3 0.99 0 38.0
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55 -0.145 15.1
5 0.11 0.1 -0.08 0.22 0.015 0.01 0.01 -1.64 0
6 0.22 0.21 -0.08 0.02 0.03 0.01 0.01 0.0800 0
In my real data I have many more columns like this and I'm struggling to implement this in a "tidy" way which isn't hardcoded, what's the best practice for my situation?
Sample code:
structure(
list(
X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22),
X_1 = c(0.02,
0.02, 0.03, 0.05, 0.10, 0.21),
Y = c(0.61, 0.37, 0.55, 0.29, -0.08, -0.08),
alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02),
alpha_1 = c(5, 0.01, 0.01, 0.005, 0.015, 0.03),
beta_5 = c(3, 0.8, 0.3, 0.03, 0.01, 0.01),
beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01)
),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")
) -> df
variable_of_interest <- c("alpha", "beta")
Here's another way to approach this with dynamic creation of columns. With map_dfc from purrr you can column-bind new results, creating new column names with bang-bang on left hand side of := operator, and using .data to access column values on right hand side.
library(tidyverse)
bind_cols(
df,
map_dfc(
variable_of_interest,
~ transmute(df, !!paste0(.x, '_new') :=
(.data[[paste0(.x, '_5')]] - .data[[paste0(.x, '_1')]]) /
(X_5 - X_1) * Y)
)
)
Output
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1 alpha_new beta_new
1 0.21 0.02 0.61 10.00 5.000 3.00 0.01 16.05263 9.599474
2 0.01 0.02 0.37 0.40 0.010 0.80 0.50 -14.43000 -11.100000
3 0.02 0.03 0.55 0.01 0.010 0.30 0.99 0.00000 37.950000
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55 -0.14500 15.080000
5 0.11 0.10 -0.08 0.22 0.015 0.01 0.01 -1.64000 0.000000
6 0.22 0.21 -0.08 0.02 0.030 0.01 0.01 0.08000 0.000000
Better to pivot the data first
library(dplyr)
library(tidyr)
# your data
df <- structure(list(X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22), X_1 = c(0.02,
0.02, 0.03, 0.05, 0.1, 0.21), Y = c(0.61, 0.37, 0.55, 0.29, -0.08,
-0.08), alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02), alpha_1 = c(5,
0.01, 0.01, 0.005, 0.015, 0.03), beta_5 = c(3, 0.8, 0.3, 0.03,
0.01, 0.01), beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01)), class = "data.frame", row.names = c(NA,
-6L))
df <- df |> mutate(id = 1:n()) |>
pivot_longer(cols = -c(id, Y, X_5, X_1),
names_to = c("name", ".value"), names_sep="_") |>
mutate(new= (`5` - `1`) / (X_5 - X_1) * Y) |>
pivot_wider(id_cols = id, names_from = "name", values_from = c(`5`,`1`, `new`),
names_glue = "{name}_{.value}", values_fn = sum)
df
#> # A tibble: 6 × 7
#> id alpha_5 beta_5 alpha_1 beta_1 alpha_new beta_new
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 3 5 0.01 16.1 9.60
#> 2 2 0.4 0.8 0.01 0.5 -14.4 -11.1
#> 3 3 0.01 0.3 0.01 0.99 0 38.0
#> 4 4 0.01 0.03 0.005 0.55 -0.145 15.1
#> 5 5 0.22 0.01 0.015 0.01 -1.64 0
#> 6 6 0.02 0.01 0.03 0.01 0.0800 0
Created on 2023-02-16 with reprex v2.0.2
Note: if you want to add X_5 and X_1 in the output use id_cols = c(id, X_5, X_1) instead.
I modified your data to create a bit more complicated situation. My hope is that this is close to your real situation. The condition in this idea is that two columns that you wanna pair up stay next to each other. The first job is to collect column names that begin with small letters. Next job is to create a data frame. Here I keep the column names in odd positions
in target in the first column, and ones in even positions in the second column. I was thinking in the same line of Ben; I used map2_dfc to create an output data frame. In this function, I replaced all small letters with X so that I could specify two column names in the original data (i.e., ones starting with X). Then, I did the calculation as you specified. Finally, I created a column name for outcome in the loop. If you want to add the result to the original data, you can run the final line with cbind.
grep(x = names(df), pattern = "[[:lower:]]+_[0-9]+", value = TRUE) -> target
tibble(first_element = target[c(TRUE, FALSE)],
second_element = target[c(FALSE, TRUE)]) -> mydf
map2_dfc(.x = mydf$first_element,
.y = mydf$second_element,
.f = function(x, y) {
sub(x = x, pattern = "[[:lower:]]+", replacement = "X") -> foo1
sub(x = y, pattern = "[[:lower:]]+", replacement = "X") -> foo2
outcome <- ((df[x] - df[y]) / (df[foo1] - df[foo2]) * df["Y"])
names(outcome) <- paste(x,
sub(x = y, pattern = "[[:lower:]]+", replacement = ""),
sep = "")
return(outcome)
}) -> result
cbind(df, result)
# alpha_5_1 alpha_2_6 beta_5_1 beta_3_4
#1 16.05263 0.10736 9.599474 0.27145
#2 -14.43000 0.10730 -11.100000 0.28564
#3 0.00000 0.28710 37.950000 0.50820
#4 -0.14500 0.21576 15.080000 0.64206
#5 -1.64000 -0.06416 0.000000 -0.61352
#6 0.08000 -0.08480 0.000000 -0.25400
DATA
structure(list(
X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22),
X_1 = c(0.02,0.02, 0.03, 0.05, 0.10, 0.21),
X_2 = 1:6,
X_6 = 6:11,
X_3 = 21:26,
X_4 = 31:36,
Y = c(0.61, 0.37, 0.55, 0.29, -0.08, -0.08),
alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02),
alpha_1 = c(5, 0.01, 0.01, 0.005, 0.015, 0.03),
alpha_2 = c(0.12, 0.55, 0.39, 0.28, 0.99, 0.7),
alpha_6 = 1:6,
beta_5 = c(3, 0.8, 0.3, 0.03, 0.01, 0.01),
beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01),
beta_3 = c(0.55, 0.28, 0.76, 0.86, 0.31, 0.25),
beta_4 = c(5, 8, 10, 23, 77, 32)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")) -> df
I am trying to use pivot_longer. However, I am not sure how to use names_sep or names_pattern to solve this.
dat <- tribble(
~group, ~BP, ~HS, ~BB, ~lowerBP, ~upperBP, ~lowerHS, ~upperHS, ~lowerBB, ~upperBB,
"1", 0.51, 0.15, 0.05, 0.16, 0.18, 0.5, 0.52, 0.14, 0.16,
"2.1", 0.67, 0.09, 0.06, 0.09, 0.11, 0.66, 0.68, 0.08, 0.1,
"2.2", 0.36, 0.13, 0.07, 0.12, 0.15, 0.34, 0.38, 0.12, 0.14,
"2.3", 0.09, 0.17, 0.09, 0.13, 0.16, 0.08, 0.11, 0.15, 0.18,
"2.4", 0.68, 0.12, 0.07, 0.12, 0.14, 0.66, 0.69, 0.11, 0.13,
"3", 0.53, 0.15, 0.06, 0.14, 0.16, 0.52, 0.53, 0.15, 0.16)
Desired output (First row from wide data)
group names values lower upper
1 BP 0.51 0.16 0.18
1 HS 0.15 0.5 0.52
1 BB 0.05 0.14 0.16
Here is solution following a similar method that #Fnguyen used but using the newer pivot_longer and pivot_wider construct:
library(dplyr)
library(tidyr)
longer<-pivot_longer(dat, cols=-1, names_pattern = "(.*)(..)$", names_to = c("limit", "name")) %>%
mutate(limit=ifelse(limit=="", "value", limit))
answer <-pivot_wider(longer, id_cols = c(group, name), names_from = limit, values_from = value, names_repair = "check_unique")
Most of the selecting, separating, mutating and renaming is taking place within the pivot function calls.
Update:
This regular expressions "(.*)(..)$" means:
( ) ( ) Look for two parts,
(.*) the first part should have zero or more characters
(..) the second part should have just 2 characters at the “$” end of the string
A data.table version (not sure yet how to retain the original names so that you dont need to post substitute them https://github.com/Rdatatable/data.table/issues/2551):
library(data.table)
df <- data.table(dat)
v <- c("BP","HS","BB")
setnames(df, v, paste0("x",v) )
g <- melt(df, id.vars = "group",
measure.vars = patterns(values = "x" ,
lower = "lower",
upper = "upper"),
variable.name = "names")
g[names==1, names := "BP" ]
g[names==2, names := "HS" ]
g[names==3, names := "BB" ]
group names values lower upper
1: 1 BP 0.51 0.16 0.18
2: 2.1 BP 0.67 0.09 0.11
3: 2.2 BP 0.36 0.12 0.15
4: 2.3 BP 0.09 0.13 0.16
5: 2.4 BP 0.68 0.12 0.14
6: 3 BP 0.53 0.14 0.16
7: 1 HS 0.15 0.50 0.52
8: 2.1 HS 0.09 0.66 0.68
9: 2.2 HS 0.13 0.34 0.38
10: 2.3 HS 0.17 0.08 0.11
11: 2.4 HS 0.12 0.66 0.69
12: 3 HS 0.15 0.52 0.53
13: 1 BB 0.05 0.14 0.16
14: 2.1 BB 0.06 0.08 0.10
15: 2.2 BB 0.07 0.12 0.14
16: 2.3 BB 0.09 0.15 0.18
17: 2.4 BB 0.07 0.11 0.13
18: 3 BB 0.06 0.15 0.16
Based on your example data this solution using dplyr works for me:
library(dplyr)
dat %>%
gather(key, values,-group) %>%
mutate(names = gsub("lower","",gsub("upper","",key))) %>%
separate(key, into = c("key1","key2") ,"[[:upper:]]", perl=T) %>%
mutate(key1 = case_when(key1 == "" ~ "values", TRUE ~ key1)) %>%
select(group,names,key1,values) %>%
rowid_to_column() %>%
spread(key1,values) %>%
select(-rowid) %>%
group_by(group,names) %>%
summarise_all(mean,na.rm = TRUE)
I'd like to add an alternative tidyverse solution drawing from the answer provided by #Dave2e.
Like Dave2e's solution it's a two-step procedure (first rename, then reshape). Instead of reshaping the data twice, I add the prefix "values" to the columns named "BP", "HS", and "BB" using rename_with. This was necessary for getting the column names right when using the .value sentinel in the names_to argument of pivot_longer.
library(dplyr)
library(tidyr)
dat %>%
rename_with(~sub("^(BP|HS|BB)$", "values\\1", .)) %>% # add prefix values
pivot_longer(dat , cols= -1,
names_pattern = "(.*)(BP|HS|BB)$",
names_to = c(".value", "names"))
I've got a dataset that has info about bunch of cities in it. Variables include % of residents that are several different race categories, % of residents in several employment sectors, etc. I'm trying to determine, for each category, how close each city is to an even split among the options.
So for race, there's 4 race categories, so a city that's 25% of each would be (for example) 1, while a city that was 100% white would be a 0. However, with 7 employment sectors, each would have to be 14.29% for a perfect score (the point being that I'm doing this on multiple categories with different numbers of groups in each category). My output would be a column that has some kind of numeric score for how evenly the group I'm looking at (for example, race) is spread out.
I'm programming in R, so a solution there would be great, but I'm up for whatever kind of answer might be useful.
Here's a sample data frame if that's useful
testdata <- structure(list(city = c("City1", "City2", "City3", "City4"), black = c(0.4, 0.1, 0.3, 0.2), white = c(0.3, 0.7, 0.1, 0.2), hisp = c(0.2, 0.1, 0.2, 0.2),asian = c(0.1, 0.1, 0.4, 0.4), service =c(0.10, 0.14, 0.4, 0.0),tech = c(0.00, 0.14, 0.6, 0.2),govt = c(0.15, 0.14, 0.0, 0.2),nonprofit = c(0.20, 0.14, 0.0, 0.3),agriculture = c(0.05, 0.14, 0.0, 0.1),manufacturing = c(0.40, 0.14, 0.0, 0.1),marketing = c(0.10, 0.16, 0.0, 0.1)), row.names = c(NA, -4L), class = "data.frame")
Here's one way to proceed :
Differentiate the data based on categories. In the example, you have shared you have information about two broad categories, race and employment sectors, once you have the categories you could get the even split number by dividing 1 by number of rows in each group and subtract it from the value present.
library(dplyr)
testdata %>%
tidyr::pivot_longer(cols = -city) %>%
mutate(category=case_when(name %in% c('black', 'white', 'hisp', 'asian') ~ 'race',
TRUE ~ 'sectors')) %>%
group_by(city, category) %>%
mutate(close_ratio = abs(1/n() - value))
# city name value category close_ratio
# <chr> <chr> <dbl> <chr> <dbl>
# 1 City1 black 0.4 race 0.15
# 2 City1 white 0.3 race 0.0500
# 3 City1 hisp 0.2 race 0.0500
# 4 City1 asian 0.1 race 0.15
# 5 City1 service 0.1 sectors 0.0429
# 6 City1 tech 0 sectors 0.143
# 7 City1 govt 0.15 sectors 0.00714
# 8 City1 nonprofit 0.2 sectors 0.0571
# 9 City1 agriculture 0.05 sectors 0.0929
#10 City1 manufacturing 0.4 sectors 0.257
# … with 34 more rows
close_ratio = 0 is ideal which means that the value is exactly same as even split. The more it goes far from 0, the more it is towards uneven split.
I'd like to divide by the sum of rows if the rowSums() is greater than one. I haven't thought of a way to do this without a for() loop, so I'm looking for a solution without a loop.
Sample Data
dat <- structure(list(x1 = c(0.18, 0, 0.11, 0.24, 0.33), x2 = c(0.34,
0.14, 0.36, 0.35, 0.21), x3 = c(0.1, 0.36, 0.12, 0.07, 0.18),
x4 = c(0.08, 0.35, 0.06, 0.1, 0.09), x5 = c(0.26, 0.13, 0.22,
0.31, 0.22)), .Names = c("x1", "x2", "x3", "x4", "x5"), row.names = c(NA,
5L), class = "data.frame")
> rowSums(dat)
1 2 3 4 5
0.96 0.98 0.87 1.07 1.03
What I've tried
This works, but I wonder if there is a better way to do it:
a <- which(rowSums(dat) > 1)
dat[a, ] <- dat[a,] / rowSums(dat[a,]
> rowSums(dat)
1 2 3 4 5
0.96 0.98 0.87 1.00 1.00
This gives the same value as the expression near the end of the question:
dat / pmax(rowSums(dat), 1)
This is inferior to G. Grothendieck's answer, but you can also use ifelse.
rs <- rowSums(dat)
dat / ifelse(rs < 1, rs, 1L)
x1 x2 x3 x4 x5
1 0.1875000 0.3541667 0.1041667 0.08333333 0.2708333
2 0.0000000 0.1428571 0.3673469 0.35714286 0.1326531
3 0.1264368 0.4137931 0.1379310 0.06896552 0.2528736
4 0.2400000 0.3500000 0.0700000 0.10000000 0.3100000
5 0.3300000 0.2100000 0.1800000 0.09000000 0.2200000
My data is organized as such:
Distance r^2
0 1
0 0.9
0 0
0 0.8
0 1
1 0.5
1 0.45
1 0.56
1 1
2 0
2 0.9
3 0
3 0.1
3 0.2
3 0.3
...
300 1
300 0.8
I want to plot r^2 decay with distance, meaning I want to plot a mean value + st-dev for every unique distance value. So I should have 1 point at x=0, 1 point at x=1... but I have multiple x=0 values.
What is the best way to achieve this, given how the data is organized? I would like to do it in R if possible.
Thank you,
Adrian
Edit:
I have tried:
> dd <-structure(list(Distance = dist18, r.2 = a18[,13]), Names = c("Distance", "r^2"), class = "data.frame", row.names = c(NA, -15L))
> ggplot(dd, aes(x=Distance, y=r.2)) + stat_summary(fun.data="mean_sdl")
Error in data.frame(x = c(42L, 209L, 105L, 168L, 63L, 212L, 148L, 175L, : arguments imply differing number of rows: 126877, 15
> head(dist18)
[1] 42 209 105 168 63 212
> head(dd)
Distance r.2
1 42 0.89
2 209 0.92
3 105 0.91
4 168 0.81
5 63 0.88
6 212 0.88
Is this because my data is not sorted?
You can also plot your SD as an area around the mean similar to CI plotting (assuming temp is your data set)
library(data.table)
library(ggplot2)
temp <- setDT(temp)[, list(Mean = mean(r.2), SD = sd(r.2)), by = Distance]
ggplot(temp) + geom_point(aes(Distance, Mean)) + geom_ribbon(aes(x = Distance, y = Mean, ymin = (Mean - SD), ymax = (Mean + SD)), fill = "skyblue", alpha = 0.4)
Using dplyr it will be something like this:
df = data.frame(distance = rep(1:300, each = 10), r2 = runif(3000))
library(dplyr)
df_group = group_by(df, distance)
summarise(df_group, mn = mean(r2), s = sd(r2))Source: local data frame [300 x 3]
distance mn s
1 300 0.4977758 0.3565554
2 299 0.4295891 0.3281598
3 297 0.5346428 0.3424429
4 296 0.4623368 0.3163320
5 291 0.3224376 0.2103655
6 290 0.3916658 0.2115264
7 288 0.6147680 0.2953960
8 287 0.3405524 0.2032616
9 286 0.5690844 0.2458538
10 283 0.2901744 0.2835524
.. ... ... ...
Where df is the data.frame with your data, and distance and r2 the two column names.
this should work.
# Create a data frame like yours
df=data.frame(sample(50,size=300,replace=TRUE),runif(300))
colnames(df)=c('Distance','r^2')
#initialize empty data frame with columns x, mean and stdev
results=data.frame(x=numeric(0),mean=numeric(0),stdev=numeric(0))
count=1
for (i in 0:max(df$Distance)){
results[count,'x']=i
temp_mean=mean(df[which(df$Distance==i),'r^2'])
results[count,'mean']=temp_mean
temp_sd=sd(df[which(df$Distance==i),'r^2'])
results[count,'stdev']=temp_sd
count=count+1
}
# Plot your results
plot(results$x,results$mean,xlab='distance',ylab='r^2')
epsilon=0.02 #to add the little horizontal bar to the error bars
for (i in 1:nrow(results)){
up = results$mean[i] + results$stdev[i]
low = results$mean[i] - results$stdev[i]
segments(results$x[i],low , results$x[i], up)
segments(results$x[i]-epsilon, up , results$x[i]+epsilon, up)
segments(results$x[i]-epsilon, low , results$x[i]+epsilon, low)
}
Here's the result http://imgur.com/ED7PwD8
If you want to plot mean and +/- 1 sd for each point, the ggplot function makes this easy. With the test data
dd<-structure(list(Distance = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
2L, 2L, 3L, 3L, 3L, 3L), r.2 = c(1, 0.9, 0, 0.8, 1, 0.5, 0.45,
0.56, 1, 0, 0.9, 0, 0.1, 0.2, 0.3)), .Names = c("Distance", "r.2"
), class = "data.frame", row.names = c(NA, -15L))
you can just run
library(Hmisc)
ggplot(dd, aes(x=Distance, y=r.2)) +
stat_summary(fun.data="mean_sdl", mult=1)
which produces
I tried with your real data and got
real <- read.table("http://pelinfamily.ca/bio/GDR-18_conc.ld", header=F)
dd <- data.frame(Distance=real[,2]-real[,1], r.2=real[,13])
ggplot(dd, aes(x=Distance, y=r.2)) +
stat_summary(fun.data="mean_sdl", mult=1, geom="ribbon", alpha=.4) +
stat_summary(fun.data="mean_sdl", mult=1, geom="line")