calc mean/std/ci from column - r

is there a package to easily calculate for each specific n number, the mean/std/ci.
In example starting with the data:
> n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
> s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
> df = data.frame(n, s)
> df
n s
1 0 43
2 0 23
3 0 65
4 0 43
5 0 12
6 0 54
7 0 43
8 2 12
9 2 2
10 2 43
11 2 62
12 5 25
13 5 55
14 5 75
15 5 95
16 8 28
17 8 48
18 8 68
19 8 18
resulting as:
data
n mean std ci
0 40 .. ..
2 30 .. ..
5 63 .. ..
8 41 .. ..

dplyr is nice, but not necessary. In base R:
## df() is built-in in R, avoid ...
dd <- data.frame(n=rep(c(0,2,5,8),c(7,4,4,4)),
s = c(43,23,65,43,12,54,43,12,2,43,
62,25,55,75,95,28,48,68,18))
sumfun <- function(x) {
m <- mean(x)
s <- sd(x)
se <- s/sqrt(length(x))
c(mean=m,sd=s,lwr=m-1.96*se,upr=m+1.96*se)
}
(or see smean.cl.normal(), smean.cl.boot(), etc. from the Hmisc package ...)
res <- do.call(rbind,tapply(dd$s,dd$n,sumfun))
res <- cbind(n=unique(dd$n),as.data.frame(res))
Or as pointed out by #thelatemail:
res <- do.call(data.frame,aggregate(s ~ n, data=df, FUN=sumfun ))
You can easily package this into a function if you're going to use it on a regular basis.
For larger data sets/more complex transformations you can search SO for answers comparing solutions from the dplyr, plyr, data.table, doBy packages as well as base-R solutions using combinations of tapply(), ave(), aggregate(), by() ...

You can use the dplyr package.
Here's a code snippet. Note, I'm assuming you want to build the confidence interval using the standard normal approximation at the 95% level but you can make whatever choice you like.
n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
df = data.frame(n, s)
df %>%
group_by(n) %>%
summarise(mean = mean(s),
std = sqrt(var(s)),
lower = mean(s) - qnorm(.975)*std/sqrt(n()),
upper = mean(s) + qnorm(.975)*std/sqrt(n()))
Source: local data frame [4 x 5]
n mean std lower upper
1 0 40.42857 17.88721 27.177782 53.67936
2 2 29.75000 27.69326 2.611104 56.88890
3 5 62.50000 29.86079 33.236965 91.76303
4 8 40.50000 22.17356 18.770313 62.22969

Thanks for the advice everyone, I have taken a look into plyr and solved it:
n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
dd = data.frame(n, s)
library(plyr)
data <- ddply(dd,.(n),function(dd) c(mean=mean(dd$s),
std = sd(dd$s),
se = sd(dd$s)/sqrt(length(dd$s)),
lower = mean(dd$s)-qnorm(.975)*sd(dd$s)/sqrt(length(dd$s)),
upper = mean(dd$s)+qnorm(.975)*sd(dd$s)/sqrt(length(dd$s))
))
resulting as:
data
n mean std se lower upper
1 0 40.42857 17.88721 6.760731 27.177782 53.67936
2 2 29.75000 27.69326 13.846630 2.611104 56.88890
3 5 62.50000 29.86079 14.930394 33.236965 91.76303
4 8 40.50000 22.17356 11.086779 18.770313 62.22969
Will avoid df() in the future, thanks

Update tidyr 1.0.0
Even though the solution by #user1357015 is totally fine, if you are a tidyverse fan like me, there is an elegant alternative:
The new tidyr 1.0.0 contained a function that didn't get much attention but is very useful: unnest_wider.
With that, you can simplify the code to the following:
df %>%
group_by(n) %>%
nest(data = -"n") %>%
mutate(ci = map(data, ~ MeanCI(.x$s))) %>%
unnest_wider(ci)
which gives
# A tibble: 4 x 5
# Groups: n [4]
n data mean lwr.ci upr.ci
<dbl> <list> <dbl> <dbl> <dbl>
1 0 <tibble [7 × 1]> 40.4 23.9 57.0
2 2 <tibble [4 × 1]> 29.8 -14.3 73.8
3 5 <tibble [4 × 1]> 62.5 15.0 110.
4 8 <tibble [4 × 1]> 40.5 5.22 75.8

Related

Use values of one vector as index range for another vector

I provide two vectors here:
vec1 <- c(5, 2, 2, 2, 2, 3, 2, 3, 9, 6, 2, 2, 2, 3)
vec2 <- c(1.96845698, 1.11342534, 0.82580110, 0.35762122, 0.07210485, 0.06046759, 0.93615974, 0.85691566, 0.39439991,
0.26110080, 1.22082336, 0.71940824, 0.32571803, 0.46358160, 0.16009616, 0.13348428, 1.16801097, 0.30184661,
0.51190796, 1.69680701, 0.54418158, 0.74969466, 0.17246107, 0.66953561, 1.02689205, 1.67408220, 1.20311478,
0.74049935, 0.55211334, 0.31037724, 0.23620425, 0.34532764, 1.64696898, 0.23094382, 0.67733098, 0.32226374,
0.25774802, 0.35768477, 0.27219803, 0.02042260, 0.53784081, 1.27521977, 0.07043151, 0.11879638, 0.13358880)
Now I would like to calculate the mean values of different parts from vec2. The length of these parts is determined by the values of vec1.
So the output is supposed to be a vector of the same length as vec1.
The first value of this output vector should be the mean of vec2[1:5], since vec1[1] = 5. The second value should then be the mean of vec[6:7], since vec[2] = 2 and so forth until the last value of the output vector should correspond to the mean of vec2[43:45], since the last value of vec1 is 3.
I hope it is clear what I mean.
Here I calculated manually the expected output vector:
vec3 <- c(0.8674819, 0.4983137, 0.6256578, 0.7409621, 0.5225631, 0.2523873, 0.7349288,
0.9176322, 0.7887523, 0.5765066, 0.3077164, 0.1463103, 0.9065303, 0.1076056)
Anybody with an idea, how to do that?
You can try:
tapply(vec2, rep(seq_along(vec1), vec1), mean)
#tapply(vec2, unlist(Map(rep, seq_along(vec1), each=vec1)), mean) #Alternative
#tapply(vec2, inverse.rle(list(lengths=vec1, values=seq_along(vec1))), mean) #Alternative
# 1 2 3 4 5 6 7 8
#0.8674819 0.4983137 0.6256578 0.7409621 0.5225631 0.2523873 0.7349288 0.9176322
# 9 10 11 12 13 14
#0.7887523 0.5765066 0.3077164 0.1463103 0.9065303 0.1076056
The aggregate function can be used to do this if you rearrange vec1 slightly:
vec1 <- rep(seq_along(vec1), vec1)
aggregate(vec2, list(vec1), mean)$x
# [1] 0.8674819 0.4983137 0.6256578 0.7409621 0.5225631 0.2523873 0.7349288 0.9176322 0.7887523 0.5765066 0.3077164 0.1463103 0.9065303 0.1076056
Another solution using purrr
# first construct the ranges which is used as input in the purrr-call
range2 <- cumsum(vec1)
range1 <- c(1,cumsum(vec1[1:(length(vec1)-1)])+1)
purrr::map2_dbl(range1, range2, function(x,y) mean(vec2[x:y]))
[1] 0.8674819 0.4983137 0.6256578 0.7409621 0.5225631 0.2523873 0.7349288 0.9176322 0.7887523 0.5765066 0.3077164 0.1463103 0.9065303 0.1076056
Yet another option could be:
tapply(vec2, cumsum(sequence(vec1) == 1), mean)
1 2 3 4 5 6 7 8 9
0.8674819 0.4983137 0.6256578 0.7409621 0.5225631 0.2523873 0.7349288 0.9176322 0.7887523
10 11 12 13 14
0.5765066 0.3077164 0.1463103 0.9065303 0.1076056
Using tidyverse
library(dplyr)
library(tidyr)
tibble(vec1) %>%
mutate(grp = row_number()) %>%
uncount(vec1) %>%
mutate(vec2 = vec2) %>%
group_by(grp) %>%
summarise(vec2 = mean(vec2))
# A tibble: 14 × 2
grp vec2
<int> <dbl>
1 1 0.867
2 2 0.498
3 3 0.626
4 4 0.741
5 5 0.523
6 6 0.252
7 7 0.735
8 8 0.918
9 9 0.789
10 10 0.577
11 11 0.308
12 12 0.146
13 13 0.907
14 14 0.108

R: How to simply compare values of columns in 2 data frames

I am comparing two data frames: FU and FO
Here are short samples of what they look like
"Model_ID" "FU_Lin_Period" "FU_Growth_rate"
2 0.72127 0.0093333
3 0.69281 0.015857
4 0.66735 0.021103
5 0.64414 0.024205
6 0.62288 0.026568
7 0.60318 0.027749
8 0.58472 0.028161
9 0.56734 0.028008
10 0.55085 0.027309
11 0.53522 0.026068
12 0.52029 0.024684
13 0.50603 0.022866
14 0.49237 0.020991
15 0.47928 0.018773
"Model_ID" "FO_Lin_Period" "FO_Growth_rate"
7 0.44398 0.008868
8 0.43114 0.01674
9 0.41896 0.023248
10 0.40728 0.028641
11 0.39615 0.032192
12 0.38543 0.03543
13 0.37517 0.03692
14 0.36525 0.038427
15 0.35573 0.038195
As you can tell, they do not have all the same Model_ID
Basically, what I want to do is go through every Model_ID in the two tables, compare whether FU or FO's growth rate is larger for a given model ID, and...
if FU's is larger (or FU exists for the model number and FO does not), place the model number in a vector called selected_FU
if FO's is larger (or FO exists for the model number and FU does not), place the model number in a vector called selected_FO
Is there a way to do this without using loops?
data.table alternative using similar logic to the tidyverse answer.
Replace NAs with -Infinity, do the comparison of the two FU/FO_Growth_rate variables, flag which group had the larger value, and select the Model_ID into the variables requested.
library(data.table)
setDT(FU)
setDT(FO)
out <- merge(FU, FO, by="Model_ID", all=TRUE)[,
"gr_sel" := c("FO","FU")[(nafill(FU_Growth_rate, fill=-Inf) >
nafill(FO_Growth_rate, fill=-Inf)) + 1],
]
selected_FU <- out[gr_sel == "FU", Model_ID]
selected_FO <- out[gr_sel == "FO", Model_ID]
Data used:
FU <- read.table(text="Model_ID FU_Lin_Period FU_Growth_rate\n2 0.72127 0.0093333\n3 0.69281 0.015857\n4 0.66735 0.021103\n5 0.64414 0.024205\n6 0.62288 0.026568\n7 0.60318 0.027749\n8 0.58472 0.028161\n9 0.56734 0.028008\n10 0.55085 0.027309\n11 0.53522 0.026068\n12 0.52029 0.024684\n13 0.50603 0.022866\n14 0.49237 0.020991\n15 0.47928 0.018773", header=TRUE)
FO <- read.table(text="Model_ID FO_Lin_Period FO_Growth_rate\n7 0.44398 0.008868\n8 0.43114 0.01674\n9 0.41896 0.023248\n10 0.40728 0.028641\n11 0.39615 0.032192\n12 0.38543 0.03543\n13 0.37517 0.03692\n14 0.36525 0.038427\n15 0.35573 0.038195", header=TRUE)
With dplyr, tidyr, and reader.
library(dplyr)
library(tidyr)
library(readr)
FU <- read_table2("test.FU.LINA.table")
FO <- read_table2("test.FO.LINA.table")
df_compared <-
full_join(FU, FO, by = "model_id") %>%
replace_na(list(fo_growth_rate = -1, fu_growth_rate = -1)) %>%
mutate(select_fufo = if_else(fu_growth_rate >= fo_growth_rate, true = "fu", false = "fo"))
df_compared
# A tibble: 6,166 x 6
model_id fu_lin_period fu_growth_rate fo_lin_period fo_growth_rate select_fufo
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 2 0.721 0.00933 NA -1 fu
2 3 0.693 0.0159 NA -1 fu
3 4 0.667 0.0211 NA -1 fu
4 5 0.644 0.0242 NA -1 fu
5 6 0.623 0.0266 NA -1 fu
6 7 0.603 0.0277 0.444 0.00887 fu
7 8 0.585 0.0282 0.431 0.0167 fu
8 9 0.567 0.0280 0.419 0.0232 fu
9 10 0.551 0.0273 0.407 0.0286 fo
10 11 0.535 0.0261 0.396 0.0322 fo
# ... with 6,156 more rows
selected_fu <- df_compared %>% filter(select_fufo == "fu") %>% .$model_id
selected_fo <- df_compared %>% filter(select_fufo == "fo") %>% .$model_id

How to prevent R from rounding in frequency function?

I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...
The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.
Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))

A compact way to perform multiple pairwise tests (e.g. t-test) with a single variable split in multiple categories in long-format

I am interested in performing multiple tests for a single variable with an associated factor that split the values into multiple groups. It is related to this question and, actually, I would like to get a solution of that kind but it is not exactly the same.
In my case, I have a single variable and multiple groups (eventually many). Expanding on this example:
library(reshape)
# Create a dataset
mu=34
stdv=5
Location=rep(c("Area_A","Area_B","Area_C"),5)
distro=rnorm(length(Location),mu,stdv)
id=seq(1:length(Location))
sample_long=data.frame(id,Location,distro)
sample_long
id Location distro
1 1 Area_A 34.95737
2 2 Area_B 31.30298
3 3 Area_C 35.86569
4 4 Area_A 40.45378
5 5 Area_B 36.12060
6 6 Area_C 28.29649
7 7 Area_A 30.64495
8 8 Area_B 29.70668
9 9 Area_C 33.22874
10 10 Area_A 25.29148
11 11 Area_B 32.35511
12 12 Area_C 34.69159
13 13 Area_A 26.89791
14 14 Area_B 35.30717
15 15 Area_C 40.64628
I would like to perform all-against-all tests among Areas, i.e. test(Area_A,Area_B), test(Area_A,Area_C) and test(Area_B,Area_C) (in a more general case, all the i<j possible tests).
A simple way to go is to transform the data into wide format:
# Reshape to wide format
sample_wide=reshape(sample_long,direction="wide",idvar="id",timevar="Location")
sample_wide
id distro.Area_A distro.Area_B distro.Area_C
1 1 34.95737 NA NA
2 2 NA 31.30298 NA
3 3 NA NA 35.86569
4 4 40.45378 NA NA
5 5 NA 36.12060 NA
6 6 NA NA 28.29649
7 7 30.64495 NA NA
8 8 NA 29.70668 NA
9 9 NA NA 33.22874
10 10 25.29148 NA NA
11 11 NA 32.35511 NA
12 12 NA NA 34.69159
13 13 26.89791 NA NA
14 14 NA 35.30717 NA
15 15 NA NA 40.64628
and then loop across all-against-all columns, for which I've seen several approximations more R-like than the following one in which I'm using for loops:
# Now compute the test
test.out=list()
k=0
for(i in 2:(dim(sample_wide)[2]-1)){ # All against all var groups
for(j in (i+1):dim(sample_wide)[2]){
k=k+1
test.out[[k]]=t.test(sample_wide[,i],
sample_wide[,j]) # store results in a list
}
}
But my question is not about which is the best solution given the wide format, but whether it is possible to find a solution for the problem working from the original long format, in line with the solutions found for the links I provided above that use dplyr, broom, etc.
This is a little trickier and less straightforward than I hoped. You can first figure out the combinations of locations and, to make it a little simpler, save that in a lookup table. I turned that into a long shape with an ID for each pair, which I'll use as a grouping variable on the data.
library(dplyr)
library(tidyr)
library(purrr)
set.seed(111)
# same data creation code
grps <- as.data.frame(t(combn(levels(sample_long$Location), 2))) %>%
mutate(pair = row_number()) %>%
gather(key, value = loc, -pair) %>%
select(-key)
grps
#> pair loc
#> 1 1 Area_A
#> 2 2 Area_A
#> 3 3 Area_B
#> 4 1 Area_B
#> 5 2 Area_C
#> 6 3 Area_C
Joining the lookup to the data frame doubles the rows—that will differ depending on how many levels you're combining. Note also I dropped your ID column since it didn't seem necessary right now. Nest, do the t-test, and tidy the results.
sample_long %>%
select(-id) %>%
inner_join(grps, by = c("Location" = "loc")) %>%
group_by(pair) %>%
nest() %>%
mutate(t_test = map(data, ~t.test(distro ~ Location, data = .)),
tidied = map(t_test, broom::tidy)) %>%
unnest(tidied)
#> # A tibble: 3 x 13
#> pair data t_test estimate estimate1 estimate2 statistic p.value
#> <int> <lis> <list> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 <tib… <htes… -0.921 31.8 32.7 -0.245 0.816
#> 2 2 <tib… <htes… -1.48 31.8 33.3 -0.383 0.716
#> 3 3 <tib… <htes… -0.563 32.7 33.3 -0.305 0.769
#> # … with 5 more variables: parameter <dbl>, conf.low <dbl>,
#> # conf.high <dbl>, method <chr>, alternative <chr>
If you needed to, you could do something to show which locations are in each pair—joining with the lookup table would be one way to do this.
I'm realizing also that you mentioned wanting to use broom functions afterwards, but didn't specify that you need a broom::tidy call. In that case, just drop the last 2 lines.
A little bit of base R will do the trick:
combn(x=unique(sample_long$Location), m=2, simplify=FALSE,
FUN=function(l) {
t.test(distro ~ Location, data=subset(sample_long, Location %in% l))
})
combn will generate all combinations of elements of x taken m at a time (sic). Combined with subset, you will apply your test to subsets of your data.frame.

less clunky reshaping of anscombe data

I was trying to use ggplot2 to plot the built-in anscombe data set in R (which contains four different small data sets with identical correlations but radically different relationships between X and Y). My attempts to reshape the data properly were all pretty ugly. I used a combination of reshape2 and base R; a Hadleyverse 2 (tidyr/dplyr) or a data.table solution would be fine with me, but the ideal solution would be
short/no repeated code
comprehensible (somewhat conflicting with criterion #1)
involve as little hard-coding of column numbers, etc. as possible
The original format:
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## ...
## 11 5 5 5 8 5.68 4.74 5.73 6.89
Desired format:
## s x y
## 1 1 10 8.04
## 2 1 8 6.95
## ...
## 44 4 8 6.89
Here's my attempt:
library("reshape2")
ff <- function(x,v)
setNames(transform(
melt(as.matrix(x)),
v1=substr(Var2,1,1),
v2=substr(Var2,2,2))[,c(3,5)],
c(v,"s"))
f1 <- ff(anscombe[,1:4],"x")
f2 <- ff(anscombe[,5:8],"y")
f12 <- cbind(f1,f2)[,c("s","x","y")]
Now plot:
library("ggplot2"); theme_set(theme_classic())
th_clean <-
theme(panel.margin=grid::unit(0,"lines"),
axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank()
)
ggplot(f12,aes(x,y))+geom_point()+
facet_wrap(~s)+labs(x="",y="")+
th_clean
If you are really dealing with the "anscombe" dataset, then I would say #Thela's reshape solution is very direct.
However, here are a few other options to consider:
Option 1: Base R
You can write your own "reshape" function, perhaps something like this:
myReshape <- function(indf = anscombe, stubs = c("x", "y")) {
temp <- sapply(stubs, function(x) {
unlist(indf[grep(x, names(indf))], use.names = FALSE)
})
s <- rep(seq_along(grep(stubs[1], names(indf))), each = nrow(indf))
data.frame(s, temp)
}
Notes:
I'm not sure that this is necessarily less clunky than what you're already doing
This approach will not work if the data are "unbalanced" (for example, more "x" columns than "y" columns.)
Option 2: "dplyr" + "tidyr"
Since pipes are the rage these days, you can also try:
library(dplyr)
library(tidyr)
anscombe %>%
gather(var, val, everything()) %>%
extract(var, into = c("variable", "s"), "(.)(.)") %>%
group_by(variable, s) %>%
mutate(ind = sequence(n())) %>%
spread(variable, val)
Notes:
I'm not sure that this is necessarily less clunky than what you're already doing, but some people like the pipe approach.
This approach should be able to handle unbalanced data.
Option 3: "splitstackshape"
Before #Arun went and did all that fantastic work on melt.data.table, I had written merged.stack in my "splitstackshape" package. With that, the approach would be:
library(splitstackshape)
setnames(
merged.stack(
data.table(anscombe, keep.rownames = TRUE),
var.stubs = c("x", "y"), sep = "var.stubs"),
".time_1", "s")[]
A few notes:
merged.stack needs something to treat as an "id", hence the need for data.table(anscombe, keep.rownames = TRUE), which adds a column named "rn" with the row numbers
The sep = "var.stubs" basically means that we don't really have a separator variable, so we'll just strip out the stub and use whatever remains for the "time" variable
merged.stack will work if the data are unbalanced. For instance, try using it with anscombe2 <- anscombe[1:7] as your dataset instead of "anscombe".
The same package also has a function called Reshape that builds upon reshape to let it reshape unbalanced data. But it's slower and less flexible than merged.stack. The basic approach would be Reshape(data.table(anscombe, keep.rownames = TRUE), var.stubs = c("x", "y"), sep = "") and then rename the "time" variable using setnames.
Option 4: melt.data.table
This was mentioned in the comments above, but hasn't been shared as an answer. Outside of base R's reshape, this is a very direct approach that handles column renaming from within the function itself:
library(data.table)
melt(as.data.table(anscombe),
measure.vars = patterns(c("x", "y")),
value.name=c('x', 'y'),
variable.name = "s")
Notes:
Will be insanely fast.
Much better supported than "splitstackshape" or reshape ;-)
Handles unbalanced data just fine.
I think this meets the criteria of being 1) short 2) comprehensible and 3) no hardcoded column numbers. And it doesn't require any other packages.
reshape(anscombe, varying=TRUE, sep="", direction="long", timevar="s")
# s x y id
#1.1 1 10 8.04 1
#...
#11.1 1 5 5.68 11
#1.2 2 10 9.14 1
#...
#11.2 2 5 4.74 11
#1.3 3 10 7.46 1
#...
#11.3 3 5 5.73 11
#1.4 4 8 6.58 1
#...
#11.4 4 8 6.89 11
I don't know if a non-reshape solution would be acceptable, but here you go:
library(data.table)
#create the pattern that will have the Xs
#this will make it easy to create the Ys
pattern <- 1:4
#use Map to create a list of data.frames with the needed columns
#and also use rbindlist to rbind the list produced by Map
lists <- rbindlist(Map(data.frame,
pattern,
anscombe[pattern],
anscombe[pattern+length(pattern)]
)
)
#set the correct names
setnames(lists, names(lists), c('s','x','y'))
Output:
> lists
s x y
1: 1 10 8.04
2: 1 8 6.95
3: 1 13 7.58
4: 1 9 8.81
5: 1 11 8.33
6: 1 14 9.96
7: 1 6 7.24
8: 1 4 4.26
9: 1 12 10.84
10: 1 7 4.82
....
A newer tidyverse option is suggested in the tidyverse vignette:
anscombe %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
) %>%
arrange(set)
#> # A tibble: 44 x 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 1 10 8.04
#> 2 1 8 6.95
#> 3 1 13 7.58
#> 4 1 9 8.81
#> 5 1 11 8.33
#> 6 1 14 9.96
#> 7 1 6 7.24
#> 8 1 4 4.26
#> 9 1 12 10.8
#> 10 1 7 4.82
#> # … with 34 more rows

Resources