Transpose from long to wide with pair groups in r - r

I have descriptive statistics for four groups. My sample dataset is:
df <- data.frame(
Grade = c(3,3,3,3,4,4,4,4),
group = c("none","G1","G2","both","none","G1","G2","both"),
mean=c(10,12,13,12,11,18,19,20),
sd=c(22,12,22,12,11,13,14,15),
N=c(35,33,34,32,43,45,46,47))
> df
Grade group mean sd N
1 3 none 10 22 35
2 3 G1 12 12 33
3 3 G2 13 22 34
4 3 both 12 12 32
5 4 none 11 11 43
6 4 G1 18 13 45
7 4 G2 19 14 46
8 4 both 20 15 47
I would like to compare groups as pairs and need the descriptive information side by side for each pair.
Here is what I would like to have:
So, each grade has 6 pairs of groups.
Does anyone have any idea on this?
Thanks!

1) sqldf We can join df to itself on the indicated condition. Note that we escaped group since group is an sql keyword.
library(sqldf)
sqldf('select
a.Grade,
a.[group] Group1, b.[group] Group2,
a.mean mean1, b.mean mean2,
a.sd sd1, b.sd sd2,
a.N n1, b.N n2
from df a
join df b on a.Grade = b.Grade and a.[group] > b.[group]')
giving:
Grade Group1 Group2 mean1 mean2 sd1 sd2 n1 n2
1 3 none G1 10 12 22 12 35 33
2 3 none G2 10 13 22 22 35 34
3 3 none both 10 12 22 12 35 32
4 3 G2 G1 13 12 22 12 34 33
5 3 both G1 12 12 12 12 32 33
6 3 both G2 12 13 12 22 32 34
7 4 none G1 11 18 11 13 43 45
8 4 none G2 11 19 11 14 43 46
9 4 none both 11 20 11 15 43 47
10 4 G2 G1 19 18 14 13 46 45
11 4 both G1 20 18 15 13 47 45
12 4 both G2 20 19 15 14 47 46
2) base R We can perform a merge on part of the condition and then subset it for the remainder. The names are slightly different so you will need to change them if that is important.
subset(merge(df, df, by = "Grade"), group.x > group.y)
giving:
Grade group.x mean.x sd.x N.x group.y mean.y sd.y N.y
2 3 none 10 22 35 G1 12 12 33
3 3 none 10 22 35 G2 13 22 34
4 3 none 10 22 35 both 12 12 32
8 3 G1 12 12 33 both 12 12 32
10 3 G2 13 22 34 G1 12 12 33
12 3 G2 13 22 34 both 12 12 32
18 4 none 11 11 43 G1 18 13 45
19 4 none 11 11 43 G2 19 14 46
20 4 none 11 11 43 both 20 15 47
24 4 G1 18 13 45 both 20 15 47
26 4 G2 19 14 46 G1 18 13 45
28 4 G2 19 14 46 both 20 15 47

Related

tidyverse: binding list elements efficiently

I want to bind data.frames of same number of rows from a list as given below.
df1 <- data.frame(A1 = 1:10, B1 = 11:20)
df2 <- data.frame(A1 = 1:10, C1 = 21:30)
df3 <- data.frame(A2 = 1:15, B2 = 11:25, C2 = 31:45)
df4 <- data.frame(A2 = 1:15, D2 = 11:25, E2 = 51:65)
df5 <- 5
ls <- list(df1, df2, df3, df4, df5)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
bind_cols(ls[1], ls[2], .id = NULL)
#> New names:
#> * A1 -> A1...1
#> * A1 -> A1...3
#> A1...1 B1 A1...3 C1
#> 1 1 11 1 21
#> 2 2 12 2 22
#> 3 3 13 3 23
#> 4 4 14 4 24
#> 5 5 15 5 25
#> 6 6 16 6 26
#> 7 7 17 7 27
#> 8 8 18 8 28
#> 9 9 19 9 29
#> 10 10 20 10 30
bind_cols(ls[3], ls[4], .id = NULL)
#> New names:
#> * A2 -> A2...1
#> * A2 -> A2...4
#> A2...1 B2 C2 A2...4 D2 E2
#> 1 1 11 31 1 11 51
#> 2 2 12 32 2 12 52
#> 3 3 13 33 3 13 53
#> 4 4 14 34 4 14 54
#> 5 5 15 35 5 15 55
#> 6 6 16 36 6 16 56
#> 7 7 17 37 7 17 57
#> 8 8 18 38 8 18 58
#> 9 9 19 39 9 19 59
#> 10 10 20 40 10 20 60
#> 11 11 21 41 11 21 61
#> 12 12 22 42 12 22 62
#> 13 13 23 43 13 23 63
#> 14 14 24 44 14 24 64
#> 15 15 25 45 15 25 65
In my actual list, I have about twenty data.frames of different number of rows. I wonder if there is a more efficient way of binding data.frames of same number of rows without giving the name and index of list elements.
It is easier to do this by splitting. Create a grouping index with gl
grp <- as.integer(gl(length(ls), 2, length(ls)))
and then use split
library(dplyr)
library(purrr)
library(stringr)
split(ls, grp) %>% # // split by the grouping index
map(bind_cols) %>% # // loop over the `list` and use `bind_cols`
set_names(str_c('df', seq_along(.))) %>% # // name the `list`
list2env(.GlobalEnv) # // create objects in global env
-output
head(df1)
# A1...1 B1 A1...3 C1
#1 1 11 1 21
#2 2 12 2 22
#3 3 13 3 23
#4 4 14 4 24
#5 5 15 5 25
#6 6 16 6 26
head(df2)
# A2...1 B2 C2 A2...4 D2 E2
#1 1 11 31 1 11 51
#2 2 12 32 2 12 52
#3 3 13 33 3 13 53
#4 4 14 34 4 14 54
#5 5 15 35 5 15 55
#6 6 16 36 6 16 56
head(df3)
# A tibble: 1 x 1
# ...1
# <dbl>
#1 5
NOTE:
It is better to keep the elements in the list instead of creating objects in the global environment i.e. list2env
ls is a function name and naming an object with function name is not a good option as it can lead to buggy situations
Maybe not the optimal approach but you can use a loop and bind the dataframes with same number of columns into a new dataframes. The main of this code is to check the dimension of each dataframe and create an unique vector. Then in the loop you can use lapply() to subset the dataframes in ls and the bind their columns. Here the code (Updated considering the little df5, you can make the trick managing it as a dataframe):
library(dplyr)
#Data
df1 <- data.frame(A1 = 1:10, B1 = 11:20)
df2 <- data.frame(A1 = 1:10, C1 = 21:30)
df3 <- data.frame(A2 = 1:15, B2 = 11:25, C2 = 31:45)
df4 <- data.frame(A2 = 1:15, D2 = 11:25, E2 = 51:65)
df5 <- 5
#List
ls <- list(df1, df2, df3, df4,df5)
#Index
index <- sapply(ls,function(x)dim(as.data.frame(x))[1])
m <- unique(index)
#Loop
for(i in 1:length(m))
{
assign(paste0('df',i),do.call(bind_cols,ls[lapply(ls,function(x) dim(as.data.frame(x))[1]==m[i])==T]))
}
Output:
df1
A1...1 B1 A1...3 C1
1 1 11 1 21
2 2 12 2 22
3 3 13 3 23
4 4 14 4 24
5 5 15 5 25
6 6 16 6 26
7 7 17 7 27
8 8 18 8 28
9 9 19 9 29
10 10 20 10 30
df2
A2...1 B2 C2 A2...4 D2 E2
1 1 11 31 1 11 51
2 2 12 32 2 12 52
3 3 13 33 3 13 53
4 4 14 34 4 14 54
5 5 15 35 5 15 55
6 6 16 36 6 16 56
7 7 17 37 7 17 57
8 8 18 38 8 18 58
9 9 19 39 9 19 59
10 10 20 40 10 20 60
11 11 21 41 11 21 61
12 12 22 42 12 22 62
13 13 23 43 13 23 63
14 14 24 44 14 24 64
15 15 25 45 15 25 65
df3
...1
1 5

How to delete values from a column only from a threshold on? [duplicate]

This question already has answers here:
Subset data frame based on multiple conditions [duplicate]
(3 answers)
Closed 3 years ago.
I have a dataframe (k by 4). I have ordered one of the four columns in a descending order (from 19 to -9 let'say). I would like to throw away those values that are smaller than 1.5.
I just tried unsuccessfully various combinations of the following code
subset(w, select = -c(columnofinterest, <=1.50))
Can anyone help me?
Thanks a lot!
You can use arrange and filter from dplyr package:
library(dplyr)
w <- data.frame(use_this = round(runif(100, min = -9, max = 19)),
second = runif(100),
third = runif(100),
fourth = runif(100)) %>%
arrange(desc(use_this)) %>%
filter(use_this >= 1.5)
Output:
> w
use_this second third fourth
1 19 0.264306555 0.11234097 0.30149863
2 19 0.574675520 0.50406805 0.71502833
3 19 0.376586752 0.21530618 0.35323250
4 18 0.949974135 0.46726122 0.36008741
5 17 0.339737597 0.11358402 0.04035303
6 16 0.180291264 0.81855913 0.16109650
7 16 0.958398058 0.94827266 0.54693974
8 16 0.297317238 0.28726682 0.63560208
9 16 0.653006870 0.15175848 0.69305851
10 16 0.685338886 0.30493976 0.89360112
11 16 0.493931093 0.52830391 0.68391458
12 16 0.945083084 0.19880501 0.66769341
13 16 0.910927578 0.86032225 0.73062990
14 15 0.662130980 0.19207451 0.44240610
15 15 0.730482762 0.92418574 0.46387086
16 15 0.547101759 0.87847767 0.27973739
17 15 0.487773258 0.05870471 0.40147753
18 15 0.695824922 0.91289504 0.94897518
19 14 0.576095914 0.42914670 0.27707368
20 14 0.156691824 0.02187951 0.31940887
21 13 0.079037019 0.16993999 0.53232350
22 13 0.944372064 0.63485350 0.23548337
23 13 0.016378244 0.42772076 0.76618218
24 13 0.606340182 0.33611591 0.36017352
25 13 0.170346203 0.43325314 0.16285515
26 13 0.605379012 0.95574187 0.23941377
27 12 0.157352454 0.90963650 0.01611328
28 12 0.353934785 0.80058806 0.13782414
29 12 0.464950823 0.81835421 0.12771521
30 12 0.624139506 0.69472154 0.02833191
31 11 0.362033514 0.98849181 0.37684822
32 11 0.067974815 0.24154922 0.49300890
33 11 0.522271380 0.03502680 0.50665790
34 10 0.810183210 0.56598130 0.41279787
35 10 0.609560713 0.46745813 0.34939724
36 10 0.087748839 0.56531646 0.02249387
37 10 0.008262635 0.68432285 0.35648525
38 10 0.757824842 0.57826099 0.89973902
39 10 0.428174539 0.12538288 0.69233083
40 10 0.785175550 0.21516237 0.36578714
41 10 0.631388832 0.63700087 0.40933640
42 10 0.171396873 0.37925970 0.27935731
43 10 0.773437320 0.24710107 0.23902388
44 8 0.443778088 0.77238651 0.08517639
45 8 0.954302451 0.87102748 0.52031446
46 8 0.347608835 0.79912385 0.36169856
47 8 0.839238717 0.54200177 0.52221408
48 8 0.235710838 0.85575923 0.78092366
49 7 0.610772265 0.16833538 0.94704562
50 7 0.242917834 0.02852729 0.87131760
51 7 0.875879507 0.04537683 0.81000861
52 7 0.577880660 0.54259171 0.43301336
53 6 0.541772984 0.06164861 0.62867700
54 6 0.071746509 0.51758874 0.70365933
55 5 0.103953563 0.99147043 0.33944620
56 5 0.504618656 0.95827073 0.65527417
57 5 0.726648637 0.37460291 0.47072657
58 5 0.796268586 0.09644167 0.93960812
59 5 0.796498528 0.68346948 0.23290885
60 5 0.490859592 0.76727730 0.39888256
61 5 0.949232913 0.02954981 0.56672834
62 4 0.360401806 0.62879833 0.31107107
63 4 0.926329930 0.87624801 0.91260914
64 4 0.922783983 0.11524112 0.06240194
65 3 0.518727534 0.23927630 0.37114683
66 3 0.951288192 0.58672287 0.45337659
67 3 0.767943126 0.76102957 0.24347122
68 2 0.786254279 0.39824869 0.58548193
69 2 0.321557042 0.75393236 0.43273743
70 2 0.872124621 0.89918160 0.55623725
71 2 0.242389529 0.85453423 0.78540085
72 2 0.013294874 0.61593974 0.70549476

Subtract and find the difference of a value or volume

I have a volume measurements of brain parts (optic lobe, olfactory lobe, auditory cortex, etc), all the parts will add up to total brain volume. As shown in the example dataframe here.
a b c d e total
1 2 3 4 5 15
2 3 4 5 6 20
4 6 7 8 9 34
7 8 10 10 15 50
I would like to find the find the difference of brain volume if I subtract one components out of total volume.
So I was wondering how to go about it in R, without having to create a new column for every brain part.
For example: (total - a = 14, total - b =13, and so on for other components).
total-a total-b total-c total-d total-e
14 13 12 11 10
18 17 16 15 14
30 28 27 26 25
43 42 40 40 35
You can do
dat[, "total"] - dat[1:5]
# a b c d e
#1 14 13 12 11 10
#2 18 17 16 15 14
#3 30 28 27 26 25
#4 43 42 40 40 35
If you want also the column names, then one tidyverse possibility could be:
df %>%
gather(var, val, -total) %>%
mutate(var = paste0("total-", var),
val = total - val) %>%
spread(var, val)
total total-a total-b total-c total-d total-e
1 15 14 13 12 11 10
2 20 18 17 16 15 14
3 34 30 28 27 26 25
4 50 43 42 40 40 35
If you do not care about the column names, then with just dplyr you can do:
df %>%
mutate_at(vars(-matches("(total)")), list(~ total - .))
a b c d e total
1 14 13 12 11 10 15
2 18 17 16 15 14 20
3 30 28 27 26 25 34
4 43 42 40 40 35 50
Or without column names with just base R:
df[, grepl("total", names(df))] - df[, !grepl("total", names(df))]
a b c d e
1 14 13 12 11 10
2 18 17 16 15 14
3 30 28 27 26 25
4 43 42 40 40 35

Create partition based in two variables

I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example,
library(caret)
set.seed(123)
matris=matrix(rnorm(10),1000,20)
case1 <- as.factor(ceiling(runif(1000, 0, 4)))
case2 <- as.factor(ceiling(runif(1000, 0, 50)))
df <- as.data.frame(matris)
df$case1 <- case1
df$case2 <- case2
split1 <- createDataPartition(df$case1, p=0.2)[[1]]
train1 <- df[-split1,]
test1 <- df[split1,]
length(split1)
201
split2 <- createDataPartition(df$case2, p=0.2)[[1]]
train2 <- df[-split2,]
test2 <- df[split2,]
length(split2)
220
If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1.
I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size.
Is there a smart way to do this?
Thank you!
If I understand correctly (which I do not guarantee) I can offer the following approach:
Group by case1 and case2 and get the group indices
library(tidyverse)
df %>%
select(case1, case2) %>%
group_by(case1, case2) %>%
group_indices() -> indeces
use these indeces as the outcome variable in create data partition:
split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]]
check if satisfactory:
table(df[split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
5 6 5 8 5 5 6 6 4 6 6 6 6 6 5 5 5 4 4 7 5 6 5 6 7 5 5 8 6 7 6 6 7
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
4 5 6 6 6 5 5 6 5 6 6 5 4 5 6 4 6
table(df[-split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
15 19 13 18 12 13 16 15 8 13 13 15 21 14 11 13 12 9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
9 6 12 19 14 10 16 19 17 17 16 14 4 15 14 9 19
table(df[split1,21])
#output
1 2 3 4
71 70 71 67
table(df[-split1,21])
1 2 3 4
176 193 174 178

All possible unique pair combinations of gamete positions

I have some gamete data in the following format:
Ind Letter Place Position
1 A 19 23
2 B 19 23
3 B 19 23
4 B 19 23
1 B 19 34
2 A 19 34
3 B 19 34
4 B 19 34
1 C 19 52
2 T 19 52
3 C 19 52
4 T 19 52
1 T 33 15
2 T 33 15
3 T 33 15
4 C 33 15
1 C 33 26
2 T 33 26
3 T 33 26
4 C 33 26
dput of data:
structure(list(Ind = c(1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L),
Letter = structure(c(1L,2L,2L,2L,2L,1L,2L,2L,3L,4L,3L,4L,4L,4L,4L,3L,3L,4L,4L,3L),
.Label = c("A","B","C","T"), class="factor"),
Place = c(19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,33L,33L,33L,33L,33L,33L,33L,33L),
Position = c(23L,23L,23L,23L,34L,34L,34L,34L,52L,52L,52L,52L,15L,15L,15L,15L,26L,26L,26L,26L)),
.Names = c("Ind","Letter","Place","Position"),
class="data.frame", row.names = c(NA,-20L))
I need to pair and combine them, so I get all possible unique combinations with reference to Position within a pair. I have another data-file, that contains information on the pairs, and they are paired with reference to Place. So in this file I may see, that Place 19+Place 33 is a pair, and I want the following result:
Ind Letter Place Position Ind Letter Place Position
1 A 19 23 1 T 33 15
2 B 19 23 2 T 33 15
3 B 19 23 3 T 33 15
4 B 19 23 4 C 33 15
1 A 19 23 1 C 33 26
2 B 19 23 2 T 33 26
3 B 19 23 3 T 33 26
4 B 19 23 4 C 33 26
1 B 19 34 1 T 33 15
2 A 19 34 2 T 33 15
3 B 19 34 3 T 33 15
4 B 19 34 4 C 33 15
1 B 19 34 1 C 33 26
2 A 19 34 2 T 33 26
3 B 19 34 3 T 33 26
4 B 19 34 4 C 33 26
1 C 19 52 1 T 33 15
2 T 19 52 2 T 33 15
3 C 19 52 3 T 33 15
4 T 19 52 4 C 33 15
1 C 19 52 1 C 33 26
2 T 19 52 2 T 33 26
3 C 19 52 3 T 33 26
4 T 19 52 4 C 33 26
In this case unique means that A1:A2 is equal to A2:A1.
The reason I want to do this, is because I want to do a Four-Gamete-Test on the pairs, to the see if all possible combinations of Letter is existent. So e.g. for the last combined pair above, we have the letter-pairs CC, TT, CT, TC, so this combined pair will pass the FGT.
I have tried to do the combining with expand.grid, as it seems this is quite close to what I want. However, when I require all combination of data$Position, I lose the information for Ind, Letter, and Place. Also the output includes non-unique pairs.
Can anyone point me to a tool, that is closer to what I want? Or give me some guidelines on how to modify expand.grid, to get what I need.
Should you be aware of a tool, that actually does the Four-Gamete-Test, or something similar, then that would of course also be interesting for me to look at.
You can use expand.grid but not directly on the Position column. The idea is to find all combinations of the "quartets" (unique Positions):
pair <- c(19, 33)
df1 <- df1[df1$Place %in% pair, ]
split1 <- split( df1, df1$Position)
vec1 <- unique(df1$Position[df1$Place == pair[1]])
vec2 <- unique(df1$Position[df1$Place == pair[2]])
combin_num <- expand.grid(vec2, vec1)[,2:1]
do.call(
rbind,
lapply(seq_len(nrow(combin_num)), function(i){
cbind( split1[[as.character(combin_num[i,1])]],
split1[[as.character(combin_num[i,2])]] )
})
)[,]
Result:
# Ind Letter Place Position Ind.1 Letter.1 Place.1 Position.1
# 1 1 A 19 23 1 T 33 15
# 2 2 B 19 23 2 T 33 15
# 3 3 B 19 23 3 T 33 15
# 4 4 B 19 23 4 C 33 15
# 5 1 A 19 23 1 C 33 26
# 6 2 B 19 23 2 T 33 26
# 7 3 B 19 23 3 T 33 26
# 8 4 B 19 23 4 C 33 26
# 51 1 B 19 34 1 T 33 15
# 61 2 A 19 34 2 T 33 15
# 71 3 B 19 34 3 T 33 15
# 81 4 B 19 34 4 C 33 15
# 52 1 B 19 34 1 C 33 26
# 62 2 A 19 34 2 T 33 26
# 72 3 B 19 34 3 T 33 26
# 82 4 B 19 34 4 C 33 26
# 9 1 C 19 52 1 T 33 15
# 10 2 T 19 52 2 T 33 15
# 11 3 C 19 52 3 T 33 15
# 12 4 T 19 52 4 C 33 15
# 91 1 C 19 52 1 C 33 26
# 101 2 T 19 52 2 T 33 26
# 111 3 C 19 52 3 T 33 26
# 121 4 T 19 52 4 C 33 26

Resources