Grouping data by name R - r

id value
1 expsubs 29
2 expsubs 32
3 expsubs 27
4 expsubs 36
5 expsubs 29
6 expsubs 24
New to R
I have data that I've sorted in excel and tried to import into R
I want to sort or my data by the names that are in my "id" so that I can run an ANOVA on my data. Can't figure out how to get R to recognize my id column as the names for each value. Thanks!

In this situation you need to use package dplyr:
tab <- data.frame(x = c("A", "B", "C", "C"), y = 1:4)
by_x <- group_by(tab, x)
by_x
This code will sort your data by x column.

Use order:
df <- data.frame(id = c("B", "A", "D", "C"), y = c(6, 8, 1, 5))
df
id y
1 B 6
2 A 8
3 D 1
4 C 5
df2 <- df[order(df$id), ]
df2
id y
2 A 8
1 B 6
4 C 5
3 D 1

Related

Efficient recursive random sampling with groups of unequal size

This question is a follow-up to my previous question on recursive random sampling Efficient recursive random sampling. The solutions in that thread work fine when the groups are of identical size or when a fixed number of samples per group is required. However, let's imagine a dataset as follows;
ID1 ID2
1 A 1
2 A 6
3 B 1
4 B 2
5 B 3
6 C 4
7 C 5
8 C 6
9 D 6
10 D 7
11 D 8
12 D 9
where we want to randomly sample up to n ID2 for each ID1, and doing so recursively. Recursively here means that we are moving from the first ID1 to the last ID1, and if an ID2 was already sampled for an ID1, then it should not be used for a subsequent ID1. Let's say n = 2, then expected results would be as follows;
ID1 ID2
1 A 1
2 A 6
4 B 2
5 B 3
6 C 4
7 C 5
11 D 8
12 D 9
For ID1 = "A", there are exactly two potential ID2, so both are selected.
For ID1 = "B", there are two potential ID2 left to select, so both are selected.
For ID1 = "C", there are two potential ID2 left to select, so both are selected.
For ID = "D", there are three potential ID2 left to sample from, so two are randomly selected from those.
What can happen beyond the situation shown in the example;
Every ID1 always has a non-zero number of ID2 available,
however, it is possible that all of those ID2 were already used. In
that case, ID1 should be simply left out.
It is possible that none of ID1 will have the specified n of ID2. In that
case, the n closest to specified n should be retrieved.
ID doesn't have to be seq(ID1).
ID2 could be also a character vector similar to ID1.
Sample df;
df <- structure(list(ID1 = c("A", "A", "B", "B", "B", "C", "C", "C",
"D", "D", "D", "D"), ID2 = c(1, 6, 1, 2, 3, 4, 5, 6, 6, 7, 8,
9)), class = "data.frame", row.names = c(NA, -12L))
The following function seems to give what you are after. Basically, it loops through each group of ID1 and selects the rows where the corresponding ID2 has not been sampled. Then it selects the distinct rows (in the case that some group of ID1 has duplicate ID2 values. The sample size will be the minimum of either n, or the number of rows for that group.
sample <- function(df, n) {
`%notin%` <- Negate(`%in%`)
groups <- unique(df$ID1)
out <- data.frame(ID1 = character(), ID2 = character())
for (group in groups) {
options <- df %>%
filter(ID1 == group,
ID2 %notin% out$ID2)
chosen <- sample_n(options,
size = min(n, nrow(options))) %>%
distinct()
out <- rbind(out, chosen)
}
out
}
set.seed(123)
sample(df, 2)
ID1 ID2
1 A 1
2 A 6
3 B 2
4 B 3
5 C 4
6 C 5
7 D 8
8 D 9
Case where a group of ID1 has ID2s that were already used up:
Input:
# A tibble: 10 × 2
ID1 ID2
<chr> <dbl>
1 A 1
2 A 3
3 B 1
4 B 3
5 C 5
6 C 6
7 C 7
8 C 7
9 D 10
10 D 20
Output:
sample(df2, 2)
# A tibble: 6 × 2
ID1 ID2
<chr> <dbl>
1 A 3
2 A 1
3 C 6
4 C 7
5 D 20
6 D 10
I dont know whether I am oversimplifying the problem. Take a look at the following and see whether it works in your case:
library(tidyverse)
df %>%
group_split(ID1)%>%
reduce(~ bind_rows(.x, .y) %>%
filter(!duplicated(ID2))%>%
group_by(ID1)%>%
slice_sample(n=2) %>%
ungroup,
.init = slice_sample(.[[1]], n=2))
# A tibble: 8 x 2
ID1 ID2
<chr> <dbl>
1 A 1
2 A 6
3 B 2
4 B 3
5 C 4
6 C 5
7 D 9
8 D 8
Disclaimer: NOt vectorized, thus inefficient
Here is a base R option using dynamic programming (DP)
d <- table(df)
nms <- dimnames(d)
res <- list()
for (i in nms$ID1) {
idx <- which(d[i, ] > 0)
if (length(idx) >= 2) {
j <- sample(idx, 2)
res[[i]] <- nms$ID2[j]
d[, j] <- 0
}
}
dfout <- type.convert(
setNames(rev(stack(res)), names(df)),
as.is = TRUE
)
which gives
ID1 ID2
1 A 6
2 A 1
3 B 2
4 B 3
5 C 4
6 C 5
7 D 7
8 D 8
For the case with used ID2 already, e.g.,
> (df <- structure(list(ID1 = c(
+ "A", "A", "B", "B", "B", "C", "C", "C",
+ "D", "D", "D", "D"
+ ), ID2 = c(
+ 1, 3, 1, 2, 3, 3, 4, 5, 4, 5, 6, .... [TRUNCATED]
ID1 ID2
1 A 1
2 A 3
3 B 1
4 B 2
5 B 3
6 C 3
7 C 4
8 C 5
9 D 4
10 D 5
11 D 6
12 D 1
we will obtain
ID1 ID2
1 A 1
2 A 3
3 C 5
4 C 4

Finding only unique value in each column in a d

I have the below data frame df1. (Edited to have different numbers of repeated value in the data frame.)
> dput(df1)
structure(list(...1 = c("a", "b", "c", "d", "e"), x = c(5, 10,
20, 20, 25), y = c(2, 6, 6, 6, 10), z = c(6, 2, 1, 8, 1)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
>df1
x y z
a 5 2 6
b 10 6 2
c 20 6 1
d 20 6 8
e 25 10 1
I would like to get a df2 which only has the unique values from each column 'x','y' and 'z'.
I tried:
df2<-apply(df1,2, unique)
df2 <- do.call(cbind, df2)
df2 <- as.data.frame(df2)
Desired output:
>df2
x y z
5 2 6
10 6 2
20 10 1
25 8
Tibbles can't have rownames so it creates a new column with it in your data. You can delete the first column and then use unique on all columns.
library(dplyr)
df1$...1 <- NULL
df1 %>% summarise(across(.fns = unique))
# x y z
# <dbl> <dbl> <dbl>
#1 5 2 6
#2 10 6 2
#3 20 8 1
#4 25 10 8
Or in base R :
df2 <- data.frame(sapply(df1, unique))
For unequal unique values in the column you could use :
tmp <- lapply(df1, unique)
data.frame(sapply(tmp, `[`, 1:max(lengths(tmp))))
# x y z
#1 5 2 6
#2 10 6 2
#3 20 10 1
#4 25 NA 8

order a row by column name of other data frame and match in length

For example you have this data frame :
dd <- data.frame(b = c("cpg1", "cpg2", "cpg3", "cpg4"),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
dd
b x y z
1 cpg1 A 8 1
2 cpg2 D 3 1
3 cpg3 A 9 1
4 cpg4 C 9 2
I want to order the column names (b,x,y,z) by a row in another data frame which is:
d <- data.frame(pos = c("x", "z", "b"),
g = c("A", "D", "A"), h = c(8, 3, 9))
d
pos g h
1 x A 8
2 z D 3
3 b A 9
So I want to order the column name of dd with the row d$pos and dd also needs to have the same number in the row d$pos.
I tried with order and match but it did not give me the need result. My dataset is quite large, so something automtic would be ideal.
Thanks a lot for your help!
We can do a match and then order the columns
i1 <- match(d$pos, names(dd), nomatch = 0)
dd[i1]
# x z b
#1 A 1 cpg1
#2 D 1 cpg2
#3 A 1 cpg3
#4 C 2 cpg4
Or if we want only the columns based on the 'd$pos'
dd[as.character(d$pos)]
# x z b
#1 A 1 cpg1
#2 D 1 cpg2
#3 A 1 cpg3
#4 C 2 cpg4

R - Merge two data frames with one differing column

Suppose I have two data frames
df1 = data.frame(id = c(1,1,1), stat = c("B", "A", "C"), value = c(10,11,12))
df2 = data.frame(id = c(2,2,2), stat = c("B", "A", "C"), value = c(20,21, 22))
Basically the first column identifies the data frame, the second column is some statistic I want to keep track of and the last column is the value of that statistics. Can I easily merge the data frames so that I get
stat id value
B 1 10
B 2 20
A 1 11
A 2 21
C 1 12
C 2 22
I'd like to preserve the order of the stat column even though it's not alphabetical
You could do
(r <- rbind(df1, df2))[c(2,1,3)][order(r$stat, decreasing = TRUE),]
# stat id value
# 1 B 1 10
# 3 B 2 20
# 2 A 1 11
# 4 A 2 21
In response to the edited question, you could use
f <- function(i) rbind(df1[i,], df2[i,])
do.call(rbind, lapply(1:nrow(df1), f))[c(2,1,3)]
# stat id value
# 1 B 1 10
# 2 B 2 20
# 22 A 1 11
# 21 A 2 21
# 3 C 1 12
# 31 C 2 22

Finding the number of unique variables per factor in R

I have a dataframe which looks like this:
id <- c(1,2,3,4,5,6,7,8,9,10)
val <- c("a", "b", "c", "a", "b", "a", "c", "a", "a", "c")
df <- data.frame(id,val)
I am trying to create a vector of length 10 which, for every id, gives the number of rows in df with the same value val. The output should be
out <- c(5, 2, 3, 5, 2, 5, 3, 5, 5, 3)
It's basically the opposite of
with(df, tapply(val, id, function(x) length(unique(x))))
If that makes sense? Maybe I could merge with(df, tapply(id, val, function(x) length(unique(x)))) with df somehow, but that seems like a very ugly solution.
You could do this:
table(df$val)[df$val]
The ave function is meant for tasks such as this
cc<-with(df, ave(id,val, FUN=length))
cbind(df, cc)
will result in
id val cc
1 1 a 5
2 2 b 2
3 3 c 3
4 4 a 5
5 5 b 2
6 6 a 5
7 7 c 3
8 8 a 5
9 9 a 5
10 10 c 3

Resources