Finding the number of unique variables per factor in R - r

I have a dataframe which looks like this:
id <- c(1,2,3,4,5,6,7,8,9,10)
val <- c("a", "b", "c", "a", "b", "a", "c", "a", "a", "c")
df <- data.frame(id,val)
I am trying to create a vector of length 10 which, for every id, gives the number of rows in df with the same value val. The output should be
out <- c(5, 2, 3, 5, 2, 5, 3, 5, 5, 3)
It's basically the opposite of
with(df, tapply(val, id, function(x) length(unique(x))))
If that makes sense? Maybe I could merge with(df, tapply(id, val, function(x) length(unique(x)))) with df somehow, but that seems like a very ugly solution.

You could do this:
table(df$val)[df$val]

The ave function is meant for tasks such as this
cc<-with(df, ave(id,val, FUN=length))
cbind(df, cc)
will result in
id val cc
1 1 a 5
2 2 b 2
3 3 c 3
4 4 a 5
5 5 b 2
6 6 a 5
7 7 c 3
8 8 a 5
9 9 a 5
10 10 c 3

Related

R: Filter one column based on another with many to many mapping

I have a dataset with an ID column and an item column. An ID is mapped to one or more items. The dataset has a row for each item mapped to an ID. I want to return IDs that contain my_items. The order of the items does not matter. I have a toy example below.
ID <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5)
item <- c("a", "b", "c", "a", "b", "c", "d", "a", "b", "d", "b", "a", "c")
df <- data.frame(cbind(ID, item))
df
my_items <- c("a", "b", "c")
My expected output would only include item ID 1 and 5.
df %>%
group_by(ID) %>%
filter(setequal(item,my_items))
Output
ID item
<chr> <chr>
1 1 a
2 1 b
3 1 c
4 5 b
5 5 a
6 5 c
We can use all after creating a logical vector with %in% and grouping by 'ID' and also create a condition with n_distinct
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(my_items %in% item), n_distinct(item) == 3) %>%
ungroup
-output
# A tibble: 6 × 2
ID item
<dbl> <chr>
1 1 a
2 1 b
3 1 c
4 5 b
5 5 a
6 5 c
If we add arrange, we could also use identical in this case:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(item, .by_group = TRUE) %>%
filter(identical(item,my_items))
ID item
<chr> <chr>
1 1 a
2 1 b
3 1 c
4 5 a
5 5 b
6 5 c

Efficient recursive random sampling with groups of unequal size

This question is a follow-up to my previous question on recursive random sampling Efficient recursive random sampling. The solutions in that thread work fine when the groups are of identical size or when a fixed number of samples per group is required. However, let's imagine a dataset as follows;
ID1 ID2
1 A 1
2 A 6
3 B 1
4 B 2
5 B 3
6 C 4
7 C 5
8 C 6
9 D 6
10 D 7
11 D 8
12 D 9
where we want to randomly sample up to n ID2 for each ID1, and doing so recursively. Recursively here means that we are moving from the first ID1 to the last ID1, and if an ID2 was already sampled for an ID1, then it should not be used for a subsequent ID1. Let's say n = 2, then expected results would be as follows;
ID1 ID2
1 A 1
2 A 6
4 B 2
5 B 3
6 C 4
7 C 5
11 D 8
12 D 9
For ID1 = "A", there are exactly two potential ID2, so both are selected.
For ID1 = "B", there are two potential ID2 left to select, so both are selected.
For ID1 = "C", there are two potential ID2 left to select, so both are selected.
For ID = "D", there are three potential ID2 left to sample from, so two are randomly selected from those.
What can happen beyond the situation shown in the example;
Every ID1 always has a non-zero number of ID2 available,
however, it is possible that all of those ID2 were already used. In
that case, ID1 should be simply left out.
It is possible that none of ID1 will have the specified n of ID2. In that
case, the n closest to specified n should be retrieved.
ID doesn't have to be seq(ID1).
ID2 could be also a character vector similar to ID1.
Sample df;
df <- structure(list(ID1 = c("A", "A", "B", "B", "B", "C", "C", "C",
"D", "D", "D", "D"), ID2 = c(1, 6, 1, 2, 3, 4, 5, 6, 6, 7, 8,
9)), class = "data.frame", row.names = c(NA, -12L))
The following function seems to give what you are after. Basically, it loops through each group of ID1 and selects the rows where the corresponding ID2 has not been sampled. Then it selects the distinct rows (in the case that some group of ID1 has duplicate ID2 values. The sample size will be the minimum of either n, or the number of rows for that group.
sample <- function(df, n) {
`%notin%` <- Negate(`%in%`)
groups <- unique(df$ID1)
out <- data.frame(ID1 = character(), ID2 = character())
for (group in groups) {
options <- df %>%
filter(ID1 == group,
ID2 %notin% out$ID2)
chosen <- sample_n(options,
size = min(n, nrow(options))) %>%
distinct()
out <- rbind(out, chosen)
}
out
}
set.seed(123)
sample(df, 2)
ID1 ID2
1 A 1
2 A 6
3 B 2
4 B 3
5 C 4
6 C 5
7 D 8
8 D 9
Case where a group of ID1 has ID2s that were already used up:
Input:
# A tibble: 10 × 2
ID1 ID2
<chr> <dbl>
1 A 1
2 A 3
3 B 1
4 B 3
5 C 5
6 C 6
7 C 7
8 C 7
9 D 10
10 D 20
Output:
sample(df2, 2)
# A tibble: 6 × 2
ID1 ID2
<chr> <dbl>
1 A 3
2 A 1
3 C 6
4 C 7
5 D 20
6 D 10
I dont know whether I am oversimplifying the problem. Take a look at the following and see whether it works in your case:
library(tidyverse)
df %>%
group_split(ID1)%>%
reduce(~ bind_rows(.x, .y) %>%
filter(!duplicated(ID2))%>%
group_by(ID1)%>%
slice_sample(n=2) %>%
ungroup,
.init = slice_sample(.[[1]], n=2))
# A tibble: 8 x 2
ID1 ID2
<chr> <dbl>
1 A 1
2 A 6
3 B 2
4 B 3
5 C 4
6 C 5
7 D 9
8 D 8
Disclaimer: NOt vectorized, thus inefficient
Here is a base R option using dynamic programming (DP)
d <- table(df)
nms <- dimnames(d)
res <- list()
for (i in nms$ID1) {
idx <- which(d[i, ] > 0)
if (length(idx) >= 2) {
j <- sample(idx, 2)
res[[i]] <- nms$ID2[j]
d[, j] <- 0
}
}
dfout <- type.convert(
setNames(rev(stack(res)), names(df)),
as.is = TRUE
)
which gives
ID1 ID2
1 A 6
2 A 1
3 B 2
4 B 3
5 C 4
6 C 5
7 D 7
8 D 8
For the case with used ID2 already, e.g.,
> (df <- structure(list(ID1 = c(
+ "A", "A", "B", "B", "B", "C", "C", "C",
+ "D", "D", "D", "D"
+ ), ID2 = c(
+ 1, 3, 1, 2, 3, 3, 4, 5, 4, 5, 6, .... [TRUNCATED]
ID1 ID2
1 A 1
2 A 3
3 B 1
4 B 2
5 B 3
6 C 3
7 C 4
8 C 5
9 D 4
10 D 5
11 D 6
12 D 1
we will obtain
ID1 ID2
1 A 1
2 A 3
3 C 5
4 C 4

Subset dataframe in R, dplyr filter row values of column A not NA in row of column B

I have a dataset consisting of a time series study. Since some participants didn't show up for certain days, they have NA values for rest of the data frame, but certain study days were crucial, so I am trying to subset my data to participants not missing these crucial days. My dataset is actually very large but here's the general structure:
fakedat <- data.frame(ID = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C",
"D", "D", "D", "D", "E", "E", "E", "E", "F", "F", "F", "F"),
StudyDay = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
1, 2, 3, 4),
Ab = c(10, NA, 15, 10, 10, 20, 10, NA, 10, 10, NA, 30, NA, NA, 15, NA, 10, 20,
10, 30, NA, 10, NA, 20))
Now let's say it was crucial they show up at day 2 and 4, I tried subsetting using dplyr filtering like this:
fakedat2 <- fakedat %>%
dplyr::group_by(ID) %>%
dplyr::filter(StudyDay %in% c(2, 4) & !is.na(Ab)) %>%
dplyr:: ungroup()
EDIT: But the output of this dataset is only the list if IDs that have a 2 or 4 that's not an NA value. I need to find (in my real data) subjects who have NA Ab values at 4 specific Study Days.
The answer I accepted below works but still curious about performing conditional filtering? Like in SAS you could code "IF Ab!=NA at (StudyDay=2 AND StudyDay=4) THEN ID ....or something like that.
Maybe this will achieve your goal. If all participants have all StudyDay timepoints, and you just want to see if not missing in days 2 or 4, you can just check the Ab values at those time points in your filter. In this case, an ID will be omitted if is NA in both days 2 and 4 (in this example, "D").
Alternatively, if you want to require that both values are available for days 2 and 4, you can use & (AND) instead of | (OR).
library(dplyr)
fakedat %>%
group_by(ID) %>%
filter(!is.na(Ab[StudyDay == 2]) | !is.na(Ab[StudyDay == 4]))
If you have multiple days to check are not missing, you can use all and check values for NA where the StudyDay is %in% a vector of required days as follows:
required_vals <- c(2, 4)
fakedat %>%
group_by(ID) %>%
filter(all(!is.na(Ab[StudyDay %in% required_vals])))
Output
ID StudyDay Ab
<chr> <dbl> <dbl>
1 A 1 10
2 A 2 NA
3 A 3 15
4 A 4 10
5 B 1 10
6 B 2 20
7 B 3 10
8 B 4 NA
9 C 1 10
10 C 2 10
11 C 3 NA
12 C 4 30
13 E 1 10
14 E 2 20
15 E 3 10
16 E 4 30
17 F 1 NA
18 F 2 10
19 F 3 NA
20 F 4 20
In base R, we can do
subset(fakedat, ID %in% ID[StudyDay %in% c(2, 4) & !is.na(Ab)])
-output
# ID StudyDay Ab
#1 A 1 10
#2 A 2 NA
#3 A 3 15
#4 A 4 10
#5 B 1 10
#6 B 2 20
#7 B 3 10
#8 B 4 NA
#9 C 1 10
#10 C 2 10
#11 C 3 NA
#12 C 4 30
#17 E 1 10
#18 E 2 20
#19 E 3 10
#20 E 4 30
#21 F 1 NA
#22 F 2 10
#23 F 3 NA
#24 F 4 20
Or a similar option in dplyr
library(dplyr)
fakedat %>%
filter(ID %in% ID[StudyDay %in% c(2, 4) & !is.na(Ab)])

Filter by values that have the exact names given in a list (dplyr)

I have the following data.
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
1 1 a
2 1 b
3 1 a
4 2 a
5 2 b
6 2 c
7 2 b
8 3 a
9 3 b
10 3 c
11 3 d
12 3 d
I would like to filter only by the following list
set <- NULL
set$names <- c("a","b","c")
The ids selected are those that contain exactly the names in the list.
So the result would be only the 2s selected as follows:
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
4 2 a
5 2 b
6 2 c
7 2 b
Here is the data for easy replication:
dat <- tribble(
~id, ~name,
1, "a",
1, "b",
1, "a",
2, "a",
2, "b",
2, "c",
2, "b",
3, "a",
3, "b",
3, "c",
3, "d",
3, "d"
)
I would like to have the following result.
How about:
group_by(dat, id) %>% filter(setequal(name, set$names))
This filters out all groups where the name column and set$names do not contain the same elements, but allows duplicates.
I am not sure it is what you want
dat %>%
group_by(id) %>%
filter(all(set$name %in% name) & all(name %in%set$name))
# A tibble: 4 x 2
id name
<dbl> <chr>
1 2 a
2 2 b
3 2 c
4 2 b

Grouping data by name R

id value
1 expsubs 29
2 expsubs 32
3 expsubs 27
4 expsubs 36
5 expsubs 29
6 expsubs 24
New to R
I have data that I've sorted in excel and tried to import into R
I want to sort or my data by the names that are in my "id" so that I can run an ANOVA on my data. Can't figure out how to get R to recognize my id column as the names for each value. Thanks!
In this situation you need to use package dplyr:
tab <- data.frame(x = c("A", "B", "C", "C"), y = 1:4)
by_x <- group_by(tab, x)
by_x
This code will sort your data by x column.
Use order:
df <- data.frame(id = c("B", "A", "D", "C"), y = c(6, 8, 1, 5))
df
id y
1 B 6
2 A 8
3 D 1
4 C 5
df2 <- df[order(df$id), ]
df2
id y
2 A 8
1 B 6
4 C 5
3 D 1

Resources