dplyr solution to split dataset, but keep IDs in same splits - r

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.

You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

Related

Adding lines in data frame for each observation

I have a data structure in long format, meaning that each individual has more than one observation (and each observation has one row). Now each individual has a different number of observations. I would like to structure my data in the way, that each individual will have the same number of observations. Therefore, it would be great to find the individual with the most observations and add lines with LOCF (depending on the number of missing lines).
For example:
# simulate data structure
d <- data.frame(
id = c(1,1,1,2,2,3,3,3,3,3),
value = c(10,11,12,5,9,55,14,12,20,7) )
Now individual 3 has the most observations (count = 5). I would like to add two lines for individual 1 (with 12 for value) and three lines for individual 2 (with 9 for value)
Any ideas?
Best wishes and thank you.
In case you wish to carry forward the last value for each individual you could do
d$seq=ave(d$id,d$id,FUN=seq_along)
d=merge(
d,
merge(
aggregate(value~id,data=d,FUN=tail,1),
data.frame("seq"=1:max(table(d$id))),
how="cross"
),
by=c("id","seq"),
all.y=T
)
d$value=ifelse(is.na(d$value.x),d$value.y,d$value.x)
d=d[,!grepl("value.",colnames(d))]
id seq value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 12
5 1 5 12
6 2 1 5
7 2 2 9
8 2 3 9
9 2 4 9
10 2 5 9
11 3 1 55
12 3 2 14
13 3 3 12
14 3 4 20
15 3 5 7
Here's a tidyverse solution. If we create a variable to hold the within ID count using seq_along then we can use complete and fill to expand the table and fill in the missing values.
d |> group_by(id) |>
mutate(n = seq_along(value)) |>
ungroup() |>
complete(id, n) |>
fill(value) |>
select(-n)
# A tibble: 15 × 2
id value
<dbl> <dbl>
1 1 10
2 1 11
3 1 12
4 1 12
5 1 12
6 2 5
7 2 9
8 2 9
9 2 9
10 2 9
11 3 55
12 3 14
13 3 12
14 3 20
15 3 7

Randomize two sets of number in R, not repeating values between groups

I have this file:
ID
1
1
1
3
3
3
7
7
7
And I need to assign two sets randomly, (1,2,3) and (5,15,25).
To do this I used this:
set.seed(1109201)
df %>%
group_by(ID) %>%
dplyr::mutate(set1=sample(c(1,2,3), size=n(), replace=F),set2=sample(c(5,15,25), size=n(), replace=F))
and I obtained this:
ID set1 set2
1 1 15
3 1 25
7 1 25
1 2 5
3 2 15
7 2 5
1 3 25
3 3 5
7 3 15
but I need different values for set2 in set1 and ID, like this:
ID set1 set2
1 1 15
3 1 25
7 1 5
1 2 5
3 2 15
7 2 25
1 3 25
3 3 5
7 3 15
Set2 cannot be repeated into ID or set1
some suggestion to control these 2 sets?
Change your dplyr code to the following. Using a 'group_by()` step will have the second sampling occur only within the group.
set.seed(1109201)
df %>%
group_by(ID) %>%
dplyr::mutate(set1=sample(c(1,2,3), size=n(), replace=F)) %>%
group_by(set1) %>%
mutate(set2=sample(c(5,15,25), size=n(), replace=F)) %>%
ungroup()
# A tibble: 8 x 3
ID set1 set2
<dbl> <dbl> <dbl>
1 1 2 15
2 1 3 5
3 1 1 25
4 3 3 15
5 3 2 5
6 3 1 5
7 7 2 25
8 7 3 25

Nested Subseting

I have the following data frame
Library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Tag <- c(1,2,6,1,3,4,6,4,3)
Value <- c(5,9,3,3,5,6,4,8,9)
DF <- data.frame(ID,Tag,Value)
ID Tag Value
1 1 1 5
2 1 2 9
3 1 6 3
4 2 1 3
5 2 3 5
6 2 4 6
7 2 6 4
8 3 4 8
9 3 3 9
I would like to perform the following 1) group by rows ID 2) assign the Value corresponding to a specific Tag a new column. In the following example, I am assigning the Value of Tag 6 to a new column by ID
ID Tag Value New_Value
1 1 1 5 3
2 1 2 9 3
3 1 6 3 3
4 2 1 3 4
5 2 3 5 4
6 2 4 6 4
7 2 6 4 4
8 3 4 8 NA
9 3 3 9 NA
To the best of my knowledge, I need to subset the data in each group to get the Value for Tag 6. Here is my code and the error msg
DF %>% group_by(ID) %>% mutate(New_Value = select(filter(.,Tag==6),Value))
Adding missing grouping variables: `ID`
Error: Column `New_Value` is of unsupported class data.frame
Another possible solution is to create a new dataframe with IDs and Values for Tag 6 and join it with DF. However, I believe there is a better generic solution by only using dplyr.
I would appreciate it if you can help me understand how to perform a nested subset in this situation
Thank you
On the assumption that Tag is unique within groups, you could do:
library(dplyr)
DF %>%
group_by(ID) %>%
mutate(New_Value = ifelse(any(Tag == 6), Value[Tag == 6], NA))
# A tibble: 9 x 4
# Groups: ID [3]
ID Tag Value New_Value
<dbl> <dbl> <dbl> <dbl>
1 1 1 5 3
2 1 2 9 3
3 1 6 3 3
4 2 1 3 4
5 2 3 5 4
6 2 4 6 4
7 2 6 4 4
8 3 4 8 NA
9 3 3 9 NA

R Selecting highest count cells conditional on two columns

Apologies, if this is a duplicate please let me know, I'll gladly delete.
I am attempting to select the four highest values for different values of another column.
Dataset:
A B COUNT
1 1 2 2
2 1 3 6
3 1 4 3
4 1 5 9
5 1 6 2
6 1 7 7
7 1 8 0
8 1 9 5
9 1 10 2
10 1 11 7
11 2 1 5
12 2 3 1
13 2 4 8
14 2 5 9
15 2 6 5
16 2 7 2
17 2 8 2
18 2 9 4
19 3 1 7
20 3 2 5
21 3 4 2
22 3 5 8
23 3 6 6
24 3 7 1
25 3 8 9
26 3 9 5
27 4 1 8
28 4 2 1
29 4 3 1
30 4 5 3
31 4 6 9
For example, I would like to select four highest counts when A=1 (9,7,7,6) then when A=2 (9,8,5,5) and so on...
I would also like the corresponding B column value to be beside each count, so for when A=1 my desired output would be something like:
B A Count
5 1 9
7 1 7
11 1 7
3 1 6
I have looked a various answers on 'selecting highest values' but was struggling to find an example conditioning on other columns.
Many thanks
We can do
df1 %>%
group_by(A) %>%
arrange(desc(COUNT)) %>%
filter(row_number() <5)
library(dplyr)
data %>% group_by(A) %>%
arrange(A, desc(COUNT)) %>%
slice(1:4)

How to generate an uneven sequence of numbers in R

Here's an example data frame:
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
I want to generate a sequence of numbers according to the number of observations of y per x group (e.g. there are 2 observations of y for x=1). I want the sequence to be continuously increasing and jumps by 2 after each x group.
The desired output for this example would be:
1,2,5,6,7,10,11,14,17,20,21,22,25,26
How can I do this simply in R?
To expand on my comment, the groupings can be arbitrary, you simply need to recast it to the correct ordering. There are a few ways to do this, #akrun has shown that this can be accomplished using match function, or you can make use the the as.numeric function if this is easier to understand for yourself.
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
# these are equivalent
df$newx <- as.numeric(factor(df$x, levels=unique(df$x)))
df$newx <- match(df$x, unique(df$x))
Since you now have a "new" releveling which is sequential, we can use the logic that was discussed in the comments.
df$newNumber <- 1:nrow(df) + (df$newx-1)*2
For this example, this will result in the following dataframe:
x y newx newNumber
1 1 1 1
1 2 1 2
2 3 2 5
2 4 2 6
2 6 2 7
3 3 3 10
3 7 3 11
4 8 4 14
5 6 5 17
6 4 6 20
6 3 6 21
6 7 6 22
9 3 7 25
9 2 7 26
where df$newNumber is the output you wanted.
To create the sequence 0,0,4,4,4,9,..., basically what you're doing is taking the minimum of each group and subtracting 1. The easiest way to do this is using the library(dplyr).
library(dplyr)
df %>%
group_by(x) %>%
mutate(newNumber2 = min(newNumber) -1)
Which will have the output:
Source: local data frame [14 x 5]
Groups: x
x y newx newNumber newNumber2
1 1 1 1 1 0
2 1 2 1 2 0
3 2 3 2 5 4
4 2 4 2 6 4
5 2 6 2 7 4
6 3 3 3 10 9
7 3 7 3 11 9
8 4 8 4 14 13
9 5 6 5 17 16
10 6 4 6 20 19
11 6 3 6 21 19
12 6 7 6 22 19
13 9 3 7 25 24
14 9 2 7 26 24

Resources