Sample by groupy with a condition (r)

Sample by groupy with a condition (r) - r

I need to randomly select a diary for each individual (id) but only for those who filled more than one.
Let us suppose my data look like this
dta = rbind(c(1, 1, 'a'),
c(1, 2, 'a'),
c(1, 3, 'b'),
c(2, 1, 'a'),
c(3, 1, 'b'),
c(3, 2, 'a'),
c(3, 3, 'c'))
colnames(dta) <- c('id', 'DiaryNumber', 'type')
dta = as.data.frame(dta)
dta
id DiaryNumber type
1 1 a
1 2 a
1 3 b
2 1 a
3 1 b
3 2 a
3 3 c
For example, id 1 filled 3 diaries. What I need is to randomly select one of the 3 diaries. Id 2 only filled one diary, so I do not need to do anything with it.
I have no idea how I could do that.
Any ideas ?

You can use sample_n:
library(dplyr)
dta %>% group_by(id) %>% sample_n(1)
## Source: local data frame [3 x 3]
## Groups: id
##
## id DiaryNumber type
## 1 1 2 a
## 2 2 1 a
## 3 3 1 b

Base package:
set.seed(123)
df <- lapply(split(dta, dta$id), function(x) x[sample(nrow(x), 1), ])
do.call("rbind", df)
Output:
id DiaryNumber type
1 1 1 a
2 2 1 a
3 3 2 a

Related

How to group the data by id and get unique values of all columns in R?

I have a table with ID and other columns. I want to group the data by Ids and get the unique values of all columns.
from above table group by ID and get unique(Alt1, Alt2, Alt3)
Resul should be in vector form
A -> 1,2,3,5
B ->1,3,4,5,7

We can get data in long format and for each ID make a list of unique values.
library(dplyr)
library(tidyr)
df1 <- df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(value = list(unique(value))) %>%
unnest(value)
df1
# ID value
# <fct> <dbl>
# 1 A 1
# 2 A 3
# 3 A 2
# 4 A 5
# 5 B 1
# 6 B 4
# 7 B 5
# 8 B 3
# 9 B 6
#10 B 7
We can store it as a list if needed using split.
split(df1$value, df1$ID)
#$A
#[1] 1 3 2 5
#$B
#[1] 1 4 5 3 6 7
data.table equivalent of the above would be :
library(Data.table)
setDT(df)
df2 <- melt(df, id.vars = 'ID')[, .(value = list(unique(value))), ID]
unique values are present in df2$value as a vector.
data
df <- data.frame(ID = c('A', 'A', 'B', 'B'),
Alt1 = c(1, 2, 1, 3),
Alt2 = c(3, 5, 4, 6),
Alt3 = c(1, 3, 5, 7))

Filtering in tidyverse based on a vector/list of possible values

I would like to select rows of a data frame based on conditions on two columns that should identify a unique row. In the concrete example below I would like to select
id=1,2,3... with a specific mtry value specified in a vector, i.e. For id=1, I just want the first line with mtry=3, for id=2 I would like mtry=5.
I tried using group_by and using filter e.g.
filter(df, (mtry,id) %in% c([3,1],[5,2],[3,3]))
but this gives an error
Error: unexpected ',' in .
What is the tidyverse way of doing this?

You can do this kind of filter with an inner join
library(dplyr)
df %>%
inner_join(tibble(mtry = c(3, 5, 3), id = c(1, 2, 3)))
Example:
set.seed(100)
df <- data.frame(mtry = sample(1:3, 100, T), id = sample(1:5, 100, T))
df %>%
inner_join(tibble(mtry = c(3, 5, 3), id = c(1, 2, 3)))
# Joining, by = c("mtry", "id")
# mtry id
# 1 3 1
# 2 3 3
# 3 3 3
# 4 3 3
# 5 3 1
# 6 3 3
# 7 3 1
# 8 3 1
# 9 3 1
# 10 3 3
# 11 3 1
# 12 3 3
# 13 3 1

You need to create different conditions for each combination
subset(df, (mtry == 3 & id == 1) | (mtry == 5 & id == 2) | (mtry == 3 & id == 3))
Or if you want tidyverse put the conditions in filter
library(dplyr)
df %>% filter((mtry == 3 & id == 1) | (mtry == 5 & id == 2) | (mtry == 3 & id == 3))
You can combine condition 1 and 3 to do
df %>% filter((mtry == 3 & id %in% c(1, 3)) | (mtry == 5 & id == 2))

Combining 3 versions of same table together in R

I scraped some data from a website but it was really janky and for some reason had little mistakes in it. So, I scraped the same data 3 times, and produced 3 tables that look like:
library(data.table)
df1 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(2, 1, 3, 4),
otherthing = c(2,1, 3, 4)
)
df2 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 1, 4),
otherthing = c(2,2, 3, 4)
)
df3 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 3, 4),
otherthing = c(2,1, 3, 3)
)
Except I have many more columns. I want to combine the 3 tables together, and when the values for "thing" and "other thing" etc. conflict, I want it to pick the value that has at least 2/3 and perhaps return an N/A if there is no 2/3 value. I'm confident the "name" and "id" field are good and they're what I want to sort of merge on.
I was considering setting the names for the tables to be, "thing1" "thing2" and "thing3" in the 3 tables respectively, merging together, and then writing some loops through the names. Is there a more elegant solution? It needs to work for 300+ value columns although I'm not super worried about speed.
In this example, the solution I think should be:
final_result <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 3, 4),
otherthing = c(2,1, 3, 4)
)

To generalize the approach from #IceCreamToucan, we can use:
library(dplyr)
n_mode <- function(...) {
x <- table(c(...))
if(any(x > 1)) as.numeric(names(x)[which.max(x)])
else NA
}
bind_rows(df1, df2, df3) %>%
group_by(name, id) %>%
summarise_all(funs(n_mode(.)))
N.B. Be careful with your namespace and how you name the function...preferring something like n_mode() to avoid conflicts with base::mode. Finally, if you extend this to more data.frames, you probably want to put them in a list. If that's not possible/practical, you could replace the bind_rows with purrr::map_df(ls(pattern = "^df[[:digit:]]+"), get)

data table version of Jason's solution (you should leave his as accepted)
library(data.table)
n_mode <- function(x) {
x <- table(x)
if(any(x > 1)) as.numeric(names(x)[which.max(x)])
else NA
}
my_list <- list(df1, df2, df3)
rbindlist(my_list)[, lapply(.SD, n_mode), .(name, id)]
# name id thing otherthing
# 1: adam 1 1 2
# 2: bob 2 1 1
# 3: carl 3 3 3
# 4: dan 4 4 4
Here's the output of rbindlist. Hopefully this makes it more clear why just taking n_mode of all the columns, grouped by name and id, gives the output you want.
rbindlist(my_list)[order(name, id)]
# name id thing otherthing
# 1: adam 1 2 2
# 2: adam 1 1 2
# 3: adam 1 1 2
# 4: bob 2 1 1
# 5: bob 2 1 2
# 6: bob 2 1 1
# 7: carl 3 3 3
# 8: carl 3 1 3
# 9: carl 3 3 3
# 10: dan 4 4 4
# 11: dan 4 4 4
# 12: dan 4 4 3

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.

Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Counting the result of a left join using dplyr

What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.

With data.table, you can do
library(data.table)
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI].
.EACHI refers to each row of i=a.
j=.N uses a special variable for the number of rows.

There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a and b and append a refs column which is TRUE if ref_id is not NA. Then use aggregate to sum over the refs column:
m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)
giving:
id refs
1 1 2
2 2 0
3 3 3
4 4 1

It does require another package, but i'd feel remiss for not mentioning tidylog which provides reports for a wide range of tidyverse verbs. In your case, it would produce a report like:
library(tidylog)
a <- data.frame(id = c(1, 2, 3, 4 ))
b <- data.frame(id = c(1, 1, 3, 3, 3, 4), ref_id = c('a', 'b', 'c', 'd', 'e', 'f'))
a %>% left_join(b, by='id')
left_join: added one column (ref_id)
> rows only in x 1
> rows only in y (0)
> matched rows 6 (includes duplicates)
> ===
> rows total 7
id ref_id
1 1 a
2 1 b
3 2 <NA>
4 3 c
5 3 d
6 3 e
7 4 f
See here and here for more examples/info

I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 0
3 3 3
4 4 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sample by groupy with a condition (r) - r

You can use sample_n: library(dplyr) dta %>% group_by(id) %>% sample_n(1) ## Source: local data frame [3 x 3] ## Groups: id ## ## id DiaryNumber type ## 1 1 2 a ## 2 2 1 a ## 3 3 1 b

Base package: set.seed(123) df <- lapply(split(dta, dta$id), function(x) x[sample(nrow(x), 1), ]) do.call("rbind", df) Output: id DiaryNumber type 1 1 1 a 2 2 1 a 3 3 2 a

Related

How to group the data by id and get unique values of all columns in R?

Filtering in tidyverse based on a vector/list of possible values

Combining 3 versions of same table together in R

R ifelse loop on unique values always resolves FALSE

Counting the result of a left join using dplyr

Categories

Resources