Using R merge() to collect non matching IDs [duplicate] - r

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 3 years ago.
So I have these two dataframes:
id <- c(1, 2, 3, 4, 5, 6, 7, 8)
drug <- c("A", "B", "C", "D", "E", "F", "G", "H")
value <- c(100, 200, 300, 400, 500, 600, 700, 800)
df1 <- data.frame(id, drug, value)
id <- c(1, 2, 3, 4, 6, 8)
treatment <- c("C", "IC", "C", "IC", "C", "C")
value <- c(700, 800, 900, 100, 200, 900)
df2 <- data.frame(id, treatment, value)
I used merge() to combined the two datasets like this
key = "id"
merge(df1,df2[key],by=key)
This worked but I end up droping some fields(due to not matching ids).
Is there a way I can see or collect the ids which were dropped as well?
My real dataset consists of 100s of entries so finding a way to find dropped ids would be very useful in R

library(dplyr)
> anti_join(df1, df2, by = "id")
id drug value
1 5 E 500
2 7 G 700
Or if you just want the IDs
> anti_join(df1, df2, by = "id")$id
[1] 5 7

Related

Transform a df into individual observations [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 7 months ago.
I want to transform a df from a "counting" approach (number of cases) to a "individual observations" approach.
Example:
df <- dplyr::tibble(
city = c("a", "a", "b", "b", "c", "c"),
sex = c(1,0,1,0,1,0),
age = c(1,2,1,2,1,2),
cases = c(2, 3, 1, 1, 1, 1))
Expected result
df <- dplyr::tibble(
city = c("a","a","a","a","a", "b", "b", "c", "c"),
sex = c(1,1,0,0,0,1,0,1,0),
age = c(1,1,2,2,2,1,2,1,2))
uncount() from tidyr can do that for you.
df |> tidyr::uncount(cases)

How to apply a function to a data.table subset by multiple columns in R?

I have a data table with counts for changes for multiple groups. For example:
input <- data.table(from = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
to = c(letters[1:6], letters[1:6]),
from_N = c(100, 100, 100, 50, 50, 50, 60, 60 ,60, 80, 80, 80),
to_N = c(10, 20, 40, 5, 5, 15, 10, 5, 10, 20, 5, 10),
group = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2))
How can I calculate the total for each change across groups? I can do this using a for loop, for example:
out <- list()
for (i in 1:length(unique(input$from))){
sub <- input[from == unique(input$from)[i]]
out2 <- list()
for (j in 1:length(unique(sub$to))){
sub2 <- sub[to == unique(sub$to)[j]]
out2[[j]] <- data.table(from = sub2$from[1],
to = sub2$to[1],
from_N = sum(sub2$from_N),
to_N = sum(sub2$to_N))
print(unique(sub$to)[j])
}
out[[i]] <- do.call("rbind", out2)
print(unique(input$from)[i])
}
output <- do.call("rbind", out)
However, the data table I need to apply this to is very large, and I therefore need to maximise performance. Is there a data.table method? Any help will be greatly appreciated!
Perhaps I've overlooked something, but it seems you're just after:
library(data.table)
setDT(input)[, .(from_N = sum(from_N), to_N = sum(to_N)), by = .(from, to)]
Output:
from to from_N to_N
1: A a 160 20
2: A b 160 25
3: A c 160 50
4: B d 130 25
5: B e 130 10
6: B f 130 25
An option with dplyr
library(dplyr)
input %>%
group_by(from, to) %>%
summarise_at(vars(ends_with('_N')), sum)
Or in data.table
library(data.table)
setDT(input)[, lapply(.SD, sum), by = .(from, to), .SDcols = patterns('_N$')]

Allocate ordinal values to numerical vector in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a set of data from children, recorded across a number of sessions. The number of sessions and age of each child in each session is different for each participant, so it looks something like this:
library(tibble)
mydf <- tribble(~subj, ~age,
"A", 16,
"A", 17,
"A", 19,
"B", 10,
"B", 11,
"B", 12,
"B", 13)
What I don't currently have in the data is a variable for Session number, and I'd like to add this to my dataframe. Basically I want to create a numeric variable that is ordinal from 1-n for each child, something like this:
mydf2 <- tribble(~subj, ~age, ~session,
"A", 16, 1,
"A", 17, 2,
"A", 19, 3,
"B", 10, 1,
"B", 11, 2,
"B", 12, 3
"B", 13, 4)
Ideally I'd like to do this in dplyr().
You simply need to group by subj and use row_number():
mydf %>%
group_by(subj) %>%
mutate(session = row_number())

dply filter with exception

So I'm trying to filter out certain things in my dataset.
Here's a really parred down example of my dataset:
fish <- data.frame ("order"=c("a", "a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
"family"= c("r", "s", "t", "r", "y", "y", "y", "u", "y", "u", "y"),
"species"=c(7, 8, 9, 6, 5, 4, 3, 10, 1, 11, 2))
so I have
fish <- fish%>%
filter(
!(order %in% c("a", "b", "c"))&
!(family %in% c("r","s","t","u"))
)
which should remove all orders in a,b,c and all families in , r, s, t, u. Leaving me with
order family species
d y 10
e y 11
But the issue is, there are two species that are in families that I am filtering out. So say species 1 is in family "r". I want species 1 to stay in the dataset, while filtering all the rest of family r. So I want the output to look like:
order family species
d y 10
e y 11
d r 1
e r 2
How can I make sure that when I'm filtering out the groups of family, it keeps these two species?
Thanks!
You could rbind the results of three separate filters:
temp1<-filter(fish,order!=c("a","b","c")&family!=c("r","s","t","u"))
temp2<-filter(fish,family=="r"&species==1)
temp3<-filter(fish,family=="s"&species==2)
fish<-rbind(temp1,temp2,temp3)
rm(temp1,temp2,temp3)
It would be most natural to have the filtering process mirror your logic --
Filter #1: filter-out undesirable order and family
Filter #2: filter desirable family, species pairs
Note: I had to change your family, species pair criteria to get matches.
library(dplyr)
library(purrr)
# your example data
fish <- tibble ("order"=c("a", "a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
"family"= c("r", "s", "t", "r", "y", "y", "y", "u", "y", "u", "y"),
"species"=c(7, 8, 9, 6, 5, 4, 3, 10, 1, 11, 2))
# put filter criteria in variables
order_filter <- c('a', 'b', 'c')
family_filter <- c('r', 's', 't', 'u')
# Filter 1
df1 <- fish %>%
filter(!order %in% order_filter,
!family %in% family_filter)
# Filter 2
df2 <- map_df(.x = list(c('r', 7), c('s', 8)),
.f = function(x) {fish %>%
filter(family == x[1], species == x[2])})
# Combine two data frames created by Filter 1 and Filter 2
df_final <- bind_rows(df1, df2)
print(df_final)
# A tibble: 4 x 3
# order family species
# <chr> <chr> <dbl>
# 1 d y 1
# 2 e y 2
# 3 a r 7
# 4 a s 8

Random stratified sampling with different proportions

I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -
Location1: 172
Location2: 615
Location3: 603
Location4: 502
I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using stratified function from the splitstackshape package but it doesn't seem to want to split my factors up.
Here is a simplified reproducible example -
x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)
xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")
df <- data.frame(x, xx)
validIndex <- stratified(df, "xx", size=16/nrow(df))
valid <- df[-validIndex,]
train <- df[validIndex,]
where A, B, C, D correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)
Using bothSets should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):
splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]
## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
df2[with(df2, order(xx, x)), ],
check.names=FALSE)

Resources