Subsetting a Data Table using %in% - r

A stylized version of my data.table is
outmat <- data.table(merge(merge(1:5, 1:5, all=TRUE), 1:5, all=TRUE))
What I would like to do is select a subset of rows from this data.table based on whether the value in the 1st column is found in any of the other columns (it will be handling matrices of unknown dimension, so I can't just use some sort of "row1 == row2 | row1 == row3"
I wanted to do this using
output[row1 %in% names(output)[-1], ]
but this ends up returning TRUE if the value in row1 is found in any of the rows of row2 or row3, which is not the intended behavior. It there some sort of vectorized version of %in% that will achieve my desired result?
To elaborate, what I want to get is the enumeration of 3-tuples from the set 1:5, drawn with replacement, such that the first value is the same as either the second or third value, something like:
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
...
2 1 2
2 2 1
...
5 5 5
What my code instead gives me is every enumeration of 3-tuples, as it is checking whether the first digit (say, 5), ever appears anywhere in the 2rd or 3rd columns, not simply within the same row.

One option is to construct the expression and evaluate it:
dt = data.table(a = 1:5, b = c(1,2,4,3,1), c = c(4,2,3,2,2), d = 5:1)
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
#4: 4 3 2 2
#5: 5 1 2 1
expr = paste(paste(names(dt)[-1], collapse = paste0(" == ", names(dt)[1], " | ")),
"==", names(dt)[1])
#[1] "b == a | c == a | d == a"
dt[eval(parse(text = expr))]
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
Another option is to just loop through and compare the columns:
dt[rowSums(sapply(dt, '==', dt[[1]])) > 1]
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3

library(dplyr)
library(tidyr)
dt %>%
mutate(ID = 1:n() )
gather(variable, value, -first_column, -ID) %>%
filter(first_column == value) %>%
select(ID) %>%
distinct %>%
left_join(dt)

Related

R: Missing data on table, complete it by referencing partial matches to a "Reference" table

I have two tables; "Reference" and "TableA".
I am looking through TableA which is an incomplete table and would like to turn it into a "complete" table by referencing the "Reference" table, filling in missing values, and/or adding rows where there are multiple matches are found.
Reproducible example of "Reference" and "TableA" are below:
A <- c(1,1,1,2,4,4,5,5,7,6,2,1)
B <- c(1,2,2,2,4,4,9,5,8,6,2,9)
C <- c(1,1,3,3,4,5,5,5,7,6,3,3)
D <- c(1,2,1,1,2,1,2,1,2,2,2,1)
Reference <- data.frame(A,B,C,D)
A <- c(NA,1,5,2,4,1)
B <- c(NA,2,NA,2,NA,1)
C <- c(3,NA,5,NA,NA,1)
D <- c(1,1,2,2,1,1)
TableA <- data.frame(A,B,C,D)
I have attempted to resolve this by doing the following:
for (i in 1:dim(TableA)[1])
{
tmp<-TableA[i,]
repet<-ifelse(is.na(TableA$D[i]), Reference, 1 )
for (j in 1:repet) {
tmp$D<-ifelse(repet>1, Reference$D[j,], tmp$D)
collector<-rbind(collector, tmp)
}
}
collector
However, this solution will return the entirety of Reference$D, but I would only like to return those records from Reference$D whose columns A,B,C match (or partially match) what is on TableA.
For example, in Row 1 of TableA, I would like to replace Row 1 with the Reference table's rows 3,4, and 12.
Expected output below.
Note that the Reference table combination 1,2,3,1 appears twice on the expected output as it is a match for both rows 1 & 2 of TableA.
A
B
C
D
1
2
3
1
2
2
3
1
1
9
3
1
1
2
3
1
5
9
5
2
2
2
3
2
4
4
5
1
1
1
1
1
I'll first create an extra column "string" in both TableA and Reference, with NA replaced with a dot . in TableA, which would be used in regex matching.
Then find out which string in TableA appeared in Reference, and store them in a matrix.
Finally, replicate the lgl_matrix row number by the number of matches, and use those row numbers as index in Reference.
library(tidyverse)
TableA <- TableA %>%
mutate(across(A:D, ~ replace_na(as.character(.x), "."))) %>%
rowwise() %>%
mutate(string = paste0(c_across(A:D), collapse = ""))
Reference <- Reference %>%
rowwise() %>%
mutate(string = paste0(c_across(A:D), collapse = ""))
lgl_matrix <- sapply(TableA$string, grepl, x = Reference$string)
Reference[rep(1:nrow(lgl_matrix), rowSums(lgl_matrix)), -5]
# A tibble: 8 x 4
# Rowwise:
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 3 1
3 1 2 3 1
4 2 2 3 1
5 4 4 5 1
6 5 9 5 2
7 2 2 3 2
8 1 9 3 1

Adding an index column representing a repetition of a dataframe in R

I have a dataframe in R that I'd like to repeat several times, and I want to add in a new variable to index those repetitions. The best I've come up with is using mutate + rbind over and over, and I feel like there has to be an efficient dataframe method I could be using here.
Here's an example: df <- data.frame(x = 1:3, y = letters[1:3]) gives us the dataframe
x
y
1
a
2
b
3
c
I'd like to repeat that say 3 times, with an index that looks like this:
x
y
index
1
a
1
2
b
1
3
c
1
1
a
2
2
b
2
3
c
2
1
a
3
2
b
3
3
c
3
Using the rep function, I can get the first two columns, but not the index column. The best I've come up with so far (using dplyr) is:
df2 <-
df %>%
mutate(index = 1) %>%
rbind(df %>% mutate(index = 2)) %>%
rbind(df %>% mutate(index = 3))
This obviously doesn't work if I need to repeat my dataframe more than a handful of times. It feels like the kind of thing that should be easy to do using dataframe methods, but I haven't been able to find anything.
Grateful for any tips!
You can use this code for as many data frames as you would like. You just have to set the n argument:
replicate function takes 2 main arguments. We first specify the number of time we would like to reproduce our data set by n. Then we specify our data set as expr argument. The result would be a list whose elements are instances of our data set
After that we pass it along to imap function from purrr package to define the unique id for each of our data set. .x represents each element of our list (here a data frame) and .y is the position of that element which amounts to the number of instances we created. So for example we assign value 1 to the first id column of the first data set as .y is equal to 1 for that and so on.
library(dplyr)
library(purrr)
replicate(3, df, simplify = FALSE) %>%
imap_dfr(~ .x %>%
mutate(id = .y))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
In base R you can use the following code:
do.call(rbind,
mapply(function(x, z) {
x$id <- z
x
}, replicate(3, df, simplify = FALSE), 1:3, SIMPLIFY = FALSE))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
You can use rerun to repeat the dataframe n times and add an index column using bind_rows -
library(dplyr)
library(purrr)
n <- 3
df <- data.frame(x = 1:3, y = letters[1:3])
bind_rows(rerun(n, df), .id = 'index')
# index x y
#1 1 1 a
#2 1 2 b
#3 1 3 c
#4 2 1 a
#5 2 2 b
#6 2 3 c
#7 3 1 a
#8 3 2 b
#9 3 3 c
In base R, we can repeat the row index 3 times.
transform(df[rep(1:nrow(df), n), ], index = rep(1:n, each = nrow(df)))
One more way
n <- 3
map_dfr(seq_len(n), ~ df %>% mutate(index = .x))
x y index
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3

How to sort a dataframe in decreasing order lapply and sort in r

Not sure if this is a duplicate but I couldn't find anything that either solves my original problem or the issue I'm running into with the partial I did find.
The goal is to sort a dataframe independently by column.
Reproducible example
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a
name date1 date2 date3
1 a 2 0 0
2 a 3 2 2
3 a 1 3 0
4 b 3 1 3
5 b 1 2 2
6 b 2 0 1
b <- ddply(a, "name", function(x) { as.data.frame(lapply(x, sort))
b
name date1 date2 date3
1 a 1 0 0
2 a 2 2 0
3 a 3 3 2
4 b 1 0 1
5 b 2 1 2
6 b 3 2 3
Now this works as expected, but is the opposite of what I'm looking to do.
Desired output
b
name date1 date2 date3
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
I've tried to add in the decreasing=T parameter but haven't had any luck with the variations I've tried and usually end up with an error about missing arguments or undefined columns being selected. How does one correctly implement a decreasing sort with this syntax and/or otherwise achieve the end result without relying on explicitly naming the columns (they names are dates so change often)
Bonus
How could this code be adapted to account for NA's with na.last
Thank you!
I think you nuked the data.frame rows with your code, not a very good practice standard dplyr use the arrange() function like this
library(tidyverse)
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a %>%
arrange(name,-date1)
If you want to live a dangerous life here is the code for it
a %>%
group_by(name) %>%
mutate_all(sort,decreasing = TRUE)
name date1 date2 date3
<fct> <dbl> <dbl> <dbl>
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
A solution with the data.table package is the following
library(data.table)
a <- data.table(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# alternatively:
# a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# setDT(a)
b <- a[, lapply(.SD, sort, decreasing = TRUE), by = name]
.SD returns the subset of data, in this case created with the by = name. It splits the original data.table by the values in the given column.
This also fulfills your bonus requirement, the na.last can be supplied.
aa <- data.table(name = c("a","a","a","b","b","b"),date1 = c(NA,3,1,3,1,NA),date2 = c(0,2,NA,1,2,0),date3 = c(0,2,0,3,2,NA))
bb <- aa[, lapply(.SD, sort, decreasing = TRUE, na.last = TRUE), by = name]

R: Collapse duplicated values in a column while keeping the order

I'm sure this is super simple but just can't find the answer. I have a data frame like so
Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A
And I'd like to group by Id and collapse the distinct event values while keeping the event order like so
Id event
1 1 A
2 1 B
3 1 A
4 2 C
5 2 A
Most of my searches end up with using the distinct() or unique() functions but that leads losing the A event in row 3 for Id 1.
Thanks in advance!
We can use lead to compare each row and filter those rows that are different than the previous ones. is.na(lead(Id)) is to also include the last rows.
library(dplyr)
dat2 <- dat %>%
filter(!(Id == lead(Id) & event == lead(event)) | is.na(lead(Id)))
dat2
# Id event
# 1 1 A
# 2 1 B
# 3 1 A
# 4 2 C
# 5 2 A
DATA
dat <- read.table(text = " Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A",
header = TRUE, stringsAsFactors = FALSE)
You can just compare every row with the one after it.
df = read.table(text=" Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A",
header=TRUE)
df[rowSums(df[-1,] == head(df, -1)) !=2, ]
Id event
1 1 A
2 1 B
4 1 A
6 2 C
7 2 A
Here is a solution with data.table:
library("data.table")
dt <- fread(
" Id event
1 A
1 B
1 A
1 A
2 C
2 C
2 A")
unique(dt[, r:=rleidv(event), Id])[, -3]
# Id event
# 1: 1 A
# 2: 1 B
# 3: 1 A
# 4: 2 C
# 5: 2 A
or
dt[, .SD[unique(rleidv(event))], by = Id]
(thx to #mt1022 for the comment)
A base R solution using tapply and rle:
x <- tapply(dat$event,dat$Id,function(x) rle(x)$values)
do.call(rbind,Map(data.frame,Id=names(x),event=x))
# Id event
# 1.1 1 A
# 1.2 1 B
# 1.3 1 A
# 2.1 2 C
# 2.2 2 A
I think the distinct function will be able to solve the problem.
dat %>%
distinct(Id, event)

Replacing the values from another data from based on the information in the first column in R

I'm trying to merge informations in two different data frames, but problem begins with uneven dimensions and trying to use not the column index but the information in the column. merge function in R or join's (dplyr) don't work with my data.
I have to dataframes (One is subset of the others with updated info in the last column):
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"))
Name val Case
1 A 1 NA
2 B 2 1
3 C 3 NA
4 D 1 NA
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 NA
9 I 3 NA
Some rows in the Case column in df1 have to be changed with the info in the df2 below:
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1")
Name val Case
1 A 1 1
2 D 2 1
3 H 3 1
So there's nothing important in the val column, however I added it into the examples since I want to indicate that I have more columns than two and also my real data is way bigger than the examples.
Basically, I want to change specific rows by checking the information in the first columns (in this case, they're unique letters) and in the end I still want to have df1 as a final data frame.
for a better explanation, I want to see something like this:
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Note changed information for A,D and H.
Thanks.
%in% from base-r is there to rescue.
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"), stringsAsFactors = F)
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1", stringsAsFactors = F)
df1$Case <- ifelse(df1$Name %in% df2$Name, df2$Case[df2$Name %in% df1$Name], df1$Case)
df1
Output:
> df1
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Here is what I would do using dplyr:
df1 %>%
left_join(df2, by = c("Name")) %>%
mutate(val = if_else(is.na(val.y), val.x, val.y),
Case = if_else(is.na(Case.y), Case.x, Case.y)) %>%
select(Name, val, Case)

Resources