R: Missing data on table, complete it by referencing partial matches to a "Reference" table - r

I have two tables; "Reference" and "TableA".
I am looking through TableA which is an incomplete table and would like to turn it into a "complete" table by referencing the "Reference" table, filling in missing values, and/or adding rows where there are multiple matches are found.
Reproducible example of "Reference" and "TableA" are below:
A <- c(1,1,1,2,4,4,5,5,7,6,2,1)
B <- c(1,2,2,2,4,4,9,5,8,6,2,9)
C <- c(1,1,3,3,4,5,5,5,7,6,3,3)
D <- c(1,2,1,1,2,1,2,1,2,2,2,1)
Reference <- data.frame(A,B,C,D)
A <- c(NA,1,5,2,4,1)
B <- c(NA,2,NA,2,NA,1)
C <- c(3,NA,5,NA,NA,1)
D <- c(1,1,2,2,1,1)
TableA <- data.frame(A,B,C,D)
I have attempted to resolve this by doing the following:
for (i in 1:dim(TableA)[1])
{
tmp<-TableA[i,]
repet<-ifelse(is.na(TableA$D[i]), Reference, 1 )
for (j in 1:repet) {
tmp$D<-ifelse(repet>1, Reference$D[j,], tmp$D)
collector<-rbind(collector, tmp)
}
}
collector
However, this solution will return the entirety of Reference$D, but I would only like to return those records from Reference$D whose columns A,B,C match (or partially match) what is on TableA.
For example, in Row 1 of TableA, I would like to replace Row 1 with the Reference table's rows 3,4, and 12.
Expected output below.
Note that the Reference table combination 1,2,3,1 appears twice on the expected output as it is a match for both rows 1 & 2 of TableA.
A
B
C
D
1
2
3
1
2
2
3
1
1
9
3
1
1
2
3
1
5
9
5
2
2
2
3
2
4
4
5
1
1
1
1
1

I'll first create an extra column "string" in both TableA and Reference, with NA replaced with a dot . in TableA, which would be used in regex matching.
Then find out which string in TableA appeared in Reference, and store them in a matrix.
Finally, replicate the lgl_matrix row number by the number of matches, and use those row numbers as index in Reference.
library(tidyverse)
TableA <- TableA %>%
mutate(across(A:D, ~ replace_na(as.character(.x), "."))) %>%
rowwise() %>%
mutate(string = paste0(c_across(A:D), collapse = ""))
Reference <- Reference %>%
rowwise() %>%
mutate(string = paste0(c_across(A:D), collapse = ""))
lgl_matrix <- sapply(TableA$string, grepl, x = Reference$string)
Reference[rep(1:nrow(lgl_matrix), rowSums(lgl_matrix)), -5]
# A tibble: 8 x 4
# Rowwise:
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 3 1
3 1 2 3 1
4 2 2 3 1
5 4 4 5 1
6 5 9 5 2
7 2 2 3 2
8 1 9 3 1

Related

Filtering of dataframe columns displaying a counter intuitive behavior (R)

Take as an example the dataframe below. I need to change the dataframe by keeping only the columns that are in the filter objects.
test <- data.frame(A = c(1,6,1,2,3) , B = c(1,2,1,1,2), C = c(1,7,6,4,1), D = c(1,1,1,1,1))
filter <- c("A", "B", "C", "D")
filter2 <- c("A","B","D")
To do that I'm using this piece of code:
`%ni%` <- Negate(`%in%`)
test <- test[,-which(names(test) %ni% filter2)]
If I use the filter2 object I get what is expected:
A B D
1 1 1 1
2 6 2 1
3 1 1 1
4 2 1 1
5 3 2 1
However, if I use the filter object, I get a dataframe with zero columns:
data frame with 0 columns and 5 rows
I expected to get an untouched dataframe, since filter had all test columns in it. Why does this happen, and how can I write a more reliable code not to get empty dataframes in these situations?
Use ! instead of -
test[,!(names(test) %ni% filter2)]
test[,!(names(test) %ni% filter)]
by wrapping with which and using -, it works only when the length of output of which is greater than 0
> which(names(test) %ni% filter2)
[1] 3
> which(names(test) %ni% filter)
integer(0)
By doing the -, there is no change in the integer(0) case
> -which(names(test) %ni% filter)
integer(0)
> -which(names(test) %ni% filter2)
[1] -3
thus,
> test[integer(0)]
data frame with 0 columns and 5 rows
I think you can simplify the column selection process by subsetting the dataframe with character vector of column names.
test[filter]
# A B C D
#1 1 1 1 1
#2 6 2 7 1
#3 1 1 6 1
#4 2 1 4 1
#5 3 2 1 1
test[filter2]
# A B D
#1 1 1 1
#2 6 2 1
#3 1 1 1
#4 2 1 1
#5 3 2 1

Calculate rowwise maximum from columns that have changing names

I have the following objects:
s1 = "1_1_1_1_1"
s2 = "2_1_1_1_1"
s3 = "3_1_1_1_1"
Please note that the value of s1, s2, s3 can change in another example.
I then have the follwoing data frame:
set.seed(666)
df = data.frame(draw = c(1,2,3,4,1,2,3,4,1,2,3,4),
resp = c(1,1,1,1,2,2,2,2,3,3,3,3),
"1_1_1_1_1" = runif(12),
"2_1_1_1_1" = runif(12),
"3_1_1_1_1" = runif(12)).
Please note that the column names of may data frame will change based on the values of s1,s2,s3.
I now want to achieve the following:
I want to find out which of last three columns in df has the highest value and store it as a value in a new column (values are supposed to be either of 1,2 or 3, depending on if the highest value is the first, second or third of these variables).
Now that I know which value is the highest per row, I want to group/summarize the result by the column resp and count how often my max value is 1, 2 or 3.
So the outcome from 1. should be:
draw resp 1_1_1_1_1 2_1_1_1_1 3_1_1_1_1 max
1 1 0.774 0.095 0.806 3
2 1 0.197 0.142 0.266 3
...
And the outcome from 2. is supposed to be:
resp first_max second_max third_max
1 1 1 2
2 2 1 1
3 1 2 1
My problem is that tidyverse's rowwise function is deprecated and that I don't know how I can dynamically address columns in a tidyverse pipe by column names which a re stored externally (here in s1, s2, s3). One last note: I might be overcomplicating things by trying to go by the column names, when, in fact, the positions of the columns that I'm interested in are always at column position 3:5.
Here is one way to get what you want. For a sligthly different format, you can use count rather than table but this matches your expected output. Hope this helps!!
library(dplyr)
df %>%
mutate(max_val = max.col(select(., starts_with("X")))) %>%
select(resp, max_val) %>%
table()
max_val
resp 1 2 3
1 1 1 2
2 2 1 1
3 1 2 1
Or, you could do this:
df %>%
mutate(max_val = max.col(.[3:5])) %>%
count(resp, max_val) %>%
mutate(max_val = paste0("max_", max_val)) %>%
spread(value = n, key = max_val)
resp max_1 max_2 max_3
<dbl> <int> <int> <int>
1 1 1 1 2
2 2 2 1 1
3 3 1 2 1
calculate max using pmap(row-wise iteration)
max_cols <- pmap_dbl(unname(df),function(x,y,...){
vals <- unlist(list(...))
return(which(vals == max(vals)))
})
result <- df %>% add_column(max = max_cols)
> result
draw resp X1_1_1_1_1 X2_1_1_1_1 X3_1_1_1_1 max
1 1 1 0.4551478 0.70061232 0.618439890 2
2 2 1 0.3667764 0.26670969 0.024742605 1
3 3 1 0.6806912 0.03233215 0.004014758 1
4 4 1 0.9117449 0.42926492 0.885247456 1
5 1 2 0.1886954 0.34189707 0.985054492 3
6 2 2 0.5569398 0.78043504 0.100714130 2
7 3 2 0.9791164 0.92823982 0.676584495 1
8 4 2 0.9174654 0.74627116 0.485582287 1
9 1 3 0.3681890 0.69622331 0.672346875 2
10 2 3 0.5510356 0.99651637 0.482430518 2
11 3 3 0.4283281 0.12832611 0.018095649 1
12 4 3 0.6168436 0.64381995 0.655178701 3
Reshape the data frame.
reshape2::dcast(result,resp~max,fun.aggregate = length,value.var = "max")
resp 1 2 3
1 1 1 1 2
2 2 2 1 1
3 3 1 2 1

create id variable from table of duplicates

I have a dataframe where each row has a unique identifier, but some rows are actually duplicates.
fdf <- data.frame(name = c("fred", "ferd", "frad", 'eric', "eirc", "george"),
id = 1:6)
fdf
#> name id
#> 1 fred 1
#> 2 ferd 2
#> 3 frad 3
#> 4 eric 4
#> 5 eirc 5
#> 6 george 6
I have determined which rows are duplicated and this information is stored in a second dataframe as pairs of the unique id's. So the key tells me row 1 is the same individual as rows 2 and 3, etc.
key <- data.frame(id1 = c(1,1,2,4), id2 = c(2,3,3,5))
key
#> id1 id2
#> 1 1 2
#> 2 1 3
#> 3 2 3
#> 4 4 5
I'm struggling to think up a straightforward way to use the key to create an id variable in my original dataframe. Desired output would be:
fdf$realid <- c(1,1,1,2,2,3)
fdf
#> name id realid
#> 1 fred 1 1
#> 2 ferd 2 1
#> 3 frad 3 1
#> 4 eric 4 2
#> 5 eirc 5 2
#> 6 george 6 3
Edit for clarity
Keys here are the set of true connections between rows in the data.frame fdf. Thus you can imagine starting with the set of all feasible connections:
# id1 id2
# 1 2
# 1 3
# 1 4
# ...
# 6 4
# 6 5
determining which are true connections (based on the other variables in each observation).
# id1 id2 match
# 1 2 match
# 1 3 no match
# 1 4 match
# ...
# 6 4 no match
# 6 5 no match
and sub-setting to the cases that are matches.
The easiest way would be to recreate the key data frame to the following format (i.e. which id belongs to which realid)
key <- data.frame(id = c(1, 2, 3, 4, 5, 6),
realid = c(1, 1, 1, 2, 2, 3))
Then it is just a matter of merging fdf and key together with merge
fdf <- merge(fdf, key_table, by.x = "id")
fdf
id name realid
1 1 fred 1
2 2 ferd 1
3 3 frad 1
4 4 eric 2
5 5 eirc 2
6 6 george 3
I didn't find a 'straight forward way', but it seems to work well.
First you check which IDs are together in a group, by checking whether there's 'overlap', i.e. whether the intersection between two rows in key is non-empty:
check_overlap <- function(pair1, pair2){
newset <- intersect(pair1, pair2)
length(newset) != 0
}
Then we can apply this function to the rows in key against the other rows. If a row has been matched already, it is automatically removed from key, like this:
check_overlaps <- function(key){
cont <- data.frame()
i <- 1
while(nrow(key) > 0){
ids <- apply(key, 1, check_overlap, key[1, ])
vals <- unique(unlist(key[ids, ]))
key <- key[!ids, ]
cont <- rbind(cont, cbind(vals, rep(i, length(vals))))
i <- i+1
}
return(cont)
}
new_ids <- check_overlaps(key)
# vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
The problem with merging fdf and new_ids, however, is that some old IDs may not occur in key, but they should be mapped to a new ID according to the new order. You can manipulate key a bit a priori and do:
for(val in unique(fdf$id)){
if(!(val %in% unlist(key))){
key <- rbind(key, c(val, val))
}
}
new_ids2 <- check_overlaps(key)
vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 3
Which is easy to merge with fdf like:
merge(fdf, new_ids2, by.x = "id", by.y = "vals")
id name V2
# 1 1 fred 1
# 2 2 ferd 1
# 3 3 frad 1
# 4 4 eric 2
# 5 5 eirc 2
# 6 6 george 3
If I understand your question correctly it can be solved by creating groups of matching ids and creating a new (real) id out of these groups:
# determine the groups of ids
id_groups <- list()
i = 1
for (id in unique(key$id1)) {
if (!(id %in% unlist(id_groups))) {
id_groups[[i]] <- c(id, key$id2[key$id1 == id])
i = i + 1
}
}
# add ids without match
id_groups <- c(id_groups, setdiff(fdf$id, unlist(id_groups)))
# for every id in fdf, set real_id to index in id_groups to which id belongs
fdf$real_id <- sapply(fdf$id, function(id) {
which(sapply(id_groups, function(group) id %in% group))
})

Replacing the values from another data from based on the information in the first column in R

I'm trying to merge informations in two different data frames, but problem begins with uneven dimensions and trying to use not the column index but the information in the column. merge function in R or join's (dplyr) don't work with my data.
I have to dataframes (One is subset of the others with updated info in the last column):
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"))
Name val Case
1 A 1 NA
2 B 2 1
3 C 3 NA
4 D 1 NA
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 NA
9 I 3 NA
Some rows in the Case column in df1 have to be changed with the info in the df2 below:
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1")
Name val Case
1 A 1 1
2 D 2 1
3 H 3 1
So there's nothing important in the val column, however I added it into the examples since I want to indicate that I have more columns than two and also my real data is way bigger than the examples.
Basically, I want to change specific rows by checking the information in the first columns (in this case, they're unique letters) and in the end I still want to have df1 as a final data frame.
for a better explanation, I want to see something like this:
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Note changed information for A,D and H.
Thanks.
%in% from base-r is there to rescue.
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"), stringsAsFactors = F)
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1", stringsAsFactors = F)
df1$Case <- ifelse(df1$Name %in% df2$Name, df2$Case[df2$Name %in% df1$Name], df1$Case)
df1
Output:
> df1
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Here is what I would do using dplyr:
df1 %>%
left_join(df2, by = c("Name")) %>%
mutate(val = if_else(is.na(val.y), val.x, val.y),
Case = if_else(is.na(Case.y), Case.x, Case.y)) %>%
select(Name, val, Case)

Subsetting a Data Table using %in%

A stylized version of my data.table is
outmat <- data.table(merge(merge(1:5, 1:5, all=TRUE), 1:5, all=TRUE))
What I would like to do is select a subset of rows from this data.table based on whether the value in the 1st column is found in any of the other columns (it will be handling matrices of unknown dimension, so I can't just use some sort of "row1 == row2 | row1 == row3"
I wanted to do this using
output[row1 %in% names(output)[-1], ]
but this ends up returning TRUE if the value in row1 is found in any of the rows of row2 or row3, which is not the intended behavior. It there some sort of vectorized version of %in% that will achieve my desired result?
To elaborate, what I want to get is the enumeration of 3-tuples from the set 1:5, drawn with replacement, such that the first value is the same as either the second or third value, something like:
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
...
2 1 2
2 2 1
...
5 5 5
What my code instead gives me is every enumeration of 3-tuples, as it is checking whether the first digit (say, 5), ever appears anywhere in the 2rd or 3rd columns, not simply within the same row.
One option is to construct the expression and evaluate it:
dt = data.table(a = 1:5, b = c(1,2,4,3,1), c = c(4,2,3,2,2), d = 5:1)
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
#4: 4 3 2 2
#5: 5 1 2 1
expr = paste(paste(names(dt)[-1], collapse = paste0(" == ", names(dt)[1], " | ")),
"==", names(dt)[1])
#[1] "b == a | c == a | d == a"
dt[eval(parse(text = expr))]
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
Another option is to just loop through and compare the columns:
dt[rowSums(sapply(dt, '==', dt[[1]])) > 1]
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
library(dplyr)
library(tidyr)
dt %>%
mutate(ID = 1:n() )
gather(variable, value, -first_column, -ID) %>%
filter(first_column == value) %>%
select(ID) %>%
distinct %>%
left_join(dt)

Resources