Replace NAs with Row Minimum for Selected Columns - r

Suppose I have a dataframe with several types of columns (character, numeric, ID, time,etc.). I'll provide a simple example as follows:
m <- data.frame(LETTERS[1:10], LETTERS[15:24],runif(10),runif(10),runif(10),runif(10),runif(10))
x<-c("Col1","Col2","Col3","Col4","Col5","Col6","Col7")
colnames(m)<-x
m<-as.data.frame(lapply(m, function(x) x[ sample(c(TRUE, NA), prob = c(0.75, 0.25), size = length(x), replace = TRUE) ]))
> m
Col1 Col2 Col3 Col4 Col5 Col6 Col7
1 A O 0.09929126 0.40435352 0.15360830 0.03830400 0.80157985
2 B P 0.50314123 0.81725456 NA 0.07054851 0.65521042
3 C <NA> 0.75798665 NA 0.04483692 0.54671014 NA
4 D R 0.96825047 0.01875140 0.07383107 NA 0.04498563
5 <NA> S 0.47079716 0.04181401 0.21423046 NA 0.55493444
6 F <NA> NA NA NA 0.33702657 0.54989260
7 G U 0.71947656 NA NA 0.99142181 0.69548691
8 <NA> <NA> 0.90518907 0.20661633 0.65788523 0.05534330 0.78420756
9 I W 0.79208514 0.63233902 NA 0.72085080 NA
10 J X 0.39093317 0.97107464 NA 0.86417719 0.39890170
For Col3-Col7, if there are less than 3 NAs, I want to replace it with the row minimum from Col3-Col7, otherwise keep the NAs there. So, I'd want the dataset to look as follows:
> m
Col1 Col2 Col3 Col4 Col5 Col6 Col7
1 A O 0.09929126 0.40435352 0.15360830 0.03830400 0.80157985
2 B P 0.50314123 0.81725456 0.07054851 0.07054851 0.65521042
3 C <NA> 0.75798665 0.04483692 0.04483692 0.54671014 0.04483692
4 D R 0.96825047 0.01875140 0.07383107 0.01875140 0.04498563
5 <NA> S 0.47079716 0.04181401 0.21423046 0.04181401 0.55493444
6 F <NA> NA NA NA 0.33702657 0.54989260
7 G U 0.71947656 0.69548691 0.69548691 0.99142181 0.69548691
8 <NA> <NA> 0.90518907 0.20661633 0.65788523 0.05534330 0.78420756
9 I W 0.79208514 0.63233902 0.63233902 0.72085080 0.63233902
10 J X 0.39093317 0.97107464 0.39093317 0.86417719 0.39890170
So every row except row 6 had the values imputed by the minimum value in each row for columns3-7.
In my actual dataset, for every row between columns 18:27, if there are less than 4 NAs, replace with the row minimum for the columns 18:27, otherwise keep all the NAs.
I've tried using the dplyr pipes/mutate/replace method, but I'm not sure how to do it for a subset of columns (I'm under the impression you can only specify one column with mutate/replace). Some of the logic I've tried including in the if statement includes
rowSums(is.na(.[18:27]))<4 & rowSums(is.na(.[18:27]))>0)
I've seen the rowMins function in the matrixStats package, but I'm just wondering if I can do this with dplyr/dataframe and not matrices.

I would suggest a tidyverse approach where you reshape the data and group by Col1 and Col2 and the re build the data. As we will use pipes, we can also create the new variables with mutate() and evaluate the condition you want after creating Flag variable and also computing the min value. Next the code:
library(tidyverse)
#Data
m <- structure(list(Col1 = c("A", "B", "C", "D", "<NA>", "F", "G",
"<NA>", "I", "J"), Col2 = c("O", "P", "<NA>", "R", "S", "<NA>",
"U", "<NA>", "W", "X"), Col3 = c(0.09929126, 0.50314123, 0.75798665,
0.96825047, 0.47079716, NA, 0.71947656, 0.90518907, 0.79208514,
0.39093317), Col4 = c(0.40435352, 0.81725456, NA, 0.0187514,
0.04181401, NA, NA, 0.20661633, 0.63233902, 0.97107464), Col5 = c(0.1536083,
NA, 0.04483692, 0.07383107, 0.21423046, NA, NA, 0.65788523, NA,
NA), Col6 = c(0.038304, 0.07054851, 0.54671014, NA, NA, 0.33702657,
0.99142181, 0.0553433, 0.7208508, 0.86417719), Col7 = c(0.80157985,
0.65521042, NA, 0.04498563, 0.55493444, 0.5498926, 0.69548691,
0.78420756, NA, 0.3989017)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
The code:
#Reshape
m %>% pivot_longer(cols = -c(Col1,Col2)) %>%
group_by(Col1,Col2) %>% mutate(MinVal=min(value,na.rm=T),
Flag=sum(is.na(value))) %>% ungroup() %>%
mutate(value=ifelse(is.na(value) & Flag<3,MinVal,value)) %>%
select(-c(MinVal,Flag)) %>%
pivot_wider(names_from = name,values_from=value)
Output:
# A tibble: 10 x 7
Col1 Col2 Col3 Col4 Col5 Col6 Col7
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A O 0.0993 0.404 0.154 0.0383 0.802
2 B P 0.503 0.817 0.0705 0.0705 0.655
3 C <NA> 0.758 0.0448 0.0448 0.547 0.0448
4 D R 0.968 0.0188 0.0738 0.0188 0.0450
5 <NA> S 0.471 0.0418 0.214 0.0418 0.555
6 F <NA> NA NA NA 0.337 0.550
7 G U 0.719 0.695 0.695 0.991 0.695
8 <NA> <NA> 0.905 0.207 0.658 0.0553 0.784
9 I W 0.792 0.632 0.632 0.721 0.632
10 J X 0.391 0.971 0.391 0.864 0.399

Related

replace value if a dataframe column exists in specific vector and under condition in R

I would like to replace value of df1 by two vector(g1.vec. g2.vec) and column group.
If column of df1 exitst g1.vec and group == 1, replace value to Y.
If column of df1 exitst g2.vec and group == 2, replace value to X.
Here is a part of example:
> df1
ID group col1 col2 ...... col154
1 AMM115 2 C A ...... A+
2 ADM107 1 NA NA ...... B
3 AGM041 2 B C ...... C+
4 AGM132 1 A NA ...... A+
5 AQM007 1 NA A ...... B+
6 ARM028 2 NA B+ ...... A-
7 ASM019 1 A A+ ...... NA
8 AHM172 NA A A+ ...... NA
> vec
g1.vec <- c("col1", "col3", "col18", "col20", "col28", "col75", "col77", "col86", "col111")
g2.vec <- c("col2", "col13", "col37", "co38", "co44", "co87", "col123", "col41", "col154")
the output would look like:
> df2
ID group col1 col2 ...... col154
1 AMM115 2 C X ...... X
2 ADM107 1 Y NA ...... B
3 AGM041 2 B X ...... X
4 AGM132 1 Y NA ...... A+
5 AQM007 1 Y A ...... B+
6 ARM028 2 NA X ...... X
7 ASM019 1 Y A+ ...... NA
8 AHM172 NA A A+ ...... NA
What I have tried mutate with ifelse but I do not know how to mutate multiples column once or whatever to complete this.
Data
df1 <- structure(list(ID = c("AMM115", "ADM107", "AGM041", "AGM132",
"AQM007", "ARM028", "ASM019", "AHM172"), group = c(2L, 1L, 2L,
1L, 1L, 2L, 1L, NA), col1 = c("C", NA, "B", "A", NA, NA, "A",
"A"), col2 = c("A", NA, "C", NA, "A", "B+", "A+", "A+"), col154 = c("A+",
"B", "C+", "A+", "B+", "A-", NA, NA)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
Base R solution -
df[df$group %in% 1, g1.vec] <- 'Y'
df[df$group %in% 2, g2.vec] <- 'X'
df
# ID group col1 col2 col154
#1 AMM115 2 C X X
#2 ADM107 1 Y <NA> B
#3 AGM041 2 B X X
#4 AGM132 1 Y <NA> A+
#5 AQM007 1 Y A B+
#6 ARM028 2 <NA> X X
#7 ASM019 1 Y A+ <NA>
#8 AHM172 NA A A+ <NA>
I have used %in% instead of == here to handle the NA values.
You may try
library(dplyr)
df1 %>%
mutate_at(g1.vec, list(~ifelse(group == 1 & !is.na(group), 'Y', .))) %>%
mutate_at(g2.vec, list(~ifelse(group == 2 & !is.na(group), 'X', .)))
Using across and case_match from dplyr
library(dplyr)# version 1.1.0
df1 %>%
mutate(across(any_of(g1.vec),~ case_match(group, 1 ~ 'Y', .default = .x)),
across(any_of(g2.vec), ~ case_match(group, 2 ~ 'X', .default = .x)))
-output
ID group col1 col2 col154
1 AMM115 2 C X X
2 ADM107 1 Y <NA> B
3 AGM041 2 B X X
4 AGM132 1 Y <NA> A+
5 AQM007 1 Y A B+
6 ARM028 2 <NA> X X
7 ASM019 1 Y A+ <NA>
8 AHM172 NA A A+ <NA>

How to recode multiple columns in R efficiently?

I need to recode some data. Firstly,
iImagine that the the original data looks something like this
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
s1 414234 244575 539645 436236
s2 NA 512342 644252 835325
s3 NA NA 816747 475295
s4 NA NA NA 125429
s5 NA NA NA NA
s6 617465 844526 NA 194262
which, secondly, is transformed into
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 4 2 5 4
s2 NA 5 6 8
s3 NA NA 8 4
s4 NA NA NA 1
s5 NA NA NA NA
s6 6 8 NA 1
because I am going to recode everything according to the first digit. When, thirdly, recoded (see recoding pattern in MWE below) it should look like this
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s5 NA NA NA NA
s6 4 5 NA 1
and, fourthly, entire rows should be removed if all columns except the first one is empty, that is
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s6 4 5 NA 1
which is the ultimate data.
The first and second step were easily implemented but I struggle with the third and fourth step since I am new to R (see MWE below). For the third step, I tried to use mutate over multiple columns but Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')" appeared. The fourth step is easily implemented in Python with thresh but I am not sure if there is an equivalent in R.
How is this possible? Also, I work with huge data, so time-efficient solutions would also be highly appreciated.
library(dplyr)
df <- data.frame(
col1 = c("s1", "s2", "s3", "s4", "s5", "s6"),
col2 = c("414234", NA, NA, NA, NA, "617465"),
col3 = c("244575", "512342", NA, NA, NA, "844526"),
col4 = c("539645", "644252", "816747", NA, NA, NA),
col5 = c("436236", "835325", "475295", "125429", NA, "194262")
)
n = ncol(df)
for (i in colnames(df[2:n])) {
df[, i] = strtoi(substr(df[, i], 1, 1))
}
for (i in colnames(df[2:n])) {
df[, i] %>% mutate(i=recode(i, "0": 1, "1": 1, "2": 1, "3": 2, "4": 3, "5": 3, "6": 4, "7": 5, "8": 5))
}
Base R way:
# cut out just the numeric columns
df2 <- as.matrix(df[, -1])
# first digits
df2[] <- substr(df2, 1, 1)
mode(df2) <- 'numeric'
# recode
df2[] <- c(1, 1, 1, 2, 3, 3, 4, 5, 5)[df2+1]
# write back into the original data frame
df[, -1] <- df2
# remove rows with NAs only
df <- df[apply(df[, -1], 1, \(x) !all(is.na(x))), ]
df
# V1 V2 V3 V4 V5
# 1 s1 3 1 3 3
# 2 s2 NA 3 4 5
# 3 s3 NA NA 5 3
# 4 s4 NA NA NA 1
# 6 s6 4 5 NA 1
As you can see, it is not necessary to do the operations column-wise as they can be performed en bloc, which will be more efficient.
You can do this with a combination of tidyverse packages. We generally avoid for loops in R, unless we really need them. It's almost always preferable to vetorise.
library(dplyr)
library(stringr) # for str_sub
library(purrr) # for negate
mat = matrix(c( "s1", "s2", "s3", "s4", "s5", "s6",
"414234", NA, NA, NA, NA, "617465",
"244575", "512342", NA, NA, NA, "844526",
"539645", "644252", "816747", NA, NA, NA,
"436236", "835325", "475295", "125429", NA, "194262"),
nrow=6,
ncol=5
)
df <- as.data.frame(mat)
## Step 1: Extract first character of each element
df <- mutate(df, across(V2:V5, str_sub, 1, 1))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 4 2 5 4
#> 2 s2 <NA> 5 6 8
#> 3 s3 <NA> <NA> 8 4
#> 4 s4 <NA> <NA> <NA> 1
#> 5 s5 <NA> <NA> <NA> <NA>
#> 6 s6 6 8 <NA> 1
## Step 3: Recode
df <- mutate(df,
across(V2:V5,
recode,
`0` = "1", `1` = "1", `2` = "1", `3` = "2",
`4` = "3", `5` = "3", `6` = "4", `7` = "5", `8` = "5"
))
## Step 2: convert all columns to numeric
df <- mutate(df, across(V2:V5, as.numeric))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s5 NA NA NA NA
#> 6 s6 4 5 NA 1
## Step 4: filter all rows where every value is numeric
## By purrr::negate()-ing is.na, we can select rows only rows where
## at least one value is not missing
df <- filter(df, if_any(V2:V5, negate(is.na)))
df
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s6 4 5 NA 1
Created on 2022-12-13 with reprex v2.0.2
This one using fancy math
df |>
pivot_longer(col2:col5, values_to = "val", names_to = "col") |>
mutate(val = map_dbl(as.integer(val),
~c(1, 1, 1, 2, 3, 3, 4, 5, 5)[.x %/% 10^trunc(log10(.x)) +1])) |>
filter(!is.na(val)) |>
pivot_wider(values_from = val, names_from = col )
##> + # A tibble: 5 × 5
##> col1 col2 col3 col4 col5
##> <chr> <dbl> <dbl> <dbl> <dbl>
##> 1 s1 3 1 3 3
##> 2 s2 NA 3 4 5
##> 3 s3 NA NA 5 3
##> 4 s4 NA NA NA 1
##> 5 s6 4 5 NA 1

Compute frequency list of final words in utterances of variable length

I have a large dataframe with utterances of variable sizes:
df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3),
w1 = c("come", "why", "er", "well", "she", "well", "er", "well"),
w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"),
w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"),
w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)),
row.names = c(NA, -8L), class = "data.frame")
I'd like to compare the utterance-initial words in w1 with all the utterance-final words in the other w columns with frequency lists with counts and proportions. I can compute a frequency list of the utterance-initial words:
library(dplyr)
df %>%
group_by(w1) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(prop))
# A tibble: 5 x 3
w1 n prop
<chr> <int> <dbl>
1 well 3 0.375
2 er 2 0.25
3 come 1 0.125
4 she 1 0.125
5 why 1 0.125
But how to compute the list of the utterance-final words when these are in different w columns?
Expected:
# A tibble: 5 x 3
w_last n prop
<chr> <int> <dbl>
1 can 3 0.375
2 on 2 0.25
3 cool 1 0.125
4 that 1 0.125
5 today 1 0.125
Here's at long last another solution:
df %>%
mutate(w_last = c(apply(., 1, function(x) tail(na.omit(x), 1)))) %>%
group_by(w_last) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(prop))
Three methods in tidyverse style of syntax
1 You may extract final_word in a different column and create prop.table on it. (in dplyr only)
df %>% rowwise() %>%
mutate(final_word = get(paste0('w', size))) %>%
janitor::tabyl(final_word)
final_word n percent
can 3 0.375
cool 1 0.125
on 2 0.250
that 1 0.125
today 1 0.125
2 restructuring data a bit.
pivoted the format.
kept only those rows where size matches with word_number
used janitor::tabyl() to generate your prop.table (which can further be formatted in a useful manner in janitor)
df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3),
w1 = c("come", "why", "er", "well", "she", "well", "er", "well"),
w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"),
w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"),
w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)),
row.names = c(NA, -8L), class = "data.frame")
df
#> size w1 w2 w3 w4
#> 1 2 come on <NA> <NA>
#> 2 2 why that <NA> <NA>
#> 3 3 er i can <NA>
#> 4 3 well not today <NA>
#> 5 4 she 's going on
#> 6 4 well thanks they can
#> 7 3 er super cool <NA>
#> 8 3 well she can <NA>
library(tidyverse)
library(janitor)
df %>% pivot_longer(!size, values_drop_na = T) %>%
filter(as.numeric(substr(name, 2, nchar(name))) == size) %>%
janitor::tabyl(value)
#> value n percent
#> can 3 0.375
#> cool 1 0.125
#> on 2 0.250
#> that 1 0.125
#> today 1 0.125
Created on 2021-05-06 by the reprex package (v2.0.0)
3 By the way, you can specifically reverse the sequence, and count words from last column too, in tidyr using unite and separate
df %>% unite('W', starts_with('w'), sep = '=', na.rm = T, remove = T) %>%
separate(W, into = paste0('w', seq_len(1 + max(str_count(.$W, '=')))), fill = 'left', sep = '=')
size w1 w2 w3 w4
1 2 <NA> <NA> come on
2 2 <NA> <NA> why that
3 3 <NA> er i can
4 3 <NA> well not today
5 4 she 's going on
6 4 well thanks they can
7 3 <NA> er super cool
8 3 <NA> well she can
You can subset df by using the row (seq_len(nrow(df)) and the value in df$size, make a table and calculate the proportions.
tt <- table(df[-1][cbind(seq_len(nrow(df)), df$size)])
cbind(tt, proportions(tt))
# tt
#can 3 0.375
#cool 1 0.125
#on 2 0.250
#that 1 0.125
#today 1 0.125
A base R option
out <- rev(
stack(
prop.table(
table(apply(df, 1, function(x) tail(na.omit(x), 1)))
)
)
)
gives
ind values
1 can 0.375
2 cool 0.125
3 on 0.250
4 that 0.125
5 today 0.125
If you want to order the rows in a descending manner, you can do
> out[order(-out$value), ]
ind values
1 can 0.375
3 on 0.250
2 cool 0.125
4 that 0.125
5 today 0.125

Comparing two dataframes in R and extract the values from one dataframe

I have two dataframes which have different number of rows and columns. one dataframe is with two columns and other dataframe with multiple columns.
The first dataframes looks like,
Second dataframe is like
Actually, i need to replace the second dataframe which contains A,B,C etc with the values of 2nd column of first dataframe.
I need the output in below format.
Help me to solve this problem.
dput:
df
structure(list(col1 = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L"), col2 = c(10, 1, 2, 3, 4, 3, 1, 8, 19, 200,
12, 112)), row.names = c(NA, -12L), class = c("tbl_df", "tbl",
"data.frame"))
df2
structure(list(col1 = c("A", "F", "W", "E", "F", "G"), col2 = c(NA,
NA, "J", "K", "L", NA), col3 = c(NA, "H", "I", NA, "A", "B")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
A one-liner:
as_tibble(`colnames<-`(matrix(df1$col2[match(as.matrix(df2),df1$col1)], ncol=3), names(df2)))
#> # A tibble: 6 x 3
#> col1 col2 col3
#> <dbl> <dbl> <dbl>
#> 1 10 NA NA
#> 2 3 NA 8
#> 3 NA 200 19
#> 4 4 12 NA
#> 5 3 112 10
#> 6 1 NA 1
You can accomplish this with a little data manipulation. Make the data in df2 long, then join to df, then make the data wide again.
The rowid_to_column is necessary to make the transition from long to wide work. You can easily remove that column by adding select(-rowid) at the end of the chain.
library(tidyverse)
df2 %>%
rowid_to_column() %>%
pivot_longer(cols = -rowid) %>%
left_join(df, by = c("value" = "col1")) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = col2)
# rowid col1 col2 col3
# <int> <dbl> <dbl> <dbl>
# 1 1 10 NA NA
# 2 2 3 NA 8
# 3 3 NA 200 19
# 4 4 4 12 NA
# 5 5 3 112 10
# 6 6 1 NA 1
one-liner in base R:
df2 <- as.data.frame(lapply(df2, function(x) ifelse(!is.na(x), setNames(df$col2, df$col1)[x], NA)))
Output
> df2
col1 col2 col3
1 10 NA NA
2 3 NA 8
3 NA 200 19
4 4 12 NA
5 3 112 10
6 1 NA 1
Another short one liner in base. You can use match and assign the result to df2[]:
df2[] <- df[match(unlist(df2), df[,1]), 2]
df2
# col1 col2 col3
#1 10 NA NA
#2 3 NA 8
#3 NA 200 19
#4 4 12 NA
#5 3 112 10
#6 1 NA 1

Seeing if all values in one dataframe row exist in another dataframe

I have a dataframe as follows:
df1
ColA ColB ColC ColD
10 A B L
11 N Q NA
12 P J L
43 M T NA
89 O J T
df2
ATTR Att R1 R2 R3 R4
1 45 A B NA NA
2 40 C D NA NA
3 33 T J O NA
4 65 L NA NA NA
5 20 P L J NA
6 23 Q NA NA NA
7 38 Q L NA NA
How do I match up df2 with df1 so that if ALL the values in each df2 row (disregarding the order) show up in the df1 rows, then it will populate. So it is checking if ALL not just one value from each df2 row matches up with each df1 row. The final result in this case should be this:
ColA ColB ColC ColD ATTR Att R1 R2 R3 R4
10 A B L 1 45 A B NA NA
10 A B L 4 65 L NA NA NA
11 N Q NA 6 23 Q NA NA NA
12 P J L 4 65 L NA NA NA
12 P J L 5 20 P L J NA
89 O J T 3 33 T J O NA
Thanks
Here is a possible solution using base R.
Make sure everything is a character before continuing, i.e.
df[-1] <- lapply(df[-1], as.character)
df1[-c(1:2)] <- lapply(df1[-c(1:2)], as.character)
First we create two lists which contain vectors of the rowwise elements of each data frame. We then create a matrix with the length of elements from l2 are found in l1, If the length is 0 then it means they match. i.e,
l1 <- lapply(split(df[-1], seq(nrow(df))), function(i) i[!is.na(i)])
l2 <- lapply(split(df1[-c(1:2)], seq(nrow(df1))), function(i) i[!is.na(i)])
m1 <- sapply(l1, function(i) sapply(l2, function(j) length(setdiff(j, i))))
m1
# 1 2 3 4 5
#1 0 2 2 2 2
#2 2 2 2 2 2
#3 3 3 2 2 0
#4 0 1 0 1 1
#5 2 3 0 3 2
#6 1 0 1 1 1
#7 1 1 1 2 2
We then use that matrix to create a couple of coloumns in our original df. The first column rpt will indicate how many times each row has length 0 and use that as a number of repeats for each row. We also use it to filter out all the 0 lengths (i.e. the rows that do not have a match with df1). After expanding the data frame we create another variable; ATTR (same name as ATTR in df1) in order to use it for a merge. i.e.
df$rpt <- colSums(m1 == 0)
df <- df[df$rpt != 0,]
df <- df[rep(row.names(df), df$rpt),]
df$ATTR <- which(m1 == 0, arr.ind = TRUE)[,1]
df
# ColA ColB ColC ColD rpt ATTR
#1 10 A B L 2 1
#1.1 10 A B L 2 4
#2 11 N Q <NA> 1 6
#3 12 P J L 2 4
#3.1 12 P J L 2 5
#5 89 O J T 1 3
We then merge and order the two data frames,
final_df <- merge(df, df1, by = 'ATTR')
final_df[order(final_df$ColA),]
# ATTR ColA ColB ColC ColD rpt Att R1 R2 R3 R4
#1 1 10 A B L 2 45 A B <NA> <NA>
#3 4 10 A B L 2 65 L <NA> <NA> <NA>
#6 6 11 N Q <NA> 1 23 Q <NA> <NA> <NA>
#4 4 12 P J L 2 65 L <NA> <NA> <NA>
#5 5 12 P J L 2 20 P L J <NA>
#2 3 89 O J T 1 33 T J O <NA>
DATA
dput(df)
structure(list(ColA = c(10L, 11L, 12L, 43L, 89L), ColB = c("A",
"N", "P", "M", "O"), ColC = c("B", "Q", "J", "T", "J"), ColD = c("L",
NA, "L", NA, "T")), .Names = c("ColA", "ColB", "ColC", "ColD"
), row.names = c(NA, -5L), class = "data.frame")
dput(df1)
structure(list(ATTR = 1:7, Att = c(45L, 40L, 33L, 65L, 20L, 23L,
38L), R1 = c("A", "C", "T", "L", "P", "Q", "Q"), R2 = c("B",
"D", "J", NA, "L", NA, "L"), R3 = c(NA, NA, "O", NA, "J", NA,
NA), R4 = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_)), .Names = c("ATTR",
"Att", "R1", "R2", "R3", "R4"), row.names = c(NA, -7L), class = "data.frame")

Resources