Finding equal rows between dataframes in R - r

I have the following data set as example:
df1 <- data.frame(V1 = 1:10, V2 = 1:10, V3 = 1:10)
df2 <- data.frame(V1 = 5:1, V2 = 5:1, v3 = c(1, 4, 5, 2, 3))
If a row in df1 are present in df2, I would create a column in df1 that indicates the corresponding row to the df2 and for other rows showed FALSE or NULL or NA or 0 or ...
output expected:
V1 V2 V3 rows_matched
1 1 1 1 FALSE
2 2 2 2 4
3 3 3 3 FALSE
4 4 4 4 2
5 5 5 5 FALSE
6 6 6 6 FALSE
7 7 7 7 FALSE
8 8 8 8 FALSE
9 9 9 9 FALSE
10 10 10 10 FALSE

in Base R:
cbind(df1, matched = match(interaction(df1), interaction(df2)))
V1 V2 V3 matched
1 1 1 1 NA
2 2 2 2 4
3 3 3 3 NA
4 4 4 4 2
5 5 5 5 NA
6 6 6 6 NA
7 7 7 7 NA
8 8 8 8 NA
9 9 9 9 NA
10 10 10 10 NA

You can do a simple left join. Note: I fixed the column name in df2 from v3 to V3 to match the names of df1
left_join(
df1,
df2 %>% mutate(rows_matched=row_number())
)
Output:
V1 V2 V3 rows_matched
1 1 1 1 NA
2 2 2 2 4
3 3 3 3 NA
4 4 4 4 2
5 5 5 5 NA
6 6 6 6 NA
7 7 7 7 NA
8 8 8 8 NA
9 9 9 9 NA
10 10 10 10 NA

Here is another way of solving your problem using data.table
library(data.table)
setDT(df1)
setDT(df2)
df1[, rows_matched := df2[df1, on=.(V1,V2,V3), which=TRUE]]
#
# V1 V2 V3 rows_matched
# 1: 1 1 1 NA
# 2: 2 2 2 4
# 3: 3 3 3 NA
# 4: 4 4 4 2
# 5: 5 5 5 NA
# 6: 6 6 6 NA
# 7: 7 7 7 NA
# 8: 8 8 8 NA
# 9: 9 9 9 NA
# 10: 10 10 10 NA

Another possible solution, based on dplyr::left_join (we have to previously capitalize V3 in df2):
library(dplyr)
df1 %>%
left_join(df2 %>% mutate(new = row_number()))
#> Joining, by = c("V1", "V2", "V3")
#> V1 V2 V3 new
#> 1 1 1 1 NA
#> 2 2 2 2 4
#> 3 3 3 3 NA
#> 4 4 4 4 2
#> 5 5 5 5 NA
#> 6 6 6 6 NA
#> 7 7 7 7 NA
#> 8 8 8 8 NA
#> 9 9 9 9 NA
#> 10 10 10 10 NA

Related

Create run-length ID for subset of values

In this type of dataframe:
df <- data.frame(
x = c(3,3,1,12,2,2,10,10,10,1,5,5,2,2,17,17)
)
how can I create a new column recording the run-length ID of only a subset of x values, say, 3-20?
My own attempt only succeeds at inserting NA where the run-length count should be interrupted; but internally it seems the count is uninterrupted:
library(data.table)
df %>%
mutate(rle = ifelse(x %in% 3:20, rleid(x), NA))
x rle
1 3 1
2 3 1
3 1 NA
4 12 3
5 2 NA
6 2 NA
7 10 5
8 10 5
9 10 5
10 1 NA
11 5 7
12 5 7
13 2 NA
14 2 NA
15 17 9
16 17 9
The expected result:
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
In base R:
df[df$x %in% 3:20, "rle"] <- data.table::rleid(df[df$x %in% 3:20, ])
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With left_join:
left_join(df, df %>%
filter(x %in% 3:20) %>%
distinct() %>%
mutate(rle = row_number()))
Joining, by = "x"
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With data.table:
library(data.table)
setDT(df)
df[x %between% c(3,20),rle:=rleid(x)][]
x rle
<num> <int>
1: 3 1
2: 3 1
3: 1 NA
4: 12 2
5: 2 NA
6: 2 NA
7: 10 3
8: 10 3
9: 10 3
10: 1 NA
11: 5 4
12: 5 4
13: 2 NA
14: 2 NA
15: 17 5
16: 17 5

conditionally adding columns to a list of dataframes

I have a list of dataframes with either 2 or 4 columns.
a <- data.frame(a=1:10,
b=1:10,
c=1:10,
d=1:10)
b <- data.frame(a=1:10,
b=1:10)
list_of_df <- list(a,b)
I want to add 2 empty columns to each dataframe with only 2 columns.
I've tried this lapply approach:
lapply(list_of_df, function(x) ifelse(ncol(x) < 4,x%>%add_column(empty=NA),x <- x))
Which does not work unfortunately. How can I fix this?
I came up with something similar:
add_col <- function(x){
col_to_add <- 4 - ncol(x)
if(col_to_add == 0) return(x)
z <- rep(NA, nrow(x))
for (i in 1:col_to_add){
x <- cbind(x, z)
}
x
}
lapply(list_of_df, add_col)
I would use a for loop to avoid copying the whole list:
for (i in seq_along(list_of_df)) {
n_columns = ncol(list_of_df[[i]])
if (n_columns == 2L) {
list_of_df[[i]][c('empty1', 'empty2')] <- NA
}
}
Result:
> list_of_df
[[1]]
a b c d
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
[[2]]
a b empty1 empty2
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 7 7 NA NA
8 8 8 NA NA
9 9 9 NA NA
10 10 10 NA NA
We could use bind_rows and then group_split and map from purrr to remove the id_Group column:
library(dplyr)
library(purrr)
bind_rows(list_of_df) %>%
group_split(id_Group =cumsum(a==1)) %>%
map(., ~ (.x %>% ungroup() %>%
select(-id_Group)))
[[1]]
# A tibble: 10 x 4
a b c d
<int> <int> <int> <int>
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
[[2]]
# A tibble: 10 x 4
a b c d
<int> <int> <int> <int>
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 7 7 NA NA
8 8 8 NA NA
9 9 9 NA NA
10 10 10 NA NA

How to select consecutive columns with across function dplyr [duplicate]

This question already has answers here:
How to replace all NA in a dataframe using tidyr::replace_na? [duplicate]
(3 answers)
dplyr mutate rowwise max of range of columns
(8 answers)
Closed 2 years ago.
I wanted to use the new across function from dplyr to select consecutive columns and to change the NA in zeros. However, it does not work. It seems like a very simple thing so it could be that I miss something.
A working example:
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 NA 3 7 6 6 10 6 5
2 9 8 9 5 10 NA 2 1 7 2
3 1 1 6 3 6 NA 1 4 1 6
4 NA 4 NA 7 10 2 NA 4 1 8
5 1 2 4 NA 2 6 2 6 7 4
6 NA 3 NA NA 10 2 1 10 8 4
7 4 4 9 10 9 8 9 4 10 NA
8 5 8 3 2 1 4 5 9 4 7
9 3 9 10 1 9 9 10 5 3 3
10 4 2 2 5 NA 9 7 2 5 5
This works fine:
mutate_at(vars(V1:V4), ~replace(., is.na(.), 0))
But if try these options I get an error:
d %>% mutate(across(vars(V1:V4)), ~replace(., is.na(.), 0))
d %>% mutate(across(V1:V4)), ~replace(., is.na(.), 0))
d %>% mutate(across("V1":"V4")), ~replace(., is.na(.), 0))
I am not sure why this doesn't work
In across(), there are two basic arguments. The first argument are the columns that are to be modified, while the second argument is the function which should be applied to the columns. In addition, vars() is no longer needed to select the variables. Thus, the correct form is:
d %>%
mutate(across(V1:V4, ~ replace(., is.na(.), 0)))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 2 6 0 6 5 6 10 5 3 1
2 2 9 2 4 10 6 9 4 NA NA
3 5 5 3 0 3 7 1 5 9 5
4 7 1 1 6 2 1 8 NA 8 4
5 3 5 3 0 2 3 4 2 3 NA
6 0 10 0 2 5 10 1 10 4 3
7 4 3 10 6 NA 5 9 3 3 9
8 9 9 8 5 8 1 3 1 NA 10
9 6 3 0 1 1 9 3 5 8 4
10 3 2 9 1 5 2 4 NA 6 1

Need to find highest 5 columns and their associated values for each row in a dataframe in R

I have a list of employees each with 10 variables of integer data of various attributes about each employee, and I need to know the highest five variables related to each of person (row) in this dataframe. In addition to the 5 highest variable names, I also need to know the 5 highest variable values for each row (each employee).
A simple example below (column names = employee-related integer variables, row names = employee IDs).
set.seed(1)
DF <- matrix(sample(1:9,9),ncol=10,nrow=9)
DF <- as.data.frame.matrix(DF)
>DF
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 3 2 5 6 5 2 6 8 1 3
# 2 1 4 7 8 7 7 3 4 2 9
# 3 2 3 4 7 5 8 9 1 3 5
# 4 3 8 3 4 5 6 7 4 6 5
# 5 6 2 3 7 2 1 8 3 2 4
# 6 8 2 4 8 3 2 9 7 6 5
# 7 1 5 3 6 8 3 8 9 1 3
# 8 9 3 5 8 4 9 7 8 1 2
# 9 1 2 4 8 3 2 1 2 5 6
Thanks in advance!
If you want to keep the shape, you could just make all but the top 5 for each row NA
out <- t(apply(DF, 1, function(x) ifelse(x %in% tail(sort(x), 5), x, NA)))
colnames(out) <- colnames(DF)
rownames(out) <- rownames(DF)
out
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 NA 9 5 6 NA NA 8 7 NA
# 2 NA NA 8 5 7 NA 6 NA 9
# 3 NA 7 8 NA 9 NA 5 NA 6
# 4 NA 7 NA 8 6 NA NA 9 5
# 5 8 NA 6 NA 5 7 NA NA 9
# 6 8 NA NA 5 7 NA NA 9 6
# 7 NA 9 NA NA 6 NA 7 8 5
# 8 NA 6 NA 9 NA NA 8 5 7
# 9 NA NA 9 6 5 NA 8 7 NA
# 10 7 NA NA 5 NA 9 NA 8 6
You can also print without showing all the NAs
print(out, na.print = '')
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 9 5 6 8 7
# 2 8 5 7 6 9
# 3 7 8 9 5 6
# 4 7 8 6 9 5
# 5 8 6 5 7 9
# 6 8 5 7 9 6
# 7 9 6 7 8 5
# 8 6 9 8 5 7
# 9 9 6 5 8 7
# 10 7 5 9 8 6
Another option:
out <- t(apply(DF, 1, function(x){
o <- head(order(-x), 5)
paste0(names(x[o]), ':', x[o])
}))
as.data.frame(out)
# V1 V2 V3 V4 V5
# 1 V2:9 V7:8 V8:7 V4:6 V3:5
# 2 V9:9 V3:8 V5:7 V7:6 V4:5
# 3 V5:9 V3:8 V2:7 V9:6 V7:5
# 4 V8:9 V4:8 V2:7 V5:6 V9:5
# 5 V9:9 V1:8 V6:7 V3:6 V5:5
# 6 V8:9 V1:8 V5:7 V9:6 V4:5
# 7 V2:9 V8:8 V7:7 V5:6 V9:5
# 8 V4:9 V7:8 V9:7 V2:6 V8:5
# 9 V3:9 V7:8 V8:7 V4:6 V5:5
# 10 V6:9 V8:8 V1:7 V9:6 V4:5
Data used (from emsinko's answer)
set.seed(1)
DF <- t(replicate(10,sample(1:9,9))) # random values
DF <- as.data.frame.matrix(DF)
Fast solution (output in list):
set.seed(1)
DF <- t(replicate(10,sample(1:9,9))) # random values
DF <- as.data.frame.matrix(DF)
output <- list() # init empty list
for(i in 1:10) output[[i]] <- sort(DF[i,], decreasing = TRUE)[1:5]
print(output)
> output
[[1]]
V2 V7 V8 V4 V3
1 9 8 7 6 5
[[2]]
V9 V3 V5 V7 V4
2 9 8 7 6 5
.... and so on
I can try to do it in other output format, just specify how should output look like

extracting the nth column of each row of a DT in R where n is a vector of the number of rows in the DT

Seems silly but a simple extraction from a DT is giving me problems.
Consider a toy example:
Create a test data.table with 5 columns:
library(data.table)
dt <- fread("
V1 V2 V3 V4 V5
1 10 7 4 3
2 11 8 5 2
3 12 9 6 1
4 1 10 7 4
5 2 11 8 4
6 3 12 9 3
7 4 1 10 3
8 5 2 11 1
9 6 3 12 2")
Now I want to add a 6th column V6 that contains the value of the column with column number in V5, for each row. So the final output I need is a data.table that transforms dt to below:
V1 V2 V3 V4 V5 V6
1: 1 10 7 4 3 7
2: 2 11 8 5 2 11
3: 3 12 9 6 1 3
4: 4 1 10 7 4 7
5: 5 2 11 8 4 8
6: 6 3 12 9 3 12
7: 7 4 1 10 3 1
8: 8 5 2 11 1 8
9: 9 6 3 12 2 6
With data.table, we can loop through the rows, subset the .SD based on the column index in 'V5' and assign (:= it to create 'V6'
dt2[, V6 := .SD[[V5]], by = 1:nrow(dt2)]
dt2
# V1 V2 V3 V4 V5 V6
#1: 1 10 7 4 3 7
#2: 2 11 8 5 2 11
#3: 3 12 9 6 1 3
#4: 4 1 10 7 4 7
#5: 5 2 11 8 4 8
#6: 6 3 12 9 3 12
#7: 7 4 1 10 3 1
#8: 8 5 2 11 1 8
#9: 9 6 3 12 2 6
In base R, we use row/column indexing
setDF(dt2)
dt2$V6 <- dt2[cbind(seq_len(nrow(dt2)), dt2$V5)]

Resources