How to find duplicated values in two columns between two dataframes and remove non-duplicates in R? - r

So let's say I have two dataframes that look like this
df1 <- data.frame(ID = c("A","B","F","G","B","B","A","G","G","F","A","A","A","B","F"),
code = c(1,2,2,3,3,1,2,2,1,1,3,2,2,1,1),
class = c(2,4,5,5,2,3,2,5,1,2,4,5,3,2,1))
df2 <- data.frame(ID = c("G","F","C","F","B","A","F","C","A","B","A","B","C","A","G"),
code = c(1,2,2,3,3,1,2,2,1,1,3,2,2,1,1),
class = c(2,4,5,5,2,3,2,5,1,2,4,5,3,2,1))
I want to check the duplicates in df1$ID and df2$ID and remove all the rows from df2 if the IDs are not present in df1 so the new dataframe would look like this:
df3 <- data.frame(ID = c("G","F","F","B","A","F","A","B","A","B","A","G"),
code = c(1,2,3,3,1,2,1,1,3,2,1,1),
class = c(2,4,5,2,3,2,1,2,4,5,2,1))

With %in%:
df2[df2$ID %in% df1$ID, ]
ID code class
1 G 1 2
2 F 2 4
4 F 3 5
5 B 3 2
6 A 1 3
7 F 2 2
9 A 1 1
10 B 1 2
11 A 3 4
12 B 2 5
14 A 1 2
15 G 1 1

You can use the 'intersect' function to tackle the issue.
common_ids <- intersect(df1$ID, df2$ID)
df3 <- df2[df2$ID %in% common_ids, ]
ID code class
1 G 1 2
2 F 2 4
4 F 3 5
5 B 3 2
6 A 1 3
7 F 2 2
9 A 1 1
10 B 1 2
11 A 3 4
12 B 2 5
14 A 1 2
15 G 1 1

I want to throw semi_join in.
library(tidyverse)
df_test <- df2 |> semi_join(df1, by = "ID")
all.equal(df3, df_test)
#> [1] TRUE

Related

I have a list of data frames and a character vector. I want to rename the second column of each data frame by iterating through the vector. How do I?

I have a list of dataframes. Each of these dataframes has the same number of columns and rows, and has a similar data structure:
df.list <- list(data.frame1, data.frame2, data.frame3)
I have a vector of characters:
charvec <- c("a","b","c")
I want to replace the column name of the second column in each data frame by iterating through the above character vector. For example, the first data frame's second column should be "a". The second data frame's second column should be "b".
[[1]]
col1 a
1 1 2
2 2 3
[[2]]
col1 b
1 1 2
2 2 3
A reproducible example:
charvec <- c("a","b","c")
df_list <- list(df1 = data.frame(x = seq_len(3), y = seq_len(3)), df2 = data.frame(x = seq_len(4), y = seq_len(4)), df3 = data.frame(x = seq_len(5), y = seq_len(5)))
for(i in seq_along(df_list)){
names(df_list[[i]])[2] <- charvec[i]
}
> df_list
$df1
x a
1 1 1
2 2 2
3 3 3
$df2
x b
1 1 1
2 2 2
3 3 3
4 4 4
$df3
x c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
Also can use map2 from purrr. Thanks to #ismirsehregal for example data.
library(purrr)
map2(
df_list,
charvec,
\(x, y) {
names(x)[2] <- y
x
}
)
Output
$df1
x a
1 1 1
2 2 2
3 3 3
$df2
x b
1 1 1
2 2 2
3 3 3
4 4 4
$df3
x c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

From axis values to coodinates pairs [duplicate]

I have two vectors of integers, say v1=c(1,2) and v2=c(3,4), I want to combine and obtain this as a result (as a data.frame, or matrix):
> combine(v1,v2) <--- doesn't exist
1 3
1 4
2 3
2 4
This is a basic case. What about a little bit more complicated - combine every row with every other row? E.g. imagine that we have two data.frames or matrices d1, and d2, and we want to combine them to obtain the following result:
d1
1 13
2 11
d2
3 12
4 10
> combine(d1,d2) <--- doesn't exist
1 13 3 12
1 13 4 10
2 11 3 12
2 11 4 10
How could I achieve this?
For the simple case of vectors there is expand.grid
v1 <- 1:2
v2 <- 3:4
expand.grid(v1, v2)
# Var1 Var2
#1 1 3
#2 2 3
#3 1 4
#4 2 4
I don't know of a function that will automatically do what you want to do for dataframes(See edit)
We could relatively easily accomplish this using expand.grid and cbind.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
expand.grid(df1, df2) # doesn't work so let's try something else
id <- expand.grid(seq(nrow(df1)), seq(nrow(df2)))
out <-cbind(df1[id[,1],], df2[id[,2],])
out
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#1.1 1 3 6 b
#2.1 2 4 6 b
Edit: As Joran points out in the comments merge does this for us for data frames.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
merge(df1, df2)
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#3 1 3 6 b
#4 2 4 6 b

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Sort Data in the Table

For example, now I get the table
A B C
A 0 4 1
B 2 1 3
C 5 9 6
I like to order the columns and rows by my own defined order, to achieve
B A C
B 1 2 3
A 4 0 1
C 9 5 6
This can be accomplished in base R. First we make the example data:
# make example data
df.text <- 'A B C
0 4 1
2 1 3
5 9 6'
df <- read.table(text = df.text, header = T)
rownames(df) <- LETTERS[1:3]
A B C
A 0 4 1
B 2 1 3
C 5 9 6
Then we simply re-order the columns and rows using a vector of named indices:
# re-order data
defined.order <- c('B', 'A', 'C')
df <- df[, defined.order]
df <- df[defined.order, ]
B A C
B 1 2 3
A 4 0 1
C 9 5 6
If the defined order is given as
defined_order <- c("B", "A", "C")
and the initial table is created by
library(data.table)
# create data first
dt <- fread("
id A B C
A 0 4 1
B 2 1 3
C 5 9 6")
# note that row names are added as own id column
then you could achieve the desired result using data.table as follows:
# change column order
setcolorder(dt, c("id", defined_order))
# change row order
dt[order(defined_order)]
# id B A C
# 1: B 1 2 3
# 2: A 4 0 1
# 3: C 9 5 6

R Sum columns by index

I need to find a way to sum columns by their index,I'm working on a bigread.csv file, I'll show here a sample of the problem; I'd like for example to sum from the 2nd to the 5th and from the 6th to the 7h the following matrix:
a 1 3 3 4 5 6
b 2 1 4 3 4 1
c 1 3 2 1 1 5
d 2 2 4 3 1 3
The result has to be like this:
a 11 11
b 10 5
c 7 6
d 8 4
The columns have all different names
We can use rowSums on the subset of columns i.e 2:5 and 6:7 separately and then create a new data.frame with the output.
data.frame(df1[1], Sum1=rowSums(df1[2:5]), Sum2=rowSums(df1[6:7]))
# id Sum1 Sum2
#1 a 11 11
#2 b 10 5
#3 c 7 6
#4 d 11 4
The package dplyr has a function exactly made for that purpose:
require(dplyr)
df1 = data.frame(a=c(1,2,3,4,3,3),b=c(1,2,3,2,1,2),c=c(1,2,3,21,2,3))
df2 = df1 %>% transmute(sum1 = a+b , sum2 = b+c)
df2 = df1 %>% transmute(sum1 = .[[1]]+.[[2]], sum2 = .[[2]]+.[[3]])

Resources