comparing two files and outputting common elements - r

I have 2 files of 3 columns and hundreds of rows. I want to compare and list the common elements of first two columns of the two files. Then the list which i will get after comparing i have to add the third column of second file to that list. Third column will contain the values which were in the second file corresponding to numbers of remaining two columns which i have got as common to both the files.
For example, consider two files of 6 rows and 3 columns
First file -
1 2 3
2 3 4
4 6 7
3 8 9
11 10 5
19 6 14
second file -
1 4 1
2 1 4
4 6 10
3 7 2
11 10 3
19 6 5
As i said i have to compare the first two columns and then add the third column of second file to that list. Therefore, output must be:
4 6 10
11 10 3
19 6 5
I have the following code, however its showing an error object not found also i am not able to add the third column. Please help :)
df2 = reading first file, df3 = reading second file. Code is in R language.
s1 = 1
for(i in 1:nrow(df2)){
for(j in 1:nrow(df3)){
if(df2[i,1] == df3[j,1]){
if(df2[i,2] == df3[j,2]){
common.rows1[s1,1] <- df2[i,1]
common.rows1[s1,2] <- df2[i,2]
s1 = s1 + 1
}
}
}

You can use the %in% operator twice to subset your second data.frame (I call it df2):
df2[df2$V1 %in% df1$V1 & df2$V2 %in% df1$V2,]
# V1 V2 V3
#3 4 6 10
#5 11 10 3
#6 19 6 5
V1 and V2 in my example are the column names of df1 and df2.

It seems that this is the perfect use-case for merge, e.g.
merge(d1[c('V1','V2')],d2)
results in:
V1 V2 V3
1 11 10 3
2 19 6 5
3 4 6 10
In which 'V1' and 'V2' are the column names of interest.

data.table proposal
library(data.table)
setDT(df1)
setDT(df2)
setkey(df1, V1, V2)
setkey(df2, V1, V2)
df2[df1[, -3, with = F], nomatch = 0]
## V1 V2 V3
## 1: 4 6 10
## 2: 11 10 3
## 3: 19 6 5

If your two tables are d1 and d2,
d1<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(2, 3, 6, 8, 10, 6),
V3 = c(3, 4, 7, 9, 5, 14)
)
d2<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(4, 1, 6, 7, 10, 6),
V3 = c(1, 4, 10, 2, 3, 5)
)
then you can subset d2 (in order to keep the third column) with
d2[interaction(d2$V1, d2$V2) %in% interaction(d1$V1, d1$V2),]
The interaction() treats the first two columns as a combined key.

Related

How to list R data frame variables in alphabetical order? [duplicate]

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Create then populate colmuns in a dataframe

Hello I'm trying to find a way to create new columns in a dataframe the populate them.
For example:
id = c(2, 3, 5)
v1 = c(2, 1, 7)
v2 = c(1, 9, 5)
duration=c(v1+v2)
df = data.frame(id,v1,v2,duration,stringsAsFactors=FALSE)
id v1 v2 duration
1 2 2 1 3
2 3 1 9 10
3 5 7 5 12
Now I want to create new columns by dividing each value of a row by the 'duration' of said row, I know how do it manually but it is prone to errors and not really elegant...
df$I_v1=v1/duration
df$I_v2=v2/duration
Or is df <- df %>% mutate(I_v1 = v1/duration) quicker/better?
id v1 v2 duration I_v1 I_v2
1 2 2 1 3 0.6666667 0.3333333
2 3 1 9 10 0.1000000 0.9000000
It works but I would like to know if it's possible to create -and name- the row and populate them automatically.
Say that you have a cols vector containing the names of the columns you want to manipulate. In your example:
cols<-c("v1","v2")
Then you can try:
df[paste0("I_",cols)]<-df[cols]/df$duration
# id v1 v2 duration I_v1 I_v2
#1 2 2 1 3 0.6666667 0.3333333
#2 3 1 9 10 0.1000000 0.9000000
#3 5 7 5 12 0.5833333 0.4166667
You can use transform():
df <- data.frame(id=c(2, 3, 5), v1=c(2, 1, 7), v2=c(1, 9, 5))
df$duration <- df$v1 + df$v2) # or ... <- with(df, v1 + v2)
df_new <- transform(df, I_v1=v1/duration, I_v2=v2/duration )
... or (if you have many columns v1, v2, ...):
as.matrix(df[, 2:3])/df$duration # or with cbind():
cbind(df, as.matrix(df[, 2:3])/df$duration)
(similar as in the answer from nicola)
All data frames have a row names attribute, a character vector of length the number of rows with no duplicates nor missing values. You can name the rows as:
row.names(x) <- value
Arguments:
x
object of class "data.frame", or any other class for which a method has been defined.
value
an object to be coerced to character unless an integer vector.e here

Matching dataframe columns: one int and another is list

Trying to create a column in dataframe df1 based on match in another dataframe df2, where df1 is much bigger than df2:
df1$val2 <- df2$val2[match(df1$id, df2$IDs)]
This doesn't quite work because df2$IDs column is a list:
> df2
IDs val2
1 0 1
2 1, 2 2
3 3, 4 3
4 5, 6 4
5 7, 8 5
6 9, 10 6
7 11, 12, 13, 14 7
It only works for the part where the list has 1 element (row 1: ..$ : int 0 above). For all other rows the 'match(df1$id, df2$IDs)' returns NA.
Test of matching some individual numbers works just fine with double brackets:
2 %in% df2[[2,'IDs']]
So, I either need to modify the column df2$IDs or need to perform match operation differently. The df1 has many other columns, so does the df2, but df2 is much shorter in rows.
The case can be reproduced with the following:
IDs <- c("[0]", "[1, 2]", "[3, 4]", "[5, 6]", "[7, 8]", "[9, 10]", "[11, 12, 13, 14]")
val2 <- c(1,2,3,4,5,6,7)
df2 <- data.frame(IDs, val2)
df2$IDs <- lapply(strsplit(as.character(df2$IDs), ','), function (x) as.integer(gsub("\\s|\\[|\\]", "", x)))
id <- floor(runif(100, min=0, max=15))
df1 <- data.frame(id)
str(df1)
str(df2)
df1$val2 <- df2$val2[match(df1$id, df2$IDs)]
List columns are clumsy to work with. If you convert df2 to a more vanilla format, it works:
DF2 = with(df2, data.frame(ID = unlist(IDs), val2 = rep(val2, lengths(IDs))))
df1$m = DF2$val2[ match(df1$id, DF2$ID) ]
If you want list columns just for browsing, it is quick to do...
aggregate(ID ~ ., DF2, list)
val2 ID
1 1 0
2 2 1, 2
3 3 3, 4
4 4 5, 6
5 5 7, 8
6 6 9, 10
7 7 11, 12, 13, 14
.
Fyi, the match approach will not extend naturally to joining on more columns, so you might want to eventually learn data.table and its "update join" syntax for this case:
library(data.table)
setDT(df1); setDT(df2)
DT2 = df2[, .(ID = unlist(IDs)), by=setdiff(names(df2), "IDs")]
df1[DT2, on=.(id = ID), v := i.val2 ]

How to reorder columns of a data.frame with na condition? [duplicate]

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Sort columns of a dataframe by column name

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Resources