Select columns matching names in a list - r

I have a data.frame
DF1
a.x.c b.y.l c.z.n d.a.pl f.e.cl
which consists of numeric columns
I also have a list
DF2
a.x.c c.z.n f.e.cl
which contains certain names of columns in DF2
I need to create DF3 that would store only those columns of DF1 which have matching names in DF2.
I have tried which to find indexes of columns i need. But problem that i have long name list of columns and which become useless.
Could you please help. Thank you beforehand.

We can use intersect to get the names that are common in both the datasets and use that to subset the columns of 'DF1' to create 'DF3'.
DF3 <- DF1[intersect(names(DF1),names(DF2))]
DF3
# a.x.c c.z.n
#1 1 7
#2 2 8
#3 3 9
data
DF1 <- data.frame(a.x.c = 1:3, b.y.l= 4:6, c.z.n=7:9)
DF2 <- list(a.x.c= 1:5, c.z.n=8:15, z.l.y=22:29)

Related

Merge data frames in R with partial match in common column, entries separated by semicolon

I'm trying to merge several data frames by an identifier column which contains strings (ids) separated by a semi-colon, and fill the values of the entries not found in the other data frame with NAs, and have a final common id based on the longest string (with more elements separated by a semicolon) present in the row.
The id is string with one or more elements separated by a semicolon. The strings might appear in different order in the other data frames, contain the same or more strings. I have looked around for similar questions and tried with regex_inner_join as well as followed posts using a fuzzy join or agrep type of join, but in my case the string "word" would always be the same, i.e. no miss-spellings. Is a protein ID so it should be the same, and what I have tried has not been successful or taking a lot of time. Each data frame contains around 5000 rows.
I have tried by adding an id and then separating the column identifiers, followed by merging, following the same steps on another data frame, and trying to remove duplicates of the id of each group. Merging to the original data frames and then rbind both data frames and make them "wide". However this is too cumbersome, and I also have not been successful in reaching a common merged data frame, nor the data frame that I'm looking for.
df1$id <- 1:nrow(df)
separate_rows(df, name, sep = ";")
bis for df2, and then
df <- merge(df1, df2, by = "name", all=TRUE)
As for the common id I have come up with
df %>%
rowwise() %>%
mutate(common = c_across(starts_with("name"))[which.max(nchar(c_across(starts_with("name"))))])
An example of the type of data frames:
df1 <- data.frame(name = c("a;b;c", "d", "k;p", "e;f", "g", "h;i"), value = c(1, 2, 3, NA, 2, 7))
df2 <- data.frame(name = c("b;c", "d;o", "e;f;z", "g;l", "x;y"), value = c(NA, 1, 5, 4, 5))
df3 <- ...
Expected result
name.df1
name.df2
value.df1
value.df2
common
a;b;c
b;c
1
NA
a;b;c
d
d;o
2
1
d;o
k;p
NA
3
NA
k;p
e;f
e;f;z
NA
5
e;f;z
g
g;l
2
4
g;l
h;i
NA
7
NA
h;i
NA
x;y
NA
5
x;y
Ideally I would need a solution that can be expanded to more than 2 data frames.
Any help will be appreciated.

How to replace field names in a dataframe with a vector in r?

I have two dataframes and I am trying to use one character vector from one dataframe for the field names of the other dataframe. Here is an example of what I'm trying to do.
df1 <- data.frame(a1 = c(1,2,3,4), a2 = c(5,6,7,8))
df2 <- data.frame(ID = c("a1","a2"), name = c("one","two"))
I want to replace the field names of df1 with the character vector df2$name to get:
one two
1 5
2 6
3 7
4 8
Any solutions?
We can use match to get the index of matching elements between the column names of 'df1 and the 'ID' column of 'df2' and use that to get the corresponding 'name' from 'df2' and assign it to the column names of 'df1'
names(df1) <- df2$name[match(names(df1), df2$ID)]

Problems when trying to join multiple dataframes in R

I have three data frames: df1, df2, df3 with the same number of columns and rows, in the same order.Their column names are exactly the same except for the last three columns (42:43) which are specific to each df (e.g.: col41df1, cold42df1, col43df1...col41df2, col42df2, col43df2...col41df3, col42df3, col43df3...).
I wanted to join the three data frames so that the columns that are specific to each would be appended at the end and I would end up with a data frame with 49 columns, rather than 43.I managed that with:
df_merged <- df1 %>%
left_join(df2)%>%
left_join(df3)
However, something goes wrong during the join because df_merged appears to have 6 NA values while none of the original data frames I joined had any.
Help please?
Thanks!
Since the rows are in the same order across all 3 dataframes, there's no need to use a join. Instead, simply grab the 3 columns you want from the second and third dataframes and attach them to the first, as such:
df_merged <- cbind(df1, df2[, c(41:43)], df3[, c(42:43)])
Here is an example:
df1 <- data.frame(id=c(1,2,3), value=c(5,10,25))
df2 <- data.frame(id=c(1,2,3), value=c(3,6,9), morevalues=c(4,5,9))
library(dplyr)
merged_df <- data.frame(df1, df2[,c(2:3)])
merged_df

How to substitute some values of one dataframe with other data frame in R?

I have two large dataframe, with the same column and row, but I need to substitute the NA of the first, based on the second. For example assume data frame "DF1" is
DF1 <- data.frame(a=c(1,NA,3), b=c(4,NA,6))
and "DF2" is
D2 <- data.frame(a=c(NA,2,NA), b=c(3,5,6))
When there is NA in the "DF1", I want to substitute "DF1" with "DF2", and create a new "DF3", i.e
a b
1 4
2 5
3 6
Could you help me with this please?
This should do the trick:
DF3 <- DF1
replace.bool.matrix <- is.na(DF1)
DF3[replace.bool.matrix] <- DF2[replace.bool.matrix]
Explanation:
We create DF3, which is a copy of DF1. Then we make a logical matrix replace.bool.matrix which we use to select the values in DF3 to replace, as well as the values in DF2 to replace them with.
This makes use of select operations on data frames, for which there are many tutorials.
This is much easier with the match() function:
df1 <- data.frame(steps=c(NA,NA,NA,NA,NA,NA,NA,NA),date=c('2012-10-01','2012-10-01','2012-10-01','2012-10-01','2012-10-01','2012-10-01','2012-10-02','2012-10-02'), interval=c(0,5,10,15,20,25,0,5))
df2 <- data.frame(Interval=c(0,5,10,15,20,25),x=c(1.716,0.339,0.132,0.151,0.075,2.094))
if (is.na(df1$steps)==TRUE) df1$steps <- df2$x[match(df1$interval,df2$Interval)]

creating new dataframe with matching ids in two different table that do not match

I am trying to merge two dataframes with ids, I want to merge first all the ids that match and then find that doesn't match, I found the merge function which can merge the common ids.for example:
m1 = merge(df1, df2, by=c("id"))
Now I am trying to create a new dataframe with ids of dataframe 2 that do not match dataframe 1.
Could you please advise me which command should I look for?
For example:
I have the following two datasets:
df1
df2
I am trying to create a new dataframe with ids from df2 that not in df1. for example id = "a3" and "c3" in df2.
my sample data:
df1 =data.frame(id= c("a1","a2","b1","b2","c1","c2"), value= 1:6)
df2 =data.frame(id= c("a1","a2","a3","b1","c1","c3"), value= 7:12)
Many thanks, Ayan
If you want to use merge, here is one way to do it:
df_merged <- merge(df2, df1, by.x="id", by.y="id", all.x=TRUE)
df_merged[is.na(df_merged$value.y),]
id value.x value.y
3 a3 9 NA
6 c3 12 NA
Since your column names are in both data.frames identical and merge merges by common column names, you have to tell the function the column names explicitly that you want to use, here id.
But you should ask yourself if you really want to merge here? If you just want those rows in df2 that are not in df1, why not use something like this?
df2[!(df2$id %in% df1$id), ]
id value
3 a3 9
6 c3 12

Resources