How to find the matched names in different datasets' columns? - r

I have 2 datasets with columns having the same names.
a:
A B C
1 2 3
5 6 7
b:
B E A
2 3 4
9 1 2
How can I find the column indices with the matched names?
I have tried converting them from wide to long format by using gather() respectively and matching both datasets with match(a,b). It didn't work.

#Find common column names in the two dataframes
intersect(names(a), names(b))
#[1] "A" "B"
#Find the column number in a which is present in b
which(names(a) %in% names(b))
#[1] 1 2
#find the column number in b which is present in a
which(names(b) %in% names(a))
#[1] 1 3

I personally like to use grep for this
grep(pattern = paste(names(a), collapse = "|") , x = names(b))

Related

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

Duplicating R dataframe vector values using another vector as a guide

I have the following R dataframe: df = data.frame(value=c(5,4,3,2,1), a=c(2,0,1,6,9), b=c(7,0,0,3,4)). I would like to duplicate the values of a and b by the number of times of the corresponding position values in value. For example, Expanding b would look like b_ex = c(7,7,7,7,7,2,2,2,4). No values of three or four would be in b_ex because values of zero are in b[2] and b[3]. The expanded vectors would be assigned names and be stand-alone.
Thanks!
Maybe you are looking for :
result <- lapply(df[-1], function(x) rep(x[x != 0], df$value[x != 0]))
#$a
#[1] 2 2 2 2 2 1 1 1 6 6 9
#$b
#[1] 7 7 7 7 7 3 3 4
To have them as separate vectors in global environment use list2env :
list2env(result, .GlobalEnv)

splitting and renaming repeated columns in data frame in R

I'm very new to R and working on tidying a data set. I have a large number of columns, where some columns (in .CSV file) contain several comma separated names. For example, I need to split and duplicate the column and give the comma-separated-names individually to each column:
However, I may have more complicated situation, where there are several columns (with different numerical values) with the same repeated multiple names. these column should be split (each column for each name) and to the repeated names should be added suffixes ('.1' or even '.2' if they repeated more times), see here:
I am actively exploring how to do it, but still no luck. Any help would be highly appreciated.
Here's one way:
First lets create some dummy example data using data.table::fread
library(data.table)
dt = fread(
"a b c,d e f,g,h
1 2 3 4 5
1 2 3 4 5", sep=' ')
# a b c,d e f,g,h
#1: 1 2 3 4 5
#2: 1 2 3 4 5
cols = names(dt)
Now we use stringr to count occurences of commas in the names, and add columns accordingly. We use recycling in the matrix statement to fill new adjacent columns with the same values
library(stringr)
dt.new = dt[, lapply(cols, function(x) matrix(get(x), NROW(dt), str_count(x, ',')+1L))]
names(dt.new) <- unlist(strsplit(cols, ','))
dt.new
# a b c d e f g h
# 1: 1 2 3 3 4 5 5 5
# 2: 1 2 3 3 4 5 5 5
Similarly, in case you prefer to use a base data.frame rather than data.table we can instead do
dt.new = data.frame(lapply(cols, function(x) matrix(dt[[x]], NROW(dt), str_count(x,',')+1L)))
names(dt.new) <- unlist(strsplit(cols, ','))

how to name data frame columns to column index

It is a very basic question.How can you set the column names of data frame to column index? So if you have 4 columns, column names will be 1 2 3 4. The data frame i am using can have up to 100 columns.
It is not good to name the column names with names that start with numbers. Suppose, we name it as seq_along(D). It becomes unnecessarily complicated when we try to extract a column. For example,
names(D) <- seq_along(D)
D$1
#Error: unexpected numeric constant in "D$1"
In that case, we may need backticks or ""
D$"1"
#[1] 1 2 3
D$`1`
#[1] 1 2 3
However, the [ should work
D[["1"]]
#[1] 1 2 3
I would use
names(D) <- paste0("Col", seq_along(D))
D$Col1
#[1] 1 2 3
Or
D[["Col1"]]
#[1] 1 2 3
data
D <- data.frame(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9),d=c(10,11,12))
Just use names:
D <- data.frame(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9),d=c(10,11,12))
names(D) <- 1:ncol(D) # sequence from 1 through the number of columns

match values in dataframes with values in a column

I have two data.frames that looks like these ones:
>df1
V1
a
b
c
d
e
>df2
V1 V2
1 a,k,l
2 c,m,n
3 z,b,s
4 l,m,e
5 t,r,d
I would like to match the values in df1$V1 with those from df2$V2and add a new column to df1 that corresponds to the matching and to the value of df2$V1, the desire output would be:
>df1
V1 V2
a 1
b 3
c 2
d 5
e 4
I've tried this approach but only works if df2$V2 contains just one element:
match(as.character(df1[,1]), strsplit(as.character(df2[,2], ",")) -> idx
df1$V2 <- df2[idx,1]
Many thanks
You can just use grep, which will return the position of the string found:
sapply(df1$V1, grep, x = df2$V2)
# a b c d e
# 1 3 2 5 4
If you expect repeats, you can use paste.
Let's modify your data so that there is a repeat:
df2$V2[3] <- "z,b,s,a"
And modify the solution accordingly:
sapply(df1$V1, function(z) paste(grep(z, x = df2$V2), collapse = ";"))
# a b c d e
# "1;3" "3" "2" "5" "4"
Similar to Tyler's answer, but in base using stack:
df.stack <- stack(setNames(strsplit(as.character(df2$V2), ","), df2$V1))
transform(df1, V2=df.stack$ind[match(V1, df.stack$values)])
produces:
V1 V2
1 a 1
2 b 3
3 c 2
4 d 5
5 e 4
One advantage of splitting over grep is that with grep you run the risk of searching for a and matching things like alabama, etc. (though you can be careful with the patterns to mitigate this (i.e. include word boundaries, etc.).
Note this will only find the first matching value.
Here's an approach:
library(qdap)
key <- setNames(strsplit(as.character(df2$V2), ","), df2$V1)
df1$V2 <- as.numeric(df1$V1 %l% key)
df1
## V1 V2
## 1 a 1
## 2 b 3
## 3 c 2
## 4 d 5
## 5 e 4
First we used strsplit to create a named list. Then we used qdap's lookup operator %l% to match values and create a new column (I converted to numeric though this may not be necessary).

Resources