How to find common variables in different data frames? - r

I have several data frames with similar (but not identical) series of variables (columns). I want to find a way for R to tell me what are the common variables across different data frames.
Example:
`a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- c(7, 8, 9)
df1 <- data.frame(a, b, c)
b <- c(1, 3, 5)
c <- c(2, 4, 6)
df2 <- data.frame(b, c)`
With df1 and df2, I would want some way for R to tell me that the common variables are b and c.

1) For 2 data frames:
intersect(names(df1), names(df2))
## [1] "b" "c"
To get the names that are in df1 but not in df2:
setdiff(names(df1), names(df2))
1a) and for any number of data frames (i.e. get the names common to all of them):
L <- list(df1, df2)
Reduce(intersect, lapply(L, names))
## [1] "b" "c"
2) An alternative is to use duplicated since the common names will be the ones that are duplicated if we concatenate the names of the two data frames.
nms <- c(names(df1), names(df2))
nms[duplicated(nms)]
## [1] "b" "c"
2a) To generalize that to n data frames use table and look for the names that occur the same number of times as data frames:
L <- list(df1, df2)
tab <- table(unlist(lapply(L, names)))
names(tab[tab == length(L)])
## [1] "b" "c"

Use intersect:
intersect(colnames(df1),colnames(df2))
OR
We can also check for the colname using %in%:
colnames(df1)[colnames(df1) %in% colnames(df2)]
Output:
[1] "b" "c"

Related

Assigning complex values to character elements of data frame in R

There are three columns in my data frame which are characters, "A","B", and "C" (this order can vary for different data frames). I want to assign values to them, A= 1+0i, B=2+3i and C=3+2i. I use as.complex(factor(col1)) and the same thing for column two and three, but it makes all three column equal to 1+0i!!
col1 <- c("A","A", "A")
col2 <- c("B", "B","B")
col3 <- c("C","C","C")
df <- data.frame(col1,col2,col3)
print(df)
A= 1+0i
B=2+3i
C=3+2i
df2<- transform(df, col1=as.complex(as.factor(col1)),col2=as.complex(as.factor(col2)),col3=as.complex(as.factor(col3)))
sapply(df2,class)
View(df2)
So this is a weird thing you're doing. You have a column of strings, letters like "A" and "B". Then you have objects with the same names, A = 1 + 0i, etc. Normally we don't treat object names as "data", but you're sort of mixing the two here. The solution I'd propose is to make everything data: combine your A, B, and C values into a vector, and give the vector names accordingly. Then we can replace the values in the data frame with the corresponding values from our named vector:
vec = c(A, B, C)
names(vec) = c("A", "B", "C")
df[] = lapply(df, \(x) vec[x])
df
# col1 col2 col3
# 1 1+0i 2+3i 3+2i
# 2 1+0i 2+3i 3+2i
# 3 1+0i 2+3i 3+2i

How to identify unique columns in a data frame with respect to other data frames?

If I have several data frames, how can I identify the columns which are unique to a certain data frame?
df1 <- data.frame(A=rnorm(5), B=rnorm(5), C=rnorm(5))
df2 <- data.frame(B=rnorm(5), C=rnorm(5), D=rnorm(5))
df3 <- data.frame(B=rnorm(5), C=rnorm(5), D=rnorm(5))
What I want to achieve is something like the unique() function, that gives me the unique columns in a data frame with respect to other data frames.
unique.columns(df1, c(df2, df3))
[1] "A"
but
unique.columns(df2, c(df1, df3))
[1] NA
since there are no unique columns in df2.
You could use Reduce along with setdiff to deal with any number of comparison datasets easily. The first named dataset will be compared to the remainder.
Reduce(setdiff, lapply(list(df1,df2,df3), names))
#[1] "A"
Reduce(setdiff, lapply(list(df2,df1,df3), names))
#character(0)
We can use setdiff and union
unique.columns <- function(df1, df2, df3) {
setdiff(names(df1), union(names(df2), names(df3)))
}
unique.columns(df1, df2, df3)
#[1] "A"
unique.columns(df2, df1, df3)
#character(0)
If you are going to pass a variable number of dataframes to the function, you can change the function
unique.columns <- function(df1, ...) {
temp <- list(...)
setdiff(names(df1), unique(c(sapply(temp, names))))
}
unique.columns(df1, df3)
#[1] "A"
You could also use the "not in" using ! and %in% on the colnames of each df to get the column names that are unique to one df in comparison with other dfs.
colnames(df1)[!(colnames(df1) %in% c(colnames(df2),colnames(df3)))]
#[1] "A"
colnames(df2)[!(colnames(df2) %in% c(colnames(df1),colnames(df3)))]
#character(0)

Matching across datasets and columns

I have a vector with words, e.g., like this:
w <- LETTERS[1:5]
and a dataframe with tokens of these words but also tokens of other words in different columns, e.g., like this:
set.seed(21)
df <- data.frame(
w1 = c(sample(LETTERS, 10)),
w2 = c(sample(LETTERS, 10)),
w3 = c(sample(LETTERS, 10)),
w4 = c(sample(LETTERS, 10))
)
df
w1 w2 w3 w4
1 U R A Y
2 G X P M
3 Q B S R
4 E O V T
5 V D G W
6 T A Q E
7 C K L U
8 D F O Z
9 R I M G
10 O T T I
# convert factor to character:
df[] <- lapply(df[], as.character)
I'd like to extract from dfall the tokens of those words that are contained in the vector w. I can do it like this but that doesn't look nice and is highly repetitive and error prone if the dataframe is larger:
extract <- c(df$w1[df$w1 %in% w],
df$w2[df$w2 %in% w],
df$w3[df$w3 %in% w],
df$w4[df$w4 %in% w])
I tried this, using paste0 to avoid addressing each column separately but that doesn't work:
extract <- df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w]
extract
data frame with 0 columns and 10 rows
What's wrong with this code? Or which other code would work?
To answer your question, "What's wrong with this code?": The code df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w] is the equivalent of df[df %in% w] because df[paste0("w", 1:4)], which you use twice, simply returns the entirety of df. That means df %in% w will return FALSE FALSE FALSE FALSE because none of the variables in df are in w (w contains strings but not vectors of strings), and df[c(F, F, F, F)] returns an empty data frame.
If you're dealing with a single data type (strings), and the output can be a character vector, then use a matrix instead of a data frame, which is faster and is, in this case, a little easier to subset:
mat <- as.matrix(df)
mat[mat %in% w]
#[1] "B" "D" "E" "E" "A" "B" "E" "B"
This produces the same output as your attempt above with extract <- ….
If you want to keep some semblance of the original data frame structure then you can try the following, which outputs a list (necessary as the returned vectors for each variable might have different lengths):
lapply(df, function(x) x[x %in% w])
#### OUTPUT ####
$w1
[1] "B" "D" "E"
$w2
[1] "E" "A"
$w3
[1] "B"
$w4
[1] "E" "B"
Just call unlist or unclass on the returned list if you want a vector.

R - two data frame columns to list of key-value pairs

Say I have a data frame
DF1 <- data.frame("a" = c("a", "b", "c"), "b" = 1:3)
What is the easiest way to turn this into a list?
DF2 <- list("a" = 1, "b" = 2, "c" = 3)
It must be really simple but I can't find out the answer.
You can use setNames and as.list
DF2 <- setNames(as.list(DF1$b), DF1$a)

Create a nested list out of list names

list1 <- 1:3
list2 <- letters[1:3]
I'd like to combine them in a list but not by simple listing them in list(list1, list2), but in a more generalized fashion.
For example, by using ls(pattern = "^list*"). However, that only combines the names and not the actual lists. How do you access, substitute, or refer to the actual lists?
It sounds like you're looking for mget:
list1 <- 1:3
list2 <- letters[1:3]
mget(ls(pattern = "list\\d"))
# $list1
# [1] 1 2 3
#
# $list2
# [1] "a" "b" "c"

Resources