rbind data based on matching values in a column - r

I have several data frames I would like to combine, but I need to get rid of rows that don't have matching values in a column in the other data frames. For example, I want to merge a, b, and c data frames, based on the values in column x.
a <- data.frame(1:5, 5:9)
colnames(a) <- c("x", "y")
b <- data.frame(1:4, 7:10)
colnames(b) <- c("x", "y")
c <- data.frame(1:3, 6:8)
colnames(c) <- c("x", "y")
and have the result be
1 5
2 6
3 7
1 7
2 8
3 9
1 6
2 7
3 8
where the first three rows are from data frame a, the second three rows are from data frame b, and the third three rows are from data frame c, and the rows that didn't have matching values in column x were not included.

We create an index based on intersecting elements of 'x'
v1 <- Reduce(intersect, list(a$x, b$x, c$x))
rbind(a[a$x %in% v1,], b[b$x %in% v1,], c[c$x %in% v1, ])
# x y
#1 1 5
#2 2 6
#3 3 7
#4 1 7
#5 2 8
#6 3 9
#7 1 6
#8 2 7
#9 3 8
If there are many dataset objects, it is better to keep it in a list. Here, the example showed the object identifiers as completely different, but if the identifiers have a pattern e.g. df1, df2, ..df100 etc, it becomes easier to get it to a list
lst1 <- mget(ls(pattern = "^df\\d+$"))
If the object identifiers are all different xyz, abc, fq12 etc, but these are the only data.frame objects loaded in the global environment
lst1 <- mget(names(eapply(.GlobalEnv, 'is.data.frame')))
Then, get the interesecitng elements of the column 'x'
v1 <- Reduce(intersect, lapply(lst1, `[[`, "x"))
Use the intersecting vector to subset the rows of the list elements
do.call(rbind, lapply(lst1, function(x) dat[dat$x %in% v1,]))
Here, we assume the column names are the same across all the datasets
Another option is to do a merge and then unlist
out <- Reduce(function(...) merge(..., by = 'x'), list(a, b, c))
data.frame(x = out$x, y = unlist(out[-1], use.name = FALSE))

Related

How to use a vector for creating a logical expression for subsetting a data frame?

I am trying to use a vector of logical expressions to subset a data frame. I have a data frame I want to subset based on several columns where I want to exclude "B" each time. First I want do define a vector for logical expressions based on data frame column names.
set.seed(42)
n <- 24
dataframe <- data.frame(column1=as.character(factor(paste("obs",1:n))),
rand1=rep(LETTERS[1:4], n/4),
rand2=rep(LETTERS[1:6], n/6),
rand3=rep(LETTERS[1:3], n/3),
x=rnorm(n))
columns <- colnames(dataframe)[2:4]
criteria <- quote(rep(paste0(columns[1:3], " != ", quote("B")), length(columns)))
What I want to achieve is a vector criteria containing
rand1 != "B" rand2 != "B" rand3 != "B" so I can use it to subset data frame based on columns like
dfs1 <- subset(dataframe, criteria[1])
dfs2 <- subset(dataframe, criteria[2])
dfs3 <- subset(dataframe, criteria[3])
I might be misunderstanding your question, but it seems like you want a collection of data.frames where each one excludes rows where a given column = 'B'.
Assuming this is what you want:
cols <- c('rand1', 'rand2', 'rand3')
result <- lapply(dataframe[, cols], function(x) dataframe[x!='B',])
will create a list of data.frames, each of which has the result of excluding rows where the indicated column == 'B'.
Based on Using tidy eval for multiple, arbitrary filter conditions
filter_fun <- function(df, cols, conds){
fp <- map2(cols, conds, function(x, y) quo((!!(as.name(x))) != !!y))
filter(df, !!!fp)
}
filter_col <- columns[1:3] %>% as.list()
cond_list <- rep(list("B"), length(columns[1:3]))
filter_fun(dataframe, cols = filter_col,
conds = cond_list)
column1 rand1 rand2 rand3 x
1 obs 1 A A A 1.3709584
2 obs 3 C C C 0.3631284
3 obs 4 D D A 0.6328626
4 obs 7 C A A 1.5115220
5 obs 9 A C C 2.0184237
6 obs 12 D F C 2.2866454
7 obs 13 A A A -1.3888607
8 obs 15 C C C -0.1333213
9 obs 16 D D A 0.6359504
10 obs 19 C A A -2.4404669
11 obs 21 A C C -0.3066386
12 obs 24 D F C 1.2146747

How to make sure 3 separate dfs only contain the same columns?

I have 3 dataframes, all have some overlapping column names, but also some that are not present in at least one column. I am trying to 1) select only columns that are present in all 3 dfs and 2) make sure all of the columns are in the same order (not looking for alphabetical per say).
df1
A B C D
4 5 2 9
df2
A D C F
13 23 94 1
df3
E C A D
3 83 12 7
**Ideal Output**
df1
A C D
4 2 9
df2
A C D
13 94 23
df3
A C D
12 83 7
I am honestly not to sure where to start. Intuitively I think something like this:
df1 <- apply(df1, 2, function(x) ifelse(colnames(x) %in% colnames(df2) & colnames(df1) %in% colnames(df3), x, subset(df1, select = -c(x))
Then repeat for the other 2 dfs. Once all three dfs have the same columns, then I would just order it using one of the dfs as a template.
col_order <- colnames(df1)
df2 <- df2[, col_order]
Where am I going wrong?
We can get the datasets in a list, and get the intersecting names and use that to subset
lst1 <- mget(ls(pattern = '^df\\d+$'))
nm1 <- Reduce(intersect, lapply(lst1, names))
lapply(lst1, subset, select = nm1)

Intersection of row names in dataframe (subset the data)?

Since intersect doesn't work with dataframes, I'm trying to use subset to create a subset of dfA with only data for which dfA's row names match dfB's row names. I should end up with 3000 rows because dfA has 5000 rows and dfB has 3000, and all of dfB’s row names exist in dfA’s row names.
The following just returns dfA's column names without any data.
mysubset = subset(dfA, dfA[,0] %in% dfB[,0])
The rownames function will give you access to the rownames, then the set comparison condition will do what you expected.
Example, using small data frames with some shared rownames
dfA <- data.frame(x = 1:5,
y = 6:10,
row.names = letters[1:5])
# Show dfA
dfA
x y
a 1 6
b 2 7
c 3 8
d 4 9
e 5 10
dfB <- data.frame(x = 1:5,
y = 6:10,
row.names = letters[3:7])
# Show dfB
dfB
x y
c 1 6
d 2 7
e 3 8
f 4 9
g 5 10
Solution
# Subset rows with matching rownames
dfA[ rownames(dfA) %in% rownames(dfB), ]
x y
c 3 8
d 4 9
e 5 10
You should get a subset based on rownames for both data.frames.
dfA[which(rownames(dfA) %in% rownames(dfB)),]
This checks which row names from dfA are in row names of dfB (which) and returns the indices to get the data in dfA (dfA[...]).
If you want to stick to your solution (which costs a bit more, computationally):
subset(dfA, rownames(dfA) %in% rownames(dfB))

returning from list to data.frame after lapply

I have a very simply question about lapply. I am transitioning from STATA to R and I think there is some very basic concept that I am not getting about looping in R. But I have been reading about it all afternoon and can't figure out a reasonable way to do this very simple thing.
I have three data frames df1, df2, and df3 that all have the same column names, in the same order, etc.
I want to rename their columns all at once.
I put the data frames in a list:
dflist <- list(df1, df2, df3)
What I want the new names to be:
varlist <- c("newname1", "newname2", "newname3")
Write a function that replaces names with those in varlist, and lapply it over the data frames
ChangeNames <- function(x) {
names(x) <- varlist
return(x)
}
dflist <- lapply(dflist, ChangeNames)
So, as far as I understand, R has changed the names of the copies of the data frames that I put in the list, but not the original data frames themselves. I want the data frames themselves to be renamed, not the elements of the list (which are trapped in a list).
Now, I can go
df1 <- as.data.frame(dflist[1])
df2 <- as.data.frame(dflist[2])
df2 <- as.data.frame(dflist[3])
But that seems weird. You need a loop to get back the elements of a loop?
Basically: once you've put some data frames in a list and run your function on them via lapply, how do you get them back out of the list, without starting back at square one?
If you just want to change the names, that isn't too hard in R. Bear in mind that the assignment operator, <-, can be applied in sequence. Hence:
names(df1) <- names(df2) <- names(df3) <- c("newname1", "newname2", "newname3")
I am not sure I understand correctly, do you want to rename the columns of the data frames or the components of the list that contain the data frames?
If it is the first, please always search before asking, the question has been asked here.
So what you can easily do in case you have even more data frames in the list is:
# Creating some sample data first
> dflist <- list(df1 = data.frame(a = 1:3, b = 2:4, c = 3:5),
+ df2 = data.frame(a = 4:6, b = 5:7, c = 6:8),
+ df3 = data.frame(a = 7:9, b = 8:10, c = 9:11))
# See how it looks like
> dflist
$df1
a b c
1 1 2 3
2 2 3 4
3 3 4 5
$df2
a b c
1 4 5 6
2 5 6 7
3 6 7 8
$df3
a b c
1 7 8 9
2 8 9 10
3 9 10 11
# And do the trick
> dflist <- lapply(dflist, setNames, nm = c("newname1", "newname2", "newname3"))
# See how it looks now
> dflist
$df1
newname1 newname2 newname3
1 1 2 3
2 2 3 4
3 3 4 5
$df2
newname1 newname2 newname3
1 4 5 6
2 5 6 7
3 6 7 8
$df3
newname1 newname2 newname3
1 7 8 9
2 8 9 10
3 9 10 11
So the names were changed from a, b and c to newname1, newname2and newname3 for each data frame in the list.
If it is the second, you can do this:
> names(dflist) <- c("newname1", "newname2", "newname3")

R grouping by name and perform stats (t-test)

I have two data.frames:
word1=c("a","a","a","a","b","b","b")
word2=c("a","a","a","a","c","c","c")
values1 = c(1,2,3,4,5,6,7)
values2 = c(3,3,0,1,2,3,4)
df1 = data.frame(word1,values1)
df2 = data.frame(word2,values2)
df1:
word1 values1
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
df2:
word2 values2
1 a 3
2 a 3
3 a 0
4 a 1
5 c 2
6 c 3
7 c 4
I would like to split these dataframes by word*, and perform two sample t.tests in R.
For example, the word "a" is in both data.frames. What's the t.test between the data.frames for the word "a"? And do this for all the words that are in both data.frames.
The result is a data.frame(result):
word tvalues
1 a 0.4778035
Thanks
Find the words common to both dataframes, then loop over these words, subsetting both dataframes and performing the t.test on the subsets.
E.g.:
df1 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
df2 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
common_words <- sort(intersect(df1$word, df2$word))
setNames(lapply(common_words, function(w) {
t.test(subset(df1, word==w, x), subset(df2, word==w, x))
}), common_words)
This returns a list, where each element is the output of the t.test for one of the common words. setNames just names the list elements so you can see which words they correspond to.
Note I've created new example data here since your example data only have one word in common (a) and so don't really resemble your true problem.
If you just want a matrix of statistics, you can do something like:
t(sapply(common_words, function(w) {
test <- t.test(subset(df1, word==w, x), subset(df2, word==w, x))
c(test$statistic, test$parameter, p=test$p.value,
`2.5%`=test$conf.int[1], `97.5%`=test$conf.int[2])
}))
## t df p 2.5% 97.5%
## a 0.9141839 8.912307 0.38468553 -0.4808054 1.1313220
## b -0.2182582 7.589109 0.83298193 -1.1536056 0.9558315
## c -0.2927253 8.947689 0.77640684 -1.5340097 1.1827691
## d -2.7244728 12.389709 0.01800568 -2.5016301 -0.2826952
## e -0.3683153 7.872407 0.72234501 -1.9404345 1.4072499

Resources