Intersection of row names in dataframe (subset the data)? - r

Since intersect doesn't work with dataframes, I'm trying to use subset to create a subset of dfA with only data for which dfA's row names match dfB's row names. I should end up with 3000 rows because dfA has 5000 rows and dfB has 3000, and all of dfB’s row names exist in dfA’s row names.
The following just returns dfA's column names without any data.
mysubset = subset(dfA, dfA[,0] %in% dfB[,0])

The rownames function will give you access to the rownames, then the set comparison condition will do what you expected.
Example, using small data frames with some shared rownames
dfA <- data.frame(x = 1:5,
y = 6:10,
row.names = letters[1:5])
# Show dfA
dfA
x y
a 1 6
b 2 7
c 3 8
d 4 9
e 5 10
dfB <- data.frame(x = 1:5,
y = 6:10,
row.names = letters[3:7])
# Show dfB
dfB
x y
c 1 6
d 2 7
e 3 8
f 4 9
g 5 10
Solution
# Subset rows with matching rownames
dfA[ rownames(dfA) %in% rownames(dfB), ]
x y
c 3 8
d 4 9
e 5 10

You should get a subset based on rownames for both data.frames.
dfA[which(rownames(dfA) %in% rownames(dfB)),]
This checks which row names from dfA are in row names of dfB (which) and returns the indices to get the data in dfA (dfA[...]).
If you want to stick to your solution (which costs a bit more, computationally):
subset(dfA, rownames(dfA) %in% rownames(dfB))

Related

rbind data based on matching values in a column

I have several data frames I would like to combine, but I need to get rid of rows that don't have matching values in a column in the other data frames. For example, I want to merge a, b, and c data frames, based on the values in column x.
a <- data.frame(1:5, 5:9)
colnames(a) <- c("x", "y")
b <- data.frame(1:4, 7:10)
colnames(b) <- c("x", "y")
c <- data.frame(1:3, 6:8)
colnames(c) <- c("x", "y")
and have the result be
1 5
2 6
3 7
1 7
2 8
3 9
1 6
2 7
3 8
where the first three rows are from data frame a, the second three rows are from data frame b, and the third three rows are from data frame c, and the rows that didn't have matching values in column x were not included.
We create an index based on intersecting elements of 'x'
v1 <- Reduce(intersect, list(a$x, b$x, c$x))
rbind(a[a$x %in% v1,], b[b$x %in% v1,], c[c$x %in% v1, ])
# x y
#1 1 5
#2 2 6
#3 3 7
#4 1 7
#5 2 8
#6 3 9
#7 1 6
#8 2 7
#9 3 8
If there are many dataset objects, it is better to keep it in a list. Here, the example showed the object identifiers as completely different, but if the identifiers have a pattern e.g. df1, df2, ..df100 etc, it becomes easier to get it to a list
lst1 <- mget(ls(pattern = "^df\\d+$"))
If the object identifiers are all different xyz, abc, fq12 etc, but these are the only data.frame objects loaded in the global environment
lst1 <- mget(names(eapply(.GlobalEnv, 'is.data.frame')))
Then, get the interesecitng elements of the column 'x'
v1 <- Reduce(intersect, lapply(lst1, `[[`, "x"))
Use the intersecting vector to subset the rows of the list elements
do.call(rbind, lapply(lst1, function(x) dat[dat$x %in% v1,]))
Here, we assume the column names are the same across all the datasets
Another option is to do a merge and then unlist
out <- Reduce(function(...) merge(..., by = 'x'), list(a, b, c))
data.frame(x = out$x, y = unlist(out[-1], use.name = FALSE))

Select columns of dataframe filling with NA's in selected column doesn't exist

Take this example data frame
temp <- data.frame('a' = 1:3, 'b' = 4:6, 'd' = 7:9)
I want to subset this data frame using a vector of column names, but if the vector contains any columns that don't exist in the data frame I want them still to be returned but as NA.
So if my vector was
colVec <- c('a', 'b', 'c')
I would want to run something along the lines of
subset(temp, select = colVec)
to get
a b c
1 4 NA
2 5 NA
3 6 NA
You can do this in two steps -- limiting to the requested columns that are in your data frame and then adding the requested columns that are not in your data frame. You can use intersect and setdiff to get these two sets of column names:
temp2 <- temp[intersect(colVec, names(temp))]
temp2[setdiff(colVec, names(temp))] <- NA
temp2
# a b c
# 1 1 4 NA
# 2 2 5 NA
# 3 3 6 NA

Select row numbers of a data frame conditioning on another data frame

I have a data frame that I want to find the row numbers where these rows are in common with another data frame.
To make the question clear, say I have data frame A and data frame B:
dfA <- data.frame(NAME = rep(c("a", "b"), each = 3),
TRIAL = rep(1:3, 2),
DATA = runif(6))
dfB <- data.frame(NAME = c("a", "b"),
TRIAL = c(2, 3))
dfA
# NAME TRIAL DATA
# 1 a 1 0.62948592
# 2 a 2 0.88041819
# 3 a 3 0.02479411
# 4 b 1 0.48031827
# 5 b 2 0.86591315
# 6 b 3 0.93448264
dfB
# NAME TRIAL
# 1 a 2
# 2 b 3
I want to get dfA's row number where dfA and dfB have the same NAME and TRIAL, in this case, row numbers are 2 and 6.
I tried the following code, gives me row 2, 3, 5, 6. It separately matches NAME and TRIAL, doesn't work.
which(dfA$NAME %in% dfB$NAME & dfA$TRIAL %in% dfB$TRIAL)
# 2 3 5 6
Then I tried to create a dummy column and match this col. Works, but the code would be verbose if dfB has many columns...
dfA$dummy <- paste0(dfA$NAME, dfA$TRIAL)
dfB$dummy <- paste0(dfB$NAME, dfB$TRIAL)
which(dfA$dummy %in% dfB$dummy)
# 2 6
I'm wondering if there are better ways to solve the problem, thanks for your help!
You can do:
merge(transform(dfA, row.num = 1:nrow(dfA)), dfB)$row.num
# [1] 2 6
And if the whole goal of finding the indices is so that you can subset dfA, then you can just do merge(dfA, dfB).
Or use duplicated:
apply(dfB, 1, function(x)
which(duplicated(rbind(x, dfA[1:2])))-1)
# [1] 2 6

Changing the values of a column for the values from another column

I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.
Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.
df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA
How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!

Convert the values in a column into row names in an existing data frame

I would like to convert the values in a column of an existing data frame into row names. Is is possible to do this without exporting the data frame and then reimporting it with a row.names = call?
For example I would like to convert:
> samp
names Var.1 Var.2 Var.3
1 A 1 5 0
2 B 2 4 1
3 C 3 3 2
4 D 4 2 3
5 E 5 1 4
Into:
> samp.with.rownames
Var.1 Var.2 Var.3
A 1 5 0
B 2 4 1
C 3 3 2
D 4 2 3
E 5 1 4
This should do:
samp2 <- samp[,-1]
rownames(samp2) <- samp[,1]
So in short, no there is no alternative to reassigning.
Edit: Correcting myself, one can also do it in place: assign rowname attributes, then remove column:
R> df<-data.frame(a=letters[1:10], b=1:10, c=LETTERS[1:10])
R> rownames(df) <- df[,1]
R> df[,1] <- NULL
R> df
b c
a 1 A
b 2 B
c 3 C
d 4 D
e 5 E
f 6 F
g 7 G
h 8 H
i 9 I
j 10 J
R>
As of 2016 you can also use the tidyverse.
library(tidyverse)
samp %>% remove_rownames %>% column_to_rownames(var="names")
in one line
> samp.with.rownames <- data.frame(samp[,-1], row.names=samp[,1])
It looks like the one-liner got even simpler along the line (currently using R 3.5.3):
# generate original data.frame
df <- data.frame(a = letters[1:10], b = 1:10, c = LETTERS[1:10])
# use first column for row names
df <- data.frame(df, row.names = 1)
The column used for row names is removed automatically.
With a one-row dataframe
Beware that if the dataframe has a single row, the behaviour might be confusing. As the documentation mentions:
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
This mean that, if you use the same command as above, it might look like it did nothing (when it actually named the first row "1", which won't look any different in the viewer).
In that case, you will have to stick to the more verbose:
df <- data.frame(a = "a", b = 1)
df <- data.frame(df, row.names = df[,1])
... but the column won't be removed. Also remember that, if you remove a column to end up with a single-column dataframe, R will simplify it to an atomic vector. In that case, you will want to use the extra drop argument:
df <- data.frame(df[,-1, drop = FALSE], row.names = df[,1])
You can execute this in 2 simple statements:
row.names(samp) <- samp$names
samp[1] <- NULL

Resources