Subset a data frame based on another - r

I have two data frames, x and y.
x<-data.frame(id=c(1,2,3,4,5), g=c(21,52,43,94,35))
y<-data.frame(id=c(3,4,7), u=c(55, 77, 99))
I want to subset x to include only the observations with "IDs" that are also in y.
What is the best way of doing this?
Thanks!

Use setdiff to exclude observations appearing in both df
> x[setdiff(x$id, y$id),]
id g
1 1 21
2 2 52
5 5 35
Use merge to include observations present in both df
> merge(x, y)
id g u
1 3 43 55
2 4 94 77
or looking for this subset?
> x[intersect(x$id, y$id),]
id g
3 3 43
4 4 94

The accepted answer only works because the values 3 and 4 in x$id happen to be located in rows 3 and 4. The wrong answer will be obtained, for example, if:
x<-data.frame(id=c(1,3,2,4,5), g=c(21,52,43,94,35))
x[intersect(x$id, y$id),]
id g
3 2 43
4 4 94
The following will work properly, regardless of the position of the common elements:
x[is.element(x$id,intersect(x$id,y$id)),]

Related

How can I arrange data.frame according to the factor levels?

there is data.frame df, i want to arrange (sort) it by index column according the factor levels.
the result as "the wished data.frame". anyone can help ? thanks!
#create data frame
df<-data.frame(index=c("b","a","e"),amount=c(10,76,60))
df$index<-factor(df$index,levels=c("a","b","e"))
# current df
index amount
1 b 10
2 a 76
3 e 60
# the wished data.frame
index amount
1 a 76
2 b 10
3 e 60
Like this?
arrange(df, match(df$index, levels(df$index)))
index amount
1 a 76
2 b 10
3 e 60
Data
df<-data.frame(index=c("b","a","e"),amount=c(10,76,60))
df$index<-factor(df$index,levels=c("b","e","a"))
You can use order :
df[order(df$index), ]
# index amount
#2 a 76
#1 b 10
#3 e 60

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

How to merge tables and fill the empty cells in the mean time in R?

Assume there are two tables a and b.
Table a:
ID AGE
1 20
2 empty
3 40
4 empty
Table b:
ID AGE
2 25
4 45
5 60
How to merge the two table in R so that the resulting table becomes:
ID AGE
1 20
2 25
3 40
4 45
You could try
library(data.table)
setkey(setDT(a), ID)[b, AGE:= i.AGE][]
# ID AGE
#1: 1 20
#2: 2 25
#3: 3 40
#4: 4 45
data
a <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
b <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
Assuming you have NA on every position in the first table where you want to use the second table's age numbers you can use rbind and na.omit.
Example
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
na.omit(rbind(x,y))
Results in what you're after (although unordered and I assume you just forgot ID 5)
ID AGE
1 20
3 40
2 25
4 45
5 60
EDIT
If you want to merge two different data.frames's and keep the columns its a different thing. You can use merge to achieve this.
Here are two data frames with different columns:
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA), COUNTY=c(1,2,3,4))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60), STATE=c('CA','CA','IL'))
Add them together into one data.frame
res <- merge(x, y, by='ID', all=T)
giving us
ID AGE.x COUNTY AGE.y STATE
1 20 1 NA <NA>
2 NA 2 25 CA
3 40 3 NA <NA>
4 NA 4 45 CA
5 NA NA 60 IL
Then massage it into the form we want
idx <- which(is.na(res$AGE.x)) # find missing rows in x
res$AGE.x[idx] <- res$AGE.y[idx] # replace them with y's values
names(res)[agrep('AGE\\.x', names(res))] <- 'AGE' # rename merged column AGE.x to AGE
subset(res, select=-AGE.y) # dump the AGE.y column
Which gives us
ID AGE COUNTY STATE
1 20 1 <NA>
2 25 2 CA
3 40 3 <NA>
4 45 4 CA
5 60 NA IL
The package in the other answer will work. Here is a dirty hack if you don't want to use the package:
x$AGE[is.na(x$AGE)] <- y$AGE[y$ID %in% x$ID]
> x
ID AGE
1 1 20
2 2 25
3 3 40
4 4 45
But, I would use the package to avoid the clunky code.

Removing NAs when multiplying columns

This is a really simple question, but I am hoping someone will be able to help me avoid extra lines of unnecessary code. I have a simple dataframe:
Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
What I want to do is produce an extra column which is the multiplication of A, B and C, which I will then cbind to the original dataframe.
So, I would normally use:
attach(Df.1)
D<-A*B*C
But obviously where the NAs are in column C, I get an NA in variable D. I don't want to exclude all the NA rows, rather just ignore the NA values in this column (and then the value in D would simply be the multiplication of A and B, or where C was available, A*B*C.
I know I could simply replace the NAs with 1s, so the calculation remains unchanged, or use if statements, but I was wodnering what the simplist way of doing this is?
Any ideas?
You can use prod which has an na.rm argument. To do it by row use apply:
apply(Df.1,1,prod,na.rm=TRUE)
[1] 10 60 14 120 72 36
As #James said, prod and apply will work, but you don't need to waste memory storing it in a separate variable, or even cbinding it
Df.1$D = apply(Df.1, 1, prod, na.rm=T)
Assigning the new variable in the data frame directly will work.
> Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
> Df.1
A B C
1 5 1 2
2 4 5 3
3 7 2 NA
4 6 4 5
5 8 9 NA
6 4 1 9
> Df.1$D = apply(Df.1, 1, prod, na.rm=T)
> Df.1$D
[1] 10 60 14 120 72 36
> Df.1
A B C D
1 5 1 2 10
2 4 5 3 60
3 7 2 NA 14
4 6 4 5 120
5 8 9 NA 72
6 4 1 9 36

Match dataframe rows according to two variables (Indexing)

I am essentially trying to get disorganized data into long form for linear modeling.
I have 2 data.frames "rec" and "book"
Each row in "book" needs to be pasted onto the end of several of the rows of "rec" according to two variables in the row: "MRN" and "COURSE" which match.
I have tried the following and variations thereon to no avail:
i=1
newlist=list()
colnames(newlist)=colnames(book)
for ( i in 1:dim(rec)[1]) {
mrn=as.numeric(as.vector(rec$MRN[i]));
course=as.character(rec$COURSE[i]);
get.vector<-as.vector(((as.numeric(as.vector(book$MRN))==mrn) & (as.character(book$COURSE)==course)))
newlist[i]<-book[get.vector,]
i=i+1;
}
If anyone has any suggestions on
1)getting this to work
2) making it more elegant (or perhaps just less clumsy)
If I have been unclear in any way I beg your pardons.
I do understand I haven't combined any data above, I think if I can generate a long-format data.frame I can combine them all on my own
Sounds like you need to merge the two data-frames. Try this:
merge(rec, book, by = c('MRN', 'COURSE'))
and do read the help for merge (by doing ?merge at the R console) for more options on how to merge these.
I've created a simple example that may help you. In my case i wanted to paste the 'value' column from df1 in each row of df2, according to variables x1 and x2:
df1 <- read.table(textConnection("
x1 x2 value
1 2 12
1 3 56
2 1 35
2 2 68
"),header=T)
df2 <- read.table(textConnection("
test x1 x2
1 1 2
2 1 3
3 2 1
4 2 2
5 1 2
6 1 3
7 2 1
"),header=T)
library(sqldf)
sqldf("select df2.*, df1.value from df2 join df1 using(x1,x2)")
test x1 x2 value
1 1 1 2 12
2 2 1 3 56
3 3 2 1 35
4 4 2 2 68
5 5 1 2 12
6 6 1 3 56
7 7 2 1 35

Resources