Subsetting a dataframe based on another dataframe column value in R [duplicate] - r

This question already has answers here:
Subset rows in a data frame based on a vector of values
(4 answers)
Subsetting a data frame based on contents of another data frame
(1 answer)
Closed last year.
I have the following dataframe, df:
studID Name
023 John
283 Mary
842 Jacob
211 Amy
and another dataframe, df_2:
studID
023
999
100
211
575
I want to subset the first dataframe, df so that it only contains the row values which the studID exists in the dataframe df_2.
So i would get:
studID Name
023 John
211 Amy
This dataframe would only contain John and Amy record since their studID is found in df_2.
I tried the following:
df_3 <- df[intersect(df$studID, df_2$studID),]
But I'm getting N/A values.

Related

How can I combine two data frames and, where there is a match on a common field, merge the data into a single record? [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN? [duplicate]
(3 answers)
Closed 1 year ago.
I have two data frames (df1 and df2) which I wish to combine and, where there is a match on a common field, combine those rows' data, otherwise simply add them to the resulting data frame.
They share a common field (specimenid) and where there are matches between df1 and df2, I wish to combine those individual rows' data to produce a single row, featuring the fields of both data frames.
For example, my data frames might be something like this:
df1:
specimenid firstname age
001 bob 45
004 stuart 65
005 eunice 43
006 robert 20
007 james 40
df2:
specimenid surname salary department
001 hastings £45,000 sales
002 smith £28,500 accounting
007 bond £150,000 international relations
008 jennings £50,000 contracts
And the resulting data frame should look like this:
specimenid firstname age surname salary department
001 bob 45 hastings £45,000 sales
002 NA NA smith £28,500 accounting
004 stuart 65 NA NA NA
005 eunice 43 NA NA NA
006 robert 20 NA NA NA
007 james 40 bond £150,000 international relations
008 NA NA jennings £50,000 contracts
The resulting data frame will be a combination of the data frames' rows where there is a match on specimenid in both df1 and df2 and for those records where there is no match, they will be added as new rows with NAs populating the unmatched fields.
I want to include the data from both data frames regardless of whether there is a match or not. If there is a match, then I wish to combine the data frames' rows into a single entry.
I feel that this is a clear example and explanation, but if further clarification is needed then I will update the question.

Trying to find a specific element based on a condition [duplicate]

This question already has answers here:
Find value corresponding to maximum in other column [duplicate]
(2 answers)
Closed 2 years ago.
This is my dataframe in r studio. I'm trying to find code what will produce the name of the student with he highest age.
students.df #Name of dataframe
name DAD BDA gender nationality age
1 Amy 80 70 F IRL 20
2 Bill 65 50 M UK 21
3 Carl 50 80 M IRL 22
as.character(subset(students.df,students.df$age==max(students.df$age))$name)
library(dplyr)
students.df %>% filter(age==max(age)) %>% select(name)
you can try this
students.df[which.max(student.df$age),]

Combining rows from the same data frame [duplicate]

This question already has answers here:
Merge multiple variables in R
(6 answers)
How to implement coalesce efficiently in R
(9 answers)
Closed 3 years ago.
I am trying to write a code that create a new column to combine two rows together. The idea is to add the row when there is NA.
The new column will be the "EventDate
Here is a sample data frame:
Id SDate CDate EventDate
101 2013-03-27 NA 2013-03-27
101 2013-05-09 NA 2013-05-09
101 NA 2013-05-30 2013-05-30
101 NA 2013-07-26 2013-07-26
We can use coalesce
library(tidyverse)
df1 %>%
mutate(EventDate = coalesce(SDate, CDate))

How to transpose data frame , keeping row and column names using R

I have a gene expression dataset that currently has columns of patient samples and rows of genes. I need to transpose the dataset so that the genes are now columns and rows are now patient samples using R. I have found a few ways yet none have been successful. I appreciate your help in advance! :)
Make a data frame as follows:
df <- data.frame(Joe = c(45,123,1007), Mary=c(1,456,103))
rownames(df) <- c("Wnt1", "Bmp4", "BRCA2")
df
Joe Mary
Wnt1 45 1
Bmp4 123 456
BRCA2 1007 103
And to transpose it, simply:
t(df)
Wnt1 Bmp4 BRCA2
Joe 45 123 1007
Mary 1 456 103

Locate and merge duplicate rows in a data.frame but ignore column order

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.
Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.
Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35

Resources