R: Searching a column in a dataframe for matches to a reference list in another dataframe - r

I am trying to categorize genes with multiple GO descriptors into bins based on what those GO descriptors are related to. I have dataframe A which contains the raw data associated with a list of geneIDs (>500,000) and their associated GO descriptors and dataframe B which classifies these GO descriptors into larger groups.
Example of dataframe A
dfA
Example of dataframe B
dfB
Ideally, the final output would reference the entire list and generate a new column in dataframe A classifying the GeneIDs into the GO_Category's associated with its specific GO_IDs -- bonus points if it removes duplicate hits on the GO_Categorys.
Looking something like this...
Example of Ideal Solution
However, I know that the ideal solution might be difficult to obtain, and I already have dataframe B listed out based on the unique GO_Categories so a solution like this might be easier to obtain.
Example of Acceptable Solution
So far I have struggled with getting any command to search for partial strings using a list from another dataframe with the goal of returning all matches.
I have had partial success with the acceptable solution approach and using:
dfA <- dfA %>%
mutate(GO_Cat_1 = c('No', 'Yes')[1+str_detect(dfA$GO_IDs, as.character(dfB$GO_IDs))])
The solution seems okay, however, it does return an error along the lines of
problem with mutate() column GO_Cat_1.
i GO_Cat_1 = ...[].
i longer object length is not a multiple of shorter object length
I have also tried to look into applying grepl/grep - but struggled to feed it a list of terms to look for partial string matches in dfA.
Any assistance is greatly appreciated!

Related

R function for identifying values from one column in another?

I have two different data frames, each of them consisting of a list of "genes" and a list of "interactors" (other genes). Is it possible with R to check if there any "genes" from one list that are also present in any of the columns of "interactors" from the other data frame, and vice-versa?
I am quite new in R, so perhaps there is an easy way to perform this, but I don't even know how to look for it.
Thanks in advance!
Guillermo.
please can you show a sample of your data?
In any case, I guess the following is what you need:
df_common<-data.frame(df[which(df$genes %in% df$interactors),])
it is checking which elements in the column "genes" in the data frame df are also present %in% the column "interactors" in the same data frame
Is it this what you are looking for? if not, please paste input and desired output

copying data from one data frame to other using variable in R

I am trying to transfer data from one data frame to other. I want to copy all 8 columns from a huge data frame to a smaller one and name the columns n1, n2, etc..
first I am trying to find the column number from which I need to copy by using this
x=as.numeric(which(colnames(old_df)=='N1_data'))
Then I am pasting it in new data frame this way
new_df[paste('N',1:8,'new',sep='')]=old_df[x:x+7]
However, when I run this, all the new 8 columns have exactly same data. However, instead if I directly use the value of x, then I get what I want like
new_df[paste('N',1:8,'new',sep='')]=old_df[10:17]
So my questions are
Why I am not able to use the variable x. I added as.numeric just to make sure it is a number not a list. However, that does not seem to help.
Is there any better or more efficient way to achieve this?
If I'm understanding your question correctly, you may be overthinking the problem.
library(dplyr);
new_df <- select(old_df, N1_data, N2_data, N3_data, N4_data,
N5_data, N6_data, N7_data, N8_data);
colnames(new_df) <- sub("N(\\d)_data", "n\\\\1", colnames(new_df));

List rows with specific columns with a specific value in R from dataset

I have a dataset called "flights" and I am attempting to list all the rows that have the value of "Escanaba, Michigan" in the column Destination. I would like to show 5 columns and then all the rows that apply to Escanaba.
Currently I have...
flights[,c("FlightDate","Carrier","Destination","DestCityName","AirTime")]
That works perfectly for what I want, except it shows all rows.
How do I call out a specific value from a column in a dataset?
This is a pretty basic indexing question (see e.g here, which was the first hit when I googled "R indexing"); you need to construct a logical vector that is TRUE for the relevant rows.
flights[flights$Destination=="Escanaba, Michigan",
c("FlightDate","Carrier","Destination","DestCityName","AirTime")]
A prettier alternative for interactive work (not entirely safe for programmatic use):
subset(flights,Destination=="Escanaba, Michigan",
select=c(FlightDate,Carrier,
Destination,DestCityName,AirTime))
If you want to allow for more than one possible value of Destination, try %in%

Changing hundreds of column names simultaneously in R

I have a data frame with hundreds of columns whose names I want to change. I'm very new to R, so it's rather easy to think through the logic of this, but I simply can't find a relevant example online.
The closest I could sort of get was this:
projectFileAllCombinedNames <- for (i in 1:200){names(projectFileAllCombined)[i+1] <-variableNames[i]}
Basically, starting at the second column of projectFileAllCombined, I want to loop through the columns in the dataframe and assign them the data values in the second data frame. I was able to change one column name manually with this code:
colnames(projectFileAllCombined)[2]<-"newColumnName"
but I can't possibly do that for hundreds of columns. I've spent multiple hours on this and can't crack it with any number of Google searches on "change multiple columns in r" or "change column names in r". The best I can find online is examples where people change a few columns with a c() function and I get how that works, but that still seems to require typing out all the column names as parameters to the function, unless there is a way to just pass the "variableNames" file into that c() function, but I don't know of one.
Will
colnames(projectFileAllCombined)[-1] <- variableNames
not suffice?
This assumes the ordering of columns in projectFileAllCombined is the same as the ordering of the new variable names in variableNames, and that
length(variableNames) == (ncol(projectFileAllCombined) - 1)
The key point here is that the replacement function 'colnames<-'() is vectorised and can replace any number of column names in a single call if passed a vector of replacement values.

How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?

For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.

Resources