How can I exclude partial duplications of a dataframe? - r

I am a beginner in R and I have a question about dataframes.
In my case, I have this data.frame as a example:
I just want one ENSEMBL_ID Per SYMBOL, so I'd like to get rid of one of these. Can someone tell me how could I do this? In this case unique(data.frame) does not work as there is a difference in the third column.

Related

For and if loop in R

I am trying to get the following done: I have two columns (lets say codeA and codeB) in a dataframe A and want to compare these characters to a column (codeC) of another dataframe B. The codeA and codeB are the same in most cases, if they are not the same, the code (A/B) that matches codeC should be written in a new column.
So far I did not manage to achieve this result in combining if and for loops in R. Can someone help me?
Gretly appreciated!
I tried to code it using if and for loop but did not get the result needed.

R: Searching a column in a dataframe for matches to a reference list in another dataframe

I am trying to categorize genes with multiple GO descriptors into bins based on what those GO descriptors are related to. I have dataframe A which contains the raw data associated with a list of geneIDs (>500,000) and their associated GO descriptors and dataframe B which classifies these GO descriptors into larger groups.
Example of dataframe A
dfA
Example of dataframe B
dfB
Ideally, the final output would reference the entire list and generate a new column in dataframe A classifying the GeneIDs into the GO_Category's associated with its specific GO_IDs -- bonus points if it removes duplicate hits on the GO_Categorys.
Looking something like this...
Example of Ideal Solution
However, I know that the ideal solution might be difficult to obtain, and I already have dataframe B listed out based on the unique GO_Categories so a solution like this might be easier to obtain.
Example of Acceptable Solution
So far I have struggled with getting any command to search for partial strings using a list from another dataframe with the goal of returning all matches.
I have had partial success with the acceptable solution approach and using:
dfA <- dfA %>%
mutate(GO_Cat_1 = c('No', 'Yes')[1+str_detect(dfA$GO_IDs, as.character(dfB$GO_IDs))])
The solution seems okay, however, it does return an error along the lines of
problem with mutate() column GO_Cat_1.
i GO_Cat_1 = ...[].
i longer object length is not a multiple of shorter object length
I have also tried to look into applying grepl/grep - but struggled to feed it a list of terms to look for partial string matches in dfA.
Any assistance is greatly appreciated!

R - Can I have a matrix with different number of columns for rows?

This might be a stupid question. I have some 'NA' in a matrix, I need to put this matrix into jags model, but I want to remove those NA. Can I remove only NA but keep the rest of the data?
My data looked like the picture below. Can I have rows with different column numbers?
You cannot.
You need to impute these missing values or remote either the column or the row entirely.
Imputing missing values is as complicated as you want it to be. You'd be best of looking into the first few google searches on the topic or just using the mean value of the column.

Extracting different vectors from a single column of data (in R)

I have a small problem, which I don't think is too hard, but I couldn't find any answer here (maybe I phrased my research wrong so please excuse me if the question has already been asked!)
I am importing data from an excel sheet which is split in two columns as in the following picture:
Now, I am trying to import all the data in the second column to my R script, but by splitting it into different vectors: one vector for category A, one for category B, etc... by keeping the data points in the order they are in the file (because as it happens, they are in chronological order).
Now, the categories each have a different number of elements, however, they are ordered alphabetically (ie you'll never find an A in the B's, for example). So I guess that makes it easier, but I'm still a novice with R and I don't really know how to proceed without getting really messy with the code and I know there's probably a simple way of doing it.
Does anyone have an idea on how to treat this nicely please? :)
We can use split in base R to return a list of vectors of 'Data' based on the unique values in 'Category'
lst1 <- split(df1$Data, df1$Category)

Count number of 0's on each column

My problem is quite simple but I not beeing able to solve him. I have a tibble dataframe and want to know how much 0's each column have. I tried to use the function sum(dataframe$column == 0) on each column but I think this is kinda inefficient since I want to apply this to a bunch of differents dataframes. Are there any other more automatic way to do it?

Resources