How do I hash in these 2 dataframes in R? - r

so I have these 2 columns (genome$V9 and Impact4$INFO) from 2 different dataframes as shown below.
Basically there is a value inside each Impact4$INFO row (structure would be like OE6AXXXXXXX where X is an integer) that I want to filter in each row inside genome$V9. I understand it is complicated since there are a lot of values inside both columns...
Thank you
Column1
Column2

You can extract numbers from strings quite easily, when the structure is consistent. Given your structure is consistent you can try:
library(stringr)
test <- c("ID=OE6A002689", "ID=OE6A044524", "ID=OE6A057168TI")
str_extract(test, "[0-9]{6}")
Output is:
[1] "002689" "044524" "057168"
Given you want to filter your genome data based on this, you can try:
library(dplyr)
library(stringr)
ids <- str_extract(Impact4$INFO, "[0-9]{6}")
genome %>%
mutate(ind = str_extract(V9, "[0-9]{6}")) %>%
filter(ind %in% ids)
Hope that helps? Otherwise you have to provide a reproducible example (post exapmle data here).

Related

Drop rows if combination of multiple rows match regex value(s) using R dplyr (or other)

First off, I'm not very well versed in R, I code mainly in Bash and sometimes in python. That being said, I have a dataframe with the following (variable) structure. Between the columns 'info' and 'gene' there can be upto 5 columns and may have different names. The data content in each column will start with either r,Ref,n,N,No_GT,Hom or het. If a column starts with Ref, No_GT, Hom or Het, it will have additional data delimited by :.
For ex. demo table below.
info
s1
gene
a
r
GG
b
Hom:10,10:20:99
TG
c
Het:5,6:11:20
TGGB
To identify the column names of my interest, I'm using this snippet-
my_file %>% select('info':'gene') %>% colnames() -> samples
samples <- samples[! samples %in% c("info","gene")]
In case there is a single column, I need to remove rows containing r,n,N,Ref,No_GT. This can be achieved using grepl and a regex match. For ex. using
df[!grepl("^r|^n|^N|^Ref|^No_GT", df$s1),]
And the first row is removed.
However, there may be more columns between info and gene, example:
info
s1
s2
gene
a
r
n
GG
b
Hom:10,10:20:99
n
TG
c
Het:5,6:11:20
r
TGGB
My problem arises when there are multiple samples. In this case, I have to drop rows where all sample columns have combination of r,n,N,Ref and No_GT, i.e. if any of the sample has Hom or Het at the beginning, the row has to be preserved. I have the names of columns, but not entirely sure what the optimum way to solve this problem! I could cycle through each column, but then how do I break out if I encounter a Hom or Het?
Any help appreciated!
I tried using filter and select, however I'm not able to specify multiple columns even when using across.
I thought tried this -
my_file_sorted %>% filter(across(c(s1,s2), ~ "^Het|^Hom")) -> trimmed
However I'm getting this error
Error in `filter()`:
! Problem while computing `..1 = across(.cols = c(s1, s2))`.
✖ Input `..1$s1` must be a logical vector, not a character.
You could use if_any:
df %>%
filter(if_any(matches("^s[0-9]+$"), str_detect, pattern = "^(Hom|Het)"))
or pmp_lgl:
df %>%
filter(
pmap_lgl(
select(., matches("^s[0-9]+$")),
~any(str_detect(c(...), "^(Hom|Het)"))
)
)
If you know which columns are required for this test and have them in vector my_cols you just replace matches("^s[0-9]+$") with all_of(my_cols)

Transform a csv or excel table (with rows in one order and eg 20 columns with head) into another table with same rows in other pre-established order

I want to transform a csv or excel table (with rows in one order and eg 20 columns with head) into another table with same rows in other pre-established order.
Thank you very much
suppose your table looks a bit like this, once you've loaded it into r:
# Packages
library(tidyverse)
# Sample Table
df <- tibble(names = c("Jane","Boris","Michael","Tina"),
height = c(167,175,182,171),
age = c(26,45,32,51),
occupation = c("Teacher","Construction Worker","Salesman","Banker"))
If all you want to do is reorder the columns, you can do the following:
df <- df %>%
select(occupation,height,age,names)
There are other ways to do this, especially if you only want to move one or two columns out of your 20. But suppose you want to rearrange all of them, this will do it.

How do I show which variables are not shared by two datasets in R?

I have two data sets (A and B), one with 1600 observations/ rows and 1002 Variables/columns and one with 860 observations/rows and 1040 variables/ columns. I want to quickly check which variables are not contained in dataset A but are in dataset B and vice versa. I am only interestes in the column names, not in the onservations contained within these columns.
I found this great function here: https://cran.r-project.org/web/packages/arsenal/vignettes/comparedf.html and essencially I would want to get an output similar to this:
The code I am trying is: summary(comparedf(dataA, dataB)) However, the table is not printed because R does a row by row comparision of both data sets and then runs out of space when printing the results in the console. Is there a quick way of achieving what I need here?
I think you can use the anti_join() function from the dplyr package to find the unmatched records. It will give you an output of the rows that both data sets A and B do not share in common. Here is an example:-
table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet",
"lion", "duck"))
table2<-table1[c(1,3,5),]
library(dplyr)
anti_join(table1, table2, by="id")
id animal
1 2 dog
2 4 lion
This will return the unshared rows by ID.
Edit
If you are wanting to find which column names/variables appear in one data frame but not the other, then you could use this solution:-
df1 <- data.frame(a=rnorm(100), b=rnorm(100), not=rnorm(100))
df2 <- data.frame(a=rnorm(100), b=rnorm(100))
df1[, !names(df1) %in% names(df2)] #returns column/variable that appears in df1 but not in df2
I hope this answers your question. It will return the actual values beneath each unshared column/variable, but you could save the output to an object and run colnames() on it, which should print your unshared column/variable names.
It may be a bit clunky, but combining setdiff() with colnames() may work.
Doing both setdiff(colnames(DataA),colnames(DataB)) and setdiff(colnames(DataB),colnames(DataA)) will give you 2 vectors, each with the names of the columns present in one of the datasets but not in the other one.

select text from multiple combinations of text within a dataframe R

I want to subset data based on a text code that is used in numerous combinations throughout one column of a df. I checked first all the variations by creating a table.
list <- as.data.frame(table(EQP$col1))
I want to search within the dataframe for the text "EFC" (even when combined with other letters) and subset these rows so that I have a resultant dataframe that looks like this.
I have looked through this question here, but this does not answer the question. I have reviewed the tidytext package, but this does not seem to be the solution either.
How to Extract keywords from a Data Frame in R
You can simply use grepl.
Considering your data.frame is called df and the column to subset on is col1
df <- data.frame(
col1 = c("eraEFC", "dfs", "asdj, aslkj", "dlja,EFC,:LJ)"),
stringsAsFactors = F
)
df[grepl("EFC", df$col1), , drop = F]
Another option besides the mentioned solution by Gallarus would be:
library(stringr)
library(dplyr)
df %>% filter(str_detect(Var1, "EFC"))
As described by Sam Firke in this post:
Selecting rows where a column has a string like 'hsa..' (partial string match)

Counting number of rows where a value occurs at least once within many columns

I updated the question with pseudocode to better explain what I would like to do.
I have a data.frame named df_sel, with 5064 rows and 215 columns.
Some of the columns (~80) contains integers with a unique identifier for a specific trait (medications). These columns are named "meds_0_1", "meds_0_2", "meds_0_3" etc. as well as "meds_1_1", "meds_1_2", "meds_1_3". Each column may or may not contain any of the integer values I am looking for.
For the specific integer values to look for, some could be grouped under different types of medication, but coded for specific brand names.
metformin = 1140884600 # not grouped
sulfonylurea = c(1140874718, 1140874724, 1140874726) # grouped
If it would be possible to look-up a group of medications, like in a vector format as above, that would be helpful.
I would like to do this:
IF [a specific row]
CONTAINS [the single integer value of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_METFORMIN = 1 ELSE A_NEW_VARIABLE_METFORMIN = 0
and concordingly
IF [a specific row]
CONTAINS [any of multiple integer values of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_SULFONYLUREA = 1 ELSE A_NEW_VARIABLE_SULFONYLUREA = 0
I have manged to create a vector based on column names:
column_names <- names(df_sel) %>% str_subset('^meds_0')
But I havent gotten any further despite some suggestions below.
I hope you understand better what I am trying to do.
As for the selection of the columns, you could do this by first extracting the names in the way you are doing with a regex, and then using select:
library(stringr)
column_names <- names(df_sel) %>%
str_subset('^meds_0')
relevant_df <- df_sel %>%
select(column_names)
I didn't quite get the structure of your variables (if they are integers, logicals, etc.), so I'm not sure how to continue, but it would probably involve something like summing across all the columns and dropping those that are not 0, like:
meds_taken <- rowSums(relevant_df)
df_sel_med_count <- df_sel %>%
add_column(meds_taken)
At this point you should have your initial df with the relevant data in one column, and you can summarize by subject, medication or whatever in any way you want.
If this is not enough, please edit your question providing a relevant sample of your data (you can do this with the dput function) and I'll edit this answer to add more detail.
First, I would like to start off by recommending bioconductor for R libraries, as it sounds like you may be studying biological data. Now to your question.
Although tidyverse is the most widely acceptable and 'easy' method, I would recommend in this instance using 'lapply' as it is extremely fast. Your code from a programming standpoint becomes a simple boolean, as you stated, but I think we can go a little further. Using the built-in data from 'mtcars',
data(mtcars)
head(mtcars, 6)
target=6
#trues and falses for each row and column
rows=lapply(mtcars, function(x) x %in% target)
#Number of Trues for each column and which have more that 0 Trues
column_sums=unlist(lapply(rows, function(x) (sum(x, na.rm = TRUE))))
which(column_sums>0)
This will work with other data types with a few tweaks here and there.

Resources