Replace elements in an array using a conversion table (python) - data-conversion

I have a table (ppi_interactions) containing a couples of gene names in the Ensembl naming format:
Gene 1 Gene 2 Weight
0 9606.ENSP00000000233 9606.ENSP00000263431 0.26
1 9606.ENSP00000000233 9606.ENSP00000353863 0.164
2 9606.ENSP00000000233 9606.ENSP00000342026 0.159
...
I have to convert them in gene names with the 'Approved symbol' format, and I have a txt file (conversion_ppi) with the conversion table:
Approved symbol Ensembl gene ID
0 A1BG 9606.ENSG00000121410
1 A1BG-AS1 9606.ENSG00000268895
...
Is there a loop I can use to iterate the conversion over all my original table?

Related

How to subset the rows of my data frame based on a list of names?

I have a gene expression count data set with 2 columns and ~60000 rows. Each column is a sample and each row is a gene represented by an ensembl ID. I need to reduce or subset by data to include only genes that are protein coding. Here is a small scale example of what I would like to achieve:
Here is my data set called BDC containing the ensembl ID's of various types of genes along with the count data for each sample:
ENSEMBL Sample A Sample B
ENSG00000198888 10 2
ENSG00000210082 3 13
ENSG00000198763 6 18
ENSG00000198886 12 11
I also have a list of ensembl ID's called ProtCod that contain genes that I know are protein coding:
ENSEMBL_Protein_Coding
ENSG00000198888
ENSG00000198763
So I want to subset my data set to only include rows that have a protein coding ensembl ID and exclude all other rows:
ENSEMBL Sample A Sample B
ENSG00000198888 10 2
ENSG00000198763 6 18
But I need to achieve this on the large scale reducing my data set from ~60000 to ~20000 rows or genes.
This is what I've tried so far:
BDCProtCod <- BDC[!row.names(BDC) %in% ProtCod, ]
BDCProtCod
dim(BDCProtCod)
[1] 60675 2
The dimensions are the same as my original BDC data set, why isn't this code excluding the rows that don't contain names from ProtCod?
I've also tried:
BDCProtCod <- BDC[unlist(ProtCod), ]
BDCProtCod
dim(BDCProtCod)
[1] 19603 2
This actually excludes the rows I want to be excluded but it sets everything to "NA".
You've deleted your previous post as I was halfway writing the answer.
It seems like the ID is stored as ENSEMBL in BDC and ENSEMBL_Protein_Coding in ProtCod, so to get them as vectors you should call them as BDC$ENSEMBL and ProtCod$ENSEMBL_Protein_Coding respectively
BDC[BDC$ENSEMBL %in% ProtCod$ENSEMBL_Protein_Coding, ]
ENSEMBL SampleA SampleB
<chr> <dbl> <dbl>
1 ENSG00000198888 10 2
2 ENSG00000198763 6 18
Data:
require(readr)
BDC = readr::read_table("ENSEMBL SampleA SampleB
ENSG00000198888 10 2
ENSG00000210082 3 13
ENSG00000198763 6 18
ENSG00000198886 12 11")
ProtCod = readr::read_table('ENSEMBL_Protein_Coding
ENSG00000198888
ENSG00000198763')

Reformat wrapped data coerced into a dataframe? (R)

I have some data I need to extract from a .txt file in a very weird, wrapped format. It looks like this:
eid nint hisv large NA
1 1.00 1.00000e+00 0 1.0 NA
2 -152552.00 -6.90613e+04 -884198 -48775.7 1151.70
3 -5190.13 4.17751e-05 NA NA NA
4 2.00 1.00000e+00 0 1.0 NA
5 -172188.00 -8.16684e+04 -809131 -56956.1 -1364.07
6 -5480.54 4.01573e-05 NA NA NA
Luckily, I do not need all of this data. I just want to match eid with the value written in scientific notation. so:
eid sigma
1 1 4.17751e-005
2 2 4.01573e-005
3 3 3.72098e-005
This data goes on for hundreds of thousands of eids. It needs to discard the last three values of each first row, all of the values in row 2, and keep the last/second value in the third row. Then place it next to the 1st value of row 1. Then repeat. The column names other than 'eid' are totally disposable, too. I've never had to deal with wrapped data before so don't know where to begin.
**edited to show df after read-in.

Replacing a string with a matched number in a column in R

I have a data frame in R with 10,000 columns and roughly 4,000 rows. The data are IDs. For example the IDs look like (rs100987, rs1803920, etc). Each rsID# has a corresponding iHS score between 0-3. I have a separate data frame where all the possible rs#'s in existence are in one column and their corresponding iHS scores are in the next column. I want to replace my 10,000 by 4,000 data frame with rsIDs to a 10,000 by 4,000 data frame with the corresponding iHS scores. How do I do this?
This is what my file looks like now:
input ID match 1 match 2 match 3 ......
rs6708 rs10089 rs100098 rs10567
rs8902 rs18079 rs234058 rs123098
rs9076 rs77890 rs445067 rs105023
This is what my iHS score file looks like (it has matching scores for every ID in the above file
snpID iHS
rs6708 1.23
rs105023 0.92
rs234058 2.31
rs77890 0.31
I would like my output to look like
match 1 match 2 match 3
0.89 0.34 2.45
1.18 2.31 0.67
0.31 1.54 0.92
Let's consider a small example:
(dat <- data.frame(id1 = c("rs100987", "rs1803920"), id2=c("rs123", "rs456"), stringsAsFactors=FALSE))
# id1 id2
# 1 rs100987 rs123
# 2 rs1803920 rs456
(dat2 <- data.frame(id=c("rs123", "rs456", "rs100987", "rs1803920", "rs123456"),
score=5:1, stringsAsFactors=FALSE))
# id score
# 1 rs123 5
# 2 rs456 4
# 3 rs100987 3
# 4 rs1803920 2
# 5 rs123456 1
Then you can do this operation with:
apply(dat, 2, function(x) dat2$score[match(x, dat2$id)])
# id1 id2
# [1,] 3 5
# [2,] 2 4
The call to match figures out the row in dat2 corresponding to each id in your column.

How to do subset in a loop using a factor and assign names to subsets of data. Both factor and names have to take the index value

I have a DF which I need to break into subsets based on some columns of the DF that contain the factors. Something like this:
value (numeric) factor 1 (factor) factor 2 (factor)
23 1 FR
45 1 FR
33 1 ER
1345 2 TY
2325 2 TY
221 3 TY
11 4 FR
2234 4 FR
I can do :
subset1 <- subset(DF,factor 1 =="1" & factor =="FR")
but I want to do it in a loop.. the name of the subset has to concatenate the index of the loop (e.g., i) with the name and the value of factors have to be structured in a way that can be passed as index. Basically I want to do the subsetting in a loop.

How to remove character with specific pattern form data frame in R

Im trying to remove the all the characters starting with the pattern "Gm" from last column of my data.frame
My data.frame looks like this
level logp chr start end CNA Genes
3 1.4 3 100 110 gain Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm911
4 18.10 3 962 966 gain Fcgr1,Terc,Gm5703
The result should look something like this
level logp chr start end CNA Genes
3 1.4 3 100 110 gain Tdpoz4,Tdpoz3
4 18.10 3 962 966 gain Fcgr1,Terc
This uses a single gsub to remove the unwanted portions:
Genes <- c("Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm911", "Fcgr1,Terc,Gm5703") # test data
gsub(",?Gm[^,]*,?", "", Genes)
giving:
[1] "Tdpoz4,Tdpoz3" "Fcgr1,Terc"
Here is a visualization of the regular expression:
,?Gm[^,]*,?
Debuggex Demo
Given a data frame d:
d$Genes_new <- sapply(strsplit(as.character(d$Genes),split=','),function(s) paste(s[!grepl('^Gm',s)],collapse=','))
# level logp chr start end CNA Genes Genes_new
#1 3 1.4 3 100 110 gain Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm911 Tdpoz4,Tdpoz3
#2 4 18.1 3 962 966 gain Fcgr1,Terc,Gm5703 Fcgr1,Terc
Here, strsplit(as.character(d$Genes),split=',') creates a list of comma-separated gene names for each row, and sapply applies to each element of this list a function that excludes all gene names starting from Gm (s[!grepl('^Gm',s)]) and concatenates the remaining genes (paste(.,collapse=',').

Resources