Related
I have a column that lists the race/ethnicity of individuals. I am trying to make it so that if the cell contains an 'H' then I only want H. Similarly, if the cell contains an 'N' then I want an N. Finally, if the cell has multiple races, not including H or N, then I want it to be M. Below is how it is listed currently and the desired output.
Current output
People | Race/Ethnicity
PersonA| HAB
PersonB| NHB
PersonC| AB
PersonD| ABW
PersonE| A
Desired output
PersonA| H
PersonB| N
PersonC| M
PersonD| M
PersonE| A
You can try the following dplyr approach, which combines grepl with dplyr::case_when to first search for N values, then among those not with N values, search for H values, then among those without an H or an N will assign M to those with >1 races and the original letter to those with only one race (assuming each race is represented by a single character).
A base R approach is below as well - no need for dependencies but but less elegant.
Data
df <- read.table(text = "person ethnicity
PersonA HAB
PersonB NHB
PersonC AB
PersonD ABW
PersonE A", header = TRUE)
dplyr (note order matters given your priority)
df %>% mutate(eth2 = case_when(
grepl("N", ethnicity) ~ "N",
grepl("H", ethnicity) ~ "H",
!grepl("H|N", ethnicity) & nchar(ethnicity) > 1 ~ "M",
TRUE ~ ethnicity
))
You could also do it "manually" in base r by indexing (note order matters given your priority):
df[grepl("H", df$ethnicity), "eth2"] <- "H"
df[grepl("N", df$ethnicity), "eth2"] <- "N"
df[!grepl("H|N", df$ethnicity) & nchar(df$ethnicity) > 1, "eth2"] <- "M"
df[nchar(df$ethnicity) %in% 1, "eth2"] <- df$ethnicity[nchar(df$ethnicity) %in% 1]
In both cases the output is:
# person ethnicity eth2
# 1 PersonA HAB H
# 2 PersonB NHB N
# 3 PersonC AB M
# 4 PersonD ABW M
# 5 PersonE A A
Note this is based on your comment about assigning superiority (that N anywhere supersedes those with both N and H, etc)
We could use str_extract. When the number of characters in the column is greater than 1, extract, the 'N', 'M' separately, do a coalesce with the extracted elements along with 'M' (thus if there is no match, we get 'M', or else it will be in the order we placed the inputs in coalecse, For the other case, i.e. number of characters is 1, return the column values. Thus, N supersedes 'H' no matter the position in the string.
library(dplyr)
library(stringr)
df1 %>%
mutate(output = case_when(nchar(`Race/Ethnicity`) > 1
~ coalesce(str_extract(`Race/Ethnicity`, 'N'),
str_extract(`Race/Ethnicity`, 'H'), "M"),
TRUE ~ `Race/Ethnicity`))
-output
People Race/Ethnicity output
1 PersonA HAB H
2 PersonB NHB N
3 PersonC AB M
4 PersonD ABW M
5 PersonE A A
data
df1 <- structure(list(People = c("PersonA", "PersonB", "PersonC", "PersonD",
"PersonE"), `Race/Ethnicity` = c("HAB", "NHB", "AB", "ABW", "A"
)), class = "data.frame", row.names = c(NA, -5L))
I have a data frame called "ref" that contains information that allows mapping of gene entrez ID to the gene's start and end positions. I have another data frame "ori_data" where each row contains unique mutations from samples, which gives a genomic position. I am trying to assign each position given in "ori_data" to map to information on "ref" in order to assign entrez ID to each mutation. I have tried a for loop to match for the same chromosome, and then select for positions in "ori_data" that fall between the coordinates in "ref" though I have not been successful. The "ori_data" dataset is over 1 million rows, so I'm not sure a for loop is an efficient solution. Note that many positions will be mapped to the same entrez ID in my real dataset. "Final" is what I want to happen- which would just add a column for entrezID according to chromosome/position. TYIA!
ref = data.frame("EntrezID" = c(1, 10, 100, 1000), "Chromosome" = c("19", "8", "20", "18"), "txStarts" = c("58345182", "18391281", "44619518", "27950965"), "txEnds" = c("58353492", "18401215", "44651758", "28177130"))
ori_data = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
"Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"))
final = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
"Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"), "EntrezID" = c(1,10,100,1000))
I have tried this line of code and I'm unsure as to why it does not work.
for (i in 1:dim(ori_data)[1])
{
for (j in 1:dim(ref)[1])
{
ID = which(ori_data[i, "Chromosome"] == ref[j,
"Chromosome"])
if (length(ID) > 0)
{
Pos = ori_data[ID, "POS"]
IDj = which(Pos >= ref[j, "txStarts"] & Pos <=
ref[j, "txEnds"])
print(IDj)
if (length(IDj) > 0)
{
ori_data = cbind("Entrez" = ref[IDj,
"EntrezID"], ori_data)
}
}
}
}
In base apply could be used to find matches per row for Chromosome and test if Pos is in the range of txStarts txEnds.
ori_data$EntrezID <- apply(ori_data[c("Chromosome", "Pos")], 1, \(x)
ref$EntrezID[ref$Chromosome == x["Chromosome"] &
x["Pos"] >= ref$txStarts & x["Pos"] <= ref$txEnds][1])
ori_data
# Chromosome Pos Sample EntrezID
#1 19 58345186 HCC1 1
#2 8 18401213 HCC2 10
#3 20 44619519 HCC1 100
#4 18 27950966 HCC3 1000
A version which could be faster:
lup <- list2env(split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome))
ori_data$EntrezID <- Map(\(x, y) {
. <- get(x, envir=lup)
.$EntrezID[y >= .$txStarts & y <= .$txEnds][1]
}, ori_data$Chromosome, ori_data$Pos)
Or another way but not keeping the original order. (If original order is important, have a look at unsplit.)
#Assuming you have many rows with same Chromosome
x <- split(ori_data, ori_data$Chromosome)
#Assuming you have also here many rows with same Chromosome
lup <- split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome)
#Now I am soting this by the names of x - try which Method ist faster
#Method 1:
lup <- lup[names(x)]
#Method 2:
lup <- mget(names(x), list2env(lup))
res <- do.call(rbind, Map(\(a, b) {
cbind(a, b[1][a$Pos >= b[[2]] & a$Pos <= b[[3]]][1])
}, x, lup))
One option would be to use sqldf, which should also be efficient for a large dataframe.
library(tibble)
library(sqldf)
as_tibble(sqldf("select dna.*, ref.EntrezID from dna
join ref on dna.Pos > ref.'txStarts' and
dna.Pos < ref.'txEnds'"))
Another option using fuzzy_join:
library(dplyr)
library(fuzzyjoin)
dna %>%
fuzzy_join(ref %>% select(-Chromosome), by = c("Pos" = "txStarts", "Pos" = "txEnds"),
match_fun = list(`>`, `<`)) %>%
select(names(dna), EntrezID)
Output
Chromosome Pos Sample EntrezID
1 19 58345186 HCC1 1
2 8 18401213 HCC2 10
3 20 44619519 HCC1 100
4 18 27950966 HCC3 1000
If the 'Pos', 'txStarts', 'txEnds' are numeric, then we can use non-equi join
library(data.table)
setDT(dna)[ref, EntrezID := i.EntrezID,
on = .(Chromosome, Pos > txStarts, Pos <txEnds)]
-output
> dna
Chromosome Pos Sample EntrezID
<char> <num> <char> <num>
1: 19 58345186 HCC1 1
2: 8 18401213 HCC2 10
3: 20 44619519 HCC1 100
4: 18 27950966 HCC3 1000
data
dna <- type.convert(dna, as.is = TRUE)
ref <- type.convert(ref, as.is = TRUE)
I have preprocessed Affymetrix microarray gene expression data (32830 probesets in rows, 735 RNA sample in columns). Here is how my expression matrix looks like:
> exprs_mat[1:6, 1:4]
Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 Tarca_005_P1A05
1_at 6.062215 6.125023 5.875502 6.126131
10_at 3.796484 3.805305 3.450245 3.628411
100_at 5.849338 6.191562 6.550525 6.421877
1000_at 3.567779 3.452524 3.316134 3.432451
10000_at 6.166815 5.678373 6.185059 5.633757
100009613_at 4.443027 4.773199 4.393488 4.623783
I have also phenodata of this Affymetrix expression (RNA sample identifiers in the row, sample descriptions in the column):
> pheno[1:6, 1:4]
SampleID GA Batch Set
Tarca_001_P1A01 Tarca_001_P1A01 11.0 1 PRB_HTA
Tarca_013_P1B01 Tarca_013_P1B01 15.3 1 PRB_HTA
Tarca_025_P1C01 Tarca_025_P1C01 21.7 1 PRB_HTA
Tarca_037_P1D01 Tarca_037_P1D01 26.7 1 PRB_HTA
Tarca_049_P1E01 Tarca_049_P1E01 31.3 1 PRB_HTA
Tarca_061_P1F01 Tarca_061_P1F01 32.1 1 PRB_HTA
since in phenodata, sample identifier in rows, I need to find way to match sampleID in phenodata with sampleID in expression matrix exprs_mat.
OBJECTIVE:
I want to filter out the genes in the expression matrix by the measuing correlation between each gene with target profile data in phenodata. Here is my initial attempt but not quite sure about accuracy:
update: my implementation in R:
I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data. Here is my simple function to find this correlation in R:
getPCC <- function(expr_mat, anno_mat, verbose=FALSE){
stopifnot(class(expr_mat)=="matrix")
stopifnot(class(anno_mat)=="matrix")
stopifnot(ncol(expr_mat)==nrow(anno_mat))
final_df <- as.data.frame()
lapply(colnames(expr_mat), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(anno_mat)){
cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
final_df <- cbind(res, cor=cor_mat[out])
}
})
})
return(final_df)
}
but above script didn't return the correct output that I am expecting. Any idea to make this happen correctly? any thoughts?
does something like this help:
library(tidyverse)
x <- data.frame(stringsAsFactors=FALSE,
Levels = c("1_at", "10_at", "100_at", "1000_at", "10000_at", "100009613_at"),
Tarca_001_P1A01 = c(6.062215, 3.796484, 5.849338, 3.567779, 6.166815,
4.443027),
Tarca_003_P1A03 = c(6.125023, 3.805305, 6.191562, 3.452524, 5.678373,
4.773199),
Tarca_004_P1A04 = c(5.875502, 3.450245, 6.550525, 3.316134, 6.185059,
4.393488),
Tarca_005_P1A05 = c(6.126131, 3.628411, 6.421877, 3.432451, 5.633757,
4.623783)
)
y <- data.frame(stringsAsFactors=FALSE,
gene = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
"Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
"Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1),
Batch = c(1, 1, 1, 1, 1, 1),
Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA")
)
x %>% gather(SampleID, value, -Levels) %>%
left_join(., y, by = "SampleID") %>%
group_by(SampleID) %>%
filter(value == max(value)) %>%
spread(SampleID, value)
I have a following dataframe in r
Names X_1 X_2 X_3 X_4
Name Sagar II Booster
Location India No Discharge Open
Depth 19.5 start End
DOC 3.2 FPL 64
Qunatity 234 SPL 50
Now I want to extract certain cells and their corresponding values in next cell.
My desired dataframe would be
Names Values
Name Sagar II
Location India
Discharge Open
Depth 19.5
DOC 3.2
FPL 64
SPL 50
How can I do it in r?
A solution from base R.
# Create example data frame
dt <- read.table(text = "Names X_1 X_2 X_3 X_4
Name Sagar II Booster
Location India No Discharge Open
Depth 19.5 start End
DOC 3.2 FPL 64
Qunatity 234 SPL 50",
stringsAsFactors = FALSE, header = TRUE, fill = TRUE)
# A list of target keys
target_key <- c("Name", "Location", "Discharge", "Depth", "DOC", "FPL", "SPL")
# A function to extract value based on key and create a new data frame
extract_fun <- function(key, df = dt){
Row <- which(apply(dt, 1, function(x) key %in% x))
Col <- which(apply(dt, 2, function(x) key %in% x))
df2 <- data.frame(Names = key, Values = df[Row, Col + 1],
stringsAsFactors = FALSE)
df2$Values <- as.character(df2$Values)
return(df2)
}
# Apply the extract_fun
ext_list <- lapply(target_key, extract_fun)
# Combine all data frame
dt_final <- do.call(rbind, ext_list)
dt_final
Names Values
1 Name Sagar
2 Location India
3 Discharge Open
4 Depth 19.5
5 DOC 3.2
6 FPL 64
7 SPL 50
Might not be the most efficient, but works for your example:
library(dplyr)
key_value = function(extraction){
temp = matrix(NA, nrow = length(extraction), ncol = 2)
temp[,1] = extraction
for(ii in 1:nrow(temp)){
index = df %>%
as.matrix %>%
{which(. == extraction[ii], arr.ind = TRUE)}
temp[ii, 2] = index %>% {df[.[1], .[2]+1]}
}
return(data.frame(Names = temp[,1], Values = temp[,2]))
}
Result:
> vec = c("Name", "Location", "Discharge", "Depth", "DOC", "FPL", "SPL")
> key_value(vec)
Names Values
1 Name SagarII
2 Location India
3 Discharge Open
4 Depth 19.5
5 DOC 3.2
6 FPL 64
7 SPL 50
Data:
df = read.table(text = "Names X_1 X_2 X_3 X_4
Name SagarII Booster NA NA
Location India No Discharge Open
Depth 19.5 start End NA
DOC 3.2 FPL 64 NA
Qunatity 234 SPL 50 NA", header = TRUE, stringsAsFactors = FALSE)
I want to print the corresponding column that has been matched to the row name.
List_Sample
S.No Name
2 Ba
1 Ar
5 Ca
3 Bl
4 Bu
Volume
Ar Ba Bl Bu Ca
-5.1275 1.3465 -1.544 -0.0877 3.2955
-2.2385 1.5065 0.193 1.082 3.074
-5.3705 1.1285 1.966 1.183 -1.9305
-6.4925 1.5735 1.36 -0.0761 2.0875
-5.068 0.9455 0.947 -0.7775 3.832
Total <- as.data.frame(matrix(0, ncol = 1, nrow = 5))
for (i in 1:5)
{
match(List_Sample$Name[i], names(Volume))
print(List_Sample$S.No[i]*100)
print(names(Volume[i]))
Total = Total + Volume[i]
print(Total)
}
View(Total)
When I use this code print(names(Volume[i])), it prints the name of the first column(all columns in ascending order) since here i is just a number from 1 to 5 in increasing order. What I want is to print the matching column which has been found or rather extract the matching column from the other dataframe and do some calculation.
But the output I get is just the columns of Volume in ascending order cause of the i is just 1 to 5.
I think you'll find you'll get more help from folks if you post code that is easy to just copy and paste into R. For example,
List_Sample <- data.frame(S.No = c(2, 1, 5, 3, 4),
Name = c("Ba", "Ar", "Ca", "Bl", "Bu"))
Volume <- data.frame(Ar = c(-5.1275, -2.2385, -5.3705, -6.4925, -5.068),
Ba = c(1.3465, 1.5065, 1.1285, 1.5735, 0.9455),
Bl = c(-1.544, 0.193, 1.966, 1.36, 0.947),
Bu = c(-0.0877, 1.082, 1.183, -0.0761, -0.7775),
Ca = c(3.2955, 3.074, -1.9305, 2.0875, 3.832))
I think you can get the code to do what you want if you save the information returned by the call to the match() function, I called it j, and then use that as your index for the rest of the for() loop.
Total <- 0
for (i in 1:5) {
j <- match(List_Sample$Name[i], names(Volume))
print(List_Sample$S.No[j]*100)
print(names(Volume[j]))
Total = Total + Volume[j]
print(Total)
}