R: replace column values with string - r

I have a set of BAM files within the chr16_bam directory and a sgseq_sam.txt file.
I want to replace the file_bam column values with the full path where the BAM files are stored.
My code hasn't been able to achieve that.
bamPath = "C:/Users/User/Downloads/chr16_bam/"
samFile <- read.delim("C:/Users/User/Downloads/sgseq_sam.txt", header=T)
for (i in samFile[,2]) {
p <- gsub(i, bamPath, samFile)
}
> dput(samFile)
structure(list(sample_name = c("N60", "N11", "T132", "T114"),
file_bam = c("60.bam", "11.bam", "132.bam", "114.bam"), paired_end = c(TRUE,
TRUE, TRUE, TRUE), read_length = c(75L, 75L, 75L, 75L), frag_length = c(1075L,
1466L, 946L, 1154L), lib_size = c(2589976L, 5153522L, 4429912L,
3131400L)), class = "data.frame", row.names = c(NA, -4L))

library(tidyverse)
sam_file <- sam_file %>%
mutate(file_bam = paste0(bamPath, file_bam))
Or, alternately, in base R:
data_file$file_bam <- paste0(bamPath, data_file$file_bam)

Related

How to compute distance.matrix for the spatialRF::rf_spatial function

I am using the package spatialRF in R to perform a regression task. From the example provided by the package, the have precomputed the distance.matrix and they use it in the function spatialRF::rf. Here is an example:
library(spatialRF)
#loading training data
data(block.data)
#names of the response variable and the predictors
dependent.variable.name <- "ntl"
predictor.variable.names <- colnames(block.data)[2:4]
#coordinates of the cases
xy <- block.data[, c("x", "y")]
#distance matrix
distance.matrix <- dist(subset(block.data, select = -c(x, y)))
#random seed for reproducibility
random.seed <- 1
model.non.spatial <- spatialRF::rf(
data = block.data,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance.matrix,
distance.thresholds = 0,
xy = xy,
seed = random.seed,
verbose = FALSE)
When running the spatialRF::rf function I am getting this error: Error in diag<-(tmp, value = NA): only matrix diagonals can be replaced
My dataset:
block.data = structure(list(ntl = c(11.4058170318604, 13.7000455856323, 16.0420398712158,
17.4475727081299, 26.263370513916, 30.658130645752, 19.8927211761475,
20.917688369751, 23.7149887084961, 25.2641334533691), pop = c(111.031448364258,
145.096557617188, 166.351989746094, 193.804962158203, 331.787200927734,
382.979248046875, 237.971466064453, 276.575958251953, 334.015289306641,
345.376617431641), tirs = c(35.392936706543, 34.4172630310059,
33.7765464782715, 35.3224639892578, 40.4262886047363, 39.6619148254395,
38.6306076049805, 36.752326965332, 37.2010040283203, 36.1100578308105
), agbh = c(1.15364360809326, 0.177780777215958, 0.580717206001282,
0.647109687328339, 3.84336423873901, 5.6310133934021, 2.10894227027893,
3.9533429145813, 2.7016019821167, 4.36041164398193), lc = c(40L,
40L, 40L, 126L, 50L, 50L, 50L, 50L, 40L, 50L)), class = "data.frame", row.names = c(NA,
-10L))
For reference, in the example in the link I provided, the distance matrix and the dataset the authors are using it's the same.

Export Cyrillic characters from R?

I have a dataset where one of the columns includes Russian words:
raw_data2 = structure(list(word = c("абрикос",
"автомобиль",
"аист",
"ананас",
"апрель",
"атака",
"баклажан"),
subject_nr = c(3L, 21L, 12L, 17L, 8L, 1L, 17L),
acc = c(98.976109215, 91.8803418803, 94.8979591837, 94.5273631841, 94.4444444444, 94.5355191257, 94.3661971831)),
row.names = c(1L, 100L, 200L, 300L, 400L, 500L, 600L),
class = "data.frame")
When I look at the file in RStudio there's no problem:
However, when I export the data into a table to work with them further in Excel I get this UTF-mess which Excel cannot convert back into Russian words (even when UTF-8 is chosen during data importing):
"word";"subject_nr";"acc"
"<U+0430><U+0431><U+0440><U+0438><U+043A><U+043E><U+0441>";3;98,976109215
"<U+0430><U+0432><U+0442><U+043E><U+043C><U+043E><U+0431><U+0438><U+043B><U+044C>";21;91,8803418803
"<U+0430><U+0438><U+0441><U+0442>";12;94,8979591837
"<U+0430><U+043D><U+0430><U+043D><U+0430><U+0441>";17;94,5273631841
"<U+0430><U+043F><U+0440><U+0435><U+043B><U+044C>";8;94,4444444444
"<U+0430><U+0442><U+0430><U+043A><U+0430>";1;94,5355191257
"<U+0431><U+0430><U+043A><U+043B><U+0430><U+0436><U+0430><U+043D>";17;94,3661971831
Is there any way to force R to replace those strings with corresponding Cyrillic letters when saving the table? It certainly "knows" what these letters are, since it shows them in preview. I use the following code (which does not work):
write.table(raw_data2,
file = "raw_data2.csv",
append = FALSE,
quote = TRUE,
sep = ";",
eol = "\n",
na = "NA",
dec = ",",
row.names = FALSE,
col.names = TRUE,
qmethod = c("escape", "double"),
fileEncoding = "UTF-8")
Works fine for me if you write it to xlsx file.
openxlsx::write.xlsx(raw_data2, 'temp.xlsx')
For me, Sys.setlocale("LC_CTYPE", "russian") works well
(code source: https://www.r-bloggers.com/2013/01/r-and-foreign-characters/)

Grouped barplots in R using csv

I have a 3 column csv file like this
x,y1,y2
100,50,10
200,10,20
300,15,5
I want to have a barplot using R, with first column values on x axis and second and third columns values as grouped bars for the corresponding x. I hope I made it clear. Can someone please help me with this? My data is huge so I have to import the csv file and can't enter all the data.I found relevant posts but none was exactly addressing this.
Thank you
Use the following code
library(tidyverse)
df %>% pivot_longer(names_to = "y", values_to = "value", -x) %>%
ggplot(aes(x,value, fill=y))+geom_col(position = "dodge")
Data
df = structure(list(x = c(100L, 200L, 300L), y1 = c(50L, 10L, 15L),
y2 = c(10L, 20L, 5L)), class = "data.frame", row.names = c(NA,
-3L))

Automatically split function output (list) into component data.frames

I have a functions which yields 2 dataframes. As functions can only return one object, I combined these dataframes as a list. However, I need to work with both dataframes separately. Is there a way to automatically split the list into the component dataframes, or to write the function in a way that both objects are returned separately?
The function:
install.packages("plyr")
require(plyr)
fun.docmerge <- function(x, y, z, crit, typ, doc = checkmerge) {
mergedat <- paste(deparse(substitute(x)), "+",
deparse(substitute(y)), "=", z)
countdat <- nrow(x)
check_t1 <- data.frame(mergedat, countdat)
z1 <- join(x, y, by = crit, type = typ)
countdat <- nrow(z1)
check_t2 <- data.frame(mergedat, countdat)
doc <- rbind(doc, check_t1, check_t2)
t1<-list()
t1[["checkmerge"]]<-doc
t1[[z]]<-z1
return(t1)
}
This is the call to the function, saving the result list to the new object results.
results <- fun.docmerge(x = df1, y = df2, z = "df3", crit = c("id"), typ = "left")
In the following sample data to replicate the problem:
df1 <- structure(list(id = c("XXX1", "XXX2", "XXX3",
"XXX4"), tr.isincode = c("ISIN1", "ISIN2",
"ISIN3", "ISIN4")), .Names = c("id", "isin"
), row.names = c(NA, 4L), class = "data.frame")
df2 <- structure(list(id= c("XXX1", "XXX5"), wrong= c(1L,
1L)), .Names = c("id", "wrong"), row.names = 1:2, class = "data.frame")
checkmerge <- structure(list(mergedat = structure(integer(0), .Label = character(0), class = "factor"),
countdat = numeric(0)), .Names = c("mergedat", "countdat"
), row.names = integer(0), class = "data.frame")
In the example, a list with the dataframes df3 and checkmerge are returned. I would need both dataframes separately. I know that I could do it via manual assignment (e.g., checkmerge <- results$checkmerge) but I want to eliminate manual changes as much as possible and am therefore looking for an automated way.

Clusters on separate pages of pdf. Each row may belong to different clusters

As title says I would like to save each cluster on separate page of pdf file.
Example data:
structure(list(P1 = c("ATCG00490", "AT5G17710", "AT2G42910",
"AT4G23600", "AT3G61540", "AT2G05990"), P2 = c("AT5G38420", "AT5G20070",
"AT5G04230", "AT1G08200", "AT4G30910", "AT5G52100"), clique = structure(list(
`930` = integer(0), `2090` = integer(0), `3120` = c(2L, 3L,
231L), `3663` = integer(0), `3704` = integer(0), `4156` = c(19L,
27L)), .Names = c("930", "2090", "3120", "3663", "3704",
"4156"), class = "AsIs")), .Names = c("P1", "P2", "clique"), row.names = c(930L,
2090L, 3120L, 3663L, 3704L, 4156L), class = "data.frame")
Some of the rows belong to many clusters and some of them just to single one. Of course all possible variants have to be considered.
If it's possible I would like to keep only clusters which have at least two members.
That's the function which I use if each of the row belongs to single cluster:
pdf("clusters.pdf", , width=12, height=18)
lapply(split(data_cluster, data_cluster$cluster), function(d) {
grid::grid.newpage()
gridExtra::grid.table(d)
}
)
dev.off()
Maybe it will help someone to find an answer for me.
EDIT:
I made a mistake while preparing an example data... Please take a look on my original data and than you will find out that's not that simple (at least in my opinion).
structure(list(P1 = c("ATCG00490", "AT5G17710", "AT2G42910",
"AT4G23600", "AT3G61540", "AT2G05990"), P2 = c("AT5G38420", "AT5G20070",
"AT5G04230", "AT1G08200", "AT4G30910", "AT5G52100"), clique = structure(list(
`930` = integer(0), `2090` = integer(0), `3120` = c(2L, 3L,
231L), `3663` = integer(0), `3704` = integer(0), `4156` = c(19L,
27L)), .Names = c("930", "2090", "3120", "3663", "3704",
"4156"), class = "AsIs")), .Names = c("P1", "P2", "clique"), row.names = c(930L,
2090L, 3120L, 3663L, 3704L, 4156L), class = "data.frame")
It seems that this is only a question of splitting a variable to a long format data.frame. library(splitstackshape) does just that. Here is a solution using #Ananda's suggestion of listCol_l rather than cSplit.
library(splitstackshape)
data_cluster <- listCol_l(data_cluster, "clique")
data_cluster <- data_cluster[,n := .N >= 2,by=clique_ul][!is.na(clique_ul) & n,][,n :=NULL]
pdf("clusters.pdf", width=12, height=18)
lapply(unique(data_cluster$clique_ul), function(i) {
grid::grid.newpage()
gridExtra::grid.table(data_cluster[clique_ul == i,])
})
dev.off()
This will produce an empty pdf document with your dataset, since no cluster is repeated.

Resources