I have a set of BAM files within the chr16_bam directory and a sgseq_sam.txt file.
I want to replace the file_bam column values with the full path where the BAM files are stored.
My code hasn't been able to achieve that.
bamPath = "C:/Users/User/Downloads/chr16_bam/"
samFile <- read.delim("C:/Users/User/Downloads/sgseq_sam.txt", header=T)
for (i in samFile[,2]) {
p <- gsub(i, bamPath, samFile)
}
> dput(samFile)
structure(list(sample_name = c("N60", "N11", "T132", "T114"),
file_bam = c("60.bam", "11.bam", "132.bam", "114.bam"), paired_end = c(TRUE,
TRUE, TRUE, TRUE), read_length = c(75L, 75L, 75L, 75L), frag_length = c(1075L,
1466L, 946L, 1154L), lib_size = c(2589976L, 5153522L, 4429912L,
3131400L)), class = "data.frame", row.names = c(NA, -4L))
library(tidyverse)
sam_file <- sam_file %>%
mutate(file_bam = paste0(bamPath, file_bam))
Or, alternately, in base R:
data_file$file_bam <- paste0(bamPath, data_file$file_bam)
Related
I am using the package spatialRF in R to perform a regression task. From the example provided by the package, the have precomputed the distance.matrix and they use it in the function spatialRF::rf. Here is an example:
library(spatialRF)
#loading training data
data(block.data)
#names of the response variable and the predictors
dependent.variable.name <- "ntl"
predictor.variable.names <- colnames(block.data)[2:4]
#coordinates of the cases
xy <- block.data[, c("x", "y")]
#distance matrix
distance.matrix <- dist(subset(block.data, select = -c(x, y)))
#random seed for reproducibility
random.seed <- 1
model.non.spatial <- spatialRF::rf(
data = block.data,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance.matrix,
distance.thresholds = 0,
xy = xy,
seed = random.seed,
verbose = FALSE)
When running the spatialRF::rf function I am getting this error: Error in diag<-(tmp, value = NA): only matrix diagonals can be replaced
My dataset:
block.data = structure(list(ntl = c(11.4058170318604, 13.7000455856323, 16.0420398712158,
17.4475727081299, 26.263370513916, 30.658130645752, 19.8927211761475,
20.917688369751, 23.7149887084961, 25.2641334533691), pop = c(111.031448364258,
145.096557617188, 166.351989746094, 193.804962158203, 331.787200927734,
382.979248046875, 237.971466064453, 276.575958251953, 334.015289306641,
345.376617431641), tirs = c(35.392936706543, 34.4172630310059,
33.7765464782715, 35.3224639892578, 40.4262886047363, 39.6619148254395,
38.6306076049805, 36.752326965332, 37.2010040283203, 36.1100578308105
), agbh = c(1.15364360809326, 0.177780777215958, 0.580717206001282,
0.647109687328339, 3.84336423873901, 5.6310133934021, 2.10894227027893,
3.9533429145813, 2.7016019821167, 4.36041164398193), lc = c(40L,
40L, 40L, 126L, 50L, 50L, 50L, 50L, 40L, 50L)), class = "data.frame", row.names = c(NA,
-10L))
For reference, in the example in the link I provided, the distance matrix and the dataset the authors are using it's the same.
I have a dataset where one of the columns includes Russian words:
raw_data2 = structure(list(word = c("абрикос",
"автомобиль",
"аист",
"ананас",
"апрель",
"атака",
"баклажан"),
subject_nr = c(3L, 21L, 12L, 17L, 8L, 1L, 17L),
acc = c(98.976109215, 91.8803418803, 94.8979591837, 94.5273631841, 94.4444444444, 94.5355191257, 94.3661971831)),
row.names = c(1L, 100L, 200L, 300L, 400L, 500L, 600L),
class = "data.frame")
When I look at the file in RStudio there's no problem:
However, when I export the data into a table to work with them further in Excel I get this UTF-mess which Excel cannot convert back into Russian words (even when UTF-8 is chosen during data importing):
"word";"subject_nr";"acc"
"<U+0430><U+0431><U+0440><U+0438><U+043A><U+043E><U+0441>";3;98,976109215
"<U+0430><U+0432><U+0442><U+043E><U+043C><U+043E><U+0431><U+0438><U+043B><U+044C>";21;91,8803418803
"<U+0430><U+0438><U+0441><U+0442>";12;94,8979591837
"<U+0430><U+043D><U+0430><U+043D><U+0430><U+0441>";17;94,5273631841
"<U+0430><U+043F><U+0440><U+0435><U+043B><U+044C>";8;94,4444444444
"<U+0430><U+0442><U+0430><U+043A><U+0430>";1;94,5355191257
"<U+0431><U+0430><U+043A><U+043B><U+0430><U+0436><U+0430><U+043D>";17;94,3661971831
Is there any way to force R to replace those strings with corresponding Cyrillic letters when saving the table? It certainly "knows" what these letters are, since it shows them in preview. I use the following code (which does not work):
write.table(raw_data2,
file = "raw_data2.csv",
append = FALSE,
quote = TRUE,
sep = ";",
eol = "\n",
na = "NA",
dec = ",",
row.names = FALSE,
col.names = TRUE,
qmethod = c("escape", "double"),
fileEncoding = "UTF-8")
Works fine for me if you write it to xlsx file.
openxlsx::write.xlsx(raw_data2, 'temp.xlsx')
For me, Sys.setlocale("LC_CTYPE", "russian") works well
(code source: https://www.r-bloggers.com/2013/01/r-and-foreign-characters/)
I have a 3 column csv file like this
x,y1,y2
100,50,10
200,10,20
300,15,5
I want to have a barplot using R, with first column values on x axis and second and third columns values as grouped bars for the corresponding x. I hope I made it clear. Can someone please help me with this? My data is huge so I have to import the csv file and can't enter all the data.I found relevant posts but none was exactly addressing this.
Thank you
Use the following code
library(tidyverse)
df %>% pivot_longer(names_to = "y", values_to = "value", -x) %>%
ggplot(aes(x,value, fill=y))+geom_col(position = "dodge")
Data
df = structure(list(x = c(100L, 200L, 300L), y1 = c(50L, 10L, 15L),
y2 = c(10L, 20L, 5L)), class = "data.frame", row.names = c(NA,
-3L))
I have a functions which yields 2 dataframes. As functions can only return one object, I combined these dataframes as a list. However, I need to work with both dataframes separately. Is there a way to automatically split the list into the component dataframes, or to write the function in a way that both objects are returned separately?
The function:
install.packages("plyr")
require(plyr)
fun.docmerge <- function(x, y, z, crit, typ, doc = checkmerge) {
mergedat <- paste(deparse(substitute(x)), "+",
deparse(substitute(y)), "=", z)
countdat <- nrow(x)
check_t1 <- data.frame(mergedat, countdat)
z1 <- join(x, y, by = crit, type = typ)
countdat <- nrow(z1)
check_t2 <- data.frame(mergedat, countdat)
doc <- rbind(doc, check_t1, check_t2)
t1<-list()
t1[["checkmerge"]]<-doc
t1[[z]]<-z1
return(t1)
}
This is the call to the function, saving the result list to the new object results.
results <- fun.docmerge(x = df1, y = df2, z = "df3", crit = c("id"), typ = "left")
In the following sample data to replicate the problem:
df1 <- structure(list(id = c("XXX1", "XXX2", "XXX3",
"XXX4"), tr.isincode = c("ISIN1", "ISIN2",
"ISIN3", "ISIN4")), .Names = c("id", "isin"
), row.names = c(NA, 4L), class = "data.frame")
df2 <- structure(list(id= c("XXX1", "XXX5"), wrong= c(1L,
1L)), .Names = c("id", "wrong"), row.names = 1:2, class = "data.frame")
checkmerge <- structure(list(mergedat = structure(integer(0), .Label = character(0), class = "factor"),
countdat = numeric(0)), .Names = c("mergedat", "countdat"
), row.names = integer(0), class = "data.frame")
In the example, a list with the dataframes df3 and checkmerge are returned. I would need both dataframes separately. I know that I could do it via manual assignment (e.g., checkmerge <- results$checkmerge) but I want to eliminate manual changes as much as possible and am therefore looking for an automated way.
As title says I would like to save each cluster on separate page of pdf file.
Example data:
structure(list(P1 = c("ATCG00490", "AT5G17710", "AT2G42910",
"AT4G23600", "AT3G61540", "AT2G05990"), P2 = c("AT5G38420", "AT5G20070",
"AT5G04230", "AT1G08200", "AT4G30910", "AT5G52100"), clique = structure(list(
`930` = integer(0), `2090` = integer(0), `3120` = c(2L, 3L,
231L), `3663` = integer(0), `3704` = integer(0), `4156` = c(19L,
27L)), .Names = c("930", "2090", "3120", "3663", "3704",
"4156"), class = "AsIs")), .Names = c("P1", "P2", "clique"), row.names = c(930L,
2090L, 3120L, 3663L, 3704L, 4156L), class = "data.frame")
Some of the rows belong to many clusters and some of them just to single one. Of course all possible variants have to be considered.
If it's possible I would like to keep only clusters which have at least two members.
That's the function which I use if each of the row belongs to single cluster:
pdf("clusters.pdf", , width=12, height=18)
lapply(split(data_cluster, data_cluster$cluster), function(d) {
grid::grid.newpage()
gridExtra::grid.table(d)
}
)
dev.off()
Maybe it will help someone to find an answer for me.
EDIT:
I made a mistake while preparing an example data... Please take a look on my original data and than you will find out that's not that simple (at least in my opinion).
structure(list(P1 = c("ATCG00490", "AT5G17710", "AT2G42910",
"AT4G23600", "AT3G61540", "AT2G05990"), P2 = c("AT5G38420", "AT5G20070",
"AT5G04230", "AT1G08200", "AT4G30910", "AT5G52100"), clique = structure(list(
`930` = integer(0), `2090` = integer(0), `3120` = c(2L, 3L,
231L), `3663` = integer(0), `3704` = integer(0), `4156` = c(19L,
27L)), .Names = c("930", "2090", "3120", "3663", "3704",
"4156"), class = "AsIs")), .Names = c("P1", "P2", "clique"), row.names = c(930L,
2090L, 3120L, 3663L, 3704L, 4156L), class = "data.frame")
It seems that this is only a question of splitting a variable to a long format data.frame. library(splitstackshape) does just that. Here is a solution using #Ananda's suggestion of listCol_l rather than cSplit.
library(splitstackshape)
data_cluster <- listCol_l(data_cluster, "clique")
data_cluster <- data_cluster[,n := .N >= 2,by=clique_ul][!is.na(clique_ul) & n,][,n :=NULL]
pdf("clusters.pdf", width=12, height=18)
lapply(unique(data_cluster$clique_ul), function(i) {
grid::grid.newpage()
gridExtra::grid.table(data_cluster[clique_ul == i,])
})
dev.off()
This will produce an empty pdf document with your dataset, since no cluster is repeated.