Appending to a text file in a loop - r

I have a data frame called MetricsInput which looks like this:
ID ExtractName Dimensions Metrics First_Ind
124 extract1.txt ga:date gs:sessions 1
128 extract1.txt ga:date gs:sessions 0
134 extract1.txt ga:date gs:sessions 0
124 extract2.txt ga:browser ga:users 1
128 extract2.txt ga:browser ga:users 0
134 extract2.txt ga:browser ga:users 0
I'm trying to use the above data frame in a loop to run a series of queries, which ultimately will create 2 text files, extract1.txt and extract2.txt. The reason I have the first_ind field is I only want to append the column headings on the first run through each unique file.
Here's my loop -- the issue I'm having is that the data for each ID is not appending -- I seem to be overwriting my results, not appending. Where did I go wrong?
for(i in seq(from=1, to=nrow(MetricsInput), by=1)){
id <- MetricsInput[i,1]
myresults <- ga$getData(id,batch = TRUE, start.date="2013-12-01", end.date="2014-01-01", metrics = MetricsInput[i,4], dimensions = MetricsInput[i,3])
appendcolheads <- ifelse(MetricsInput[i,5]==1, TRUE, FALSE)
write.table(myresults, file=MetricsInput$ExtractName[i], append=TRUE, row.names = FALSE, col.names = appendcolheads, sep="\t")
}

Although you can get this code to work, it doesn't look like the right approach at all. As #MrFlick said in the comments it's very hard to help without being able to reproduce your problem, but I would do something along the following lines
GetData <- function(id, metric, dim) {
d <- ga$getData(id, batch = TRUE, start.date="2013-12-01",
end.date="2014-01-01", metrics = metric, dimensions = dim)
d$id <- id
d
}
myresults <- Map(GetData,
id = MetricsInput$ID,
metric = MetricsInput$Metrics,
dim = MetricsInput$Dimensions)
This will give you a list whose ith component is the output of the ith iteration in your for loop. So now you have to split it in two to write it in the files you wanted
myresultslist <- split(myresults, MetricsInput$ExtractName)
myresultslist <- lapply(myresultslist, do.call, what = rbind)
Map(write.table, x = myresultslist, file = names(myresultslist),
row.names = FALSE, sep = "\t")

Why don't you create a data frame in the loop and then write it to the text file?
myresults <- data.frame()
for (i in yourloop) {
#your code here
id <- MetricsInput[i,1]
temp <- ga$getData(id,batch = TRUE, start.date="2013-12-01", end.date="2014-01-01", metrics = MetricsInput[i,4], dimensions = MetricsInput[i,3])
myresults <- rbind(myresults, temp)
}
write.csv(myresults, ...)

Related

How can I process my StringTie data so that I can run DEseq2 using R?

I have StringTie data for a parental cell line and a KO cell line (which I'll refer to as B10). I am interested in comparing the parental and B10 cell lines. The issue seems to be that my StringTie files are separate, meaning I have one for the parental cell line and one for B10. I've included the code I have written to date for context along with the error messages I received and troubleshooting steps I have already tried. I have no idea where to go from here and I'd appreciate all the help I could get. This isn't something that anyone in my lab has done before so I'm struggling to do this without any guidance.
Thank you all in advance!
`# My code to go from StringTie to count data:
(I copy pasted this so all my notes are included. I'm new to R so they're really just for me. I'm not trying to explain to everyone what every bit of the code means condescendingly. You all likely know much more that I do)
# Open Data
# List StringTie output files for all samples
# All files should be in same directory
files_B10 <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/B10", recursive = TRUE, full.names = TRUE)
files_parental <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/parental", recursive = TRUE, full.names = TRUE)
tmp_B10 <- read_tsv(files_B10[1])
tx2gene_B10 <- tmp_B10[, c("t_name", "gene_name")]
txi_B10 <- tximport(files_B10, type = "stringtie", tx2gene = tx2gene_B10)
tmp_parental <- read_tsv(files_parental[1])
tx2gene_parental <- tmp_parental[, c("t_name", "gene_name")]
txi_parental <- tximport(files_parental, type = "stringtie", tx2gene = tx2gene_parental)
# Create a filter (vector) showing which rows have at least two columns with 5 or more counts
txi_B10.filter<-apply(txi_B10$counts,1,function(x) length(x[x>5])>=2)
txi_parental.filter<-apply(txi_parental$counts,1,function(x) length(x[x>5])>=2)
head(txi_parental.filter)
sum(txi_B10.filter)
# Now filter the txi object to keep only the rows of $counts, $abundance, and $length where the txi.filter value is >=5 is true
txi_B10$counts<-txi_B10$counts[txi_B10.filter,]
txi_B10$abundance<-txi_B10$abundance[txi_B10.filter,]
txi_B10$length<-txi_B10$length[txi_B10.filter,]
txi_parental$counts<-txi_parental$counts[txi_parental.filter,]
txi_parental$abundance<-txi_parental$abundance[txi_parental.filter,]
txi_parental$length<-txi_parental$length[txi_parental.filter,]
# save count data as csv files
write.csv(txi_B10$counts, "txi_B10.counts.csv")
write.csv(txi_parental$counts, "txi_parental.counts.csv")
# Open count data
# Do this in order that the files are organized in file manager
txi_B10_counts <- read_csv("txi_B10.counts.csv")
txi_parental_counts <- read_csv("txi_parental.counts.csv")
# Set column names
colnames(txi_B10_counts) = c("Gene_name", "B10_n1", "B10_n2")
View(txi_B10_counts)
colnames(txi_parental_counts) = c("Gene_name", "parental_n1", "parental_n2")
View(txi_parental_counts)
## R is case sensitive so you just wanna ensure that everything is in the same case
## convert Gene names which is column [[1]] into lowercase
txi_parental_counts[[1]] <- tolower( txi_parental_counts[[1]])
View(txi_parental_counts)
txi_B10_counts[[1]] <- tolower(txi_B10_counts[[1]])
View(txi_B10_counts)
## Capitalize the first letter of each gene name
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_parental_counts$Gene_name <- capFirst(txi_parental_counts$Gene_name)
View(txi_parental_counts)
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_B10_counts$Gene_name <- capFirst(txi_B10_counts$Gene_name)
View(txi_B10_counts)
# Merge PL and KO into one table
# full_join takes all counts from PL and KO even if the gene names are missing
# If a value is missing it writes it as NA
# This site explains different types of merging https://remiller1450.github.io/s230s19/Merging_and_Joining.html
mergedCounts <- full_join (x = txi_parental_counts, y = txi_B10_counts, by = "Gene_name")
view(mergedCounts)
# Replace NA with value = 0
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# Save file for merged counts
write.csv(mergedCounts, "MergedCounts.csv")
## --------------------------------------------------------------------------------
# My code to go from count data to DEseq2
# Import data
# I added my metadata incase the issue is how I set up the columns
# metaData is a file with your samples name and Comparison
# Your second column in metadata must be called Comparison, otherwise you'll get error in dds line
metadata <- read.csv(metadata.csv', header = TRUE, sep = ",")
countData <- read.csv('MergedCounts.csv', header = TRUE, sep = ",")
# Assign "Gene Names" as row names
# Notice how there's suddenly an extra row (x)?
# R automatically created and assigned column x as row names
# If you don't fix this the # of columns won't add up
rownames(countData) <- countData[,1]
countData <- countData[,-1]
# Create DEseq2 object
# !!!!!!! Here is where I get stuck!!!!!!!
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line
# It says Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers
## --------------------------------------------------------------------------------
# How I tried to fix this:
# 1) I saw something here that suggested this might be an issue with having zeros in the count data
# I viewed the countData files to make sure there were no zeros and there weren't any
# I thought that would be the case since I replaced NA with value = 0 earlier using this bit of code
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# 2) I was then informed that StringTie outputs non integer values
# It was recommended that I try DESeqDataSetFromTximport instead
dds <- DESeqDataSetFromTximport(countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line either
# It says Error in DESeqDataSetFromTximport(countData, colData = metaData, design = ~Comparison, : is(txi, "list") is not TRUE
# I think this might be because merging the parental and B10 counts led to a file that's no longer a txi or accessible through Tximport
# It seems like this should be done with the original StringTie files from the very beginning of the code
# My concern with doing that is that the files for parental and B10 are separate so I don't see how I could end up comparing the two
# I think this approach would work if I was interested in comparing n1 verses n2 for each cell line but that is not of interest to me
`

Problems extracting metadata from NCBI in R

I am trying to extract some information (metadata) from GenBank using the R package "rentrez" and the example I found here https://ajrominger.github.io/2018/05/21/gettingDNA.html. Specifically, for a particular group of organisms, I search for all records that have geographical coordinates and then want to extract data about the accession number, taxon, sequenced locus, country, lat_long, and collection date. As an output, I want a csv file with the data for each record in a separate row. It seems that the code below can do the job but at some point, rows get muddled with data from different records overlapping the neighbouring rows. For example, from 157 records that rentrez retrieves from NCBI 109 records in the file look like what I want to achieve but the rest is a total mess. I would greatly appreciate any advice on how to fix the issue because I am a total newbie with R and figuring out each step takes a lot of time.
setwd ("C:/R-Works")
library('XML')
library('rentrez')
argasid <- entrez_search(db="nuccore", term = "Argasidae[Organism] AND [lat]", use_history=TRUE, retmax=15000)
x <- entrez_fetch (db="nuccore", id=argasid$ids, rettype= "native", retmode="xml", parse=TRUE)
x <-xmlToList(x)
cleanEntrez <- function(x) {
basePath <- 'Seq-entry_seq.Bioseq'
c(
genbank = as.character(x[paste(basePath,
'Bioseq_id', 'Seq-id', 'Seq-id_genbank',
'Textseq-id', 'Textseq-id_accession',
sep = '.')]),
taxon = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_source', 'BioSource', 'BioSource_org',
'Org-ref', 'Org-ref_taxname',
sep = '.')]),
bseqdesc_title = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_title',
sep = '.')]),
lat_lon = as.character(x[grep('lat-lon', x) + 1]),
geo_description = as.character(x[grep('country', x) + 1]),
coll_date = as.character(x[grep('collection-date', x) + 1])
)
}
getGenbankMeta <- function(ids) {
allRec <- entrez_fetch(db = 'nuccore', id = ids,
rettype = 'native', retmode = 'xml',
parsed = TRUE)
allRec <- xmlToList(allRec)[[1]]
o <- lapply(allRec, function(x) {
cleanEntrez(unlist(x))
})
temp <- array(unlist(o), dim = c(length(o[[1]]), length(ids)))
seqVec <- temp[nrow(temp), ]
seqDF <- as.data.frame(t(temp[-nrow(temp), ]))
names(seqDF) <- names(o[[1]])[-nrow(temp)]
return(list(seq = seqVec, data = seqDF))
}
write.csv(getGenbankMeta(argasid$ids), 'argasid_georef.csv')

Problems with binding columns from two data frames using a for loop in R

I have 7 of two different asc files loaded into R, asc[i] and wasc[i], [i] denotes that there are 1:7 ascs and wascs loaded into R. I need to combine the wasc[i] with the asc[i][[1]] (Just the first column in asc[i] with the whole wasc[i] file).
This should be repeated for every pair of asc and wasc files.
The code keeps giving me blank data frames, so I don't know why this doesn't work. The naming is correct, yet the code is not recognizing that the asc[i] and wasc[i] correlate with previously loaded files.
Any help will be greatly appreciated.
# These data frames will reproduce my issue
asc1 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc1 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc2 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc2 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc3 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc3 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
for (i in 1:3) {
d <- paste("asc", i, sep ="")
f <- paste("wasc", i, sep ="")
full_wing <- as.character(paste("full_wing", i, sep = ""))
assign(full_wing,cbind(d[[1]], f))
}
# Output of full_wing1 data frame
dput(full_wing1)
structure(c("asc1", "wasc1"), .Dim = 1:2, .Dimnames = list(NULL,
c("", "f")))
Additional Information:
asc files are 19 columns long
wasc files are 13 columns long
I only want to combine column 1 from the asc file with the entire wasc file, thus cutting out the remaining 18 columns of the asc file.
# put data in a list
asc = mget(ls(pattern = "^asc"))
wasc = mget(ls(pattern = "^wasc"))
full_wing = Map(f = function(w, a) cbind(w, a[[1]]), w = wasc, a = asc)
Map is a nice shortcut for iterating in parallel over multiple arguments. It returns a nice list. You can access the individual elements with, e.g., full_wing[[1]], full_wing[[3]], etc. Map is just a shortcut, the above code is basically equivalent to the for loop below:
results = list()
for (i in seq_along(asc)) {
results[[i]] = cbind(wasc[[i]], asc[[i]][[1]])
}
I use mget to put the data in a list because in your example you already have objects like asc1, asc2, etc. A much better way to go is to never create those variables in the first place, instead read the files directly into a list, something like this:
asc_paths = list.files(pattern = "^asc")
asc = lapply(asc_paths, read.table)
You can see a lot more explanation of this at How to make a list of data frames?
If you only ever need one column of the asc files, another way to simplify this would be to only read in the needed column, see Only read limited number of columns for some recommendations there.

Need help writing data from a table in R for unique values using a loop

Trying to figure why when I run this code all the information from the columns is being written to the first file only. What I want is only the data from the columns unique to a MO number to be written out. I believe the problem is in the third line, but am not sure how to divide the data by each unique number.
Thanks for the help,
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
df = MOs_Interest[MOs_Interest$MO_NUMBER == MO, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
submit.df <- data.frame(df)
filename = paste("Variance", "Report",MO, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)}
If you are trying to write out a separate csv for each unique MO number, then something like this may work to accomplish that.
unique.mos <- unique(MOs_Interest$MO_NUMBER)
for (mo in unique.mos){
submit.df <- MOs_Interest[MOs_Interest$MO_NUMBER == mo, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
filename <- paste("Variance", "Report", mo, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)
}
It's hard to answer fully without example data (what are the columns of MOs_InterestDF1?) but I think your issue is in the df line. Are you trying to subset the dataframe to only the data matching the MO? If so, try which as in df = MOs_Interest[which(MOs_Interest$MO_NUMBER == MO),].
I wasn't sure if you actually had two separate dfs (MOs_Interest and MOs_InterestDF1); if not, make sure the df line points to the correct data frame.
I tried to create some simplified sample data:
MOs_InterestDF1 <- data.frame("MO_NUMBER" = c(1,2,3), "Item_No" = c(142,423,214), "Desc" = c("Plate","Book","Table"))
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
mydf = data.frame(MOs_InterestDF1[which(MOs_InterestDF1$MO_NUMBER == MO),])
filename = paste("This is number ",MO,".csv", sep="")
write.csv(mydf, file = filename, row.names=FALSE)
}
This output three different csv files, each with exactly one row of data. For example, "This is number 1.csv" had the following data:
MOs Item_No Desc
1 142 Plate

returned objects within a list while keeping the original data structure in R

In R, I need to return two objects from a function:
myfunction()
{
a.data.frame <- read.csv(file = input.file, header = TRUE, sep = ",", dec = ".")
index.hash <- get_indices_function(colnames(a.data.frame))
alist <- list("a.data.frame" = a.data.frame, "index.hash" = index.hash)
return(alist)
}
But, the returned objects from myfunction all become list not data.frame and hash.
Any help would be appreciated.
You can only return one object from an R function; this is consistent with..pretty much every other language I've used. However, you'll note that the objects retain their original structure within the list - so alist[[1]] and alist[[2]] should be the data frame and hash respectively, and are structured as data frames and hashes. Once you've returned them from the function, you can split them out into unique objects if you want :).
You can use a structure.
return (structure(class = "myclass",
list(data = daza.frame,
type = anytype,
page.content = page.content.as.string.vector,
knitr = knitr)))
Than you can access your data with
values <- my function(...)
values$data
values$type
values$page.content
values$knitr
and so on.
A working example from my package:
sju.table.values <- function(tab, digits=2) {
if (class(tab)!="ftable") tab <- ftable(tab)
tab.cell <- round(100*prop.table(tab),digits)
tab.row <- round(100*prop.table(tab,1),digits)
tab.col <- round(100*prop.table(tab,2),digits)
tab.expected <- as.table(round(as.array(margin.table(tab,1)) %*% t(as.array(margin.table(tab,2))) / margin.table(tab)))
# -------------------------------------
# return results
# -------------------------------------
invisible (structure(class = "sjutablevalues",
list(cell = tab.cell,
row = tab.row,
col = tab.col,
expected = tab.expected)))
}
tab <- table(sample(1:2, 30, TRUE), sample(1:3, 30, TRUE))
# show expected values
sju.table.values(tab)$expected
# show cell percentages
sju.table.values(tab)$cell

Resources