I do most of my projects with lists downloading information from the web. Sometimes when I download 100 different sets of data the website does not give me data for a few of them.
You can tell it has no data because it says.
A data.frame with 0 rows and 7 colums.
A good dataframe says something like this.
A data.Frame with 245345 rows and 7 colums.
My script does not like no data in the list. It stops my loop in that spot.
Thank you in advance.
#Pulls all the active USGS gages for the URL's
GageList <- CDEC
gage <- c(as.character(GageList$GAGE_ID))
duration <- c(as.character(GageList$DURATION_CODE))
number <- c(as.character(GageList$SENSOR_CODE))
View(GageList)
#CDEC URL
urls <- sprintf(final=list()
TOTALERRORS =list()
#Pulls all the active USGS gages for the URL's
GageList <- CDEC
gage <- c(as.character(GageList$GAGE_ID))
duration <- c(as.character(GageList$DURATION_CODE))
number <- c(as.character(GageList$SENSOR_CODE))
View(GageList)
#CDEC URL
urls <- sprintf("http://cdec.water.ca.gov/cgi-progs/querySHEF?
station_id=%s&dur_code=%s&sensor_num=%s&start_date=10/25/2019",
gage,duration,number)
View(urls)
data <- suppressWarnings(lapply(urls, fread, header=TRUE)))
It difficult to answer whithout having an eye on the list. But from your description, here is an example :
# first here is a list of data.frame
l = list(data.frame(0,nrow = 2,ncol =5),
data.frame(0,nrow = 1, ncol=5))
# here i remove the only row of the second data.frame
l[[2]] = l[[2]][-1,]
l
# I set a data.frame identifiying the dim of each data.frame
d2rm = data.frame(n = rep(1:length(l),each = 2),
empty =unlist(lapply(l,dim)) == 0)
# I remove from the list the data.frames that have dim of 0 (in col or row)
l[[d2rm[which(d2rm$empty),1]]] = NULL
l
Related
Fairly new to R, so any guidance is appreciated.
GOAL: I'm trying to create hundreds of dataframes in a short script. They follow a pattern, so I thought a For Loop would suffice, but the data.frame function seems to ignore the variable nature of the variable, reading it as it appears. Here's an example:
# Defining some dummy variables for the sake of this example
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:338)
# (Theoretically) creating a separate dataframe for each of the terms in 'dfTitles'
for (i in dfTitles){
i <- data.frame(matrix(0, nrow = 4, ncol = 338, dimnames = list(Copes, Voxels)))
}
# Trying an alternative method
for (i in 1:length(dfTitles))
{dfTitles[i] <- data.frame(matrix(0, nrow = 4, ncol = 338, dimnames = list(Copes, Voxels)))}
This results in the creation of one dataframe named 'i', in the former, or a list of 4, in the case of the latter. Any ideas? Thank you!
PROBABLY UNNECESSARY BACKGROUND INFORMATION: We're using fMRI data to run an analysis which will run correlations across stimuli, brain voxels, brain regions, and participants. We're correlating whole matrices, so separating the values (aka COPEs) into separate dataframes by both Participant ID and Brain Region is going to make the next step much much easier. I already had tried the next step after having loaded and sorted the data into one large dataframe and it was a big pain in the butt.
rm(list=ls)
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:3)
# (Theoretically) creating a separate dataframe for each of the terms in 'dfTitles'
nr <- length(Voxels)
nc <- length(Copes)
N <- length(dfTitles) # Number of data frames, same as length of dfTitles
DF <- vector(N, mode="list")
for (i in 1:N){
DF[[i]] <- data.frame(matrix(rnorm(nr*nc), nrow = nr))
dimnames(DF[[i]]) <- list(Voxels, Copes)
}
names(DF) <- dfTitles
DF[1:2]
$C2000.AMY
Cope1 Cope2 Cope3 Cope4
1 -0.8293164 -1.813807 -0.3290645 -0.7730110
2 -1.1965588 1.022871 -0.7764960 -0.3056280
3 0.2536782 -0.365232 2.3949076 0.5672671
$C2000.ACC
Cope1 Cope2 Cope3 Cope4
1 -0.7505513 1.023325 -0.3110537 -1.4298174
2 1.2807725 1.216997 1.0644983 1.6374749
3 1.0047408 1.385460 0.1527678 0.1576037
When creating objects in a for loop, they need to be saved somewhere before the next iteration of the loop, or it gets overwritten.
One way to handle that is to create an empty list or vector with c()before the beginning of your loop, and append the output of each run of the loop.
Another way to handle it is to assign the object to your environment before moving on to the next iteration of the loop.
# Defining some dummy variables for the sake of this example
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:338)
# initialize a list to store the data.frame output
df_list <- list()
for (d in dfTitles) {
# create data.frame with the dfTitle, and 1 row per Copes observation
df <- data.frame(dfTitle = d,
Copes = Copes)
# append columns for Voxels
# setting to NA, can be reassigned later as needed
for (v in Voxels) {
df[[paste0("Voxel", v)]] <- NA
}
# store df in the list as the 'd'th element
df_list[[d]] <- df
# or, assign the object to your environment
# assign(d, df)
}
# data.frames can be referenced by name
names(df_list)
head(df_list$C2000.AMY)
I currently have a code to extract certain details within a PDF document. However, as i have thousands of other PDF documents to extract information from, I would like to automate this process. I am using the pdf_text option to read PDFs into R. My code looks something like this:
library(pdftools)
x <- pdf_text("Test.pdf")
y1 <- str_split(x, "\r")
#pdf output contains a total of 7 lists
a <- y1 [[4]]
b <- c(a[4],a[11:13]) #Obtain only rows 4, 11 to 13 from list 4
n2 <- y1[[3]]
n3 <- c(n2[3]) #Obtain only rows 3 from list 3
n <- y1[[5]]
n1 <- c(n[3]) #Obtain only rows 3 from list 5
c <- y1[[6]]
d <- c(c[4:18]) #Obtain only rows 4 to 18 from list 6
e <- c(n3,b,d,n1) #Combining all necessary information into one list
z <- substr(s[1:21], start = 15, stop = 200) #to remove white spaces between quotes
Name <- z[1]
InterestedParty <- z[2]
TotalOwnBefore <- substr(z[11], start = 97, stop = 120)
Ownership <- list(NM = Name, Party = InterestedParty, OwnBefore = TotalOwnBefore)
write.csv(Ownership, file="MyData.csv")
The above code allows me to output a file for a single company. However, I have thousands other PDFs ("Test_1.pdf" to "Test_1000.pdf") to be read. Is there a way to automate the reading of the PDF files into R with pdf_text? Would also be great if there's a way for me to store all results into a single file instead of one firm per file.
I have since managed to automate the process using a for loop as follows:
for (i in 1:1000){
x <- paste("Test_",i,".pdf", sep="")
y <- pdf_text(print(x))
total <- strsplit(y, "\r")
print(y1)
}
I need to count mutations in the genome that occur at certain spots or rather ranges. The mutations have a genomic position (chromosome and basepair, e.g. Chr1, 10658324). The range or spot, respectively, is defined as 10000 basepairs up- and downstream (+-) of a given position in the genome. Both, positions of mutations and position of "spots" are stored in data frames.
Example:
set.seed(1)
Chr <- 1
Pos <- as.integer(runif(5000 , 0, 1e8))
mutations <- data.frame(Pos, Chr)
Chr <- 1
Pos <- as.integer(runif(50 , 0, 1e8))
spots <- data.frame(Pos, Chr)
So the question I am asking is: How many mutations are present +-10k basepairs around the positions given in "spots". (e.g. if the spot is 100k, the range would be 90k-110k)
The real data would of course contain all 24 chromosomes, but for the sake of simplicity we can focus on one chromosome for now.
The final data should contain the "spot" and the number of mutations in it's vicinity, ideally in a data frame or matrix.
Many thanks in advance for any suggestions or help!
Here's a first attempt, but I am pretty shure there is a way more elegant way of doing it.
w <- 10000 #setting range to 10k basepairs
loop <- spots$Pos #creating vector of positions to loop through
out <- data.frame(0,0)
colnames(out) <- c("Pos", "Count")
for (l in loop) {
temp <- nrow(filter(mutations, Pos>=l-w, Pos<=l+w))
temp2 <- cbind(l,temp)
colnames(temp2) <- c("Pos", "Count")
out <- rbind(out, temp2)
}
out <- out[-1,]
Using data.table foverlaps, then aggregate:
library(data.table)
#set the flank
myFlank <- 100000
#convert to ranges with flank
spotsRange <- data.table(
chr = spots$Chr,
start = spots$Pos - myFlank,
end = spots$Pos + myFlank,
posSpot = spots$Pos,
key = c("chr", "start", "end"))
#convert to ranges start end same as pos
mutationsRange <- data.table(
chr = mutations$Chr,
start = mutations$Pos,
end = mutations$Pos,
key = c("chr", "start", "end"))
#merge by overlap
res <- foverlaps(mutationsRange, spotsRange, nomatch = 0)
#count mutations
resCnt <- data.frame(table(res$posSpot))
colnames(resCnt) <- c("Pos", "MutationCount")
merge(spots, resCnt, by = "Pos")
# Pos Chr MutationCount
# 1 3439618 1 10
# 2 3549952 1 15
# 3 4375314 1 11
# 4 7337370 1 13
# ...
I'm not familiar with bed manipulations in R, so I'm going propose an answer with bedtools and someone here can try to convert to GRanges or other R bioinformatics library.
Essentially, you have two bed files, one with your spots and other with your mutations (I'm assuming a 1bp coordinate for each in the latter). In this case, you'd use closestBed to get the closest spot and the distance in bp of each mutation and then filter those that are 10KB from the spots. The code in a UNIX environment would look something like this:
# Assuming 4-column file structure (chr start end name)
closestBed -d -a mutations.bed -b spots.bed | awk '$9 <= 10000 {print}'
Where column 9 ($9) will be the distance in bp from the closest spot. Depending on how more specific you want to be, you can check the manual page at http://bedtools.readthedocs.io/en/latest/content/tools/closest.html. I'm pretty sure there's at least one bedtools-like package in R. If the functionality is similar, you can apply this exact same solution.
Hope that helps!
I have a very simple assignment for a project that requires processing a large amount of information; my professor's first words were "this will take a while to run" so I figured it'd be a good opportunity to spend that time i would be running my program making a super efficient one :P
Basically, I have a input file where each line is either a node or details. It might look something like:
#NODE1_length_17_2309482.2394832.2
val1 5 18
val2 6 21
val3 100 23
val4 9 6
#NODE2_length_1298_23948349.23984.2
val1 2 293
...
and so on. Basically, I want to know how I can efficiently use R to either output, line by line, something like:
NODE1_length_17 val1 18
NODE1_length_17 val2 21
...
So, as you can see, I would want to node name, the value, and the third column of the value line. I have implemented it using an ultra slow for loop that uses strsplit a whole bunch of times, and obviously this is not ideal. My current implementation looks like:
nodevals <- which(substring(data, 1, 1) == "#") # find lines with nodes
vallines <- which(substring(data, 1, 3) == "val")
out <- vector(mode="character", length=length(vallines))
for (i in vallines) {
line_ra <- strsplit(data[i], "\\s+")[[1]]
... and so on using a bunch of str splits and pastes to reformat
out[i] <- paste(node, val, value, sep="\t")
}
Does anybody know how I can optimize this using data frames or crafty vector manipulations?
EDIT: I'm implementing vecor wise splitting for everything, and so far I've found that the main thing I can't split correctly is the names of each node. I'm trying to do something like,
names <- data[max(nodes[nodelines < vallines])]
where nodes are the names of each line containing a node and vallines are the numbers of each line containing a val. The return vector should have the same number of elements as vallines. The goal is to find the maximum nodelines that is less than the line number of vallines for each vallines. Any thoughts?
I suggest using data.table package - it has very fast string split function tstrsplit.
library(data.table)
#read from file
data <- scan('data.txt', 'character', sep = '\n')
#create separate objects for nodes and values
dt <- data.table(data)
dt[, c('IsNode', 'NodeId') := list(IsNode <- substr(data, 1, 1) == '#', cumsum(IsNode))]
nodes <- dt[IsNode == TRUE, list(NodeId, data)]
values <- dt[IsNode == FALSE, list(data, NodeId)]
#split string and join back values and nodes
tmp <- values[, tstrsplit(data, '\\s+')]
values <- data.table(values[, list(NodeId)], tmp[, list(val = V1, value = V3)], key = 'NodeId')
res <- values[nodes]
I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz
You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz
I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.