parsing xml from a URL list - r

I have a list of URLs that direct to different xml files and I want to extract some info from them using R and the xml package.
I am trying to do this with a for loop.
I have this code but it gives me only the last xml (numtotal), how can I read all of them?
for (i in seq(from = 1, to = numtotal, by = 1)){
urli <- xmlParse(urls[[i]], useInternalNodes = TRUE)
top_numberi <- xmlRoot(urli)
GS = data.frame(GS = xpathSApply(top_numberi,"//a//b",xmlValue))
}
where:
urls is a list of 7 or more URLs
numtotal is the length of another list (numeric value)

Every iteration of your for loop is overwriting your GS data frame. Instead of using a data frame, create a list outside the loop
l = list()
Then fill in the elements inside the loop
l[i] = xpathSApply(top_numberi,"//a//b",xmlValue)
As an aside, this is a very basic question. You should read some standard R textbooks before proceeding much further.

Related

Fill the array using matrix in each loop in R

I have multiple files containing the data in tabulated form. I want to generate a 3-dimensional array in which the data from each file stored in the third dimension. For example, if I have 10 files then data from the first file will store in the first layer of 3D array, data from the second file in the second layer and so on.
Here is the dummy code I am using but it does not work correctly.
# reading data from the file ( I have a list of files names as fname)
dataDum <- read.table(fname[i],header = F, sep =';', skip=121, stringsAsFactors = FALSE)
# Assigning data to the array. I have already generated an empty array with the desired dimension
finaldata[, , i]=dataDum
It is not clear "why" your code is not working properly, as there is not reproducible example. As it is it should work correctly given the inputs are as expected. For example:
arr <- array(data = 0, dim = c(10,10,3));
for(i in 1:3){
mat <- matrix(rnorm(10^2), ncol = 10);
arr[,,i] <- mat
}
arr
If an error occurs it is likely due to dataDum being a data.frame. Explicitly using as.matrix(dataDum) would fix such an issue.

R: use single file while running a for loop on list of files

I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!

ReadLines using multiple sources in R

I'm trying to use readLines() to scrape .txt files hosted by the Census and compile them into one .txt/.csv file. I am able to use it to read individual pages but I'd like to have it so that I can just run a function that will go out and readLines() based on a csv with urls.
My knowledge of looping and function properties isn't great, but here are the pieces of my code that I'm trying to incorporate:
Here is how I build my matrix of urls which I can add to and/or turn into a csv and have a function read it that way.
MasterList <- matrix( data = c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt"), ncol = 1)
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)
Here's the function (riddled with problems) I started writing:
Scrape <- function(x){
for (i in x){
URLS <- i
headers <- readLines(URLS, n=2)
bod <- readLines(URLS)
bodclipped <- bod[-c(1,2,3)]
Totes <- c(headers, bodclipped)
write(Totes, file = "[Directory]/ScrapeTest.txt")
return(head(Totes))
}
}
The idea being that I would run Scrape(urls) which would generate a cumulation of the 3 urls I have in my "urls" matrix/csv with the Census' build in headers removed from all files except the first one (headers vs. bodclipped).
I've tried doing lapply() to "urls" with readLines but that only generates text based on the last url and not all three, and they still have the headers for each text file which I could just remove and then reattach at the end.
Any help would be appreciated!
As all of these documents are csv files with 38 columns you can combine then very easily using:
MasterList <- c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt")
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)
raw_dat <- lapply(urls, read.csv, skip = 3, header = FALSE)
dat <- do.call(rbind, dat_raw)
What happens here and how is this looping?
The lapply function basically creates a list with 3 (= length(urls)) entries and populates them with: read.csv(urls[i], skip = 3, header = FALSE). So raw_dat is a list with 3 data.frames containing your data. do.call(rbind, dat) binds em together.
The header row seams somehow broken thats why i use skip = 3, header = FALSE which is equivalent to your bod[-c(1,2,3)].
If all the scraped data fits into memory you can combine it this way and in the end write it into a file using:
write.csv(dat, "[Directory]/ScrapeTest.txt")

Searching for target in Excel spreadsheet using R

As an R noob, I'm currently rather stumped by what is probably a rather trivial problem. I have data that looks like in the second image below, essentially a long sheet of rows with values in three columns. What I need is for a way to scan the sheet looking for particular combinations of values in the first and second column - combinations that are specified in a second spreadsheet of targets (see picture 1). When that particular combination is found, I need the script to extract the whole row in question from the data file.
So far, I've managed to read the files without problem:
library(xlsx)
folder <- 'C:\\Users\\...\\Desktop\\R EXCEL test\\'
target_file <- paste(folder,(readline(prompt = "Enter filename for target list:")),sep = "")
data_file <- paste(folder,(readline(prompt = "Enter data file:")),sep = "")
targetsDb <- read.xlsx(target_file, sheetName = "Sheet1")
data <- read.xlsx(data_file, sheetName = "Sheet1")
targets <- vector(mode = "list", length = 3)
for(i in 1:nrow(targetsDb)){
targets[[i]] <- c(targetsDb[i,1],targetsDb[i,2])
}
And with the last command I've managed to save the target combinations as items in a list. However, I run into trouble when it comes to iterating through the file looking for any of those combinations of cell values in the first two columns. My approach was to create a list with one item,
SID_IA <- vector(mode = "list", length = 1)
and to fill it with the values of column 1 and 2 iteratively for each row of the data file:
for(n in 1:nrow(data)){
SID_IA[[n]] <- c(data[n,1],data[n,2])
I would then nest another for loop here, which basically goes through every row in the targets sheet to check if the combination of values currently in the SID_IA list matches any of the target ones. Then at the end of the loop, the list is emptied so it can be filled with the following combination of data values.
for(i in targets){
if(SID_IA[[n]] %in% targets){
print(SID_IA[[n]], "in sentence" , data[n,1], "is ", data[n,3])
}else{
print(FALSE)
}
SID_IA[[n]] <- NULL
}
}
However, if I try to run that last loop, it returns the following output and error:
[1] FALSE
Error in SID_IA[[n]] : subscript out of bounds
In addition: Warning message:
In if (SID_IA[[n]] %in% targets) { :
the condition has length > 1 and only the first element will be used
So, it seems to be doing something for at least one iteration, but then crashes. I'm sure I'm missing something very elementary, but I just can't see it. Any ideas?
EDIT: As requested, I've removed the images and made the test Excel sheets available here and here.
OK.. I'm attempting an answer that should require minimum use of fancy tricks.
data<- xlsx::read.xlsx(file = "Data.xlsx",sheetIndex = 1)
target<- xlsx::read.xlsx(file = "Targets.xlsx",sheetIndex = 1)
head(data)
target
These values are already in data.frame format. If all you want to know is which rows appear exactly same in data and target, then it will be as simple as finding a merge
merge(target,data,all = F)
If, on the other hand , you want to keep the data table with a marking of target rows, then the easiest way will be to make an index column
data$indx<- 1:nrow(data)
data
mrg<- merge(target,data,all = F)
data$test<- rep("test", nrow(data))
data$test[mrg$indx]<- "target"
data
This is like the original image you'd posted.
BTW , if yo are on a graphical interface you can also use File dialogue to open data files.. check out file.choose()
(Posted on behalf of the OP).
Following from #R.S.'s suggestion that didn't involve vectors and loops, and after some playing around, I have figured out how to extract the target lines, and then how to remove them from the original data, outputting both results. I'm leaving it here for future reference and considering this solved.
extracted <- merge(targets,data,all = F)
write.xlsx(extracted,output_file1)
combined <-rbind(data,extracted)
minus.target <- combined[!duplicated(combined,fromLast = FALSE)&!duplicated(combined,fromLast = TRUE),]
write.xls(minus.target,output_file2)

Reading from a very large XML file in R

I have a very large XML file (>70GB) from which I only need to read some segments. However, I also don't know the structure of the file, and failed to extract it due to the file's size.
I don't need to read the full file or convert it to a data frame - only to extract specific parts, but I don't know the specific format for those sequences since I don't have the structure.
I tried using xmlParse, and also using xmlEventParse based on what is suggested here:
How to read large (~20 GB) xml file in R?
The code suggested there returns an empty data frame:
xmlDoc <- "Final.xml"
result <- NULL
#function to use with xmlEventParse
row.sax = function() {
ROW = function(node){
children <- xmlChildren(node)
children[which(names(children) == "text")] <- NULL
result <<- rbind(result, sapply(children,xmlValue))
}
branches <- list(ROW = ROW)
return(branches)
}
#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
saxVersion = 2, trim = FALSE)
#and here is your data.frame
result <- as.data.frame(result, stringsAsFactors = F)
I have little experience working with XML, and so I don't fully understand the solution I tried to use.
Thanks for your help!

Resources