Is there an R function for parsing INI like configuration files?
While searching I only found this discussion.
Here is an answer that was given to exact the same question on r-help in 2007 (thanks to #Spacedman for pointing this out):
Parse.INI <- function(INI.filename)
{
connection <- file(INI.filename)
Lines <- readLines(connection)
close(connection)
Lines <- chartr("[]", "==", Lines) # change section headers
connection <- textConnection(Lines)
d <- read.table(connection, as.is = TRUE, sep = "=", fill = TRUE)
close(connection)
L <- d$V1 == "" # location of section breaks
d <- subset(transform(d, V3 = V2[which(L)[cumsum(L)]])[1:3],
V1 != "")
ToParse <- paste("INI.list$", d$V3, "$", d$V1, " <- '",
d$V2, "'", sep="")
INI.list <- list()
eval(parse(text=ToParse))
return(INI.list)
}
Actually, I wrote a short and presumably buggy function (i.e. not covering all corner cases) which works for me now:
read.ini <- function(x) {
if(length(x)==1 && !any(grepl("\\n", x))) lines <- readLines(x) else lines <- x
lines <- strsplit(lines, "\n", fixed=TRUE)[[1]]
lines <- lines[!grepl("^;", lines) & nchar(lines) >= 2] # strip comments & blank lines
lines <- gsub("\\r$", "", lines)
idx <- which(grepl("^\\[.+\\]$", lines))
if(idx[[1]] != 1) stop("invalid INI file. Must start with a section.")
res <- list()
fun <- function(from, to) {
tups <- strsplit(lines[(from+1):(to-1)], "[ ]*=[ ]*")
for (i in 1:length(tups))
if(length(tups[[i]])>2) tups[[i]] <- c(tups[[i]][[1]], gsub("\\=", "=", paste(tail(tups[[i]],-1), collapse="=")))
tups <- unlist(tups)
keys <- strcap(tups[seq(from=1, by=2, length.out=length(tups)/2)])
vals <- tups[seq(from=2, by=2, length.out=length(tups)/2)]
sec <- strcap(substring(lines[[from]], 2, nchar(lines[[from]])-1))
res[[sec]] <<- setNames(vals, keys)
}
mapply(fun, idx, c(tail(idx, -1), length(lines)+1))
return(res)
}
where strcap is a helper function that capitalizes a string:
strcap <- function(s) paste(toupper(substr(s,1,1)), tolower(substring(s,2)), sep="")
There are also some C solutions for this, like inih or libini that might be useful. I did not try them out, though.
Related
Hope you don't mind if this is too easy for you.
In R, I am using fromJSON() to read from 3 urls (tier 1 url) , in the JSON file there is "link" field which give me another url (tier 2 url) and I use that and read.table() to get my final data. My code now is like this:
# note, this code does not run
urlJohn <- www.foo1.com
urlJane <- www.foo2.com
urlJoe <- www.foo3.com
tempJohn <- fromJson(urlJohn)
tempJohn[["data"]][["rows"]]$link %<>%
{clean up this data}
dataJohn <- read.table(tempJohn[["data"]][["rows"]]$link,
header = TRUE,
sep = ",")
tempJane <- fromJson(urlJane)
tempJane[["data"]][["rows"]]$link %<>%
{clean up this data}
dataJane <- read.table(tempJane[["data"]][["rows"]]$link,
header = TRUE,
sep = ",")
tempJoe <- fromJson(urlJoe)
tempJoe[["data"]][["rows"]]$link %<>%
{clean up this data}
dataJoe <- read.table(tempJoe[["data"]][["rows"]]$link,
header = TRUE,
sep = ",")
As you can see, I am just copying-n-pasting code blocks. What I wish is this:
# note, this code also does not run
urlJohn <- www.foo1.com
urlJane <- www.foo2.com
urlJoe <- www.foo3.com
source <- c("John", "Jane", "joe")
for (i in source){
temp <- paste(temp, i, sep = "")
url <- paste(url, i, sep = "")
data <- paste(data, i, sep = "")
temp <- fromJson(url)
temp[["data"]][["rows"]]$link %<>%
{clean up this data}
data <- read.table(temp[["data"]][["rows"]]$link,
header = TRUE,
sep = ",")
}
What do I need to do to make the for loop work? If my question is not clear, please ask me to clarify it.
I usually find using lapply convenient than a for loop. Although you can easily convert this to a for loop if needed.
URLs <- c('www.foo1.com', 'www.foo2.com', 'www.foo3.com')
lapply(URLs, function(x) {
temp <- jsonlite::fromJSON(x)
temp[["data"]][["rows"]]$link %<>% {clean up this data}
read.table(temp[["data"]][["rows"]]$link,header = TRUE,sep = ",")
}) -> list_data
list_data
Thanks to #Ronak Shah. The R community strongly favors "non-For-loop" solution.
The way to get my desired result is lapply.
Below is non-running codes in mnemonics:
URLs <- c('www.foo1.com', 'www.foo2.com', 'www.foo3.com')
lapply(URLs, function(x) {
temp <- jsonlite::fromJSON(x)
x <- temp[["data"]][["rows"]]$link %<>% {clean up this data}
y <- read.table(temp[["data"]][["rows"]]$link,header = TRUE,sep = ",")
return(list(x, y))
})
And this is a running example.
x <- list(alpha = 1:10,
beta = exp(-3:3),
logic = c(TRUE,FALSE,FALSE,TRUE))
lapply(x, function(x){
temp <- sum(x) / 2
temp2 <- list(x,
temp)
return(temp2)
}
)
I have a loop which downloads data from urls. Now I would like that for every x iterations, the information so far gets written away.
As such I have the following code:
baseurl <- "http://zoeken.kvk.nl/Address.ashx?site=handelsregister&partialfields=&q=010"
pages3 <- list()
for(i in 1:99999){
if(i < 10){
mydata <- RJSONIO::fromJSON(paste0(baseurl,"00000",i), flatten=TRUE)
}
if(i < 100 & i >= 10){
mydata <- RJSONIO::fromJSON(paste0(baseurl,"0000",i), flatten=TRUE)
}
if(i < 1000 & i >= 100){
mydata <- RJSONIO::fromJSON(paste0(baseurl,"000",i), flatten=TRUE)
}
if(i < 10000 & i >= 1000){
mydata <- RJSONIO::fromJSON(paste0(baseurl,"00",i), flatten=TRUE)
}
if(i < 100000 & i >= 10000){
mydata <- RJSONIO::fromJSON(paste0(baseurl,"0",i), flatten=TRUE)
}
if(i < 1000000 & i >= 100000){
mydata <- RJSONIO::fromJSON(paste0(baseurl,i), flatten=TRUE)
}
mydata <- RJSONIO::fromJSON(paste0(baseurl,i), flatten=TRUE)
pages3[[i]] <- mydata$resultatenHR
options(timeout = 4000000)
if(i %% 100 == 0){Sys.sleep(5)}
if(i %% 1000 == 0){
final_df<-do.call(rbind,pages3)
final<- Reduce(function(x,y) merge(x, y, all=TRUE), final_df)
mytime <- format(Sys.time(), "%b_%d_%H_%M_%S_%Y")
myfile <- file.path(R(), paste0(mytime, "_", i, ".csv"))
write.csv2(final, file = myfile, sep = "", row.names = FALSE, col.names = FALSE,
quote = FALSE, append = FALSE)
}
}
}
}
However, nothing gets saved in the meantime? Where did I go wrong with the code? Thank you for your insights.
I think that your problem is probably in:
myfile <- file.path(R(), paste0(mytime, "_", i, ".csv"))
As R thinks R() is a function.
Error in R() : could not find function "R"
You can change it to getwd() if you want (don't forget to set working directory first setwd()) or specify a different path .
In addition, here: write.csv2(final, file = myfile, sep = "", row.names = FALSE, col.names = FALSE, quote = FALSE, append = FALSE), your forgot to write paste(), and for convince you can delete the default arguments you used.
write.csv2(final, file = paste(myfile, sep = "" ))
#Edit
This is probably not the most efficient way but it will probably do the trick.
The main issue is that you append the pages3 list object/csv file by the url index. If you'd create a new index for pages3, you can reset it every i urls.
setwd("Your working directory path")
baseurl <- "http://zoeken.kvk.nl/Address.ashx?site=handelsregister&partialfields=&q=010"
pages3 <- list()
#Counter for the url loop
i <- 1
#Counter for the appended csv file/ list object pages3
k <- 1
for(i in 1:99999){
#Read JSON file by i index
mydata <- RJSONIO::fromJSON(paste0(baseurl,i), flatten=TRUE)
#Appending to the Pages3 list object by k index
pages3[[k]] <- mydata$resultatenHR
# Increasing the k counter
k <- k + 1
options(timeout = 4000000)
if(i %% 100 == 0) {Sys.sleep(5)}
if(i %% 1000 == 0) {
final_df <- do.call(rbind, pages3)
final <- Reduce(function(x,y) merge(x, y, all=TRUE), final_df)
mytime <- format(Sys.time(), "%b_%d_%H_%M_%S_%Y")
myfile <- file.path(getwd(), paste0(mytime, "_", i, ".csv"))
write.csv2(final, file = paste(myfile, sep = "" ))
#Resetting the pages3 list object
pages3 <- NULL
#Resting the k index counter
k <- 1
}
}
However, depending on your computer/the size of the files you try to import, maybe it would be more efficient to save and split to different csv files when you finished imported all the urls.
all,
I'm working form Matthew Jockers's code in his "Text Analysis with R for Students of Literature" book.
In it he provides code to pull all <p> tags from XML documents, chop that content in 1000 words chunks and apply a bunch data massaging tricks. Once that's done, he inserts that chunking function in a loop that produces a data matrix that is ready to be used in mallet. Please see the code below.
My question is, how do I do the same thing with .txt files? Obviously, text files do not have attributes like <p> to work from. I'm not an experienced programmer so go easy on me please!!!
chunk.size <- 1000 #number of words per chunk
makeFlexTextChunks <- function(doc.object, chunk.size=1000, percentage=TRUE){
paras <- getNodeSet(doc.object,
"/d:TEI/d:text/d:body//d:p",
c(d = "http://www.tei-c.org/ns/1.0"))
words <- paste(sapply(paras,xmlValue), collapse=" ")
words.lower <- tolower(words)
words.lower <- gsub("[^[:alnum:][:space:]']", " ", words.lower)
words.l <- strsplit(words.lower, "\\s+")
word.v <- unlist(words.l)
x <- seq_along(word.v)
if(percentage){
max.length <- length(word.v)/chunk.size
chunks.l <- split(word.v, ceiling(x/max.length))
} else {
chunks.l <- split(word.v, ceiling(x/chunk.size))
#deal with small chunks at the end
if(length(chunks.l[[length(chunks.l)]]) <=
length(chunks.l[[length(chunks.l)]])/2){
chunks.l[[length(chunks.l)-1]] <-
c(chunks.l[[length(chunks.l)-1]],
chunks.l[[length(chunks.l)]])
chunks.l[[length(chunks.l)]] <- NULL
}
}
chunks.l <- lapply(chunks.l, paste, collapse=" ")
chunks.df <- do.call(rbind, chunks.l)
return(chunks.df)
}
topic.m <- NULL
for(i in 1:length(files.v)){
doc.object <- xmlTreeParse(file.path(input.dir, files.v[i]),
useInternalNodes=TRUE)
chunk.m <- makeFlexTextChunks(doc.object, chunk.size,
percentage=FALSE)
textname <- gsub("\\..*","", files.v[i])
segments.m <- cbind(paste(textname,
segment=1:nrow(chunk.m), sep="_"), chunk.m)
topic.m <- rbind(topic.m, segments.m)
}
Thank you everybody for your help. I think I found my answer after much trial and error! The key was to pull the txt files with scan(paste(input.dir, files.v[i], sep="/") in the loop rather than the function. Please see my code here:
input.dir <- "data/plainText"
files.v <- dir(input.dir, ".*txt")
chunk.size <- 100 #number of words per chunk
makeFlexTextChunks <- function(doc.object, chunk.size=100, percentage=TRUE){
words.lower <- tolower(paste(doc.object, collapse=" "))
words.lower <- gsub("[^[:alnum:][:space:]']", " ", words.lower)
words.l <- strsplit(words.lower, "\\s+")
word.v <- unlist(words.l)
x <- seq_along(word.v)
if(percentage){
max.length <- length(word.v)/chunk.size
chunks.l <- split(word.v, ceiling(x/max.length))
}
else {
chunks.l <- split(word.v, ceiling(x/chunk.size))
#deal with small chunks at the end
if(length(chunks.l[[length(chunks.l)]]) <=
length(chunks.l[[length(chunks.l)]])/2){
chunks.l[[length(chunks.l)-1]] <-
c(chunks.l[[length(chunks.l)-1]],
chunks.l[[length(chunks.l)]])
chunks.l[[length(chunks.l)]] <- NULL
}
}
chunks.l <- lapply(chunks.l, paste, collapse=" ")
chunks.df <- do.call(rbind, chunks.l)
return(chunks.df)
}
topic.m <- NULL
for(i in 1:length(files.v)){
doc.object <- scan(paste(input.dir, files.v[i], sep="/"), what="character", sep="\n")
chunk.m <- makeFlexTextChunks(doc.object, chunk.size, percentage=FALSE)
textname <- gsub("\\..*","", files.v[i])
segments.m <- cbind(paste(textname, segment=1:nrow(chunk.m), sep="_"), chunk.m)
topic.m <- rbind(topic.m, segments.m)
}
Maybe this can point you in the right direction. The following code reads in a txt file a splits the words up into elements of a vector.
library(readr)
library(stringr)
url <- "http://www.gutenberg.org/files/98/98-0.txt"
mystring <- read_file(url)
res <- str_split(mystring, "\\s+")
Then you can split it into chunks of 1000 words and do your magic?
I asked a very similar question before but the answers i got dont seem to apply in this case. The aim of my code is primarily to take a file, manipulate it and the save the manipulated file over the old file. Unfortunately there are a lot of file so I have incorporated a for loop but it is stopping after just one run through the loop. I think my return function is in the right place and my for statement worked in a previous slightly different version of the script.
Here is my code:
AddLatLon<- function(num, condition){
#Set working directiory
# num is the number of files needing extraction e.g (3:5) , c(2,7)
connect <- file("./EOBS DATA/sources.txt", "r")
locdata <- readLines(connect)
close(connect)
info <- locdata[24:length(locdata)] # removes non data before the data begins
Infodata <- read.table(text = info, sep=',',fill=TRUE,colClasses='character',header ==TRUE )
InfoTable <- read.csv("./EOBS DATA/sources.csv")
InfoTable$STAID <- as.numeric(InfoTable$STAID)
for(i in c(num)){
filename <-paste("./EOBS DATA/",condition, "_csv_data/", condition,i, ".csv", sep = "")
#if(i <10){
#filename <- paste("./EOBS DATA/ECA_blend_", condition, "/" ,CONDITION, "_STAID00000", i, ".txt", sep = "")
#}
#if(i >=10 & i < 100){
#filename <- paste("./EOBS DATA/ECA_blend_", condition, "/" ,CONDITION, "_STAID0000", i, ".txt", sep = "")
#}
#if(i>= 100 & i <1000){
#filename <- paste("./EOBS DATA/ECA_blend_", condition, "/" ,CONDITION, "_STAID000", i, ".txt", sep = "")
#}
#if(i>= 1000){
#filename <- paste("./EOBS DATA/ECA_blend_", condition, "/" ,CONDITION, "_STAID00", i, ".txt", sep = "")
#}
if(file.exists(filename) == FALSE) {
next
}
#con <- file(filename, "r")
#data <- readLines(con)
#close(con)
#q <- data[21:length(data)] # removes non data before the data begins
#Impactdata <- read.table(text = q, sep=',',fill=TRUE,colClasses='character',header = TRUE )
x <- read.csv(filename)
point <- match(i, InfoTable$STAID)
Lat <- InfoTable[point,5]
Lon <- InfoTable[point,6]
Lat <- as.character(Lat)
Lon <- as.character(Lon)
x$Lat <- Lat
x$Lon <- Lon
x$X <- NULL
x$DATE<- as.Date(as.character(x$DATE), format='%Y%m%d')
Savename <- paste("./EOBS DATA/",condition, "_csv_data/", condition,i, ".csv", sep = "")
if(condition == "rr"){
condition <- "Precipitation"
}
if(condition == "tn"){
condition <- "Minimum Temperature"
}
if(condition == "tx"){
condition <- "Maximum Temperature"
}
names(x)<- c("Station_ID", "Source_ID", "Date(yyyy-mm-dd)", condition, "Quality_code(0='valid'; 1='suspect')", "Latitude", "Longitude")
write.csv(x, Savename)
}
return(head(x))
}
num is not defined, but from the name I'm pretty sure you want to be looping over 1:num, not c(num). So just replace:
for(i in c(num)){
with
for(i in 1:num)){
or
for(i in seq_len(num)){
Why seq_len? It will do the right thing if num is zero (no looping) or negative (throw an error).
I am currently taking in multiple command line parameters within my R script such as :
args<-commandArgs(TRUE)
arg1 <- as.numeric(args[1])
arg2 <- as.numeric(args[2])
I am wanting to use these args within my paste string like below. My problem is that I can only figure out how to use 1 of the arguments and not both (arg1, arg2). Instead of "xxx" that I show below in my where clause (i.e. "columnname1 in (xxx)") how do I use the "arg1" command line parameter in place of "xxx"? I've tried a number of different ways and for some reason I can't figure it out. Should I concatenate two different strings to accomplish this or is there an easier way?
SQL<-paste(
"SELECT
*
FROM
table
WHERE
columnname1 in (xxx)
and
columnname2 in ('",arg2,"')",sep = "")
Thanks for your help!
Try:
SQL<-paste(
"SELECT
*
FROM
table
WHERE
columnname1 in ('",arg1,"')
and
columnname2 in ('",arg2,"')",sep = "", collapse="")
You could also use the following helper function that allows named substitutions:
SQL<-strsubst(
"SELECT * FROM table WHERE
columnname1 in ('$(arg1)') and
columnname2 in ('$(arg2)')",
list(arg1=arg1, arg2=arg2)
)
where strsubst is defined as follows:
strsubst <- function (template, map, verbose = getOption("verbose"))
{
pat <- "\\$\\([^\\)]+\\)"
res <- template
map <- unlist(map)
m <- gregexpr(pat, template)
idx <- which(sapply(m, function(x) x[[1]] != -1))
for (i in idx) {
line <- template[[i]]
if (verbose)
cat("input: |", template[[i]], "|\n")
starts <- m[[i]]
ml <- attr(m[[i]], "match.length")
sym <- substring(line, starts + 2, starts + ml - 2)
if (verbose)
cat("sym: |", sym, "|\n")
repl <- map[sym]
idx1 <- is.na(repl)
if (sum(idx1) > 0) {
warning("Don't know how to replace '", paste(sym[idx1],
collapse = "', '"), "'.")
repl[idx1] <- paste("$(", sym[idx1], ")", sep = "")
}
norepl <- substring(line, c(1, starts + ml), c(starts -
1, nchar(line)))
res[[i]] <- paste(norepl, c(repl, ""), sep = "", collapse = "")
if (verbose)
cat("output: |", res[[i]], "|\n")
}
return(res)
}