How to check if subset is empty in R - r

I have a set of data with weight with time (t), I need to identify outliers of weight for every time (t), after which I need to send a notification email.
I'm using bloxplot($out) to identify the outliers, it seems to work, but I'm not sure if:
It's the correct way to use the boxplot?
I can't detect if the boxplot has no outlier or if its empty (or maybe, I'm using a wrong technique)
Or possibly the subset itself is empty (could be the root cause)
For now, I just need to trap the empty subset and check if out variable is empty or not.
Below is my R script code:
#i am a comment, and the compiler doesn't care about me
#load our libraries
library(ggplot2)
library(mailR)
#some variables to be used later
from<-""
to<-""
getwd()
setwd("C:\\Temp\\rwork")
#read the data file into a data(d) variable
d<-read.csv("testdata.csv", header=TRUE) #file
#get the current time(t)
t <-format(Sys.time(),"%H")
#create a subset of d based on t
sbset<-subset(d,Time==t)
#identify if outlier exists then send an email report
out<-boxplot(sbset$weight)$out
if(length(out)!=0){
#create a boxplot of the subset
boxplot(sbset$weight)
subject = paste("Attention: An Outlier is detected for Scheduled Job Run on Hour ",t)
message = toString(out) #sort(out)
}else{
subject = paste("No Outlier Identified")
message = ""
}
email<-send.mail(from=from,
to=to,
subject=subject,
body=message,
html=T,
smtp=list(host.name = "smtp.gmail.com",
port = 465,
user.name = from,
passwd = "", #password of sender email
ssl = TRUE),
authenticate=TRUE,
send=TRUE)
DATA
weight,Time,Chick,x
42,0,1,1
51,2,1,1
59,4,1,1
64,6,1,1
76,8,1,1
93,10,1,1
106,12,1,1
125,14,1,1
149,16,1,1
171,18,1,1
199,20,1,1
205,21,1,1
40,0,2,1
49,2,2,1
58,4,2,1
72,6,2,1
84,8,2,1
103,10,2,1
122,12,2,1
138,14,2,1
162,16,2,1
187,18,2,1
209,20,2,1
215,21,2,1
43,0,3,1
39,2,3,1
55,4,3,1
67,6,3,1
84,8,3,1
99,10,3,1
115,12,3,1
138,14,3,1
163,16,3,1
187,18,3,1
198,20,3,1
202,21,3,1
42,0,4,1
49,2,4,1
56,4,4,1
67,6,4,1
74,8,4,1
87,10,4,1
102,12,4,1
108,14,4,1
136,16,4,1
154,18,4,1
160,20,4,1
157,21,4,1
41,0,5,1
42,2,5,1
48,4,5,1
60,6,5,1
79,8,5,1
106,10,5,1
141,12,5,1
164,14,5,1
197,16,5,1
199,18,5,1
220,20,5,1
223,21,5,1
41,0,6,1
49,2,6,1
59,4,6,1
74,6,6,1
97,8,6,1
124,10,6,1
141,12,6,1
148,14,6,1
155,16,6,1
160,18,6,1
160,20,6,1
157,21,6,1
41,0,7,1
49,2,7,1
57,4,7,1
71,6,7,1
89,8,7,1
112,10,7,1
146,12,7,1
174,14,7,1
218,16,7,1
250,18,7,1
288,20,7,1
305,21,7,1
42,0,8,1
50,2,8,1
61,4,8,1
71,6,8,1
84,8,8,1
93,10,8,1
110,12,8,1
116,14,8,1
126,16,8,1
134,18,8,1
125,20,8,1
42,0,9,1
51,2,9,1
59,4,9,1
68,6,9,1
85,8,9,1
96,10,9,1
90,12,9,1
92,14,9,1
93,16,9,1
100,18,9,1
100,20,9,1
98,21,9,1
41,0,10,1
44,2,10,1
52,4,10,1
63,6,10,1
74,8,10,1
81,10,10,1
89,12,10,1
96,14,10,1
101,16,10,1
112,18,10,1
120,20,10,1
124,21,10,1
43,0,11,1
51,2,11,1
63,4,11,1
84,6,11,1
112,8,11,1
139,10,11,1
168,12,11,1
177,14,11,1
182,16,11,1
184,18,11,1
181,20,11,1
175,21,11,1
41,0,12,1
49,2,12,1
56,4,12,1
62,6,12,1
72,8,12,1
88,10,12,1
119,12,12,1
135,14,12,1
162,16,12,1
185,18,12,1
195,20,12,1
205,21,12,1
41,0,13,1
48,2,13,1
53,4,13,1
60,6,13,1
65,8,13,1
67,10,13,1
71,12,13,1
70,14,13,1
71,16,13,1
81,18,13,1
91,20,13,1
96,21,13,1
41,0,14,1
49,2,14,1
62,4,14,1
79,6,14,1
101,8,14,1
128,10,14,1
164,12,14,1
192,14,14,1
227,16,14,1
248,18,14,1
259,20,14,1
266,21,14,1
41,0,15,1
49,2,15,1
56,4,15,1
64,6,15,1
68,8,15,1
68,10,15,1
67,12,15,1
68,14,15,1
41,0,16,1
45,2,16,1
49,4,16,1
51,6,16,1
57,8,16,1
51,10,16,1
54,12,16,1
42,0,17,1
51,2,17,1
61,4,17,1
72,6,17,1
83,8,17,1
89,10,17,1
98,12,17,1
103,14,17,1
113,16,17,1
123,18,17,1
133,20,17,1
142,21,17,1
39,0,18,1
35,2,18,1
43,0,19,1
48,2,19,1
55,4,19,1
62,6,19,1
65,8,19,1
71,10,19,1
82,12,19,1
88,14,19,1
106,16,19,1
120,18,19,1
144,20,19,1
157,21,19,1
41,0,20,1
47,2,20,1
54,4,20,1
58,6,20,1
65,8,20,1
73,10,20,1
77,12,20,1
89,14,20,1
98,16,20,1
107,18,20,1
115,20,20,1
117,21,20,1
40,0,21,2
50,2,21,2
62,4,21,2
86,6,21,2
125,8,21,2
163,10,21,2
217,12,21,2
240,14,21,2
275,16,21,2
307,18,21,2
318,20,21,2
331,21,21,2
41,0,22,2
55,2,22,2
64,4,22,2
77,6,22,2
90,8,22,2
95,10,22,2
108,12,22,2
111,14,22,2
131,16,22,2
148,18,22,2
164,20,22,2
167,21,22,2
43,0,23,2
52,2,23,2
61,4,23,2
73,6,23,2
90,8,23,2

Your first use of boxplot is unnecessarily creating a plot, you can use
out <- boxplot.stats(sbset$weight)$out
for a little efficiency.
You are interested in the presence of rows, but length(sbset) will return the number of columns. I suggest instead nrow or NROW.
if (NROW(out) > 0) {
boxplot(sbset$weight)
# ...
} else {
# ...
}

Related

MturkR, R, automatically posting microbatches

I am new to MTurk, but have some fluency with R. I am using the MTurkR package the first time, and I am trying to create "micro-batches" that post on MTurk over time. The code I am using can be found below (the XXXX parts are obviously filled with the correct values). I don't get any error messages, the code runs, and posts the HIT both in the Sandbox and in the real correctly. However, the HITs posted do not show up in the Sandbox Requester account, or the real requester account which means - as far as I understand - that I can't evaluate the workers who submit a completion code before they are paid automatically.
Could anyone point out where the error is, and how could I review the HITs?
Thanks.
K
##### Notes:
# 1) Change sandbox from TRUE to FALSE to run live (make sure to test in sandbox first!!)
##### Step 1: Load library, set parameters
#### Load MTurkR library
library(MTurkR)
#### HIT Layout ID
# Layout ID for the choice task
# my_hitlayoutid = "XXXX"
# Layout ID for the choice task in the Sandbox
my_hitlayoutid = "XXXX"
#### Set MTurk credentials
Sys.setenv(
AWS_ACCESS_KEY_ID = "XXXX",
AWS_SECRET_ACCESS_KEY = "XXXX"
)
#### HIT parameters
## Run in sandbox?
sandbox_val <- "FALSE"
## Set the name of your project here (used to retrieve HITs later)
myannotation <- "myannotation"
## Enter other HIT aspects
newhittype <- RegisterHITType(
title = "hope trial",
description = "Description",
reward = "2.5",
duration = seconds(hours = 1),
keywords = "survey, demographics, neighborhoods, employment",
sandbox = sandbox_val
)
##### Step 2: Define functions
## Define a function that will create a HIT using information above
createhit <- function() {
CreateHIT(
hit.type = newhittype$HITTypeId,
assignments = 2,
expiration = seconds(days = 30),
annotation = myannotation,
verbose = TRUE,
sandbox = sandbox_val,
hitlayoutid = my_hitlayoutid
)
}
## Define a function that will expire all running HITs
## This keeps HITs from "piling up" on a slow day
## It ensures that A) HIT appears at the top of the list, B) workers won't accidentally accept HIT twice
# expirehits <- function() {
# ExpireHIT(
# annotation = myannotation,
# sandbox = sandbox_val
# )
#}
##### Step 3: Execute a loop that runs createhit/expirehit functions every hour, and it will log the output to a file
## Define number of times to post the HIT (totalruns)
totalruns <- 2
counter <- 0
## Define log file (change the location as appropriate)
logfile <- file("/Users/kinga.makovi/Dropbox/Bias_Experiment/MTurk/logfile.txt", open="a")
sink(logfile, append=TRUE, type="message")
## Run loop (note: interval is hourly, but can be changed in Sys.sleep)
repeat {
message(Sys.time())
createhit()
Sys.sleep(10)
#expirehits()
counter = counter + 1
if (counter == totalruns){
break
}
}
## To stop the loop before it finishes, click the "STOP" button
## To stop logging, run sink()
You can't see HITs that are created via the API (through MTurkR or otherwise) in the requester website. It's a "feature". You'll have to access the HITs through MTurkR (e.g., SearchHITs() and GetHIT()).

Better way to optimize my code for getting NOAA climate data

So I've been working on grabbing climate data (specifically temperature, precip) from NOAA's network of GHCN weather stations. I've managed to get a list of the stations pertinent to my area (~200) and have built a loop to essentially get a certain a climate variable for every station on that list for every day of between a specified min and max date. Ultimately I need ~10 years worth of data. However my simple loop is taking forever to get this data and I was wondering if there's a better way to optimize it? Also I really want to access monthly data rather than daily but rnoaa doesn't seem to have an option for GHCN monthly data as the only available function is ghcnd_search(). If anyone also knows about how to mine monthly rather than daily data that would be appreciated
Station list:
df<-c("US1FLAL0048", "US1FLBK0003", "US1FLBV0002", "US1FLBV0006",
"US1FLBV0023", "US1FLBV0040", "US1FLBW0099", "US1FLCT0012", "US1FLDV0051",
"US1FLFR0006", "US1FLHL0003", "US1FLHN0009", "US1FLLB0001", "US1FLLE0005",
"US1FLLK0012", "US1FLLN0004", "US1FLLN0018", "US1FLMN0013", "US1FLMR0012",
"US1FLMR0033", "US1FLOK0017", "US1FLOR0028", "US1FLPS0002", "US1FLPS0018",
"US1FLPT0007", "US1FLSJ0012", "US1FLSM0008", "US1FLSS0044", "US1FLST0014",
"US1FLSW0008", "US1FLVL0035", "US1FLWK0001", "USC00080228", "USC00080236",
"USC00080369", "USC00080414", "USC00080478", "USC00080598", "USC00080737",
"USC00080945", "USC00080992", "USC00081163", "USC00081276", "USC00081306",
"USC00081544", "USC00081641", "USC00081651", "USC00081978", "USC00082008",
"USC00082046", "USC00082150", "USC00082229", "USC00082288", "USC00082298",
"USC00082391", "USC00082418", "USC00082441", "USC00082850", "USC00082915",
"USC00082944", "USC00083020", "USC00083153", "USC00083163", "USC00083168",
"USC00083207", "USC00083209", "USC00083470", "USC00083874", "USC00083909",
"USC00083956", "USC00083986", "USC00084050", "USC00084095", "USC00084210",
"USC00084289", "USC00084320", "USC00084366", "USC00084394", "USC00084412",
"USC00084461", "USC00084625", "USC00084662", "USC00084731", "USC00084802",
"USC00085076", "USC00085099", "USC00085184", "USC00085275", "USC00085359",
"USC00085377", "USC00085539", "USC00085612", "USC00085667", "USC00085879",
"USC00085895", "USC00085973", "USC00086065", "USC00086078", "USC00086129",
"USC00086240", "USC00086315", "USC00086406", "USC00086414", "USC00086618",
"USC00086657", "USC00086764", "USC00086767", "USC00086828", "USC00086842",
"USC00086999", "USC00087020", "USC00087025", "USC00087205", "USC00087228",
"USC00087261", "USC00087304", "USC00087397", "USC00087429", "USC00087760",
"USC00087826", "USC00087851", "USC00087869", "USC00087886", "USC00087982",
"USC00088368", "USC00088529", "USC00088620", "USC00088756", "USC00088782",
"USC00088824", "USC00088942", "USC00089120", "USC00089176", "USC00089219",
"USC00089401", "USC00089430", "USC00089566", "USC00089640", "USC00089795",
"USR0000FBLO", "USR0000FCAC", "USR0000FCEN", "USR0000FCHE", "USR0000FLSU",
"USR0000FMER", "USR0000FMIL", "USR0000FNAV", "USR0000FOAS", "USR0000FOCH",
"USR0000FOLU", "USR0000FRAC", "USR0000FSAN", "USR0000FSTM", "USR0000FSUM",
"USR0000FWIL", "USW00003818", "USW00003853", "USW00012812", "USW00012815",
"USW00012816", "USW00012818", "USW00012819", "USW00012832", "USW00012833",
"USW00012834", "USW00012835", "USW00012836", "USW00012838", "USW00012839",
"USW00012841", "USW00012842", "USW00012843", "USW00012844", "USW00012849",
"USW00012850", "USW00012854", "USW00012871", "USW00012873", "USW00012876",
"USW00012882", "USW00012885", "USW00012888", "USW00012894", "USW00012895",
"USW00012896", "USW00012897", "USW00013884", "USW00013889", "USW00013899",
"USW00053847", "USW00053853", "USW00053860", "USW00092805", "USW00092806",
"USW00092809", "USW00092811", "USW00092821", "USW00093805", "USW00093837",
"USW00093841")
Code:
library(rnoaa)
options(noaakey = "your api key")
data<-matrix(, nrow=0, ncol=0) #create empty matrix
for (i in 1:length(df)){
a<-ghcnd_search(stationid=df[1],var='TMAX',date_min='2010-1-30',date_max='2015-12-31')
data=rbind(data,a$tmax)
}
Assuming the station ID is stored in a vector called dat, we can use the functions from the purrr package to download the data and create a data frame.
# Load packages
library(rnoaa)
library(purrr)
# Download the data and create a data frame.
dat_df <- map(dat, ghcnd_search,
var='TMAX', date_min = '2010-1-30', date_max = '2015-12-31') %>%
map_dfr("tmax")
DATA
dat<-c("US1FLAL0048", "US1FLBK0003", "US1FLBV0002", "US1FLBV0006",
"US1FLBV0023", "US1FLBV0040", "US1FLBW0099", "US1FLCT0012", "US1FLDV0051",
"US1FLFR0006", "US1FLHL0003", "US1FLHN0009", "US1FLLB0001", "US1FLLE0005",
"US1FLLK0012", "US1FLLN0004", "US1FLLN0018", "US1FLMN0013", "US1FLMR0012",
"US1FLMR0033", "US1FLOK0017", "US1FLOR0028", "US1FLPS0002", "US1FLPS0018",
"US1FLPT0007", "US1FLSJ0012", "US1FLSM0008", "US1FLSS0044", "US1FLST0014",
"US1FLSW0008", "US1FLVL0035", "US1FLWK0001", "USC00080228", "USC00080236",
"USC00080369", "USC00080414", "USC00080478", "USC00080598", "USC00080737",
"USC00080945", "USC00080992", "USC00081163", "USC00081276", "USC00081306",
"USC00081544", "USC00081641", "USC00081651", "USC00081978", "USC00082008",
"USC00082046", "USC00082150", "USC00082229", "USC00082288", "USC00082298",
"USC00082391", "USC00082418", "USC00082441", "USC00082850", "USC00082915",
"USC00082944", "USC00083020", "USC00083153", "USC00083163", "USC00083168",
"USC00083207", "USC00083209", "USC00083470", "USC00083874", "USC00083909",
"USC00083956", "USC00083986", "USC00084050", "USC00084095", "USC00084210",
"USC00084289", "USC00084320", "USC00084366", "USC00084394", "USC00084412",
"USC00084461", "USC00084625", "USC00084662", "USC00084731", "USC00084802",
"USC00085076", "USC00085099", "USC00085184", "USC00085275", "USC00085359",
"USC00085377", "USC00085539", "USC00085612", "USC00085667", "USC00085879",
"USC00085895", "USC00085973", "USC00086065", "USC00086078", "USC00086129",
"USC00086240", "USC00086315", "USC00086406", "USC00086414", "USC00086618",
"USC00086657", "USC00086764", "USC00086767", "USC00086828", "USC00086842",
"USC00086999", "USC00087020", "USC00087025", "USC00087205", "USC00087228",
"USC00087261", "USC00087304", "USC00087397", "USC00087429", "USC00087760",
"USC00087826", "USC00087851", "USC00087869", "USC00087886", "USC00087982",
"USC00088368", "USC00088529", "USC00088620", "USC00088756", "USC00088782",
"USC00088824", "USC00088942", "USC00089120", "USC00089176", "USC00089219",
"USC00089401", "USC00089430", "USC00089566", "USC00089640", "USC00089795",
"USR0000FBLO", "USR0000FCAC", "USR0000FCEN", "USR0000FCHE", "USR0000FLSU",
"USR0000FMER", "USR0000FMIL", "USR0000FNAV", "USR0000FOAS", "USR0000FOCH",
"USR0000FOLU", "USR0000FRAC", "USR0000FSAN", "USR0000FSTM", "USR0000FSUM",
"USR0000FWIL", "USW00003818", "USW00003853", "USW00012812", "USW00012815",
"USW00012816", "USW00012818", "USW00012819", "USW00012832", "USW00012833",
"USW00012834", "USW00012835", "USW00012836", "USW00012838", "USW00012839",
"USW00012841", "USW00012842", "USW00012843", "USW00012844", "USW00012849",
"USW00012850", "USW00012854", "USW00012871", "USW00012873", "USW00012876",
"USW00012882", "USW00012885", "USW00012888", "USW00012894", "USW00012895",
"USW00012896", "USW00012897", "USW00013884", "USW00013889", "USW00013899",
"USW00053847", "USW00053853", "USW00053860", "USW00092805", "USW00092806",
"USW00092809", "USW00092811", "USW00092821", "USW00093805", "USW00093837",
"USW00093841")

R Language: getCommentReplies() error:

despite reading the existing answers, I still don't know how to fix this problem.
I am trying to extract Comments For each post in the 1st phase which it is doing successfully and then in the 2nd phase for each comment extract the corresponding replies for that comment (i.e. in my program when i=1 [1st post] AND when j=1 [1st comment] )
However by the time getCommentreplies() tries to extract the very first reply for the very first comment of the first post it throws up the following error:
Error in data.frame(from_id = json$from$id, from_name = json$from$name, :
arguments imply differing number of rows: 0, 1
my program:
load ("fb_oauth")
fb_page_no_nullz<-getPage(page="gtbank", token=fb_oauth,n=130, since= '2018/3/10', until= '2018/3/12',feed=TRUE,api = 'v2.11') #Extract THE LATEST n=7 FCMB posts excluding Null rows from FCMB page# into variable/vector fb_page .
no_of_rows=na.omit(nrow(fb_page_no_nullz)) #Count the number of rows without NULLS and store in var no_of_rows
i=1
all_comments<-NULL
while (i<=no_of_rows)
{
postt <- getPost(post=fb_page_no_nullz$id[i], n=200, token=fb_oauth, comments = TRUE, likes=FALSE, api= "v2.11" ) #Extract N comments for each post
no_of_row_c=na.omit(nrow(postt$comments))
if(no_of_row_c!=0) #If their are no comments for each post then pick the next post.
{
comment_details<-postt$comments[,1:7]
comment_details$from_id<-comment_details$from_name<-NULL # This line removes the columns from_id AND from_name from the v data Frame
j =1
while (j<=no_of_row_c)
{
repl<-NULL
repl<-getCommentReplies(comment_details$id[i],token=fb_oauth,n=200,replies=TRUE,likes=FALSE,n.replies=100)
j=j+1
}
}
#all_comments$from_id<-all_comments$from_name<-NULL # This line removes the columns from_id AND from_name from the v data Frame
all_comments<-rbind(all_comments,comment_details) # Cummutatively append all comments for all posts into the data frame all_comments
i=i+1
}
#allPC<-merge(all_comments,fb_page_no_nullz, by.x= substr(c("id"),1,14), by.y=substr(c("id"),14,30),all.x = TRUE)

How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?
I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table
skip integer: the number of lines of the data file to skip before
beginning to read data.
nrows integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.
p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.
The answer given bu #berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.
I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:
#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;
print("\nEnter input file name = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada) || die "Cannot open input file: $!\n" ;
$cabecera = <IN> ;
$leidas = 0 ;
$fragmento = 1 ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
if ($leidas > $nlineas) {
close(OUT) ;
$fragmento++ ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
$leidas = 0;
}
$leidas++ ;
print OUT $_ ;
}
close(OUT) ;
Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").
One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:
library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)
Once big file is read into a ff object, Subsetting ffobject into data frames can be done using:
a[1000:1000000,]
Rest of the code for subsetting and saving broken dataframes
totalrows = dim(a)[1]
row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes
block.size = 200000000 #in bytes .IN Mbs 200 Mb
#rows.block is rows per block
rows.block = ceiling(block.size/row.size)
#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)
for(i in (0:nmaps)){
if(i==nmaps){
df = a[(i*rows.block+1) : totalrows,]
}
else{
df = a[(i*rows.block+1) : ((i+1)*rows.block),]
}
#process df or save it
write.csv(df,paste0("M",i+1,".csv"))
#remove df
rm(df)
}
Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;
read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){
conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname)
dbWriteTable(conn, name, file, header = header, overwrite = overwrite)
on.exit(dbRemoveTable(conn, name))
command = paste0("select * from ", name)
ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password))
return(ret)
}

R: Plot ARC/INFO Generate File

I have an ARC/INFO generate file whose contents look like:
3594 -124.049541 44.429077
-123.381222 44.530192
-123.479913 44.625517
-123.578917 44.720704
-123.678234 44.815755
-123.777866 44.910669
-123.946044 44.885032
-124.114074 44.858987
-124.281949 44.832529
-124.449663 44.805654
-124.516511 44.684660
-124.583091 44.563597
-124.649404 44.442465
-124.715451 44.321261
-124.615376 44.227772
-124.515601 44.134147
-124.416125 44.040385
-124.316948 43.946486
-124.151513 43.973082
-123.985926 43.999247
-123.820193 44.024987
-123.654322 44.050307
-123.586447 44.170362
-123.518307 44.290360
-123.449899 44.410303
-123.381222 44.530192
END
3595 -123.103772 45.009223
-122.427717 45.101578
-122.525757 45.198252
-122.624122 45.294789
-122.722814 45.391191
-122.821833 45.487459
-122.992014 45.464007
-123.162072 45.440175
-123.332002 45.415959
-123.501798 45.391355
-123.571234 45.271264
-123.640389 45.151121
-123.709266 45.030923
-123.777866 44.910669
-123.678234 44.815755
-123.578917 44.720704
-123.479913 44.625517
-123.381222 44.530192
-123.213811 44.554460
-123.046278 44.578334
-122.878629 44.601816
-122.710869 44.624913
-122.640504 44.744148
-122.569859 44.863337
-122.498931 44.982480
-122.427717 45.101578
END
3676 -122.989567 44.147495
-122.323040 44.238368
-122.419523 44.335217
-122.516322 44.431923
-122.613437 44.528488
-122.710869 44.624913
-122.878629 44.601816
-123.046278 44.578334
-123.213811 44.554460
-123.381222 44.530192
-123.449899 44.410303
-123.518307 44.290360
-123.586447 44.170362
-123.654322 44.050307
-123.556277 43.955264
-123.458534 43.860080
-123.361093 43.764751
-123.263953 43.669279
-123.098838 43.693189
-122.933613 43.716694
-122.768285 43.739802
-122.602857 43.762515
-122.533309 43.881546
-122.463492 44.000532
-122.393403 44.119472
-122.323040 44.238368
END
END
My strategy is to read in the file generating a list of latitude-longitude points and beginning a new unique group id every time I encounter an END. I'll then plot using ggplot" andgeom_polygon".
Alas, I'm not sure how to efficiently accomplish the reading of the file.
Any thoughts?
Read the spatial task view on CRAN and then use readOGR from the rgdal package to read into an sp class object. You'll need a GDAL/OGR install with ARCGEN format support, which despite being listed as 'compiled by default' Link I don't have on my system.
Failing that, open the file as a connection, read each line, build a Polygon, then Polygons and SpatialPolygons.
Here's a fairly sub-optimal but working function:
readUng <- function(f){
require(sp)
stream = file(f,"r")
first = readLines(stream,1)
bits = strsplit(first," ")[[1]]
polys = list();ids=NULL
while(TRUE){
id=bits[1] # label pt = bits[2],bits[3]
ids=c(ids,id)
coords=NULL
while(TRUE){
xy=readLines(stream,1)
if(xy=="END"){
break
}
coords=rbind(coords,strsplit(xy," ")[[1]])
}
polys[[length(polys)+1]] = Polygons(list(Polygon(matrix(as.numeric(coords[,2:3]),ncol=2))),ID=id)
lines = readLines(stream,1)
if(lines == "END"){
break
}
bits = strsplit(lines," ")[[1]]
}
return(SpatialPolygons(polys))
}
Now its a proper spatial data object, you can also give it a coordinate system (looks like lat-long to me, so epsg:4326, but only you know). Now you could modify all this to produce whatever ggplot wants, but if its spatial data then you should keep it as a spatial data class and ggplot should be made capable of dealing with such.

Resources