Automate large data scraping - r

I am very new to using R and coding in general.
I would like to ask for any help I could get to solve my issue.
I try to download a large file from https://www.ncei.noaa.gov/access/search/data-search/global-summary-of-the-day.
So I use the GSODR package to download all data however, no matter what is the data length (1 year, 3 months, 5 years) it only downloaded around 5000 data (obs. of around 49 variables).
:(
another problem is actually I have to automate the download. my friend told me that I need to make a server and scrape it with the website (not the GSODR package) to do that. I will appreciate it if anyone gives me any advice since I'm very new to all of these.
thank you very much. I really appreciate your time.
library(GSODR)
library(dplyr)
library(reshape2)
if (!require("remotes")) {
install.packages("remotes", repos = "http://cran.rstudio.com/")
library("remotes")
}
install_github("ropensci/GSODR")
load(system.file("extdata", "isd_history.rda", package = "GSODR"))
# create data.frame for Laos only
Oz <- subset(isd_history, COUNTRY_NAME == "LAOS")
Oz
# download climatic data Laos for whole years
ie_2018 <- get_GSOD(years = 2017:2021, country = "Laos")
data(ie_2018)

Related

R - Making a ggplot while using survey package

I am stuck with a real problem.
My dataset comes from a survey and to make it usable to find statistics about the whole French population, I must weight it with weights.
For this purpose, I used the survey package, but the syntax is not really easy to use with R.
Is there a way to use ggplot while having weights?
To explain it a bit better, here is my dataset:
head(df)
Id Weight Var1
1 30 0
2 12.4 0
3 68.2 1
So my individual 1 accounts for 30 people in the French population.
I create a df_weighted dataset using the survey package.
How can I use ggplot now? df_weighted is a list!
I did something like this to try to escape the list problem but I did not work at all...
df_weighted_ggplot$var1 <- svytable(~var1, df_weighted)
df_weighted_ggplot$var_fill <- svytable(~var_fill, df_weighted)
ggplot(df_weighted_ggplot, aes(fill = var_fill , x =var1)) + geom_bar(position = "fill")
I received this predictable error:
Erreur : `data` must be a data frame, or other object coercible by `fortify()`, not a list
Do you know any other package which should help me? But I read many forums and it seems to be the most helpful...

How can I create a balanced subset based on one variable in r?

In my data I want to explain what influences an offline or an online conversion. However, my data is unbalanced. I have 12344 online conversions and 435 offline conversion. Due to this I get warnings when I want to run my logistic regresssion.
To solve this I want to take a more balanced subset to use in my logistic regressions, but I have no idea how I can manage this.
My data looks like:
Client id Conversion_type
1 Online
2 Offline
3 Online
4 Online
5 Online
6 Online
7 Online
So based on the conversiontype I want to have a more balanced subset.
We can split the data frame by Conversion_type and then sample randomly from the data frame that contains the online Conversion type so many samples as there are samples for the offline conversion type:
library(dplyr)
df_online <- df %>%
filter(Conversion_type == "Online")
df_offline <- df %>%
filter(Conversion_type == "Offline")
df_online_sampled <- df_online[sample(nrow(df_online), nrow(df_offline)), ]
balanced_df <- bind_rows(df_online_sampled, df_offline)
However I think the problem in your case might be that the classes are perfectly separable, see e.g., here on how to adress this.

Historical GTFS data

I'm trying to obtain a couple of weeks worth of historical train performance (delay) data on trains arriving at Central Station (id=600016...600024), Brisbane, Australia for research purposes. I found a package called gtfsway developed by SymbolixAU, and some brief code, but I don't know how to specify a date and the station id.
I'm new to GTFS and any help is appreciated.
library(gtfsway)
url <- "https://gtfsrt.api.translink.com.au/Feed/SEQ"
response <- httr::GET(url)
FeedMessage <- gtfs_realtime(response)
## the function gtfs_tripUpdates() extracts the 'trip_update' feed
lst <- gtfs_tripUpdates(FeedMessage)

Incorrect Plot when plotting SpatialPointsDataFrame

I am new to R and am having trouble plotting a SpatialPointsDataFrame, with eventual hopes of creating minimum convex polygons. I've been trying for a few days but can't find anything to fix this problem.
I load my excel data as TXT. File has 3 columns (Latitude, Longitude, ID) and 549 rows of observations. Then I enter the following code:
# Create coordinates variable
coords <- cbind(LAT = as.numeric(as.character(multi$LAT)), LONG = as.numeric(as.character(multi$LONG)))
# Create the SpatialPointsDataFrame
multi_SPDF <- SpatialPointsDataFrame(coords, data = data.frame(multi$ID), proj4string = CRS("+init=epsg:4326"))
#Plot the points
plot(multi_SPDF)
When I enter this, it produces a plot that looks like this:
I made this code from a similar code found at this link: http://www.alex-singleton.com/R-Tutorial-Materials/7-converting-coordinates.pdf
If anyone is able to help me to make this work I would really appreciate it. Hopefully I provided all of the necessary information.
EDIT
In an attempt to provide a reproducible example, I extracted the head of the data to copy into the comment, as follows:
LAT LONG ID
1 -41.30853 174.7342 7
2 -41.30481 174.7353 6
3 -41.30681 174.7363 7
4 -41.30660 174.7360 10
5 -41.31400 174.7329 10
6 -41.31059 174.7350 6
When I ran the above code on these 6 rows alone, it produced a plot with points distributed vertically and horizontally (exactly what I wanted!)
However, the same code still does not work on my entire data set. So I think the problem may be in my excel file and not my code.

Good ways to code complex tabulations in R?

Does anyone have any good thoughts on how to code complex tabulations in R?
I am afraid I might be a little vague on this, but I want to set up a script to create a bunch of tables of a complexity analogous to the stat abstract of the united states.
e.g.: http://www.census.gov/compendia/statab/tables/09s0015.pdf
And I would like to avoid a whole bunch of rbind and hbind statements.
In SAS, I have heard, there is a table creation specification language; I was wondering if there was something of similar power for R?
Thanks!
It looks like you want to apply a number of different calculations to some data, grouping it by one field (in the example, by state)?
There are many ways to do this. See this related question.
You could use Hadley Wickham's reshape package (see reshape homepage). For instance, if you wanted the mean, sum, and count functions applied to some data grouped by a value (this is meaningless, but it uses the airquality data from reshape):
> library(reshape)
> names(airquality) <- tolower(names(airquality))
> # melt the data to just include month and temp
> aqm <- melt(airquality, id="month", measure="temp", na.rm=TRUE)
> # cast by month with the various relevant functions
> cast(aqm, month ~ ., function(x) c(mean(x),sum(x),length(x)))
month X1 X2 X3
1 5 66 2032 31
2 6 79 2373 30
3 7 84 2601 31
4 8 84 2603 31
5 9 77 2307 30
Or you can use the by() function. Where the index will represent the states. In your case, rather than apply one function (e.g. mean), you can apply your own function that will do multiple tasks (depending upon your needs): for instance, function(x) { c(mean(x), length(x)) }. Then run do.call("rbind" (for instance) on the output.
Also, you might give some consideration to using a reporting package such as Sweave (with xtable) or Jeffrey Horner's brew package. There is a great post on the learnr blog about creating repetitive reports that shows how to use it.
Another options is the plyr package.
library(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, "month", function(x){
with(x, c(meantemp = mean(temp), maxtemp = max(temp), nonsense = max(temp) - min(solar.r)))
})
Here is an interesting blog posting on this topic. The author tries to create a report analogous to the United Nation's World Population Prospects: The 2008 Revision report.
Hope that helps,
Charlie

Resources