Failed two methods to subset dataset with R, requesting assistance - r

I am attempting to make a subset of some data in R (open source statistics scripting language). I attempt two methods, but I am unsuccessful with both. One returns a table with no data, the other returns a table of all "NA" cells, but of the apparently correct dimensions.
I laid out the code pretty clearly commented--
First, I create the list of zip codes I'll use to subset the data. The list of zip codes is from a dataset I'll be using.
The list of zip codes is called "zipCodesOfData"
Next, I download the Crime Data I'll be subsetting. I basically just subset it into the data set that I need.
The last part, section three, shows that I try both %in% and the filter method to filter the Crime Data against the zip code data.
Unfortunately, neither method works. I was hoping someone might be able to point out my mistakes or recommend a different subsetting method for the third section.
(As an aside, in section two, I attempt to turn the list into a dataframe, but it does not work. I'm curious as to why, if anyone can shed light onto this for me.)
Thanks for your time & assistance!
####
#### Section zero: references and dependencies
####
# r's "choroplethr" library creator's blog for reference:
# http://www.arilamstein.com/blog/2015/06/25/learn-to-map-census-data-in-r/
# http://stackoverflow.com/questions/30787877/making-a-zip-code-choropleth-in-r-using-ggplot2-and-ggmap
#
# library(choroplethr)
# library(choroplethrMaps)
# library(ggplot2)
# # use the devtools package from CRAN to install choroplethrZip from github
# # install.packages("devtools")
# library(devtools)
# install_github('arilamstein/choroplethrZip')
# library(choroplethrZip)
# library(data.table)
#
####
#### Section one: the data set providing the zipcode we'll use to subset the crime set
####
austin2014_data_raw <- fread('https://data.austintexas.gov/resource/hcnj-rei3.csv')
names(austin2014_data_raw)
nrow(austin2014_data_raw)
## clean up: make any blank cells in column ZipCode say "NA" instead -> source: http://stackoverflow.com/questions/12763890/exclude-blank-and-na-in-r
austin2014_data_raw[austin2014_data_raw$ZipCode==""] <- NA
# keep only rows that do not have "NA"
austin2014_data <- na.omit(austin2014_data_raw)
nrow(austin2014_data) # now there's one less row.
# selecting the first column, which is ZipCode
zipCodesOfData <- austin2014_data[,1]
View(zipCodesOfData)
# Now we have the zipcodes we need: zipCodesOfData
####
#### Section two: Crime data
####
# Crime by zipcode: https://data.austintexas.gov/dataset/Annual-Crime-2014/7g8v-xxja
# (visualized: https://data.austintexas.gov/dataset/Annual-Crime-2014/8mst-ed5t )
# https://data.austintexas.gov/resource/<insertResourceNameHere>.csv w/ resource "7g8v-xxja"
austinCrime2014_data_raw <- fread('https://data.austintexas.gov/resource/7g8v-xxja.csv')
View(austinCrime2014_data_raw)
nrow(austinCrime2014_data_raw)
# First, let's remove the data we don't need
names(austinCrime2014_data_raw)
columnSelection_Crime <- c("GO Location Zip", "GO Highest Offense Desc", "Highest NIBRS/UCR Offense Description")
austinCrime2014_data_selected_columns <- subset(austinCrime2014_data_raw, select=columnSelection_Crime)
names(austinCrime2014_data_selected_columns)
nrow(austinCrime2014_data_selected_columns)
####
#### Section Three: The problem: I am unable to make subsets with the two following methods.
####
# Neither of these methods work:
# Attempt 1:
austinCrime2014_data_selected_columns <- austinCrime2014_data_selected_columns[austinCrime2014_data_selected_columns$`GO Location Zip` %in% zipCodesOfData , ]
View(austinCrime2014_data_selected_columns) # No data in the table
# Attempt 2:
# This initially told me an error:
# Then, I installed dplyr and the error went away.
library(dplyr)
# However, it still doesn't create anything-- just an empty set w/ headers
austinCrime2014_data_selected_zips <- filter(austinCrime2014_data_selected_columns, `GO Location Zip` %in% zipCodesOfData)
View(austinCrime2014_data_selected_zips)
I edited out this section, after realizing it was unnecessary.
####
#### Bad section
####
nrow(austinCrime2014_data_selected_columns)
# Then, let's keep only the zipcodes we need
# doesnt work: austinCrime2014_data_selected_columns_df <- data.frame(austinCrime2014_data_selected_columns)
# typeof(austinCrime2014_data_selected_columns_df)
austinCrime<-do.call("rbind", austinCrime2014_data_selected_columns)
austinCrime_needsTranspose <-as.data.frame(austinCrime)
austinCrime <- t(austinCrime_needsTranspose)
typeof(austinCrime)
View(austinCrime)
names(austinCrime)
####
#### Bad section
####

I think readr and dplyr can solve your problem. It's simple:
library(readr)
library(dplyr)
### SECTION 1
# Import data
austin2014_data_raw <- read_csv('https://data.austintexas.gov/resource/hcnj-rei3.csv', na = '')
glimpse(austin2014_data_raw)
nrow(austin2014_data_raw)
# Remove NAs
austin2014_data <- na.omit(austin2014_data_raw)
nrow(austin2014_data) # now there's one less row.
# Get zip codes
zipCodesOfData <- austin2014_data$`Zip Code`
### SECTION 2
# Import data
austinCrime2014_data_raw <- read_csv('https://data.austintexas.gov/resource/7g8v-xxja.csv', na = '')
glimpse(austinCrime2014_data_raw)
nrow(austinCrime2014_data_raw)
# Select and rename required columns
columnSelection_Crime <- c("GO Location Zip", "GO Highest Offense Desc", "Highest NIBRS/UCR Offense Description")
austinCrime_df <- select(austinCrime2014_data_raw, one_of(columnSelection_Crime))
names(austinCrime_df) <- c("zipcode", "highestOffenseDesc", "NIBRS_OffenseDesc")
glimpse(austinCrime_df)
nrow(austinCrime_df)
### SECTION 3
# Filter by zipcode
austinCrime2014_data_selected_zips <- filter(austinCrime_df, zipcode %in% zipCodesOfData)
glimpse(austinCrime2014_data_selected_zips)
nrow(austinCrime2014_data_selected_zips)
Here I used read_csv() from the readr package to import data, and the subset methods select() and filter() from the dplyr package to get the required columns and rows.

I'm not sure why you're do.calling and transposing your data. You can just use something like dplyr's semi_join to get only the zipcodes you want:
library(data.table)
library(dplyr)
#> -------------------------------------------------------------------------
#> data.table + dplyr code now lives in dtplyr.
#> Please library(dtplyr)!
#> -------------------------------------------------------------------------
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
zipCodesOfData <- fread('https://data.austintexas.gov/resource/hcnj-rei3.csv') %>%
mutate(`Zip Code` = ifelse(`Zip Code` == "", NA, `Zip Code`)) %>%
na.omit() %>%
select(`Zip Code`)
austinCrime2014_data_raw <- fread('https://data.austintexas.gov/resource/7g8v-xxja.csv') %>%
select(`GO Location Zip`, `GO Highest Offense Desc`, `Highest NIBRS/UCR Offense Description`) %>%
semi_join(zipCodesOfData, by = c("GO Location Zip" = "Zip Code")) %>%
rename(zipcode = `GO Location Zip`,
highestOffenseDesc = `GO Highest Offense Desc`,
NIBRS_OffenseDesc = `Highest NIBRS/UCR Offense Description`)

Related

download.file with wildcard matching in R

I'm trying to download all the files that match a pattern from a url directory in R using download.file, but I can't get it working for even a single file. The url is:
https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/
and the pattern match is all files like: AIS_2019_*_18.zip
Here is what I've tried for a single file case:
download.file('https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_04_18.zip',
destfile = "AIS_2019_04_18.zip",
method = "wget", extra = c("-r", "-np", "-L", "--max-redirect=0"))
but I always get 'wget' call had nonzero exit status
I've also tried setting method = internal and mode = w, but get ```scheme not supported in url'
Here's a way to generate all the links that you can then loop through them with a for loop.
library(glue)
library(stringr)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
# Setup
month_dates <- glue("2019-{str_pad(1:12, width = 2, pad = '0')}-01")
days_in_months <- days_in_month(as.Date(month_dates))
# Get appropriate number of days and months combinations
months <- rep(1:12, days_in_months)
days <- unlist(mapply(function(x) str_pad(1:x, width = 2, pad = "0"),
days_in_months))
base_url <- "https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019"
# Put everything together
all_files <- glue("{base_url}/AIS_2019_{months}_{days}.zip")
# See results
head(all_files)
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_01.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_02.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_03.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_04.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_05.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_06.zip
# Check number of the days in a year is correct
length(all_files)
#> [1] 365
Created on 2021-08-04 by the reprex package (v2.0.0)
Once you have those created, you can do something like:
# Untested
for (file in all_files) {
download.file(file,
destfile = basename(file),
extra = c("-r", "-np", "-L", "--max-redirect=0"))
}

"Duplicated rows" error in AnnotationForge function makeOrgPackage

I'm creating an organism package using the AnnotationForge package, specifically the function makeOrgPackage. I've been following this vignette: https://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/inst/doc/MakingNewOrganismPackages.html
When I call the function:
makeOrgPackage(gene_info=PA14Sym, chromosome=PA14Chr, go=PA14Go,
version="0.1",
maintainer="myname <email#university.edu>",
author="myname <email#university.edu>",
outputDir = ".",
tax_id="208963",
genus="Pseudomonas",
species="aeruginosa",
goTable="go")
I receive this error:
Error in FUN(X[[i]], ...) : data.frames in '...' cannot contain duplicated rows
The "..." refers to the set of dataframes containing the annotation data. I've ensured that these dataframes are in the exact same structure as the example in the vignette. In the "gene_info" and "chromosome" dfs, I deleted all duplicated rows.
The "go" df has repeated values in the "GID" (gene ID) column, but all GO values are unique, and I've checked that no duplicate rows exist. For example:
GID GO EVIDENCE
1 PA14_00010 GO:0005524 ISM
2 PA14_00010 GO:0006270 ISM
3 PA14_00010 GO:0006275 ISM
4 PA14_00010 GO:0043565 ISM
5 PA14_00010 GO:0003677 ISM
6 PA14_00010 GO:0003688 ISM
7 PA14_00020 GO:0003677 ISM
8 PA14_00020 GO:0006260 ISM
The same goes for the sample finch data provided by the vignette; repeated GIDs, but unique GO numbers. Frustratingly, when I run the makeOrgPackage function for the sample data in the vignette, there are no errors. What am I missing here?
Full script:
# Load in GO annotated PA14 file, downloaded from Psuedomonas.com
PA14file <- read.csv("../data/GO_annotations/GO_PA14.csv")
colnames(PA14file)
> colnames(PA14file)
[1] "LocusTag" "GeneName" "ProductDescription"
[4] "StrainName" "Accession" "GOTerm"
[7] "Namespace" "GOEvidenceCode" "EvidenceOntologyECOCode"
[10] "EvidenceOntologyTerm" "SimilarToBindsTo" "PMID"
[13] "chrom"
# PA14 only has 1 chromosome, so create a new column and populate it with 1s.
PA14file$chrom <- '1'
# Create gene_info df, remove duplicate rows
PA14Sym <- PA14file[,c("LocusTag", "GeneName", "ProductDescription")]
PA14Sym <- PA14Sym[PA14Sym[,"GeneName"]!="-",]
PA14Sym <- PA14Sym[PA14Sym[,"ProductDescription"]!="-",]
colnames(PA14Sym) <- c("GID","SYMBOL","GENENAME")
PA14Sym <- PA14Sym[!duplicated(PA14Sym), ]
# Create chromosome df, remove duplicate rows
PA14Chr <- PA14file[,c("LocusTag", "chrom")]
PA14Chr <- PA14Chr[PA14Chr[,"chrom"]!="-",]
colnames(PA14Chr) <- c("GID","CHROMOSOME")
PA14Chr %>% distinct(GID, .keep_all = TRUE)
PA14Chr <- PA14Chr[!duplicated(PA14Chr), ]
# Create go df
PA14Go <- PA14file[,c("LocusTag", "Accession", "GOEvidenceCode")]
PA14Go <- PA14Go[PA14Go[,"GOEvidenceCode"]!="",]
colnames(PA14Go) <- c("GID","GO","EVIDENCE")
# Call the function
makeOrgPackage(gene_info=PA14Sym, chromosome=PA14Chr, go=PA14Go,
version="0.1",
maintainer="myname <email#university.edu>",
author="myname <email#university.edu>",
outputDir = ".",
tax_id="208963",
genus="Pseudomonas",
species="aeruginosa",
goTable="go")
I also met this question today, and just after I changed to use the distinct() in dplyr, this function can work correctly.(My function is the same as yours.)
Just try to add a piece of %>% dplyr::distinct() to the tail of each part of creating or use dplyr::distinct() after all operations to remove the duplications in your variable.
In your case:
library(dplyr)
PA14Sym <- dplyr::distinct(PA14Sym)
PA14Chr <- dplyr::distinct(PA14Chr)
PA14Go <- dplyr::distinct(PA14Go)
Hope these can help you.

Plot the longest transcript in GenomicRanges with ggbio

I am trying to plot an specific region using ggbio. I am using the below code that produced my desire output, except that it contains several transcript. Is it possible to only plot the longest transcript? I've not been able to access the genomic ranges object within Homo.sapiens that I assume contains this information.
library(ggbio)
library(Homo.sapiens)
range <- GRanges("chr10" , IRanges(start = 78000000 , end = 79000000))
p.txdb <- autoplot(Homo.sapiens, which = range)
p.txdb
Here is a solution that involves filtering TxDb.Hsapiens.UCSC.hg19.knownGene on the longest transcript by gene_id (which does remove genes without gene_id):
suppressPackageStartupMessages({
invisible(lapply(c("ggbio", "biovizBase", "data.table",
"TxDb.Hsapiens.UCSC.hg19.knownGene",
"org.Hs.eg.db"),
require, character.only = TRUE))})
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# retrieve transcript lengths
txlen <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)
setDT(txlen)
txlen$len <- rowSums(as.matrix(txlen[, .(tx_len, utr5_len, utr3_len)]))
setkey(txlen, gene_id, len, tx_id)
# filter longesttranscript by gene_id
ltx <- txlen[!is.na(gene_id)][, tail(.SD,1), by=gene_id]$tx_id
# filter txdb object
txb <- as.list(txdb)
txb$transcripts <- txb$transcripts[txb$transcripts$tx_id %in% ltx, ]
txb$splicings <- txb$splicings[txb$splicings$tx_id %in% ltx,]
txb$genes <- txb$genes[txb$genes$tx_id %in% ltx,]
txb <- do.call(makeTxDb, txb)
# plot according to vignette, chapter 2.2.5
range <- GRanges("chr10", IRanges(start = 78000000 , end = 79000000))
gr.txdb <- crunch(txb, which = range)
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
colnames(values(gr.txdb))[4] <- "model"
grl <- split(gr.txdb, gr.txdb$gene_id)
symbols <- select(org.Hs.eg.db, keys=names(grl), columns="SYMBOL", keytype="ENTREZID")
#> 'select()' returned 1:1 mapping between keys and columns
names(grl) <- symbols[match(symbols$ENTREZID, names(grl), nomatch=0),"SYMBOL"]
autoplot(grl, aes(type = "model"), gap.geom="chevron")
#> Constructing graphics...
Created on 2020-05-29 by the reprex package (v0.3.0)
Edit:
To get gene symbols instead of gene (or transcript) ids, just replace the names of grl with the associated gene symbols, e.g. via org.Hs.eg.db, or any other resource that matches them up.

Nestled Loop not Working to gather data from NOAA

I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]

Scrape number of articles on a topic per year from NYT and WSJ?

I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
NYT WSJ
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
Thanks!
PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register)
# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F")
res <- fromJSON(raw.data)
dat <- append(dat, unlist(res$results)) # append
}
# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes
There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
moocs$response$meta$hits
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880

Resources