How to read a ".TAB" file - r

I am trying to find a way to retrieve data from Harvard Dataverse website through R. I am using "dataverse" and "dvn" packages, among others. Many of the data files end with ".tab", although they are not formatted as normal tab-delimited text.
I have done this:
library(dataverse)
## 01. Using the dataverse server and making a search
Sys.setenv("DATAVERSE_SERVER" ="dataverse.harvard.edu")
## 02. Loading the dataset that I chose, by url
doi_url <- "https://doi.org/10.7910/DVN/ZTCWYQ"
my_dataset <- get_dataset(doi_url)
## 03. Grabbing the first file of the dataset
## which is named "001_AppendixC.tab"
my_files <- my_dataset$files$label
my_file <- get_file(my_files[1], doi_url)
AppendixC <- tempfile()
writeBin(my_file, AppendixC)
read.table(AppendixC)
> Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
> line 1 did not have 2 elements
> In addition: Warning message:
> In read.table(AppendixC) :
> line 1 appears to contain embedded nulls
Any hint?

The problem is that dataverse::get_file() returns the file in a raw binary format. The easiest way to load it into memory is to write it to a tempfile with writeBin() and then read that file with the appropriate import/read function.
Here is a function that should automagically read it into memory:
# Uses rio, which automatically chooses the appropriate import/read
# function based on file type.
library(rio)
install_formats() # only needs to run once after
# pkg installation
load_raw_file <- function(raw, type) {
match.arg(
arg = type,
choices = c(
"csv", "tab", "psc", "tsv", "sas7bdat",
"sav", "dta", "xpt", "por", "xls", "xlsx",
"R", "RData", "rda", "rds", "rec", "mtb",
"feather", "csv.gz", "fwf"
)
)
tmp <- tempfile(fileext = paste0(".", type))
writeBin(as.vector(raw), tmp)
out <- import(tmp)
unlink(tmp)
out
}
Let's try it out with your file, which is a an excel file.
library(dataverse)
raw <- get_file(
"001_AppendixC.tab",
"https://doi.org/10.7910/DVN/ZTCWYQ"
)
data <- load_raw_file(raw, "xlsx")
And look at the data:
str(data)
> 'data.frame': 132 obs. of 17 variables:
> $ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
> $ UN_9193 : chr "37.4" "7.7" "9.1" "65.400000000000006" ...
> $ UN_9901 : chr "46.1" "7.2" "10.7" "50" ...
> $ UN_0709 : chr "24.6" "9.6999999999999993" "7.5" "23.7" ...
> $ UN_1416 : chr "23" "4.9000000000000004" "4.5999999999999996" "14" ...
> $ stu90_94 : chr "51.3" "37.200000000000003" "22.9" "52.9" ...
> $ stu98_02 : chr "54.7" "39.200000000000003" "23.6" "47.1" ...
> $ stu06_10 : chr "51.3" "23.1" "13.2" "29.2" ...
> $ stu12_16 : chr "40.9" "17.899999999999999" "11.7" "37.6" ...
> $ wast90_94: chr "11.5" "9.4" "7.1" "7.9" ...
> $ wast98_02: chr "13.4" "12.2" "3.1" "8.6999999999999993" ...
> $ wast06_10: chr "8.9" "9.4" "4.0999999999999996" "8.1999999999999993" ...
> $ wast12_16: chr "9.5" "6.2" "4.0999999999999996" "4.9000000000000004" ...
> $ UM1992 : chr "16.8" "3.7" "4.5" "22.6" ...
> $ UM2000 : chr "13.7" "2.6" "4" "21.7" ...
> $ UM2008 : chr "11" "1.8" "2.9" "19.2" ...
> $ UM2015 : chr "9.1" "1.4" "2.6" "15.7" ...

Related

How to specify end of header line with read.table

I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "

troubleshooting a connection to the ebd file path for awk in r in mac and windows

I'm opening ebird data with auk, having trouble creating a path for the file. I set the path to a folder. When I try to change it to a file it says the path is not true.
with the Sys.getenv() I can see the path is set to a folder. with the auk_get_ebd_path() command I see the same thing. When I try to change the path to a file inside that folder with the auk_set_ebd_path() command I receive an error message.
library(auk)
auk_get_ebd_path()
[1] "/Users/lucypullen/Documents/bird/data"
auk_set_ebd_path("/Users/lucypullen/Documents/bird/data/ebd_CA_relApr-2019.txt", overwrite = TRUE)
[1] Error: dir.exists(paths = path) is not TRUE
other attempts yeilded an Error in file(con, "r") : cannot open the connection message
Warning messages: 1: In file(con, "r") :
'raw = FALSE' but '/Users.....data/CA' is not a regular file
2: In file(con, "r") :
cannot open file '/Users/lucypullen/Documents/bird/data/CA': it is a directory
seems like they want the path to go to a file. I thought the path would be complete with the system.file() command. I've tried a bunch of variations:
input_file <- system.file("/Users/lucypullen/Documents/bird/data/CA/ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("~/ebd_CA_relApr-2019.txt", package = "auk")
I suspect you should be doing this, since there appears to have been some sort of setup operation that preceded this question:
my_ebd_path = auk_get_ebd_path() # since you appear to already set it.
my_full_file_loc <- paste0(my_ebd_path, ”/“, "ebd_CA_relApr-2019.txt")
my_ebd_data <- read_ebd(my_full_file_loc)
str(my_ebd_data)
# ------what I get with the sample file in the package--------------
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 494 obs. of 45 variables:
$ checklist_id : chr "S6852862" "S14432467" "S39033556" "S38303088" ...
$ global_unique_identifier : chr "URN:CornellLabOfOrnithology:EBIRD:OBS97935965" "URN:CornellLabOfOrnithology:EBIRD:OBS201605886" "URN:CornellLabOfOrnithology:EBIRD:OBS530638734" "URN:CornellLabOfOrnithology:EBIRD:OBS520887169" ...
$ last_edited_date : chr "2016-02-22 14:59:49" "2013-06-16 17:34:19" "2017-09-06 13:13:34" "2017-07-24 15:17:16" ...
$ taxonomic_order : num 20145 20145 20145 20145 20145 ...
$ category : chr "species" "species" "species" "species" ...
$ common_name : chr "Green Jay" "Green Jay" "Green Jay" "Green Jay" ...
$ scientific_name : chr "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" ...
$ observation_count : chr "4" "2" "1" "1" ...
$ breeding_bird_atlas_code : chr NA NA NA NA ...
#----snipped a bunch of output---------

POST to API using httr in R results in error

I'm trying to pull data directly from an API into R using the httr package. The API doesn't require any authentication, and accepts JSON strings of lat, long, elevation, variable sets, and time period to estimate climate variables for any location. This is my first time using an API, but the code below is what I've cobbled together from various Stack Overflow posts.
library(jsonlite)
library(httr)
url = "http://apibc.climatewna.com/api/clmApi"
body <- data.frame(lat = c(48.98,50.2), ##two example locations
lon = c(-115.02, -120),
el = c(1000,100),
prd = c("Normal_1961_1990.nrm","Normal_1961_1990.nrm"),
varYSM = c("Y","SST"))
requestBody <- toJSON(list("output" = body),auto_unbox = TRUE) ##convert to JSON string
result <- POST("http://apibc.climatewna.com/api/clmApi", ##post to API
body = requestBody,
add_headers(`Content-Type`="application/json"))
content(result)
I've tried various different versions of this (e.g. writing the JSON string manually, putting the body as a list in POST with encode = "json"), and it always runs, but the content always contains the below error message:
$Message
[1] "An error has occurred."
$ExceptionMessage
[1] "Object reference not set to an instance of an object."
$ExceptionType
[1] "System.NullReferenceException"
If I use GET and specify the variables directly in the URL
url = "http://apibc.climatewna.com/api/clmApi/LatLonEl?lat=48.98&lon=-115.02&el=1000&prd=Normal_1961_1990&varYSM=Y"
result <- GET(url)
content(result)
it produces the correct output, but then I can only obtain information for one location at a time. There isn't currently any public documentation about this API as it's very new, but I've attached a draft of the section explaining it using JS below. I would very much appreciate any help/suggestions on what I'm doing wrong!
Thank you!
The main problem is that jQuery.ajax encodes the data using jQuery.param before sending it to the API, so what it's sending looks something like [0][lat]=48.98&[0][lon]=-115.02.... I don't know of a package in R that does a similar encoding as jQuery.param, so we'll have to hack something together.
Modifying your example slightly:
library(httr)
body <- data.frame(lat = c(48.98,50.2), ##two example locations
lon = c(-115.02, -120),
el = c(1000,100),
prd = c("Normal_1961_1990","Normal_1961_1990"),
varYSM = c("Y","Y"))
Now, we do the encoding, like so:
out <- sapply(1:nrow(body), function(i) {
paste(c(
paste0(sprintf("[%d][lat]", i - 1), "=", body$lat[i]),
paste0(sprintf("[%d][lon]", i - 1), "=", body$lon[i]),
paste0(sprintf("[%d][el]", i - 1), "=", body$el[i]),
paste0(sprintf("[%d][prd]", i - 1), "=", body$prd[i]),
paste0(sprintf("[%d][varYSM]", i - 1), "=", body$varYSM[i])
), collapse = "&")
})
out <- paste(out, collapse = "&")
so now out is in a form that the API likes. Finally
result <- POST(url = "http://apibc.climatewna.com/api/clmApi", ##post to API
body = out,
add_headers(`Content-Type`="application/x-www-form-urlencoded"))
noting the Content-Type. We get
df <- do.call(rbind, lapply(content(result), as.data.frame, stringsAsFactors = FALSE))
str(df)
# 'data.frame': 2 obs. of 29 variables:
# $ lat : chr "48.98" "50.2"
# $ lon : chr "-115.02" "-120"
# $ elev : chr "1000" "100"
# $ prd : chr "Normal_1961_1990" "Normal_1961_1990"
# $ varYSM : chr "Y" "Y"
# $ MAT : chr "5.2" "8"
# $ MWMT : chr "16.9" "20.2"
# $ MCMT : chr "-6.7" "-5.6"
# $ TD : chr "23.6" "25.7"
# $ MAP : chr "617" "228"
# $ MSP : chr "269" "155"
# $ AHM : chr "24.7" "79.1"
# $ SHM : chr "62.9" "130.3"
# $ DD_0 : chr "690" "519"
# $ DD5 : chr "1505" "2131"
# $ DD_18 : chr "4684" "3818"
# $ DD18 : chr "60" "209"
# $ NFFD : chr "165" "204"
# $ bFFP : chr "150" "134"
# $ eFFP : chr "252" "254"
# $ FFP : chr "101" "120"
# $ PAS : chr "194" "34"
# $ EMT : chr "-36.3" "-32.7"
# $ EXT : chr "37.1" "41.2"
# $ Eref : chr "14.7" "13.6"
# $ CMD : chr "721" "862"
# $ MAR : chr "347" "679"
# $ RH : chr "57" "57"
# $ Version: chr "ClimateBC_API_v5.51" "ClimateBC_API_v5.51"

Scraping table from myneta using R

I am trying to scrape a table from http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary to my R studio.
Here's the code
url<-'http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary'
webpage<-read_html(url)
candidate_info<- html_nodes(webpage,xpath='//*[#id="main"]/div/div[2]/div[2]/table')
candidate_info<- html_table(candidate_info)
head(candidate_info)
But getting no output, suggest what I am doing wrong?
That site has some very broken HTML. But, it's workable.
I find it better to target nodes in a slightly less fragile way. The XPath below finds it by content of the table.
html_table() croaks (or took forever and I didn't want to wait) so I ended up building the table "manually".
library(rvest)
# helper to clean column names
mcga <- function(x) { make.unique(gsub("(^_|_$)", "", gsub("_+", "_", gsub("[[:punct:][:space:]]+", "_", tolower(x)))), sep = "_") }
pg <- read_html("http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary")
# target the table
tab <- html_node(pg, xpath=".//table[contains(thead, 'Liabilities')]")
# get the rows so we can target columns
rows <- html_nodes(tab, xpath=".//tr[td[not(#colspan)]]")
# make a data frame
do.call(
cbind.data.frame,
c(lapply(1:8, function(i) {
html_text(html_nodes(rows, xpath=sprintf(".//td[%s]", i)), trim=TRUE)
}), list(stringsAsFactors=FALSE))
) -> xdf
# make nicer names
xdf <- setNames(xdf, mcga(html_text(html_nodes(tab, "th")))) # get the header to get column names
str(xdf)
## 'data.frame': 4823 obs. of 8 variables:
## $ sno : chr "1" "2" "3" "4" ...
## $ candidate : chr "A Hasiv" "A Wahid" "Aan Shikhar Shrivastava" "Aaptab Urf Aftab" ...
## $ constituency : chr "ARYA NAGAR" "GAINSARI" "GOSHAINGANJ" "MUBARAKPUR" ...
## $ party : chr "BSP" "IND" "Satya Shikhar Party" "Islam Party Hind" ...
## $ criminal_case: chr "0" "0" "0" "0" ...
## $ education : chr "12th Pass" "10th Pass" "Graduate" "Illiterate" ...
## $ total_assets : chr "Rs 3,94,24,827 ~ 3 Crore+" "Rs 75,106 ~ 75 Thou+" "Rs 41,000 ~ 41 Thou+" "Rs 20,000 ~ 20 Thou+" ...
## $ liabilities : chr "Rs 58,46,335 ~ 58 Lacs+" "Rs 0 ~" "Rs 0 ~" "Rs 0 ~" ...

Read csv file as a vector

Hi I was trying to read a csv file, I would like to get a vector
the inside of file like
head filetoread.csv
0610010K14Rik,0610011F06Rik,1110032F04Rik,1110034G24Rik,1500011B03Rik,1700019L03Rik,1700021K19Rik, blah,blah,...
in R session:
c <- read.csv("filetoread.csv")
> c
[1] X0610010K14Rik X0610011F06Rik X1110032F04Rik
...
> str(c)
'data.frame': 0 obs. of 2840 variables:
$ X0610010K14Rik : logi
$ X0610011F06Rik : logi
$ X1110032F04Rik : logi
$ X1110034G24Rik : logi
...
but I wanna something like:
> c
[1] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik" ...
str(c)
chr [1:2840] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik"...
We can use scan
scan("filetoread.csv", sep=',', what = "", quiet = TRUE)
#[1] "0610010K14Rik" "0610011F06Rik" "1110032F04Rik" "1110034G24Rik"
#[5] "1500011B03Rik" "1700019L03Rik" "1700021K19Rik" " blah" "blah"

Resources