Opening JSON files in R - r

I have downloaded some data from the following site as a zip file and extracted it onto my computer. Now, I'm having trouble trying to open the included json data files.
Running following code:
install.packages("rjson")
library("rjson")
comp <- fromJSON("statsbomb/data/competitions")
gave this error:
Error in fromJSON("statsbomb/data/competitions") : unexpected character 's'
Also, is there a way to load all files at once instead of writing individual statements each time?

Here is what I did(Unix system).
Clone the Github repo(mark location)
git clone https://github.com/statsbomb/open-data.git
Set working directory(directory to which you cloned the repo or extracted the zip file).
setwd("path to directory where you cloned the repo")
Read data.
jsonlite::fromJSON("competitions.json")
With rjson: rjson::fromJSON(file="competitions.json")
To run all the files at once, move all .json files to a single directory and use lapply/assign to assign your objects to your environment.
Result(single file):
competition_id season_id country_name
1 37 4 England
2 43 3 International
3 49 3 United States of America
4 72 30 International
competition_name season_name match_updated
1 FA Women's Super League 2018/2019 2019-06-05T22:43:14.514
2 FIFA World Cup 2018 2019-05-14T08:23:15.306297
3 NWSL 2018 2019-05-17T00:35:34.979298
4 Women's World Cup 2019 2019-06-21T16:45:45.211614
match_available
1 2019-06-05T22:43:14.514
2 2019-05-14T08:23:15.306297
3 2019-05-14T08:02:00.567719
4 2019-06-21T16:45:45.211614

The function fromJSON takes a JSON string as a first argument unless you specify you are giving a file (fromJSON(file = "competitions.json")).
The error you mention comes from the function trying to parse 'statsbomb/data/competitions' as a string and not a file name. In JSON however, everything is enclosed in brackets and strings are inside quotation marks. So the s from "statsbomb" is not a valid first character.
To read all json files you could do:
lapply(dir("open-data-master/",pattern="*.json",recursive = T), function(x) {
assign(gsub("/","_",x), fromJSON(file = paste0("open-data-master/",x)), envir = .GlobalEnv)
})
however this will take a long time to complete! You probably should elaborate a little bit on this function. E.g. split the list of files obtained with dir into chunks of 50 before running the lapply call.

Related

How Can I Download and Use a Matrix from Matrix Market?

I am trying to write code to store a matrix to a variable directly from Matrix Market's website. Below is a sample URL that I'd use:
https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz
The example URL will download a bcsstk01.mtx.gz file. I need to extract the bcsstk01.mtx file. Then I need to use MatrixMarket.mmread() so I can save to a variable.
I first tried saving the downloaded file (or URL location) to a variable A = HTTP.get(), but lack of online resources and lack of knowledge led to no results. Then I used HTTP.download() and got the .mtx.gz file, but I can't unzip it. And finally, MatrixMarket.mmread() cannot read .gz files. So I'm stuck with a downloaded file I can't do anything with unless I manually unzip it.
Using the info from link in the comments and some fiddling, I managed to get the following:
using TranscodingStreams, CodecZlib
using Downloads
stream = PipeBuffer()
openstream = TranscodingStream(GzipDecompressor(), stream)
Downloads.download("https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz", stream)
for line in eachline(openstream)
println(line)
end
This prints:
%%MatrixMarket matrix coordinate real symmetric
48 48 224
1 1 2.8322685185200e+06
5 1 1.0000000000000e+06
6 1 2.0833333333300e+06
7 1 -3.3333333333300e+03
...
which I suppose is the desired data.

Moving files between folders when folder names match partially (in R or VBA)

I'm trying to solve the following problem
I have 9 folders titled PROS_2010 to PROS_2019. Each of them has about 500 subfolders with names structured as follows e.g. PROS_201001211_FIRM NAME_number. Each subfolder has a variety of pdf files with different names.
I have created in VBA another folder called sample with about 400 subfolders, each of which is named a specific FIRM NAME. For this I used the following code:
Sub MakeFolders()
Dim Rng As Range
Dim maxRows, maxCols, r, c As Integer
Set Rng = Selection
maxRows = Rng.Rows.Count
maxCols = Rng.Columns.Count
For c = 1 To maxCols
r = 1
Do While r <= maxRows
If Len(Dir(ActiveWorkbook.Path & "\" & Rng(r, c), vbDirectory)) = 0 Then
MkDir (ActiveWorkbook.Path & "\" & Rng(r, c))
On Error Resume Next
End If
r = r + 1
Loop
Next c
End Sub
I now want to move all the pdf files that are in the original subfolders PROS_201001211_FIRM NAME_number to the folders titled FIRM NAME only.
Basically, each original subfolder contains a report about a firm for a specific year (2010 to 2019) and I want to get all the firm reports for all years in a single folder titled FIRM NAME
To make it easier I already have an excel file that basically has the complete list of subfolders that looks like this:
Data structure: Company name is the name of the folder in which I want to move the files that are currently in "attachment folder". attachment1 is the pdf file name (which always changes so ideally the code would pluck all the files in attachment folder and move them to the file with company name
Thanks in advance,
Simon
OK
So thanks to the help of a mate I found it is super easy to solve this problem using the "command" command in windows
Basically create a text file (in notepad) that has the following structure
move "original pdf file directory" "new pdf file location\"
...
Repeat the structure for each file (which requires some basic excel string manipulations)
Then save the .txt file as a .cmd file and open it.
Done

Apply common header names to all files in the folder

I have 4 files in the folder named import_xxx.xlsx.
I need to apply below tasks,
1.First apply common header names to all files in the folder.
Write the applied corrections as seperate file in another folder.
I have tried below code.
1.Read the list of files
filenames_list <- list.files(pattern= ".xls", full.names=TRUE)
My question is how to search for header names and apply naming the changed name to all files.
My column names is as follows with sample data.
Sr No Invoice Date Invoice No Payer Name IGMNo Container No Size Type Act. gate in Date Container Agent Container Agent Name Importer Name CHA Code CHA Name Activity Description Amount Service Tax Total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 8-1-2018 12:12:29 AM MII180800001 SAME DEUTZ FAHR INDIA PRIVATE LIMITED 2200750 ECMU9674562 40 GB 7-26-2018 4:50:35 AM CLC007 CMA CGM SAME DEUTZ FAHR INDIA PRIVATE LIMITED CHS020 SEAKING CARGO SERVICES (I) PVT LTD Handling & PNR Movement Charges-FCL 10400 1872 12272
2 8-1-2018 12:12:29 AM MII180800001 SAME DEUTZ FAHR INDIA PRIVATE LIMITED 2200750 ECMU9674562 40 GB 7-26-2018 4:50:35 AM CLC007 CMA CGM SAME DEUTZ FAHR INDIA PRIVATE LIMITED CHS020 SEAKING CARGO SERVICES (I) PVT LTD Value Added Charges 2000 360 2360
I need to perform below data transformation tasks, which converts lower case characters to upper case characters. i.e column of activity desciption.
data.frame(lapply(df$Activity Description, function(v) {
if (is.character(v)) return(toupper(v))
else return(v)
}))
Do I need to loop through the files to write it? Following code will write the file but I need to loop through and write the applied changes for all files.
write.xlsx2(filename,"path")
Can anyone help me on this to loop through files and perform header transformation and write the files within the loop?
Thanks.
I would do something along the lines of the following. You will have to define the vector with the new common header, common_header.
(Untested.)
library(xlsx)
filenames_list <- list.files(pattern= "\\.xls", full.names=TRUE)
target_dir <- "path/to/target/directory/"
lapply(filenames_list, function(fl){
DF <- read.xlsx2(fl)
names(DF) <- common_header
target_fl <- paste0(target_dir, basename(fl))
write.xlsx2(DF, target_fl)
})

R - read html files within a folder, count frequency, and export output

I'm planning to use R to do some simple text mining tasks. Specifically, I would like to do the following:
Automatically read each html file within a folder, then
For each file, do frequency count of some particular words (e.g., "financial constraint" "oil export" etc.), then
Automatically write output to a csv. file using the following data structure (e.g., file 1 has "financial constraint" showing 3 times and "oil export" 4 times, etc.):
file_name count_financial_constraint count_oil_export
1 3 4
2 0 3
3 4 0
4 1 2
Can anyone please let me know where I should start, so far I think I've figured out how to clean html files and then do the count but I'm still not sure how to automate the process (I really need this as I have around 5 folders containing about 1000 html files within each)? Thanks!
Try this:
gethtml<-function(path=".") {
files<-list.files(path)
setwd(path)
html<-grepl("*.html",files)
files<-files[html]
htmlcount<-vector()
for (i in files) {
htmlcount[i]<- ##### add function that reads html file and counts it
}
return(sum(htmlcount))
}
R is not intended for doing rigorous text parsing. Subsequently, the tools for such tasks are limited. If you insist on doing it with R then you better get familiar with regular expressions and have a look at this.
However, I highly recommend using Python with the beautifulsoup library, which is specifically designed for this task.

How can I use R (Rcurl/XML packages ?!) to scrape this webpage?

I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:
I would like to go through all the "species pages" present in this link:
http://gtrnadb.ucsc.edu/
So for each of them I will go to:
The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)
Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):
chr.trna3 (1-77) Length: 77 bp
Type: Ala Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....
Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)
I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is:
1. Some suggestion on how to build such a code.
2. And recommendation for how to learn the knowledge needed for performing such a task.
Thanks for any help,
Tal
Tal,
You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.
Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:
Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.
So, here goes...
library(RCurl)
### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]
get_data_url<-function(s) {
u_split1<-strsplit(s,"/")[[1]][1]
u_split2<-strsplit(u_split1,'\\"')[[1]][2]
ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}
# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]
### 2) Now, scrape the genome data from all of those URLS ###
# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
g_split1<-strsplit(g,"\n")[[1]]
g_split1<-g_split1[2:5]
# Pull all of the data and stick it in a list
g_split2<-strsplit(g_split1[1],"\t")[[1]]
ID<-g_split2[1] # Sequence ID
LEN<-strsplit(g_split2[2],": ")[[1]][2] # Length
g_split3<-strsplit(g_split1[2],"\t")[[1]]
TYPE<-strsplit(g_split3[1],": ")[[1]][2] # Type
AC<-strsplit(g_split3[2],": ")[[1]][2] # Anticodon
SEQ<-strsplit(g_split1[3],": ")[[1]][2] # ID
STR<-strsplit(g_split1[4],": ")[[1]][2] # String
return(c(ID,LEN,TYPE,AC,SEQ,STR))
}
# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
raw_data<-getURLContent(struct_url)
s_split1<-strsplit(raw_data,"<PRE>")[[1]]
all_data<-s_split1[seq(3,length(s_split1))]
data_list<-lapply(all_data,parse_genomes)
for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
return(data_list)
}
# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>") # Some malformed web pages produce bad rows, but we can remove them
head(genome_data)
The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:
head(genome_data)
ID LEN TYPE AC SEQ
1 Scaffold17302.trna1 (1426-1498) 73 bp Ala AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2 Scaffold20851.trna5 (43038-43110) 73 bp Ala AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3 Scaffold20851.trna8 (45975-46047) 73 bp Ala AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4 Scaffold17302.trna2 (2514-2586) 73 bp Ala AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6 Scaffold17302.trna4 (6027-6099) 73 bp Ala AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
STR NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp
I hope this helps, and thanks for the fun little Sunday afternoon R challenge!
Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.
Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.
Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.
I'm sure there would be lots of other ways also.

Resources