I have a dataset of around 4k newspapers with some information about them. Relevant variables:
dataset
name year_first_published year_last_published
Herald 1902 1903
Jumo 1933 1990
I want to create a folder for each newspaper called 'name' and within that folder I want to create annual subfolders going from year first published to year last published. So in the example above, folder 'Herald' would have 2 subfolders "1902" and "1903", whereas "Jumo" would have 58 subfolders named in similar fashion from 1933 to 1990
subfolder_names <- dataset$name
for (i in 1:length(subfolder_names)){
setwd("XXX/outputs")
folder<-dir.create(subfolder_names[i])
for (i in year_first_published:year_last_published){ #It breaks here or line below
folders<-dir.create(i)
}
}
Essentially I am not sure how to change the working directory after creating the folder for a newspaper. I would appreciate any advice that would allow me to do it within the same function call. I am aware that I could iterate over 'newspaper name' folders once they have been created but I was wondering if there is potentially a more direct approach.
Related
I have downloaded some data from the following site as a zip file and extracted it onto my computer. Now, I'm having trouble trying to open the included json data files.
Running following code:
install.packages("rjson")
library("rjson")
comp <- fromJSON("statsbomb/data/competitions")
gave this error:
Error in fromJSON("statsbomb/data/competitions") : unexpected character 's'
Also, is there a way to load all files at once instead of writing individual statements each time?
Here is what I did(Unix system).
Clone the Github repo(mark location)
git clone https://github.com/statsbomb/open-data.git
Set working directory(directory to which you cloned the repo or extracted the zip file).
setwd("path to directory where you cloned the repo")
Read data.
jsonlite::fromJSON("competitions.json")
With rjson: rjson::fromJSON(file="competitions.json")
To run all the files at once, move all .json files to a single directory and use lapply/assign to assign your objects to your environment.
Result(single file):
competition_id season_id country_name
1 37 4 England
2 43 3 International
3 49 3 United States of America
4 72 30 International
competition_name season_name match_updated
1 FA Women's Super League 2018/2019 2019-06-05T22:43:14.514
2 FIFA World Cup 2018 2019-05-14T08:23:15.306297
3 NWSL 2018 2019-05-17T00:35:34.979298
4 Women's World Cup 2019 2019-06-21T16:45:45.211614
match_available
1 2019-06-05T22:43:14.514
2 2019-05-14T08:23:15.306297
3 2019-05-14T08:02:00.567719
4 2019-06-21T16:45:45.211614
The function fromJSON takes a JSON string as a first argument unless you specify you are giving a file (fromJSON(file = "competitions.json")).
The error you mention comes from the function trying to parse 'statsbomb/data/competitions' as a string and not a file name. In JSON however, everything is enclosed in brackets and strings are inside quotation marks. So the s from "statsbomb" is not a valid first character.
To read all json files you could do:
lapply(dir("open-data-master/",pattern="*.json",recursive = T), function(x) {
assign(gsub("/","_",x), fromJSON(file = paste0("open-data-master/",x)), envir = .GlobalEnv)
})
however this will take a long time to complete! You probably should elaborate a little bit on this function. E.g. split the list of files obtained with dir into chunks of 50 before running the lapply call.
I have 4 files in the folder named import_xxx.xlsx.
I need to apply below tasks,
1.First apply common header names to all files in the folder.
Write the applied corrections as seperate file in another folder.
I have tried below code.
1.Read the list of files
filenames_list <- list.files(pattern= ".xls", full.names=TRUE)
My question is how to search for header names and apply naming the changed name to all files.
My column names is as follows with sample data.
Sr No Invoice Date Invoice No Payer Name IGMNo Container No Size Type Act. gate in Date Container Agent Container Agent Name Importer Name CHA Code CHA Name Activity Description Amount Service Tax Total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 8-1-2018 12:12:29 AM MII180800001 SAME DEUTZ FAHR INDIA PRIVATE LIMITED 2200750 ECMU9674562 40 GB 7-26-2018 4:50:35 AM CLC007 CMA CGM SAME DEUTZ FAHR INDIA PRIVATE LIMITED CHS020 SEAKING CARGO SERVICES (I) PVT LTD Handling & PNR Movement Charges-FCL 10400 1872 12272
2 8-1-2018 12:12:29 AM MII180800001 SAME DEUTZ FAHR INDIA PRIVATE LIMITED 2200750 ECMU9674562 40 GB 7-26-2018 4:50:35 AM CLC007 CMA CGM SAME DEUTZ FAHR INDIA PRIVATE LIMITED CHS020 SEAKING CARGO SERVICES (I) PVT LTD Value Added Charges 2000 360 2360
I need to perform below data transformation tasks, which converts lower case characters to upper case characters. i.e column of activity desciption.
data.frame(lapply(df$Activity Description, function(v) {
if (is.character(v)) return(toupper(v))
else return(v)
}))
Do I need to loop through the files to write it? Following code will write the file but I need to loop through and write the applied changes for all files.
write.xlsx2(filename,"path")
Can anyone help me on this to loop through files and perform header transformation and write the files within the loop?
Thanks.
I would do something along the lines of the following. You will have to define the vector with the new common header, common_header.
(Untested.)
library(xlsx)
filenames_list <- list.files(pattern= "\\.xls", full.names=TRUE)
target_dir <- "path/to/target/directory/"
lapply(filenames_list, function(fl){
DF <- read.xlsx2(fl)
names(DF) <- common_header
target_fl <- paste0(target_dir, basename(fl))
write.xlsx2(DF, target_fl)
})
I have filenames in a directory like:
ACCT_GA12345_2015-01-10.xml
ACCT_GA12345_2015-01-09.xml
ACCT_GDC789g_2015-01-09.xml
ACCT_GDC567g_2015-01-09.xml
ACCT_GDC567g_2015-01-08.xml
ACCT_GCC7894_2015-01-01.xml
ACCT_GCC7894_2015-01-02.xml
ACCT_GAC7884_2015-02-01.xml
ACCT_GAC7884_2015-01-01.xml
I want to have only the latest file in the folder. The latest file can be found using only the file name (NOT the date stamp). For example ACCT 12345 has files from 1/10 & 1/09. I need to delete 1/09 file and have only 1/10 file, for ACCT 789g there is only one file so I have to have that file, and ACCT 567g the latest file is 1/09 so I have to remove 1/08 and have 1/09. So the combination for latest file should be ACCT & Max date for that ACCT.
I would need the final list of files as:
ACCT_GA12345_2015-01-10.xml
ACCT_GDC789g_2015-01-09.xml
ACCT_GDC567g_2015-01-09.xml
ACCT_GCC7894_2015-01-02.xml
ACCT_GAC7884_2015-02-01.xml
Can someone help me with this command in unix? Any help is appreciated
I'd do something like this.... to test start with ls command, when you get what you want to delete, then do rm.
ls ACCT_{GDC,GA1}*-{09,10}.xml
this will list any GDC or GA1 files that end in 09 or 10. You can play with combinations and different values until you have the right set of files showing that you want deleted. once you to just change ls to rm and you should be golden.
With some more info I could help you out. To test this out I did:
touch ACCT_{GDC,GA1}_{01..10}_{05..10}.xml
this will make 56 different dummy files with different combinations. Make a directory, run this command, and get your hands dirty. That is the best way to learn linux cli. Also 65% of commands you need, you will learn, understand, use then never use again...so learn how to teach yourself how to use man pages and setup a spot to play around in.
I'm planning to use R to do some simple text mining tasks. Specifically, I would like to do the following:
Automatically read each html file within a folder, then
For each file, do frequency count of some particular words (e.g., "financial constraint" "oil export" etc.), then
Automatically write output to a csv. file using the following data structure (e.g., file 1 has "financial constraint" showing 3 times and "oil export" 4 times, etc.):
file_name count_financial_constraint count_oil_export
1 3 4
2 0 3
3 4 0
4 1 2
Can anyone please let me know where I should start, so far I think I've figured out how to clean html files and then do the count but I'm still not sure how to automate the process (I really need this as I have around 5 folders containing about 1000 html files within each)? Thanks!
Try this:
gethtml<-function(path=".") {
files<-list.files(path)
setwd(path)
html<-grepl("*.html",files)
files<-files[html]
htmlcount<-vector()
for (i in files) {
htmlcount[i]<- ##### add function that reads html file and counts it
}
return(sum(htmlcount))
}
R is not intended for doing rigorous text parsing. Subsequently, the tools for such tasks are limited. If you insist on doing it with R then you better get familiar with regular expressions and have a look at this.
However, I highly recommend using Python with the beautifulsoup library, which is specifically designed for this task.
I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:
I would like to go through all the "species pages" present in this link:
http://gtrnadb.ucsc.edu/
So for each of them I will go to:
The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)
Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):
chr.trna3 (1-77) Length: 77 bp
Type: Ala Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....
Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)
I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is:
1. Some suggestion on how to build such a code.
2. And recommendation for how to learn the knowledge needed for performing such a task.
Thanks for any help,
Tal
Tal,
You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.
Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:
Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.
So, here goes...
library(RCurl)
### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]
get_data_url<-function(s) {
u_split1<-strsplit(s,"/")[[1]][1]
u_split2<-strsplit(u_split1,'\\"')[[1]][2]
ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}
# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]
### 2) Now, scrape the genome data from all of those URLS ###
# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
g_split1<-strsplit(g,"\n")[[1]]
g_split1<-g_split1[2:5]
# Pull all of the data and stick it in a list
g_split2<-strsplit(g_split1[1],"\t")[[1]]
ID<-g_split2[1] # Sequence ID
LEN<-strsplit(g_split2[2],": ")[[1]][2] # Length
g_split3<-strsplit(g_split1[2],"\t")[[1]]
TYPE<-strsplit(g_split3[1],": ")[[1]][2] # Type
AC<-strsplit(g_split3[2],": ")[[1]][2] # Anticodon
SEQ<-strsplit(g_split1[3],": ")[[1]][2] # ID
STR<-strsplit(g_split1[4],": ")[[1]][2] # String
return(c(ID,LEN,TYPE,AC,SEQ,STR))
}
# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
raw_data<-getURLContent(struct_url)
s_split1<-strsplit(raw_data,"<PRE>")[[1]]
all_data<-s_split1[seq(3,length(s_split1))]
data_list<-lapply(all_data,parse_genomes)
for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
return(data_list)
}
# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>") # Some malformed web pages produce bad rows, but we can remove them
head(genome_data)
The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:
head(genome_data)
ID LEN TYPE AC SEQ
1 Scaffold17302.trna1 (1426-1498) 73 bp Ala AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2 Scaffold20851.trna5 (43038-43110) 73 bp Ala AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3 Scaffold20851.trna8 (45975-46047) 73 bp Ala AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4 Scaffold17302.trna2 (2514-2586) 73 bp Ala AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6 Scaffold17302.trna4 (6027-6099) 73 bp Ala AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
STR NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp
I hope this helps, and thanks for the fun little Sunday afternoon R challenge!
Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.
Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.
Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.
I'm sure there would be lots of other ways also.