I am working on a project where I have to map firms that have an SIC industry classification to the corresponding Fama-French industry classification. I have found that Ian Gow has gracefully created the script to do this. The script is available from the following url: https://iangow.wordpress.com/2011/05/17/getting-fama-french-industry-data-into-r/
However, there is a glitch in the script or in the data set and for some reason, it does not work with “Siccodes30.txt”. More specifically, it does not produce the correct result (mapping) for lines related to “6726-6726 Unit inv trusts, closed-end” from the “Siccodes30.txt”. I have been trying to figure out the source of the problem, but I have not been successful.
In the post below, I have included the original script (there is some room to make it more efficient) and I have added a few lines at the end to make it work with an online example.
Original Script (I have removed comments to makes the post shorter). Again, this is not my script (the original script is in https://iangow.wordpress.com/2011/05/17/getting-fama-french-industry-data-into-r/
url4FF <- paste("http://mba.tuck.dartmouth.edu",
"pages/faculty/ken.french/ftp",
"Industry_Definitions.zip", sep="/")
f <- tempfile()
download.file(url4FF, f)
fileList <- unzip(f,list=TRUE)
trim <- function(string) {
ifelse(grepl("^\\s*$", string, perl=TRUE),"",
gsub("^\\s*(.*?)\\s*$","\\1",string,perl=TRUE))
}
extract_ff_ind_data <- function (file) {
ff_ind <- as.vector(read.delim(unzip(f, files=file), header=FALSE,
stringsAsFactors=FALSE))
ind_num <- trim(substr(ff_ind[,1],1,10))
for (i in 2:length(ind_num)) {
if (ind_num[i]=="") ind_num[i] <- ind_num[i-1]
}
sic_detail <- trim(substr(ff_ind[,1],11,100))
is.desc <- grepl("^\\D",sic_detail,perl=TRUE)
regex.ind <- "^(\\d+)\\s+(\\w+).*$"
ind_num <- gsub(regex.ind,"\\1",ind_num,perl=TRUE)
ind_abbrev <- gsub(regex.ind,"\\2",ind_num[is.desc],perl=TRUE)
ind_list <- data.frame(ind_num=ind_num[is.desc],ind_abbrev,
ind_desc=sic_detail[is.desc])
regex.sic <- "^(\\d+)-(\\d+)\\s*(.*)$"
ind_num <- ind_num[!is.desc]
sic_detail <- sic_detail[!is.desc]
sic_low <- as.integer(gsub(regex.sic,"\\1",sic_detail,perl=TRUE))
sic_high <- as.integer(gsub(regex.sic,"\\2",sic_detail,perl=TRUE))
sic_desc <- gsub(regex.sic,"\\3",sic_detail,perl=TRUE)
sic_list <- data.frame(ind_num, sic_low, sic_high, sic_desc)
return(merge(ind_list,sic_list,by="ind_num",all=TRUE))
}
FFID_30 <- extract_ff_ind_data("Siccodes30.txt")
I have added the following lines to allow testing the script:
library(gsheet)
url <-"https://docs.google.com/spreadsheets/d/1QRv8YmJv0pdhIVmkXMQC7GQuvXV21Kyjl9pVZsSPEAk/gid=1758600626"
companiesSIC <- read.csv(text=gsheet2text(url, format='csv'), stringsAsFactors=FALSE)
names(companiesSIC)
library(sqldf)
companiesFFID_30 <- sqldf("SELECT a.gvkey, a.SIC, b.ind_desc AS FF30,
b.ind_num as FFIndNUm30
FROM companiesSIC AS a
LEFT JOIN FFID_30 AS b
ON a.sic BETWEEN b.sic_low AND b.sic_high")
companiesFFID_30
Results on rows 141 and 142 are wrong. Instead of an industry number the provide a string.
Thanks
PS As I said there is room to make the script shorter (e.g., you don't need to create a separate function to remove white space, you can use trimws) but to give credit to the original author, I kept the script in its original form. However, if someone can solve the problem should also try to update the rest of the script too.
There is nothing wrong with the script. The problem is in the formatting of the two lines (141 and 142) of the txt file.
I opened the text file with a text editor, deleted and re-typed the content of these two lines. When I re-run the R script the problem was gone.
Related
I want to import multiple pdf-files into R but per page there are 4 columns, a header/footer line and a table of contents.
For purpose of text mining I want to remove them from my file or character vector.
Right now I am using two functions to read in the files. The first one is pdf_text because it keeps the pages but can't deal with the 4 columns. The second one is extract_text, this one on its own doesn't keep the pages but can deal with the column structure (and is decently with occuring tables) .
But neither one of them is able to remove the table of contents (as far as I have tried).
My data set is not exactly minimal but otherwise I had some problems with the data structures. Here a working code:
################ relevant code ##############
library(pdftools)
library(tidyverse)
library(tabulizer)
files_name <- "Nachhaltigkeit 2021.pdf"
file_url <- c("https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/sustainability/documents/Allianz_Group_Sustainability_Report_2021-web.pdf", "https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/investor-relations/en/results-reports/annual-report/ar-2021/en-Allianz-Group-Annual-Report-2021.pdf")
reports_list <- lapply(file_url, pdf_text)
createTibble <- function(){
tibble_together <- NULL
#for all files
for(i in 1:length(files_name)){
page_nr <- length(reports_list[[i]])
tib <- tibble(report = rep(files_name[i], page_nr), page = 1:page_nr, text = gsub("\r\n", " ",
extract_text(files_name[[i]], pages = 1:page_nr)))
tibble_together <- rbind(tibble_together, tib)
}
return(tibble_together)
}
reports_df <- createTibble()
############ code for problem visualization ###############
reports_df <- reports_df %>% unnest_tokens(output = word, input = text, token = "words")
#e.g this part contains the table of contents which is not intended
(reports_df %>% filter(page == 34, report == "Nachhaltigkeit 2021.pdf"))$word[832:885]
Thanks for your help in advance
PS: it's my first question so if you need sth. let me know.
And I know that the function createTibble probably isn't optimal. But that's not my primary concern.
I'm using r to download data from an api that uses a key. I've downloaded the data for AK into a df called officials and I would like to download the data for the remaining states using rbind to add each state to the df officials. But the format of the call to the api requires the state abbreviation without ". That is, stateId=AK not "AK". Is there a way to do this? I tried the code below and then realized my error in the GET command specifying stateID. My code inserts "AL" not AL.
states <- c("AL","AR","AZ","CA","CO","CT")
for(i in 1:length(states)) {
temp_raw <- GET("http://api.votesmart.org/Officials.getByOfficeTypeState?key=xxx&officeTypeId=L&stateId=states[i]&o=JSON")
my_content <- content(temp_raw, as = 'text')
my_content2 <- fromJSON(my_content)
temp_officials <- my_content2$candidate$candidate
officials2022 <- rbind(officials2022,temp_officials)
}
Try this variation, using the paste command to combine the strings together into the URL:
Also, notice the simplified way to perform a for loop over states, where i is directly available.
Edit: forgot the GET
states <- c("AL","AR","AZ","CA","CO","CT")
for(i in states) {
temp_raw <- GET(paste0("http://api.votesmart.org/Officials.getByOfficeTypeState?key=xxx&officeTypeId=L&stateId=", i, "&o=JSON"))
...
}
I have this code that works for me (it's from Jockers' Text Analysis with R for Students of Literature). However, what I need to be able to do is to automate this: I need to perform the "ProcessingSection" for up to thirty individual text files. How can I do this? Can I have a table or data frame that contains thirty occurrences of "text.v" for each scan("*.txt")?
Any help is much appreciated!
# Chapter 5 Start up code
setwd("D:/work/cpd/R/Projects/5/")
text.v <- scan("pupil-14.txt", what="character", sep="\n")
length(text.v)
#ProcessingSection
text.lower.v <- tolower(text.v)
mars.words.l <- strsplit(text.lower.v, "\\W")
mars.word.v <- unlist(mars.words.l)
#remove blanks
not.blanks.v <- which(mars.word.v!="")
not.blanks.v
#create a new vector to store the individual words
mars.word.v <- mars.word.v[not.blanks.v]
mars.word.v
It's hard to help as your example is not reproducible.
Admitting you're happy with the result of mars.word.v,
you can turn this portion of code into a function that will accept a single argument,
the result of scan.
processing_section <- function(x){
unlist(strsplit(tolower(x), "\\W"))
}
Then, if all .txt files are in the current working directory, you should be able to list them,
and apply this function with:
lf <- list.files(pattern=".txt")
lapply(lf, function(path) processing_section(scan(path, what="character", sep="\n")))
Is this what you want?
I have a large character-vector file and I need to draw a random sample from it. This works fine. But I need to draw sample after sample. For that I want to shorten file by every element that is already drawn out of it (that I can draw a new sample without drawing the same element more than once).
I've got some solution, but I'm interested in anything else that might work faster and even more important, maybe correctly.
Here are my tries:
Approach 1
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
a <- data.frame()
for (i in 1:length(rand_no)){
a <- rbind(a, which.names(rand_no[i], file))
file <- file[-a[1,1]]
}
Problem:
Warning message:
In which.names(rand_no[i], file) : 297 not matched
Approach 2
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
deleter <- function(i) {
a <- which.names(rand_no[i], file)
file <- file[-a]
}
lapply(1:length(rand_no), deleter)
Problem:
This doesn't work at all. Maybe I should split the quesion, because the second problem clearly lies with me not fully understanding lapply.
Thanks for any suggestions.
Edit
I hoped that it will work with numbers, but of course file looks like this:
file <- c("Post-19960101T000000Z-1.tsv", "Post-19960101T000000Z-2.tsv", "Post-19960101T000000Z-3.tsv","Post-19960101T000000Z-4.tsv", "Post-19960101T000000Z-5.tsv", "Post-19960101T000000Z-6.tsv", "Post-19960101T000000Z-7.tsv","Post-19960101T000000Z-9.tsv")
Of course rand_no can't be over 100 files with such a small sample. Therefore:
rand_no <- sample(file, 2)
Use list instead of c. Then you can set the values to NULL and they will be removed.
file[file %in% rand_no] <- NULL This find all instances from rand_no in file and removes them.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(file, 2)
library(car) #From poster's code.
file[file %in% rand_no] <- NULL
If you are working with a large list of files, using %in% to compare strings may bog you down. In that case I would use indexes.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(1:length(file), 2)
library(car) #From poster's code.
file[rand_no] <- NULL
Sample() already returns values in a permuted order with no replacements (unless you set replace=T). So it will never pick a value twice.
So if you want three sets of 100 samples that don't share any elements, you can use
file <- rep(1:10000)
rand_no <- sample(seq_along(file), 300)
s1<-file[rand_no[1:100]]
s2<-file[rand_no[101:200]]
s3<-file[rand_no[201:300]]
Or if you wanted to decease the total size by 100 each time you could do
s1<-file[-rand_no[1:100]]
s2<-file[-rand_no[1:200]]
s3<-file[-rand_no[1:300]]
A simple approach would be to select random indices and then remove those indices:
file <- 1:10000 # Build sample data
ind <- sample(seq(length(file)), 100) # Select random indices
rand_no <- file[ind] # Compute the actual values selected
file <- file[-ind] # Remove selected indices
I think using sample and split could be a nice way of doing this, without having to alter your files variable. I'm not a big fan of mutation, unless you really need to, and this would let you know exactly which files you used for each chunk of the analysis going forward.
files<-paste("file",1:100,sep="_")
randfiles<-sample(files, 50)
randfiles_chunks<-split(randfiles,seq(1,length(randfiles), by=10))
I'm facing a challenge in R. I'm writing a code that incorporates another code written in C++ called MHX.
MHX is used for chemical data analysis by inputting some concentrations, etc. The integration between R and MHX works fine. So I'm able to write my MHX code definitions in the form of cat(CODE HERE) then calling a bash command to run MHX from terminal.
Now the results from MHX are given as tab delimited data tables that I am able to read without a problem in R. The problem is that I use R to simulate a large number of MHX calculations using loops.
Hence the need to write dynamic variables and here were I'm stuck. Let me give you more information with examples of my R code:
for (i in 1:100) {
fin <- file.create("input/ex1") #MHX input file
fout <- file.create("output/ex1.out") #MHX output file
FNM <- paste0("table_data/pH", i, ".txt") #filename used inside MHX definition
file.create(FNM) #this is used to create FNM table in R
fXY <- file.create(paste0("table_data/ECOMXY", i, ".txt"))
ifelse (HERE SOME MATHEMATICAL DEFINITIONS OF SOME VARIABLES)
ksource(MHXCode) #THIS CALLS MY MHX CODE which is inside another R code called `MHXCode` using a custom function KSOURCE. No problem here.
Up to here I don't have major problems. Now I need to setup the dynamic variables:
First I am creating variables PHL1 to PHL100
assign(paste("PHL", i, sep=""), read.table(paste0("table_data/pH", i, ".txt") ,skip=0, sep="\t", head=TRUE, na.strings = "-Inf"))
Each PHL table contains two rows and about 20 columns. Now I am interested in creating data frames from the second row for each column. Take for example row number 1 which is called EMF, ideally I need to do the following for all tables from PHLto PHL100 which is very tedious:
EMFT <- cbind(PHL1$EMF[2], PHL2$EMF[2], PHL3$EMF[2], PHL4$EMF[2], PHL5$EMF[2], PHL6$EMF[2],PHL7$EMF[2], PHL8$EMF[2], PHL9$EMF[2], PHL10$EMF[2], ....... etc up to PHL100! )
I tried many things to achieve the above, but I was not successful, including:
XX <- assign(paste0("PHL", i, "$EMF[2]"), cat(paste0("PHL", i, "$EMF[2]")))
I will need to do the same for other variables in order to be able to create some complicated plots. I hope anyone would be able to help.
I must mention that the main problem with assign is that I get qouted names of variables hence cannot return their values. Also for cat, you cannot use it to return a value, you will get NULL in the example above. Simple I am stuck!!
Please help.
Thanks to Justin he gave me a clue to answer my question. Here is what I have done:
files <- list.files(path="table_data", pattern=".dat", full.names=T); files
FRM <- NULL
for (f in files){
dat <- read.table(f, skip=0, header=TRUE, sep="\t", na.strings="",quote="", colClasses="character")[2,]
note that the [2, ] argument means that you skip all lines except line number 2 while keeping header which exactly what I was looking for.
Now I can bind it all in one table for my plots.
FRM <- rbind(FRM, dat)
This is a short answer and I think it is neat, sorted!