read.px C stack usage issue - r

I am following the tutorial here to practice rayshader. However, when I used the code below I get this error:
Error: C stack usage 17812428 is too close to the limit
See in the tutorial, when you come to the step which is the code posted in the question. I get this error. Sample data is provided in the tutorial.
Code:
tbl_census_2018 <- read.px("data/census_2018.px") %>% # Load & format
as_tibble()
Package pxR's GitHub page has more info about the read.px function which I am pasting below if it helps.
How can I fix this?
#################################################################
#
# File: read.px.R
# Purpose: reads a PC-Axis file into R
#
# Created: 20110618
# Authors: fvf, cjgb, opl
#
# Modifications:
# 20111210, cjgb: in the data string, "-" may represent the value 0
# 20111210, cjgb: fixing the strsplit when the split character is contained in the data part
# 20120329, cjgb: number strings in the DATA part can contain ";" as separators.
# Although deprecated, cases still lurk.
# 20130228, cjgb: There can be ; inside quoted strings that create chaos
# 20130608 fvf: Ability to read files with keys in data area.
# ":" added to defaut na.string (EuroStat files)
# 20130624: use str_split (line 91) to read DATA area
# 20130917, cjgb: changes to prevent errors with EOL characteres
# 20131115, cjgb: some files do not have heading (or stub): only one of
# them is really required
# 20131118, cjgb: fixed a bug happening when missing (i.e. "..") was the last value in DATA
# fixing it required that the last quote was not eliminated (same for first quote)
# 20141222, fvf: fixing some bug in relation to read files with KEYS (sparse array)
# 20150211, fvf: The parameter "encoding" is NULL by default. "encoding" is determined by
# the file itself: if CHARSET="ANSI" then "latin1" else "CP437".
# 20150212. fvf: I have to delete => 20130917, cjgb: tmp[2] <- gsub(";.*", "", tmp[2])
# many px-files have a semicolon at the end of line in DATA area:
# i.e: read.px('http://www.ine.es/pcaxisdl//t20/e245/p05/a2002/l0/00004001.px')
# 20150216. fvf minor correction of a bug in the modification: 20150211,fvf
# 20150219. fvf Solving a bug: a missing "DROP=FALSE" was producing a read error on files with a single key
#################################################################
read.px <- function(filename, encoding = NULL,
na.strings = c('"."', '".."', '"..."', '"...."', '"....."', '"......"', '":"')) {
## auxiliary functions ##
clean.spaces <- function(x){
gsub("^[[:space:]]+|[[:space:]]+$", "", x) # discards heading|trailing whitespace
}
get.attributes <- function(x){
x <- gsub( "([A-Z-]*)\\((.*)\\).*", "\\1;\\2", x ) ## separates label-attribute with ";"
x <- ldply(strsplit(x, ";"),
function(y) c(y, "value")[1:2])
}
break.clean <- function(x) {
x <- clean.spaces( strsplit(x, split = '\\"')[[1]] ) ## breaks by '"'
x[! x %in% c("," , "")] ## and drops spurious seps
}
## end: auxiliary functions ##
# modification by fvf (150211): Determining the character encoding used in the file => encoding
if (is.null(encoding)) {
charset <- readLines(filename, 5) # read the first five lines
encoding <- ifelse(any(grepl('CHARSET.*ANSI', charset, ignore.case = T)),
"latin1", "CP437") # comprobado en debian y osx
}
a <- scan(filename, what = "character", sep = "\n", quiet = TRUE, fileEncoding = encoding)
# modification by fvf: 130608
a <- paste(a, collapse = "\n") # Se mantienen "CR/LF luego se quitaran selectivamente
tmp <- strsplit( a, "DATA=" )[[1]]
tmp[1] <- gsub("\n", " ", tmp[1]) # fvf[130608]: elimina CR de la cabecera
tmp[2] <- gsub(";", "", tmp[2]) # fvf[150212] (la modificacion rev 92 a 94) da multiples problemas en INEBase
# i.e: read.px('http://www.ine.es/pcaxisdl//t20/e245/p05/a2002/l0/00004001.px')
# en muchos ficheros cada linea del area DATA tiene ";" antes del "EOL"
# lo que produce que solo se lea la primera de las lineas de datos
a <- paste(tmp[1], "DATA=", tmp[2], sep = "")
## modification by cjgb, 20130228 concerning line separators within quoted strings
## ; is the logical line end in px files
## so we should do:
## a <- unlist(strsplit(a, ";"))
## but there might be ; inside quoted strings
## so we need the following workaround:
punto.coma <- str_locate_all(a, ";")[[1]][,1] # where the ";" are
comillas <- str_locate_all(a, '"')[[1]][,1] # where the '"' are
## ";" not after an odd number of '"'
## these are the proper "cuts"
cortes <- Filter( function(x) sum(comillas < x) %% 2 == 0, punto.coma )
a <- str_sub(a, c(1, cortes + 1), c(cortes - 1, str_length(a)))
a <- a[!is.na(a)]
a <- a[a != ""]
## end of modification by cjgb, 20130228 concerning line separators within quoted strings
# change strsplit by str-split. In big px-files:
# "Error: C stack usage is too close to the limit"
a <- do.call(rbind, str_split(a, "=", n = 2))
## fvf.20141222: not chage to factor: ++ stringsAsFactors=F)
a <- data.frame(cbind(get.attributes(a[, 1]), a[, 2], stringsAsFactors=F))
colnames(a) <- c("label", "attribute", "value")
## build a px object: list with px class attribute ##
a$label <- make.names(clean.spaces(a$label))
a$attribute <- make.names(clean.spaces(gsub('\\"', "", a$attribute)))
# need to avoid that quotes are removed in DATA part because of a bug:
# a case was reported where the data part ended in ".." and the last quote was erased
# and this affected the scan function below
a.data <- as.character(a[a$label == "DATA", "value"])
a.value <- gsub('^\\"|\\"$', "", a$value) # removes " at beginning / end
a.value[a$label == "DATA"] <- a.data
names(a.value) <- a$attribute
px <- tapply(a.value, a$label, as.list)
## these metadata keys contain vectors (comma separated)
## we need to split them (and clean the mess: extra spaces, etc.)
px$STUB$value <- if(!is.null(px$STUB)) make.names(break.clean(px$STUB$value))
px$HEADING$value <- if(!is.null(px$HEADING)) make.names(break.clean(px$HEADING$value))
px$VALUES <- lapply(px$VALUES, break.clean)
# fvf.20141222: if there are not CODES, do not create CODES
if (!is.null(px$CODES))
px$CODES <- lapply(px$CODES, break.clean)
# fvf.20141222: Sustituye ["~~~~" "~~~~~"] por ["~~~~~"\n"~~~~"] en
# campos multilinea con retornos perdidos (simplifica la lectura humana)
px <- lapply(px, function(e){
if (!is.null(e$value))
e$value <- gsub('"[[:space:]]+"', '"\n"', e$value)
e
})
#### read the data part into a 'melted' dataframe ###
## there are two cases: files with/without KEYS keyword
## which need to be processed independently
# fvf[130608]: add to to read files with keys in data area
if ("KEYS" %in% a$label ){
## read the whole block
tc <- textConnection(px$DATA$value); on.exit( close(tc) )
raw <- read.table(tc, sep = ",", colClasses = "factor")
## extract and process the data part (the numbers)
data.part <- as.character(raw[, ncol(raw)] ) # numbers (last column of the data.frame)
data.part <- gsub('"-"', 0, data.part) # 0's might be encoded as "-"
data.part <- scan(text = data.part, na.strings = na.strings, quiet = T)
## extract and process the keys part (it needs to be staked a number of times,
## as many as there are entries in the data vector in each row in the block)
keys.part <- raw[, -ncol(raw), drop = FALSE]
keys.part <- keys.part[ rep(1:nrow(keys.part), each = length(data.part) / nrow(keys.part) ), , drop = FALSE ]
colnames(keys.part) <- names(px$KEYS)
## change CODES (if any) in keys part to VALUES (consistency issue)
# for (col.name in colnames(keys.part)[unlist(px$KEYS) == "CODES"])
# keys.part[[col.name]] <- mapvalues(keys.part[[col.name]],
# from = px$CODES[[col.name]],
# to = px$VALUES[[col.name]])
# fvf.20141222:
for (col.name in colnames(keys.part)){
if (px$KEYS[[col.name]] == 'CODES') {
keys.part[[col.name]] <- factor(keys.part[[col.name]], levels = px$CODES[[col.name]])
levels(keys.part[[col.name]]) <- px$VALUES[[col.name]] ## all levels a VALUES
} else keys.part[[col.name]] <- factor(keys.part[[col.name]], levels = px$VALUES[[col.name]] )
}
## extract and process the variables that are not keys
no.keys.part <- px$VALUES[setdiff(names(px$VALUES), names(px$KEYS))]
no.keys.part <- expand.grid(rev(no.keys.part))
## put everything together & cleanup
px$DATA$value <- data.frame( keys.part,
no.keys.part,
value = data.part,
row.names = NULL)
}
else
{
tmp <- gsub('"-"', 0, px$DATA$value) # 0 can be encoded as "-"
tmp <- gsub("\n", " ", tmp) # delete CR/LF of DATA area fvf[130608]
tc <- textConnection(tmp); on.exit( close(tc) )
raw <- scan(tc, na.strings = na.strings, quote = NULL, quiet = TRUE)
names.vals <- c( rev(px$HEADING$value), rev( px$STUB$value ) )
output.grid <- data.frame(do.call(expand.grid, px$VALUES[names.vals]))
# sanity check: avoids the problem of "reclycling" of values if
# the ratio of lenghts of variables and values is an exact integer
if (nrow(output.grid) != length(raw))
stop( "The input file is malformed: data and varnames length differ" )
px$DATA$value <- data.frame(output.grid, raw)
colnames(px$DATA$value) <- c(names.vals, "value")
}
class(px) <- "px"
px
}

The issue should have been fixed with the new code release (version 0.42.6).

Related

How to process many txt files with my code in R

I'm quite a novice, but I've successfully managed to make some code do what I want.
Right now my code does what I want for one file at a time.
I want to make my code automate this process for 600 files.
I kind of have an idea, that I need to put the list of files in a vector, then maybe use lapply and a function, but I'm not sure how to do this. The syntax and code are beyond me at the moment.
Here's my code...
#Packages are callled
library(tm) #text mining
library(SnowballC) #stemming - reducing words to their root
library(stringr) #for str_trim
library(plyr)
library(dplyr)
library(readtext)
#this is my code to run the code on a bunch of text files. Obviously it's unfinished, and I'm not sure if this is the right approach. Where do I put this? Will it even work?
data_files <- list.files(path = "data/", pattern = '*.txt', full.names = T, recursive = T)
lapply(
#
# where do I put this chunk of code?
# do I need to make all the code below a function?
##this bit cleans the document
company <- "CompanyXReport2015"
txt_raw = readLines("data/CompanyXReport2015.txt")
# remove all extra white space, also splits on lines
txt_format1 <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", txt_raw)
txt_format1.5 <- gsub("^ +| +$|( ) +", "\\1", txt_format1)
# recombine now that all white space is stripped
txt_format2 <- str_c(txt_format1.5, collapse=" ")
#split strings on space now to get a list of all words
txt_format3 <- str_split(txt_format2," ")
txt_format3
# convert to vector
txt_format4 <- unlist(txt_format3)
# remove empty strings and those with words shorter than 3 length
txt_format5 <- txt_format4[str_length(txt_format4) > 3]
# combine document back to single string
cleaned <- str_c(txt_format5, collapse=" ")
head(cleaned, 2)
##import key words and run analysis on frequency for the document
s1_raw = readLines("data/stage1r.txt")
str(s1_raw)
s2_raw = readLines("data/stage2r.txt")
str(s2_raw)
s3_raw = readLines("data/stage3r.txt")
str(s3_raw)
s4_raw = readLines("data/stage4r.txt")
str(s4_raw)
s5_raw = readLines("data/stage5r.txt")
str(s5_raw)
# str_count(cleaned, "legal")
# apply str_count function using each stage vector
level1 <- sapply(s1_raw, str_count, string=cleaned)
level2 <- sapply(s2_raw, str_count, string=cleaned)
level3 <- sapply(s3_raw, str_count, string=cleaned)
level4 <- sapply(s4_raw, str_count, string=cleaned)
level5 <- sapply(s5_raw, str_count, string=cleaned)
#make a vector from this for the report later
wordcountresult <- c(level1,level2,level3,level4,level5)
# convert to dataframes
s1 <- as.data.frame(level1)
s2 <- as.data.frame(level2)
s3 <- as.data.frame(level3)
s4 <- as.data.frame(level4)
s5 <- as.data.frame(level5)
# add a count column that each df shares
s1$count <- s1$level1
s2$count <- s2$level2
s3$count <- s3$level3
s4$count <- s4$level4
s5$count <- s5$level5
# add a stage column to identify what stage the word is in
s1$stage <- "Stage 1"
s2$stage <- "Stage 2"
s3$stage <- "Stage 3"
s4$stage <- "Stage 4"
s5$stage <- "Stage 5"
# drop the unique column
s1 <- s1[c("count","stage")]
s2 <- s2[c("count","stage")]
s3 <- s3[c("count","stage")]
s4 <- s4[c("count","stage")]
s5 <- s5[c("count","stage")]
# s1
df <- rbind(s1, s2,s3, s4, s5)
df
#write the summary for each company to a csv
#Making the report
#Make a vector to put in the report
#get stage counts and make a vector
s1c <- sum(s1$count)
s2c <- sum(s2$count)
s3c <- sum(s3$count)
s4c <- sum(s4$count)
s5c <- sum(s5$count)
stagesvec <- c(s1c,s2c,s3c,s4c,s5c)
names(stagesvec) <- c("Stage1","Stage2","Stage3","Stage4","Stage5")
#get the company report name for a vector
companyvec <- c(company)
names(companyvec) <- c("company")
# combine the vectors for the vector row to be inserted into the report
reportresult <- c(companyvec, wordcountresult, stagesvec)
rrdf <- data.frame(t(reportresult))
newdf <- data.frame(t(reportresult))
#if working file exists-use it
if (file.exists("data/WordCount12.csv")){
write.csv(
rrdf,
"data/WordCountTemp12.csv", row.names=FALSE
)
rrdf2 <-
read.csv("data/WordCountTemp12.csv")
df2 <-
read.csv("data/WordCount12.csv")
df2 <- rbind(df2, rrdf2)
write.csv(df2,
"data/WordCount12.csv", row.names=FALSE)
}else{ #if NO working file exists-make it
write.csv(newdf,
"data/WordCount12.csv", row.names=FALSE)
}
Hello :) Here is an example of workflow, you might find better ones but I started with it when learning.
listoftextfiles = list.files(...)
analysis1 = function(an element of listoftextfiles){
# your 1st analysis
}
res1 = lapply(listoftextfile, analysis1) # results of the 1st analysis
analysis2 = function(an element of res1){
# your 2nd analysis
}
res2 = lapply(res1, analysis2) # results of the 2nd analysis
# ect.
You will find many tutorials about custom functions on internet.

Breaking words into the letter pattern: code not working as expected

I have a list words and a list of common character patterns found within words. The script works by running through the word list and then checks what character pattern are found within the word and finally display the results in table.
The finished table should look like:
word CharPatLen charpat01 charpat02 charpat03 charpat04
father 4 f a th er
there 3 th er e
after 4 a f r er
Instead, I get the table below and things start to go wrong on the field charpat03 for the word "there". The 'f' here should be instead of an 'e' and the following row is blank.
word CharPatLen charpat01 charpat02 charpat03 charpat04
father 4 f a th er
there 3 th er f
after 4
I am also get the following warning message which which I tried to fix through google searches with no luck
'Warning message:
In as.numeric(paste(as.numeric(charIndexStart), charIndexEnd, sep = "")) :
NAs introduced by coercion'
Help! I am not sure whats going wrong with the script.
Thanks, for taking the time to view my post.
##################################################
# This script loops through a word list then break the word into character (char)
# pattern found character pattern list
#
# e.g
#using the word list ( father, there, after)
#using the char pattern list (th,er,f, a, e,t)
#
# it should return the following
#
# word CharPatLen charpat01 charpat02 charpat03 charpat04
# father 4 f a th er
# there 3 th er e
# after 4 a f r er
#####################################################
word <- c("father", "there", "after")
CharPatLen <- c(0, 0, 0)
charpat01 <- c("", "", "" )
charpat02 <- c("", "", "" )
charpat03 <- c("", "", "" )
charpat04 <- c("", "", "" )
charpat05 <- c("", "", "" )
wordList <- data.frame(word, CharPatLen, charpat01,charpat02,charpat03,charpat04,charpat05,stringsAsFactors = F)
textPat <- c("th", "er", "f","a","e","t")
frequency <- c(0,0,0,0,0,0)
textPattern <- data.frame(textPat,frequency, stringsAsFactors = F)
#######################################
# 01 loop through word list
#######################################
for (text in wordList$word) {#4loop01
# track what parts of the word a found char pattern
charSelectionTracker <- rep(1, times=nchar(text))
#found char patterns from word, order/range and the char pattern
FoundcharPatternholder <- data.frame(order= integer(),charPattern = character())
#########################################
# 02 loop through character patterns list
#########################################
for (pattern in textPattern$textPat) { #4loop02
if(sum(charSelectionTracker)== 0)
{#charSelect
#reorder patterns
rank <- order(FoundcharPatternholder$order)
FoundcharPatternholder<- FoundcharPatternholder[rank,]
wordList[which(wordList$word == text),"CharPatLen"] = nrow(FoundcharPatternholder)
for (patPao in 1:nrow(FoundcharPatternholder))
{
wordList[which(wordList$word == text),patPao+2] = as.character(FoundcharPatternholder[patPao,2])
}
break
}#charSelect
#find all char pattern in word
patFoundAt <- unlist(gregexpr (pattern,text)[[1]])
#########################################
# 03 check that each pattern within a word is valid and not used in an other char pattern
#########################################
for (charIndexStart in patFoundAt)
{#4loop03
charIndexEnd = charIndexStart + nchar(pattern)-1
if( sum(charSelectionTracker[charIndexStart:charIndexEnd]) == nchar(pattern) & sum(charSelectionTracker)> 0)
{#PatExtract
#track what letters have been used by character pattern
charSelectionTracker[charIndexStart:charIndexEnd]=0
#order/index in pattern
patIndex <- as.numeric(paste(as.numeric(charIndexStart),charIndexEnd, sep = ''))
innerPatternholder <- data.frame(order= patIndex,charPattern = pattern)
FoundcharPatternholder <- rbind(FoundcharPatternholder, innerPatternholder)
}#PatExtract
}#4loop03
} #4loop02
}#4loop01
You are probably better off losing the nested for-loops. str_extract_all and str_count from the stringr package may be useful to simplify the code:
library(stringr)
## data
words <- c("father", "there", "after")
textPat <- paste(c("th", "er", "f","a","e","t"), collapse = "|")
## extract matching patterns
charPat <- str_extract_all(words, textPat, simplify = TRUE)
colnames(charPat) <- sprintf("charpat%02d", seq_len(ncol(charPat)))
## count matched patterns per word
charPatLen <- str_count(words, textPat)
## combine into data.frame
cbind(data.frame(word = words, CharPatLen = charPatLen), charPat)
#> word CharPatLen charpat01 charpat02 charpat03 charpat04
#> 1 father 4 f a th er
#> 2 there 3 th er e
#> 3 after 4 a f t er
Created on 2019-07-05 by the reprex package (v0.3.0)

Dealing with commas in a CSV file in sqldf

I am following up on my question here sqldf returns zero observations with a reproducible example.
I found that the problem is probably from the "comma" in one of the cells ("1,500+") and I think that I have to use a filter as suggested here sqldf, csv, and fields containing commas, but I am not sure how to define my filter. Below is the code:
library(sqldf)
df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
"b"= c("1,500+" , "-"),
"c"= c("***" , "**"),
"d"= c("(x)" , "(x)"),
"e"= c("(x)" , "(x)"),
"f"= c(992 , "-"))
write.csv(df, 'df_to_read.csv')
# 'df_to_read.csv' looks like this:
# "","a","b","c","d","e","f"
# 1,8600000US01770,1,500+,***,(x),(x),992
# 2,8600000US01937,-,**,(x),(x),-
Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing', file.format = list(eol="\n"))
When I run this code, I get the following error:
Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) : RS_sqlite_import: df_to_read.csv line 2 expected 7 columns of data but found 8
The problem comes from reading the column created by df$b. The first value in that column contains comma and so sqldf() function treats it as a separator.
One way to deal with this is to either remove comma or use some other symbol (like space).You can also use read.csv2.sql function:
library(sqldf)
df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
"b"= c("1,500+" , "-"),
"c"= c("***" , "**"),
"d"= c("(x)" , "(x)"),
"e"= c("(x)" , "(x)"),
"f"= c("992" , "-"))
write.csv(df, 'df_to_read.csv',row.names = FALSE )
Housing_filtered <- read.csv2.sql("df_to_read.csv", sql = "select * from file", header=TRUE)
Best way would be to clean your file once, so that you don't need to worry later again in your analysis for the same issue. This should get you going:
Housing <- readLines("df_to_read.csv") # read the file
n <- 6 # number of separators expected = number of columns expected - 1
library(stringr)
ln_idx <- ifelse(str_count(Housing, pattern = ",") == n, 0 , 1)
which(ln_idx == 1) # line indices with issue, includes the header row
#[1] 2
Check for the specific issues and write back to you file, at the same indices. for eg line (2):
Housing[2]
#[1] "1,8600000US01770,1,500+,***,(x),(x),992" # hmm.. extra comma
Housing[2] = "1,8600000US01770,1500+,***,(x),(x),992" # removed the extra comma
writeLines(Housing, "df_to_read.csv")
Now the business is usual, good to go:
Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing')
# Housing_filtered
# a b c d e f
# 1 8600000US01770 1500+ *** (x) (x) 992
# 2 8600000US01937 - ** (x) (x) -

Shiny FileInput error

I'm new to Shiny, and I'm trying to convert an existent code that works as an .R script into Shiny app.
Original code - link
Sample data - link
The point is to have a fileinput, where a person selects a pdf file. Than the pdf is processed.
This is my code:
server <- function(input, output) {
output$contents <- renderTable({
# input$file1 will be NULL initially. After the user selects
# and uploads a file, head of that data file by default,
# or all rows if selected, will be shown.
req(input$file1)
test <- pdf_text(input$file1$datapath)
### Two Sided
test <- gsub("[\r\n]", " ", test)
list <- strsplit(test, " {2,}") #split anywhere where there are 2 or more consecutive spaces - hopefully only between two paragraphs (if not the output wont make much sense)
resi <- lapply(list, function(x) {
unl <- unlist(x)
len <- length(x)
uneven <- seq(from = 1, to = len , by = 2)
even <- seq(from = 2, to = len , by = 2)
uneven <- unl[uneven]
even <- unl[even]
uneven <- paste(uneven, collapse = " ")
even <- paste(even, collapse = " ") #intentionally leave a space between them, one could even use something that is not expected to occur in the document like "frafrafra" and use that in the gsub call later as gsub("(\\d)-frafrafra(\\d)", "\\1\\2", resi)
return(cbind(uneven, even))
}) #separate even from uneven rows
resi <- unlist(resi)
resi <- gsub("(\\d)- (\\d)", "\\1\\2", resi) #clean numbers
resi <- gsub("(\\b)- (\\b)", "\\1\\2", resi) #clean words
resi <- data_frame(line = 1:length(resi), text = resi) #change class/type - vector to dataframe
count <- resi %>%
unnest_tokens(word, text) %>% #split columns into word like elements (tokens)
count(word, sort = TRUE) #count frequency and sort in desc order
count$word <- gsub("[^0-9]", NA, count$word)
count$num_char <- nchar(count$word)
two_cols <- count %>%
filter(!is.na(word)) %>%
filter(n == 1) %>%
filter(num_char == 7 | num_char == 13 | num_char == 15)
if(input$disp == "head") {
return(head(test))
}
else {
return(two_cols)
}
})
}
The code seems to work until test <- pdf_text(input$file...
However, I'm getting an error: wrong sign in 'by' argument
Any clue what is wrong?
Edit:
I am testing with the second part of the original code, it seems to be an issue with the scoping rules. That is, defining a function within the server function.

Removing rows based on character conditions in a column

Good morning, I have created the following R code:
setwd("xxx")
library(reshape)
##Insert needed year
url <- "./Quarterly/1990_qtrly.csv"
##Writes data in R with applicable columns
qtrly_data <- read.csv(url, header = TRUE, sep = ",", quote="\"", dec=".", na.strings=" ", skip=0)
relevant_cols <- c("area_fips", "industry_code", "own_code", "agglvl_code", "year", "qtr")
overall <- c(relevant_cols, colnames(qtrly_data)[8:16])
lq <- c(relevant_cols, colnames(qtrly_data)[17:25])
oty <- c(relevant_cols, colnames(qtrly_data)[18:42])
types <- c("overall", "lq", "oty")
overallx <- colnames(qtrly_data)[9:16]
lqx <- colnames(qtrly_data)[18:25]
otyx <- colnames(qtrly_data)[seq(27,42,2)]
###Adding in the disclosure codes from each section
disc_codes <- c("disclosure_code", "lq_disclosure_code", "oty_disclosure_code")
cols_list = list(overall, lq, oty)
denom_list = list(overallx, lqx, otyx)
##Uses a two-loop peice of code to go through data denominations and categories, while melting it into the correct format
for (j in 1:length(types))
{
cat("Working on type: " , types[j], "\n")
these_denominations <- denom_list[[j]]
type_data <- qtrly_data[ , cols_list[[j]] ]
QCEW_County <- melt(type_data, id=c(relevant_cols, disc_codes[j]))
colnames(QCEW_County) <- c(relevant_cols, "disclosure_code", "text_denomination", "value")
Data_Cat <- j
for (k in 1:length(these_denominations))
{
cat("Working on type: " , types[j], "and denomination: ", these_denominations[k], "\n")
QCEW_County_Denominated <- QCEW_County[QCEW_County[, "text_denomination"] == these_denominations[k], ]
QCEW_County_Denominated$disclosure_code <- ifelse(QCEW_County_Denominated$disclosure_code == "", 0, 1)
Data_Denom <- k
QCEW_County_Denominated <- cbind(QCEW_County_Denominated, Data_Cat, Data_Denom)
QCEW_County_Denominated$Source_ID <- 1
QCEW_County_Denominated$text_denomination <- NULL
colnames(QCEW_County_Denominated) <- NULL
###Actually writes the txt file to the QCEW folder
write.table(QCEW_County_Denominated, file="C:\\Users\\jjackson\\Downloads\\QCEW\\1990_test.txt", append=TRUE, quote=FALSE, sep=',', row.names=FALSE)
}
}
Now, there are some things I need to get rid of, namely, all the rows in my QCEW_County_Denominated dataframe where the "area_fips" column begins with the character "C", in that same column, there are also codes that start with US that I would like to replace with a 0. Finally, I also have the "industry_code" column that in my final dataframe has 3 values that need to be replaced. 31-33 with 31, 44-45 with 44, and 48-49 with 48. I understand that this is a difficult task. I'm slowly figuring it out on my own, but if anyone could give me a helpful nudge in the right direction while I'm figuring this out on my own, it would be much appreciated. Conditional statements in R is looking like it's my Achilles heel, as it's always where I begin to get confused with how its syntax differs from other statistical packages.
Thank you, and have a nice day.
You can remove and recode your data using regex and subsetting.
Using grepl, you can select the rows in the column area_fips that DON'T start with C.
QCEW_County_Denominated <- QCEW_County_Denominated[!grepl("^C", QCEW_County_Denominated$area_fips), ]
Using gsub, you can replace with 0 the values in the area_fips columns that start with 0.
QCEW_County_Denominated$area_fips <- as.numeric(gsub("^US", 0, QCEW_County_Denominated$area_fips))
Finally, using subsetting you can replace the values in the industry_code.
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "31-33"] <- 31
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "44-45"] <- 44
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "48-49"] <- 48

Resources