Related
I'm quite a novice, but I've successfully managed to make some code do what I want.
Right now my code does what I want for one file at a time.
I want to make my code automate this process for 600 files.
I kind of have an idea, that I need to put the list of files in a vector, then maybe use lapply and a function, but I'm not sure how to do this. The syntax and code are beyond me at the moment.
Here's my code...
#Packages are callled
library(tm) #text mining
library(SnowballC) #stemming - reducing words to their root
library(stringr) #for str_trim
library(plyr)
library(dplyr)
library(readtext)
#this is my code to run the code on a bunch of text files. Obviously it's unfinished, and I'm not sure if this is the right approach. Where do I put this? Will it even work?
data_files <- list.files(path = "data/", pattern = '*.txt', full.names = T, recursive = T)
lapply(
#
# where do I put this chunk of code?
# do I need to make all the code below a function?
##this bit cleans the document
company <- "CompanyXReport2015"
txt_raw = readLines("data/CompanyXReport2015.txt")
# remove all extra white space, also splits on lines
txt_format1 <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", txt_raw)
txt_format1.5 <- gsub("^ +| +$|( ) +", "\\1", txt_format1)
# recombine now that all white space is stripped
txt_format2 <- str_c(txt_format1.5, collapse=" ")
#split strings on space now to get a list of all words
txt_format3 <- str_split(txt_format2," ")
txt_format3
# convert to vector
txt_format4 <- unlist(txt_format3)
# remove empty strings and those with words shorter than 3 length
txt_format5 <- txt_format4[str_length(txt_format4) > 3]
# combine document back to single string
cleaned <- str_c(txt_format5, collapse=" ")
head(cleaned, 2)
##import key words and run analysis on frequency for the document
s1_raw = readLines("data/stage1r.txt")
str(s1_raw)
s2_raw = readLines("data/stage2r.txt")
str(s2_raw)
s3_raw = readLines("data/stage3r.txt")
str(s3_raw)
s4_raw = readLines("data/stage4r.txt")
str(s4_raw)
s5_raw = readLines("data/stage5r.txt")
str(s5_raw)
# str_count(cleaned, "legal")
# apply str_count function using each stage vector
level1 <- sapply(s1_raw, str_count, string=cleaned)
level2 <- sapply(s2_raw, str_count, string=cleaned)
level3 <- sapply(s3_raw, str_count, string=cleaned)
level4 <- sapply(s4_raw, str_count, string=cleaned)
level5 <- sapply(s5_raw, str_count, string=cleaned)
#make a vector from this for the report later
wordcountresult <- c(level1,level2,level3,level4,level5)
# convert to dataframes
s1 <- as.data.frame(level1)
s2 <- as.data.frame(level2)
s3 <- as.data.frame(level3)
s4 <- as.data.frame(level4)
s5 <- as.data.frame(level5)
# add a count column that each df shares
s1$count <- s1$level1
s2$count <- s2$level2
s3$count <- s3$level3
s4$count <- s4$level4
s5$count <- s5$level5
# add a stage column to identify what stage the word is in
s1$stage <- "Stage 1"
s2$stage <- "Stage 2"
s3$stage <- "Stage 3"
s4$stage <- "Stage 4"
s5$stage <- "Stage 5"
# drop the unique column
s1 <- s1[c("count","stage")]
s2 <- s2[c("count","stage")]
s3 <- s3[c("count","stage")]
s4 <- s4[c("count","stage")]
s5 <- s5[c("count","stage")]
# s1
df <- rbind(s1, s2,s3, s4, s5)
df
#write the summary for each company to a csv
#Making the report
#Make a vector to put in the report
#get stage counts and make a vector
s1c <- sum(s1$count)
s2c <- sum(s2$count)
s3c <- sum(s3$count)
s4c <- sum(s4$count)
s5c <- sum(s5$count)
stagesvec <- c(s1c,s2c,s3c,s4c,s5c)
names(stagesvec) <- c("Stage1","Stage2","Stage3","Stage4","Stage5")
#get the company report name for a vector
companyvec <- c(company)
names(companyvec) <- c("company")
# combine the vectors for the vector row to be inserted into the report
reportresult <- c(companyvec, wordcountresult, stagesvec)
rrdf <- data.frame(t(reportresult))
newdf <- data.frame(t(reportresult))
#if working file exists-use it
if (file.exists("data/WordCount12.csv")){
write.csv(
rrdf,
"data/WordCountTemp12.csv", row.names=FALSE
)
rrdf2 <-
read.csv("data/WordCountTemp12.csv")
df2 <-
read.csv("data/WordCount12.csv")
df2 <- rbind(df2, rrdf2)
write.csv(df2,
"data/WordCount12.csv", row.names=FALSE)
}else{ #if NO working file exists-make it
write.csv(newdf,
"data/WordCount12.csv", row.names=FALSE)
}
Hello :) Here is an example of workflow, you might find better ones but I started with it when learning.
listoftextfiles = list.files(...)
analysis1 = function(an element of listoftextfiles){
# your 1st analysis
}
res1 = lapply(listoftextfile, analysis1) # results of the 1st analysis
analysis2 = function(an element of res1){
# your 2nd analysis
}
res2 = lapply(res1, analysis2) # results of the 2nd analysis
# ect.
You will find many tutorials about custom functions on internet.
The function strip() below tries to produce a brief report on the result of its operation via the tee pipe (%T>%). Because this function is in turn being handed to a wrapper function and then to purrr::pwalk, which will supply it with a bunch of dataframes one by one, I want to get a report of its operation on each dataframe along with the dataframe name; which is to say, the name of the actual dataframe that is supplied to correspond to the formal argument tib in the function below. In the example supplied, this would be "tst_df". I don't know the names in advance of running the function, as they are constructed from the filenames read from disk and various other inputs.
Somewhat to my surprise, I actually have almost all of this working, except for getting the name of the supplied dataframe. In the example below, the code that is supposed to do this is enexpr(XX), but I have also tried expr(XX), and both of these expressions applied to tib or the dot (.), with or without a preceding !!. Also deparse(substitute()) on XX, tib, and ., but without the bang bangs.
I see that the names is stripped initially by pass-by-value, and then again, maybe, by each stage of the pipe, including the T, and again, maybe, by (XX = .) in the anonymous function after the T. But I know R + tidyverse will have a way. I just hope it does not involve providing an integer to count backwards up the call stack
tst_df <- tibble(A = 1:10, B = 11:20, C=21:30, D = 31:40)
tst_df
################################################################################
# The strip function expects a non-anonymous dataframe, from which it removes
# the rows specified in remove_rows and the columns specified in remove_cols. It
# also prints a brief report; just the df name, length and width.
strip <- function(tib, remove_rows = FALSE, remove_cols = NULL){
remove_rows <- enquo(remove_rows)
remove_cols <- enquo(remove_cols)
out <- tib %>%
filter(! (!! remove_rows)) %>%
select(- !! remove_cols) %T>% (function(XX = .){
function(XX = .)print(
paste0("length of ", enxpr(XX), " = ", nrow(XX), " Width = ", ncol(XX)))
cat("\n")
})
out
}
out_tb <- strip(tib = tst_df, remove_rows = (A < 3 | D > 38), remove_cols = c(C, D))
out_tb
Just save the name of tib at the beginning of your function,
it will be found by your reporter function:
strip <- function(tib, remove_rows = FALSE, remove_cols = NULL) {
remove_rows <- enquo(remove_rows)
remove_cols <- enquo(remove_cols)
tib_name <- as.character(substitute(tib))
report <- function(out) {
cat("output length of", tib_name, "=", nrow(out), ", width =", ncol(out), "\n")
}
tib %>%
filter(! (!! remove_rows)) %>%
select(- !! remove_cols) %T>%
report
}
out_tb <- strip(tib = tst_df, remove_rows = (A < 3 | D > 38), remove_cols = c(C, D))
output length of tst_df = 6 , width = 2
I think this is a fairly basic question, as I am a new R user, but I want to make it so that I can activate the entire code below with a single entry/word (I presumed it would be a function). If this has already been asked, I apologize, and please refer me to the link where it is answered. Thank you in advance for all help.
My code:
head(yelp, 10)
str(yelp)
yelp_flat<- flatten(yelp)
str(yelp_flat)
library(tibble)
yelp_tbl <- as_data_frame(yelp_flat)
yelp_tbl
yelp_tbl$newcolumn <- NULL
yelp_tbl$newcolumn1 <- NULL
yelp_tbl$shotClock <- NULL
yelp_tbl$period <- NULL
yelp_tbl$wallClock <- NULL
yelp_tbl$gameClock <- NULL
yelp_tbl$gameClockStopped <- NULL
yelp_tbl$ball <- NULL
head(yelp_tbl)
good <- unnest(yelp_tbl) #extracts xyz from original dataframe
library(tidyr)
player <- good %>% separate(xyz, c("player_x", "player_y", "player_z"), sep = ",")
finish <- player %>% separate(xyz1, c("player_x", "player_y", "player_z"), sep = ",")
k <- finish %>% separate(player_x, c("trash", "player_x"), sep = "c")
k$trash <- NULL
r <- k %>% separate(player_z, c("player_z", "tra"), sep = "\\)")
u <- r %>% separate(player_x, c("kol", "player_x"), sep = "\\(")
Away_Team <- u
Away_Team$garbage <- NULL
Away_Team$playerId1<- NULL
Away_Team$aplayer_x <- NULL
Away_Team$aplayer_y <- NULL
Away_Team$aplayer_z <- NULL
Away_Team$dispose <- NULL
Away_Team$brack <- NULL
Away_Team$kol <- NULL
Away_Team$tra <- NULL
View(Away_Team)
yelp_tbl
yelp_tbl$newcolumn <- NULL
yelp_tbl$newcolumn1 <- NULL
yelp_tbl$shotClock <- NULL
yelp_tbl$period <- NULL
yelp_tbl$wallClock <- NULL
yelp_tbl$gameClock <- NULL
yelp_tbl$gameClockStopped <- NULL
yelp_tbl$ball <- NULL
head(yelp_tbl)
good <- unnest(yelp_tbl) #extracts xyz from original dataframe
library(tidyr)
player <- good %>% separate(xyz, c("player_x", "player_y", "player_z"), sep = ",")
finish <- player %>% separate(xyz1, c("player_x", "player_y", "player_z"), sep = ",")
k <- finish %>% separate(player_x, c("trash", "player_x"), sep = "c")
k$trash <- NULL
r <- k %>% separate(player_z, c("player_z", "tra"), sep = "\\)")
u <- r %>% separate(player_x, c("kol", "player_x"), sep = "\\(")
Home_Team <- u
Home_Team$garbage <- NULL
Home_Team$playerId1<- NULL
Home_Team$hplayer_x <- NULL
Home_Team$hplayer_y <- NULL
Home_Team$hplayer_z <- NULL
Home_Team$dispose <- NULL
Home_Team$brack <- NULL
Home_Team$kol <- NULL
Home_Team$tra <- NULL
View(Home_Team)
View (Away_Team)
Table <- rbind(Home_Team, Away_Team)
View(Table) #order frameIdx to see correct order
So, indeed you should make a function. Here are some steps to follow:
1. Put all your code in your function
my_function <- function(){
# Your code
}
2. Identify what you have as an input (aka, what your are not building in your code), they will become the argument of your function
my_function <- function(arg1, arg2, ...){
# Your code
}
In your example, I identified yelp
3. Identify what you want to output (ideally only one object), they will be in the return of your function
my_function <- function(arg1, arg2, ...){
# Your code
return(output)
}
In your example I identified Table
4. Take all the import/library and put them outside your function
library(lib1)
my_function <- function(arg1, arg2, ...){
# Your code
return(output)
}
EDIT using #r2evans suggestion: Using libraryis generally used instead of require, here and here is some literature on it.
In your code I identified tidyr and tibble
5. Identify what you want to print/View and what was just for debugging. Add a print to print, suppres what you don't want
6. Add some comments/slice your code
For example I would add something like # Creating XXX table
7. Improve code quality
You should try to minimize the number of line of code (for example using loops and avoiding code to be in double). Make variables names explicit (instead of k, u, r...)
Regarding loop, in your code you drop some columns on at a time, you could do a loop to drop them in order. (It's what I have done bellow). It helps to make your code easier to read/debug. In this particular case, as Gregor said it is heaven faster to drop them all at once with using a list of column names (if you are interested check his comment).
Here you go:
There are still some improvement to do especially regarding point number 7 and 5.
library(tibble)
library(tidyr)
yelp_function <- function(yelp){
# Printing the input
print(head(yelp, 10))
print(str(yelp))
# Flatten table
yelp_flat<- flatten(yelp)
print(str(yelp_flat))
# Create yelp_tbl and drop some columns
yelp_tbl <- as_data_frame(yelp_flat)
# Drop some columns
for (col in c("newcolumn", "newcolumn1", "shotClock", "period", "wallClock", "gameClock", "gameClockStopped", "ball")){
yelp_tbl[, col] <- NULL
}
print(head(yelp_tbl))
# Build some table
good <- unnest(yelp_tbl) #extracts xyz from original dataframe
player <- good %>% separate(xyz, c("player_x", "player_y", "player_z"), sep = ",")
finish <- player %>% separate(xyz1, c("player_x", "player_y", "player_z"), sep = ",")
k <- finish %>% separate(player_x, c("trash", "player_x"), sep = "c")
k$trash <- NULL
r <- k %>% separate(player_z, c("player_z", "tra"), sep = "\\)")
u <- r %>% separate(player_x, c("kol", "player_x"), sep = "\\(")
# Build away team
Away_Team <- u
# Build yelp table: I'm not quite sure why you are rebdoing that... Is this code necessary?
yelp_tbl
# Drop some columns
for (col in c("newcolumn", "newcolumn1", "shotClock", "period", "wallClock", "gameClock", "gameClockStopped", "ball")){
yelp_tbl[, col] <- NULL
}
print(head(yelp_tbl))
good <- unnest(yelp_tbl) #extracts xyz from original dataframe
# Build some table
player <- good %>% separate(xyz, c("player_x", "player_y", "player_z"), sep = ",")
finish <- player %>% separate(xyz1, c("player_x", "player_y", "player_z"), sep = ",")
k <- finish %>% separate(player_x, c("trash", "player_x"), sep = "c")
k$trash <- NULL
r <- k %>% separate(player_z, c("player_z", "tra"), sep = "\\)")
u <- r %>% separate(player_x, c("kol", "player_x"), sep = "\\(")
## Build home_team
Home_Team <- u
# Drop some columns
for (col in c("garbage", "playerId1", "aplayer_x", "aplayer_y", "aplayer_z", "dispose", "brack", "kol", "tra")){
Away_Team[, col] <- NULL
Home_Team[, col] <- NULL
}
# Merge
Table <- rbind(Home_Team, Away_Team)
# Return
return(Table)
}
View(Table) #order frameIdx to see correct order
Run it:
To run your code you now just have to execute the function with the needed argument:
yelp_function(yelp)
NB 1: please note that I didn't tested the code since you didn't provide data to run it. To improve your question you should give some data using dputfunction.
NB 2: There is always room for improvement in the code so you might want to go further and llok into refactoring to avoid having code in double. Control your inputs with some sanity check...
It's rather simple.
You do this:
foo <- function{
#all your code goes here
}
Then you call your function by typing (in console for instance):
foo()
This question already has answers here:
Error in if/while (condition) {: missing Value where TRUE/FALSE needed
(4 answers)
Closed 5 years ago.
I have dataframe with 7 rows and 1 column,which contains links of a website, I'm trying to extract data from those various link and store them in a data frame but not able to append that.Also I'm checking that if for a link if there is no records(this I'm checking through html attribute of that link) skip that link and proceed to next link.I'm also trying to fetch data for multiple pages of a link.
This is reproducible data
text1="http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom="
text3="&proptype="
text4="Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment"
text5="&cityName=Thane&BudgetMin="
text6="&BudgetMax="
bhk=c("1","2","3","4","5",">5")
budg_min=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
budg_max=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
eg <- expand.grid(bhk = bhk, budg_min = budg_min, budg_max = budg_max)
eg <- eg[as.integer(eg$budg_min) <= as.integer(eg$budg_max),]
uuu <- sprintf("%s%s%s%s%s%s%s%s", text1,eg[,1],text3,text4,text5,eg[,2],text6,eg[,3])
uuu_df1=data.frame(x=uuu[1:7,])
dput(uuu_df1)
I have 3 solution for this but none seems to be working fine.
SOlution#1
urlList <- llply(uuu_df1[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[#id='resultCount']") %>%
xml_text() %>%
as.integer()
if(results_count > 0){
cards <- this_pg %>%
xml_find_all('//div[#class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[#class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[#class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[#class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[#class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df1[,1]
a=bind_rows(urlList)
Above code gives me error Error in if (results_count > 0) { : missing value where TRUE/FALSE needed
Solution#2
urlList <- lapply(uuu_df1[,1], function(url){
UrlPage <- html(as.character(url))
ImgNode <- UrlPage %>% html_node("div.noResultHead")
u <- paste("No", word(string = as(ImgNode, "character"), start=4, end=5), sep=" ")
cat(".")
pg <- read_html(url)
if(u!="No Results Found!") {
df <- data.frame(wine=html_text(html_nodes(pg, ".agentNameh")),
excerpt=html_text(html_nodes(pg, ".postedOn")),
locality=html_text(html_nodes(pg,".localityFirst")),
society=html_text(html_nodes(pg,'.labValu .stop-propagation:nth-child(1)')),
stringsAsFactors=FALSE)
} else {
# ASSIGN EMPTY DATAFRAME (FOR CONSISTENT STRUCTURE)
df <- data.frame(wine=character(), excerpt=character(), locality=character(), society=character())
}
# RETURN NAMED LIST
return(list(UrlPage=UrlPage, ImgNode=ImgNode, u=u, df=df))
})
# ROW BIND ONLY DATAFRAME ELEMENT FROM LIST
wines <- map_df(urlList, function(u) u$df)
Above code gives empty dataframe
Solution#3
uuu_df1=data.frame(x=uuu_df[1:7,])
wines=data.frame()
url_test=c()
UrlPage_test=c()
u=c()
ImgNode=c()
pg=c()
for(i in 1:dim(uuu_df1)[1]) {
url_test[i]=as.character(uuu_df1[i,])
UrlPage_test[i] <- html(url_test[i])
ImgNode[i] <- UrlPage_test[i] %>% html_node("div.noResultHead")
u[i]=ImgNode[i]
u[i]=as(u[i],"character")
u[i]=paste("No",word(string = u, start = 4, end = 5),sep = " ")
if(u[i]=="No Results Found!") next
{
map_df(1:5, function(i) # here 1:5 is number of webpages of a website
{
# simple but effective progress indicator
cat(".")
pg[i] <- read_html(sprintf(url_test[i], i))
data.frame(wine=html_text(html_nodes(pg[i], ".agentNameh")),
excerpt=html_text(html_nodes(pg[i], ".postedOn")),
locality=html_text(html_nodes(pg[i],".localityFirst")),
society=html_text(html_nodes(pg[i],'.labValu .stop-propagation:nth-child(1)')),
stringsAsFactors=FALSE)
}) -> wines
}}
Above code also gives an error
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "list"
In addition: Warning messages:
1: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
2: In UrlPage_test[i] <- html(url_test[i]) :
number of items to replace is not a multiple of replacement length
Any suggestions on which code can be corrected so that my requirement is met. Thanks in advance
Solution #1
That missing value where TRUE/FALSE needed is printed when you do something like this:
if (NA > 0) {
do something
}
So replace your if condition
if(results_count > 0)
with
(!is.na(results_count) & (results_count > 0))
Good morning, I have created the following R code:
setwd("xxx")
library(reshape)
##Insert needed year
url <- "./Quarterly/1990_qtrly.csv"
##Writes data in R with applicable columns
qtrly_data <- read.csv(url, header = TRUE, sep = ",", quote="\"", dec=".", na.strings=" ", skip=0)
relevant_cols <- c("area_fips", "industry_code", "own_code", "agglvl_code", "year", "qtr")
overall <- c(relevant_cols, colnames(qtrly_data)[8:16])
lq <- c(relevant_cols, colnames(qtrly_data)[17:25])
oty <- c(relevant_cols, colnames(qtrly_data)[18:42])
types <- c("overall", "lq", "oty")
overallx <- colnames(qtrly_data)[9:16]
lqx <- colnames(qtrly_data)[18:25]
otyx <- colnames(qtrly_data)[seq(27,42,2)]
###Adding in the disclosure codes from each section
disc_codes <- c("disclosure_code", "lq_disclosure_code", "oty_disclosure_code")
cols_list = list(overall, lq, oty)
denom_list = list(overallx, lqx, otyx)
##Uses a two-loop peice of code to go through data denominations and categories, while melting it into the correct format
for (j in 1:length(types))
{
cat("Working on type: " , types[j], "\n")
these_denominations <- denom_list[[j]]
type_data <- qtrly_data[ , cols_list[[j]] ]
QCEW_County <- melt(type_data, id=c(relevant_cols, disc_codes[j]))
colnames(QCEW_County) <- c(relevant_cols, "disclosure_code", "text_denomination", "value")
Data_Cat <- j
for (k in 1:length(these_denominations))
{
cat("Working on type: " , types[j], "and denomination: ", these_denominations[k], "\n")
QCEW_County_Denominated <- QCEW_County[QCEW_County[, "text_denomination"] == these_denominations[k], ]
QCEW_County_Denominated$disclosure_code <- ifelse(QCEW_County_Denominated$disclosure_code == "", 0, 1)
Data_Denom <- k
QCEW_County_Denominated <- cbind(QCEW_County_Denominated, Data_Cat, Data_Denom)
QCEW_County_Denominated$Source_ID <- 1
QCEW_County_Denominated$text_denomination <- NULL
colnames(QCEW_County_Denominated) <- NULL
###Actually writes the txt file to the QCEW folder
write.table(QCEW_County_Denominated, file="C:\\Users\\jjackson\\Downloads\\QCEW\\1990_test.txt", append=TRUE, quote=FALSE, sep=',', row.names=FALSE)
}
}
Now, there are some things I need to get rid of, namely, all the rows in my QCEW_County_Denominated dataframe where the "area_fips" column begins with the character "C", in that same column, there are also codes that start with US that I would like to replace with a 0. Finally, I also have the "industry_code" column that in my final dataframe has 3 values that need to be replaced. 31-33 with 31, 44-45 with 44, and 48-49 with 48. I understand that this is a difficult task. I'm slowly figuring it out on my own, but if anyone could give me a helpful nudge in the right direction while I'm figuring this out on my own, it would be much appreciated. Conditional statements in R is looking like it's my Achilles heel, as it's always where I begin to get confused with how its syntax differs from other statistical packages.
Thank you, and have a nice day.
You can remove and recode your data using regex and subsetting.
Using grepl, you can select the rows in the column area_fips that DON'T start with C.
QCEW_County_Denominated <- QCEW_County_Denominated[!grepl("^C", QCEW_County_Denominated$area_fips), ]
Using gsub, you can replace with 0 the values in the area_fips columns that start with 0.
QCEW_County_Denominated$area_fips <- as.numeric(gsub("^US", 0, QCEW_County_Denominated$area_fips))
Finally, using subsetting you can replace the values in the industry_code.
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "31-33"] <- 31
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "44-45"] <- 44
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "48-49"] <- 48