Removing rows based on character conditions in a column - r

Good morning, I have created the following R code:
setwd("xxx")
library(reshape)
##Insert needed year
url <- "./Quarterly/1990_qtrly.csv"
##Writes data in R with applicable columns
qtrly_data <- read.csv(url, header = TRUE, sep = ",", quote="\"", dec=".", na.strings=" ", skip=0)
relevant_cols <- c("area_fips", "industry_code", "own_code", "agglvl_code", "year", "qtr")
overall <- c(relevant_cols, colnames(qtrly_data)[8:16])
lq <- c(relevant_cols, colnames(qtrly_data)[17:25])
oty <- c(relevant_cols, colnames(qtrly_data)[18:42])
types <- c("overall", "lq", "oty")
overallx <- colnames(qtrly_data)[9:16]
lqx <- colnames(qtrly_data)[18:25]
otyx <- colnames(qtrly_data)[seq(27,42,2)]
###Adding in the disclosure codes from each section
disc_codes <- c("disclosure_code", "lq_disclosure_code", "oty_disclosure_code")
cols_list = list(overall, lq, oty)
denom_list = list(overallx, lqx, otyx)
##Uses a two-loop peice of code to go through data denominations and categories, while melting it into the correct format
for (j in 1:length(types))
{
cat("Working on type: " , types[j], "\n")
these_denominations <- denom_list[[j]]
type_data <- qtrly_data[ , cols_list[[j]] ]
QCEW_County <- melt(type_data, id=c(relevant_cols, disc_codes[j]))
colnames(QCEW_County) <- c(relevant_cols, "disclosure_code", "text_denomination", "value")
Data_Cat <- j
for (k in 1:length(these_denominations))
{
cat("Working on type: " , types[j], "and denomination: ", these_denominations[k], "\n")
QCEW_County_Denominated <- QCEW_County[QCEW_County[, "text_denomination"] == these_denominations[k], ]
QCEW_County_Denominated$disclosure_code <- ifelse(QCEW_County_Denominated$disclosure_code == "", 0, 1)
Data_Denom <- k
QCEW_County_Denominated <- cbind(QCEW_County_Denominated, Data_Cat, Data_Denom)
QCEW_County_Denominated$Source_ID <- 1
QCEW_County_Denominated$text_denomination <- NULL
colnames(QCEW_County_Denominated) <- NULL
###Actually writes the txt file to the QCEW folder
write.table(QCEW_County_Denominated, file="C:\\Users\\jjackson\\Downloads\\QCEW\\1990_test.txt", append=TRUE, quote=FALSE, sep=',', row.names=FALSE)
}
}
Now, there are some things I need to get rid of, namely, all the rows in my QCEW_County_Denominated dataframe where the "area_fips" column begins with the character "C", in that same column, there are also codes that start with US that I would like to replace with a 0. Finally, I also have the "industry_code" column that in my final dataframe has 3 values that need to be replaced. 31-33 with 31, 44-45 with 44, and 48-49 with 48. I understand that this is a difficult task. I'm slowly figuring it out on my own, but if anyone could give me a helpful nudge in the right direction while I'm figuring this out on my own, it would be much appreciated. Conditional statements in R is looking like it's my Achilles heel, as it's always where I begin to get confused with how its syntax differs from other statistical packages.
Thank you, and have a nice day.

You can remove and recode your data using regex and subsetting.
Using grepl, you can select the rows in the column area_fips that DON'T start with C.
QCEW_County_Denominated <- QCEW_County_Denominated[!grepl("^C", QCEW_County_Denominated$area_fips), ]
Using gsub, you can replace with 0 the values in the area_fips columns that start with 0.
QCEW_County_Denominated$area_fips <- as.numeric(gsub("^US", 0, QCEW_County_Denominated$area_fips))
Finally, using subsetting you can replace the values in the industry_code.
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "31-33"] <- 31
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "44-45"] <- 44
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "48-49"] <- 48

Related

How to process many txt files with my code in R

I'm quite a novice, but I've successfully managed to make some code do what I want.
Right now my code does what I want for one file at a time.
I want to make my code automate this process for 600 files.
I kind of have an idea, that I need to put the list of files in a vector, then maybe use lapply and a function, but I'm not sure how to do this. The syntax and code are beyond me at the moment.
Here's my code...
#Packages are callled
library(tm) #text mining
library(SnowballC) #stemming - reducing words to their root
library(stringr) #for str_trim
library(plyr)
library(dplyr)
library(readtext)
#this is my code to run the code on a bunch of text files. Obviously it's unfinished, and I'm not sure if this is the right approach. Where do I put this? Will it even work?
data_files <- list.files(path = "data/", pattern = '*.txt', full.names = T, recursive = T)
lapply(
#
# where do I put this chunk of code?
# do I need to make all the code below a function?
##this bit cleans the document
company <- "CompanyXReport2015"
txt_raw = readLines("data/CompanyXReport2015.txt")
# remove all extra white space, also splits on lines
txt_format1 <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", txt_raw)
txt_format1.5 <- gsub("^ +| +$|( ) +", "\\1", txt_format1)
# recombine now that all white space is stripped
txt_format2 <- str_c(txt_format1.5, collapse=" ")
#split strings on space now to get a list of all words
txt_format3 <- str_split(txt_format2," ")
txt_format3
# convert to vector
txt_format4 <- unlist(txt_format3)
# remove empty strings and those with words shorter than 3 length
txt_format5 <- txt_format4[str_length(txt_format4) > 3]
# combine document back to single string
cleaned <- str_c(txt_format5, collapse=" ")
head(cleaned, 2)
##import key words and run analysis on frequency for the document
s1_raw = readLines("data/stage1r.txt")
str(s1_raw)
s2_raw = readLines("data/stage2r.txt")
str(s2_raw)
s3_raw = readLines("data/stage3r.txt")
str(s3_raw)
s4_raw = readLines("data/stage4r.txt")
str(s4_raw)
s5_raw = readLines("data/stage5r.txt")
str(s5_raw)
# str_count(cleaned, "legal")
# apply str_count function using each stage vector
level1 <- sapply(s1_raw, str_count, string=cleaned)
level2 <- sapply(s2_raw, str_count, string=cleaned)
level3 <- sapply(s3_raw, str_count, string=cleaned)
level4 <- sapply(s4_raw, str_count, string=cleaned)
level5 <- sapply(s5_raw, str_count, string=cleaned)
#make a vector from this for the report later
wordcountresult <- c(level1,level2,level3,level4,level5)
# convert to dataframes
s1 <- as.data.frame(level1)
s2 <- as.data.frame(level2)
s3 <- as.data.frame(level3)
s4 <- as.data.frame(level4)
s5 <- as.data.frame(level5)
# add a count column that each df shares
s1$count <- s1$level1
s2$count <- s2$level2
s3$count <- s3$level3
s4$count <- s4$level4
s5$count <- s5$level5
# add a stage column to identify what stage the word is in
s1$stage <- "Stage 1"
s2$stage <- "Stage 2"
s3$stage <- "Stage 3"
s4$stage <- "Stage 4"
s5$stage <- "Stage 5"
# drop the unique column
s1 <- s1[c("count","stage")]
s2 <- s2[c("count","stage")]
s3 <- s3[c("count","stage")]
s4 <- s4[c("count","stage")]
s5 <- s5[c("count","stage")]
# s1
df <- rbind(s1, s2,s3, s4, s5)
df
#write the summary for each company to a csv
#Making the report
#Make a vector to put in the report
#get stage counts and make a vector
s1c <- sum(s1$count)
s2c <- sum(s2$count)
s3c <- sum(s3$count)
s4c <- sum(s4$count)
s5c <- sum(s5$count)
stagesvec <- c(s1c,s2c,s3c,s4c,s5c)
names(stagesvec) <- c("Stage1","Stage2","Stage3","Stage4","Stage5")
#get the company report name for a vector
companyvec <- c(company)
names(companyvec) <- c("company")
# combine the vectors for the vector row to be inserted into the report
reportresult <- c(companyvec, wordcountresult, stagesvec)
rrdf <- data.frame(t(reportresult))
newdf <- data.frame(t(reportresult))
#if working file exists-use it
if (file.exists("data/WordCount12.csv")){
write.csv(
rrdf,
"data/WordCountTemp12.csv", row.names=FALSE
)
rrdf2 <-
read.csv("data/WordCountTemp12.csv")
df2 <-
read.csv("data/WordCount12.csv")
df2 <- rbind(df2, rrdf2)
write.csv(df2,
"data/WordCount12.csv", row.names=FALSE)
}else{ #if NO working file exists-make it
write.csv(newdf,
"data/WordCount12.csv", row.names=FALSE)
}
Hello :) Here is an example of workflow, you might find better ones but I started with it when learning.
listoftextfiles = list.files(...)
analysis1 = function(an element of listoftextfiles){
# your 1st analysis
}
res1 = lapply(listoftextfile, analysis1) # results of the 1st analysis
analysis2 = function(an element of res1){
# your 2nd analysis
}
res2 = lapply(res1, analysis2) # results of the 2nd analysis
# ect.
You will find many tutorials about custom functions on internet.

Filtering process not fetching full data? Using dplyr filter and grep

I have this log file that has about 1200 characters (max) on a line. What I want to do is read this first and then extract certain portions of the file into new columns. I want to extract rows that contain the text “[DF_API: input string]”.
When I read it and then filter based on the rows that I am interested, it almost seems like I am losing data. I tried this using the dplyr filter and using standard grep with the same result.
Not sure why this is the case. Appreciate your help with this. The code and the data is there at the following link.
Satish
Code is given below
library(dplyr)
setwd("C:/Users/satis/Documents/VF/df_issue_dec01")
sec1 <- read.delim(file="secondary1_aa_small.log")
head(sec1)
names(sec1) <- c("V1")
sec1_test <- filter(sec1,str_detect(V1,"DF_API: input string")==TRUE)
head(sec1_test)
sec1_test2 = sec1[grep("DF_API: input string",sec1$V1, perl = TRUE),]
head(sec1_test2)
write.csv(sec1_test, file = "test_out.txt", row.names = F, quote = F)
write.csv(sec1_test2, file = "test2_out.txt", row.names = F, quote = F)
Data (and code) is given at the link below. Sorry, I should have used dput.
https://spaces.hightail.com/space/arJlYkgIev
Try this below code which could give you a dataframe of filtered lines from your file based a matching condition.
#to read your file
sec1 <- readLines("secondary1_aa_small.log")
#framing a dataframe by extracting required lines from above file
new_sec1 <- data.frame(grep("DF_API: input string", sec1, value = T))
names(new_sec1) <- c("V1")
Edit: Simple way to split the above column into multiple columns
#extracting substring in between < & >
new_sec1$V1 <- gsub(".*[<\t]([^>]+)[>].*", "\\1", new_sec1$V1)
#replacing comma(,) with a white space
new_sec1$V1 <- gsub("[,]+", " ", new_sec1$V1)
#splitting into separate columns
new_sec1 <- strsplit(new_sec1$V1, " ")
new_sec1 <- lapply(new_sec1, function(x) x[x != ""] )
new_sec1 <- do.call(rbind, new_sec1)
new_sec1 <- data.frame(new_sec1)
Change columns names for your analysis.

Efficient way to split strings in separate rows (creating an edgelist)

I currently have the following problem. I work with Web-of-Science scientific publication and citation data, which has the following structure: A variable "SR" is a string with the name of a publication, "CR" a variable with a string containing all cited references in the article, separated by a ";".
My task now is to create an edgelist between all publications with the corresponding citations, where every publication and citation combination is in a single row. I do it currently with the following code:
# Some minimal data for example
pub <- c("pub1", "pub2", "pub3")
cit <- c("cit1;cit2;cit3;cit4","cit1;cit4;cit5","cit5;cit1")
M <- cbind(pub,cit)
colnames(M) <- c("SR","CR")
# Create an edgelist
cit_el <- data.frame() #
for (i in seq(1, nrow(M), 1)) { # i=3
cit <- data.frame(strsplit(as.character(M[i,"CR"]), ";", fixed=T), stringsAsFactors=F)
colnames(cit)[1] <- c("SR")
cit$SR_source <- M[i,"SR"]
cit <- unique(cit)
cit_el <- rbind(cit_el, cit)
}
However, for large datasets of some 10k+ of publications (which tend to have 50+ citations), the script runs 15min+. I know that loops are usually an inefficient way of coding in R, yet didn't find an alternative that produces what I want.
Anyone knows some trick to make this faster?
This is my attempt. I haven't compared the speeds of different approaches yet.
First is the artificial data with 10k pubs, 100k possible citations, max is 80 citations per pub.
library(data.table)
library(stringr)
pubCount = 10000
citCount = 100000
maxCitPerPub = 80
pubList <- paste0("pub", seq(pubCount))
citList <- paste0("cit", seq(citCount))
cit <- sapply(sample(seq(maxCitPerPub), pubCount, replace = TRUE),
function(x) str_c(sample(citList, x), collapse = ";"))
data <- data.table(pub = pubList,
cit = cit)
For processing, I use stringr::str_split_fixed to split the citations into columns and use data.table::melt to collapse the columns.
temp <- data.table(pub = pubList, str_split_fixed(data$cit, ";", maxCitPerPub))
result <- melt(temp, id.vars = "pub")[, variable:= NULL][value!='']
Not sure if this is any quicker but if I'm understanding correctly this should give the desired result
rbindlist(lapply(1:nrow(M), function(i){
data.frame(SR_source = M[i, 'SR'], SR = strsplit(M[i, 'CR'], ';'))
}))

R Loop optimisation/ Loop is way too time consuming

The following loop takes ages. Is there any way to this in a more time-efficient way? The following data.table consists of 27 variables and more than 600k observations.
data <- read.table("file.txt", header = T, sep= "|")
colnames(data)[c(1)] <- c("X")
data <- as.data.table(data)
n=1;
vector <- vector()
for(i in 2:nrow(data))
{
if(data[["X"]][i] != data[["X"]][i-1])
{
n=1; vector[i]=1}
else {
n=n+1; vector[i]=n}}
Basically, I need to index every appearance of a unique entry in X, i.e. the first time it appeared, the second time it appeared, etc and then merge this to the existing data as additional column. However, I got stock at compiling the vector.
Thank you.
First off, use fread:
DT <- fread("file.txt", sep = "|")
Next, use setnames:
setnames(DT, 1, "X")
Finally, use rowid:
DT[ , vector := rowid(X)]

Conditional Insert of Rows

I have a unique dataset, a portion of which can be reproduced using:
data <- textConnection("SNP_Pres,Chr_N,BP_A1F,A1_Beta,A2_SE,ForSortSNP,SortOrder
rs122,13,100461219,C,T,rs122,6
1,16362,0.8701,-0.0048,0.0056,rs122,7
1,19509,0.546015137607046,-0.0033,0.0035,rs122,8
1,17218,0.1539,-0.004,0.013,rs122,9
rs142,13,61952115,G,T,rs142,6
1,16387,0.1295,0.0044,0.0057,rs142,7
1,17218,0.8454,0.006,0.013,rs142,9
rs160,13,100950452,C,T,rs160,6
1,16387,0.549,-0.0021,0.0035,rs160,7
1,19509,0.519102731537216,0.003,0.0027,rs160,8
rs298,13,66664221,C,G,rs298,6
1,19509,0.308290808358246,-0.0032,0.0033,rs298,8
1,17218,0.7227,0.022,0.01,rs298,9")
mydata <- read.csv(data, header = T, sep = ",", stringsAsFactors=FALSE)
It is formatted for use in a program that requires holding spots for missing data entries. In this case, a missing entry is indicated by a numeric skip in the Sort Order column. An entry is complete if the column descends 6 - 7 - 8 - 9, with a new entry beginning again with 6.
I need a way to read through the data file, and insert a row of zeros for each missing entry, so that the file looks like this:
data <- textConnection("SNP_Pres,Chr_N,BP_A1F,A1_Beta,A2_SE,ForSortSNP,SortOrder
rs122,13,100461219,C,T,rs122,6
1,16362,0.8701,-0.0048,0.0056,rs122,7
1,19509,0.546015137607046,-0.0033,0.0035,rs122,8
1,17218,0.1539,-0.004,0.013,rs122,9
rs142,13,61952115,G,T,rs142,6
1,16387,0.1295,0.0044,0.0057,rs142,7
0,0,0,0,0,rs142,8
1,17218,0.8454,0.006,0.013,rs142,9
rs160,13,100950452,C,T,rs160,6
1,16387,0.549,-0.0021,0.0035,rs160,7
1,19509,0.519102731537216,0.003,0.0027,rs160,8
0,0,0,0,0,rs160,9
rs298,13,66664221,C,G,rs298,6
0,0,0,0,0,rs289, 7
1,19509,0.308290808358246,-0.0032,0.0033,rs298,8
1,17218,0.7227,0.022,0.01,rs298,9")
mydata <- read.csv(data, header = T, sep = ",", stringsAsFactors=FALSE)
Ultimately, the last two columns, ForSortSNP and SortOrder will be deleted from the data file, but they are included now for convenience's sake.
Any suggestions are greatly appreicated.
Here is a solution using the expand.grid and merge functions.
grid <- with(mydata, expand.grid(ForSortSNP=unique(ForSortSNP), SortOrder=unique(SortOrder)))
complete <- merge(mydata, grid, all=TRUE, sort=FALSE)
complete[is.na(complete)] <- 0 # replace NAs with 0's
complete <- complete[order(complete$ForSortSNP, complete$SortOrder), ] # re-sort

Resources