Applying Function to Multiple Files & Handling the resulting dataframes - r

I've got a function which I'm trying to apply in a for loop that extracts a dataframe from multiple files and combines them into a single one.
This is how, from what I've read, I thought would be the best way to attack it but I get an empty list returned, when I was hoping for a list of dataframes which could be combined using bind_rows.
This is the code I'm using:
combined_functions <- function(file_name) {
#combines the get_dfm_df and get corp function: get dfm tibble straight from the file name
data_frame_returned<- get_dfm_df(getcorp(file_name))
data_frame_returned
}
list_of_dataframes <- list()
file.list <- dir(pattern ="DOCX$")
for (file in file.list) {
dataframe_of_file <- combined_function(file)
append(list_of_dataframes,dataframe_of_file)
}
bind_rows(list_of_dataframes, .id = "column_label") #https://stackoverflow.com/questions/2851327/convert-a-list-of-data-frames-into-one-data-frame
It creates an empty list, gets a list of the file names which the function combined_function uses to create a data frame out of the file and should, to my understanding, append this dataframe to the list. After all the files in the directory have been matched, bind_rows should combine it into one overall dataframe but it only returns an empty tibble. list_of_dataframes is also empty.
I've tried the solution in this answer but it didn't help:
Append a data frame to a list
https://www.dropbox.com/sh/z8vh50b370gcb1j/AAAcbnfAUOM6-y8uWn4-lUWLa?dl=0
This a link to the raw files I am using in this case, but I think the problem is a general one.
Appendix:
These are the functions combined_functions refer too. They work on the individual cases so I'm confident this is not the cause of the problem but I've included them for completeness anyway.
rm(list = ls())
library(quanteda)
library(quanteda.corpora)
library(readtext)
library(LexisNexisTools)
library(tidyverse)
library(tools)
getcorp<- function(file_name){
#function to take the lexis word document, convert it into quanteda corpus object, returns duplicate df and date from filename in list
LNToutput <- lnt_read(file_name)
duplicates_df <- lnt_similarity(LNToutput = LNToutput,
threshold = 0.99)
duplicates_df <- duplicates_df[duplicates_df$Similarity > 0.99] #https://github.com/JBGruber/LexisNexisTools creates dataframe of duplicate articles
LNToutput <- LNToutput[!LNToutput#meta$ID %in% duplicates_df$ID_duplicate, ] #removes these duplicates from the main dataframe
corp <- lnt_convert(LNToutput, to = "quanteda") #to return multiple values from the r function, must be placed in a list
corp_date_from_file_name <- basename(file_name)
file_date <- as.Date(corp_date_from_file_name, format ="%d_%m_%y")
list_of_returns <-list(duplicates_df, corp,file_date) #list returns has duplicate df in first position, corpus in second and the file date in third
list_of_returns
}
get_dfm_df <- function(corp_list){
# takes the corp from getcorp, applies lexicoder dictionary, adds the neg_pos etc to their equivalent columns,
# calculates the percentage each category is of the total number of sentiment bearing words, adds the date specified from the file name
corpus_we_want <- corp_list[[2]]
sentiment_df <- dfm(corpus_we_want, dictionary = data_dictionary_LSD2015) %>% #applies the dictionary
convert("data.frame") %>%
cbind(docvars(corpus_we_want)) %>% #https://stackoverflow.com/questions/60419692/how-to-convert-dfm-into-dataframe-but-keeping-docvars
as_tibble() %>%
mutate(combined_negative = negative + neg_positive, combined_positive = positive + neg_negative) %>%
mutate(pos_percentage = combined_positive/(combined_positive + combined_negative ), neg_percentage =combined_negative/(combined_positive + combined_negative ) ) %>%
mutate(date = corp_list[[3]])
sentiment_df
}

Related

Read list of files with inconsistent delimiter/fixed width

I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)

How do I create / name dataframes in a for loop in R?

So I'm currently trying to scrape precinct results by county from JSON files on Virginia's Secretary of State. I got code working that gets the data from a URL and creates a dataframe named after the county. To speed up the process, I tried to put the code inside a for loop that iterates through Virginia's counties (which I'm sourcing from a 2020 election by county CSV already on my computer that I constructed from this: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ), constructs the URL for the county JSON file (since the format's consistent), and saves it to a dataframe. My current code doesn't save the dataframes though, so only the last county remains.
This is the code:
library(dplyr)
library(tidyverse)
library(jsonlite)
va <- filter(biden_margin, biden_margin$state_po == "VA")
#i put this line here because the spreadsheet uses spaces to separate "X" and "city" but the URL uses an underline
va$county_name <- gsub(" ", "_", va$county_name)
#i put this line here because the URLs have "county" in the name, but the spreadsheet doesn't; however the spreadsheet does have "city" for the independent cities, like the URLs (and the independent cities are the observations with FIPS above 51199)
va$county_name <- if_else(va$county_fips > 51199, va$county_name, paste0(va$county_name, "_COUNTY"))
#i did this as a list but i realize this might be a bad idea
governor_data <- vector(mode = "list", length = nrow(va))
for (i in nrow(va)) {
precincts <- paste0("https://results.elections.virginia.gov/vaelections/2021%20November%20General/Json/Locality/", va$county_name[i], "/Governor.json")
name <- paste0(va$county_name[i], "_governor_2021")
java_source <- stream_in(file(precincts))
df <- as.data.frame(java_source$Precincts)
df$county <- java_source$Locality$LocalityName
df <- unnest(df, cols = c(Candidates))
df <- subset(df, select = -c(PoliticalParty, BallotOrder))
df <- pivot_wider(df, names_from = BallotName, values_from = c(Votes, Percentage))
#tried append before this, got the same result
governor_data[i] <- assign(name, df)
}
Any thoughts?

How to select the exact matches for a list of variables to append datasets

I have different datasets for different waves. Each wave has its own prefix for both dataset and variables names. I'm trying to import and append all the data files using a subset of variables I need. Therefore, I'm currently doing:
var_list <- c("pidp", "jbsat", "jbhrs", "jbnssec8_dv", "panssec8_dv", "manssec8_dv", "paedqf", "maedqf", "qfhigh", "age_dv",
"sex_dv", "psu", "strata", "employ", "jbhas", "jboff", "jbsem", "jbstat", "jbterm1", "jbterm2", "pjbptft", "fimnet_dv",
"fimngrs_dv", "fimnlabnet_dv", "seearnnet_dv", "fimnmisc_dv", "fimnprben_dv", "fimninvent_dv", "fimnpen_dv", "fimnsben_dv",
"hhtype_dv", "livesp_dv", "nch14resp", "nmpsp_dv", "tenure_dv", "urban_dv", "jbsat", "health", "sf1", "scghqa",
"scghqb", "scghqc", "scghqd", "scgqhe", "scgqhf", "scghqg", "scghqi", "scghqj", "scghqh", "scghql", "sclsat1",
"sclsat2", "sclsat3", "sclsat4", "indscus_lw", "indscub_xw")
Then import the data for the first wave, selecting these variables and remove the wave-prefix:
longfile <- read_dta(file=paste0(dir, "ukhls_w1/a_indresp.dta")) %>%
select(matches(var_list)) %>%
rename_at(vars(starts_with("a_")), ~str_replace(.,"a_", "")) %>% #remove the wave prefix
mutate(wave = 1)
At this point, I would simply use the following loop:
for (wn in 2:10) {
wl <- paste0(letters[wn],"_")
wave_data <- read_dta(paste0(dir, "ukhls_w", wn, "/", wl, "indresp.dta")) %>%
select(matches(var_list)) %>%
rename_at(vars(starts_with(wl)), ~str_replace(.,wl, "")) %>% # remove prefix wave
mutate(wave = wn)
longfile <- rbind(longfile, wave_data)
}
However, the problem is that some variable names matches more than one column in the files for the subsequent waves. For example, in the second wave it exists a variable named "nxtjbhrs", therefore this will be included as it matches "jbhrs". This will create an error in the rbind as the number of columns will differ.
How can I select the exact matches in this case? Or force the appending of the dataset?
Thanks for your support!
select(setdiff(names(.), var_list))

Add New Field to File Names in Directory - R

I am pulling 10-Ks off the SEC website using the EDGAR package in R. Fortunately, the text files come with a consistent file naming convention: CIK number (this is a unique filing ID)_File type_Date.
Ultimately I want to analyze these by SIC/industry group, so I think the best way to do this would be to add the SIC industry code to this filename rule.
I am including an image of what I would like to do below. It is kind of like a database join except my file names would be taking the new field. Not sure how to do that, I am pretty new to R and file scripting.
I am assuming that you have a data.frame with a column filenames. (Or a vector containing all the filenames) See the code below:
# A data.frame with a character column 'filenames'
df$CIK <- sapply(df$filenames, FUN = function(x) {unlist(strsplit(x, split = "_"))[1]})
df$CIK <- as.character(df$CIK)
Now, let us assume that you have another data.frame with two columns: CIK and SIC.
# A data.frame with two character columns: 'CIK' and 'SIC'
# df2.
#
# We add another column to the first data.frame: 'new_filenames'
df$new_filename <- sapply(1:nrow(df), FUN = function(idx, CIK, filenames, df2) {
SIC <- df2$SIC[which(df2$CIK == CIK[idx])]
new_filename <- as.character(paste(SIC, "_", filenames[idx], sep = ""))
new_filenames
}, CIK = df$CIK, filenames = df$filenames, df2 = df2)
# Now the new filenames are available in df$new_filenames
View(df)

R, creating variables on the fly in a list using assign statement

I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)

Resources