Importing multiple .txt files into r - r

I need to import multiple .txt files into R. Each file has multiple sentences in it (for eg, "On Monday, I went to the park.") I would like to be able to import all the files in at the same time and then add them to a tibble, so that I can do text analysis on it.
So far, I have tried
#to create vector of txt files
files <- list.files(pattern = "txt$")
# Read all the files and create a FileName column to store filenames
files_list <- files %>%
set_names(.) %>%
map_df(read_table2, .id = "FileName")
my_data <- read.delim(file(files))
But I don't know how to actually load the text in each .txt file into the data. When I run this code above, it only reads in the text from one of the files, not all.
I also tried:
sapply(files, read.delim)
mainlist = list()
for (i in 1: length(fileList)) {
mainlist[[i]] = read.delim(files[i], header = TRUE, sep = "\t")
}
And while it prints out all the info in each .txt file, when I try to put it in a tibble using
mainlist_tib <- tibble(mainlist)
the tibble is empty.
Any assistance would be greatly appreciated!
Edit: Regarding the tibble, I would like for it to have a column for the txt file name and then another column for the text from the file, and then to be able to use the unnest_tokens() function to have a tibble where each row contains only one word. Sort of like in the example from the text mining textbook by Silge and Robinson: https://www.tidytextmining.com/tidytext.html

You could try it like this:
library(dplyr)
library(purrr)
files %>%
set_names(.) %>%
map_dfr(~readr::read_table(., col_names = F), .id = "FileName")

Related

Read a folder of excel files and import individual sheets as separate df's in R with Names

I have a folder of excel files that contain multiple sheets each. The sheets are named the same in each wb. I'm trying to import one specific named sheet for all excel files as separate data frames. I have been able to import them in; however, the names become df_1, df_2, df_3, etc... I've been trying to take the first word of the excel file name and use that to identify the df.
Example of Excel file Name "AAPL Multiple Sheets" the sheet would be named "Balance" I'm importing as a df. I would like "AAPL Balance df" as the result.
The code that came closest to what I'm looking for located below, however, it names each data frame as df_1, df_2, and so on.
library(purrr)
library(readxl)
files_list <- list.files(path = 'C:/Users/example/Drive/Desktop/Total_Related_Data/Analysis of Data/',
pattern = "*.xlsx",full.names = TRUE)
files_list %>%
walk2(1:length(files_list),
~ assign(paste0("df_", .y), read_excel(path = .x), envir = globalenv()))
I tried using the file path variable 'file_list' in the past0 function to label them and ended up with,
df_C:/Users/example/Drive/Desktop/Total_Related_Data/Analysis of Data/.xlsx1, df_C:/Users/example/Drive/Desktop/Total_Related_Data/Analysis of Data/.xlsx2,
and so on.
I tried to make a list of file names to use. This read the file names and created a list but I couldn't make it work with the code above.
files_Names<-list.files(path='C:/Users/example/Drive/Desktop/Total_Related_Data/Analysis of Data/', pattern=NULL, all.files=FALSE, full.names=FALSE)
Which resulted with this,
"AAPL Analysis of Data.xlsx" for all the files in the list.
You can do the following (note that I'm using the openxlsx package for reading in Excel files, but you can replace that part with readxl of course):
library(openxlsx)
library(tidyverse)
Starting with your `files_list` we can do:
# using lapply to read in all files and store them as list elements in one list
list_of_dfs <- lapply(as.list(files_list), function(x) readWorkbook(x, sheet = "Balance"))
# Create a vector of names based on the first word of the filename + "Balance"
# Note that we can't use empty space in object names, hence the underscore
df_names <- paste0(str_extract(basename(files_list), "[^ ]+"), "_Balance_df")
# Assign the names to our list of dfs
names(list_of_dfs) <- df_names
# Push the list elements (i.e. data frames) to the Global environment
# I highly recommend NOT doing this. I'd say in 99% of the cases it's better to continue working in the list structure or combine the individual dfs into one large df.
list2env(list_of_dfs, env = .GlobalEnv)
I hope I could reproduce your example without code. I would create a function to have more control for the new filename.
I would suggest:
library(purrr)
library(readxl)
library(openxlsx)
target_folder <- 'C:/Users/example/Drive/Desktop/Total_Related_Data/Analysis of Data'
files_list <- list.files(path = target_folder,
pattern = "*.xlsx", full.names = TRUE)
tease_out <- function(file) {
data <- read_excel(file, sheet = "Balance")
filename <- basename(file) %>% tools::file_path_sans_ext()
new_filename <- paste0(target_folder, "/", fileneame, "Balance df.xlsx")
write.xlsx(data, file = new_filename)
}
map(file_list, tease_out)
Let me know if it works. I assume you are just targeting for the sheet "Balance"?

Read one worksheet from multiple excel files using purrr and readxl and add field

Suppose I have two excel files named packs.xlsx and each contains multiple sheets. I want to iteratively create a dataframe using only 1 sheet from each file, each named "summary". How can I go about this using Purrr and readxl while also adding a field which contains the filename?
I'm successful when I save the sheets as CSVs using the following code:
filenames <- list.files(pattern="packs*.*csv")
dat <- map_dfr(filenames, read_xlsx, sheet = "summary") %>% glimpse()
How would I go about adding a field to show which file a given row came from? Thanks for any insight that can be offered!
Supposing the two packs.xlsx files are in different subfolders:
library(readxl)
filenames <- list.files(pattern = "packs.xlsx", recursive = TRUE)
df <- lapply(filenames, function(fn) {
# get the sheet detail
xl <- read_excel(fn, sheet = "summary")
# add the filename as a field
xl$filename <- fn
# function return
xl
})
# if both summary sheets have the same format, you can combine them into one
fin <- do.call(rbind, df)

How to remove some certain column in multiple files in R?

everyone. I want to remove some certain columns in multiple files(csv.).
for example, I have 50 files. And I want to delete a,b,c column in every file.
The point is I don't know how to get the files. Save the change in every single file and remain the original file name.
library(tidyverse)
# I want to delet some column which contain messy code
# input a list of file
df <- list.files(here("Data"),pattern=".csv",full.names = TRUE) %>%
lapply(read_csv) %>% #read csv
lapply(subset,select = -c(a,b,c)) #To remove the messy code
write.csv(df, file = here())
# I want to save the change in the original files, but I don't know how to do it.
Read all the files (if all the files are in the working directory) directly into a list and process it.
files <- list.files() #if you want to read all the files in working directory
lst2 <- lapply(files, function(x) read.table(x, header=TRUE))
lapply(lst2,`[`,c(-a,-b,-c)

Naming the columns of a merged file equal to the folder name the source file comes from

I have written a script in R that combines my text files having one column of data to a .csv file where all the columns are listed besides each other. Unfortunately, my analysis Software always lables the text file in the same way so that all the text files are called "List".
Hereby, I was able to combine the different text files to a .csv file.
fileList <- list.files(path = ".", recursive = TRUE, pattern = "DistList.txt", full.names = TRUE)
listData <- lapply(fileList, read.table)
names(listData) <- gsub("DistList.txt","",basename(fileList))
library(tidyverse)
library(reshape2)
bind_rows(listData, .id = "FileName") %>%
group_by(FileName) %>%
mutate(rowNum = row_number()) %>%
dcast(rowNum~FileName, value.var = "V1") %>%
select(-rowNum) %>%
write.csv(file="Result.csv")
Now, I would like to change the column names in such a way that it is equal to the name of the folder, in which the text file is located. As I don't have that much experience using R yet, I can't figure out, how I should do it.
Thank you very much for your help already in advance!
The line
names(listData) <- gsub("DistList.txt", "", basename(fileList))
should be:
names(listData) <- gsub("DistList.txt", "", fileList)
Because by using basename we are removing all the folders, leaving us with filename "DistList.txt", and that filename gets replaced by empty string "" using gsub.
We might actually want below instead, extract the last directory, which should give in this case something like c("C1.1", "C1.2", ...):
names(listData) <- basename(dirname(fileList))

Loading a zip file, converting ".docx" into a text file, exporting back in R

I am trying to load a zip file into R. This zip file has hundreds of ".docx" documents in it. I want to convert each of these ".docx" documents into a ".txt" file.
Is there any way to automate this process in R?
The zip file is called "Documents.zip"!
With the code below, you can get a data.frame with content of your documents stored in "Documents.zip".
library(officer)
library(purrr)
library(magrittr)
docx_scan_data <- unpack_folder("Documents.zip", folder = "docx_zips") %>%
list.files(pattern = "\\.docx$", recursive = TRUE, full.names = TRUE) %>%
map_df(function(x) {
data <- read_docx(path = x) %>%
docx_summary()
data$path <- x
data
})
It should be easy then to create text files from the result. Text content is stored in column text.

Resources