Transfer text file to table in R with some conditions on it - r

I have one text file like this
DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop.
text goes till approx 300 lines. And sometimes Address data exceeds to two second line also i want to convert this text data to either cvs format which will have data like this
DOB, Name, Address
13-03-2003,ABC,xyz.
or at least in one data frame. I tried so many things, when i am giving read.table("file.txt",sep="\n") it makes everything in one column and i also tried first making headers by using
header <- read.table("file.txt",sep= "\n")
and then another data <- read.table("file.txt",skip = 3, sep ="\n") and then combining both but its not working out as my header vector has 3 and data vector has like 300 approx columns, its not working as required. Any help will be really helpful :)

You could try
entries <- unlist(strsplit(text, "\\n")) #separate entries by line breaks
entries <- entries[nchar(entries) > 0] #remove empty lines
as.data.frame(matrix(entries, ncol=3, byrow=TRUE)) #assemble dataframe
# V1 V2 V3
#1 DOB Name Address
#2 13-03-2003 ABC xyz.
#3 12-08-2004 dfs 1 infinite loop.
data
text <-'DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop.'
df <- read.table(text = text)

Two assumptions were made, 1 there will not be any blank names or date of births. By "blank" I do not mean "NA", "", or any other marker that the value was missing. Second assumption was that names and DOBs will only occupy one line each.
s1 <- gsub("^\n|\n$", "", strsplit(x, "\n\n+")[[1]])
stars <- gsub("\n", ", ", sub("\n", "*", sub("\n", "*", s1)))
mat <- t(as.data.frame(strsplit(stars, "\\*")))
dimnames(mat) <- c(NULL, NULL)
write.csv(mat,"filename.csv")
We start by splitting the text by the blank lines and eliminating any leading or trailing newline tokens. Then we replace the first and second "\n" symbols with stars. Next we split on those new star markers that we created to always have 3 elements for each row. We create a matrix with the values and transpose it for display. Then write the data to csv.
When opened with Notepad on a test file, I get:
"","V1","V2","V3"
"1","DOB","Name","Address"
"2","13-03-2003","ABC","xyz."
"3","12-08-2004","dfs","1 infinite loop"
"4","01-01-2000","Bob Smith","1234 Main St, Suite 400"
row and column names can be set to FALSE with ?write.csv if desired.
Data
x <- "DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop
01-01-2000
Bob Smith
1234 Main St
Suite 400
"

Related

name splitting in base r

I've got a list of names that have been written in a messy way in a single column. I'm trying to extract first name, middle names and last names out of this column to store separately.
To do this, I gsub the first word from each name entry and save it as the first name. I then remove the last word and first word of each entry and save that as the middle names. Then i gsub the last word from each entry and save it as the last name.
This gave me a problem, because for entries that have only one name entered (so 'kevin' instead of 'kevin banks') my code saves the first name as the last name ('kevin kevin'). I tried to fix it using a for-loop that deletes the lastname column if the original name entry has only 1 word. When i try this, ALL the lastname entries are empty, even the ones that do have a last name!
This is my code:
df <- data.frame(ego = c("linda", "wendy pralice of rivera", "bruce springsteen", "dan", "sam"))
df$firstname <- gsub("([A-Za-z]+).*", "\\1", df$ego)
df$middlename <- gsub("^\\w*\\s*", "", gsub("\\s*\\w*\\.*$", "", df$ego))
df$lastname <- gsub("^.* ([A-Za-z]+)", "\\1", df$ego)
for(n in df$ego) {
if(lengths(strsplit(n, " ")) == 1) {
df$lastname <- ""
}
}
What am i doing wrong?
If there are 4 fields put double quotes around the middle two. For example, a b c d would be changed to a "b c" d giving s1. (If there are not 4 fields then no substitution is done and s1 is set to df$ego.)
If there are exactly two fields insert double quotes between the two. For example, a b would be changed to a "" b. (If there are not exactly two fields then no substitution is done and s2 is set to s1).
Finally read in.
s1 <- sub('^(\\w+) (\\w+ \\w+) (\\w)+$', '\\1 "\\2" \\3', df$ego)
s2 <- sub('^(\\w+) (\\w+)$', '\\1 "" \\2', s1)
read.table(text = s2, as.is = TRUE, fill = TRUE,
col.names = c("first", "middle", "last"))
giving:
first middle last
1 linda
2 wendy pralice of a
3 bruce springsteen
4 dan
5 sam

Creating a table extracting the first letter in a string and counts in R

I am trying to extract the first letter of a string that are separated by commas, then counting how many times that letter appears. So an example of a column in my data frame looks like this:
test <- data.frame("Code" = c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK",
"RRRF"))
And I'd want a column added next to it that looks like this:
test2 <- data.frame("Code" = c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK",
"RRRF"), "Code_Count" = c("E1, S1", "E1", "S1, R2", "R1"))
The code count column extracts the first letter of the string and counts how many times that letter appears in that specific cell.
I looked into using strsplit to get the first letter in the column separated by commas, but I'm not sure how to attach the count of how many times that letter appears in the cell to it.
Here is one option using base R. This splits the Code column on the comma (and at least one space), then tabulates the number of times the first letter appears, then pastes them back together into your desired output. It does sort the new column in alphabetical order (which doesn't match your output). Hope this helps!
test2$Coode_Count2 <- sapply(strsplit(test2$Code, ",\\s+"), function(x) {
tab <- table(substr(x, 1, 1)) # Create a table of the first letters
paste0(names(tab), tab, collapse = ", ") # Paste together the letter w/ the number and collapse them
} )
test2
Code Code_Count Coode_Count2
1 EKST, STFO E1, S1 E1, S1
2 EFGG E1 E1
3 SSGG, RRRR, RRFK S1, R2 R2, S1
4 RRRF R1 R1
Here is a tidier, stringr/purrr solution that grabs the first letter of a word and does the same thing (instead of splitting the string)
library(purrr)
library(stringr)
map_chr(str_extract_all(test2$Code, "\\b[A-Z]{1}"), function(x) {
tab <- table(x)
paste0(names(tab), tab, collapse = ", ")
} )
Data:
test2 <- data.frame("Code" = c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK",
"RRRF"), "Code_Count" = c("E1, S1", "E1", "S1, R2", "R1"))
test2[] <- lapply(test2, as.character) # factor to character

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.
You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.
Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

How do I import XLSX files into R Dataframe While Maintaining Carriage Returns and Line Breaks?

I want to ingest all files in the working directory and scan all rows for line breaks or carriage returns. Instead of eliminating them, I'd like to divert them into a new output file for manual review. Here's what I have so far:
library(plyr)
library(dplyr)
library(readxl)
filenames <- list.files(pattern = "Sara Lee.*\\.xlsx$", ignore.case = TRUE)
read_excel_filename <- function(filename){
ret <- read_excel(filename, col_names = TRUE, skip = 5, trim_ws = FALSE)
ret
}
import.list <- ldply(filenames, read_excel_filename)
returnornewline <- import.list[((import.list$"CUSTOMER SEGMENT")=="[\r\n]"|(import.list$"SECTOR NAME")=="[\r\n]"|
(import.list$"LOCATION NAME")=="[\r\n]"|(import.list$"LOCATION ID")=="[\r\n]"|
(import.list$"ADDRESS")=="[\r\n]"|(import.list$"CITY")=="[\r\n]"|
(import.list$"STATE")=="[\r\n]"|(import.list$"ZIP CODE")=="[\r\n]"|
(import.list$"DISTRIBUTOR NAME")=="[\r\n]"|(import.list$"REDISTRIBUTOR NAME")=="[\r\n]"|
(import.list$"TRANS DATE")=="[\r\n]"|(import.list$"DIST. INVOICE")=="[\r\n]"|
(import.list$"ITEM MIN")=="[\r\n]"|(import.list$"ITEM LABEL")=="[\r\n]"|
(import.list$"ITEM DESC")=="[\r\n]"|(import.list$"PACK SIZE")=="[\r\n]"|
(import.list$"REBATEABLE UOM")=="[\r\n]"|(import.list$"QUANTITY")=="[\r\n]"|
(import.list$"SALES VOLUME")=="[\r\n]"|(import.list$"X__1")=="[\r\n]"|
(import.list$"X__2")=="[\r\n]"|(import.list$"X__3")=="[\r\n]"|
(import.list$"VA PER")=="[\r\n]"|(import.list$"VA PER CODE")=="[\r\n]"|
(import.list$"TOTAL REBATE")=="[\r\n]"|(import.list$"TOTAL ADMIN FEE")=="[\r\n]"|
(import.list$"TOTAL INVOICED")=="[\r\n]"|(import.list$"STD VA PER")=="[\r\n]"|
(import.list$"STD VA PER CODE")=="[\r\n]"|(import.list$"EXC TYPE CODE")=="[\r\n]"|
(import.list$"EXC EXC VA PER")=="[\r\n]"|(import.list$"EXC VA PER CODE")=="[\r\n]"), ]
now <- Sys.time()
carriage_return_file_name <- paste(format(now,"%Y%m%d"),"ROWS with Carriage Returns or New Lines.csv",sep="_")
write.csv(returnornewline, carriage_return_file_name, row.names = FALSE)
Here's some sample data:
Customer Segment Address
BuyFood 123 Main St.\r
BigKetchup 679 Smith Dr.\r
DownUnderMeat 410 Crocodile Way
BuyFood 123 Main St.
I thought the trim_ws = FALSE condition would work, but it hasn't.
Apologies for the column spam, I've yet to figure out an easier way to scan all the columns without listing them. Any help on that issue is appreciated as well.
EDIT: Added some sample data. I don't know how to show a carriage return in the address other than the regex of it. It doesn't look like that in the real sample data, that's just for our use here. Please let me know if that's not clear. The desired output would take the first 2 rows of data where there's a carriage return and output it to the csv file listed at the end of the code block.
EDIT 2: I used the code provided in the suggestion in place of the original long list of columns as follows. However, this doesn't give me a new variable that contains a dataframe of rows with new lines or carriage returns. When I look at my global environment in R Studio I see another variable under Data called "returnornewline" but it shows as a large list, unlike the import.list variable which shows a dataframe. This shouldn't be the case because I've only added a carriage return in the first row of the first spreadsheet of the data, so that list should not be so large.:
returnornewline <- lapply(import.list, function(x) lapply(x, function(s) grep("\r", s)))
# returnornewline <- import.list[((import.list$"CUSTOMER SEGMENT")=="[\r\n]"|(import.list$"SECTOR NAME")=="[\r\n]"|
# (import.list$"LOCATION NAME")=="[\r\n]"|(import.list$"LOCATION ID")=="[\r\n]"|
# (import.list$"ADDRESS")=="[\r\n]"|(import.list$"CITY")=="[\r\n]"|
# (import.list$"STATE")=="[\r\n]"|(import.list$"ZIP CODE")=="[\r\n]"|
# (import.list$"DISTRIBUTOR NAME")=="[\r\n]"|(import.list$"REDISTRIBUTOR NAME")=="[\r\n]"|
# (import.list$"TRANS DATE")=="[\r\n]"|(import.list$"DIST. INVOICE")=="[\r\n]"|
# (import.list$"ITEM MIN")=="[\r\n]"|(import.list$"ITEM LABEL")=="[\r\n]"|
# (import.list$"ITEM DESC")=="[\r\n]"|(import.list$"PACK SIZE")=="[\r\n]"|
# (import.list$"REBATEABLE UOM")=="[\r\n]"|(import.list$"QUANTITY")=="[\r\n]"|
# (import.list$"SALES VOLUME")=="[\r\n]"|(import.list$"X__1")=="[\r\n]"|
# (import.list$"X__2")=="[\r\n]"|(import.list$"X__3")=="[\r\n]"|
# (import.list$"VA PER")=="[\r\n]"|(import.list$"VA PER CODE")=="[\r\n]"|
# (import.list$"TOTAL REBATE")=="[\r\n]"|(import.list$"TOTAL ADMIN FEE")=="[\r\n]"|
# (import.list$"TOTAL INVOICED")=="[\r\n]"|(import.list$"STD VA PER")=="[\r\n]"|
# (import.list$"STD VA PER CODE")=="[\r\n]"|(import.list$"EXC TYPE CODE")=="[\r\n]"|
# (import.list$"EXC EXC VA PER")=="[\r\n]"|(import.list$"EXC VA PER CODE")=="[\r\n]"), ]
EDIT 3: I need to be able to take all rows in the newly created data frame "import.list" and scan them for any instances of carriage returns or new lines within all the rows. The example above is rudimentary, but the concept stands. In the example, I'd expect for the script to read the first two rows and say "hey, these rows have carriage returns, add this to the variable assigned to this line of code and at the end of the script output this data to a csv." The remaining two rows in the sample data above have no need to be output because they have no carriage returns in their data.

R: Read text files with blanks and unequal number of columns

I am trying to read many text files into R using read.table. Most of the time we have clean text files which have defined columns.
The data that I am trying to read comes from ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
You can see that the blanks and length of text files varies by report.
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/100917_livecattle.txt
My objective is to read many of these text files and combine them into a dataset.
If I can read one of the them then compiling should not be an issue. However, I am running into several issues because of the format of the text file:
1) the number of FIRMS vary from report to report. For example, sometimes there will be 3 rows (i.e. 3 firms that did business on that data) of data to import and sometimes there may be 10.
2) Blanks are being recognized. For example, under the FIRM section there should be a column for Deliveries (DEL) and Receipts (REC). The data when it is read in THIS section should look like:
df <- data.frame("FIRM_#" = c(407, 685, 800, 905),
"FIRM_NAME" = c("STRAITS FIN LLC", "R.J.O'BRIEN ASSOC", "ROSENTHAL COLLINS LL", "ADM INVESTOR SERVICE"),
"DEL" = c(1,1,15,1), "REC"= c(NA,18,NA,NA))
however when I read this in the fomatting is all messed up and does not put NA for the blank values
3) The above issues apply for "YARDS" and "FUTURE DELIVERIES SCHEDULED" section of the text file.
I have tried to read in sections of the text file and then format it accordingly but since the the number of firms change day to day the code does not generalize.
Any help would greatly be appreciated.
Here an answer which starts from the scratch via rvest for downloading data and includes lots of formatting. The general idea is to identify fixed widths that may be used to separate columns - I used a little help from SO for this purpose link.
You could then use read.fwf() in combination with cat()and tempfile(). In my first attempt this did not work, due to some formatting issues, so I added some additional lines to get the final table format.
Maybe there are some more elegant options and shortcuts I have overseen, but at least, my answer should get you started. Of course, you will have to adapt the selection of lines, identification of widths for spliting tables depending on what parts of the data you need. Once this is settled, you may loop through all the websites to gather data. I hope this helps...
library(rvest)
library(dplyr)
page <- read_html("ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt")
table <- page %>%
html_text("pre") %>%
#reformat by splitting on line breakes
{ unlist(strsplit(., "\n")) } %>%
#select range based on strings in specific lines
"["(.,(grep("FIRM #", .):(grep(" DELIVERIES SCHEDULED", .)-1))) %>%
#exclude empty rows
"["(., !grepl("^\\s+$", .)) %>%
#fix width of table to the right
{ substring(., 1, nchar(gsub("\\s+$", "" , .[1]))) } %>%
#strip white space on the left
{ gsub("^\\s+", "", .) }
headline <- unlist(strsplit(table[1], "\\s{2,}"))
get_split_position <- function(substring, string) {
nchar(string)-nchar(gsub(paste0("(^.*)(?=", substring, ")"), "", string , perl=T))
}
#exclude first element, no split before this element
split_positions <- sapply(headline[-1], function(x) {
get_split_position(x, table[1])
})
#exclude headline from split
table <- lapply(table[-1], function(x) {
substring(x, c(1, split_positions + 1), c(split_positions, nchar(x)))
})
table <- do.call(rbind, table)
colnames(table) <- headline
#strip whitespace
table <- gsub("\\s+", "", table)
table <- as.data.frame(table, stringsAsFactors = FALSE)
#assign NA values
table[ table == "" ] <- NA
#change column type
table[ , c("FIRM #", "DEL", "REC")] <- apply(table[ , c("FIRM #", "DEL", "REC")], 2, as.numeric)
table
# FIRM # FIRM NAME DEL REC
# 1 407 STRAITSFINLLC 1 NA
# 2 685 R.J.O'BRIENASSOC 1 18
# 3 800 ROSENTHALCOLLINSLL 15 NA
# 4 905 ADMINVESTORSERVICE 1 NA

Resources