I'm trying to read.csv thousands of csv files into R but am into a lot of trouble when a my text has commas.
My csv file has 16 columns with headers. Some of the text in column 1 has commas. Column 2 is a string, and Column 3 is always a number.
For instance an entry in column 1 is:
"I do not know Robert, Kim, or Douglas"- Marcus. A. Ten, Inc President
When I try to
df <- do.call("rbind", lapply(paste(CSVpath, fileNames, sep=""), read.csv, header=TRUE, stringsAsFactors=TRUE, row.names=NULL))
I get a df with more than 16 columns and the above text is split into 4 columns:
V1 V2 V3 V4
"I do not know Robert Kim or Douglas" - Marcus. A. Ten Inc President
when I need it all in one column as:
V1
"I do not know Robert, Kim, or Douglas"- Marcus. A. Ten, Inc President
First, if you have control over the data output format, I strongly urge you to either (a) correctly quote the fields, or (b) use another character as a delimiter (e.g., tab, pipe "|"). This is the ideal solution, as it will certainly speed up future processing and "fix the glitch", so to speak.
Lacking that, you can try to programmatically fix all rows. Assuming that only the first column is problematic (i.e., all of the other columns are perfectly defined), then on a line-by-line basis, change the true-separators to a different delimiter (e.g., pipe or tab).
For this example, I have 4 columns delimited with a comma, and I'm going to change the legitimate separators to a pipe.
Some data and magic constants:
txt <- '"I do not know Robert, Kim, or Douglas" - Marcus. A. Ten, Inc President,TRUE,0,14
"Something, else",FALSE,1,15
"Something correct",TRUE,2,22
Something else,FALSE,3,33'
nColumns <- 4 # known a priori
oldsep <- ","
newsep <- "|"
In your case, you'll read in the data:
txt <- readLines("path/to/malformed.csv")
nColumns <- 16
Do a manual (text-based, not parsing for data types) separation:
splits <- strsplit(readLines(textConnection(txt)), oldsep)
Realize that this reads, for example, the false fields as the verbatim characters "FALSE", not as a boolean data type. This might be avoided if we take on the magic type-detection done by read.csv and cousins, but why?
Per line: first ignore the last nColumns-1 fields, take the first fields and recombine them with the old separator, resulting in a single field (with commas); now combine this with the remaining nColumns-1 fields and combine these with the new separator. (BTW: making sure we deal with quoting double-quotes correctly, too.)
txt2 <- sapply(splits, function(vec) {
n <- length(vec)
if (n < nColumns) return(paste(vec, collapse = newsep))
vec1 <- paste(vec[1:(n - nColumns + 1)], collapse = oldsep)
vec1 <- sprintf('"%s"', gsub('"', '""', vec1))
paste(c(vec1,
vec[(n - nColumns + 2):n]), collapse = newsep)
})
txt2[1]
# [1] "\"\"\"I do not know Robert, Kim, or Douglas\"\" - Marcus. A. Ten, Inc President\"|TRUE|0|14"
(The sprintf line may not be necessary if the original file has correct quoting of double-quotes ... but then again, if it had correct quoting, we wouldn't be having this problem in the first place.)
Now, either absorb the data directly into a data.frame:
read.csv(textConnection(txt2), header = FALSE, sep = newsep)
# V1 V2 V3 V4
# 1 "I do not know Robert, Kim, or Douglas" - Marcus. A. Ten, Inc President TRUE 0 14
# 2 "Something, else" FALSE 1 15
# 3 "Something correct" TRUE 2 22
# 4 Something else FALSE 3 33
or write these back to a file (good if you want to deal with these files elsewhere), adding con = "path/to/filename as appropriate:
writeLines(txt2)
# """I do not know Robert, Kim, or Douglas"" - Marcus. A. Ten, Inc President"|TRUE|0|14
# """Something, else"""|FALSE|1|15
# """Something correct"""|TRUE|2|22
# "Something else"|FALSE|3|33
(Two notable changes: the correct comma-delimiters are now pipes, and all other commas are still commas; and there is correct quoting around the double-quotes. Yes, an escaped double-quote is just two double-quotes. That's what R expects if there are quotes within a field.)
NB: though this seems to work with my fabricated data (and I hope it works with yours), you do not hear of people touting R's speed and efficiency in doing text mangling in this fashion. There are certainly better ways to do this, perhaps using python, awk, or sed. There are possibly faster ways to do this in R.
Related
I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history narrative data. These are paragraphs of natural language, bounded by unique words that are standardized throughout all the individual documents. See below for a reproduced example.
Is there a way I can use those two unique words, ("CASE HISTORY & INVESTIGATOR:"), to bound the text I would like to extract? If not, what sort of approach can I take to extracting the narrative data I need from each report?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789\n NOTIFIED DATE TIME OFFICER\nMARITAL STATUS: UNKNOWN\nIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895\nFINGERPRINTS TAKEN BY DATE\n YES NO OBIWAN KENOBI 01/02/1895\n
SPRINGFEILD\n CASE#: 012-345-678\n ABC NOTIFIED: ABC DATE:\n ABC OFFICER: NATURE:\nCASE HISTORY\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\nINVESTIGATOR: HERCULE POIROT \n")
Here is what the expected output would be.
output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.")
Thanks so much for helping!
One quick approach would be to use gsub and regexes to replace everything up to and including CASE HISTORY ('^.*CASE HISTORY') and everything after INVESTIGATOR: ('INVESTIGATOR:.*') with nothing. What remains will be the text between those two matches.
gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\n"
After much deliberation I came to a solution I feel is worth sharing, so here we go:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,
index_case_num_2,
pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1,
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,
index_case_hist_2,
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from.
cases_comp <- cbind(case_nums_df, case_hist_df)
Thanks all for responding. I hope this solution helps someone out there in the future. :)
As an R newbie, by using quanteda I am trying to find instances when a certain word sequentially appears somewhere before another certain word in a sentence. To be more specific, I am looking for instances when the word "investors" is located somewhere before the word "shall" in a sentence in the corpus consisted of an international treaty concluded between Morocco and Nigeria (the text can be found here: https://edit.wti.org/app.php/document/show/bde2bcf4-e20b-4d05-a3f1-5b9eb86d3b3b).
The problem is that sometimes there are multiple words between these two words. For instance, sometimes it is written as "investors and investments shall". I tried to apply similar solutions offered on this website. When I tried the solution on (Keyword in context (kwic) for skipgrams?) and ran the following code:
kwic(corpus_mar_nga, phrase("investors * shall"))
I get 0 observations since this counts only instances when there is only one word between "investors" and "shall".
And when I follow another solution offered on (Is it possible to use `kwic` function to find words near to each other?) and ran the following code:
toks <- tokens(corpus_mar_nga)
toks_investors <- tokens_select(toks, "investors", window = 10)
kwic(toks_investors, "shall")
I get instances when "investor" appear also after "shall" and this changes the context fundamentally since in that case, the subject of the sentence is something different.
At the end, in addition to instances of "investors shall", I should also be getting, for example the instances when it reads as "Investors, their investment and host state authorities shall", but I can't do it with the above codes.
Could anyone offer me a solution on this issue?
Huge thanks in advance!
Good question. Here are two methods, one relying on regular expressions on the corpus text, and the second using (as #Kohei_Watanabe suggests in the comment) using window for tokens_select().
First, create some sample text.
library("quanteda")
## Package version: 2.1.2
# sample text
txt <- c("The investors and their supporters shall do something.
Shall we tell the investors? Investors shall invest.
Shall someone else do something?")
Now reshape this into sentences, since your search occurs within sentence.
# reshape to sentences
corp <- txt %>%
corpus() %>%
corpus_reshape(to = "sentences")
Method 1 uses regular expressions. We add a boundary (\\b) before "investors", and the .+ says one or more of any character in between "investors" and "shall". (This would not catch newlines, but corpus_reshape(x, to = "sentences") will remove them.)
# method 1: regular expressions
corp$flag <- stringi::stri_detect_regex(corp, "\\binvestors.+shall",
case_insensitive = TRUE
)
print(corpus_subset(corp, flag == TRUE), -1, -1)
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "The investors and their supporters shall do something."
##
## text1.2 :
## "Investors shall invest."
A second method applies tokens_select() with an asymmetric window, with kwic(). First we select all documents (which are sentences) containing "investors", but discarding tokens before and keeping all tokens after. 1000 tokens after should be enough. Then, apply the kwic() where we keep all context words but focus on the word after, which by definition must be after, since the first word was "investors".
# method 2: tokens_select()
toks <- tokens(corp)
tokens_select(toks, "investors", window = c(0, 1000)) %>%
kwic("shall", window = 1000)
##
## [text1.1, 5] investors and their supporters | shall | do something.
## [text1.3, 2] Investors | shall | invest.
The choice depends on what suits your needs best.
Context
I need to clean financial data with mixed formats. The data has been punched in manually by different departments, some of them using "." as decimal and "," as grouping digit (e.g. US notation: $1,000,000.00) while others are using "," as decimal and "." as grouping digit (e.g. notation used in certain European countries: $1.000.000,00).
Input:
Here's a fictional example set:
df <- data.frame(Y2019= c("17.530.000,03","28000000.05", "256.000,23", "23,000",
"256.355.855","2565467,566","225,453.126")
)
Y2019
1 17.530.000,03
2 28000000.05
3 256.000,23
4 23,000
5 256.355.855
6 2565467,566
7 225,453.126
Desired result:
Y2019
1 17530000.03
2 28000000.05
3 256000.23
4 23000.00
5 256355855.00
6 2565467.566
7 225453.126
My attempt:
I got pretty close by considering the first occurrence (starting from the right) of "," or "." as the decimal operator and replacing the other occurrences accordingly. However, some entries are without decimals (e.g. entry 4 and 5) or have a variable number of decimals, rendering this strategy less useful.
Any input is greatly appreciated!
Edit:
As per request, I salvaged some of the code of the original attempt. I am sure it could be written a lot cleaner.
df %>%
mutate(Y2019r = ifelse(str_length(Y2019)- data.frame(str_locate(pattern =",",Y2019 ))[,1]==2, gsub("\\.","", Y2019),NA )) %>%
mutate(Y2019r = ifelse((is.na(Y2019r) & str_length(Y2019)- data.frame(str_locate(pattern ="\\.",Y2019 ))[,1]==2), gsub("\\.",",", Y2019),Y2019r ))%>%
mutate(Y2019r = gsub(",",".", Y2019r))
Y2019 Y2019r
1 17.530.000,03 17530000.03
2 28000000.05 28000000.05
3 256.000,23 256000.23
4 23,000 <NA>
5 256.355.855 <NA>
6 2565467,566 <NA>
7 225,453.126 <NA>
Here's a functional approach to build up the logic needed to parse the strings you might come across. I suppose it is built up from thinking about how we parse these strings when we read them, and trying to emulate that.
I think the key is realising that all we really need to know is whether the value after the last delimiter is decimal or not. If we could somehow label the strings as having a decimal portion it would be easy to parse the strings then.
The following method involves splitting the character strings at the points and commas and trying to label them as having a terminal decimal or not. The split strings will be held as a list of string vectors, with each vector being composed of the "chunks" of digits between the delimiters.
First we will write two helper functions to create the final numbers from the string vectors once we have correctly labeled them as having a terminal decimal portion or not:
last_element_is_decimal <- function(x)
{
as.numeric(paste0(paste(x[-length(x)], collapse = ""), ".", x[length(x)]))
}
last_element_is_whole <- function(x)
{
as.numeric(paste0(x, collapse = ""))
}
It will be easy to decide what to do in the event of no delimiters, since we assume these are just whole numbers. Similarly, it is easy to see that any numbers containing both a comma and a stop (in either order) must have a terminal decimal component.
However, it is less obvious what to do when there is only a single delimiter; in these cases we have to use the length of the digit chunks to decide. If any chunk is longer than three digits, then a thousands seperator isn't in use, and the presence of a delimiter indicates we have a decimal component. If the terminal chunk contains only two digits then we must have a decimal. In all other cases, we assume a whole number.
This says the same thing in code:
decide_last_element <- function(x)
{
if(max(nchar(x)) > 3)
return(last_element_is_decimal(x))
if(nchar(x[length(x)]) < 3)
return(last_element_is_decimal(x))
return(last_element_is_whole(x))
}
Now we can write our main function. It takes our strings as input and classifies each string into having either two types of delimiter, one type of delimiter or no delimiter. Then we can apply the functions above using lapply accordingly.
parse_money <- function(money_strings)
{
any_comma <- grepl(",", money_strings)
any_point <- grepl("[.]", money_strings)
both <- any_comma & any_point
neither <- !any_comma & !any_point
single <- (any_comma & !any_point) | (any_point & !any_comma)
digit_groups <- strsplit(money_strings, "[.]|,")
values <- rep(0, length(money_strings))
values[neither] <- as.numeric(money_strings[neither])
values[both] <- sapply(digit_groups[both], last_element_is_decimal)
values[single] <- sapply(digit_groups[single], decide_last_element)
return(format(round(values, 2), nsmall = 2))
}
So now we can just do
parse_money(df$Y2019)
#> [1] " 17530000.03" " 28000000.05" " 256000.23" " 23000.00" "256355855.00"
#> [6] " 2565467.57" " 225453.13"
Note I have output as strings so that rounding inaccuracies in the console output aren't ascribed to mistakes in the code.
In src$Review each row is filled with text in Russian. I want to count the number of positive smiles in each row. For example, in "My apricot is orange)) (for sure)" I want to count not just the quantity of outbound brackets (i.e., excluding general brackets in "(for sure)"), but the amount of positive smiling characters ("))" — at least two outbound brackets, number of ":)", ":-)"). So, it works only if at least two outbound brackets are exhibited.
Assume there is a string "I love this girl!)))) (she makes me happy) every day:):) :-)!" Here we count: )))) (4 units), ":)" (2 units), ":-)" (1 unit). After we combine the number of units (i.e., 7). Pay attention that we don't count brackets in "(she makes me happy)".
Now I have following code in my script:
smilecounts <- str_count(src$Review, "[))]")
It counts only the total amount of bracket pairs ("()") (as I understand comparing data set and derivation of this command).
I only need the total amount of ":)", ":-)", "))" (the total number of outbound brackets which display as "))" in rows) to be counted. For example, in ")))))" appear 5 outbound brackets, the condition of at least two outbound brackets together is satisfied, than we count the total amount of brackets in this part of text (i.e., 5 outbound brackets).
Thank you so much for help in advance.
We can use regex lookarounds to extract the ) that follows a ) or : or :=, then use length to get the count.
length(str_extract_all(str1, '(?<=\\)|\\!)\\)')[[1]])
#[1] 4
length(str_extract_all(str1, '(?<=:)\\)')[[1]])
#[1] 2
length(str_extract_all(str1, '(?<=:-)\\)')[[1]])
#[1] 1
Or this can be done using a loop
pat <- c('(?<=\\)|\\!)\\)', '(?<=:)\\)', '(?<=:-)\\)')
sum(sapply(lapply(pat, str_extract_all, string=str1),
function(x) length(unlist(x))))
#[1] 7
data
str1 <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"
One way with regexpr and regmatches:
vec <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"
Solution:
#matches the locations of :-) or ))+ or :)
a <- gregexpr(':-)+|))+|:)+', vec)
#extracts those
b <- regmatches(vec, a)[[1]]
b
#[1] "))))" ":)" ":)" ":-)"
#table counts the instances
b
)))) :-) :)
1 1 2
Then I suppose you could count the number of single )s using
nchar(b[1])
[1] 4
Or in a more automated way:
tab <- table(b)
#the following means "if a name of the table consists only of ) then
#count the number of )s"
tab2 <- ifelse(gsub(')','', names(table(b)))=='', nchar(names(table(b))), table(b))
names(tab2) <- names(tab)
> tab2
)))) :-) :)
4 1 2
I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:
MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that.
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.
MR. JOHN: Thank you
In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:
MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1
Thanks for pointers using R!
You can use the pattern : to split the string by and then use table:
table(sapply(strsplit(x, ":"), "[[", 1))
# MR. JOHN MR. LEHMAN MS. SMITH
# 2 1 1
strsplit - splits strings at : and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency
Edit: Following OP's comment. You can save the transcripts in a text file and use readLines to read the text in R.
tt <- readLines("./tmp.txt")
Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.
Check for a : and then lookbehind the : to see if it is any of A-Z or [:punct:] (that is, if the character occurring before the : is any of the capital letters or any punctuation marks - this is because some of them have a ) before the :).
You can use strsplit followed by sapply (as shown below)
Using strsplit:
# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:
out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))
There are other approaches possible (using gsub for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.
Of course, this assumes that there is no other line, for example, like this:
"Mr. Chariman, whatever (bla bla): It is not a problem"
Because our pattern will give TRUE for ):. If this happens in the text, you'll have to find a better pattern.