I have a data frame that looks like this (sorry, I can't replicate the actual data frame with code as the double quotes don't show up. Vx are variables):
V1, V2, V3, V4
home, 15, "grand", terminal,
"give", 32, "cuz", good,
"miles", 5, "before", ten,
yes, 45, "sorry," fine
Question: how I might be able to fix the double quote issue for my entire data frame that I've imported using the read.csv function, where all the double quotes are removed?
What I'm looking for is the excel or word equivalent of FIND + REPLACE: Find the double quote, and replace with nothing.
Notes:
1) I've confirmed it's a data frame by running is.data.frame() function
2) The actual data frame has hundreds of columns, so going through each one and declaring the type of column it is isn't feasible
3) I tried using the following, and it didn't work: as.data.frame(sapply(my_data, function(x) gsub("\"", "", x)))
4) I confirmed that this isn't a simple print issue by testing using sql on the the data frame. It won't find columns in double quotes unless I use LIKE instead of =
Thanks in advance!
7/7/15 EDIT 01: as requested from #alexforrence, here is the d(put) output for a couple of columns:
billing_first_name billing_last_name billing_company
3 NA
4 Peldi Guilizzoni NA
5 NA
6 "James Andrew" Angus NA
7 NA
8 Nova Spivack NA
Here is a solution using dplyr and stringr. Note that purely numerical columns will be character columns afterwards. It's not clear to me from your description whether there are purely numerical columns. If there are then you'd probably want to treat them separately, or alternatively convert back into numbers afterwards.
require(dplyr)
require(stringr)
df <- data.frame(V1=c("home", "\"give\"", "\"miles\"", "yes"),
V2=c(15, 32, 5, 45),
V3=c("\"grand\"", "\"cuz\"", "\"before\"", "\"sorry\""),
V4=c("terminal", "good", "ten", "fine"))
df
## V1 V2 V3 V4
## 1 home 15 "grand" terminal
## 2 "give" 32 "cuz" good
## 3 "miles" 5 "before" ten
## 4 yes 45 "sorry" fine
df %>% mutate_each(funs(str_replace_all(., "\"", "")))
## V1 V2 V3 V4
## 1 home 15 grand terminal
## 2 give 32 cuz good
## 3 miles 5 before ten
## 4 yes 45 sorry fine
You can identify the double quotes using nchar().
a <- ""
nchar(a)==0
[1] TRUE
In addition to the above I ran into a very strange problem. Using the tips I wrote this very short program:
setClass("char.with.deleted.quotes")
setAs("character", "char.with.deleted.quotes",
function(from) as.character(gsub('„',"xxx", as.character(from), fixed = TRUE)))
TMP = read.csv2("./test.csv", header=TRUE, sep=";", dec=",",
colClasses = c("character","char.with.deleted.quotes"))
temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
print(temp)
with the Output:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
which reads the dummy csv:
Number;Name
X-23;This is some „Test
K-33.01;And another „Test
My goal is to get rid of this double quote before the word Test. However this so far does not work. And this is because of this double quote.
If instead I choose to replace a different part of the character it does work with either read.csv2 and the above definition of a class or directly with gsub saving it into the temp variable.
Now what is really strange is the following. After running the program I copied the two lines "temp <- gsub" and "print(temp)" manually into the command line:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
>
> temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(temp)
[1] "This is some xxxTest" "And another xxxTest"
This for whatever reason works and it does also work if I modify the data frame directly:
> TMP$Name <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(TMP)
Number Name
1 X-23 This is some xxxTest
2 K-33.01 And another xxxTest
But if I repeat this command in the program and run it again, it does not work. And I really have no idea why.
Related
I have a csv file separated by comma. However, there are fields containing commas like company names "Apple, Inc" and the fields will be separated into two columns, which leads to the following error using fread.
"Stopped early on line 5. Expected 26 fields but found 27."
Any suggestions on how to appropriately load this file?
Example rows are as follows. It seems that there are some fields with comma without quotes. But they have whitespace following the comma inside the field.
100,Microsoft,azure.com
300,IBM,ibm.com
500,Google,google.com
100,Amazon, Inc,amazon.com
400,"SAP, Inc",sap.com
1) Using the test file created in the Note at the end and assuming that the file has no semicolons (use some other character if it does) read in the lines, replace the first and last comma with semicolon and then read it as a semicolon separated file.
L <- readLines("firms.csv")
read.table(text = sub(",(.*),", ";\\1;", L), sep = ";")
## V1 V2 V3
## 1 100 Microsoft azure.com
## 2 300 IBM ibm.com
## 3 500 Google google.com
## 4 100 Amazon, Inc amazon.com
## 5 400 SAP, Inc sap.com
2) Another approach is to use gsub to replace every comma followed by space with semicolon followed by space and then use chartr to replace every comma with semicolon and every semicolon with comma and then read it in
as a semicolon separated file.
L <- readLines("firms.csv")
read.table(text = chartr(",;", ";,", gsub(", ", "; ", L)), sep = ";")
## V1 V2 V3
## 1 100 Microsoft azure.com
## 2 300 IBM ibm.com
## 3 500 Google google.com
## 4 100 Amazon, Inc amazon.com
## 5 400 SAP, Inc sap.com
3) Another possibility if there are not too many such rows is to locate them and then put quotes around the offending fields in a text editor. Then it can be read in normally.
which(count.fields("firms.csv", sep = ",") != 3)
## [1] 4
Note
Lines <- '100,Microsoft,azure.com
300,IBM,ibm.com
500,Google,google.com
100,Amazon, Inc,amazon.com
400,"SAP, Inc",sap.com
'
cat(Lines, file = "firms.csv")
Works fine for me. Can you provide a reproducible example?
library(data.table)
# Create example and write out
df_out <- data.frame("X" = c("A", "B", "C"),
"Y"= c("a,A", "b,B", "C"))
write.csv(df_out, file = "df.csv", row.names = F)
# Read in CSV with fread
df_in <- fread("./df.csv")
df_in
X Y
1: A a,A
2: B b,B
3: C C
I've got the following .txt structure
test <- "A n/a:
4001
Exam date:
2020-01-01 15:38
Pos (deg):
18.19
18.37"
I'd like to read this into a list, where each list element is given the name of the row ending with a colon, and the values are given by the following rows. (see: expected output).
Challenges
The number of rows (the length of each list element) can differ. There can be special characters (e.g., "A n/a") and there is the date time value which contains a pesky colon.
My problem
My current solution (see below) is unsafe, because I cannot be sure that I have a full list of all expected elements - the file might contain unexpected list elements which I would then not capture, or worse, they would mess up the entire data.
What I tried
I tried reading the txt to json with jsonlite::fromJson, because the structure somehow resembled it, but this gave an error about an unexpected character.
I tried to read into a single string and split, but this leaves me, again, with all values in a single list element:
readr::read_file(test)
strsplit(test, split = ":\n")
My current approach is to read this in with read.csv2 and generate a lookup on the (expected) row names, create a vector for splitting and using the first element of the resulting list for naming.
myfile <- read.csv2(text = test,
header = FALSE)
lu <- paste(c("A n", "date", "Pos"), collapse = "|")
ls_file <- split(myfile$V1, cumsum(grepl(lu, myfile$V1, ignore.case = TRUE)))
names(ls_file) <- unlist(lapply(ls_file, function(x) x[1]))
ls_file <- lapply(ls_file, function(x) x <- x[2:length(x)])
## expected output is a named list
## The spaces and backticks below do not really bother me,
## but I would get rid of them in a next step.
ls_file
#> $`A n/a:`
#> [1] " 4001"
#>
#> $`Exam date:`
#> [1] " 2020-01-01 15:38"
#>
#> $`Pos (deg):`
#> [1] "18.19" "18.37"
Assuming the name of each element ends with :, then we can:
res <- readLines(textConnection(test))
res <- split(res, cumsum(endsWith(res, ':')))
res <- setNames(lapply(res, `[`, -1), sapply(res, `[`, 1))
# > res
# $`A n/a:`
# [1] " 4001"
#
# $`Exam date:`
# [1] " 2020-01-01 15:38"
#
# $`Pos (deg):`
# [1] "18.19" "18.37"
(This is a follow-up to Regex in R: match collocates of node word.)
I want to extract word combinations (collocates) to the left and to the right of a target word (node) and store the three elements in a dataframe.
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")
Aim:
The target word is the verb GO in any of its possible realizations, be it 'go', 'going', goes', 'gone, or 'went' and I'm interested in extracting 3 words to the left of GO and to the right of GO. The three words can cross sentence boundaries but the extracted strings should not include punctuation.
What I've tried so far:
To extract left-hand collocates I've used str_extract_all from stringr:
unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and" " It was" "s still"
[5] " And will" " and"
This captures most but not all matches and includes spaces.
The extraction of the node, by contrast, looks okay:
unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went" "went" "going" "Going" "going" "go" "go"
To extract the right hand collocates:
unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went" " on" " on for quite" " on for ages" " on" " on and on"
[7] " on forever"
Again the matches are incomplete and unwanted spaces are included.
And finally assembling all the matches in a dataframe throws an error:
collocates <- data.frame(
Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")), :
arguments imply differing number of rows: 6, 7
Expected output:
Left Node Right
This little sentence went on and went
went on and went on It was
on It was going on for quite
quite a while Going on for ages
ages It’s still going on And will
on And will go on and on
and on and go on forever
Does anyone know how to fix this? Suggestions much appreciated.
If you use Quanteda, you can get the following result. When you deal with texts, you want to use small letters. I converted capital letters with tolower(). I also removed . and , with gsub(). Then, I applied kwic() to the text. If you do not mind losing capital letters, dots, and commas, you get pretty much what you want.
library(quanteda)
library(dplyr)
library(splitstackshape)
myvec <- c("go", "going", "goes", "gone", "went")
mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")
mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>%
as_tibble %>%
select(pre, keyword, post) %>%
cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>%
select(contains("pre"), keyword, contains("post"))
pre_1 pre_2 pre_3 keyword post_1 post_2 post_3
1: this little sentence went on and went
2: went on and went on it was
3: on it was going on for quite
4: quite a while going on for ages
5: ages it's still going on and will
6: on and will go on and on
7: and on and go on forever <NA>
A little late but not too late for posterity or contemporaries doing collocation research on unannotated text, here's my own answer to my question. Full credit is given to #jazzurro's pointer to quantedaand his answer.
My question was: how to compute collocates of a given node in a text and store the results in a dataframe (that's the part not addressed by #jazzurro).
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while.
Going on for ages. It's still going on. And will go on and on, and go on forever.")
Step 1: Prepare data for analysis
go <- gsub("[.!?;,:]", "", tolower(GO)) # get rid of punctuation
go <- gsub("'", " ", tolower(go)) # separate clitics from host
Step 2: Extract KWIC using regex pattern and argument valuetype = "regex"
concord <- kwic(go, "go(es|ing|ne)?|went", window = 3, valuetype = "regex")
concord
[text1, 4] this little sentence | went | on and went
[text1, 7] went on and | went | on it was
[text1, 11] on it was | going | on for quite
[text1, 17] quite a while | going | on for ages
[text1, 24] it s still | going | on and will
[text1, 28] on and will | go | on and on
[text1, 33] and on and | go | on forever
Step 3: Identify strings with fewer collocates than defined by window:
# Number of collocates on the left:
concord$nc_l <- unlist(lengths(strsplit(concordance$pre, " "))); concord$nc_l
[1] 3 3 3 3 3 3 3 # nothing missing here
# Number of collocates on the right:
concord$nc_r <- unlist(lengths(strsplit(concordance$post, " "))); concord$nc_r
[1] 3 3 3 3 3 3 2 # last string has only two collocates
Step 4: Add NA to strings with missing collocates:
# define window:
window <- 3
# change string:
concord$post[!concord$nc_r == window] <- paste(concord$post[!concord$nc_r == window], NA, sep = " ")
Step 5: Fill dataframe with slots for collocates and node, using str_extract from library stringras well as regex with lookarounds to determine split points for collocates:
library(stringr)
L3toR3 <- data.frame(
L3 = str_extract(concord$pre, "^\\w+\\b"),
L2 = str_extract(concord$pre, "(?<=\\s)\\w+\\b(?=\\s)"),
L1 = str_extract(concord$pre, "\\w+\\b$"),
Node = concord$keyword,
R1 = str_extract(concord$post, "^\\w+\\b"),
R2 = str_extract(concord$post, "(?<=\\s)\\w+\\b(?=\\s)"),
R3 = str_extract(concord$post, "\\w+\\b$")
)
Result:
L3toR3
L3 L2 L1 Node R1 R2 R3
1 this little sentence went on and went
2 went on and went on it was
3 on it was going on for quite
4 quite a while going on for ages
5 it s still going on and will
6 on and will go on and on
7 and on and go on forever NA
I have some text as follows:
inputString<- “Patient Name:MRS Comfor Atest Date of Birth:23/02/1981 Hospital Number:000000 Date of Procedure:01/01/2010 Endoscopist:Dr. Sebastian Zeki: Nurses:Anthony Nurse , Medications:Medication A 50 mcg, Another drug 2.5 mg Instrument:D111 Extent of Exam:second part of duodenum Visualization:Good Tolerance: Good Complications: None Co-morbidity:None INDICATIONS FOR EXAMINATION Illness Stomach pain. PROCEDURE PERFORMED Gastroscopy (OGD) FINDINGS Things found and biopsied DIAGNOSIS Biopsy of various RECOMMENDATIONS Chase for histology. FOLLOW UP Return Home"
I want to extract parts of the test in to their own columns according to some text boundaries I have set:
myWords<-c("Patient Name","Date of Birth","Hospital Number","Date of Procedure","Endoscopist","Second Endoscopist","Trainee","Referring Physician","Nurses"."Medications")
Not all of the delimiter words are in the text (but they are always in the same order).
I have a function that should separate them out (with the column title as the start of the word boundary:
delim<-myWords
inputStringdf <- data.frame(inputString,stringsAsFactors = FALSE)
inputStringdf <- inputStringdf %>%
tidyr::separate(inputString, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
However, when there is no finding between two delimiters, or if the delimiters do not exist, rather than place NA in the column, it just fills it with the next text found between two delimiters. How can I make sure that the correct columns are filled with the correct text as defined by the delimiters?
Using the input shown in the Note at the end transform it into DCF format and then read it in using read.dcf which converts the input lines into a character matrix m. See ?read.dcf for more info. No packages are used.
pat <- sprintf("(%s)", paste(myWords, collapse = "|"))
g <- gsub(pat, "\n\\1", paste0(Lines, "\n"))
m <- read.dcf(textConnection(g))
Here are the first three columns:
m[, 1:3]
## Patient Name Date of Birth Hospital Number
## [1,] "MRS Comfor Atest" "23/02/1981" "000000"
## [2,] "MRS Comfor Atest" NA "000000"
Note
The input is assumed to have one record per patient like this example which has two records. We have just repeated the first patient for simplicity in synthesizing an input data set except we have omitted the Date of Birth in the second record.
Lines <- c(inputString, sub("Date of Birth:23/02/1981 ", "", inputString))
Compare the conversion of a character string with as.numeric to how it can be done with read.fwf .
as.numeric("457") # 457
as.numeric("4 57") # NA with warning message
Now read from a file "fwf.txt" containing exactly " 5 7 12 4" .
foo<-read.fwf('fwf.txt',widths=c(5,5),colClasses='numeric',header=FALSE)
V1 V2
1 57 124
foo<-read.fwf('fwf.txt',widths=c(5,5),colClasses='character',header=FALSE)
V1 V2
1 5 7 12 4
Now, I'll note that in the "numeric" version, read.fwf does concatenation the same way Fortran does. I was just a bit surprised that it doesn't throw an error or NA in the same manner as as.numeric . Anyone know why?
As #eipi10 pointed out, the space eliminating behavior is not unique to read.fwf. It actually comes form the scan() function (which is used by read.table which is used by read.fwf). Actually the scan() function will remove spaces (or tabs if they are not specified as the delimiter) from any value that is not a character as it process the input stream. Once it has the "cleaned" the value of spaces, then it uses the same function as as.numeric to turn that value into a number. With character values it don't take out any white space unless you set strip.white=TRUE which will only remove space from the beginning and end of the value.
Observe these examples
scan(text="TRU E", what=logical(), sep="x")
# [1] TRUE
scan(text="0 . 0 0 7", what=numeric(), sep="x")
# [1] 0.007
scan(text=" text ", what=character(), sep="~")
# [1] " text "
scan(text=" text book ", what=character(), sep="~", strip.white=T)
# [1] "text book"
scan(text="F\tALS\tE", what=logical(), sep=" ")
# [1] FALSE
You can find the source for scan() in /src/main/scan.c and the specific part responsible for this behavior is around this line.
If you wanted as.numeric to behave like, you could create a new function like
As.Numeric<-function(x) as.numeric(gsub(" ", "", x, fixed=T))
in order to get
As.Numeric("4 57")
# [1] 457