Reading text into data.frame where string values contain spaces - r

What's the easiest way to read text from a printed data.frame into a data.frame when there are string values containing spaces that interfere with read.table? For instance, this data.frame excerpt does not pose a problem:
candname party elecVotes
1 BarackObama D 365
2 JohnMcCain R 173
I can paste it into a read.table call without a problem:
dat <- read.table(text = " candname party elecVotes
1 BarackObama D 365
2 JohnMcCain R 173", header = TRUE)
But if the data has strings with spaces like this:
candname party elecVotes
1 Barack Obama D 365
2 John McCain R 173
Then read.table throws an error as it interprets "Barack" and "Obama" as two separate variables.

Read the file into L, remove the row numbers and use sub with the indicated regular expression to insert commas between the remaining fields. (Note that "\\d" matches any digit and "\\S" matches any non-whitespace character.) Now re-read it using read.csv:
Lines <- " candname party elecVotes
1 Barack Obama D 365
2 John McCain R 173"
# L <- readLines("myfile") # read file; for demonstration use next line instead
L <- readLines(textConnection(Lines))
L2 <- sub("^ *\\d+ *", "", L) # remove row numbers
read.csv(text = sub("^ *(.*\\S) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L2), as.is = TRUE)
giving:
candname party elecVotes
1 Barack Obama D 365
2 John McCain R 173
Here is a visualization of the regular expression:
^ *(.*\S) +(\S+) +(\S+)$
Debuggex Demo

Related

How to replace a pattern except specific pattern with another pattern in a column?

I have a data frame(x) with "Symbol" column, which I want to replace (all "-*" with "") but I don't want change some values such as: 1-Mar, 1-Sep, 1-Dec,...
x<-data.frame("ID"=c("a","b","c","d","e","f","g","h","i"),"Symbol"=c("3-Mar","STON1-GTF2A1L","1-Dec","NME1-NME2","12-Mar","TNFSF12-TNFSF13","8-Mar","TMEM189-UBE2V1","10-Sep"))
I tried this code: x$Symbol<-gsub ("-*", "", x$Symbol)
But it changes (1-Mar, 1-Sep, 1-Dec)
I need the below data frame
x<-data.frame("ID"=c("a","b","c","d","e","f","g","h","i"),"Symbol"=c("3-Mar","STON1","1-Dec","NME1","12-Mar","TNFSF12","8-Mar","TMEM189","10-Sep"))
You may use
x$Symbol <- sub("-(?!(?:Jan|Feb|Mar|Apr|May|Ju[nl]|Aug|Sep|Oct|Nov|Dec)$).*", "", x$Symbol, perl=TRUE)
See the regex demo
Details
- - a hyphen
(?!(?:Jan|Feb|Mar|Apr|May|Ju[nl]|Aug|Sep|Oct|Nov|Dec)$) - a negative lookahead that fails the match if, immediately to the right of the current location, there is an abbreviated month name at the end of the string (NOTE: if you may have any more text after the month name use (?!(?:Jan|Feb|Mar|Apr|May|Ju[nl]|Aug|Sep|Oct|Nov|Dec)\\b) (to match month names as whole words) or (?!Jan|Feb|Mar|Apr|May|Ju[nl]|Aug|Sep|Oct|Nov|Dec) to just match the names as non-bounded substrings)
.* - any 0+ chars other than line break chars as many as possible.
R demo:
df<-data.frame("ID"=c("a","b","c","d","e","f","g","h","i"),"Symbol"=c("3-Mar","STON1-GTF2A1L","1-Dec","NME1-NME2","12-Mar","TNFSF12-TNFSF13","8-Mar","TMEM189-UBE2V1","10-Sep"))
df$Symbol <- sub("-(?!(?:Jan|Feb|Mar|Apr|May|Ju[nl]|Aug|Sep|Oct|Nov|Dec)$).*", "", df$Symbol, perl=TRUE)
df
Output:
ID Symbol
1 a 3-Mar
2 b STON1
3 c 1-Dec
4 d NME1
5 e 12-Mar
6 f TNFSF12
7 g 8-Mar
8 h TMEM189
9 i 10-Sep
You could paste "18" in Symbol and see if it parses to a Date and sub the values which are not dates.
df$Symbol <- with(df, ifelse(is.na(as.Date(paste0(Symbol, "-18"), "%d-%b-%y")),
sub ("-.*", "", Symbol), Symbol))
df
# ID Symbol
#1 a 3-Mar
#2 b STON1
#3 c 1-Dec
#4 d NME1
#5 e 12-Mar
#6 f TNFSF12
#7 g 8-Mar
#8 h TMEM189
#9 i 10-Sep
First ran
df$Symbol <- as.character(df$Symbol)
to convert Symbol into characters.

reading comma-separated strings with read.csv()

I am trying to load a comma-delimited data file that also has commas in one of its text columns. The following sample code generates such a file'test.csv',which I'll load usingread.csv()to illustrate my problem.
> d <- data.frame(name = c("John Smith", "Smith, John"), age = c(34, 34))
> d
name age
1 John Smith 34
2 Smith, John 34
> write.csv(d, file = "test.csv", quote = F, row.names = F)
> d2 <- read.csv("test.csv")
> d2
name age
John Smith 34 NA
Smith John 34
Because of the ',' in Smith, John, d2 is not assigned correctly. How do I read the file so that d2 looks exactly like d?
Thanks.
1) read.pattern read.pattern (in gsubfn package) can read such files:
library(gsubfn)
pat <- "(.*),(.*)"
read.pattern("test.csv", pattern = pat, header = TRUE, as.is = TRUE)
giving:
name age
1 John Smith 34
2 Smith, John 34
2) two pass Another possibility is to read it in, fix it up and then re-read it. This uses no packages and gives the same output.
L <- readLines("test.csv")
read.table(text = sub("(.*),", "\\1|", L), header = TRUE, sep = "|", as.is = TRUE)
Note: For 3 fields with the third field at the end use this in (1)
pat <- "(.*),([^,]+),([^,]+)"
The same situation use this in (2) assuming that there are non-spaces adjacent to each of the last two commas and at least one space adjacent to any commas in the text field and that fields have at least 2 characters:
text = gsub("(\\S),(\\S)", "\\1|\\2", L)
If you have some other arrangement just modify the regular expression in (1) appropriately and the sub or gsub in (2).

Parsing names in mixed formats using R

I have a list of names in mixed formats, which I would like to separate into columns containing first and last names in R. An example dataset:
Names <- c("Mary Smith","Hernandez, Maria","Bonds, Ed","Michael Jones")
The goal is to assemble a dataframe that contains names in a format like this:
FirstNames <- c("Mary","Maria","Ed","Michael")
LastNames <- c("Smith","Hernandez","Bonds","Jones")
FinalData <- data.frame (FirstNames, LastNames)
I tried a few approaches to select either the First or Last name based on whether the names are separated by a space only versus comma-space. For instance I wanted to use regular expressions in gsub to copy first names from rows in which a comma-space separates the names:
FirstNames2 <- gsub (".*,\\s","",Names)
This worked for rows that contained names in the LastName, FirstName format, but gsub collected the entire contents in rows with names in FirstName LastName format.
So my request for help is to please advise: How would you tackle this problem? Thanks to all in advance!
Here is a one-liner. The pattern tries Firstname lastname first and if that fails it tries lastname, firstname. No packages are used.
read.table(text = sub("(\\w+) (\\w+)|(\\w+), (\\w+)", "\\1\\4 \\2\\3", Names), as.is=TRUE)
giving:
V1 V2
1 Mary Smith
2 Maria Hernandez
3 Ed Bonds
4 Michael Jones
You could rearrange the , version to first last name and then just strsplit.
FirstNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 1)
LastNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 2)
temp = strsplit(x = Names, split = "(, | )")
do.call(rbind, lapply(1:length(temp), function(i){
if (grepl(pattern = ", ", x = Names[i])){
data.frame(F = temp[[i]][2], L = temp[[i]][1])
}else{
data.frame(F = temp[[i]][1], L = temp[[i]][2])
}
}))
# F L
#1 Mary Smith
#2 Maria Hernandez
#3 Ed Bonds
#4 Michael Jones

Read data with space character in R

Usually, read.table will solve many data input problems personally. Like this one:
China 2 3
USA 1 4
Sometimes, the data can madden people, like:
Chia 2 3
United States 3 4
So the read.table cannot work, and any assistance is appreciated.
P.S. the format of data file is .dat
First set up some test data:
# create test data
cat("Chia 2 3
United States 3 4
", file = "file.with.spaces.txt")
1) Using the above read in the data, insert commas between fields and re-read:
L <- readLines("file.with.spaces.txt")
L2 <- sub("^(.*) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L) # 1
DF <- read.table(text = L2, sep = ",")
giving:
> DF
V1 V2 V3
1 Chia 2 3
2 United States 3 4
2) Another approach. Using L from above, replace the last string of spaces with comma twice (since there are three fields):
L2 <- L
for(i in 1:2) L2 <- sub(" +(\\S+)$", ",\\1", L2) # 2
DF <- read.table(text = L2, sep = ",")
ADDED second solution. Minor improvements.
If the column seperator 'sep' is indeed a whitespace, it logically cannot differentiate between spaces in a name and spaces that actually seperate between columns. I'd suggest to change your country names to single strings, ie, strings without spaces. Alternatively, use semicolons to seperate between your data colums and use:
data <- read.table(foo.dat, sep= ";")
If you have many rows in your .dat file, you can consider using regular expressions to find spaces between the columns and replace them with semicolons.

Regular expression to convert raw text into columns of data

I have a raw text output from a program that I want to convert into a DataFrame. The text file is not formatted and is as shown below.
10037 149439Special Event 11538.00 13542.59 2004.59
10070 10071Weekday 8234.00 9244.87 1010.87
10216 13463Weekend 145.00 0 -145.00
I am able to read the data into R using readLines() in the base package. How can I convert this into data that looks like this (column names can be anything).
A B C D E F
10037 149439 Special Event 11538.00 13542.59 2004.59
10070 10071 Weekday 8234.00 9244.87 1010.87
10216 13463 Weekend 145.00 0 -145.00
What regular expression should I use to achieve this? I know that this is ideal for applying a combination of regexec() and regmatches(). But I am unable to come up with an expression that splits the line into the desired components.
Here's a simple solution:
raw <- readLines("filename.txt")
data.frame(do.call(rbind, strsplit(raw, " {2,}|(?<=\\d)(?=[A-Z])", perl = TRUE)))
# X1 X2 X3 X4 X5 X6
# 1 10037 149439 Special Event 11538.00 13542.59 2004.59
# 2 10070 10071 Weekday 8234.00 9244.87 1010.87
# 3 10216 13463 Weekend 145.00 0 -145.00
The regular expression " {2,}|(?<=\\d)(?=[A-Z])" consists of two parts, combined with "|" (logical or).
" {2,}" means at least two spaces. This will split between the different columns only, since the text in the third column has a single space.
"(?<=\\d)(?=[A-Z])" denotes the positions that are preceded by a digit and followed by an uppercase letter. This is used to split between the second and the third column.
I created "txt.txt" from your data. Then we work some with a regular expression.
> read <- readLines("txt.txt")
> S <- strsplit(read, "[A-Za-z]|\\s")
> W <- do.call(rbind, lapply(S, function(x) x[nzchar(x)]))
> D <- data.frame(W[,1:2], col, W[,3:5])
> names(D) <- LETTERS[seq(D)]
> D
## A B C D E F
## 1 10037 149439 SpecialEvent 11538.00 13542.59 2004.59
## 2 10070 10071 Weekday 8234.00 9244.87 1010.87
## 3 10216 13463 Weekend 145.00 0 -145.00
Toss it all into some curly brackets and you've got yourself a function to parse your files.
PS: If the space between "Special" and "Event" is important, please comment and I'll revise.
Something like this at least works on your example but I don't know all your corner cases...
([0-9]+) +([0-9]+)(.+) ([0-9.-]+) +([0-9.-]+) +([0-9.-]+)
The captured groups from 1 to 6 are resp. your columns from A to F.

Resources