Drop rows after criteria - r

I have some data that I'm trying to clean up, and I noticed that I have 150 files that have rows that are subsets of previous rows. Is there a way that I can drop everything after certain criteria occur? Below I'm not sure how I'd write out sample data for this via code, so I've listed an example of the data as text. Below. I'd like to drop all rows at and below "section 2"
Name,Age,Address
Section 1,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd
,,
Section 2,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
,,
Section 3,,
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
,,
Section 5,,
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd
Expected output
Name,Age,Address
Section 1,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd

Assuming your text file is called temp.txt you can use readLines to read it in, find the line with 'Section 2' in it and read all the lines above that.
tmp <- readLines('temp.txt')
inds <- grep('Section 2', tmp) - 2
data <- read.csv(text = paste0(tmp[1:inds], collapse = '\n'))
data
# Name Age Address
#1 Section 1 NA
#2 Abby 10 1 Baker St
#3 Alice 12 3 Main St
#4 Becky 13 156 F St
#5 Ben 14 2 18th St
#6 Cameron 15 4 Journey Road
#7 Danny 16 123 North Ave
#8 Eric 17 325 Hill Blvd

Here, I "read" in your data by calling strsplit and using the newline as the separator. If you were doing this from file, you could use readLines
I use grep to find the line number that contains "Section 2", use that to subset raw_data. I paste0(..., collapse="") the lines that do not start with "Section" and use read.table using sep="," with header=TRUE to parse as if I read just that section with read.csv.
raw_data <- strsplit(split = "\\n", "Name,Age,Address
Section 1,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd
,,
Section 2,,
Abby,10,1 Baker St
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
,,
Section 3,,
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
,,
Section 5,,
Alice,12,3 Main St
Becky,13,156 F St
Ben,14,2 18th St
Cameron,15,4 Journey Road
Danny,16,123 North Ave
Eric,17,325 Hill Blvd")
section2_idx <- grep('Section 2', raw_data[[1]])
raw_data_clean <- trimws(raw_data[[1]][1:(section2_idx-2)])
allsect_idx <- grep('^Section', raw_data_clean)
if(length(allsect_idx > 0))
raw_data_clean <- raw_data_clean[-allsect_idx]
read.table(text = paste0(raw_data_clean, collapse="\n"), sep=",", header = TRUE)
#> Name Age Address
#> 1 Abby 10 1 Baker St
#> 2 Alice 12 3 Main St
#> 3 Becky 13 156 F St
#> 4 Ben 14 2 18th St
#> 5 Cameron 15 4 Journey Road
#> 6 Danny 16 123 North Ave
#> 7 Eric 17 325 Hill Blvd
Created on 2020-12-06 by the reprex package (v0.3.0)

Here is a made up example that avoids having to type in your starting data.
mixed_data is 500 elements long and each row is a string containing two commas. The string doesn't need to be broken apart if if looks like your example.
Create an empty vector to hold just one of each value. Then loop through the whole mixed list and add the unique entries to that vector. This example resulted in 444 unique items in one_of_each of the original 500 in mixed_data.
set.seed(101)
a <- sample(LETTERS,500, replace = TRUE)
b <- sample(letters,500, replace = TRUE)
d <- sample(c(1:3),500, replace = TRUE)
mixed_data <- paste0(a,",",b,",",d)
head(mixed_data)
one_of_each <- c() #starts empty
for (i in 1:length(mixed_data)){
if (mixed_data[i] %in% one_of_each == FALSE) {
one_of_each <- c(one_of_each,mixed_data[i]) #if not found, then add
}
}

Related

Extracting best guess first and last names from a string

I have a set of names that looks as such:
names <- structure(list(name = c('Michael Smith ♕',
'Scott Lewis - Realtor',
'Erin Hopkins Ŧ',
'Katie Parsons | Denver',
'Madison Hollins Taylor',
'Kevin D. Williams',
'|Ryan Farmer|',
'l a u r e n t h o m a s',
'Dave Goodwin💦',
'Candice Harper Makeup Artist',
'dani longfeld // millenialmodels',
'Madison Jantzen | DALLAS, TX',
'Rachel Wallace Perkins',
'Kayla Wright Photography',
'Scott Green Jr.')), class = "data.frame", row.names = c(NA, -15L))
In addition to getting first and last name extracted from each of these, for ones like Rachel Wallace Perkins and Madison Hollins Taylor, I'd like to create one to multiple extracts since we don't really know which is their true last name. The final output would look something like this:
names_revised <- structure(list(name = c('Michael Smith',
'Scott Lewis',
'Erin Hopkins',
'Katie Parsons',
'Madison Hollins',
'Madison Taylor',
'Kevin Williams',
'Ryan Farmer',
'Lauren Thomas',
'Dave Goodwin',
'Candice Harper',
'Dani Longfeld',
'Madison Jantzen',
'Rachel Wallace',
'Rachel Perkins',
'Kayla Wright',
'Scott Green')), class = "data.frame", row.names = c(NA, -17L))
Based on some previous answers, I attempted to do (using the tidyr package):
names_extract <- tidyr::extract(names, name, c("FirstName", "LastName"), "([^ ]+) (.*)")
But that doesn't seem to do the trick, as the output it produces looks as such:
FirstName LastName
1 Michael Smith ♕
2 Scott Lewis - Realtor
3 Erin Hopkins Ŧ
4 Katie Parsons | Denver
5 Madison Hollins Taylor
6 Kevin D. Williams
7 |Ryan Farmer|
8 l a u r e n t h o m a s
9 Dave Goodwin💦
10 Candice Harper Makeup Artist
11 dani longfeld // millenialmodels
12 Madison Jantzen | DALLAS, TX
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr.
I know there are a ton of little edge cases that make this difficult, but overall, what would be the best approach for handling this that would capture the most results I'm trying for?
This fixes most of the rows.
library(dplyr)
library(tidyr)
Names %>%
mutate(name2 = sub("^[[:punct:]]", "", name) %>%
sub(" \\w[.] ", " ", .) %>%
sub("[[:punct:]]+ *[^[:punct:]]*$", "", .) %>%
sub("\\W+[[:upper:]]+$", "", .) %>%
trimws) %>%
separate(name2, c("First", "Last"), extra = "merge")
giving:
name First Last
1 Michael Smith ♕ Michael Smith
2 Scott Lewis - Realtor Scott Lewis
3 Erin Hopkins Ŧ Erin Hopkins
4 Katie Parsons | Denver Katie Parsons
5 Madison Hollins Taylor Madison Hollins Taylor
6 Kevin D. Williams Kevin Williams
7 |Ryan Farmer| Ryan Farmer
8 l a u r e n t h o m a s l a u r e n t h o m a s
9 Dave Goodwin?? Dave Goodwin
10 Candice Harper Makeup Artist Candice Harper Makeup Artist
11 dani longfeld // millenialmodels dani longfeld
12 Madison Jantzen | DALLAS, TX Madison Jantzen
13 Rachel Wallace Perkins Rachel Wallace Perkins
14 Kayla Wright Photography Kayla Wright Photography
15 Scott Green Jr. Scott Green Jr
Here's a first go at cleaning the data - (much) more will be needed to obtain perfect data:
library(stringr)
df %>%
mutate(name = str_extract(name, "[\\w\\s.]+\\w"))
name
1 Michael Smith
2 Scott Lewis
3 Erin Hopkins
4 Katie Parsons
5 Madison Hollins Taylor
6 Kevin D. Williams
7 Ryan Farmer
8 l a u r e n t h o m a s
9 Dave Goodwin
10 Candice Harper Makeup Artist
11 dani longfeld
12 Madison Jantzen
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr
Here we use str_extract, which extracts just the first match in the string, which is convenient as most of the characters that you want to remove are right-end bound. The character class [\\w\\s.]+ matches any alphanumeric and whitespace characters and the dot occurring one or more times. It is followed by \\w, i.e., a single alphanumeric character to make sure that the extracted parts do not end on whitespace. As said, that's just a first go but the data is already very much tidier.

Cleaning addresses - add last token in street name (Ave, St,..) where missing, based on other records [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the example data below, some addresses are missing the last 'token' making up the street name - ave, st, dr, etc. I'm using OSM for geocoding and I find these records get a hit, but often in some other country. I would like to clean them further by adding the most likely missing token based on other records in the data.
valid_ends <- c("AVE", "ST", "EXT", "BLVD")
data.frame(address = c("75 NEW PARK AVE", "245 NEW PARK AVE", "42 NEW PARK",
"934 NEW PARK ST", "394 NEW PARK", "34 ASYLUM ST",
"42 ASYLUM", "953 ASYLUM AVE", "23 ASYLUM ST",
"65 WASHINGTON AVE EXT", "94 WASHINGTON AVE")) %>%
mutate(addr_tokens = str_split(address, " ")) %>%
mutate(addr_fix = NA)
Desired result: a new character column ("addr_fix") added to the above which contains an "augmented" address for records 3, 5, 7 ("AVE", "AVE", "ST"...respectively). Those which are augmented are done so based on the last address token not being contained in valid_ends. The token which is appended to the one which occurs most frequently for that street (matched based on removing the numeric first token and the valid end tokens from addresses in the dataset)
A little messy, but this approach should work:
Start by getting the "core address" - the street name without suffix - and copying the suffix/"valid end", if there is one, to end:
valid_ends_rgx <- paste0(valid_ends, collapse = "|")
df2 <- df %>%
mutate(has_valid_end = str_detect(address, valid_ends_rgx),
core_addr =
str_remove_all(address, valid_ends_rgx) %>%
str_trim() %>%
str_remove("\\d+ "),
end = str_match(address, valid_ends_rgx)[, 1]
)
df2
# A tibble: 11 x 4
address has_valid_end core_addr end
<chr> <lgl> <chr> <chr>
1 75 NEW PARK AVE TRUE NEW PARK AVE
2 245 NEW PARK AVE TRUE NEW PARK AVE
3 42 NEW PARK FALSE NEW PARK NA
4 934 NEW PARK ST TRUE NEW PARK ST
5 394 NEW PARK FALSE NEW PARK NA
6 34 ASYLUM ST TRUE ASYLUM ST
7 42 ASYLUM FALSE ASYLUM NA
8 953 ASYLUM AVE TRUE ASYLUM AVE
9 23 ASYLUM ST TRUE ASYLUM ST
10 65 WASHINGTON AVE EXT TRUE WASHINGTON AVE
11 94 WASHINGTON AVE TRUE WASHINGTON AVE
Find the most common valid ending for each street:
replacements <- df2 %>%
group_by(core_addr, end) %>%
summarise(end_ct = n()) %>%
group_by(core_addr) %>%
summarise(most_end = end[which.max(end_ct)])
# A tibble: 3 x 2
core_addr most_end
<chr> <chr>
1 ASYLUM ST
2 NEW PARK AVE
3 WASHINGTON AVE
Update the address fields with missing ends, based on the most_end field in `replacements.
df2 %>%
left_join(replacements, by = "core_addr") %>%
transmute(
address = if_else(has_valid_end, address, str_c(address, most_end, sep = " "))
)
# A tibble: 11 x 1
address
<chr>
1 75 NEW PARK AVE
2 245 NEW PARK AVE
3 42 NEW PARK AVE
4 934 NEW PARK ST
5 394 NEW PARK AVE
6 34 ASYLUM ST
7 42 ASYLUM ST
8 953 ASYLUM AVE
9 23 ASYLUM ST
10 65 WASHINGTON AVE EXT
11 94 WASHINGTON AVE

Removing Custom Words From Text Variables in R

I have Data set which looks like following:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
> dat
ID ADDRESS
1 1 EAST SS BLVD
2 2 SOUTH AA STREET
3 3 XX EAST ST
4 4 ZZ NORTH ROAD
5 5 WEST TR TRAIL
I want to remove all details in address not in list of words I want. I am using following code which is not proper and is not working.
dat$FEATURE <- gsub("^[(BLVD)|(BOULEVARD)|(DRIVE)|(DR)|(ROAD)|(RD)|(PL)|(PLACE)
|(SL)|(CIRCLE)|(CT)|(COURT)|(WY)|(WAY)|(ST)|(STREET)|(AVE)
|(AVENUE)|(PKWY)|(WAY)|(PARKWAY)|(LN)|(LANE)|(HWY)|(HIGHWAY)
|(TRAIL$)|(CIR$)]","",dat$ADDRESS)
> dat
ID ADDRESS FEATURE
1 1 EAST SS BLVD AST SS BLVD
2 2 SOUTH AA STREET OUTH AA STREET
3 3 XX EAST ST XX EAST ST
4 4 ZZ NORTH ROAD ZZ NORTH ROAD
5 5 WEST TR TRAIL EST TR TRAIL
Output that I want is :
> dat1
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
I am not great regex any help is appreciated and any references for regex in R will be helpful.
You may use
(?xs).*\b # any 0+ chars, as many as possible, then word boundary
( # Group 1 start:
BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)? # Various words
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)? # you need to keep
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY # here
|TRAIL$|CIR$ # and here
) # Group 1 end
\b # Word boundary
.* # Rest of the string.
See the regex demo
Here, (?x) is a free spacing/comment/verbose modifier enabling formatting whitespace inside the pattern and comments inside. (?s) is a DOTALL modifier allowing . match any char including a newline (it is necessary as it is a PCRE pattern, pay attention to perl=TRUE).
The "\\1" replacement inserts the value in Group 1 back into the replaced string.
See the R demo:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("(?xs).*\\b(BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)?
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)?
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY
|TRAIL$|CIR$)\\b.*","\\1",dat$ADDRESS, perl=TRUE)
dat
Output:
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
You could do it like this
#R version 3.3.2
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("\\b(?!AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y)).+?\\b","",dat$ADDRESS, perl=TRUE)
dat
http://rextester.com/GGYN78288
https://regex101.com/r/6RcXTi/1
I guess technically, this is more exact:
"\\b(?!(?:AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y))\\b).+?\\b"

How can I separate one column into two in R so that the all capital letter words are in one column?

I have a one column like this:
x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')
# [1] WV West Virginia FL Florida
# [3] CA California SC South Carolina
How can I separate the abbreviation from the whole state name. And I want to give the two new columns two different headers. I think I can only solve this by separating the all upper letter words away.
With tidyr we can use separate to expand the column into two while specifying the new names. The argument extra=merge limits the output to the given columns. The separator will default to non-alpha-numerics:
library(tidyr)
separate(df, x, c("Abb", "State"), extra="merge")
# Abb State
#1 WV West Virginia
#2 FL Florida
#3 CA California
#4 SC South Carolina
Data
x = c('WV West Virginia', 'FL Florida','CA California', 'SC South Carolina')
Two approaches without external packages:
Approach 1: you could use substring in combination with nchar.
dat <-data.frame(raw=c("WV West Virginia","FL Florida", "CA California","SC South Carolina"),
stringsAsFactors=F)
dat$code <- substr(dat$raw,1,2)
dat$state <- substr(dat$raw, 4, nchar(dat$raw))
> dat
raw code state
1 WV West Virginia WV West Virginia
2 FL Florida FL Florida
3 CA California CA California
4 SC South Carolina SC South Carolina
Approach two: you could use regular expressions to replace parts of your strings:
##approach two: regex
dat$code <- sub(" .+","",dat$raw)
dat$state <- sub("[A-Z]{2} ","",dat$raw)
Use the state.* constants that come with the base datasets package
DF = data.frame(raw=c("WV West Virginia","FL Florida","CA California","SC South Carolina"))
DF$state.abbr <- substr(DF$raw, 1, 2)
DF$state.name <- state.name[ match(DF$state.abbr, state.abb) ]
# raw state.abbr state.name
# 1 WV West Virginia WV West Virginia
# 2 FL Florida FL Florida
# 3 CA California CA California
# 4 SC South Carolina SC South Carolina
This way, you can afford to have typos or other oddities in the state names.
Use the reshape2 package.
library(reshape2)
x <- rbind('WV West Virginia','FL Florida','CA California','SC South Carolina')
colsplit(x," ",c("Code","State"))
Output:
Code State
1 WV West Virginia
2 FL Florida
3 CA California
4 SC South Carolina
Based on #rawr's comment, we could split 'x' at white space that follows the first two characters, i.e. showed by the regex lookaround ((?<=^.{2})). The output will be a list, which we rbind, convert to data.frame and then cbind with the original vector 'x'.
cbind(x, as.data.frame(do.call(rbind,strsplit(x, '(?<=^.{2})\\s+', perl=TRUE)),
stringsAsFactors=FALSE))
# x V1 V2
#1 WV West Virginia WV West Virginia
#2 FL Florida FL Florida
#3 CA California CA California
#4 SC South Carolina SC South Carolina
Or instead of the regex lookaround, we could use stri_split with n=2 and split at whitespace.
library(stringi)
cbind(x,as.data.frame(do.call(rbind,stri_split(x, regex='\\s+', n=2))))
Here's a data.table/ gsub approach:
x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')
data.table::data.table(x)[,
abb := gsub("(^[A-Z]{2})( .+)", "\\1", x)][,
state := gsub("(^[A-Z]{2})( .+)", "\\2", x)][]
## x abb state
## 1: WV West Virginia WV West Virginia
## 2: FL Florida FL Florida
## 3: CA California CA California
## 4: SC South Carolina SC South Carolina

Import raw data into R

please anyone can help me to import this data into R from a text or dat file. It has space delimited, but cities names should not considered as two names. Like NEW YORK.
1 NEW YORK 7,262,700
2 LOS ANGELES 3,259,340
3 CHICAGO 3,009,530
4 HOUSTON 1,728,910
5 PHILADELPHIA 1,642,900
6 DETROIT 1,086,220
7 SAN DIEGO 1,015,190
8 DALLAS 1,003,520
9 SAN ANTONIO 914,350
10 PHOENIX 894,070
For your particular data frame, where true spaces only occur between capital letters, consider using a regular expression:
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK 7,262,700")
# [1] "1 NEW-YORK 7,262,700"
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO 3,009,530")
# [1] "3 CHICAGO 3,009,530"
You can then interpret spaces as field separators.
A variation on a theme... but first, some sample data:
cat("1 NEW YORK 7,262,700",
"2 LOS ANGELES 3,259,340",
"3 CHICAGO 3,009,530",
"4 HOUSTON 1,728,910",
"5 PHILADELPHIA 1,642,900",
"6 DETROIT 1,086,220",
"7 SAN DIEGO 1,015,190",
"8 DALLAS 1,003,520",
"9 SAN ANTONIO 914,350",
"10 PHOENIX 894,070", sep = "\n", file = "test.txt")
Step 1: Read the data in with readLines
x <- readLines("test.txt")
Step 2: Figure out a regular expression that you can use to insert delimiters. Here, the pattern seems to be (looking from the end of the lines) a set of numbers and commas preceded by space preceded by some words in ALL CAPS. We can capture those groups and insert some "tab" delimiters (\t). The extra slashes are to properly escape them.
gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
# [1] "1\t NEW YORK \t7,262,700" "2\t LOS ANGELES \t3,259,340"
# [3] "3\t CHICAGO \t3,009,530" "4\t HOUSTON \t1,728,910"
# [5] "5\t PHILADELPHIA \t1,642,900" "6\t DETROIT \t1,086,220"
# [7] "7\t SAN DIEGO \t1,015,190" "8\t DALLAS \t1,003,520"
# [9] "9\t SAN ANTONIO \t914,350" "10\t PHOENIX \t894,070"
Step 3: Since we know our gsub is working, and we know that read.delim has a "text" argument that can be used instead of a "file" argument, we can use read.delim directly on the result of gsub:
out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x),
header = FALSE, strip.white = TRUE)
out
# V1 V2 V3
# 1 1 NEW YORK 7,262,700
# 2 2 LOS ANGELES 3,259,340
# 3 3 CHICAGO 3,009,530
# 4 4 HOUSTON 1,728,910
# 5 5 PHILADELPHIA 1,642,900
# 6 6 DETROIT 1,086,220
# 7 7 SAN DIEGO 1,015,190
# 8 8 DALLAS 1,003,520
# 9 9 SAN ANTONIO 914,350
# 10 10 PHOENIX 894,070
One possible last step would be to convert the third column to numeric:
out$V3 <- as.numeric(gsub(",", "", out$V3))
Expanding on #Hugh's answer I would try the following, although its not particularly efficient.
lines <- scan("cities.txt", sep="\n", what="character")
lines <- unlist(lapply(lines, function(x) {
gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x)
}))
citiesDF <- data.frame(num = rep(0, length(lines)),
city = rep("", length(lines)),
population = rep(0, length(lines)),
stringsAsFactors=FALSE)
for (i in 1:length(lines)) {
splitted = strsplit(lines[i], " +")
citiesDF[i, "num"] <- as.numeric(splitted[[1]][1])
citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2])
citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3]))
}

Resources