Removing Custom Words From Text Variables in R - r

I have Data set which looks like following:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
> dat
ID ADDRESS
1 1 EAST SS BLVD
2 2 SOUTH AA STREET
3 3 XX EAST ST
4 4 ZZ NORTH ROAD
5 5 WEST TR TRAIL
I want to remove all details in address not in list of words I want. I am using following code which is not proper and is not working.
dat$FEATURE <- gsub("^[(BLVD)|(BOULEVARD)|(DRIVE)|(DR)|(ROAD)|(RD)|(PL)|(PLACE)
|(SL)|(CIRCLE)|(CT)|(COURT)|(WY)|(WAY)|(ST)|(STREET)|(AVE)
|(AVENUE)|(PKWY)|(WAY)|(PARKWAY)|(LN)|(LANE)|(HWY)|(HIGHWAY)
|(TRAIL$)|(CIR$)]","",dat$ADDRESS)
> dat
ID ADDRESS FEATURE
1 1 EAST SS BLVD AST SS BLVD
2 2 SOUTH AA STREET OUTH AA STREET
3 3 XX EAST ST XX EAST ST
4 4 ZZ NORTH ROAD ZZ NORTH ROAD
5 5 WEST TR TRAIL EST TR TRAIL
Output that I want is :
> dat1
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
I am not great regex any help is appreciated and any references for regex in R will be helpful.

You may use
(?xs).*\b # any 0+ chars, as many as possible, then word boundary
( # Group 1 start:
BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)? # Various words
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)? # you need to keep
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY # here
|TRAIL$|CIR$ # and here
) # Group 1 end
\b # Word boundary
.* # Rest of the string.
See the regex demo
Here, (?x) is a free spacing/comment/verbose modifier enabling formatting whitespace inside the pattern and comments inside. (?s) is a DOTALL modifier allowing . match any char including a newline (it is necessary as it is a PCRE pattern, pay attention to perl=TRUE).
The "\\1" replacement inserts the value in Group 1 back into the replaced string.
See the R demo:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("(?xs).*\\b(BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)?
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)?
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY
|TRAIL$|CIR$)\\b.*","\\1",dat$ADDRESS, perl=TRUE)
dat
Output:
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL

You could do it like this
#R version 3.3.2
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("\\b(?!AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y)).+?\\b","",dat$ADDRESS, perl=TRUE)
dat
http://rextester.com/GGYN78288
https://regex101.com/r/6RcXTi/1
I guess technically, this is more exact:
"\\b(?!(?:AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y))\\b).+?\\b"

Related

Remove specific value in R or Linux

Hi I have a file (tab sep) in terminal that has several columns as below. You can see last column has a comma in between followed by one or more characters.
1 100 Japan Na pa,cd
2 120 India Ca pa,ces
5 110 Japan Ap pa,cres
1 540 China Sn pa,cd
1 111 Nepal Le pa,b
I want to keep last column values before the comma so the file can look like
2 120 India Ca pa
5 110 Japan Ap pa
1 540 China Sn pa
1 111 Nepal Le pa
I have looked for sed but I cannot find a way to exclude them
Regards
In R you can read the file with a tab-separator and remove the values after comma.
result <- transform(read.table('file1.txt', sep = '\t'), V5 = sub(',.*', '', V5))
V5 is used assuming it is the 5th column that you want to change the value.
We can use
df1 <- read.tsv('file1.txt', sep="\t")
df1$V5 <- sub("^([^,]+),.*", "\\1", df1$V5)

Recode by comparing a value to numbers in a vector

I want to code the values in a column into fewer values in another column.
For example,
if the value in zipcode column is one of the following c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028),
code it as "west" in district column.
How can I do it in R?
You can use the ifelse() function.
Set up the data in a dataframe:
df <- data.frame(zipcode = c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028))
Then use ifelse() to code a new value based on the values of zipcode.
df$district <- ifelse(df$zipcode %in% c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028),
"west",
NA)
> df
zipcode region
1 90272 west
2 90049 west
3 90077 west
4 90210 west
5 90046 west
6 90069 west
7 90024 west
8 90025 west
9 90048 west
10 90036 west
11 90038 west
12 90028 west

Splitting a string few characters after the delimiter

I have a large data set of names and states that I need to split. After splitting, I want to create new rows with each name and state. My data strings are in multiple lines that look like this
"Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ"
"Ralph Hogan, TX, Michael Johnson, FL"
I need the data to look like this
attr name state
1 Peter Johnson IN
2 Chet Charles TX
3 Ed Walsh AZ
4 Ralph Hogan TX
5 Michael Johnson FL
I can't figure out how to do this, perhaps split it somehow a few characters after the comma? Any help would be greatly appreciated.
If it is multiple line strings, then we can create a delimiter with gsub, split the strings using strsplit, create data.frame with the components of the split in the output list, and rbind it together.
d1 <- do.call(rbind, lapply(strsplit(gsub("([A-Z]{2})(\\s+|,)",
"\\1;", lines), "[,;]"), function(x) {
x1 <- trimws(x)
data.frame(name = x1[c(TRUE, FALSE)],state = x1[c(FALSE, TRUE)]) }))
cbind(attr = seq_len(nrow(d1)), d1)
# attr name state
#1 1 Peter Johnson IN
#2 2 Chet Charles TX
#3 3 Ed Walsh AZ
#4 4 Ralph Hogan TX
#5 5 Michael Johnson FL
Or this can be done in a compact way
library(data.table)
fread(paste(gsub("([A-Z]{2})(\\s+|,)", "\\1\n", lines), collapse="\n"),
col.names = c("names", "state"), header = FALSE)[, attr := 1:.N][]
# names state attr
#1: Peter Johnson IN 1
#2: Chet Charles TX 2
#3: Ed Walsh AZ 3
#4: Ralph Hogan TX 4
#5: Michael Johnson FL 5
data
lines <- readLines(textConnection("Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ
Ralph Hogan, TX, Michael Johnson, FL"))

How can I separate one column into two in R so that the all capital letter words are in one column?

I have a one column like this:
x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')
# [1] WV West Virginia FL Florida
# [3] CA California SC South Carolina
How can I separate the abbreviation from the whole state name. And I want to give the two new columns two different headers. I think I can only solve this by separating the all upper letter words away.
With tidyr we can use separate to expand the column into two while specifying the new names. The argument extra=merge limits the output to the given columns. The separator will default to non-alpha-numerics:
library(tidyr)
separate(df, x, c("Abb", "State"), extra="merge")
# Abb State
#1 WV West Virginia
#2 FL Florida
#3 CA California
#4 SC South Carolina
Data
x = c('WV West Virginia', 'FL Florida','CA California', 'SC South Carolina')
Two approaches without external packages:
Approach 1: you could use substring in combination with nchar.
dat <-data.frame(raw=c("WV West Virginia","FL Florida", "CA California","SC South Carolina"),
stringsAsFactors=F)
dat$code <- substr(dat$raw,1,2)
dat$state <- substr(dat$raw, 4, nchar(dat$raw))
> dat
raw code state
1 WV West Virginia WV West Virginia
2 FL Florida FL Florida
3 CA California CA California
4 SC South Carolina SC South Carolina
Approach two: you could use regular expressions to replace parts of your strings:
##approach two: regex
dat$code <- sub(" .+","",dat$raw)
dat$state <- sub("[A-Z]{2} ","",dat$raw)
Use the state.* constants that come with the base datasets package
DF = data.frame(raw=c("WV West Virginia","FL Florida","CA California","SC South Carolina"))
DF$state.abbr <- substr(DF$raw, 1, 2)
DF$state.name <- state.name[ match(DF$state.abbr, state.abb) ]
# raw state.abbr state.name
# 1 WV West Virginia WV West Virginia
# 2 FL Florida FL Florida
# 3 CA California CA California
# 4 SC South Carolina SC South Carolina
This way, you can afford to have typos or other oddities in the state names.
Use the reshape2 package.
library(reshape2)
x <- rbind('WV West Virginia','FL Florida','CA California','SC South Carolina')
colsplit(x," ",c("Code","State"))
Output:
Code State
1 WV West Virginia
2 FL Florida
3 CA California
4 SC South Carolina
Based on #rawr's comment, we could split 'x' at white space that follows the first two characters, i.e. showed by the regex lookaround ((?<=^.{2})). The output will be a list, which we rbind, convert to data.frame and then cbind with the original vector 'x'.
cbind(x, as.data.frame(do.call(rbind,strsplit(x, '(?<=^.{2})\\s+', perl=TRUE)),
stringsAsFactors=FALSE))
# x V1 V2
#1 WV West Virginia WV West Virginia
#2 FL Florida FL Florida
#3 CA California CA California
#4 SC South Carolina SC South Carolina
Or instead of the regex lookaround, we could use stri_split with n=2 and split at whitespace.
library(stringi)
cbind(x,as.data.frame(do.call(rbind,stri_split(x, regex='\\s+', n=2))))
Here's a data.table/ gsub approach:
x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')
data.table::data.table(x)[,
abb := gsub("(^[A-Z]{2})( .+)", "\\1", x)][,
state := gsub("(^[A-Z]{2})( .+)", "\\2", x)][]
## x abb state
## 1: WV West Virginia WV West Virginia
## 2: FL Florida FL Florida
## 3: CA California CA California
## 4: SC South Carolina SC South Carolina

Clustering list for hclust function

Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.
In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:
1 a,b,c
2 d,e,f,g
Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.
How is this possible?
I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.
Construct a hclust object.
hc <- hclust(dist(USArrests), "ave")
#plot(hc)
You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).
cutree(hc, k = 2)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 2 1 1 2
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 1 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 1 2
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 1 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 1
South Dakota Tennessee Texas Utah Vermont
2 2 2 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2
lets say,
y<-dist(x)
clust<-hclust(y)
groups<-cutree(clust, k=3)
x<-cbind(x,groups)
now you will get for each record, the cluster group.
You can subset the dataset as well:
x1<- subset(x, groups==1)
x2<- subset(x, groups==2)
x3<- subset(x, groups==3)

Resources