I have a dataframe with UK postcodes in it. Unfortunately some of the postcode data is incorrect - ie, they are only numeric (all UK postcodes should start with a alphabet character)
I have done some research and found the grepl command that I've used to generate a TRUE/FALSE vector if the entry is only numeric,
Data$NewPostCode <- grepl("^.*[0-9]+[A-Za-z]+.*$|.*[A-Za-z]+[0-9]+.*$",Data$PostCode)
however, what I really want to do is where the instance starts with a number to make the postcode blank.
Note, I don't want remove the rows with an incorrect postcode as I will lose information from the other variables. I simply want to remove that postcode
Example data
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton 1254
London 1290C
Newcastle N1 3DC
Desired output
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton
London
Newcastle N1 3DC
There are a few ways to go between TRUE/FALSE vectors and the kind of task you want, but I prefer ifelse. A simpler way to generate the type of logical vector you're looking for is
grepl("^[0-9]", Data$PostCode)
which will be TRUE whenever PostCode starts with a number, and FALSE otherwise. You may need to adjust the regex if your needs are more complex.
You can then define a new column which is blank whenever the vector is TRUE and the old value whenever the vector is FALSE, as follows:
Data$NewPostCode <- ifelse(grepl("^[0-9]", Data$PostCode), "", Data$PostCode)
(May I suggest using NA instead of blank?)
Related
I have a vector of city names:
Cities <- c("New York", "San Francisco", "Austin")
And want to use it to find records in a 1,000,000+ element column of city/state names contained in a bigger table that match any of the items in the Cities vector
Locations<- c("San Antonio/TX","Austin/TX", "Boston/MA")
Tried using lapply and grep but it kept saying it can’t use an input vector dimension larger than 1.
Ideally want to return the row positions in the Locations vector that contain any item in the Cities vector that will allow me to select matching rows in the broader table.
grep and family only allow a single pattern= in their call, but one can use Vectorize to help with this:
out <- Vectorize(grepl, vectorize.args = "pattern")(Cities, Locations)
rownames(out) <- Locations
out
# New York San Francisco Austin
# San Antonio/TX FALSE FALSE FALSE
# Austin/TX FALSE FALSE TRUE
# Boston/MA FALSE FALSE FALSE
(I added rownames(.) purely to identify columns/rows from the source data.)
With this, if you want to know which index points where, then you can do
apply(out, 1, function(z) which(z)[1])
# San Antonio/TX Austin/TX Boston/MA
# NA 3 NA
apply(out, 2, function(z) which(z)[1])
# New York San Francisco Austin
# NA NA 2
The first indicates the index within Cities that apply to each specific location. The second indicates the index within Locations that apply to each of Cities. Both of these methods assume that there is at most a 1-to-1 matching; if there are ever more, the which(z)[1] will hide the 2nd and subsequent, which is likely not a good thing.
Ok, so I have a dataframe that I downloaded from Pew Research Center. One of the columns (called 'cregion') contains a series of numbers from 1-56, with each number corresponding to a geographic location in the U.S. Most of these locations are states, and the additional 6 are at the sub-state level. So, for example, the number '1' corresponds to 'Alabama', and '11' corresponds to the 'District Of Columbia'.
What I'd like to do is replace each of those numbers in the 'cregion' column with the ACTUAL name of the region it corresponds to. Unfortunately, there is no column in this data frame that I can use to swap the values, as the key for which number corresponds to which region exists completely separately (word document). I'm new to R and while I've been searching for a few hours for the best way to go about this, I can't seem to find a method that would work (or I just don't understand the explanation). Can anybody suggest a method to me?
If you have a vector of the state names as strings called statevec whose ith element corresponds to cregion i, and your data frame is named dat, just do
dat <- data.frame(cregion = sample(1:50), stuff = runif(50))
head(dat)
# cregion stuff
#1 25 0.665843896
#2 11 0.144631131
#3 13 0.691616240
#4 28 0.507454243
#5 9 0.416535139
#6 30 0.004196311
statevec <- state.name
dat$cregion <- statevec[dat$cregion]
head(dat)
# cregion stuff
#1 Missouri 0.665843896
#2 Hawaii 0.144631131
#3 Illinois 0.691616240
#4 Nevada 0.507454243
#5 Florida 0.416535139
#6 New Jersey 0.004196311
I am wanting to extract strings from elements in a data frame. Having gone through numerous previous questions, I am still unable to understand what to do! This is what I have tried to do so far:
unlist(strsplit(pcode2$Postcode,"'"))
I get the following error:
Error in strsplit(pcode2$Postcode, "'") : non-character argument
which I understand because I am trying to reference the data rather than putting the text in the code itself. I have 16,000 cases in a dataframe so also not sure how to vectorise the operation.
Any help would be greatly appreciated.
Data:
Postcode Locality State Latitude Longitude
1 ('200', Australian National University ACT -35.280, 149.120),
2 ('221', Barton ACT -35.200, 149.100),
3 ('3030', Werribee VIC -12.800, 130.960),
4 ('3030', Point Cook VIC -12.800, 130.960),
I want to get rid of the commas and braces etc so that I am left with the numeric part of Column 1 which is Postcode, numeric part of Latitude andLongitude. This is how the I am hoping the final result will look like:
Postcode Locality State Latitude Longitude
1 200 Australian National University ACT -35.280 149.120
2 221 Barton ACT -35.200 149.100
3 3030 Werribee VIC -12.800 130.960
4 3030 Point Cook VIC -12.800 130.960
Lastly, I would also like to understand how to nicely format the data in the questions.
Hi dear I have a little problem with a dataframe that has duplicates in a column. I would like to remove the rows where a column presents duplicates. For example my dataframe is like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
NA Kansas Gold 002
500 Michigan Silver 001
800 Texas Basic 005
You can see that in ID column there are two duplicates one for 001 and one for 002. I was using unique function but I don't get to erase that duplicates. I would like to get someone like this:
Value City Card.Type ID
100 Michigan Silver 001
120 Angeles Gold 002
800 Texas Basic 005
Thanks for your help.
The use of which should only be done with its "positive" version. The danger in using the construction -which() is that when none of the rows or items match the test, the result of the which() is numeric(0) and -numeric(0) will return 'nothing', when the correct result is 'everything'. Use use:
dat[!duplicated(dat), ]
In this case there were no duplicated rows, but the OP thought that some should be removed so obviously it was only two or three columns were under consideration. This is easy to accommodate. Just do the duplication test on 2 or three columns:
dat[ !duplicated(dat[ , 2:3] ) , ]
Use the function duplicated.
Something like:
data.subset <- data[!duplicated(data$ID),]
Duplicated returns a true/false vector. The second duplicated entry in the vector will always return TRUE.
I'm trying to convert from data.frame to data.table, and need some advice on some logical indexing I am trying to do on a single column. Here is a table I have:
places <- data.table(name=c('Brisbane', 'Sydney', 'Auckland',
'New Zealand', 'Australia'),
search=c('Brisbane AU Australia',
'Sydney AU Australia',
'Auckland NZ New Zealand',
'NZ New Zealand',
'AU Australia'))
# name search
# 1: Brisbane Brisbane AU Australia
# 2: Sydney Sydney AU Australia
# 3: Auckland Auckland NZ New Zealand
# 4: New Zealand NZ New Zealand
# 5: Australia AU Australia
setkey(places, search)
I want to extract rows whose search column matches all words in a list, like so:
words <- c('AU', 'Brisbane')
hits <- places
for (w in words) {
hits <- hits[search %like% w]
}
# I end up with the 'Brisbane AU Australia' row.
I have one question:
Is there a more data.table-way to do this? It seems to me that storing hits each time seems like a data.frame way to do this.
This is subject to the caveat that I eventually want to use agrep rather than grep/%like%:
words <- c('AU', 'Bisbane') # note the mis-spelling
hits <- places
for (w in words) {
hits <- hits[agrep(w, search)]
}
I feel like this doesn't quite take advantage of data.table's capabilities and would appreciate thoughts on how to modify the code so it does.
EDIT
I want the for loop because places is quite large, and I only want to find rows that match all the words. Hence I only need to search in the results for the last word for the next word (that is, successively refine the results).
With the talk of "binary scan" vs "vector scan" in the data.table introduction (i.e. "bad way" is DT[DT$x == "R" & DT$y == "h"], "good way" is setkey(DT, x, y); DT[J("R", "h")] I just wondered if there was some way I could apply this approach here.
Mathematical.coffee, as I mentioned under comments, you can not "partial match" by setting a column (or more columns) as key column(s). That is, in the data.table places, you've set the column "search" as the key column. Here, you can fast subset by using data.table's binary search (as opposed to vector scan subsetting) by doing:
places["Brisbane AU Australia"] # binary search when "search" column is key'd
# is faster compared to:
places[search == "Brisbane AU Australia"] # vector scan
But in your case, yo require:
places["AU"]
to give all rows with has a partial match of "AU" within the key column. And this is not possible (while it's certainly a very interesting feature to have).
If the substring you're searching for by itself does not contain mismatches, then you can try splitting the search strings into separate columns. That is, the column search if split into three columns containing Brisbane, AU and Australia, then you can set the key of the data.table to the columns that contain AU and Brisbane. Then, you can query the way you mention as:
# fast subset, AU and Brisbane are entries of the two key columns
places[J("AU", "Brisbane")]
You can vectorize the agrep function to avoid looping.
Note that the result of agrep2 is a list hence the unlist call
words <- c("Bisbane", "NZ")
agrep2 <- Vectorize(agrep, vectorize.args = "pattern")
places[unlist(agrep2(words, search))]
## name search
## 1: Brisbane Brisbane AU Australia
## 2: Auckland Auckland NZ New Zealand
## 3: New Zealand NZ New Zealand