Related
I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")
How to create columns of a data frame from a long character vector that contains three comma seperated values in each row. The first element contains the names of the data frame columns.
Not every row has three columns, some places there is just a trailing comma:
> string.split.cols[1] #This row is the .names
[1] "Acronym,Full form,Remarks"
> string.split.cols[2]
[1] "AC,Actual Cost, "
> string.split.cols[3]
[1] "ACWP,Actual Cost of Work Performed,Old term for AC"
> string.split.cols[4]
[1] "ADM,Arrow Diagramming Method,Rarely used now"
> string.split.cols[5]
[1] "ADR,Alternative Dispute Resolution, "
> string.split.cols[6]
[1] "AE,Apportioned Effort, "
The output should be a df with three columns, I'm only interested in the first two columns and will throw out the third.
This is the original string, some columns are not comma escaped but that isn't a big huge deal.
string.cols <- [1] "Acronym,Full form,Remarks\nAC,Actual Cost, \nACWP,Actual Cost of Work Performed,Old term for AC\nADM,Arrow Diagramming Method,Rarely used now\nADR,Alternative Dispute Resolution, \nAE,Apportioned Effort, \nAOA,Activity-on-Arrow,Rarely used now\nAON,Activity-on-Node, \nARMA,Autoregressive Moving Average, \nBAC,Budget at Completion, \nBARF,Bought-into, Approved, Realistic, Formal,from Rita Mulcahy's PMP Exam Prep\nBCR,Benefit Cost Ratio, \nBCWP,Budgeted Cost of Work Performed,Old term for EV\nBCWS,Budgeted Cost of Work Scheduled,Old term for PV\nCA,Control Account, \nCBR,Cost Benefit Ratio, \nCBT,Computer-Based Test, \n..."
Have you tried the text input for read.csv?
df <- read.csv( text = string.split.cols, header = T )
I found this routine to be very fast for splitting a string and converting to a data frame.
slist<-strsplit(mylist,",")
x<-sapply(slist, FUN= function(x) {x[1]})
y<-sapply(slist, FUN= function(x) {x[2]})
df<-data.frame(Column1Name=x, Column2Name=y, stringsAsFactors = FALSE)
where mylist is your vector of strings to split.
You can use rbind.data.frame to do this, after splitting the string:
x <- do.call(rbind.data.frame, strsplit(split.string.cols[-1], ','))
names(x) <- strsplit(split.string.cols[1], ',')[[1]]
x
## Acronym Full form Remarks
## 1 AC Actual Cost
## 2 ACWP Actual Cost of Work Performed Old term for AC
## ...
As a one-liner:
setNames(do.call(rbind.data.frame,
strsplit(split.string.cols[-1], ',')
),
strsplit(split.string.cols[1], ',')[[1]]
)
I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")
I am trying to pull numbers and dates from text using R. Say I have a vector of text strings, V.text. The text strings are sentences that contain numbers and dates. For example:
"listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"
I want to extract the number amounts for and dates as separate vector components.
So the output would be two vectors:
1 1500000 160000
2 2/14/2015 3/1/2015
I tried using scan() but couldn't get the desired result. I would appreciate any help
How about:
txt <- "listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"
lapply(c('[0-9,]{5,}',
'[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}'),
function(re) {
matches <- gregexpr(re, txt)
gsub(',', '', regmatches(txt, matches)[[1]])
})
## [[1]]
## [1] "150000" "160000"
## [[2]]
## [1] "2/14/2015" "3/1/2015"
(The first match for numbers assumes 5 digits or more. If you have less, than this simpler regular expression will collide with the year of the date(s).)
First split out the "words". Then the ones with a slash are dates and the ones with only $, digit or comma are numbers. In the latter case strip off the non-digit characters and convert to numeric:
s <- strsplit(x, " ")[[1]]
grep("/", s, value = TRUE) # dates
## [1] "2/14/2015" "3/1/2015"
as.numeric(gsub("\\D", "", grep("^[$0-9,]+$", s, value = TRUE)))
## [1] 150000 160000
If negative numbers or decimal numbers are possible then change the last line of code to:
as.numeric(gsub("[^-0-9.]", "", grep("^-?[$0-9,.]+$", s, value = TRUE)))
Quick-and-dirty approach:
x<-"listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"
mydate<-regmatches(x,gregexpr("\\d{1,2}/\\d{1,2}/\\d{4}",x,perl=TRUE))
mynumber<-regmatches(sub(",","",x),gregexpr("\\d{6}",sub(",","",x),perl=TRUE))
You can run the above code in r-fiddle:
I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"