I want to replace all ,, -, ), ( and (space) with . from the variable DMA.NAME in the example data frame. I referred to three posts and tried their approaches but all failed.:
Replacing column values in data frame, not included in list
R replace all particular values in a data frame
Replace characters from a column of a data frame R
Approach 1
> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")")
c$DMA.NAME[shouldbecomeperiod] <- "."
Approach 2
> removetext <- c("-", ",", " ", "(", ")")
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME)
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE)
Warning message:
In gsub(removetext, ".", c$DMA.NAME) :
argument 'pattern' has length > 1 and only the first element will be used
Approach 3
> c[c == c(" ", ",", "(", ")", "-")] <- "."
Sample data frame
> df
DMA.CODE DATE DMA.NAME count
111 22 8/14/2014 12:00:00 AM Columbus, OH 1
112 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
79 18 7/30/2014 12:00:00 AM Boston (Manchester) 1
99 22 8/20/2014 12:00:00 AM Columbus, OH 1
112.1 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
208 27 7/31/2014 12:00:00 AM Minneapolis-St. Paul 1
I know the problem - gsub uses pattern and only first element . The other two approaches are searching the entire variable for the exact value instead of searching within value for specific characters.
You can use the special groups [:punct:] and [:space:] inside of a pattern group ([...]) like this:
df <- data.frame(
DMA.NAME = c(
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Boston (Manchester)",
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Minneapolis-St. Paul"),
stringsAsFactors=F)
##
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME)
[1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
If your data frame is big you might want to look at this fast function from stringi package. This function replaces every character of specific class for another. In this case character class is L - letters (inside {}), but big P (before {}) indicates that we are looking for the complements of this set, so for every non letter character. Merge indicates that consecutive matches should be merged into a single one.
require(stringi)
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T)
## [1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
And some benchmarks:
x <- sample(df$DMA.NAME, 1000, T)
gsubFun <- function(x){
gsub("[[:punct:][:space:]]+","\\.",x)
}
striFun <- function(x){
stri_replace_all_charclass(x, "\\P{L}",".", T)
}
require(microbenchmark)
microbenchmark(gsubFun(x), striFun(x))
Unit: microseconds
expr min lq median uq max neval
gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984 100
striFun(x) 877.259 893.3945 907.769 929.8065 3189.017 100
Related
I have a large data set structured like the demo data frame below and need to replace the second and third (but not the first) colons with dashes in the datetime rows. I have tried using various regex constructions with str_replace(), gsub(), and substr() in R, but I cannot figure out how to keep the first colon while replacing the second and third.
# Demo data
df <- data.frame(V1=c(
"case: 1",
"myvar2: 36",
"myvar3: First",
"datetime: 2018-11-29 02:27:16",
"case: 2",
"myvar2: 37",
"myvar3: Second",
"datetime: 2018-11-30 04:33:18",
"case: 3",
"myvar2: 38",
"myvar3: Third",
"datetime: 2018-12-01 15:21:48",
"case: 4",
"myvar2: 39",
"myvar3: Fourth",
"datetime: 2018-12-02 12:27:01"))
df
I'm trying to extend my rudimentary understanding of regex with R and would appreciate guidance on how to solve this problem.
Use str_replace after capturing groups and then replace by inserting the - between the backreference of the captured groups
library(dplyr)
library(stringr)
df %>%
mutate(V1 = str_replace(V1,
'(datetime: \\d{4}-\\d{2}-\\d{2} \\d+):(\\d+):', '\\1-\\2-'))
-output
V1
1 case: 1
2 myvar2: 36
3 myvar3: First
4 datetime: 2018-11-29 02-27-16
5 case: 2
6 myvar2: 37
7 myvar3: Second
8 datetime: 2018-11-30 04-33-18
9 case: 3
10 myvar2: 38
11 myvar3: Third
12 datetime: 2018-12-01 15-21-48
13 case: 4
14 myvar2: 39
15 myvar3: Fourth
16 datetime: 2018-12-02 12-27-01
1) Since the colons that are to be changed are always followed by a digit and the colons that are not to be changed are not we can replace colon digit with dash digit. No packages are used.
transform(df, V1 = gsub(":(\\d)", "-\\1", V1))
This variant which uses a perl regex with a zero width lookahead (see ?regex) to check for a digit in the next position without having it being replaced by the replacement string.
transform(df, V1 = gsub(":(?=\\d)", "-", V1, perl = TRUE))
2) If a different format for df is preferable then we could use this. Insert a newline before each case to separate the records and then use read.dcf giving a character matrix. Then convert that to a data.frame and using type.convert convert the columns that can be numeric to such. Finally operate on the datetime column replacing colons with dashes. No packages are used.
df |>
unlist() |>
gsub(pattern = "^case:", replacement = "\ncase:") |>
textConnection() |>
read.dcf() |>
as.data.frame() |>
type.convert(as.is = TRUE) |>
transform(datetime = gsub(":", "-", datetime))
giving:
case myvar2 myvar3 datetime
1 1 36 First 2018-11-29 02-27-16
2 2 37 Second 2018-11-30 04-33-18
3 3 38 Third 2018-12-01 15-21-48
4 4 39 Fourth 2018-12-02 12-27-01
Need to extract data from the text ( this is just a sample)
text <- c(" 9 A 1427107 -",
" 99 (B) 3997915 -",
" 999 (SOCIO) 7161315 -",
" 9999 #M 4035115 -",
" 99999 01 Z 2136481035115 8,621"
)
so far I tried but could not create pattern for all columns
as.numeric(gsub("([0-9]+).*$", "\\1",text))
I want my data frame out put looks like
row_names Text ID Amount
9 A 1427107 -
99 (B) 3997915 -
999 (SOCIO) 7161315 -
9999 #M 4035115 -
99999 01 Z 2136481035115 8,621
Row_names are all the numbers, "Text" contains numbers and text
ID column contains numbers from 7 to 13 digits,
Amount is either a "-" or numbers with thousands (,)
We can use read.table to read the data into a data.frame
df1 <- read.table(text = text, header = FALSE, fill = TRUE)
Or using extract
library(tibble)
library(tidyr)
tibble(col1 = trimws(text)) %>%
extract(col1, into = c('rn', 'Text', 'ID', 'Amount'),
'^(\\d+)\\s+(.*)\\s+(\\d+)\\s+([-0-9,]+)', convert = TRUE)
In base R, we can use strcapture and provide the pattern and type of data to extract.
strcapture('\\s+(\\d+)\\s(.*?)\\s+(\\d+)\\s(.*)', text,
proto=list(row_names=integer(), Text=character(),
ID = numeric(), Amount = character()))
# row_names Text ID Amount
#1 9 A 1427107 -
#2 99 (B) 3997915 -
#3 999 (SOCIO) 7161315 -
#4 9999 #M 4035115 -
#5 99999 01 Z 2136481035115 8,621
I'm beginner dealing with R and working with strings.
I've been trying to remove periods from data but unfortunately I can't find a solution.
This is the data I'm working on in a dataframe df:
df <- read.table(text = " n mesAno receita
97 1/2009 3.812.819.062,06
98 2/2009 4.039.362.599,36
99 3/2009 3.652.885.587,18
100 4/2009 3.460.247.960,02
101 5/2009 3.465.677.403,12
102 6/2009 3.131.903.622,55
103 7/2009 3.204.983.361,46
104 8/2009 3.811.786.009,24
105 9/2009 3.180.864.095,05
106 10/2009 3.352.535.553,88
107 11/2009 5.214.148.756,95
108 12/2009 4.491.795.201,50
109 1/2010 4.333.557.619,30
110 2/2010 4.808.488.277,86
111 3/2010 4.039.347.179,81
112 4/2010 3.867.676.530,69
113 5/2010 6.356.164.873,94
114 6/2010 3.961.793.391,19
115 7/2010 3797656130.81
116 8/2010 4709949715.37
117 9/2010 4047436592.12
118 10/2010 3923484635.28
119 11/2010 4821729985.03
120 12/2010 5024757038.22",
header = TRUE,
stringsAsFactors = TRUE)
My objective is to transform receita column to numeric as it's is being stored as factor. But applying conversion functions like as.numeric(as.factor(x)) does not work in the interval 97:114 (it coerces to NA's).
I suppose that this is because of the periods separating billion/million/thousands in this column.
The mentioned conversion functions will work only if I have something like 3812819062.06 as in 115:120.
I tried mutating the dataset adding another column and modelling.
I don't really know if what i'm doing is fine, but i also tried extracting the anomalous numbers to a variable, and applying sub/gsub on them but without success.
Is there some straight forward way of doing this, that is, instruct it to remove the 2 first occurrences of '.' and then replace the comma with a '.'?
I'm very confident that the function i'm needing is gsub but i'm having a hard time finding the correct usage. Any help will be appreciated.
Edit: My approach using dplyr::mutate(). Ugly but works.
df <- df %>%
mutate(receita_temp = receita) %>%
mutate(dot_count = str_count(receita, '\\.')) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\.', '', as.factor(receita_temp)),
gsub('\\,', '.',as.factor(receita_temp))
)) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\,', '.',as.factor(receita_temp)),
receita_temp)) %>%
select(-c(dot_count, receita)) %>%
rename(., receita = receita_temp)
I'm using regex and some stringr functions to remove all the periods except those followed by two digits and the end of the string. That way, periods denoting separation like in 3.811.786.009,24 are removed, but periods denoting the start of a decimal like in 4821729985.03 are not. Using str_remove_all rather than str_remove lets me not have to worry about removing the matches repeatedly or about how well it will scale. Then replace the remaining commas with periods, and make it numeric.
library(tidyverse)
df2 <- df %>%
mutate(receita = str_remove_all(receita, "\\.(?!\\d{2,}$)") %>%
str_replace_all(",", ".") %>%
as.numeric())
print(head(df2), digits = 12)
#> n mesAno receita
#> 1 97 1/2009 3812819062.06
#> 2 98 2/2009 4039362599.36
#> 3 99 3/2009 3652885587.18
#> 4 100 4/2009 3460247960.02
#> 5 101 5/2009 3465677403.12
#> 6 102 6/2009 3131903622.55
Created on 2018-09-04 by the reprex package (v0.2.0).
You can use the following:
first create a function that will be used for replacement:
repl = function(x)setNames(c("","."),c(".",","))[x]
This function takes in either "." or "," and returns "" or '.' respectively
Now use this function to replace
stringr::str_replace_all(as.character(df[,3]), "[.](?!\\d+$)|,", repl)
[1] "3812819062.06" "4039362599.36" "3652885587.18" "3460247960.02" "3465677403.12" "3131903622.55"
[7] "3204983361.46" "3811786009.24" "3180864095.05" "3352535553.88" "5214148756.95" "4491795201.50"
[13] "4333557619.30" "4808488277.86" "4039347179.81" "3867676530.69" "6356164873.94" "3961793391.19"
[19] "3797656130.81" "4709949715.37" "4047436592.12" "3923484635.28" "4821729985.03" "5024757038.22"
Of course you can do the rest. ie calling as.numeric() etc.
To do this in base R:
sub(',','.',gsub('[.](?!\\d+$)','',as.character(df[,3]),perl=T))
or If you know the exact number of . and , in your data, you could do
a = as.character(df[,3])
regmatches(a,gregexpr('[.](?!\\d+$)|,',df[,3],perl = T)) = list(c("","","","."))
a
df$num <- as.numeric(sapply(as.character(si), function(x) gsub("\\,","\\.",ifelse(grepl("\\,", x), gsub("\\.","",x),x))))
should do the trick.
First, the function searches for rows with ",", removes "." in these rows, and last it converts all occurring "," into ".", so that it can be converted without problems to numeric.
Use print(df$num, digits = 12) to see your data with 2 decimals.
I get the following style of table inside a for loop at every iteration.
> table
Status Description
1 Date: Monday 19 November 1945
2 Type: Curtiss R5C-1 Commando (C-46)
3 Operator: United States Marine Corps
4 Registration: 39592
5 C/n / msn: 87
6 First flight: 1944
7 Crew: Fatalities: 0 / Occupants:
8 Passengers: Fatalities: 0 / Occupants:
9 Total: Fatalities: 0 / Occupants:
10 Airplane damage: Damaged beyond repair
11 Location: Hishi-no-Shima ( Japan)
12 Phase: Unknown (UNK)
13 Nature: Military
14 Departure airport: ?
15 Destination airport: ?
16 Narrative: Force landed.
17 Probable Cause: <NA>
on every iteration, I wish to append this to the following dataframe:
>individual_status
[1] Date Time Type Operator Registration
[6] C_n_msn First_flight Crew Passengers Total
[11] Airplane_damage Location Phase Nature Departure_airport
[16] Destination_airport Narrative Probable_Cause Engines Flightnumber
[21] Total_airframe_hrs Airplane_fate Operating_for Leased_from Cycles
[26] Crash_site_elevation Ground_casualties Operated_by On_behalf_of
<0 rows> (or 0-length row.names)
nrow(table$Status) keeps changing for every record and accordingly the description against it. All possible values of it are covered under colnames(individual_status)
Can someone please guide me on how to go about updating the individual_status data.frame for every iteration in a correct manner.
How about this:
table$Status <- gsub(":", "", table$Status)
reshapedTable <- data.frame(lapply(table$Description, function(x)
t(data.frame(x))))
names(reshapedTable) <- table$Status
require(plyr)
rbind.fill(reshapedTable, individual_status)
Here I created a minimal example of four columns:
status_codes1 <- c("Date", "Type", "Operator", "Registration")
status_codes2 <- paste(status_codes1, ":", sep = "")
table1 <- data.frame(Status = status_codes2, Description = 1:4, stringsAsFactors = F)
table1
individual_status <- setNames(data.frame(matrix(ncol = 4, nrow = 0)), sample(status_codes1))
table2 <- table1[sample(1:4),]
append_to_is <- function()
{
table2 <- table1[sample(1:4),]
n_row <- nrow(individual_status)
cols <- gsub(":", "", table2$Status)
individual_status[n_row + 1, cols] <<- table2$Description
return(list(table2, individual_status))
}
See that, the line with "table2":
table2 <- table1[sample(1:4),]
creates a copy of the original table1 with shuffled columns. In whatever order the new table is presented, first a regex replacement deletes the trailing colons ":", and then subsets the main df with the column names and appends to the next row.
The function returns the shuffled table and the appended individual status. You can restate the function as to your liking.
I would like to format my negative numbers in "Accounting" format, i.e. with brackets.
For example, I would like to format -1000000 as (1,000,000).
I know the way of introducing thousands-separator:
prettyNum(-1000000, big.mark=",",scientific=F)
However, I am not sure how to introduce the brackets. I would like to be able to apply the formatting to a whole vector, but I would want only the negative numbers to be affected. Not that after introducing the thousands separator, the vector of numbers is now a characater vector, example:
"-50,000" "50,000" "-50,000" "-49,979" "-48,778" "-45,279" "-41,321"
Any ideas? Thanks.
You can try my package formattable which has a built-in function accounting to apply accounting format to numeric vector.
> # devtools::install_github("renkun-ken/formattable")
> library(formattable)
> accounting(c(123456,-23456,-789123456))
[1] 123,456.00 (23,456.00) (789,123,456.00)
You can print the numbers as integers:
> accounting(c(123456,-23456,-789123456), format = "d")
[1] 123,456 (23,456) (789,123,456)
These numbers work with arithmetic calculations:
> money <- accounting(c(123456,-23456,-789123456), format = "d")
> money
[1] 123,456 (23,456) (789,123,456)
> money + 5000
[1] 128,456 (18,456) (789,118,456)
It also works in data.frame printing:
> data.frame(date = as.Date("2015-01-01") + 1:10,
balance = accounting(cumsum(rnorm(10, 0, 100000))))
date balance
1 2015-01-02 (21,929.80)
2 2015-01-03 (246,927.59)
3 2015-01-04 (156,210.85)
4 2015-01-05 (135,122.80)
5 2015-01-06 (199,713.06)
6 2015-01-07 (91,938.03)
7 2015-01-08 (34,600.47)
8 2015-01-09 147,165.57
9 2015-01-10 180,443.31
10 2015-01-11 251,141.04
Another way, without regex:
x <- c(-50000, 50000, -50000, -49979, -48778, -45279, -41321)
x.comma <- prettyNum(abs(x), big.mark=',')
ifelse(x >= 0, x.comma, paste0('(', x.comma, ')'))
# [1] "(50,000)" "50,000" "(50,000)" "(49,979)" "(48,778)" "(45,279)" "(41,321)"
Here's an approach that gives the leading spaces:
x <- c(-10000000, -4444, 1, 333)
num <- gsub("^\\s+|\\s+$", "", prettyNum(abs(x), ,big.mark=",", scientific=F))
num[x < 0] <- sprintf("(%s)", num[x < 0])
sprintf(paste0("%0", max(nchar(as.character(num))), "s"), num)
## [1] "(10,000,000)" " (4,444)" " 1" " 333"
A very easy approach is using paste0 and sub. Here's a simple function for this:
my.format <- function(num){
ind <- grepl("-", num)
num[ind] <- paste0("(", sub("-", "", num[ind]), ")")
num
}
> num <- c("-50,000", "50,000", "-50,000", "-49,979", "-48,778", "-45,279", "-41,321")
> my.format(num)
[1] "(50,000)" "50,000" "(50,000)" "(49,979)" "(48,778)" "(45,279)" "(41,321)"
If you want to reverse the situation, let's say, you have a vector like this:
num2 <- my.format(num)
and you want to replace (ยท) by -, then try
sub(")", "", sub("\\(", "-", num2))
Maybe this?:
a <- prettyNum(-1000000, ,big.mark=",",scientific=F)
paste("(", sub("-", "", a), ")", sep = "")