I have a data frame where some values for "revenue" are listed in the hundreds, say "300," and others are listed as "1.5k." Obviously this is annoying, so I need to find some way of splitting the "k" and "." characters from those values and only those values. Any thoughts?
Another way to do this is just with Regex (and tidyverse for pipes)
library(tidyverse)
string <- c("300", "1.5k")
string %>% ifelse(
# check if string ends in k (upper/lower case)
grepl("[kK]$", .),
# if string ends in k, remove it and multiply by 1000
1000 * as.numeric(gsub("[kK]$", "", .)),
.) %>% as.numeric()
[1] 300 1500
You could create a function that remove "k", change to a numeric vector and multiple by 1,000.
to_1000 <- function(x){
x %>%
str_remove("k") %>%
as.numeric() %>%
{.*1000}
}
x <- c("3000","1.5k")
tibble(x) %>%
mutate(x_num = if_else(str_detect(x,"k"),to_1000(x),as.numeric(x)))
# A tibble: 2 x 2
x x_num
<chr> <dbl>
1 3000 3000
2 1.5k 1500
Related
I have a data frame like this
df <- data.frame(Income = c("$100to$200","under$100","above$1000"))
I would like this as output
df_final <- data.frame(Avg = c(150,100,1000))
I would like to extract the numeric value from the income column, if there are two numbers, take the average, if there is only one number, take that number.
A few key steps here. First we need to clean our data, in this case getting rid of the $ makes thing easier. Then we'll split into a From and To column. Finally we need to convert to numeric and calculate the row means.
library(tidyverse)
df %>%
mutate(Income = gsub("$", "", Income, fixed = TRUE)) %>%
separate(Income, "to|under|above", into = c("From", "To")) %>%
mutate_all(.,as.numeric) %>%
mutate(Avg = rowMeans(.,na.rm =TRUE))
From To Avg
1 100 200 150
2 NA 100 100
3 NA 1000 1000
You could try:
library(dplyr)
library(stringr)
df %>%
mutate(across(Income, ~ sapply(str_extract_all(.x, '\\d+'), \(x) {strsplit(x, ',') |>
as.numeric() |> mean()})))
Income
1 150
2 100
3 1000
A stringr approach using gsub to get the numerics, str_squish to remove the white space and str_split to get the entries in case of more then one value.
library(stringr)
data.frame(Avg = sapply(
str_split(str_squish(gsub("[[:alpha:]$]", " ", df$Income)), " "), function(x)
sum(as.numeric(x)) / length(x)))
Avg
1 150
2 100
3 1000
df %>%
transmute(
Avg = stringr::str_extract_all(Income, "(?<=\\$)\\d+") %>%
lapply(as.numeric) %>%
sapply(mean)
)
Avg
1 150
2 100
3 1000
Have a scenario where I have a lengthy (12 digit) index value being read into r as a double. I need to concact this with some other identifiers, but mutate(x = as.character(x)) converts to scientific format:
index <- c(123000789000, 123456000000, 123000000012)
concact_val <- c("C", "A", "B")
df <-
bind_cols(
as_tibble(index),
as_tibble(concact_val)
)
df %>%
mutate(index = as.character(index))
This outputs:
index concact_val
1.23e11 C
1.23e11 A
1.23e11 B
Whereas ideally I'd like to be able to do this:
df %>%
mutate(index = as.character(index),
index = paste0(concact_val, index)) %>%
select(-concact_val)
to output:
index
C123000789000
A123456000000
B123000000012
Is there a way around this? In this example, I created a vector for the index, but in the frame I'm reading in it's being read as a double via an API (unfortunately, I can't change the col type prior to reading in, it's being read differently than read_csv).
Use sprintf:
df %>%
mutate(result = sprintf("%s%0.0f", concact_val, index))
# # A tibble: 3 x 3
# index concact_val result
# <dbl> <chr> <chr>
# 1 123000789000 C C123000789000
# 2 123456000000 A A123456000000
# 3 123000000012 B B123000000012
If there is the chance that some index have fractional components, this will round them silently. If that's a concern (and you don't want to round), you can instead use floor(index) inside the sprintf.
We may use as.bigz from gmp
paste0(concact_val, gmp::as.bigz(index))
[1] "C123000789000" "A123456000000" "B123000000012"
Or another option is to specify the scipen in options to avoid converting to scientific format
options(scipen = 999)
In addition to sprintf and gmp solutions, we may try another option like below as a programming practice
f <- function(x) {
res <- c()
while (x) {
res <- append(res, x %% 10)
x <- x %/% 10
}
paste0(rev(res), collapse = "")
}
paste0(concact_val, Vectorize(f)(index))
# [1] "C123000789000" "A123456000000" "B123000000012"
I'm trying to extract dates from a Notes column using tidyr's extract function. The data I'm working on looks like this:
dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
End_date = NA,
Start_Date = NA )
I tried extracting the date following the string "S/DATE" like this:
extract <- extract(
dates,
col = "Notes",
into = "Start_date",
regex = "(?<=(S\\/DATE:)).*" # Using regex lookahead
)
However, this only extracts the string "S/DATE:", not the date after it. When I tried this on regex101.com, it works as expected.
Thanks. Ibrahim
You could use sub here for a base R option:
s_date <- ifelse(grepl("S/DATE", dates$Notes),
sub("^.*\\bS/DATE: (\\S+).*$", "\\1", dates$Notes), NA)
s_date
[1] NA "28/08/19" NA
Note that the call to grepl above is needed here, because sub by default will return the entire input string (in this case the full Notes) in the event that S/DATE be not found in the text.
One method can be like this one also. (Assuming that you need either of S/DATE or START as your expected new column name is Start_date). If however all such values aren't required you may easily modify this syntax.
Explanation -
In the innermost expr Notes column has been splitted into list by either of these separators : or \n.
In this list, blanks are removed then
In the modified list item next to Start or S/Date is extracted using sapply which simplifies the list into a vector (if possible)
lastly lubridate::dmy is used in outermost expr.
sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])
[1] "09/01/2019" "28/08/19" "04/12/2018"
If you'll wrap the above in lubridate::dmy dates will be correctly formatted too
dmy(sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))]))
[1] "2019-01-09" "2019-08-28" "2018-12-04"
Further, this can be passed into dplyr pipes, so as to simultaneously create a new column in your dates
dates %>% mutate(Start_Date = dmy(sapply(strsplit(Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])))
col1 Notes End_date Start_Date
1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
I would combine stringr and lubridate:
dates %>%
mutate(
Start_Date =
sub("\ns/date:", "\nstart:", tolower(Notes)) %>%
str_remove_all("(.*\nstart:)|(\n.*)") %>%
trimws() %>%
lubridate::dmy()
)
# col1 Notes End_date Start_Date
# 1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
# 2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
# 3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
The answer is not as concise, but I find it intuitive and easy to follow the steps.
First I substitute one start-pattern with another (sub), where I use tolower to make all lower caps. Then I remove everything before the start date, and everything after the line change str_remove_all. Finally I trim whitespace (trimws) and turn into a date (lubridate::dmy).
Another approach is splitting the text and dealing with smaller chunks.
Step by step illustration, with one row of data
# Split the text on newlines, yielding dates with labels
dates$Notes %>% head(1) %>% strsplit("\n")
[[1]]
[1] "DOB: 12/10/62" "START: 09/01/2019" "END: 09/01/2020"
Drilling down to the next level
# Split each name/value pair on colons
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*")
[[1]]
[1] "DOB" "12/10/62"
[[2]]
[1] "START" "09/01/2019"
[[3]]
[1] "END" "09/01/2020"
Extract the individual values
# extract a vector of name labels
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[1])
[1] "DOB" "START" "END"
# extract a vector of associated values
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[2])
[1] "12/10/62" "09/01/2019" "09/01/2020"
With some clever dplyr usage, you'll get a data frame
dates %>%
group_by(col1) %>%
# summarize can collapse many rows into one or expand one into many
summarize(
name = Notes %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[1]),
value = Notes %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[2])
) %>%
ungroup()
Result, all of the values separated and ready for further processing
# A tibble: 8 x 3
col1 name value
<chr> <chr> <chr>
1 customer DOB 12/10/62
2 customer START 09/01/2019
3 customer END 09/01/2020
4 customer2 NA NA
5 customer2 S/DATE 28/08/19
6 customer2 R/DATE 27/08/20
7 customer3 DOB 13/01/1980
8 customer3 Start 04/12/2018
I have data like this, below are the 3 rows from my data set:
total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;
Expected output as below,
total free used shared buffers cached
7871MB 5711MB 2159MB 0MB 304MB 1059MB
5751MB 71MB 5MB 3159MB 30MB 1059MB
5751MB 109MB 5MB 3159MB 30MB 1059MB
and the problem here is I want to make different columns using above data like total value, free value, used value, shared value.
I can do that by splitting using ; but in other rows values are getting shuffled, like first value coming as free then total followed by other values,
Is there any way using REGEX in , if we find total get value till ; and put into one column, if we find free get value till ; and put into another column?
Here is one possibility using strsplit.
df <- as.data.frame(matrix(unlist(lapply(strsplit(x, ";"), strsplit, "=")), nrow = 2))
colnames(df) = df[1,]
df = df[-1,]
df
# total free used shared buffers cached
# 2 7871MB 5711MB 2159MB 0MB 304MB 1059MB
Edit
I don't know how your data are structured. But you can do something like the following:
x <- "total=7871MB;free=5711MB;used=2159MB;shared=0MB; buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;"
x %>% str_split("\n") %>% unlist() %>% as_tibble() %>%
mutate(total = str_extract(value, "total=(.*?)MB;"),
free = str_extract(value, "free=(.*?)MB;"),
used = str_extract(value, "used=(.*?)MB;"),
shared = str_extract(value, "shared=(.*?)MB;"),
buffers = str_extract(value, "buffers=(.*?)MB;"),
cached = str_extract(value, "cached=(.*?)MB;")) %>%
select(-value) %>%
mutate_all(~as.numeric(str_extract(.,"[[:digit:]]+")))
# # A tibble: 3 x 6
# total free used shared buffers cached
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 7871. 5711. 2159. 0. 304. 1059.
# 2 5751. 71. 5. 3159. 30. 1059.
# 3 5751. 109. 5. 3159. 30. 1059.
We can try using strsplit followed by sub to separate the data from the labels. Then, create a data frame using this data:
x <- 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;'
y <- unlist(strsplit(x, ';'))
names <- sapply(y, function(x) gsub("=.*$", "", x))
data <- sapply(y, function(x) gsub(".*=", "", x, perl=TRUE))
df <- data.frame(names=names, data=data)
df
Demo
In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).
We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)
Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4