I have a data which is like this:
abc <- data.frame( a = c("[100-150)", "[150, 200)"))
I want to alter it to make it like this:
abc <- data.frame(a = c("100-149", "150-199"))
I know how to replace the brackets:
abc$a <- lapply(abc$a, gsub, pattern = "[", replacement = "", fixed = TRUE)
abc$a <- lapply(abc$a, gsub, pattern = "]", replacement = "", fixed = TRUE)
abc$a <- lapply(abc$a, gsub, pattern = ")", replacement = "", fixed = TRUE)
It is the subtraction of 1 number from the end that is the problem.
Is there a way to do this?
Please note this is just an example, in reality my data has a column like this which is about 2000 rows.
An option with gsubfn. We extract the numbers (\\d+) after the - or , convert it to numeric subtract 1 and paste with -
library(gsubfn)
gsubfn("[-,] ?(\\d+)", ~ paste0("-", as.numeric(x) - 1) , as.character(abc$a))
#[1] "[100-149)" "[150-199)"
Related
I have a set of df with a large number of columns. The column names follow a pattern like so:
my.df <- data.frame(sentiment_brand1_1 = c(1,0,0,1), sentiment_brand1_2 = c(0,1,1,0),
sentiment_brand2_1 = c(1,1,1,1),
sentiment_brand2_2 = c(0,0,0,0),
brand1_rating_1 = c(1,2,3,4),
brand2_rating_1 = c(4,3,2,1))
I'd like to programmatically rename the columns, moving the substrings "brand1" and "brand2" from the middle of the column name to the end, e.g.:
desired_colnames <- c("sentiment_1_brand1",
"sentiment_2_brand1",
"sentiment_1_brand2",
"sentiment_2_brand2",
"rating_1_brand1",
"rating_1_brand2")
Capture the substring groups and rearrange in replacement
sub("(.*)_(brand1)(.*)", "\\1\\3_\\2", v1)
-output
[1] "variable_1_brand1" "_stuff_1_brand1" "thing_brand1"
data
v1 <- c("variable_brand1_1", "_brand1_stuff_1", "_brand1thing")
## Input:
Test <- c("variable_brand1_1", "_brand1_stuff_1", "_brand1thing")
library("stringr")
paste(str_remove(Test, "_brand1"), "_brand1", sep = "")
## OutPut:
[1] "variable_1_brand1" "_stuff_1_brand1" "thing_brand1"
So I have this line of code in a file:
{"id":53680,"title":"daytona1-usa"}
But when I try to open it in R using this:
df <- read.csv("file1.txt", strip.white = TRUE, sep = ":")
It produces columns like this:
Col1: X53680.title
Col2: daytona1.usa.url
What I want to do is open the file so that the columns are like this:
Col1: 53680
Col2: daytona1-usa
How can I do this in R?
Edit: The actual file I'm reading in is this:
{"id":53203,"title":"bbc-moment","url":"https:\/\/wow.bbc.com\/bbc-ids\/live\/enus\/211\/53203","type":"audio\/mpeg"},{"id":53204,"title":"shg-moment","url":"https:\/\/wow.shg.com\/shg-ids\/live\/enus\/212\/53204","type":"audio\/mpeg"},{"id":53205,"title":"was-zone","url":"https:\/\/wow.was.com\/was-ids\/live\/enus\/213\/53205","type":"audio\/mpeg"},{"id":53206,"title":"xx1-zone","url":"https:\/\/wow.xx1.com\/xx1-ids\/live\/enus\/214\/53206","type":"audio\/mpeg"},], WH.ge('zonemusicdiv-zonemusic'), {loop: true});
After reading it in, I remove the first column and then every 3rd and 4th column with this:
# Delete the first column
df <- df[-1]
# Delete every 3rd and 4th columns
i1 <- rep(seq(3, ncol(df), 4) , each = 2) + 0:1
df <- df[,-i1]
Thank you.
Edit 2:
Adding this fixed it:
df[] <- lapply(df, gsub, pattern = ".title", replacement = "", fixed = TRUE)
df[] <- lapply(df, gsub, pattern = ",url", replacement = "", fixed = TRUE)
If it is a single JSON in the file, then
jsonlite::read_json("file1.txt")
# $id
# [1] 53680
# $title
# [1] "daytona1-usa"
If it is instead NDJSON (Newline-Delimited json), then
jsonlite::stream_in(file("file1.txt"), verbose = FALSE)
# id title
# 1 53680 daytona1-usa
Although the answers above would have been correct if the data had been formatted properly, it seems they don't work for the data I have so what I ended up going with was this:
df <- read.csv("file1.txt", header = FALSE, sep = ":", dec = "-")
# Delete the first column
df <- df[-1]
# Delete every 3rd and 4th columns
i1 <- rep(seq(3, ncol(df), 4) , each = 2) + 0:1
df <- df[,-i1]
df[] <- lapply(df, gsub, pattern = ".title", replacement = "", fixed = TRUE)
df[] <- lapply(df, gsub, pattern = ",url", replacement = "", fixed = TRUE)
I am trying to replace last 3 000 with K in a column in the dataframe
eg:
data <- data.frame(abc = c(1000, 100000, 450000))
abc <- 1000
then abc <- 1K
if
abc <- 100000
then abc <- 100K
gsub or regex replaces the first 3 zeroes
I tried this:
lapply(data$abc, gsub, pattern = "000", replacement = "K", fixed = TRUE)
Also, how can I make it work on an interval like :
data <- data.frame(abc = c("150000-250000", "100000-150000", "250000K+"))
An option is to use %/% with 1000 and paste the "K"
library(dplyr)
library(stringr)
data %>%
mutate(abc = str_c(abc %/% 1000, "K"))
Or using sub, match the 3 zeros at the end ($) of the string and replace with "K"
options(scipen = 999)
sub("0{3}$", "K", data$abc)
#[1] "1K" "100K" "450K"
If we have a different string with interval, then change the pattern to match 3 zeros at either at the end ($) or before a - and replace with "K"
gsub("0{3}(?=-|$)", "K", "150000-250000", perl = TRUE)
#[1] "150K-250K"
Here is a slight modification of your code. format is to turn off the scientific notation. sapply makes the output becomes a vector. 000$ means only match those at the end.
data <- data.frame(abc = c(1000, 100000, 450000))
data$abc <- format(data$abc, scientific = FALSE)
gsub(pattern = "000$", replacement = "K", data$abc)
# [1] " 1K" "100K" "450K"
I am working on a dataset, where there is a column- account_age. In this column, the age is mentioned in format- "1YRS 5MON" in character form.
How to convert the same to month ? Please guide.
We can do match of the 'YRS', 'MON' with gsubfn, replace the characters with the numbers and evaluate
library(gsubfn)
unname(sapply(gsubfn("[A-Z]+", list(YRS = "*12 +", MON = "*1"),
df1$col1), function(x) eval(parse(text = x))))
#[1] 17
Or another option is to extract the digits and do a sum or products
library(tidyverse)
map_dbl(str_extract_all(df1$col1, "\\d+"), ~ as.numeric(.x) %*% c(12, 1))
#[1] 17
Or we can remove the letters, read it with data.frame and get the sum of products
as.matrix(read.table(text = gsub("[A-Z]+", "", df1$col1),
header = FALSE) )%*% c(12, 1)
data
df1 <- data.frame(col1 = "1YRS 5MON", stringsAsFactors = FALSE)
I'm working on a data frame that has non-detects (with different decimal separators), missing and measured values.
I want to replace the non detects with half of the value after the less sign (<1 becomes 1/2=0.5).
1) I convert to charactes the imported dataframe.
df = data.frame(value=c("NA", "1.2", "<1.0", "<6,6"))
1) convert factor to character
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
2) I replace all "," to "."
pattern = ","
grep(pattern, df, value = TRUE)
df <- data.frame(lapply(df, function(x) {gsub(pattern=pattern, replacement=".", x, perl = TRUE)}))
3) I can find all non-detecs and I can replace it with the value after the less sign
pattern = "(^<)(\\d+)"
grep(pattern, df, value = TRUE)
df <- data.frame(lapply(df, function(x) {gsub(pattern=pattern, replacement="\\d", x, perl = TRUE)}))
I can't find how to perform math operation to the replacement string matched, something as:
replacement = as.character((as.numeric("\\2"))/2)
You can use the following code in Step 2:
df$value = gsub(",", ".", df$value, fixed = TRUE)
It will replace literal commas with literal dots in the value column.
Then, you may use the gsubfn package to match and manipulate substrings matched with regex:
> library(gsubfn)
> df$value = gsubfn("^<(\\d*\\.?\\d+)", ~ as.numeric(x)/2, df$value)
> df
value
1 NA
2 1.2
3 0.5
4 3.3
Here, ^<(\\d*\\.?\\d+) will match < at the start of the string and \\d*\\.?\\d+ pattern will match and capture into Group 1 any float/integer value and will divide it by 2 later in the callback function.