Beginner Question : How do you remove a date from a column? - r

I want to remove the date part from the first column but can't do it for all the dataset?
can someone advise please?
Example of dataset:

You can use sub() function with replacing ^[^[:alpha:]]+ (regular expression, i.e. all non-alphabetic characters at the beginning of the string), with "", i.e. empty string.
sub("^[^[:alpha:]]+", "", data)
Example
data <- data.frame(
good_column = 1:4,
bad_column = c("13/1/2000pasta", "14/01/2000flour", "15/1/2000aluminium foil", "15/1/2000soap"))
data
#> good_column bad_column
#> 1 1 13/1/2000pasta
#> 2 2 14/01/2000flour
#> 3 3 15/1/2000aluminium foil
#> 4 4 15/1/2000soap
data$bad_column <- sub("^[^[:alpha:]]+", "", data$bad_column)
data
#> good_column bad_column
#> 1 1 pasta
#> 2 2 flour
#> 3 3 aluminium foil
#> 4 4 soap
Created on 2020-07-29 by the reprex package (v0.3.0)

Related

Aggregate character string into vector in R

I have a data table test:
id
key
1
2365
1
2365
1
3709
2
6734
2
1908
2
4523
I want to aggregate unique key values by id into vector using data.table package.
Expected output:
id
key_array
1
"2365", "3709"
2
"6734", "1908", "4523"
So, this should work like array_agg sql function.
I tried:
res <- test[, list(key_array = paste(unique(key), collapse = ", ")), by = "id"], but I get just a string. But I need to have opportunity to find the length of each vector and operate with its certain elements (find the intersection of two vectors for example).
1. Base R
This an aggregate one-liner.
x <- 'id key
1 2365
1 2365
1 3709
2 6734
2 1908
2 4523'
test <- read.table(textConnection(x), header = TRUE)
aggregate(key ~ id, test, \(x) c(unique(x)))
#> id key
#> 1 1 2365, 3709
#> 2 2 6734, 1908, 4523
Created on 2022-06-14 by the reprex package (v2.0.1)
But if user #Chris's comment is right then the right solution as follows.
aggregate(key ~ id, test, \(x) paste(unique(x), collapse = ", "))
Note that both c(unique(x)) and as.character(c(unique(x))) will output a list column, so the latter solution is right anyway.
2. Package data.table
Once again a one-liner.
The output is a list column, with each list member an integer vector. To keep as integers use
list(unique(key))
instead.
suppressPackageStartupMessages(library(data.table))
res <- setDT(test)[, .(key_array = list(as.character(unique(key)))), by = id]
res
#> id key_array
#> 1: 1 2365,3709
#> 2: 2 6734,1908,4523
str(res)
#> Classes 'data.table' and 'data.frame': 2 obs. of 2 variables:
#> $ id : int 1 2
#> $ key_array:List of 2
#> ..$ : chr "2365" "3709"
#> ..$ : chr "6734" "1908" "4523"
#> - attr(*, ".internal.selfref")=<externalptr>
Created on 2022-06-14 by the reprex package (v2.0.1)
Then, in order to access the vectors use two extractors, one to extract the column and the other one to extract the vectors.
res$key_array[[1]]
#> [1] "2365" "3709"
res$key_array[[2]]
#> [1] "6734" "1908" "4523"
Created on 2022-06-14 by the reprex package (v2.0.1)
3. dplyr solution
Group by id and collapse the unique strings into one only.
suppressPackageStartupMessages(library(dplyr))
test %>%
group_by(id) %>%
summarise(key_array = paste(unique(key), collapse = ", "))
#> # A tibble: 2 × 2
#> id key_array
#> <int> <chr>
#> 1 1 2365, 3709
#> 2 2 6734, 1908, 4523
Created on 2022-06-14 by the reprex package (v2.0.1)

Create date of "X" column, when I have age in days at "X" column and birth date column in R

I'm having some trouble finding out how to do a specific thing in R.
In my dataset, I have a column with the date of birth of participants. I also have a column giving me the age in days at which a disease was diagnosed.
What I want to do is to create a new column showing the date of diagnosis. I'm guessing it's a pretty easy thing to do since I have all the information needed, basically it's birth date + X number of days = Date of diagnosis, but I'm unable to figure out how to do it.
All of my searches give me information on the opposite, going from date to age. So if you're able to help me, it would be much appreciated!
library(tidyverse)
library(lubridate)
df <- tibble(
birth = sample(seq("1950-01-01" %>%
as.Date(),
today(), by = "day"), 10, replace = TRUE),
age = sample(3650:15000, 10, replace = TRUE)
)
df %>%
mutate(diagnosis_date = birth %m+% days(age))
#> # A tibble: 10 x 3
#> birth age diagnosis_date
#> <date> <int> <date>
#> 1 1955-01-16 6684 1973-05-05
#> 2 1958-11-03 6322 1976-02-24
#> 3 2007-02-23 4312 2018-12-14
#> 4 2002-07-11 8681 2026-04-17
#> 5 2021-12-28 11892 2054-07-20
#> 6 2017-07-31 3872 2028-03-07
#> 7 1995-06-30 14549 2035-04-30
#> 8 1955-09-02 12633 1990-04-04
#> 9 1958-10-10 4534 1971-03-10
#> 10 1980-12-05 6893 1999-10-20
Created on 2022-06-30 by the reprex package (v2.0.1)

How to remove specific words in a column

I have a Column consisting of several Country Offices associated a with a company, where I would like to shorten fx: China Country Office and Bangladesh Country Office, to just China or Bangladesh- In other words removing the words "Office" and "Country" from the column called Imp_Office.
I tried using the tm-package, with reference to an earlier post, but nothing happened.
what I wrote:
library(tm)
stopwords = c("Office", "Country","Regional")
MY_df$Imp_Office <- gsub(paste0(stopwords, collapse = "|","",
MY_df$Imp_Office))
Where I got the following error message:
Error in gsub(paste0(stopwords, collapse = "|", "", MY_df$Imp_Office))
:
argument "x" is missing, with no default
I also tried using the function readLines:
stopwords = readLines("Office", "Country","Regional")
MY_df$Imp_Office <- gsub(paste0(stopwords, collapse = "|","",
MY_df$Imp_Office))
But this didn't help either
I have considered the possibility of using some other string manipulation method, but I don't need to detect, replace or remove whitespace - so I am kind of lost here.
Thank you.
First, let's set up a dataframe with a column like what you describe:
library(tidyverse)
df <- data_frame(Imp_Office = c("China Country Office",
"Bangladesh Country Office",
"China",
"Bangladesh"))
df
#> # A tibble: 4 x 1
#> Imp_Office
#> <chr>
#> 1 China Country Office
#> 2 Bangladesh Country Office
#> 3 China
#> 4 Bangladesh
Then we can use str_remove_all() from the stringr package to remove any bits of text that you don't want from them.
df %>%
mutate(Imp_Office = str_remove_all(Imp_Office, " Country| Office"))
#> # A tibble: 4 x 1
#> Imp_Office
#> <chr>
#> 1 China
#> 2 Bangladesh
#> 3 China
#> 4 Bangladesh
Created on 2018-04-24 by the reprex package (v0.2.0).

Tableau LOD R Equivalent

I'm using a Tableau Fixed LOD function in a report, and was looking for ways to mimic this functionality in R.
Data set looks like:
Soldto<-c("123456","122456","123456","122456","124560","125560")
Shipto<-c("123456","122555","122456","124560","122560","122456")
IssueDate<-as.Date(c("2017-01-01","2017-01-02","2017-01-01","2017-01-02","2017-01-01","2017-01-01"))
Method<-c("Ground","Ground","Ground","Air","Ground","Ground")
Delivery<-c("000123","000456","000123","000345","000456","000555")
df1<-data.frame(Soldto,Shipto,IssueDate,Method,Delivery)
What I'm looking to do is "For each Sold-to/Ship-to/Method count the number of unique delivery IDs".
The intent is to find the number of unique deliveries that could potentially be "aggregated."
In Tableau that function looks like:
{FIXED [Soldto],[Shipto],[IssueDate],[Method],:countd([Delivery])
Could this be done with aggregate or summarize as in an example below:
df.new<-ddply(df,c("Soldto","Shipto","Method"),summarise,
Deliveries = n_distinct(Delivery))
This is fairly easy with dplyr. You are looking for the number of unique delivery for each combination of soldto, shipto and method, which is just group_by and then summarise:
library(tidyverse)
tbl <- tibble(
soldto = c("123456","122456","123456","122456","124560","125560"),
shipto = c("123456","122555","122456","124560","122560","122456"),
issuedate = as.Date(c("2017-01-01","2017-01-02","2017-01-01","2017-01-02","2017-01-01","2017-01-01")),
method = c("Ground","Ground","Ground","Air","Ground","Ground"),
delivery = c("000123","000456","000123","000345","000456","000555")
)
tbl %>%
group_by(soldto, shipto, method) %>%
summarise(uniques = n_distinct(delivery))
#> # A tibble: 6 x 4
#> # Groups: soldto, shipto [?]
#> soldto shipto method uniques
#> <chr> <chr> <chr> <int>
#> 1 122456 122555 Ground 1
#> 2 122456 124560 Air 1
#> 3 123456 122456 Ground 1
#> 4 123456 123456 Ground 1
#> 5 124560 122560 Ground 1
#> 6 125560 122456 Ground 1
Created on 2018-03-02 by the reprex package (v0.2.0).

Skipping rows until row with a certain value

I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is https://fred.stlouisfed.org/data/HNOMFAQ027S.txt and the data takes the following form:
"
... (number of rows)
... (number of rows)
... (number of rows)
DATE VALUE
1945-01-01 144855
1946-01-01 138515
1947-01-01 136405
1948-01-01 135486
1949-01-01 142455
"
I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards (including "DATE // VALUE"). Is there a way to do this with data.table's fread() - or any other way, such as with dplyr?
Thank you very much in advance for your effort and your time!
Best,
c.
Here's a way to get to extract that info from those text files using readr::read_lines, dplyr, and string handling from stringr.
library(tidyverse)
library(stringr)
df <- data_frame(lines = read_lines("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt")) %>%
filter(str_detect(lines, "^\\d{4}-\\d{2}-\\d{2}")) %>%
mutate(date = str_extract(lines, "^\\d{4}-\\d{2}-\\d{2}"),
value = as.numeric(str_extract(lines, "[\\d-]+$"))) %>%
select(-lines)
df
#> # A tibble: 286 x 2
#> date value
#> <chr> <dbl>
#> 1 1945-10-01 1245
#> 2 1946-01-01 NA
#> 3 1946-04-01 NA
#> 4 1946-07-01 NA
#> 5 1946-10-01 1298
#> 6 1947-01-01 NA
#> 7 1947-04-01 NA
#> 8 1947-07-01 NA
#> 9 1947-10-01 1413
#> 10 1948-01-01 NA
#> # ... with 276 more rows
I filtered for all the lines you want to keep using stringr::str_detect, then extracted out the info you want from the string using stringr::str_extract and regexes.
Combining fread with unix tools:
> fread("curl -s https://fred.stlouisfed.org/data/HNOMFAQ027S.txt | sed -n -e '/^DATE.*VALUE/,$p'")
DATE VALUE
1: 1945-10-01 1245
2: 1946-01-01 .
3: 1946-04-01 .
4: 1946-07-01 .
5: 1946-10-01 1298
---
282: 2016-01-01 6566888
283: 2016-04-01 6741075
284: 2016-07-01 7022321
285: 2016-10-01 6998898
286: 2017-01-01 7448792
>
Using:
file.names <- c('https://fred.stlouisfed.org/data/HNOMFAQ027S.txt',
'https://fred.stlouisfed.org/data/DGS10.txt',
'https://fred.stlouisfed.org/data/A191RL1Q225SBEA.txt')
text.list <- lapply(file.names, readLines)
skip.rows <- sapply(text.list, grep, pattern = '^DATE\\s+VALUE') - 1
# option 1
l <- Map(function(x,y) read.table(text = x, skip = y), x = text.list, y = skip.rows)
# option 2
l <- lapply(seq_along(text.list), function(i) fread(file.names[i], skip = skip.rows[i]))
will get you a list of data.frame's (option 1) or data.table's (option 2).

Resources