Splitting Columns by Number of Characters [duplicate] - r

This question already has answers here:
Split character string multiple times every two characters
(2 answers)
Closed 6 years ago.
I have a column of dates in a data table entered in 6-digit numbers as such: 201401, 201402, 201403, 201412, etc. where the first 4 digits are the year and second two digits are month.
I'm trying to split that column into two columns, one called "year" and one called "month". Been messing around with strsplit() but can't figure out how to get it to do number of characters instead of a string pattern, i.e. split in the middle of the 4th and 5th digit.

Without using any external package, we can do this with substr
transform(df1, Year = substr(dates, 1, 4), Month = substr(dates, 5, 6))
# dates Year Month
#1 201401 2014 01
#2 201402 2014 02
#3 201403 2014 03
#4 201412 2014 12
We have the option to remove or keep the column.
Or with sub
cbind(df1, read.csv(text=sub('(.{4})(.{2})', "\\1,\\2", df1$dates), header=FALSE))
Or using some package solutions
library(tidyr)
extract(df1, dates, into = c("Year", "Month"), "(.{4})(.{2})", remove=FALSE)
Or with data.table
library(data.table)
setDT(df1)[, tstrsplit(dates, "(?<=.{4})", perl = TRUE)]

tidyr::separate can take an integer for its sep parameter, which will split at a particular location:
library(tidyr)
df <- data.frame(date = c(201401, 201402, 201403, 201412))
df %>% separate(date, into = c('year', 'month'), sep = 4)
#> year month
#> 1 2014 01
#> 2 2014 02
#> 3 2014 03
#> 4 2014 12
Note the new columns are character; add convert = TRUE to coerce back to numbers.

Related

How to organise a date in 3 different columns in r?

I have a lot of climatic data organise by dates like this.
df = data.frame(date = c("2011-03-24", "2011-02-03", "2011-01-02"), Precipitation = c(20, 22, 23))
And I want to organise it like this one
df = data.frame(year = c("2011", "2011","2011"), month = c("03","02","01"), day = c("24", "03", "02"), pp = c(20, 22, 23))
I have a lot of information and I can not do it manually.
Can anybody help me? Thanks a lot.
Using strsplit you can do like this way:
Logic: strsplit will split the date with dashes to create list of 3 elements each having 3 parts of year, month and day. We bind these elements using rbind but to do it iteratively. We use do.call, So do.call will row bind these list elements into 3 rows. Since the outcome is a matrix, we convert it into a dataframe and then using setNames we give new names to the columns. The last cbind will bind these 3x3 dataframe with original precipitation.
cbind(setNames(data.frame(do.call('rbind', strsplit(df$date, '-'))), c('Year', 'month', 'day')), 'Precipitation' = df$Precipitation)
Output:
Year month day Precipitation
1 2011 03 24 20
2 2011 02 03 22
3 2011 01 02 23
This returns integer values for year, month, and day. If you really need them as characters padded with 0 you can use formatC(x, width = 2, flag = "0") on the result.
library(clock)
library(dplyr)
df <- data.frame(
date = c("2011-03-24", "2011-02-03", "2011-01-02"),
pp = c(20, 22, 23)
)
df %>%
mutate(
date = as.Date(date),
year = get_year(date),
month = get_month(date),
day = get_day(date)
)
#> date pp year month day
#> 1 2011-03-24 20 2011 3 24
#> 2 2011-02-03 22 2011 2 3
#> 3 2011-01-02 23 2011 1 2

extract specific digits from column of numbers in R

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!
You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

Spreading row indices into columns in R

I have a data frame in R in the following format:
> old.dat
id type minDate maxDat eventNum
1 001 A may june 1
2 002 B apr oct 1
3 002 C may nov 2
4 002 B july dec 3
I want to turn rows into columns, based on eventNum. The max eventNum is 3, so if some IDs only have 1 eventNum, I want them filled with NA.
Goal:
id type1 minDate1 maxDat1 eventNum1 type2 minDate2 maxDat2 eventNum2 type3 minDate3 maxDat3 eventNum3
1 001 A may june 1 <NA> <NA> <NA> NA <NA> <NA> <NA> NA
2 002 B apr oct 1 C may nov 2 B july dec 3
Here is a code chunk to bring in the starting point.
old.dat <- data.frame(id = c("001","002","002","002"),
type = c("A","B","C","B"),
minDate = c("may","apr","may","july"),
maxDat = c("june", "oct", "nov", "dec"),
eventNum = c(1,1,2,3))
I wrote a for loop, but its rather slow, and its taking a long time to churn through my data set, so any faster suggestions would be great. Thanks!
Why? I don't know if I can imagine a situation in which that format will be useful, but here is an approach using tidyr.
First, I am saving a list of the new column names to make things easier to pull together:
newCols <- c("type", "minDate", "MaxDat")
(I will be adding the numbers below).
Then, I am uniteing the values you want for each event, spreading the results to get a new column for each eventNum, then separateing the results back into separate columns (and appending the number of the event to it)
old.dat %>%
unite(toSpread, type, minDate, maxDat, sep = "::") %>%
spread(eventNum, toSpread) %>%
separate(`1`, paste0(newCols, "_1"), sep = "::") %>%
separate(`2`, paste0(newCols, "_2"), sep = "::") %>%
separate(`3`, paste0(newCols, "_3"), sep = "::")
Returns:
id type_1 minDate_1 MaxDat_1 type_2 minDate_2 MaxDat_2 type_3 minDate_3 MaxDat_3
1 001 A may june <NA> <NA> <NA> <NA> <NA> <NA>
2 002 B apr oct C may nov B july dec
Here is an alternative approach that converts the data to a long format first (with gather), then generates the new column names and does the spreading. The complicated mutate line assigning factor levels to the new columns is only needed to sort the columns and uses parse_number from readr to extract the event numbers. If you are OK with the output columns being sorted alphabetically, you can skip that step. This approach is robust to additional event numbers, as it will automatically add events for each unique value in eventNum.
old.dat %>%
gather(Metric, Value, type, minDate, maxDat) %>%
unite(newColHead, Metric, eventNum) %>%
mutate(newColHead = factor(newColHead
, levels =
unique(newColHead) %>%
{.[order(parse_number(.))]}
)) %>%
spread(newColHead, Value)
For this use case, the output is identical to the above.
(And, if you want evidence for why this may be better; note my edit -- I originally mislabeled event numbers 2/3.)

Conversion to Unique and ordered string of numbers

If I had a dataframe that has a string of numbers separated by a comma in a column, how can I convert that string to an ordered and unique converted set in another column?
Month String_of_Nums Converted
May 3,3,2 2,3
June 3,3,3,1 1,3
Sept 3,3,3, 3 3
Oct 3,3,3, 4 3,4
Jan 3,3,4 3,4
Nov 3,3,5,5 3,5
I tried splitting up the string of numbers to get unique to work
strsplit(df$String_of_Nums,",")
but I end up with spaces in the character list. Any ideas how to efficently generate a Converted column? Also need to figure out how to operate on all elements of the column, etc.
Try:
df1 <- read.table(text="Month String_of_Nums
May '3,3,2'
June '3,3,3,1'
Sept '3,3,3,3'
Oct '3,3,3,4'
Jan '3,3,4'
Nov '3,3,5,5'", header = TRUE)
df1$converted <- apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1,
function(x) paste(sort(unique(x)), collapse = ","))
df1
Month String_of_Nums converted
1 May 3,3,2 2,3
2 June 3,3,3,1 1,3
3 Sept 3,3,3,3 3
4 Oct 3,3,3, 4 3,4
5 Jan 3,3,4 3,4
6 Nov 3,3,5,5 3,5
I'd like to leave another way. As far as I see, Jay's example has String_of_Nums as factor. Given you said that strsplit() worked, I am assuming that you have String_of_Nums as character. Here I have the column as character as well. First, split each string (strsplit), find unique characters (unique), sort the characters (sort), and paste them (toString). At this point, you have a list. You want to convert the vectors in the list using as_vector from the purrr package. Of interest, I used benchmark to see how performance would be like to create a vector (i.e., Converted)
library(magrittr)
library(purrr)
lapply(strsplit(mydf$String_of_Nums, split = ","),
function(x) toString(sort(unique(x)))) %>%
as_vector(.type = "character") -> mydf$out
# Month String_of_Nums out
#1 May 3,3,2 2, 3
#2 June 3,3,3,1 1, 3
#3 Sept 3,3,3,3 3
#4 Oct 3,3,3,4 3, 4
#5 Jan 3,3,4 3, 4
#6 Nov 3,3,5,5 3, 5
library(microbenchmark)
microbenchmark(
jazz = lapply(strsplit(mydf$String_of_Nums, split = ","),
function(x) toString(sort(unique(x)))) %>%
as_vector(.type = "character"),
jay = apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1,
function(x) paste(sort(unique(x)), collapse = ",")),
times = 10000)
# expr min lq mean median uq max neval
# jazz 358.913 393.018 431.7382 405.9395 420.1735 54779.29 10000
# jay 1099.587 1151.244 1233.5631 1167.0920 1191.5610 56871.45 10000
DATA
Month String_of_Nums
1 May 3,3,2
2 June 3,3,3,1
3 Sept 3,3,3,3
4 Oct 3,3,3,4
5 Jan 3,3,4
6 Nov 3,3,5,5
mydf <- structure(list(Month = c("May", "June", "Sept", "Oct", "Jan",
"Nov"), String_of_Nums = c("3,3,2", "3,3,3,1", "3,3,3,3", "3,3,3,4",
"3,3,4", "3,3,5,5")), .Names = c("Month", "String_of_Nums"), row.names = c(NA,
-6L), class = "data.frame")

Obtaining year and month from date column in R

I have a date column where dates look like this:
19940818
19941215
What is the proper command to extract the year and month from them?
If your data is e.g.
(df <- data.frame(date = c("19940818", "19941215")))
# date
#1 19940818
#2 19941215
To add two columns, one for month and one for year, you can do
within(df, {
year <- substr(date, 1, 4)
month <- substr(date, 5, 6)
})
# date month year
# 1 19940818 08 1994
# 2 19941215 12 1994
I don't see a need to convert to Date class here since all you want is a substring of the date column.
Another option is to use extract from tidyr. Using df from #Richard Scriven's post
library(tidyr)
extract(df, date, c('year', 'month'), '(.{4})(.{2}).*', remove=FALSE)
# date year month
#1 19940818 1994 08
#2 19941215 1994 12

Resources