Splitting a column with multiple and unevenly distributed delimiters in R - r

I have a column/vector of character data that I need to separate into different columns. The problem? There are different delimiters (which mean different things), and different lengths between each delimiter. For example:
column_name
akjhaa 1-29 y 12-30
bsd, 14-20
asdf asdf del 2-5 y 6
dkljwv 3-31
joikb 6-22
sqwzsxcryvyde jd de 1-2
pk, ehde 1-2
jsd 1-15
asdasd asedd 1,3
The numbers need to be separated into columns apart from the characters. However, the numbers can be separated by a comma or dash or 'y'. Moreover, the numbers separated by dash should be somehow designated, as eventually, I need to make a document/vector where each of the numbers in that range is in their own column also (such that the split aaa column would become aaa 1 2 3 4 5 .... 29 12 13 ... 30).
So far, I have tried separating into columns based on the different delimiters, but because sometimes the values have more than one '-','y', or the 'y' falls as a word in one of the first character parts, it is starting to get a bit complicated...is there an easier way?
For clarification, in the particular "column_name" I gave, the final output would be such that i would have n columns, where n = (the highest number of numbers + 1 (the character string of the column name)). So, in the example of the provided "column_name," it would look like:
column_name n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 n15 n16 n17 n18 n19 n20 n21 n22 n23 n24 n25 n26 n27 n28 n29 n30 n31 n32 n33 n34 n35 n36 n37 n38 n39 n40 n41 n42 n43 n44 n45 n46 n47 n48 n49 n50 n51 n52 n53 n54 n55 n56 n57 n58
akjhaa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
bsd 14 15 16 17 18 19 20
asdf asdf del 2 3 4 5 6
dkljwv 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
joikb 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
sqwzsxcryvyde jd de 1 2
pk ehde 1 2
jsd 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
asdasd asedd 1 3

This isn't pretty, but it works. The result is a list column with the relevant values.
library(magrittr)
library(splitstackshape)
setDT(mydf)[, CN := gsub("(.*?) ([0-9].*)", "\\1SPLIT\\2", column_name)] %>%
cSplit("CN", "SPLIT") %>%
cSplit("CN_2", "[y,]", "long", fixed = FALSE) %>%
cSplit("CN_2", "-") %>%
.[, list(values = list(if (is.na(CN_2_2)) CN_2_1 else CN_2_1:CN_2_2)),
.(CN_1, rowid(CN_1))] %>%
.[, list(values = list(unlist(values))), .(CN_1)]
# CN_1 values
# 1: akjhaa 1,2,3,4,5,6,...
# 2: bsd, 14,15,16,17,18,19,...
# 3: asdf asdf del 2,3,4,5,6
# 4: dkljwv 3,4,5,6,7,8,...
# 5: joikb 6, 7, 8, 9,10,11,...
# 6: sqwzsxcryvyde jd de 1,2
# 7: pk, ehde 1,2
# 8: jsd 1,2,3,4,5,6,...
# 9: asdasd asedd 1,3
To get the extra columns instead of a list, you would need one more line: cbind(., .[, data.table::transpose(values)]):
as.data.table(mydf)[, CN := gsub("(.*?) ([0-9].*)", "\\1SPLIT\\2", column_name)] %>%
cSplit("CN", "SPLIT") %>%
cSplit("CN_2", "[y,]", "long", fixed = FALSE) %>%
cSplit("CN_2", "-") %>%
.[, list(values = list(if (is.na(CN_2_2)) CN_2_1 else CN_2_1:CN_2_2)),
.(CN_1, rowid(CN_1))] %>%
.[, list(values = list(unlist(values))), .(CN_1)] %>%
cbind(., .[, data.table::transpose(values)])
The basic idea is to do the following steps:
Split the column names from the values.
Split values separated by "y" or by a "," into new rows.
Split values separated by "-" into multiple columns.
Create your list of vectors according to the rule that if any values in the second split column are NA, return just the value from the first column, otherwise, create the sequence from the value in the first column to the value in the second column. Since you have duplicated "id" values because you've converted the data into a longer form, use rowid() to help with the grouping.
Consolidate the values in the list column according to the actual IDs.
(Optionally, in my opinion) transform the list data into multiple columns.

Related

Split numeric variables by decimals in R

I have a data frame with a column that contains numeric values, which represent the price.
ID
Total
1124
12.34
1232
12.01
1235
13.10
I want to split the column Total by "." and create 2 new columns with the euro and cent amount. Like this:
ID
Total
Euro
Cent
1124
12.34
12
34
1232
12.01
12
01
1235
13.10
13
10
1225
13.00
13
00
The euro and cent column should also be numeric.
I tried:
df[c('Euro', 'Cent')] <- str_split_fixed(df$Total, "(\\.)", 2)
But I get 2 new columns of type character that looks like this:
ID
Total
Euro
Cent
1124
12.34
12
34
1232
12.01
12
01
1235
13.10
13
1
1225
13.00
13
If I convert the character columns (euro and cent) to numeric like this:
as.numeric(df$Euro)
the 00 cent value turns into NULL and the 10 cent turn into 1 cent.
Any help is welcome.
Two methods:
If class(dat$Total) is numeric, you can do this:
dat <- transform(dat, Euro = Total %/% 1, Cent = 100 * (Total %% 1))
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
%/% is the integer-division operator, %% the modulus operator.
If class(dat$Total) is character, then
dat <- transform(dat, Euro = sub("\\..*", "", Total), Cent = sub(".*\\.", "", Total))
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 01
# 3 1235 13.10 13 10
The two new columns are also character. For this, you may want one of two more steps:
Removing leading 0s, and keep them character:
dat[,c("Euro", "Cent")] <- lapply(dat[,c("Euro", "Cent")], sub, pattern = "^0+", replacement = "")
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
Convert to numbers:
dat[,c("Euro", "Cent")] <- lapply(dat[,c("Euro", "Cent")], as.numeric)
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
(You can also use as.integer if you know both columns will always be such.)
Just use standard numeric functions:
df$Euro <- floor(df$Total)
df$Cent <- df$Total %% 1 * 100

To apply mutate with an other line

I have a table and I would like to add a column that calculates the percentage compared to the previous line.
You have to do as calculation takes the line 1 divided by line 2 and on the line 2, you indicate the result
Example
month <- c(10,11,12,13,14,15)
sell <-c(258356,278958,287928,312254,316287,318999)
df <- data.frame(month, sell)
df %>% mutate(augmentation = sell[month]/sell[month+1])
month sell resultat
1 10 258356 NA
2 11 278958 0.9261466
3 12 287928 0.9688464
4 13 312254 0.9220955
5 14 316287 0.9872489
6 15 318999 0.9914984
dplyr
You can just use lag like this:
library(dplyr)
df %>%
mutate(resultat = lag(sell)/sell)
Output:
month sell resultat
1 10 258356 NA
2 11 278958 0.9261466
3 12 287928 0.9688464
4 13 312254 0.9220955
5 14 316287 0.9872489
6 15 318999 0.9914984
data.table
Another option is using shift:
library(data.table)
setDT(df)[, resultat:= shift(sell)/sell][]
Output:
month sell resultat
1: 10 258356 NA
2: 11 278958 0.9261466
3: 12 287928 0.9688464
4: 13 312254 0.9220955
5: 14 316287 0.9872489
6: 15 318999 0.9914984

Hold current value until non-null value occurs [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 5 years ago.
Hi I come from a background in SAS and I am relatively new to R. I am attempting to convert an existing SAS program into equivalent R code
I am unsure how to achieve the equivalent of SAS's "retain" and "by" Behavior in R
I have a dataframe with two columns first column is a date column and the second column is a numeric value.
The numeric column represents a result from lab test. The test is conducted semi-regularly so on some days there will be Null values in the data. The data is ordered by date and the dates are sequential.
i.e example data looks like this
Date Result
2017/01/01 15
2017/01/02 NA
2017/01/03 NA
2017/01/04 12
2017/01/05 NA
2017/01/06 13
2017/01/07 11
2017/01/08 NA
I would like to create a third column which would contain the most recent result.
If Result column is Null it should be set to most recent previously non Null Result otherwise it should contain the Result value
My desired output would look like this:
Date Result My_var
2017/01/01 15 15
2017/01/02 NA 15
2017/01/03 NA 15
2017/01/04 12 12
2017/01/05 NA 12
2017/01/06 13 13
2017/01/07 11 11
2017/01/08 NA 11
In SAS I can achieve this with something like following code snippet:
data my_data;
retain My_var;
set input_data;
by date;
if Result not = . then
my_var = result;
run;
I am stumped as to how to do this in R I do not think R supports By group processing as in SAS - or at least I don't know how to set that as option.
I have naively tried:
my_data <- mutate(input_data, my_var = if(is.na(Result)) {lag(Result)} else {Result})
But I do not think that syntax is correct.
We can use na.locf function from the zoo package to fill in the missing values.
library(zoo)
dt$My_var <- na.locf(dt$Result)
dt
# Date Result My_var
# 1 2017/01/01 15 15
# 2 2017/01/02 NA 15
# 3 2017/01/03 NA 15
# 4 2017/01/04 12 12
# 5 2017/01/05 NA 12
# 6 2017/01/06 13 13
# 7 2017/01/07 11 11
# 8 2017/01/08 NA 11
Or the fill function from the tidyr package.
library(dplyr)
library(tidyr)
dt <- dt %>%
mutate(My_var = Result) %>%
fill(My_var)
dt
# Date Result My_var
# 1 2017/01/01 15 15
# 2 2017/01/02 NA 15
# 3 2017/01/03 NA 15
# 4 2017/01/04 12 12
# 5 2017/01/05 NA 12
# 6 2017/01/06 13 13
# 7 2017/01/07 11 11
# 8 2017/01/08 NA 11
DATA
dt <- read.table(text = "Date Result
2017/01/01 15
2017/01/02 NA
2017/01/03 NA
2017/01/04 12
2017/01/05 NA
2017/01/06 13
2017/01/07 11
2017/01/08 NA",
header = TRUE, stringsAsFactors = FALSE)

Extract intervals from time data in R

My problem is simple. I have table where each row is event (month, day, hour, minute is given). However, the machine was set to record 24/7. So I have more events (rows) than I need. How to remove surplus rows from daytime and to keep only rows from night (from sunset to sunrise)?
Dreadful thing is, that the timing of sunrise/sunset is slightly different each day.
In this example I provide two tables. First is table with all events, second contain timings of sunset/sunrise for each day.
If it is possible to extract, please notice that EACH night consists from two dates could be a additional column inserted in table containing ID of night? (see scheme below)
# table with all events
my.table <- data.frame(event = 1:34,
day = rep(c(30,31,1,2,3), times = c(8,9,7,8,2)),
month = rep(c(3,4), each = 17),
hour = c(13,13,13,13,22,
22,23,23,2,2,2,
14,14,14,19,22,22,
2,2,2,14,15,22,22,
3,3,3,14,14,14,
23,23,2,14),
minute = c(11,13,44,55,27,
32,54,57,10,14,
26,12,16,46,30,
12,13,14,16,45,
12,15,12,15,24,
26,28,12,16,23,12,13,11,11))
# timings of sunset/sunrise for each day
sun.table <- data.frame(day = c(30,31,31,1,1,2,2,3),
month = rep(c(3,4), times = c(3,5)),
hour = rep(c(19,6), times = 4),
minute = c(30,30,31,29,32,
28,33,27),
type = rep(c("sunset","sunrise"), times = 4))
# rigth solution reduced table would contain only rows:
# 5,6,7,8,9,10,11,16,17,18,19,20,23,24,25,26,27,31,32,33.
# nrow("reduced table") == 20
Here's one possible strategy
#convert sun-up, sun-down times to proper dates
ss <- with(sun.table, ISOdate(2000,month,day,hour,minute))
up <- ss[seq(1,length(ss),by=2)]
down <- ss[seq(2,length(ss),by=2)]
Here I assume the table is ordered and starts with a sunrise and alternates back and forth and ends with a sunset. Date values also need a year, here I just hard coded 2000. As long as your data doesn't span years (or leap days) that should be fine, but you'll probably want to pop in the actual year of your observations.
Now do the same for events
tt <- with(my.table, ISOdate(2000,month,day,hour,minute))
Find rows during the day
daytime <- sapply(tt, function(x) any(up<x & x<down))
and extract those rows
my.table[daytime, ]
# event day month hour minute
# 5 5 30 3 22 27
# 6 6 30 3 22 32
# 7 7 30 3 23 54
# 8 8 30 3 23 57
# 9 9 31 3 2 10
# 10 10 31 3 2 14
# 11 11 31 3 2 26
# 16 16 31 3 22 12
# 17 17 31 3 22 13
# 18 18 1 4 2 14
# 19 19 1 4 2 16
# 20 20 1 4 2 45
# 23 23 1 4 22 12
# 24 24 1 4 22 15
# 25 25 2 4 3 24
# 26 26 2 4 3 26
# 27 27 2 4 3 28
# 31 31 2 4 23 12
# 32 32 2 4 23 13
# 33 33 3 4 2 11
Here we only grab values that are after sunrise and before sunset. Since there isn't enough information in the sun.table to make sure that row 34 actually happens before subset, it is not returned.

Working with dataframe that uses ':' and 'x' as separators/identifiers within only one columns

I have a dataframe that provides all kinds of sales info- date, session, time, day of week, product type, total sales, etc. It also includes a single column that provides the order in which all products were purchased in that session. Some of the products are text names, some are numbers.
The products with text names never change, but the products with numerical names rotate as new ones are developed. (This is why they are listed in a single column- the "numerical" products change so much that the dataframe would get maddeningly wide in just a few months, plus some other issues)
Here's a small subset:
Session TotSales GameList
20764 15 ProductA
31976 7 ProductB:ProductB:ProductB
27966 25 1069x2
324 3 1067x1
6943 28 1071x1:1064x1:1038x2:1034x1:ProductE
14899 12 1062x2
25756 8 ProductC:ProductC:ProductB
27279 6 ProductD:ProductD:ProductD:PcoductC
31981 4 1067x1
2782 529 1046x2:1046x2:1046x1:1046x1:1046x1:1046x4
Okay, so in the above example, in session 20764 (the first one), sales were $15 and it was all spent on ProductA. In the next session, ProductB was purchased three times. In the third session, product 1069 was purchased twice, and so on.
I am going to be doing a lot with this, but I don't know how to tell R that, in this column, a ':' acts as a separator between products, and an 'x' signifies the number of "numerical' products that were purchased. Any ideas?
Some examples of what I am trying to know:
1. Which Product was purchased first in a session;
2. Which products were purchased most often with each other; and,
3. I'd like to be able to, say, aggregate sessions that contain certain combinations of products (e.g, 1067 and 1046 and Quinto)
I know this is a broad request for on here, but any info on how to get R to recognize these unique-to-this-column identifiers would be tremendously helpful. Thanks in advance.
Also, here's the dput()
structure(list(Session = c(20764L, 31976L, 27966L, 324L, 6943L,
14899L, 25756L, 27279L, 31981L, 2782L), TotSales = c(5, 5, 20,
1, 25, 2, 9, 5, 1, 520), GameList = structure(c(6L, 9L, 4L, 3L,
5L, 2L, 8L, 7L, 3L, 1L), .Label = c("1046x2:1046x2:1046x1:1046x1:1046x1:1046x4",
"1062x2", "1067x1", "1069x2", "1071x1:1064x1:1038x2:1034x1:ProductE",
"ProductA", "ProductD:ProductD:ProductD:ProductC", "ProductB:ProductB:ProductC",
"ProductB:ProductB:ProductB"), class = "factor")), .Names = c("Session",
"TotSales", "GameList"), row.names = c(320780L, 296529L, 98969L,
47065L, 19065L, 92026L, 327431L, 291843L, 296534L, 15055L), class = "data.frame")
Here is an alternate with data.table. I won't answer all your questions, but this should get you going. First, convert to long format:
library(data.table)
dt <- data.table(df) # assumes your data is in `df`
split_fun <- function(x) {
y <- unlist(strsplit(as.character(x), ":"))
z <- strsplit(y, "(?<=[0-9])+x(?=[0-9]+$)", perl=T)
unlist(lapply(z, function(x) if(length(x) == 2) rep(x[[1]], x[[2]]) else x[[1]]))
}
dt.long <- dt[, list(TotSales, split_fun(GameList)), by=Session]
Now, to answer Q1 (first product in session):
dt.long[, head(V2, 1L), by=Session]
Produces:
Session V1
1: 20764 ProductA
2: 31976 ProductB
3: 27966 1069
4: 324 1067
... 6 rows omitted
And Q3 (aggregate sessions that contain multiple products):
dt.long[,
if(length(items <- .SD[all(c("ProductB") %in% V2), V2])) paste0(items, collapse=", "),
by=Session
]
Produces (note you don't have any sessions with more than one product shared, but you can easily modify the above for multiple products for your real data):
Session V1
1: 31976 ProductB, ProductB, ProductB
2: 25756 ProductC, ProductC, ProductB
Q2 is a bit trickier, but I'll leave that one to you. I'm also not 100% sure what you mean by that question. One thing worth highlighting, dt.long here has the products repeated however many times they were "xed". For example, with session 27966, product 1069 shows up twice, so you can count rows for each product if you want:
> dt.long[Session==27966]
Session TotSales V2
1: 27966 25 1069
2: 27966 25 1069
Note that the regular expression we use to split products will work so long as you don't have products with names (not codes) like "BLHABLBHA98877x998".
You need to parse the GameList column. This is probably kind of slow for bigger datasets, but should show the general idea:
options(stringsAsFactors=FALSE)
DF <- read.table(text="Session TotSales GameList
20764 15 ProductA
31976 7 ProductB:ProductB:ProductB
27966 25 1069x2
324 3 1067x1
6943 28 1071x1:1064x1:1038x2:1034x1:ProductE
14899 12 1062x2
25756 8 ProductC:ProductC:ProductB
27279 6 ProductD:ProductD:ProductD:PcoductC
31981 4 1067x1
2782 529 1046x2:1046x2:1046x1:1046x1:1046x1:1046x4", header=TRUE)
DF <- do.call(rbind,
lapply(seq_len(nrow(DF)),
function(i) cbind.data.frame(DF[i,-3],
Game=strsplit(DF$GameList, ":", fixed=TRUE)[[i]])))
DF <- cbind(DF,
t(sapply(strsplit(DF$Game, "x", fixed=TRUE),
function(x) {if (length(x)<2L) x <- c(x, 1); x})))
DF <- DF[,-3]
names(DF)[3:4] <- c("Game", "Amount")
DF$Amount <- as.integer(DF$Amount)
DF$index <- seq_len(nrow(DF))
# Session TotSales Game Amount index
# 1 20764 15 ProductA 1 1
# 2 31976 7 ProductB 1 2
# 3 31976 7 ProductB 1 3
# 4 31976 7 ProductB 1 4
# 31 27966 25 1069 2 5
# 41 324 3 1067 1 6
# 7 6943 28 1071 1 7
# 8 6943 28 1064 1 8
# 9 6943 28 1038 2 9
# 10 6943 28 1034 1 10
# 11 6943 28 ProductE 1 11
# 6 14899 12 1062 2 12
# 13 25756 8 ProductC 1 13
# 14 25756 8 ProductC 1 14
# 15 25756 8 ProductB 1 15
# 16 27279 6 ProductD 1 16
# 17 27279 6 ProductD 1 17
# 18 27279 6 ProductD 1 18
# 19 27279 6 PcoductC 1 19
# 91 31981 4 1067 1 20
# 21 2782 529 1046 2 21
# 22 2782 529 1046 2 22
# 23 2782 529 1046 1 23
# 24 2782 529 1046 1 24
# 25 2782 529 1046 1 25
# 26 2782 529 1046 4 26
Note that I assume that there is no x in the product names. If there is, you need a regex as shown by #BrodieG for splitting.
Now you can do things like this:
aggregate(Game~Session, DF, head, 1)
# Session Game
# 1 324 1067
# 2 2782 1046
# 3 6943 1071
# 4 14899 1062
# 5 20764 ProductA
# 6 25756 ProductC
# 7 27279 ProductD
# 8 27966 1069
# 9 31976 ProductB
# 10 31981 1067

Resources