Homogenize use of single and double digit numbers in string

Homogenize use of single and double digit numbers in string - r

I have a very large data.table in which (a large number of) items are defined by strings including text and numbers.
library(data.table)
dd <- data.table(x = c("A4","A4","A4","A14","A14","A14","B4","B4","B4"),y = c("A4","A14","B4","A4","A14","B4","A4","A14","B4"), z = c(1,2,3,4,5,6,7,8,9))
x y z
A4 A4 1
A4 A14 2
A4 B4 3
A14 A4 4
A14 A14 5
A14 B4 6
B4 A4 7
B4 A14 8
B4 B4 9
Numbers can be single or double digit and therefore R will order them always according to the first digit in the number (A14 before A4). Mixedsort can handle this. However, when I reshape the long data to wide
wide <- dcast(dd, x ~ y, value.var = "z")
R is applying again the ordering according to the basic ordering rule.
x A14 A4 B4
A14 5 4 6
A4 2 1 3
B4 8 7 9
I need however the original ordering for following matrix calculations. Is there any efficient way to rename string + single digits to string + double digits (A4 -> A04) or another approach I have missed?

Another, and probably the easiest, option is to use mixedorder from the gtools-package:
wide <- dcast(dd, x ~ y, value.var = "z")[gtools::mixedorder(x)]
which gives:
> wide
x A14 A4 B4
1: A4 2 1 3
2: A14 5 4 6
3: B4 8 7 9
If you also want to get the column order set the same way, you can additionally use setcolorder:
setcolorder(wide, c(1, gtools::mixedorder(names(wide)[-1]) + 1))
which then gives:
> wide
x A4 A14 B4
1: A4 1 2 3
2: A14 4 5 6
3: B4 7 8 9

No additional zeros required in this solution.
# Data frame
df <- data.frame(x = c("A4","A4","A4","A14","A14","A14","B4","B4","B4"),
y = c("A4","A14","B4","A4","A14","B4","A4","A14","B4"),
z = c(1,2,3,4,5,6,7,8,9),
stringsAsFactors = FALSE)
# Reorder columns and rows using `mixedsort`.
wide <- dcast(df, x ~ y,value.var = "z") %>%
select(x, mixedsort(unique(df$x))) %>%
slice(match(x, mixedsort(unique(df$x))))
gives,
# x A4 A14 B4
# 1 A4 1 2 3
# 2 A14 4 5 6
# 3 B4 7 8 9

You can use sprintf() to prepad numbers with 0s
sprintf("%s%02.0d", "A", 1:20)
# [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10" "A11" "A12" "A13" "A14" "A15" "A16" "A17" "A18" "A19" "A20"

You can add the 0s to your data with
dd[nchar(x) == 2, x := paste0(substr(x, 1, 1), 0, substr(x, 2, 2))]
dd[nchar(y) == 2, y := paste0(substr(y, 1, 1), 0, substr(y, 2, 2))]
# x y z
# 1: A04 A04 1
# 2: A04 A14 2
# 3: A04 B04 3
# 4: A14 A04 4
# 5: A14 A14 5
# 6: A14 B04 6
# 7: B04 A04 7
# 8: B04 A14 8
# 9: B04 B04 9
Or, if you need to apply to more columns:
to.change <- c('x', 'y')
dd[, (to.change) := lapply(.SD, function(x) ifelse(nchar(x) > 2, x
, paste0(substr(x, 1, 1), 0, substr(x, 2, 2))))
, .SDcols = to.change]

You might want to consider implementing this order directly in the data through factors, so you don't have to fix it with data wrangling later.
if you already have these unique values sorted somewhere you won't need mixedorder not mixedsort, just convert them as factors then.
Else you can get the order back :
library(gtools)
dd[,1:2] <- lapply(dd[,1:2],function(x) factor(x, mixedsort(unique(x))))
And proceed normally:
dcast(dd, x ~ y, value.var = "z")
# x A4 A14 B4
# 1: A4 1 2 3
# 2: A14 4 5 6
# 3: B4 7 8 9

Related

Create multiple variables in data.table based other variables names [duplicate]

This question already has answers here:
Multiple pairwise differences based on column name patterns
(3 answers)
Multiply several sets of columns in the same data.table
(2 answers)
Closed 2 years ago.
I am trying to create a series of variables, c1, c2, and c3, based on the values of two sets of variables, a1, a2, and a3, and b1, b2, and b3. The code below shows a hard-coded solution, but in reality I don't know the total number of set of variables, say an and bn. As you can see the name of the c variables depend on the names of the a and b variables.
Is there a way in data.table to do this? I tried to do it by using purrr::map2 within data.table but I could not make it work. I would highly appreciate your help.
Thanks.
library(data.table)
DT <- data.table(
a1 = c(1, 2, 3),
a2 = c(1, 2, 3)*2,
a3 = c(1, 2, 3)*3,
b1 = c(5, 6, 7),
b2 = c(5, 6, 7)*4,
b3 = c(5, 6, 7)*5
)
DT[]
#> a1 a2 a3 b1 b2 b3
#> 1: 1 2 3 5 20 25
#> 2: 2 4 6 6 24 30
#> 3: 3 6 9 7 28 35
DT[,
`:=`(
c1 = a1 + b1,
c2 = a2 + b2,
c3 = a3 + b3
)
]
DT[]
#> a1 a2 a3 b1 b2 b3 c1 c2 c3
#> 1: 1 2 3 5 20 25 6 22 28
#> 2: 2 4 6 6 24 30 8 28 36
#> 3: 3 6 9 7 28 35 10 34 44
Created on 2020-08-26 by the reprex package (v0.3.0)

This first part is mostly defensive, guarding against: a* variables without matching b* variables; vice versa; and different order of each:
anames <- grep("^a[0-9]+$", colnames(DT), value = TRUE)
bnames <- grep("^b[0-9]+$", colnames(DT), value = TRUE)
numnames <- gsub("^a", "", anames)
anames <- sort(anames[gsub("^a", "", anames) %in% numnames])
bnames <- sort(bnames[gsub("^b", "", bnames) %in% numnames])
cnames <- gsub("^b", "c", bnames)
If you know the number ranges a priori and want something less-dynamic but more straight-forward, then
anames <- paste0("a", 1:3)
bnames <- paste0("b", 1:3)
cnames <- paste0("c", 1:3)
Now the magic:
DT[, (cnames) := Map(`+`, mget(anames), mget(bnames)) ]
DT
# a1 a2 a3 b1 b2 b3 c1 c2 c3
# 1: 1 2 3 5 20 25 6 22 28
# 2: 2 4 6 6 24 30 8 28 36
# 3: 3 6 9 7 28 35 10 34 44

You could tackle this issue if you split DT column-wise by the pattern of the names first, and then aggregate it
# removes numbers from col names
(ptn <- sub("\\d", "", names(DT)))
# [1] "a" "a" "a" "b" "b" "b"
# get unique numbers contained in the col names (as strings but it doesn't matter here)
(nmb <- unique(sub("\\D", "", names(DT))))
# [1] "1" "2" "3"
Next step is to split DT and finally do the aggregation
DT[, paste0("c", nmb) := do.call(`+`, split.default(DT, f = ptn))]
Result
DT
# a1 a2 a3 b1 b2 b3 c1 c2 c3
#1: 1 2 3 5 20 25 6 22 28
#2: 2 4 6 6 24 30 8 28 36
#3: 3 6 9 7 28 35 10 34 44

We can melt to long format, create the column 'c', dcast into 'wide' format and then cbind
library(data.table)
cbind(DT, dcast(melt(DT, measure = patterns('^a', '^b'))[,
c := value1 + value2], rowid(variable) ~ paste0('c', variable),
value.var = 'c')[, variable := NULL])
# a1 a2 a3 b1 b2 b3 c1 c2 c3
#1: 1 2 3 5 20 25 6 22 28
#2: 2 4 6 6 24 30 8 28 36
#3: 3 6 9 7 28 35 10 34 44

A base R option
u<-split.default(DT,gsub("\\D","",names(DT)))
cbind(DT,do.call(cbind,Map(rowSums,setNames(u,paste0("c",names(u))))))
which gives
a1 a2 a3 b1 b2 b3 c1 c2 c3
1: 1 2 3 5 20 25 6 22 28
2: 2 4 6 6 24 30 8 28 36
3: 3 6 9 7 28 35 10 34 44

Create a type column in dataframe

I have a dataframe:
x y
A1 ''
A2 '123,0'
A3 '4557777'
A4 '8756784321675'
A5 ''
A6 ''
A7
A8
A9 '1533,10'
A10
A11 '51'
I want to add column "type" to it, which has three types: 1,2,3. 1 is if value in y is a number without comma, 2 is for number with comma, 3 is for empty value ''(two apostrophes). So desired output is:
x y type
A1 '' 3
A2 '123,0' 2
A3 '4557777' 1
A4 '8756784321675' 1
A5 '' 3
A6 '' 3
A7
A8
A9 '1533,10' 2
A10
A11 '51' 1
How could i do it? The most unclear part for me is captioning each type in column y

Here's a solution via ifelseand regex:
Data:
df <- data.frame(
y = c("", "", "1,234", "5678", "001,2", "", "455"), stringsAsFactors = F)
Solution:
df$type <- ifelse(grepl(",", df$y), 2,
ifelse(grepl("[^,]", df$y), 1, 3))
Result:
df
y type
1 3
2 3
3 1,234 2
4 5678 1
5 001,2 2
6 3
7 455 1
Update:
df <- data.frame(
y = c("''", "", "1,234", "5678", "001,2", "", "''", 455), stringsAsFactors = F)
df$type <- ifelse(grepl(",", df$y), 2,
ifelse(grepl("[^,']", df$y), 1,
ifelse(df$y=="", "", 3)))
df
y type
1 '' 3
2
3 1,234 2
4 5678 1
5 001,2 2
6
7 '' 3
8 455 1
Is this what you had in mind?

assuming the empty rows have NULL values in them, I thought of dividing into 3 parts:
Those which are empty strings (1)
Those which are convertible to numerics without invoking NA (3)
Those which are NULL (no value)
the only one outside of this set are the ones who belong to group 2, so:
THREE <- which(df$y == "")
ONE <- which(is.na(df$y %>% as.numeric)==FALSE)
EMPTY <- which(is.null(df$y))
type <- c()
type[THREE] = 3
type[ONE] = 1
type[EMPTY] = NA
type[-c(ONE,THREE,EMPTY)] = 2
finally you have a vector which you can join into your dataframe as a column with :
df2 = cbind(df,type)

cbind with partially nested list

I'm trying to cbind or unnest or as.data.table a partially nested list.
id <- c(1,2)
A <- c("A1","A2","A3")
B <- c("B1")
AB <- list(A=A,B=B)
ABAB <- list(AB,AB)
nested_list <- list(id=id,ABAB=ABAB)
The length of id is the same as ABAB (2 in this case). I don't know how to unlist a part of this list (ABAB) and cbind another part (id). Here's my desired result as a data.table:
data.table(id=c(1,1,1,2,2,2),A=c("A1","A2","A3","A1","A2","A3"),B=rep("B1",6))
id A B
1: 1 A1 B1
2: 1 A2 B1
3: 1 A3 B1
4: 2 A1 B1
5: 2 A2 B1
6: 2 A3 B1

I haven't tested for more general cases, but this works for the OP example:
library(data.table)
as.data.table(nested_list)[, lapply(ABAB, as.data.table)[[1]], id]
# id A B
#1: 1 A1 B1
#2: 1 A2 B1
#3: 1 A3 B1
#4: 2 A1 B1
#5: 2 A2 B1
#6: 2 A3 B1
Or another option (which is probably faster, but is more verbose):
rbindlist(lapply(nested_list$ABAB, as.data.table),
idcol = 'id')[, id := nested_list$id[id]]

This is some super ugly base R, but produces the desired output.
Reduce(rbind, Map(function(x, y) setNames(data.frame(x, y), c("id", "A", "B")),
as.list(nested_list[[1]]),
lapply(unlist(nested_list[-1], recursive=FALSE),
function(x) Reduce(cbind, x))))
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
lapply takes the a list of two elements (each containing the A and B variables) extracted with unlist and recursive=FALSE. It returns a list of character matrices with the B variable filled in by recycling. A list of the individual id variables from as.list(nested_list[[1]]) and the lit of matrices are fed to Map which converts corresponding pairs to a data.frame and gives the columns the desired names and returns a list of data.frames. Finally, this list of data.frames is fed to Reduce, which rbinds the results to a single data.frame.
The final Reduce(rbind, could be replaced by data.tables rbindlist if desired.

Here's another hideous solution
max_length = max(unlist(lapply(nested_list, function(x) lapply(x, lengths))))
data.frame(id = do.call(c, lapply(nested_list$id, rep, max_length)),
do.call(rbind, lapply(nested_list$ABAB, function(x)
do.call(cbind, lapply(x, function(y) {
if(length(y) < max_length) {
rep(y, max_length)
} else {
y
}
})))))
# id A B
#1 1 A1 B1
#2 1 A2 B1
#3 1 A3 B1
#4 2 A1 B1
#5 2 A2 B1
#6 2 A3 B1

And one more, also inelegant- but I`d gone too far by the time I saw the other answers.
restructure <- function(nested_l) {
ids <- as.numeric(max(unlist(lapply(unlist(nested_l, recursive = FALSE), function(x){
lapply(x, length)
}))))
temp = data.frame(rep(nested_l$id, each = ids),
sapply(1:length(nested_l$id), function(x){
out <-unlist(lapply(nested_l[[2]], function(y){
return(y[x])
}))
}))
names(temp) <- c("id", unique(substring(unlist(nested_l[2]), first = 1, last = 1)))
return(temp)
}
> restructure(nested_list)
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1

Joining the party:
library(tidyverse)
temp <- map(nested_list,~map(.x,~expand.grid(.x)))
df <- map_df(1:2,~cbind(temp$id[[.x]],temp$ABAB[[.x]]))
Var1 A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1

expand.grid with separate variable for each column

I would like to achieve the following data.frame in R:
i1 i2 i3
1 A1 A2 A3
2 No A2 A3
3 A1 No A3
4 No No A3
5 A1 A2 No
6 No A2 No
7 A1 No No
8 No No No
In each column the variable can either be the concatenated string "A" and the column number or "No". The data.frame should contain all possible combinations.
My idea was to use expand.grid, but I don't know how to create the list dynamically. Or is there a better approach?
expand.grid(list(c("A1", "No"), c("A2", "No"), c("A3", "No")))

I guess you could create your own helper function, something like that
MyList <- function(n) expand.grid(lapply(paste0("A", seq_len(n)), c, "No"))
Then simply pass it the number of elements (e.g., 3)
MyList(3)
# Var1 Var2 Var3
# 1 A1 A2 A3
# 2 No A2 A3
# 3 A1 No A3
# 4 No No A3
# 5 A1 A2 No
# 6 No A2 No
# 7 A1 No No
# 8 No No No
Alternatively, you could also try data.tables CJ equivalent which should much more efficient than expand.grid for a big n
library(data.table)
DTCJ <- function(n) do.call(CJ, lapply(paste0("A", seq_len(n)), c, "No"))
DTCJ(3) # will return a sorted cross join
# V1 V2 V3
# 1: A1 A2 A3
# 2: A1 A2 No
# 3: A1 No A3
# 4: A1 No No
# 5: No A2 A3
# 6: No A2 No
# 7: No No A3
# 8: No No No

Another option is using Map with expand.grid
n <- 3
expand.grid(Map(c, paste0('A', seq_len(n)), 'NO'))
Or
expand.grid(as.data.frame(rbind(paste0('A', seq_len(n)),'NO')))

Another option, only using the most fundamental functions in R, is to use the indices:
df <- data.frame(V1 = c('A','A','A', 'A',rep('No',4)), V2 = c('A','A','No','No','A','A','No','No'), V3 = c('A','No','A','No','A','No','A','No'), stringsAsFactors = FALSE)
to get the row and col indices of the elements we need to change:
rindex <- which(df != 'No') %% nrow(df)
cindex <- ceiling(which(df != 'No')/nrow(df))
the solution is basically a one-liner:
df[matrix(c(rindex,cindex),ncol=2)] <- paste0(df[matrix(c(rindex,cindex),ncol=2)],cindex)
> df
V1 V2 V3
1 A1 A2 A3
2 A1 A2 No
3 A1 No A3
4 A1 No No
5 No A2 A3
6 No A2 No
7 No No A3
8 No No No

3D array -> apply -> 3D array

It seems apply will not re-assemble 3D arrays when operating on just one margin. Consider:
arr <- array(
runif(2*4*3),
dim=c(2, 4, 3),
dimnames=list(a=paste0("a", 1:2), b=paste0("b", 1:4), c=paste0("c", 1:3))
)
# , , c = c1
#
# b
# a b1 b2 b3 b4
# a1 0.7321399 0.8851802 0.2469866 0.9307044
# a2 0.5896138 0.6183046 0.7732842 0.6652637
#
# , , c = c2
# b
# a b1 b2 b3 b4
# a1 0.5894680 0.7839048 0.3854357 0.56555024
# a2 0.6158995 0.6530224 0.8401427 0.04044974
#
# , , c = c3
# b
# a b1 b2 b3 b4
# a1 0.3500653 0.7052743 0.42487635 0.5689287
# a2 0.4097346 0.4527939 0.07192528 0.8638655
Now, make a 4 x 4 matrix to shuffle columns around in each of arr[, , i], and use apply to matrix multiply each a*b sub-matrix in arr to re-order their columns. The important point is that the result of each apply iteration is a matrix
cols.shuf.mx <- matrix(c(0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0), ncol=4)
apply(arr, 3, `%*%`, cols.shuf.mx)
# c
# c1 c2 c3
# [1,] 0.8851802 0.78390483 0.70527431
# [2,] 0.6183046 0.65302236 0.45279387
# [3,] 0.7321399 0.58946800 0.35006532
# [4,] 0.5896138 0.61589947 0.40973463
# [5,] 0.9307044 0.56555024 0.56892870
# [6,] 0.6652637 0.04044974 0.86386552
# [7,] 0.2469866 0.38543569 0.42487635
# [8,] 0.7732842 0.84014275 0.07192528
Whereas, I expected the result to be:
# , , c = c1
#
# a 1 2 3 4
# a1 0.8851802 0.7321399 0.9307044 0.2469866
# a2 0.6183046 0.5896138 0.6652637 0.7732842
#
# , , c = c2
#
# a 1 2 3 4
# a1 0.7839048 0.5894680 0.56555024 0.3854357
# a2 0.6530224 0.6158995 0.04044974 0.8401427
#
# , , c = c3
#
# a 1 2 3 4
# a1 0.7052743 0.3500653 0.5689287 0.42487635
# a2 0.4527939 0.4097346 0.8638655 0.07192528
I can get the expected result with plyr::aaply with:
aperm(aaply(arr, 3, `%*%`, cols.shuf.mx), c(2, 3, 1))
but was wondering if there is a simple base way to achieve this result (i.e. am I missing something obvious here to get the desired outcome).
I realize what occurs here is what is documented (If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1), but it still seems weird to me that if a function returns an object with dimensions they are basically ignored.

Here is a less than fantastic solution that requires foreknowledge of the dimensions of the function result matrix:
vapply(
1:dim(arr)[3],
function(x, y) arr[,,x] %*% y,
FUN.VALUE=arr[,,1],
y=cols.shuf.mx
)

If you read the help page for apply, it basically agrees with your first sentence. It is set up with a particular design and you would need to construct a new function to do something differently. BTW: This gives you the same result much more simply than that aperm(aaply(...)) rigamarole:
arr[ , c(2,1,4,3) , ]
#-------------------------
, , c = c1
b
a b2 b1 b4 b3
a1 0.4089769 0.2875775 0.5281055 0.9404673
a2 0.8830174 0.7883051 0.8924190 0.0455565
, , c = c2
b
a b2 b1 b4 b3
a1 0.9568333 0.5514350 0.1029247 0.6775706
a2 0.4533342 0.4566147 0.8998250 0.5726334
, , c = c3
b
a b2 b1 b4 b3
a1 0.3279207 0.24608773 0.6405068 0.8895393
a2 0.9545036 0.04205953 0.9942698 0.6928034

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Homogenize use of single and double digit numbers in string - r

You can use sprintf() to prepad numbers with 0s sprintf("%s%02.0d", "A", 1:20) # [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10" "A11" "A12" "A13" "A14" "A15" "A16" "A17" "A18" "A19" "A20"

Related

Create multiple variables in data.table based other variables names [duplicate]

Create a type column in dataframe

cbind with partially nested list

expand.grid with separate variable for each column

3D array -> apply -> 3D array

Categories

Resources