Splitting column into multi-columns - r

For the data given below,
data1<-structure(list(var1 = c("2 7", "2 6 7", "2 7", "2 7", "1 7",
"1 7", "1 5", "1 2 7", "1 5", "1 7", "1 2 3 4 5 6 7", "1 2 4 6"
)), .Names = "var1", class = "data.frame", row.names = c(NA,
-12L))
> data1
var1
1 2 7
2 2 6 7
3 2 7
4 2 7
5 1 7
6 1 7
7 1 5
8 1 2 7
9 1 5
10 1 7
11 1 2 3 4 5 6 7
12 1 2 4 6
I would like it to split into seven columns (7) as follows:
v1 v2 v3 v4 v5 v6 v7
1 NA 2 NA NA NA NA 7
2 NA 2 NA NA NA 6 7
3 NA 2 NA NA NA NA 7
4 NA 2 NA NA NA NA 7
5 1 NA NA NA NA NA 7
6 1 NA NA NA NA NA 7
7 1 NA NA NA 5 NA NA
8 1 2 NA NA NA NA 7
9 1 NA NA NA 5 NA NA
10 1 NA NA NA NA NA 7
11 1 2 3 4 5 6 7
12 1 2 NA 4 NA 6 NA
I use the tstrsplit from data.table package as follows:
library(data.table)
setDT(data1)[,tstrsplit(var1," ")]
V1 V2 V3 V4 V5 V6 V7
1: 2 7 NA NA NA NA NA
2: 2 6 7 NA NA NA NA
3: 2 7 NA NA NA NA NA
4: 2 7 NA NA NA NA NA
5: 1 7 NA NA NA NA NA
6: 1 7 NA NA NA NA NA
7: 1 5 NA NA NA NA NA
8: 1 2 7 NA NA NA NA
9: 1 5 NA NA NA NA NA
10: 1 7 NA NA NA NA NA
11: 1 2 3 4 5 6 7
12: 1 2 4 6 NA NA NA
This is different than the expected output. I was wondering how can I get the expected output as described above.

With data.table you may try
library(magrittr)
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))] %>%
dcast(., rn ~ V1)
rn 1 2 3 4 5 6 7
1: 1 NA 2 NA NA NA NA 7
2: 2 NA 2 NA NA NA 6 7
3: 3 NA 2 NA NA NA NA 7
4: 4 NA 2 NA NA NA NA 7
5: 5 1 NA NA NA NA NA 7
6: 6 1 NA NA NA NA NA 7
7: 7 1 NA NA NA 5 NA NA
8: 8 1 2 NA NA NA NA 7
9: 9 1 NA NA NA 5 NA NA
10: 10 1 NA NA NA NA NA 7
11: 11 1 2 3 4 5 6 7
12: 12 1 2 NA 4 NA 6 NA
To get rid of the rn column, we can use
setDT(data1)[, strsplit(var1," "), by = .(rn = 1:nrow(data1))][
, dcast(.SD, rn ~ V1)][, rn := NULL][]
Explanation
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))]
creates a data.table directly in long format
rn V1
1: 1 2
2: 1 7
3: 2 2
4: 2 6
5: 2 7
6: 3 2
7: 3 7
8: 4 2
9: 4 7
10: 5 1
11: 5 7
12: 6 1
13: 6 7
14: 7 1
15: 7 5
16: 8 1
17: 8 2
18: 8 7
19: 9 1
20: 9 5
21: 10 1
22: 10 7
23: 11 1
24: 11 2
25: 11 3
26: 11 4
27: 11 5
28: 11 6
29: 11 7
30: 12 1
31: 12 2
32: 12 4
33: 12 6
rn V1
which is then reshaped to wide format using dcast().
If we would use tstrsplit() instead of strsplit() we would get a data.table in wide format which needs to be reshaped to long format using melt():
setDT(data1)[,tstrsplit(var1," ")][, rn := .I][
, melt(.SD, id = "rn", na.rm = TRUE)][
, dcast(.SD, rn ~ paste0("V", value))][
, rn := NULL][]

In base R, we can do this by splitting the string by one or more (\\s+), create a row/column index ('i1') and assign a NA matrix ('m1') to fill up with the unlisted split values
lst <- lapply(strsplit(data1$var1, "\\s+"), as.numeric)
i1 <- cbind(rep(1:nrow(data1), lengths(lst)), unlist(lst))
m1 <- matrix(NA, nrow = max(i1[,1]), ncol = max(i1[,2]))
m1[i1] <- unlist(lst)
as.data.frame(m1)
# V1 V2 V3 V4 V5 V6 V7
#1 NA 2 NA NA NA NA 7
#2 NA 2 NA NA NA 6 7
#3 NA 2 NA NA NA NA 7
#4 NA 2 NA NA NA NA 7
#5 1 NA NA NA NA NA 7
#6 1 NA NA NA NA NA 7
#7 1 NA NA NA 5 NA NA
#8 1 2 NA NA NA NA 7
#9 1 NA NA NA 5 NA NA
#10 1 NA NA NA NA NA 7
#11 1 2 3 4 5 6 7
#12 1 2 NA 4 NA 6 NA

Related

Easiest way to replace non-NA values by column index

I have a data frame like the following, with some NAs:
mydf=data.frame(ID=LETTERS[1:10], aaa=runif(10), bbb=runif(10), ccc=runif(10), ddd=runif(10))
mydf[c(1,4,5,7:10),2]=NA
mydf[c(1,2,4:8),3]=NA
mydf[c(3,4,6:10),4]=NA
mydf[c(1,3,4,6,9,10),5]=NA
> mydf
ID aaa bbb ccc ddd
1 A NA NA 0.08844614 NA
2 B 0.4912790 NA 0.88925139 0.1233173
3 C 0.1325188 0.1389260 NA NA
4 D NA NA NA NA
5 E NA NA 0.60750723 0.6357998
6 F 0.8218579 NA NA NA
7 G NA NA NA 0.5988206
8 H NA NA NA 0.4008338
9 I NA 0.8784563 NA NA
10 J NA 0.2959320 NA NA
What I want to accomplish here is the following:
1- replace non-NA values by column index -1, so that the output looks like this:
> mydf
ID aaa bbb ccc ddd
1 A NA NA 3 NA
2 B 1 NA 3 4
3 C 1 2 NA NA
4 D NA NA NA NA
5 E NA NA 3 4
6 F 1 NA NA NA
7 G NA NA NA 4
8 H NA NA NA 4
9 I NA 2 NA NA
10 J NA 2 NA NA
2- Then I would like to add an extra column that shows the following:
0 for all NAs in a row
0 for a row with more than 1 non-NA value
the actual value when it is the only non-NA value in a row
The final result should look like this:
> mydf
ID aaa bbb ccc ddd final
1 A NA NA 3 NA 3
2 B 1 NA 3 4 0
3 C 1 2 NA NA 0
4 D NA NA NA NA 0
5 E NA NA 3 4 0
6 F 1 NA NA NA 1
7 G NA NA NA 4 4
8 H NA NA NA 4 4
9 I NA 2 NA NA 2
10 J NA 2 NA NA 2
I could probably do all this with an ugly for loop, then aggregate for the final column, and substitute by 0 where appropriate...
But I was wondering if there would be a clean way to do this with some apply calls in just a few lines...
Thanks!
You could do:
mydf[-1] <- sapply(1:4, \(x) x * mydf[x+1]/mydf[x+1])
mydf$final <- apply(mydf[-1], 1, function(x) {
if(all(is.na(x)) | sum(!is.na(x)) > 1) 0 else na.omit(x)
})
Result:
mydf
#> ID aaa bbb ccc ddd final
#> 1 A NA NA 3 NA 3
#> 2 B 1 NA 3 4 0
#> 3 C 1 2 NA NA 0
#> 4 D NA NA NA NA 0
#> 5 E NA NA 3 4 0
#> 6 F 1 NA NA NA 1
#> 7 G NA NA NA 4 4
#> 8 H NA NA NA 4 4
#> 9 I NA 2 NA NA 2
#> 10 J NA 2 NA NA 2
Created on 2022-12-16 with reprex v2.0.2
Here is an idea,
mydf1 <- cbind.data.frame(ID = mydf$ID, mapply(function(x, y) replace(x, !is.na(x), y),
mydf,
seq(ncol(mydf)) - 1)[,-1])
mydf1$final <- apply(mydf1[-1], 1, \(i)
ifelse(sum(is.na(i)) == (ncol(mydf) - 1) | sum(!is.na(i)) > 1, 0, i[!is.na(i)]))
mydf1
ID aaa bbb ccc ddd final
1 A <NA> <NA> 3 <NA> 3
2 B 1 <NA> 3 4 0
3 C 1 2 <NA> <NA> 0
4 D <NA> <NA> <NA> <NA> 0
5 E <NA> <NA> 3 4 0
6 F 1 <NA> <NA> <NA> 1
7 G <NA> <NA> <NA> 4 4
8 H <NA> <NA> <NA> 4 4
9 I <NA> 2 <NA> <NA> 2
10 J <NA> 2 <NA> <NA> 2
A third option could be
tmp <- mydf[,-1]
tmp[!is.na(tmp)] <- 1
(mydf[,-1] <- tmp * as.list(1:4))
# aaa bbb ccc ddd
#1 NA NA 3 NA
#2 1 NA 3 4
#3 1 2 NA NA
#4 NA NA NA NA
#5 NA NA 3 4
#6 1 NA NA NA
#7 NA NA NA 4
#8 NA NA NA 4
#9 NA 2 NA NA
#10 NA 2 NA NA
The final column can be generated like this
idx <- rowSums(tmp, na.rm = TRUE) == 1
mydf$final <- idx * max.col(replace(tmp, is.na(tmp), -Inf))
Result
mydf
# ID aaa bbb ccc ddd final
#1 A NA NA 3 NA 3
#2 B 1 NA 3 4 0
#3 C 1 2 NA NA 0
#4 D NA NA NA NA 0
#5 E NA NA 3 4 0
#6 F 1 NA NA NA 1
#7 G NA NA NA 4 4
#8 H NA NA NA 4 4
#9 I NA 2 NA NA 2
#10 J NA 2 NA NA 2

Transpose multiple variable data by group ini R

I have these data:
db <- read.table(header=T, text="
ID site R S data
1 1 2 10 01/02/2021
1 1 3 20 03/02/2021
1 2 4 50 05/01/2021
2 1 7 40 02/02/2021
2 2 2 30 05/02/2021
2 2 5 60 06/02/2021
2 2 9 10 07/02/2021
3 1 2 20 02/02/2021
3 2 6 30 03/02/2021
4 1 4 40 05/02/2021
5 1 5 20 07/02/2021")
And I want to get the following result:
db_transpose <- read.table(header=T, text="
ID site R S data R_1 S_1 data_1 R_2 S_2 data_2
1 1 2 10 01/02/2021 3 20 03/02/2021 NA NA NA
1 2 4 50 05/01/2021 NA NA NA NA NA NA
2 1 7 40 02/02/2021 NA NA NA NA NA NA
2 2 2 30 05/02/2021 5 60 06/02/2021 9 10 07/02/2021
3 1 2 20 02/02/2021 NA NA NA NA NA NA
3 2 6 30 03/02/2021 NA NA NA NA NA NA
4 1 4 40 05/02/2021 NA NA NA NA NA NA
5 1 5 20 07/02/2021 NA NA NA NA NA NA")
For every combination of columns ID and site I would like to transpose data in column R, S and data by data order.
I've tried with reshape2 without result.
Here've tried with S and R variables:
require(reshape2)
dcast(db, ID + site ~ data, value.var=c ("S", "R"))
But I obtain this error message:
Error in .subset2(x, i, exact = exact) : index out of bounds
Warning message:
In if (!(value.var %in% names(data))) { :
the condition has length > 1 and only the first element will be used
Here I've tried only with one variable:
dcast(db, ID + site ~ data, value.var="S")
But I obtain a result totally different from what I need:
ID site 01/02/2021 02/02/2021 03/02/2021 05/01/2021 05/02/2021 06/02/2021 07/02/2021
1 1 2 NA NA NA 50 NA NA NA
2 1 1 10 NA 20 NA NA NA NA
3 2 1 NA 40 NA NA NA NA NA
4 2 2 NA NA NA NA 30 60 10
5 3 1 NA 20 NA NA NA NA NA
6 3 2 NA NA 30 NA NA NA NA
7 4 1 NA NA NA NA 40 NA NA
8 5 1 NA NA NA NA NA NA 20
Thank you
in Base R >= 4:
transform(db, time = ave(ID, ID, site, FUN = seq_along)) |>
reshape(dir = 'wide', idvar = c('ID', 'site'), sep = '_')
ID site R_1 S_1 data_1 R_2 S_2 data_2 R_3 S_3 data_3
1 1 1234 2 10 01/02/2021 3 20 03/02/2021 NA NA <NA>
3 1 1224 4 50 05/01/2021 NA NA <NA> NA NA <NA>
4 2 1234 7 40 02/02/2021 NA NA <NA> NA NA <NA>
5 2 1342 2 30 05/02/2021 5 60 06/02/2021 9 10 07/02/2021
8 3 1234 2 20 02/02/2021 NA NA <NA> NA NA <NA>
9 3 3421 6 30 03/02/2021 NA NA <NA> NA NA <NA>
10 4 1234 4 40 05/02/2021 NA NA <NA> NA NA <NA>
11 5 1234 5 20 07/02/2021 NA NA <NA> NA NA <NA>
In base R < 4 do:
reshape(transform(db, time = ave(ID, ID, site, FUN = seq_along)),
dir = 'wide', idvar = c('ID', 'site'), sep = '_')
in tidyverse:
library(tidyverse)
db %>%
group_by(ID, site) %>%
mutate(name = row_number())%>%
pivot_wider(c(ID, site), values_from = c(R,S,data), names_sep = '_')
# A tibble: 8 x 11
# Groups: ID, site [8]
ID site R_1 R_2 R_3 S_1 S_2 S_3 data_1 data_2 data_3
<int> <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 1 1234 2 3 NA 10 20 NA 01/02/2021 03/02/2021 NA
2 1 1224 4 NA NA 50 NA NA 05/01/2021 NA NA
3 2 1234 7 NA NA 40 NA NA 02/02/2021 NA NA
4 2 1342 2 5 9 30 60 10 05/02/2021 06/02/2021 07/02/2021
5 3 1234 2 NA NA 20 NA NA 02/02/2021 NA NA
6 3 3421 6 NA NA 30 NA NA 03/02/2021 NA NA
7 4 1234 4 NA NA 40 NA NA 05/02/2021 NA NA
8 5 1234 5 NA NA 20 NA NA 07/02/2021 NA NA
with data.table:
library(data.table)
dcast(setDT(db), ID + site ~ rowid(ID, site), value.var = c('R', 'S', 'data'),sep = '_')
ID site R_1 R_2 R_3 S_1 S_2 S_3 data_1 data_2 data_3
1: 1 1224 4 NA NA 50 NA NA 05/01/2021 <NA> <NA>
2: 1 1234 2 3 NA 10 20 NA 01/02/2021 03/02/2021 <NA>
3: 2 1234 7 NA NA 40 NA NA 02/02/2021 <NA> <NA>
4: 2 1342 2 5 9 30 60 10 05/02/2021 06/02/2021 07/02/2021
5: 3 1234 2 NA NA 20 NA NA 02/02/2021 <NA> <NA>
6: 3 3421 6 NA NA 30 NA NA 03/02/2021 <NA> <NA>
7: 4 1234 4 NA NA 40 NA NA 05/02/2021 <NA> <NA>
8: 5 1234 5 NA NA 20 NA NA 07/02/2021 <NA> <NA>

R - List All Combinations With combn (Multiple m Values) [duplicate]

This question already has answers here:
All combinations of all sizes?
(2 answers)
Unordered combinations of all lengths
(3 answers)
Closed 4 years ago.
I would like to build a dataframe that lists all possible combinations of 6 numbers.
I realised that I can use combn(), but with only one value for m. With a bit of playing around I got the desired result by going through step-by-step with the following code -
combi1 <- data.frame(c(1:6))
colnames(combi1) <- 'X1'
combi2 <- data.frame(t(combn(c(1:6), 2)))
combi3 <- data.frame(t(combn(c(1:6), 3)))
combi4 <- data.frame(t(combn(c(1:6), 4)))
combi5 <- data.frame(t(combn(c(1:6), 5)))
combi6 <- data.frame(t(combn(c(1:6), 6)))
Combi <- rbind.fill(combi1, combi2, combi3, combi4, combi5, combi6)
I had to transpose each of the DFs to get them in the right shape.
My problem is that this seems to be quite an in-efficient method. Maybe a bit simplistic. I thought there must surely be some quicker way to code this, but haven't found any solution online that gives me what I'd like.
Possibly build it into a function or a loop somehow? I'm fairly new to R though and haven't had a great deal of practice writing functions.
Is it what you want ?
combis <- vector("list", 6)
combi1 <- data.frame(c(1:6))
colnames(combi1) <- 'X1'
combis[[1]] <- combi1
combis[2:6] <- lapply(2:6, function(n) data.frame(t(combn(c(1:6), n))))
do.call(plyr::rbind.fill, combis)
Result:
X1 X2 X3 X4 X5 X6
1 1 NA NA NA NA NA
2 2 NA NA NA NA NA
3 3 NA NA NA NA NA
4 4 NA NA NA NA NA
5 5 NA NA NA NA NA
6 6 NA NA NA NA NA
7 1 2 NA NA NA NA
8 1 3 NA NA NA NA
9 1 4 NA NA NA NA
10 1 5 NA NA NA NA
11 1 6 NA NA NA NA
12 2 3 NA NA NA NA
13 2 4 NA NA NA NA
14 2 5 NA NA NA NA
15 2 6 NA NA NA NA
16 3 4 NA NA NA NA
17 3 5 NA NA NA NA
18 3 6 NA NA NA NA
19 4 5 NA NA NA NA
20 4 6 NA NA NA NA
21 5 6 NA NA NA NA
22 1 2 3 NA NA NA
23 1 2 4 NA NA NA
24 1 2 5 NA NA NA
25 1 2 6 NA NA NA
26 1 3 4 NA NA NA
27 1 3 5 NA NA NA
28 1 3 6 NA NA NA
29 1 4 5 NA NA NA
30 1 4 6 NA NA NA
31 1 5 6 NA NA NA
32 2 3 4 NA NA NA
33 2 3 5 NA NA NA
34 2 3 6 NA NA NA
35 2 4 5 NA NA NA
36 2 4 6 NA NA NA
37 2 5 6 NA NA NA
38 3 4 5 NA NA NA
39 3 4 6 NA NA NA
40 3 5 6 NA NA NA
41 4 5 6 NA NA NA
42 1 2 3 4 NA NA
43 1 2 3 5 NA NA
44 1 2 3 6 NA NA
45 1 2 4 5 NA NA
46 1 2 4 6 NA NA
47 1 2 5 6 NA NA
48 1 3 4 5 NA NA
49 1 3 4 6 NA NA
50 1 3 5 6 NA NA
51 1 4 5 6 NA NA
52 2 3 4 5 NA NA
53 2 3 4 6 NA NA
54 2 3 5 6 NA NA
55 2 4 5 6 NA NA
56 3 4 5 6 NA NA
57 1 2 3 4 5 NA
58 1 2 3 4 6 NA
59 1 2 3 5 6 NA
60 1 2 4 5 6 NA
61 1 3 4 5 6 NA
62 2 3 4 5 6 NA
63 1 2 3 4 5 6

cSplit_e from splitstackshape package not accounting for NA's?

I wanted to follow up on the question that I posted here. While I received baseR and data.table solution, I was trying to implement the same using cSplit_e from splitstackshape package as suggested in the comment of my previous post. With the modified data as below (i.e. with NA),
data1<-structure(list(reason = c("1", "1", NA, "1", "1", "4 5", "1",
"1", "1", "1", "1", "1 2 3 4", "1 2 5", NA, NA)), .Names = "reason", class = "data.frame", row.names = c(NA,
-15L))
#loading packages
library(data.table)
library(splitstackshape)
cSplit_e(setDT(data1),1," ",mode = "value") # with NA's doesn't work
Error in seq.default(min(vec), max(vec)) : 'from' must be a finite number
data2<-na.omit(setDT(data1),cols="reason") # removing NA's
cSplit_e(data2,1," ",mode = "value") # without NA's works
reason reason_1 reason_2 reason_3 reason_4 reason_5
1: 1 1 NA NA NA NA
2: 1 1 NA NA NA NA
3: 1 1 NA NA NA NA
4: 1 1 NA NA NA NA
5: 4 5 NA NA NA 4 5
6: 1 1 NA NA NA NA
7: 1 1 NA NA NA NA
8: 1 1 NA NA NA NA
9: 1 1 NA NA NA NA
10: 1 1 NA NA NA NA
11: 1 2 3 4 1 2 3 4 NA
12: 1 2 5 1 2 NA NA 5
So, the question is does cSplit_e account for NA's in column to be splited?
This has been fixed in the bugfix release (v1.4.4) of "splitstackshape". Thanks for reporting it.
After using update.packages(), you should be able to do:
packageVersion("splitstackshape")
## [1] ‘1.4.4’
cSplit_e(data1, 1, " ", mode = "value")
## reason reason_1 reason_2 reason_3 reason_4 reason_5
## 1 1 1 NA NA NA NA
## 2 1 1 NA NA NA NA
## 3 <NA> NA NA NA NA NA
## 4 1 1 NA NA NA NA
## 5 1 1 NA NA NA NA
## 6 4 5 NA NA NA 4 5
## 7 1 1 NA NA NA NA
## 8 1 1 NA NA NA NA
## 9 1 1 NA NA NA NA
## 10 1 1 NA NA NA NA
## 11 1 1 NA NA NA NA
## 12 1 2 3 4 1 2 3 4 NA
## 13 1 2 5 1 2 NA NA 5
## 14 <NA> NA NA NA NA NA
## 15 <NA> NA NA NA NA NA
Note that 1.4.4 has moved "data.table" from "depends" to "imports".

Multiple aggregation in R with 4 parameters

I have four vectors (columns)
x y z t
1 1 1 10
1 1 1 15
2 4 1 14
2 3 1 15
2 2 1 17
2 1 2 19
1 4 2 18
1 4 2 NA
2 2 2 45
3 3 2 NA
3 1 3 59
4 3 3 23
1 4 3 45
4 4 4 74
2 1 4 86
How can I calculate mean and median of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R?
It was discussed how to do it with 3 parameters (Multiple Aggregation in R) but it`s a little unclear how to do it with 4 parameters.
Thank you.
You could try something like this in data.table
data <- data.table(yourdataframe)
bar <- data[,.N,by=y]
foo <- data[x==1 & z==1,list(mean.t=mean(t,na.rm=T),median.t=median(t,na.rm=T)),by=y]
merge(bar[,list(y)],foo,by="y",all.x=T)
y mean.t median.t
1: 1 12.5 12.5
2: 2 NA NA
3: 3 NA NA
4: 4 NA NA
You probably could do the same in aggregate, but I am not sure you can do it in one easy step.
An answer to to an additional request in the comments...
bar <- data.table(expand.grid(y=unique(data$y),z=unique(data[z %in% c(1,2,3,4),z])))
foo <- data[x==1 & z %in% c(1,2,3,4),list(
mean.t=mean(t,na.rm=T),
median.t=median(t,na.rm=T),
Q25.t=quantile(t,0.25,na.rm=T),
Q75.t=quantile(t,0.75,na.rm=T)
),by=list(y,z)]
merge(bar[,list(y,z)],foo,by=c("y","z"),all.x=T)
y z mean.t median.t Q25.t Q75.t
1: 1 1 12.5 12.5 11.25 13.75
2: 1 2 NA NA NA NA
3: 1 3 NA NA NA NA
4: 1 4 NA NA NA NA
5: 2 1 NA NA NA NA
6: 2 2 NA NA NA NA
7: 2 3 NA NA NA NA
8: 2 4 NA NA NA NA
9: 3 1 NA NA NA NA
10: 3 2 NA NA NA NA
11: 3 3 NA NA NA NA
12: 3 4 NA NA NA NA
13: 4 1 NA NA NA NA
14: 4 2 18.0 18.0 18.00 18.00
15: 4 3 45.0 45.0 45.00 45.00
16: 4 4 NA NA NA NA

Resources