I have these data:
db <- read.table(header=T, text="
ID site R S data
1 1 2 10 01/02/2021
1 1 3 20 03/02/2021
1 2 4 50 05/01/2021
2 1 7 40 02/02/2021
2 2 2 30 05/02/2021
2 2 5 60 06/02/2021
2 2 9 10 07/02/2021
3 1 2 20 02/02/2021
3 2 6 30 03/02/2021
4 1 4 40 05/02/2021
5 1 5 20 07/02/2021")
And I want to get the following result:
db_transpose <- read.table(header=T, text="
ID site R S data R_1 S_1 data_1 R_2 S_2 data_2
1 1 2 10 01/02/2021 3 20 03/02/2021 NA NA NA
1 2 4 50 05/01/2021 NA NA NA NA NA NA
2 1 7 40 02/02/2021 NA NA NA NA NA NA
2 2 2 30 05/02/2021 5 60 06/02/2021 9 10 07/02/2021
3 1 2 20 02/02/2021 NA NA NA NA NA NA
3 2 6 30 03/02/2021 NA NA NA NA NA NA
4 1 4 40 05/02/2021 NA NA NA NA NA NA
5 1 5 20 07/02/2021 NA NA NA NA NA NA")
For every combination of columns ID and site I would like to transpose data in column R, S and data by data order.
I've tried with reshape2 without result.
Here've tried with S and R variables:
require(reshape2)
dcast(db, ID + site ~ data, value.var=c ("S", "R"))
But I obtain this error message:
Error in .subset2(x, i, exact = exact) : index out of bounds
Warning message:
In if (!(value.var %in% names(data))) { :
the condition has length > 1 and only the first element will be used
Here I've tried only with one variable:
dcast(db, ID + site ~ data, value.var="S")
But I obtain a result totally different from what I need:
ID site 01/02/2021 02/02/2021 03/02/2021 05/01/2021 05/02/2021 06/02/2021 07/02/2021
1 1 2 NA NA NA 50 NA NA NA
2 1 1 10 NA 20 NA NA NA NA
3 2 1 NA 40 NA NA NA NA NA
4 2 2 NA NA NA NA 30 60 10
5 3 1 NA 20 NA NA NA NA NA
6 3 2 NA NA 30 NA NA NA NA
7 4 1 NA NA NA NA 40 NA NA
8 5 1 NA NA NA NA NA NA 20
Thank you
in Base R >= 4:
transform(db, time = ave(ID, ID, site, FUN = seq_along)) |>
reshape(dir = 'wide', idvar = c('ID', 'site'), sep = '_')
ID site R_1 S_1 data_1 R_2 S_2 data_2 R_3 S_3 data_3
1 1 1234 2 10 01/02/2021 3 20 03/02/2021 NA NA <NA>
3 1 1224 4 50 05/01/2021 NA NA <NA> NA NA <NA>
4 2 1234 7 40 02/02/2021 NA NA <NA> NA NA <NA>
5 2 1342 2 30 05/02/2021 5 60 06/02/2021 9 10 07/02/2021
8 3 1234 2 20 02/02/2021 NA NA <NA> NA NA <NA>
9 3 3421 6 30 03/02/2021 NA NA <NA> NA NA <NA>
10 4 1234 4 40 05/02/2021 NA NA <NA> NA NA <NA>
11 5 1234 5 20 07/02/2021 NA NA <NA> NA NA <NA>
In base R < 4 do:
reshape(transform(db, time = ave(ID, ID, site, FUN = seq_along)),
dir = 'wide', idvar = c('ID', 'site'), sep = '_')
in tidyverse:
library(tidyverse)
db %>%
group_by(ID, site) %>%
mutate(name = row_number())%>%
pivot_wider(c(ID, site), values_from = c(R,S,data), names_sep = '_')
# A tibble: 8 x 11
# Groups: ID, site [8]
ID site R_1 R_2 R_3 S_1 S_2 S_3 data_1 data_2 data_3
<int> <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 1 1234 2 3 NA 10 20 NA 01/02/2021 03/02/2021 NA
2 1 1224 4 NA NA 50 NA NA 05/01/2021 NA NA
3 2 1234 7 NA NA 40 NA NA 02/02/2021 NA NA
4 2 1342 2 5 9 30 60 10 05/02/2021 06/02/2021 07/02/2021
5 3 1234 2 NA NA 20 NA NA 02/02/2021 NA NA
6 3 3421 6 NA NA 30 NA NA 03/02/2021 NA NA
7 4 1234 4 NA NA 40 NA NA 05/02/2021 NA NA
8 5 1234 5 NA NA 20 NA NA 07/02/2021 NA NA
with data.table:
library(data.table)
dcast(setDT(db), ID + site ~ rowid(ID, site), value.var = c('R', 'S', 'data'),sep = '_')
ID site R_1 R_2 R_3 S_1 S_2 S_3 data_1 data_2 data_3
1: 1 1224 4 NA NA 50 NA NA 05/01/2021 <NA> <NA>
2: 1 1234 2 3 NA 10 20 NA 01/02/2021 03/02/2021 <NA>
3: 2 1234 7 NA NA 40 NA NA 02/02/2021 <NA> <NA>
4: 2 1342 2 5 9 30 60 10 05/02/2021 06/02/2021 07/02/2021
5: 3 1234 2 NA NA 20 NA NA 02/02/2021 <NA> <NA>
6: 3 3421 6 NA NA 30 NA NA 03/02/2021 <NA> <NA>
7: 4 1234 4 NA NA 40 NA NA 05/02/2021 <NA> <NA>
8: 5 1234 5 NA NA 20 NA NA 07/02/2021 <NA> <NA>
Related
I have a data frame like the following, with some NAs:
mydf=data.frame(ID=LETTERS[1:10], aaa=runif(10), bbb=runif(10), ccc=runif(10), ddd=runif(10))
mydf[c(1,4,5,7:10),2]=NA
mydf[c(1,2,4:8),3]=NA
mydf[c(3,4,6:10),4]=NA
mydf[c(1,3,4,6,9,10),5]=NA
> mydf
ID aaa bbb ccc ddd
1 A NA NA 0.08844614 NA
2 B 0.4912790 NA 0.88925139 0.1233173
3 C 0.1325188 0.1389260 NA NA
4 D NA NA NA NA
5 E NA NA 0.60750723 0.6357998
6 F 0.8218579 NA NA NA
7 G NA NA NA 0.5988206
8 H NA NA NA 0.4008338
9 I NA 0.8784563 NA NA
10 J NA 0.2959320 NA NA
What I want to accomplish here is the following:
1- replace non-NA values by column index -1, so that the output looks like this:
> mydf
ID aaa bbb ccc ddd
1 A NA NA 3 NA
2 B 1 NA 3 4
3 C 1 2 NA NA
4 D NA NA NA NA
5 E NA NA 3 4
6 F 1 NA NA NA
7 G NA NA NA 4
8 H NA NA NA 4
9 I NA 2 NA NA
10 J NA 2 NA NA
2- Then I would like to add an extra column that shows the following:
0 for all NAs in a row
0 for a row with more than 1 non-NA value
the actual value when it is the only non-NA value in a row
The final result should look like this:
> mydf
ID aaa bbb ccc ddd final
1 A NA NA 3 NA 3
2 B 1 NA 3 4 0
3 C 1 2 NA NA 0
4 D NA NA NA NA 0
5 E NA NA 3 4 0
6 F 1 NA NA NA 1
7 G NA NA NA 4 4
8 H NA NA NA 4 4
9 I NA 2 NA NA 2
10 J NA 2 NA NA 2
I could probably do all this with an ugly for loop, then aggregate for the final column, and substitute by 0 where appropriate...
But I was wondering if there would be a clean way to do this with some apply calls in just a few lines...
Thanks!
You could do:
mydf[-1] <- sapply(1:4, \(x) x * mydf[x+1]/mydf[x+1])
mydf$final <- apply(mydf[-1], 1, function(x) {
if(all(is.na(x)) | sum(!is.na(x)) > 1) 0 else na.omit(x)
})
Result:
mydf
#> ID aaa bbb ccc ddd final
#> 1 A NA NA 3 NA 3
#> 2 B 1 NA 3 4 0
#> 3 C 1 2 NA NA 0
#> 4 D NA NA NA NA 0
#> 5 E NA NA 3 4 0
#> 6 F 1 NA NA NA 1
#> 7 G NA NA NA 4 4
#> 8 H NA NA NA 4 4
#> 9 I NA 2 NA NA 2
#> 10 J NA 2 NA NA 2
Created on 2022-12-16 with reprex v2.0.2
Here is an idea,
mydf1 <- cbind.data.frame(ID = mydf$ID, mapply(function(x, y) replace(x, !is.na(x), y),
mydf,
seq(ncol(mydf)) - 1)[,-1])
mydf1$final <- apply(mydf1[-1], 1, \(i)
ifelse(sum(is.na(i)) == (ncol(mydf) - 1) | sum(!is.na(i)) > 1, 0, i[!is.na(i)]))
mydf1
ID aaa bbb ccc ddd final
1 A <NA> <NA> 3 <NA> 3
2 B 1 <NA> 3 4 0
3 C 1 2 <NA> <NA> 0
4 D <NA> <NA> <NA> <NA> 0
5 E <NA> <NA> 3 4 0
6 F 1 <NA> <NA> <NA> 1
7 G <NA> <NA> <NA> 4 4
8 H <NA> <NA> <NA> 4 4
9 I <NA> 2 <NA> <NA> 2
10 J <NA> 2 <NA> <NA> 2
A third option could be
tmp <- mydf[,-1]
tmp[!is.na(tmp)] <- 1
(mydf[,-1] <- tmp * as.list(1:4))
# aaa bbb ccc ddd
#1 NA NA 3 NA
#2 1 NA 3 4
#3 1 2 NA NA
#4 NA NA NA NA
#5 NA NA 3 4
#6 1 NA NA NA
#7 NA NA NA 4
#8 NA NA NA 4
#9 NA 2 NA NA
#10 NA 2 NA NA
The final column can be generated like this
idx <- rowSums(tmp, na.rm = TRUE) == 1
mydf$final <- idx * max.col(replace(tmp, is.na(tmp), -Inf))
Result
mydf
# ID aaa bbb ccc ddd final
#1 A NA NA 3 NA 3
#2 B 1 NA 3 4 0
#3 C 1 2 NA NA 0
#4 D NA NA NA NA 0
#5 E NA NA 3 4 0
#6 F 1 NA NA NA 1
#7 G NA NA NA 4 4
#8 H NA NA NA 4 4
#9 I NA 2 NA NA 2
#10 J NA 2 NA NA 2
I have data with over 6k columns. Each result has colums with data that are always the same.
XCODE Age Sex ResultA Sex ResultB
1 X001 12 2 2 2 4
2 X002 23 2 4 2 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 1 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 1 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 2 8
I would like to remove duplicate e.g sex variable. Is there possibility of doing that with data.table?
You can use match if you need to check for equality of all values.
df[, unique(match(df, df)), with = F]
df2
# XCODE Age Sex ResultA ResultB
# 1 X001 12 2 2 4
# 2 X002 23 2 4 66
# 3 X003 NA NA NA NA
# 4 X004 32 1 1 3
# 5 X005 NA NA NA NA
# 6 X001 NA NA NA NA
# 7 X002 NA NA NA NA
# 8 X003 33 1 8 6
# 9 X004 NA NA NA NA
# 10 X005 55 2 8 8
Data used:
df <- fread('
XCODE Age Sex ResultA Sex ResultB
1 X001 12 2 2 2 4
2 X002 23 2 4 2 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 1 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 1 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 2 8
')[, -'V1']
Try this:
df[, unique(colnames(df))]
One caveat: it will delete all columns with duplicated names. In your case, it will delete Sex even if the two columns have the same name but different content.
If you have duplicated columns with different names, you can transpose your dataframe, which allows you to use the unique function to solve your problem. Then you then transpose it back and set it back to dataframe (because it came a matrix when you transposed it).
df = data.frame(c = 1:5, a = c("A", "B","C","D","E"), b = 1:5)
df = t(df)
df = unique(df)
df = t(df)
df = data.frame(df)
Edit: like markus points out, this is probably not a good option if you have columns of multiples types because when t() coerces your dataframe to matrix it also coerces all your variables into the same type.
This question already has answers here:
All combinations of all sizes?
(2 answers)
Unordered combinations of all lengths
(3 answers)
Closed 4 years ago.
I would like to build a dataframe that lists all possible combinations of 6 numbers.
I realised that I can use combn(), but with only one value for m. With a bit of playing around I got the desired result by going through step-by-step with the following code -
combi1 <- data.frame(c(1:6))
colnames(combi1) <- 'X1'
combi2 <- data.frame(t(combn(c(1:6), 2)))
combi3 <- data.frame(t(combn(c(1:6), 3)))
combi4 <- data.frame(t(combn(c(1:6), 4)))
combi5 <- data.frame(t(combn(c(1:6), 5)))
combi6 <- data.frame(t(combn(c(1:6), 6)))
Combi <- rbind.fill(combi1, combi2, combi3, combi4, combi5, combi6)
I had to transpose each of the DFs to get them in the right shape.
My problem is that this seems to be quite an in-efficient method. Maybe a bit simplistic. I thought there must surely be some quicker way to code this, but haven't found any solution online that gives me what I'd like.
Possibly build it into a function or a loop somehow? I'm fairly new to R though and haven't had a great deal of practice writing functions.
Is it what you want ?
combis <- vector("list", 6)
combi1 <- data.frame(c(1:6))
colnames(combi1) <- 'X1'
combis[[1]] <- combi1
combis[2:6] <- lapply(2:6, function(n) data.frame(t(combn(c(1:6), n))))
do.call(plyr::rbind.fill, combis)
Result:
X1 X2 X3 X4 X5 X6
1 1 NA NA NA NA NA
2 2 NA NA NA NA NA
3 3 NA NA NA NA NA
4 4 NA NA NA NA NA
5 5 NA NA NA NA NA
6 6 NA NA NA NA NA
7 1 2 NA NA NA NA
8 1 3 NA NA NA NA
9 1 4 NA NA NA NA
10 1 5 NA NA NA NA
11 1 6 NA NA NA NA
12 2 3 NA NA NA NA
13 2 4 NA NA NA NA
14 2 5 NA NA NA NA
15 2 6 NA NA NA NA
16 3 4 NA NA NA NA
17 3 5 NA NA NA NA
18 3 6 NA NA NA NA
19 4 5 NA NA NA NA
20 4 6 NA NA NA NA
21 5 6 NA NA NA NA
22 1 2 3 NA NA NA
23 1 2 4 NA NA NA
24 1 2 5 NA NA NA
25 1 2 6 NA NA NA
26 1 3 4 NA NA NA
27 1 3 5 NA NA NA
28 1 3 6 NA NA NA
29 1 4 5 NA NA NA
30 1 4 6 NA NA NA
31 1 5 6 NA NA NA
32 2 3 4 NA NA NA
33 2 3 5 NA NA NA
34 2 3 6 NA NA NA
35 2 4 5 NA NA NA
36 2 4 6 NA NA NA
37 2 5 6 NA NA NA
38 3 4 5 NA NA NA
39 3 4 6 NA NA NA
40 3 5 6 NA NA NA
41 4 5 6 NA NA NA
42 1 2 3 4 NA NA
43 1 2 3 5 NA NA
44 1 2 3 6 NA NA
45 1 2 4 5 NA NA
46 1 2 4 6 NA NA
47 1 2 5 6 NA NA
48 1 3 4 5 NA NA
49 1 3 4 6 NA NA
50 1 3 5 6 NA NA
51 1 4 5 6 NA NA
52 2 3 4 5 NA NA
53 2 3 4 6 NA NA
54 2 3 5 6 NA NA
55 2 4 5 6 NA NA
56 3 4 5 6 NA NA
57 1 2 3 4 5 NA
58 1 2 3 4 6 NA
59 1 2 3 5 6 NA
60 1 2 4 5 6 NA
61 1 3 4 5 6 NA
62 2 3 4 5 6 NA
63 1 2 3 4 5 6
For the data given below,
data1<-structure(list(var1 = c("2 7", "2 6 7", "2 7", "2 7", "1 7",
"1 7", "1 5", "1 2 7", "1 5", "1 7", "1 2 3 4 5 6 7", "1 2 4 6"
)), .Names = "var1", class = "data.frame", row.names = c(NA,
-12L))
> data1
var1
1 2 7
2 2 6 7
3 2 7
4 2 7
5 1 7
6 1 7
7 1 5
8 1 2 7
9 1 5
10 1 7
11 1 2 3 4 5 6 7
12 1 2 4 6
I would like it to split into seven columns (7) as follows:
v1 v2 v3 v4 v5 v6 v7
1 NA 2 NA NA NA NA 7
2 NA 2 NA NA NA 6 7
3 NA 2 NA NA NA NA 7
4 NA 2 NA NA NA NA 7
5 1 NA NA NA NA NA 7
6 1 NA NA NA NA NA 7
7 1 NA NA NA 5 NA NA
8 1 2 NA NA NA NA 7
9 1 NA NA NA 5 NA NA
10 1 NA NA NA NA NA 7
11 1 2 3 4 5 6 7
12 1 2 NA 4 NA 6 NA
I use the tstrsplit from data.table package as follows:
library(data.table)
setDT(data1)[,tstrsplit(var1," ")]
V1 V2 V3 V4 V5 V6 V7
1: 2 7 NA NA NA NA NA
2: 2 6 7 NA NA NA NA
3: 2 7 NA NA NA NA NA
4: 2 7 NA NA NA NA NA
5: 1 7 NA NA NA NA NA
6: 1 7 NA NA NA NA NA
7: 1 5 NA NA NA NA NA
8: 1 2 7 NA NA NA NA
9: 1 5 NA NA NA NA NA
10: 1 7 NA NA NA NA NA
11: 1 2 3 4 5 6 7
12: 1 2 4 6 NA NA NA
This is different than the expected output. I was wondering how can I get the expected output as described above.
With data.table you may try
library(magrittr)
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))] %>%
dcast(., rn ~ V1)
rn 1 2 3 4 5 6 7
1: 1 NA 2 NA NA NA NA 7
2: 2 NA 2 NA NA NA 6 7
3: 3 NA 2 NA NA NA NA 7
4: 4 NA 2 NA NA NA NA 7
5: 5 1 NA NA NA NA NA 7
6: 6 1 NA NA NA NA NA 7
7: 7 1 NA NA NA 5 NA NA
8: 8 1 2 NA NA NA NA 7
9: 9 1 NA NA NA 5 NA NA
10: 10 1 NA NA NA NA NA 7
11: 11 1 2 3 4 5 6 7
12: 12 1 2 NA 4 NA 6 NA
To get rid of the rn column, we can use
setDT(data1)[, strsplit(var1," "), by = .(rn = 1:nrow(data1))][
, dcast(.SD, rn ~ V1)][, rn := NULL][]
Explanation
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))]
creates a data.table directly in long format
rn V1
1: 1 2
2: 1 7
3: 2 2
4: 2 6
5: 2 7
6: 3 2
7: 3 7
8: 4 2
9: 4 7
10: 5 1
11: 5 7
12: 6 1
13: 6 7
14: 7 1
15: 7 5
16: 8 1
17: 8 2
18: 8 7
19: 9 1
20: 9 5
21: 10 1
22: 10 7
23: 11 1
24: 11 2
25: 11 3
26: 11 4
27: 11 5
28: 11 6
29: 11 7
30: 12 1
31: 12 2
32: 12 4
33: 12 6
rn V1
which is then reshaped to wide format using dcast().
If we would use tstrsplit() instead of strsplit() we would get a data.table in wide format which needs to be reshaped to long format using melt():
setDT(data1)[,tstrsplit(var1," ")][, rn := .I][
, melt(.SD, id = "rn", na.rm = TRUE)][
, dcast(.SD, rn ~ paste0("V", value))][
, rn := NULL][]
In base R, we can do this by splitting the string by one or more (\\s+), create a row/column index ('i1') and assign a NA matrix ('m1') to fill up with the unlisted split values
lst <- lapply(strsplit(data1$var1, "\\s+"), as.numeric)
i1 <- cbind(rep(1:nrow(data1), lengths(lst)), unlist(lst))
m1 <- matrix(NA, nrow = max(i1[,1]), ncol = max(i1[,2]))
m1[i1] <- unlist(lst)
as.data.frame(m1)
# V1 V2 V3 V4 V5 V6 V7
#1 NA 2 NA NA NA NA 7
#2 NA 2 NA NA NA 6 7
#3 NA 2 NA NA NA NA 7
#4 NA 2 NA NA NA NA 7
#5 1 NA NA NA NA NA 7
#6 1 NA NA NA NA NA 7
#7 1 NA NA NA 5 NA NA
#8 1 2 NA NA NA NA 7
#9 1 NA NA NA 5 NA NA
#10 1 NA NA NA NA NA 7
#11 1 2 3 4 5 6 7
#12 1 2 NA 4 NA 6 NA
I have four vectors (columns)
x y z t
1 1 1 10
1 1 1 15
2 4 1 14
2 3 1 15
2 2 1 17
2 1 2 19
1 4 2 18
1 4 2 NA
2 2 2 45
3 3 2 NA
3 1 3 59
4 3 3 23
1 4 3 45
4 4 4 74
2 1 4 86
How can I calculate mean and median of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R?
It was discussed how to do it with 3 parameters (Multiple Aggregation in R) but it`s a little unclear how to do it with 4 parameters.
Thank you.
You could try something like this in data.table
data <- data.table(yourdataframe)
bar <- data[,.N,by=y]
foo <- data[x==1 & z==1,list(mean.t=mean(t,na.rm=T),median.t=median(t,na.rm=T)),by=y]
merge(bar[,list(y)],foo,by="y",all.x=T)
y mean.t median.t
1: 1 12.5 12.5
2: 2 NA NA
3: 3 NA NA
4: 4 NA NA
You probably could do the same in aggregate, but I am not sure you can do it in one easy step.
An answer to to an additional request in the comments...
bar <- data.table(expand.grid(y=unique(data$y),z=unique(data[z %in% c(1,2,3,4),z])))
foo <- data[x==1 & z %in% c(1,2,3,4),list(
mean.t=mean(t,na.rm=T),
median.t=median(t,na.rm=T),
Q25.t=quantile(t,0.25,na.rm=T),
Q75.t=quantile(t,0.75,na.rm=T)
),by=list(y,z)]
merge(bar[,list(y,z)],foo,by=c("y","z"),all.x=T)
y z mean.t median.t Q25.t Q75.t
1: 1 1 12.5 12.5 11.25 13.75
2: 1 2 NA NA NA NA
3: 1 3 NA NA NA NA
4: 1 4 NA NA NA NA
5: 2 1 NA NA NA NA
6: 2 2 NA NA NA NA
7: 2 3 NA NA NA NA
8: 2 4 NA NA NA NA
9: 3 1 NA NA NA NA
10: 3 2 NA NA NA NA
11: 3 3 NA NA NA NA
12: 3 4 NA NA NA NA
13: 4 1 NA NA NA NA
14: 4 2 18.0 18.0 18.00 18.00
15: 4 3 45.0 45.0 45.00 45.00
16: 4 4 NA NA NA NA