I have four vectors (columns)
x y z t
1 1 1 10
1 1 1 15
2 4 1 14
2 3 1 15
2 2 1 17
2 1 2 19
1 4 2 18
1 4 2 NA
2 2 2 45
3 3 2 NA
3 1 3 59
4 3 3 23
1 4 3 45
4 4 4 74
2 1 4 86
How can I calculate mean and median of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R?
It was discussed how to do it with 3 parameters (Multiple Aggregation in R) but it`s a little unclear how to do it with 4 parameters.
Thank you.
You could try something like this in data.table
data <- data.table(yourdataframe)
bar <- data[,.N,by=y]
foo <- data[x==1 & z==1,list(mean.t=mean(t,na.rm=T),median.t=median(t,na.rm=T)),by=y]
merge(bar[,list(y)],foo,by="y",all.x=T)
y mean.t median.t
1: 1 12.5 12.5
2: 2 NA NA
3: 3 NA NA
4: 4 NA NA
You probably could do the same in aggregate, but I am not sure you can do it in one easy step.
An answer to to an additional request in the comments...
bar <- data.table(expand.grid(y=unique(data$y),z=unique(data[z %in% c(1,2,3,4),z])))
foo <- data[x==1 & z %in% c(1,2,3,4),list(
mean.t=mean(t,na.rm=T),
median.t=median(t,na.rm=T),
Q25.t=quantile(t,0.25,na.rm=T),
Q75.t=quantile(t,0.75,na.rm=T)
),by=list(y,z)]
merge(bar[,list(y,z)],foo,by=c("y","z"),all.x=T)
y z mean.t median.t Q25.t Q75.t
1: 1 1 12.5 12.5 11.25 13.75
2: 1 2 NA NA NA NA
3: 1 3 NA NA NA NA
4: 1 4 NA NA NA NA
5: 2 1 NA NA NA NA
6: 2 2 NA NA NA NA
7: 2 3 NA NA NA NA
8: 2 4 NA NA NA NA
9: 3 1 NA NA NA NA
10: 3 2 NA NA NA NA
11: 3 3 NA NA NA NA
12: 3 4 NA NA NA NA
13: 4 1 NA NA NA NA
14: 4 2 18.0 18.0 18.00 18.00
15: 4 3 45.0 45.0 45.00 45.00
16: 4 4 NA NA NA NA
Related
I have a data frame like the following, with some NAs:
mydf=data.frame(ID=LETTERS[1:10], aaa=runif(10), bbb=runif(10), ccc=runif(10), ddd=runif(10))
mydf[c(1,4,5,7:10),2]=NA
mydf[c(1,2,4:8),3]=NA
mydf[c(3,4,6:10),4]=NA
mydf[c(1,3,4,6,9,10),5]=NA
> mydf
ID aaa bbb ccc ddd
1 A NA NA 0.08844614 NA
2 B 0.4912790 NA 0.88925139 0.1233173
3 C 0.1325188 0.1389260 NA NA
4 D NA NA NA NA
5 E NA NA 0.60750723 0.6357998
6 F 0.8218579 NA NA NA
7 G NA NA NA 0.5988206
8 H NA NA NA 0.4008338
9 I NA 0.8784563 NA NA
10 J NA 0.2959320 NA NA
What I want to accomplish here is the following:
1- replace non-NA values by column index -1, so that the output looks like this:
> mydf
ID aaa bbb ccc ddd
1 A NA NA 3 NA
2 B 1 NA 3 4
3 C 1 2 NA NA
4 D NA NA NA NA
5 E NA NA 3 4
6 F 1 NA NA NA
7 G NA NA NA 4
8 H NA NA NA 4
9 I NA 2 NA NA
10 J NA 2 NA NA
2- Then I would like to add an extra column that shows the following:
0 for all NAs in a row
0 for a row with more than 1 non-NA value
the actual value when it is the only non-NA value in a row
The final result should look like this:
> mydf
ID aaa bbb ccc ddd final
1 A NA NA 3 NA 3
2 B 1 NA 3 4 0
3 C 1 2 NA NA 0
4 D NA NA NA NA 0
5 E NA NA 3 4 0
6 F 1 NA NA NA 1
7 G NA NA NA 4 4
8 H NA NA NA 4 4
9 I NA 2 NA NA 2
10 J NA 2 NA NA 2
I could probably do all this with an ugly for loop, then aggregate for the final column, and substitute by 0 where appropriate...
But I was wondering if there would be a clean way to do this with some apply calls in just a few lines...
Thanks!
You could do:
mydf[-1] <- sapply(1:4, \(x) x * mydf[x+1]/mydf[x+1])
mydf$final <- apply(mydf[-1], 1, function(x) {
if(all(is.na(x)) | sum(!is.na(x)) > 1) 0 else na.omit(x)
})
Result:
mydf
#> ID aaa bbb ccc ddd final
#> 1 A NA NA 3 NA 3
#> 2 B 1 NA 3 4 0
#> 3 C 1 2 NA NA 0
#> 4 D NA NA NA NA 0
#> 5 E NA NA 3 4 0
#> 6 F 1 NA NA NA 1
#> 7 G NA NA NA 4 4
#> 8 H NA NA NA 4 4
#> 9 I NA 2 NA NA 2
#> 10 J NA 2 NA NA 2
Created on 2022-12-16 with reprex v2.0.2
Here is an idea,
mydf1 <- cbind.data.frame(ID = mydf$ID, mapply(function(x, y) replace(x, !is.na(x), y),
mydf,
seq(ncol(mydf)) - 1)[,-1])
mydf1$final <- apply(mydf1[-1], 1, \(i)
ifelse(sum(is.na(i)) == (ncol(mydf) - 1) | sum(!is.na(i)) > 1, 0, i[!is.na(i)]))
mydf1
ID aaa bbb ccc ddd final
1 A <NA> <NA> 3 <NA> 3
2 B 1 <NA> 3 4 0
3 C 1 2 <NA> <NA> 0
4 D <NA> <NA> <NA> <NA> 0
5 E <NA> <NA> 3 4 0
6 F 1 <NA> <NA> <NA> 1
7 G <NA> <NA> <NA> 4 4
8 H <NA> <NA> <NA> 4 4
9 I <NA> 2 <NA> <NA> 2
10 J <NA> 2 <NA> <NA> 2
A third option could be
tmp <- mydf[,-1]
tmp[!is.na(tmp)] <- 1
(mydf[,-1] <- tmp * as.list(1:4))
# aaa bbb ccc ddd
#1 NA NA 3 NA
#2 1 NA 3 4
#3 1 2 NA NA
#4 NA NA NA NA
#5 NA NA 3 4
#6 1 NA NA NA
#7 NA NA NA 4
#8 NA NA NA 4
#9 NA 2 NA NA
#10 NA 2 NA NA
The final column can be generated like this
idx <- rowSums(tmp, na.rm = TRUE) == 1
mydf$final <- idx * max.col(replace(tmp, is.na(tmp), -Inf))
Result
mydf
# ID aaa bbb ccc ddd final
#1 A NA NA 3 NA 3
#2 B 1 NA 3 4 0
#3 C 1 2 NA NA 0
#4 D NA NA NA NA 0
#5 E NA NA 3 4 0
#6 F 1 NA NA NA 1
#7 G NA NA NA 4 4
#8 H NA NA NA 4 4
#9 I NA 2 NA NA 2
#10 J NA 2 NA NA 2
I am trying to store all rows with NA for my columns Math_G1, Math_G2 and Math_G3 into a dataset variable. However when I do this, there are additional rows that pops up which have NA as values throughout all its attributes including its row number (eg. NA.1, NA.2 ...) How do I fix this?
I have already tried to use the c() function to attempt to filter out all these results but these rows are still there, in addition to this, i have also used the which() function but they are still there.
Here is my code :
dat <- read.csv(file = "final merged.csv", stringsAsFactors=FALSE, na.strings=c("NA", "NULL"))
dat_small <- dat[c("age","traveltime","studytime",
"failures","famrel","freetime","goout","Dalc","Walc",
"health","absences","Math_G1","Math_G2","Math_G3","Por_G1","Por_G2","Por_G3","DoubleSub")]
sample_size <- 500
all_set <- sample(1:length(dat[,1]),sample_size,replace = F)
dat <- dat_small[all_set,]
index_na_math <- which(is.na(c(dat$Math_G1,dat$Math_G2,dat$Math_G3)))
index_na_por <- which(is.na(c(dat$Por_G1,dat$Por_G2,dat$Por_G3)))
index_na_both <- c(index_na_math,index_na_por)
#each row of my dataset helps define a specific student
#portugese and math are subjects that students within the dataset takes
dat_purepor <- dat[which(index_na_math),] #students who takes only portugese
dat_puremath <- dat[c(index_na_por),] # students who takes only math
dat_math <- dat[c(-index_na_math),] #students who takes math + students who take both
dat_por <- dat[c(-index_na_por),] #students who take portugese + students who take both
dat_both <- dat[c(-index_na_both),] #students who takes both math and portugese
dat_purepor
dat_puremath
I expected the output to be filtered according to my conditions but without any rows with NA as the values for all its columns so I don't understand why the final results return NA.
Here is a preview of the dataset dat_small:
> dat_small
age traveltime studytime failures famrel freetime goout Dalc Walc health absences Math_G1 Math_G2 Math_G3 Por_G1 Por_G2 Por_G3 DoubleSub
1 18 2 2 0 4 3 4 1 1 3 6 5 6 6 13 13 13 1
2 17 1 2 0 5 3 3 1 1 3 4 5 5 6 15 15 15 1
3 15 1 2 3 4 3 2 2 3 3 10 7 8 10 10 12 13 1
4 15 1 3 0 3 2 2 1 1 5 2 15 14 15 14 14 14 1
5 16 1 2 0 4 3 2 1 2 5 4 6 10 10 13 13 13 1
6 16 1 2 0 5 4 2 1 2 5 10 15 15 15 10 13 13 1
7 16 1 2 0 4 4 4 1 1 3 0 12 12 11 14 14 16 1
8 17 2 2 0 4 1 4 1 1 1 6 6 5 6 12 13 13 1
9 15 1 2 0 4 2 2 1 1 1 0 16 18 19 13 17 17 1
10 15 1 2 0 5 5 1 1 1 5 0 14 15 15 9 10 11 1
11 15 1 2 0 3 3 3 1 2 2 0 10 8 9 15 15 15 1
12 15 3 3 0 5 2 2 1 1 4 4 10 12 12 10 12 13 1
13 15 1 1 0 4 3 3 1 3 5 2 14 14 14 13 14 15 1
14 15 2 2 0 5 4 3 1 2 3 2 10 10 11 14 14 14 1
15 15 1 3 0 4 5 2 1 1 3 0 14 16 16 11 12 14 1
16 16 1 1 0 4 4 4 1 2 2 4 14 14 14 9 8 9 1
17 16 1 3 0 3 2 3 1 2 2 6 13 14 14 10 10 16 1
18 16 3 2 0 5 3 2 1 1 4 4 8 10 10 11 11 11 1
19 17 1 1 3 5 5 5 2 4 5 16 6 5 5 10 13 13 1
20 16 1 1 0 3 1 3 1 3 5 4 8 10 10 14 14 14 1
21 15 1 2 0 4 4 1 1 1 1 0 13 14 15 9 8 10 1
22 15 1 1 0 5 4 2 1 1 5 0 12 15 15 10 13 13 1
23 16 1 2 0 4 5 1 1 3 5 2 15 15 16 11 10 11 1
24 16 2 2 0 5 4 4 2 4 5 0 13 13 12 14 14 14 1
25 15 1 3 0 4 3 2 1 1 5 2 10 9 8 10 11 10 1
26 16 1 1 2 1 2 2 1 3 5 14 6 9 8 13 13 13 1
27 15 1 1 0 4 2 2 1 2 5 2 12 12 11 12 11 12 1
28 15 1 1 0 2 2 4 2 4 1 4 15 16 15 14 12 12 1
29 16 1 2 0 5 3 3 1 1 5 4 11 11 11 10 10 1 1
30 16 1 2 0 4 4 5 5 5 5 16 10 12 11 9 12 12 1
31 15 1 2 0 5 4 2 3 4 5 0 9 11 12 9 10 11 1
32 15 2 2 0 4 3 1 1 1 5 0 17 16 17 14 14 16 1
33 15 1 2 0 4 5 2 1 1 5 0 17 16 16 14 14 16 1
34 15 1 2 0 5 3 2 1 1 2 0 8 10 12 10 13 13 1
35 16 1 1 0 5 4 3 1 1 5 0 12 14 15 9 12 12 1
36 15 2 1 0 3 5 1 1 1 5 0 8 7 6 14 13 12 1
37 15 1 3 0 5 4 3 1 1 4 2 15 16 18 14 14 16 1
38 16 2 3 0 2 4 3 1 1 5 7 15 16 15 9 9 8 1
39 15 1 3 0 4 3 2 1 1 5 2 12 12 11 14 13 12 1
40 15 1 1 0 4 3 1 1 1 2 8 14 13 13 14 13 12 1
41 16 2 2 1 3 3 3 1 2 3 25 7 10 11 13 13 13 1
42 15 1 1 0 5 4 3 2 4 5 8 12 12 12 10 13 13 1
43 15 1 2 0 4 3 3 1 1 5 2 19 18 18 9 12 12 1
44 15 1 1 0 5 4 1 1 1 1 0 8 8 11 10 13 13 1
45 16 2 2 1 4 3 3 2 2 5 14 10 10 9 11 11 11 1
46 15 1 2 0 5 2 2 1 1 5 8 8 8 6 12 11 12 1
47 16 1 2 0 2 3 5 1 4 3 12 11 12 11 10 11 11 1
48 16 1 4 0 4 2 2 1 1 2 4 19 19 20 14 14 16 1
49 15 1 2 0 4 3 3 2 2 5 2 15 15 14 10 13 13 1
50 15 1 2 1 4 4 4 1 1 3 2 7 7 7 15 15 15 1
51 16 3 2 0 4 3 3 2 3 4 2 12 13 13 13 13 13 1
52 15 1 2 0 4 3 3 1 1 5 2 11 13 13 16 14 16 1
53 15 2 1 1 5 5 5 3 4 5 6 11 11 10 14 14 16 1
54 15 1 1 0 3 3 4 2 3 5 0 8 10 11 11 12 13 1
55 15 1 1 0 5 3 4 4 4 1 6 10 13 13 13 12 13 1
[ reached getOption("max.print") -- omitted 889 rows ]
Here is a preview of what happens when i run the dat_puremath dataset.
> dat_puremath
age traveltime studytime failures famrel freetime goout Dalc Walc health absences Math_G1 Math_G2 Math_G3 Por_G1 Por_G2 Por_G3 DoubleSub
918 15 2 4 0 4 4 2 2 3 3 12 16 16 16 NA NA NA 0
931 16 1 2 3 2 3 3 2 2 4 5 7 7 7 NA NA NA 0
933 16 1 2 0 3 3 4 1 1 4 0 12 13 14 NA NA NA 0
935 16 1 1 0 4 5 2 1 1 5 20 13 12 12 NA NA NA 0
927 16 2 2 0 3 4 4 1 4 5 2 13 13 11 NA NA NA 0
929 17 1 2 0 5 3 3 1 1 3 0 8 8 9 NA NA NA 0
942 17 1 3 0 3 3 2 2 2 3 3 11 11 11 NA NA NA 0
928 16 1 2 0 1 2 2 1 2 1 14 12 13 12 NA NA NA 0
936 17 1 3 0 3 2 3 1 1 4 4 10 9 9 NA NA NA 0
939 17 1 4 0 5 2 2 1 2 5 0 17 17 18 NA NA NA 0
941 17 1 2 0 4 2 2 1 1 3 12 11 9 9 NA NA NA 0
937 17 1 2 0 5 4 5 1 2 5 4 10 9 11 NA NA NA 0
925 16 1 2 0 4 4 2 1 1 3 0 14 14 14 NA NA NA 0
938 17 1 3 0 4 3 3 1 1 3 6 13 12 12 NA NA NA 0
921 15 1 3 0 4 2 2 1 1 5 2 9 11 11 NA NA NA 0
943 17 1 3 0 4 4 3 1 1 5 7 12 14 14 NA NA NA 0
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.11 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.12 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.13 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.16 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.17 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.18 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.19 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.20 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.21 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.22 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.23 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.24 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.26 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.27 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.28 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.29 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.30 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.31 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Can someone explain why this happens and how I can fix it? Thank you!
When indexing, using is.na(c(dat$Math_G1,dat$Math_G2,dat$Math_G3)) creates an array of length 3*nrow(dat), so when applying the indices it does not behave as expected once past the index number nrow(dat).
Try the following
index_na_math <- (is.na(dat$Math_G1) | is.na(dat$Math_G2) | is.na(dat$Math_G3))
similarly for the other one, and then
index_na_both <- index_na_math | index_na_por
# or depending what you mean by 'both'
index_na_both <- index_na_math & index_na_por
The subsetting with dat_math <- dat[!index_na_math,] will yield the expected result (accordingly for the others).
If I have data such as
idx<-c("1_1_2015_0_00_00","1_1_2015_0_10_00","1_1_2015_0_30_00","1_1_2015_0_40_00","1_1_2015_0_60_00","1_1_2015_0_80_00")
rr<-c(2,3,4,1,5,6)
no<-seq(1,6)
dat<-data.frame(no,idx,rr)
then i want to pair with a standard index
id<-c("1_1_2015_0_00_00","1_1_2015_0_10_00","1_1_2015_0_20_00","1_1_2015_0_30_00","1_1_2015_0_40_00","1_1_2015_0_50_00","1_1_2015_0_60_00","1_1_2015_0_70_00","1_1_2015_0_80_00")
so i have rank of index of missing data such
no idx rr
1 1 1_1_2015_0_00_00 2
2 2 1_1_2015_0_10_00 3
3 NA NA NA
4 3 1_1_2015_0_30_00 4
5 4 1_1_2015_0_40_00 1
6 NA NA NA
7 5 1_1_2015_0_60_00 5
8 NA NA NA
9 6 1_1_2015_0_80_00 6
How to get it?
You can use match
dat[match(id, dat$idx), ]
# no idx rr
#1 1 1_1_2015_0_00_00 2
#2 2 1_1_2015_0_10_00 3
#NA NA <NA> NA
#3 3 1_1_2015_0_30_00 4
#4 4 1_1_2015_0_40_00 1
#NA.1 NA <NA> NA
#5 5 1_1_2015_0_60_00 5
#NA.2 NA <NA> NA
#6 6 1_1_2015_0_80_00 6
match(id, dat$idx) returns
#[1] 1 2 NA 3 4 NA 5 NA 6
and we use this vector to select rows of dat.
This question already has answers here:
All combinations of all sizes?
(2 answers)
Unordered combinations of all lengths
(3 answers)
Closed 4 years ago.
I would like to build a dataframe that lists all possible combinations of 6 numbers.
I realised that I can use combn(), but with only one value for m. With a bit of playing around I got the desired result by going through step-by-step with the following code -
combi1 <- data.frame(c(1:6))
colnames(combi1) <- 'X1'
combi2 <- data.frame(t(combn(c(1:6), 2)))
combi3 <- data.frame(t(combn(c(1:6), 3)))
combi4 <- data.frame(t(combn(c(1:6), 4)))
combi5 <- data.frame(t(combn(c(1:6), 5)))
combi6 <- data.frame(t(combn(c(1:6), 6)))
Combi <- rbind.fill(combi1, combi2, combi3, combi4, combi5, combi6)
I had to transpose each of the DFs to get them in the right shape.
My problem is that this seems to be quite an in-efficient method. Maybe a bit simplistic. I thought there must surely be some quicker way to code this, but haven't found any solution online that gives me what I'd like.
Possibly build it into a function or a loop somehow? I'm fairly new to R though and haven't had a great deal of practice writing functions.
Is it what you want ?
combis <- vector("list", 6)
combi1 <- data.frame(c(1:6))
colnames(combi1) <- 'X1'
combis[[1]] <- combi1
combis[2:6] <- lapply(2:6, function(n) data.frame(t(combn(c(1:6), n))))
do.call(plyr::rbind.fill, combis)
Result:
X1 X2 X3 X4 X5 X6
1 1 NA NA NA NA NA
2 2 NA NA NA NA NA
3 3 NA NA NA NA NA
4 4 NA NA NA NA NA
5 5 NA NA NA NA NA
6 6 NA NA NA NA NA
7 1 2 NA NA NA NA
8 1 3 NA NA NA NA
9 1 4 NA NA NA NA
10 1 5 NA NA NA NA
11 1 6 NA NA NA NA
12 2 3 NA NA NA NA
13 2 4 NA NA NA NA
14 2 5 NA NA NA NA
15 2 6 NA NA NA NA
16 3 4 NA NA NA NA
17 3 5 NA NA NA NA
18 3 6 NA NA NA NA
19 4 5 NA NA NA NA
20 4 6 NA NA NA NA
21 5 6 NA NA NA NA
22 1 2 3 NA NA NA
23 1 2 4 NA NA NA
24 1 2 5 NA NA NA
25 1 2 6 NA NA NA
26 1 3 4 NA NA NA
27 1 3 5 NA NA NA
28 1 3 6 NA NA NA
29 1 4 5 NA NA NA
30 1 4 6 NA NA NA
31 1 5 6 NA NA NA
32 2 3 4 NA NA NA
33 2 3 5 NA NA NA
34 2 3 6 NA NA NA
35 2 4 5 NA NA NA
36 2 4 6 NA NA NA
37 2 5 6 NA NA NA
38 3 4 5 NA NA NA
39 3 4 6 NA NA NA
40 3 5 6 NA NA NA
41 4 5 6 NA NA NA
42 1 2 3 4 NA NA
43 1 2 3 5 NA NA
44 1 2 3 6 NA NA
45 1 2 4 5 NA NA
46 1 2 4 6 NA NA
47 1 2 5 6 NA NA
48 1 3 4 5 NA NA
49 1 3 4 6 NA NA
50 1 3 5 6 NA NA
51 1 4 5 6 NA NA
52 2 3 4 5 NA NA
53 2 3 4 6 NA NA
54 2 3 5 6 NA NA
55 2 4 5 6 NA NA
56 3 4 5 6 NA NA
57 1 2 3 4 5 NA
58 1 2 3 4 6 NA
59 1 2 3 5 6 NA
60 1 2 4 5 6 NA
61 1 3 4 5 6 NA
62 2 3 4 5 6 NA
63 1 2 3 4 5 6
For the data given below,
data1<-structure(list(var1 = c("2 7", "2 6 7", "2 7", "2 7", "1 7",
"1 7", "1 5", "1 2 7", "1 5", "1 7", "1 2 3 4 5 6 7", "1 2 4 6"
)), .Names = "var1", class = "data.frame", row.names = c(NA,
-12L))
> data1
var1
1 2 7
2 2 6 7
3 2 7
4 2 7
5 1 7
6 1 7
7 1 5
8 1 2 7
9 1 5
10 1 7
11 1 2 3 4 5 6 7
12 1 2 4 6
I would like it to split into seven columns (7) as follows:
v1 v2 v3 v4 v5 v6 v7
1 NA 2 NA NA NA NA 7
2 NA 2 NA NA NA 6 7
3 NA 2 NA NA NA NA 7
4 NA 2 NA NA NA NA 7
5 1 NA NA NA NA NA 7
6 1 NA NA NA NA NA 7
7 1 NA NA NA 5 NA NA
8 1 2 NA NA NA NA 7
9 1 NA NA NA 5 NA NA
10 1 NA NA NA NA NA 7
11 1 2 3 4 5 6 7
12 1 2 NA 4 NA 6 NA
I use the tstrsplit from data.table package as follows:
library(data.table)
setDT(data1)[,tstrsplit(var1," ")]
V1 V2 V3 V4 V5 V6 V7
1: 2 7 NA NA NA NA NA
2: 2 6 7 NA NA NA NA
3: 2 7 NA NA NA NA NA
4: 2 7 NA NA NA NA NA
5: 1 7 NA NA NA NA NA
6: 1 7 NA NA NA NA NA
7: 1 5 NA NA NA NA NA
8: 1 2 7 NA NA NA NA
9: 1 5 NA NA NA NA NA
10: 1 7 NA NA NA NA NA
11: 1 2 3 4 5 6 7
12: 1 2 4 6 NA NA NA
This is different than the expected output. I was wondering how can I get the expected output as described above.
With data.table you may try
library(magrittr)
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))] %>%
dcast(., rn ~ V1)
rn 1 2 3 4 5 6 7
1: 1 NA 2 NA NA NA NA 7
2: 2 NA 2 NA NA NA 6 7
3: 3 NA 2 NA NA NA NA 7
4: 4 NA 2 NA NA NA NA 7
5: 5 1 NA NA NA NA NA 7
6: 6 1 NA NA NA NA NA 7
7: 7 1 NA NA NA 5 NA NA
8: 8 1 2 NA NA NA NA 7
9: 9 1 NA NA NA 5 NA NA
10: 10 1 NA NA NA NA NA 7
11: 11 1 2 3 4 5 6 7
12: 12 1 2 NA 4 NA 6 NA
To get rid of the rn column, we can use
setDT(data1)[, strsplit(var1," "), by = .(rn = 1:nrow(data1))][
, dcast(.SD, rn ~ V1)][, rn := NULL][]
Explanation
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))]
creates a data.table directly in long format
rn V1
1: 1 2
2: 1 7
3: 2 2
4: 2 6
5: 2 7
6: 3 2
7: 3 7
8: 4 2
9: 4 7
10: 5 1
11: 5 7
12: 6 1
13: 6 7
14: 7 1
15: 7 5
16: 8 1
17: 8 2
18: 8 7
19: 9 1
20: 9 5
21: 10 1
22: 10 7
23: 11 1
24: 11 2
25: 11 3
26: 11 4
27: 11 5
28: 11 6
29: 11 7
30: 12 1
31: 12 2
32: 12 4
33: 12 6
rn V1
which is then reshaped to wide format using dcast().
If we would use tstrsplit() instead of strsplit() we would get a data.table in wide format which needs to be reshaped to long format using melt():
setDT(data1)[,tstrsplit(var1," ")][, rn := .I][
, melt(.SD, id = "rn", na.rm = TRUE)][
, dcast(.SD, rn ~ paste0("V", value))][
, rn := NULL][]
In base R, we can do this by splitting the string by one or more (\\s+), create a row/column index ('i1') and assign a NA matrix ('m1') to fill up with the unlisted split values
lst <- lapply(strsplit(data1$var1, "\\s+"), as.numeric)
i1 <- cbind(rep(1:nrow(data1), lengths(lst)), unlist(lst))
m1 <- matrix(NA, nrow = max(i1[,1]), ncol = max(i1[,2]))
m1[i1] <- unlist(lst)
as.data.frame(m1)
# V1 V2 V3 V4 V5 V6 V7
#1 NA 2 NA NA NA NA 7
#2 NA 2 NA NA NA 6 7
#3 NA 2 NA NA NA NA 7
#4 NA 2 NA NA NA NA 7
#5 1 NA NA NA NA NA 7
#6 1 NA NA NA NA NA 7
#7 1 NA NA NA 5 NA NA
#8 1 2 NA NA NA NA 7
#9 1 NA NA NA 5 NA NA
#10 1 NA NA NA NA NA 7
#11 1 2 3 4 5 6 7
#12 1 2 NA 4 NA 6 NA