I wanted to follow up on the question that I posted here. While I received baseR and data.table solution, I was trying to implement the same using cSplit_e from splitstackshape package as suggested in the comment of my previous post. With the modified data as below (i.e. with NA),
data1<-structure(list(reason = c("1", "1", NA, "1", "1", "4 5", "1",
"1", "1", "1", "1", "1 2 3 4", "1 2 5", NA, NA)), .Names = "reason", class = "data.frame", row.names = c(NA,
-15L))
#loading packages
library(data.table)
library(splitstackshape)
cSplit_e(setDT(data1),1," ",mode = "value") # with NA's doesn't work
Error in seq.default(min(vec), max(vec)) : 'from' must be a finite number
data2<-na.omit(setDT(data1),cols="reason") # removing NA's
cSplit_e(data2,1," ",mode = "value") # without NA's works
reason reason_1 reason_2 reason_3 reason_4 reason_5
1: 1 1 NA NA NA NA
2: 1 1 NA NA NA NA
3: 1 1 NA NA NA NA
4: 1 1 NA NA NA NA
5: 4 5 NA NA NA 4 5
6: 1 1 NA NA NA NA
7: 1 1 NA NA NA NA
8: 1 1 NA NA NA NA
9: 1 1 NA NA NA NA
10: 1 1 NA NA NA NA
11: 1 2 3 4 1 2 3 4 NA
12: 1 2 5 1 2 NA NA 5
So, the question is does cSplit_e account for NA's in column to be splited?
This has been fixed in the bugfix release (v1.4.4) of "splitstackshape". Thanks for reporting it.
After using update.packages(), you should be able to do:
packageVersion("splitstackshape")
## [1] ‘1.4.4’
cSplit_e(data1, 1, " ", mode = "value")
## reason reason_1 reason_2 reason_3 reason_4 reason_5
## 1 1 1 NA NA NA NA
## 2 1 1 NA NA NA NA
## 3 <NA> NA NA NA NA NA
## 4 1 1 NA NA NA NA
## 5 1 1 NA NA NA NA
## 6 4 5 NA NA NA 4 5
## 7 1 1 NA NA NA NA
## 8 1 1 NA NA NA NA
## 9 1 1 NA NA NA NA
## 10 1 1 NA NA NA NA
## 11 1 1 NA NA NA NA
## 12 1 2 3 4 1 2 3 4 NA
## 13 1 2 5 1 2 NA NA 5
## 14 <NA> NA NA NA NA NA
## 15 <NA> NA NA NA NA NA
Note that 1.4.4 has moved "data.table" from "depends" to "imports".
Related
I have two objects let's call them 1 and 2. They can take either 1 or 2 as values for x variable and depending on that, their y values (binary) are determined as depicted in the image.
For example, if x=1 then only yA can be 1. But if x=2, all yA, yB and yC for that object can be 1. The constraint is that for each object maximum one y can be 1. In the image, blue columns are for object 1 and greens are for object 2.
Is there any efficient way to do it as the number of variables in original problem is much higher?
EDIT: The objective is to find all the possible combination of y variables as depicted in the image. The image is only to provide an idea for expected outcome.
A bit of a brute-force generation.
First, creating the basic frame of all y* columns:
dat <- data.frame(yA=c(1,NA,NA),yB=c(NA,1,NA),yC=c(NA,NA,1),ign=1)
dat <- merge(dat, dat, by="ign")
names(dat)[-1] <- c("y1A", "y1B", "y1C", "y2A", "y2B", "y2C")
dat
# ign y1A y1B y1C y2A y2B y2C
# 1 1 1 NA NA 1 NA NA
# 2 1 1 NA NA NA 1 NA
# 3 1 1 NA NA NA NA 1
# 4 1 NA 1 NA 1 NA NA
# 5 1 NA 1 NA NA 1 NA
# 6 1 NA 1 NA NA NA 1
# 7 1 NA NA 1 1 NA NA
# 8 1 NA NA 1 NA 1 NA
# 9 1 NA NA 1 NA NA 1
Merge (outer/cartesian) with a frame of x*:
alldat <- merge(data.frame(x1=c(1,1,2),x2=c(1,2,2),ign=1), dat, by="ign")
subset(alldat, (!is.na(y1B) | x1 > 1) & (!is.na(y2B) | x2 > 1), select = -ign)
# x1 x2 y1A y1B y1C y2A y2B y2C
# 5 1 1 NA 1 NA NA 1 NA
# 13 1 2 NA 1 NA 1 NA NA
# 14 1 2 NA 1 NA NA 1 NA
# 15 1 2 NA 1 NA NA NA 1
# 19 2 2 1 NA NA 1 NA NA
# 20 2 2 1 NA NA NA 1 NA
# 21 2 2 1 NA NA NA NA 1
# 22 2 2 NA 1 NA 1 NA NA
# 23 2 2 NA 1 NA NA 1 NA
# 24 2 2 NA 1 NA NA NA 1
# 25 2 2 NA NA 1 1 NA NA
# 26 2 2 NA NA 1 NA 1 NA
# 27 2 2 NA NA 1 NA NA 1
The ign column is merely to force/enable merge to do a cartesian/outer join.
Say I have a data frame as follows (in reality this is multiple data frames bound):
data.frame(
position = c(3,4,7,12,NA,NA,NA,NA,NA,NA,NA,NA),
colb = c(1,3,8,2,NA,NA,NA,NA,NA,NA,NA,NA),
colc = c(4,6,9,5,NA,NA,NA,NA,NA,NA,NA,NA),
position = c(2,7,8,10,11,12,15,16,19,21,24,26),
colb = c(1,3,8),
colc = c(4,6,9)
)
(Sorry, gets flagged if I post the data format myself.)
How can I transform this so I have a unified system of indicating a 'position'? ie one of the two formats below.
A single column scale:
position colb colc colb.1 colc.1
1 NA NA NA NA
2 NA NA 1 4
3 1 4 NA NA
4 3 6 NA NA
5 NA NA NA NA
6 NA NA NA NA
7 8 9 3 6
8 NA NA 8 9
9 NA NA NA NA
10 NA NA 1 4
11 NA NA 3 6
12 2 5 8 9
13 NA NA NA NA
14 NA NA NA NA
15 NA NA 1 4
16 NA NA 3 6
17 NA NA NA NA
18 NA NA NA NA
19 NA NA 8 9
20 NA NA NA NA
21 NA NA 1 4
22 NA NA NA NA
23 NA NA NA NA
24 NA NA 3 6
25 NA NA NA NA
26 NA NA 8 9
Or with separate columns, but 'matching':
position colb colc position.1 colb.1 colc.1
NA NA NA NA NA NA
NA NA NA 2 3 6
3 1 4 NA NA NA
4 3 6 NA NA NA
NA NA NA NA NA NA
NA NA NA NA NA NA
7 8 9 7 1 4
NA NA NA 8 3 6
NA NA NA NA NA NA
NA NA NA 10 1 4
NA NA NA 11 3 6
12 2 5 12 8 9
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA 15 8 9
NA NA NA 16 1 4
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA 19 1 4
NA NA NA NA NA NA
NA NA NA 21 8 9
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA 24 8 9
NA NA NA NA NA NA
NA NA NA 26 8 9
Any help is appreciated. Thanks.
If df contains the dataframe
df <- data.frame(
position = c(3,4,7,12,NA,NA,NA,NA,NA,NA,NA,NA),
colb = c(1,3,8,2,NA,NA,NA,NA,NA,NA,NA,NA),
colc = c(4,6,9,5,NA,NA,NA,NA,NA,NA,NA,NA),
position = c(2,7,8,10,11,12,15,16,19,21,24,26),
colb = c(1,3,8),
colc = c(4,6,9)
)
df1 <- df[,1:3]
df2 <- df[,4:6]
names(df2) <- c("position", "colb", "colc")
df_out <- rbind(df1, df2)
df_out <- df_out[!is.na(df_out$position),]
df_out <- df_out[order(df_out$position),]
df_out
For the data given below,
data1<-structure(list(var1 = c("2 7", "2 6 7", "2 7", "2 7", "1 7",
"1 7", "1 5", "1 2 7", "1 5", "1 7", "1 2 3 4 5 6 7", "1 2 4 6"
)), .Names = "var1", class = "data.frame", row.names = c(NA,
-12L))
> data1
var1
1 2 7
2 2 6 7
3 2 7
4 2 7
5 1 7
6 1 7
7 1 5
8 1 2 7
9 1 5
10 1 7
11 1 2 3 4 5 6 7
12 1 2 4 6
I would like it to split into seven columns (7) as follows:
v1 v2 v3 v4 v5 v6 v7
1 NA 2 NA NA NA NA 7
2 NA 2 NA NA NA 6 7
3 NA 2 NA NA NA NA 7
4 NA 2 NA NA NA NA 7
5 1 NA NA NA NA NA 7
6 1 NA NA NA NA NA 7
7 1 NA NA NA 5 NA NA
8 1 2 NA NA NA NA 7
9 1 NA NA NA 5 NA NA
10 1 NA NA NA NA NA 7
11 1 2 3 4 5 6 7
12 1 2 NA 4 NA 6 NA
I use the tstrsplit from data.table package as follows:
library(data.table)
setDT(data1)[,tstrsplit(var1," ")]
V1 V2 V3 V4 V5 V6 V7
1: 2 7 NA NA NA NA NA
2: 2 6 7 NA NA NA NA
3: 2 7 NA NA NA NA NA
4: 2 7 NA NA NA NA NA
5: 1 7 NA NA NA NA NA
6: 1 7 NA NA NA NA NA
7: 1 5 NA NA NA NA NA
8: 1 2 7 NA NA NA NA
9: 1 5 NA NA NA NA NA
10: 1 7 NA NA NA NA NA
11: 1 2 3 4 5 6 7
12: 1 2 4 6 NA NA NA
This is different than the expected output. I was wondering how can I get the expected output as described above.
With data.table you may try
library(magrittr)
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))] %>%
dcast(., rn ~ V1)
rn 1 2 3 4 5 6 7
1: 1 NA 2 NA NA NA NA 7
2: 2 NA 2 NA NA NA 6 7
3: 3 NA 2 NA NA NA NA 7
4: 4 NA 2 NA NA NA NA 7
5: 5 1 NA NA NA NA NA 7
6: 6 1 NA NA NA NA NA 7
7: 7 1 NA NA NA 5 NA NA
8: 8 1 2 NA NA NA NA 7
9: 9 1 NA NA NA 5 NA NA
10: 10 1 NA NA NA NA NA 7
11: 11 1 2 3 4 5 6 7
12: 12 1 2 NA 4 NA 6 NA
To get rid of the rn column, we can use
setDT(data1)[, strsplit(var1," "), by = .(rn = 1:nrow(data1))][
, dcast(.SD, rn ~ V1)][, rn := NULL][]
Explanation
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))]
creates a data.table directly in long format
rn V1
1: 1 2
2: 1 7
3: 2 2
4: 2 6
5: 2 7
6: 3 2
7: 3 7
8: 4 2
9: 4 7
10: 5 1
11: 5 7
12: 6 1
13: 6 7
14: 7 1
15: 7 5
16: 8 1
17: 8 2
18: 8 7
19: 9 1
20: 9 5
21: 10 1
22: 10 7
23: 11 1
24: 11 2
25: 11 3
26: 11 4
27: 11 5
28: 11 6
29: 11 7
30: 12 1
31: 12 2
32: 12 4
33: 12 6
rn V1
which is then reshaped to wide format using dcast().
If we would use tstrsplit() instead of strsplit() we would get a data.table in wide format which needs to be reshaped to long format using melt():
setDT(data1)[,tstrsplit(var1," ")][, rn := .I][
, melt(.SD, id = "rn", na.rm = TRUE)][
, dcast(.SD, rn ~ paste0("V", value))][
, rn := NULL][]
In base R, we can do this by splitting the string by one or more (\\s+), create a row/column index ('i1') and assign a NA matrix ('m1') to fill up with the unlisted split values
lst <- lapply(strsplit(data1$var1, "\\s+"), as.numeric)
i1 <- cbind(rep(1:nrow(data1), lengths(lst)), unlist(lst))
m1 <- matrix(NA, nrow = max(i1[,1]), ncol = max(i1[,2]))
m1[i1] <- unlist(lst)
as.data.frame(m1)
# V1 V2 V3 V4 V5 V6 V7
#1 NA 2 NA NA NA NA 7
#2 NA 2 NA NA NA 6 7
#3 NA 2 NA NA NA NA 7
#4 NA 2 NA NA NA NA 7
#5 1 NA NA NA NA NA 7
#6 1 NA NA NA NA NA 7
#7 1 NA NA NA 5 NA NA
#8 1 2 NA NA NA NA 7
#9 1 NA NA NA 5 NA NA
#10 1 NA NA NA NA NA 7
#11 1 2 3 4 5 6 7
#12 1 2 NA 4 NA 6 NA
I need some help with subset/filter of data.frame. Below is the code for my random dataset.
A <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
B <- c(3,3,3,3,4,4,4,4,1,1,1,1,2,2,2,2)
C <- c(1,1,1,1,3,3,3,3,2,2,2,2,4,4,4,4)
Fakey <- data.frame(A, B, C)
Filter_Fakey <- subset(Fakey, (Fakey>1 & Fakey<4))
That last line of coode results in the following:
> Filter_Fakey
A B C
5 2 4 3
6 2 4 3
7 2 4 3
8 2 4 3
9 3 1 2
10 3 1 2
11 3 1 2
12 3 1 2
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
NA.9 NA NA NA
NA.10 NA NA NA
NA.11 NA NA NA
NA.12 NA NA NA
NA.13 NA NA NA
NA.14 NA NA NA
NA.15 NA NA NA
But What I really want is this,
> Filter_Fakey
A B C
5 2 3 3
6 2 3 3
7 2 3 3
8 2 3 3
9 3 2 2
10 3 2 2
11 3 2 2
12 3 2 2
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
NA.9 NA NA NA
NA.10 NA NA NA
NA.11 NA NA NA
NA.12 NA NA NA
NA.13 NA NA NA
NA.14 NA NA NA
NA.15 NA NA NA
I've tried subset(), subset(with a negation condition), filter{dplyr}, and the different bracket notations ('[' and '[['). Thanks for helping me out.
Use lapply to loop through columns of the data frame, and set values out of conditions to be NA if that is what you are after. Use order(is.na(...)) to arrange NA values to the last positions:
do.call(cbind, lapply(Fakey, function(col) {
col[col <= 1 | col >= 4] <- NA; col[order(is.na(col))]
}))
A B C
1 2 3 3
2 2 3 3
3 2 3 3
4 2 3 3
5 3 2 2
6 3 2 2
7 3 2 2
8 3 2 2
9 NA NA NA
10 NA NA NA
11 NA NA NA
12 NA NA NA
13 NA NA NA
14 NA NA NA
15 NA NA NA
16 NA NA NA
Another option is using length<- to pad NA at the end after subsetting each of the columns using the logical condition.
data.frame(lapply(Fakey, function(x) `length<-`(x[x > 1 & x <4], nrow(Fakey))))
# A B C
#1 2 3 3
#2 2 3 3
#3 2 3 3
#4 2 3 3
#5 3 2 2
#6 3 2 2
#7 3 2 2
#8 3 2 2
#9 NA NA NA
#10 NA NA NA
#11 NA NA NA
#12 NA NA NA
#13 NA NA NA
#14 NA NA NA
#15 NA NA NA
#16 NA NA NA
I am trying to find the number of discordant and concordant pairs in a clinical trial, and have come across the 'asbio' library which provides the function ConDis.matrix. (http://artax.karlin.mff.cuni.cz/r-help/library/asbio/html/ConDis.matrix.html)
The dataset they give as an example is:
crab<-data.frame(gill.wt=c(159,179,100,45,384,230,100,320,80,220,320,210),
body.wt=c(14.4,15.2,11.3,2.5,22.7,14.9,1.41,15.81,4.19,15.39,17.25,9.52))
attach(crab)
crabm<-ConDis.matrix(gill.wt,body.wt)
crabm
Which gives a result that looks like:
1 2 3 4 5 6 7 8 9 10 11 12
1 NA NA NA NA NA NA NA NA NA NA NA NA
2 1 NA NA NA NA NA NA NA NA NA NA NA
3 1 1 NA NA NA NA NA NA NA NA NA NA
4 1 1 1 NA NA NA NA NA NA NA NA NA
5 1 1 1 1 NA NA NA NA NA NA NA NA
6 1 -1 1 1 1 NA NA NA NA NA NA NA
7 1 1 0 -1 1 1 NA NA NA NA NA NA
8 1 1 1 1 1 1 1 NA NA NA NA NA
9 1 1 1 1 1 1 -1 1 NA NA NA NA
10 1 1 1 1 1 -1 1 1 1 NA NA NA
11 1 1 1 1 1 1 1 0 1 1 NA NA
12 -1 -1 -1 1 1 1 1 1 1 1 1 NA
The solution I can think of is adding up the 1s and -1s (for concordant and discordant) respectively but I don't know how to count values in a matrix. Alternatively is someone has a better way of counting concordant/discordant then I would love to know.
Your found solution was
sum(crabm == 1, na.rm = TRUE)
[1] 57
sum(crabm == -1, na.rm = TRUE)
[1] 7
You could try (C...concordant, D...discordant pairs):
library(DescTools)
tab <- table(crab$gill.wt, crab$body.wt)
ConDisPairs(tab)[c("C","D")]
$C
[1] 57
$D
[1] 7