Let us suppose to have C<-c(1,2,3,4,5)
I want to find all the unique couples of numbers that can be extracted from this vector, e.g.,12,13 23 etc. How can I do it?
One option could be:
na.omit(c(`diag<-`(sapply(x, paste0, x), NA)))
[1] "12" "13" "14" "15" "21" "23" "24" "25" "31" "32" "34" "35" "41" "42" "43" "45"
[17] "51" "52" "53" "54"
Using RcppAlgos package.
## Combinations
unlist(RcppAlgos::comboGeneral(x, 2, FUN=function(x) Reduce(paste0, x)))
# [1] "12" "13" "14" "15" "23" "24" "25" "34" "35" "45"
## Permutations
unlist(RcppAlgos::permuteGeneral(x, 2, FUN=function(x) Reduce(paste0, x)))
# [1] "12" "13" "14" "15" "21" "23" "24" "25" "31" "32" "34" "35" "41" "42" "43"
# [16] "45" "51" "52" "53" "54"
I have a data.table with many missing values and I want a variable which gives me a 1 for the first non-missin value in each group.
Say I have such a data.table:
library(data.table)
DT <- data.table(iris)[,.(Petal.Width,Species)]
DT[c(1:10,15,45:50,51:70,101:134),Petal.Width:=NA]
which now has missings in the beginning, at the end and in between. I have tried two versions, one is:
DT[min(which(!is.na(Petal.Width))),first_available:=1,by=Species]
but it only finds the global minimum (in this case, setosa gets the correct 1), not the minimum by group. I think this is the case because data.table first subsets by i, then sorts by group, correct? So it will only work with the row that is the global minimum of which(!is.na(Petal.Width)) which is the first non-NA value.
A second attempt with the test in j:
DT[,first_available:= ifelse(min(which(!is.na(Petal.Width))),1,0),by=Species]
which just returns a column of 1s. Here, I don't have a good explanation as to why it doesn't work.
my goal is this:
DT[,first_available:=0]
DT[c(11,71,135),first_available:=1]
but in reality I have hundreds of groups. Any help would be appreciated!
Edit: this question does come close but is not targeted at NA's and does not solve the issue here if I understand it correctly. I tried:
DT <- data.table(DT, key = c('Species'))
DT[unique(DT[,key(DT), with = FALSE]), mult = 'first']
Here's one way:
DT[!is.na(Petal.Width), first := as.integer(seq_len(.N) == 1L), by = Species]
We can try
DT[DT[, .I[which.max(!is.na(Petal.Width))] , Species]$V1,
first_available := 1][is.na(first_available), first_available := 0]
Or a slightly more compact option is
DT[, first_available := as.integer(1:nrow(DT) %in%
DT[, .I[!is.na(Petal.Width)][1L], by = Species]$V1)][]
> DT[!is.na(DT$Petal.Width) & DT$first_available == 1]
# Petal.Width Species first_available
# 1: 0.2 setosa 1
# 2: 1.8 versicolor 1
# 3: 1.4 virginica 1
> rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 1]
# [1] "11" "71" "135"
> rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 0]
# [1] "12" "13" "14" "16" "17" "18" "19" "20" "21" "22" "23" "24"
# [13] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36"
# [25] "37" "38" "39" "40" "41" "42" "43" "44" "72" "73" "74" "75"
# [37] "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87"
# [49] "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
# [61] "100" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146"
# [73] "147" "148" "149" "150"
I have a panel dataset with many variables. The three most relevant variables are: "cid" (country code), 'time" (0-65), and "event" (0, 1, 2, 3, 4, 5, 6).
I am trying to run a cox regression (using coxph), however, since the time variable has different starting and ending points for each country, I need to first create a start time and end time variable. Here is where I run into my problem.
Here is what a sample of the three main variables may look like:
> data
cid time event
[1,] "AFG" "20" "0"
[2,] "AFG" "21" "0"
[3,] "AFG" "22" "0"
[4,] "AFG" "23" "0"
[5,] "AFG" "24" "0"
[6,] "AFG" "25" "0"
[7,] "AFG" "26" "1"
[8,] "AFG" "27" "1"
[9,] "AFG" "28" "1"
[10,] "AFG" "29" "1"
The idea is to convert this data into the following:
> data
cid time1 time2 event
[1,] "AFG" "20" "25" "0"
[2,] "AFG" "26" "29" "1"
How exactly does one go about doing this (keeping in mind that there are quite a few other explanatory variables in my dataset)?
You could use dplyr and pipe. This solution will work if your data is always ordered sequentially as in your example.
data<-data.frame(cid=rep("AFG",10),time=seq(20,29,1),event=c(0,0,0,0,0,0,1,1,1,1))
library(dplyr)
data %>% group_by(cid,event) %>%
summarise(time1=min(time),time2=max(time))
subset1<- data[data$event==0,]
subset1
subset2<- data[data$event==1,]
subset2
s1<- cbind(cid="AFG",time1=min(subset1$time),time2=max(subset1$time),event = 0)
s1
s2<- cbind(cid="AFG",time1=min(subset2$time),time2=max(subset2$time),event = 1)
s2
data1=rbind(s1,s2)
data1
# cid time1 time2 event
# [1,] "AFG" "20" "25" "0"
# [2,] "AFG" "26" "29" "1"
Hope this would help a little.
I have a 5-level factor that looks like the following:
tmp
[1] NA
[2] 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
[3] NA
[4] NA
[5] 5,9,16,24,35,36,42
[6] 4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
[7] 8,39
5 Levels: 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46 ...
I want to access the items within each level except NA. So I use the levels() function, which gives me:
> levels(tmp)
[1] "1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46"
[2] "4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50"
[3] "5,9,16,24,35,36,42"
[4] "8,39"
[5] "NA"
Then I would like to access the elements in each level, and store them as numbers. However, for example,
>as.numeric(cat(levels(tmp)[3]))
5,9,16,24,35,36,42numeric(0)
Can you help me removing the commas within the numbers and the numeric(0) at the very end. I would like to have a vector of numerics 5, 9, 16, 24, 35, 36, 42 so that I can use them as indices to access a data frame. Thanks!
You need to use a combination of unlist, strsplit and unique.
First, recreate your data:
dat <- read.table(text="
NA
1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
NA
NA
5,9,16,24,35,36,42
4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
8,39")$V1
Next, find all the unique levels, after using strsplit:
sort(unique(unlist(
sapply(levels(dat), function(x)unlist(strsplit(x, split=",")))
)))
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26"
[20] "27" "28" "29" "3" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "4" "40" "41" "42" "43"
[39] "44" "45" "46" "47" "48" "49" "5" "50" "6" "7" "8" "9"
Does this do what you want?
levels_split <- strsplit(levels(tmp), ",")
lapply(levels_split, as.numeric)
Using Andrie's dat
val <- scan(text=levels(dat),sep=",")
#Read 50 items
split(val,cumsum(c(T,diff(val) <0)))
#$`1`
#[1] 1 2 3 6 11 12 13 18 20 21 22 26 29 33 40 43 46
#$`2`
#[1] 4 7 10 14 15 17 19 23 25 27 28 30 31 32 34 37 38 41 44 45 47 48 49 50
#$`3`
#[1] 5 9 16 24 35 36 42
#$`4`
#[1] 8 39