Grouping a variable with numerous levels - r

Let's say I have a factor variable with numerous levels and I am trying to group them into several groups.
> levels(dat$years_continuously_insured_order2)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18"
[19] "19" "20"
> levels(dat$age_of_oldest_driver)
[1] "-16" "1" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33"
[22] "34" "35" "36" "37" "38" "39" "40
I have a script which runs through these variables and groups them into several categories. However, the number of levels could (and usually is) different each time my script runs. Therefore, if my original code to group the variables was the following (see below), it wouldn't be of use if in an hour later, my script runs and the levels are different. Instead of 15 levels, I could now have 25 levels and the values are different, but I still need to group them into specific categories.
dat$years_continuously_insured2 <- NA
dat$years_continuously_insured2[dat$years_continuously_insured %in% levels(dat$years_continuously_insured)[1]] <- NA
dat$years_continuously_insured2[dat$years_continuously_insured %in% levels(dat$years_continuously_insured)[2:3]] <- "1 or less"
dat$years_continuously_insured2[dat$years_continuously_insured %in% levels(dat$years_continuously_insured)[4]] <- "2"
dat$years_continuously_insured2[dat$years_continuously_insured %in% levels(dat$years_continuously_insured)[5:7]] <- "3 +"
dat$years_continuously_insured2 <- factor(dat$years_continuously_insured2)
How can I find a more elegant way to group variables into segments? Are there better ways to do this in R?
Thanks!

You could convert your factor levels in the continuously insured variable to numeric and then cut to your categories and re-factor(). The first step is described in the R-FAQ (to do properly it's a two step process):
dat$years_cont <- factor( cut( as.numeric(as.character(
dat$years_continuously_insured_order2)),
breaks=c(0,2,3, Inf), right=FALSE ),
labels=c( "1 or less", "2", "3 +")
)
#-----------------
> str(dat)
'data.frame': 100 obs. of 2 variables:
$ years_continuously_insured_order2: Factor w/ 20 levels "1","10","11",..: 4 15 19 5 8 4 16 12 12 18 ...
$ years_cont : Factor w/ 3 levels "1 or less","2",..: 3 3 3 3 3 3 3 2 2 3 ...

If your original column is a number, treat it as a number, not a factor. A much easier way to do what you're doing is:
bin.value = function(x) {
ifelse(x <= 1, "1 or less", ifelse(x == 2, "2", "3+"))
}
dat$years_continuously_insured2 = as.factor(bin.value(as.integer(dat$years_continuously_insured)))

Related

How to find unique couples of numbers in a vector in R?

Let us suppose to have C<-c(1,2,3,4,5)
I want to find all the unique couples of numbers that can be extracted from this vector, e.g.,12,13 23 etc. How can I do it?
One option could be:
na.omit(c(`diag<-`(sapply(x, paste0, x), NA)))
[1] "12" "13" "14" "15" "21" "23" "24" "25" "31" "32" "34" "35" "41" "42" "43" "45"
[17] "51" "52" "53" "54"
Using RcppAlgos package.
## Combinations
unlist(RcppAlgos::comboGeneral(x, 2, FUN=function(x) Reduce(paste0, x)))
# [1] "12" "13" "14" "15" "23" "24" "25" "34" "35" "45"
## Permutations
unlist(RcppAlgos::permuteGeneral(x, 2, FUN=function(x) Reduce(paste0, x)))
# [1] "12" "13" "14" "15" "21" "23" "24" "25" "31" "32" "34" "35" "41" "42" "43"
# [16] "45" "51" "52" "53" "54"

How can I reset rownames? [duplicate]

This question already has answers here:
How to reset row names?
(2 answers)
Closed 5 years ago.
I have a data frame named lagcolmean like this which begins with 2, since I drop the first one
MSFT AAPL GOOGL
2 20.91273 5.663524 97.50684
3 20.05333 5.681336 90.57909
4 20.09447 5.239416 99.60738
Now how can I convert it like this
MSFT AAPL GOOGL
1 20.91273 5.663524 97.50684
2 20.05333 5.681336 90.57909
3 20.09447 5.239416 99.60738
I actually used rownames(lagcolmean) but the output is this
[1] "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
[11] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21"
[21] "22" "23
if your data frame is called df, just do rownames(df)=NULL
another option:
rownames(df) <- 1:nrow(df)

R: find first non-NA observation in data.table column by group

I have a data.table with many missing values and I want a variable which gives me a 1 for the first non-missin value in each group.
Say I have such a data.table:
library(data.table)
DT <- data.table(iris)[,.(Petal.Width,Species)]
DT[c(1:10,15,45:50,51:70,101:134),Petal.Width:=NA]
which now has missings in the beginning, at the end and in between. I have tried two versions, one is:
DT[min(which(!is.na(Petal.Width))),first_available:=1,by=Species]
but it only finds the global minimum (in this case, setosa gets the correct 1), not the minimum by group. I think this is the case because data.table first subsets by i, then sorts by group, correct? So it will only work with the row that is the global minimum of which(!is.na(Petal.Width)) which is the first non-NA value.
A second attempt with the test in j:
DT[,first_available:= ifelse(min(which(!is.na(Petal.Width))),1,0),by=Species]
which just returns a column of 1s. Here, I don't have a good explanation as to why it doesn't work.
my goal is this:
DT[,first_available:=0]
DT[c(11,71,135),first_available:=1]
but in reality I have hundreds of groups. Any help would be appreciated!
Edit: this question does come close but is not targeted at NA's and does not solve the issue here if I understand it correctly. I tried:
DT <- data.table(DT, key = c('Species'))
DT[unique(DT[,key(DT), with = FALSE]), mult = 'first']
Here's one way:
DT[!is.na(Petal.Width), first := as.integer(seq_len(.N) == 1L), by = Species]
We can try
DT[DT[, .I[which.max(!is.na(Petal.Width))] , Species]$V1,
first_available := 1][is.na(first_available), first_available := 0]
Or a slightly more compact option is
DT[, first_available := as.integer(1:nrow(DT) %in%
DT[, .I[!is.na(Petal.Width)][1L], by = Species]$V1)][]
> DT[!is.na(DT$Petal.Width) & DT$first_available == 1]
# Petal.Width Species first_available
# 1: 0.2 setosa 1
# 2: 1.8 versicolor 1
# 3: 1.4 virginica 1
> rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 1]
# [1] "11" "71" "135"
> rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 0]
# [1] "12" "13" "14" "16" "17" "18" "19" "20" "21" "22" "23" "24"
# [13] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36"
# [25] "37" "38" "39" "40" "41" "42" "43" "44" "72" "73" "74" "75"
# [37] "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87"
# [49] "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
# [61] "100" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146"
# [73] "147" "148" "149" "150"

Reformatting Panel Data according to a time and event variable

I have a panel dataset with many variables. The three most relevant variables are: "cid" (country code), 'time" (0-65), and "event" (0, 1, 2, 3, 4, 5, 6).
I am trying to run a cox regression (using coxph), however, since the time variable has different starting and ending points for each country, I need to first create a start time and end time variable. Here is where I run into my problem.
Here is what a sample of the three main variables may look like:
> data
cid time event
[1,] "AFG" "20" "0"
[2,] "AFG" "21" "0"
[3,] "AFG" "22" "0"
[4,] "AFG" "23" "0"
[5,] "AFG" "24" "0"
[6,] "AFG" "25" "0"
[7,] "AFG" "26" "1"
[8,] "AFG" "27" "1"
[9,] "AFG" "28" "1"
[10,] "AFG" "29" "1"
The idea is to convert this data into the following:
> data
cid time1 time2 event
[1,] "AFG" "20" "25" "0"
[2,] "AFG" "26" "29" "1"
How exactly does one go about doing this (keeping in mind that there are quite a few other explanatory variables in my dataset)?
You could use dplyr and pipe. This solution will work if your data is always ordered sequentially as in your example.
data<-data.frame(cid=rep("AFG",10),time=seq(20,29,1),event=c(0,0,0,0,0,0,1,1,1,1))
library(dplyr)
data %>% group_by(cid,event) %>%
summarise(time1=min(time),time2=max(time))
subset1<- data[data$event==0,]
subset1
subset2<- data[data$event==1,]
subset2
s1<- cbind(cid="AFG",time1=min(subset1$time),time2=max(subset1$time),event = 0)
s1
s2<- cbind(cid="AFG",time1=min(subset2$time),time2=max(subset2$time),event = 1)
s2
data1=rbind(s1,s2)
data1
# cid time1 time2 event
# [1,] "AFG" "20" "25" "0"
# [2,] "AFG" "26" "29" "1"
Hope this would help a little.

Access the levels of a factor in R

I have a 5-level factor that looks like the following:
tmp
[1] NA
[2] 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
[3] NA
[4] NA
[5] 5,9,16,24,35,36,42
[6] 4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
[7] 8,39
5 Levels: 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46 ...
I want to access the items within each level except NA. So I use the levels() function, which gives me:
> levels(tmp)
[1] "1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46"
[2] "4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50"
[3] "5,9,16,24,35,36,42"
[4] "8,39"
[5] "NA"
Then I would like to access the elements in each level, and store them as numbers. However, for example,
>as.numeric(cat(levels(tmp)[3]))
5,9,16,24,35,36,42numeric(0)
Can you help me removing the commas within the numbers and the numeric(0) at the very end. I would like to have a vector of numerics 5, 9, 16, 24, 35, 36, 42 so that I can use them as indices to access a data frame. Thanks!
You need to use a combination of unlist, strsplit and unique.
First, recreate your data:
dat <- read.table(text="
NA
1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
NA
NA
5,9,16,24,35,36,42
4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
8,39")$V1
Next, find all the unique levels, after using strsplit:
sort(unique(unlist(
sapply(levels(dat), function(x)unlist(strsplit(x, split=",")))
)))
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26"
[20] "27" "28" "29" "3" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "4" "40" "41" "42" "43"
[39] "44" "45" "46" "47" "48" "49" "5" "50" "6" "7" "8" "9"
Does this do what you want?
levels_split <- strsplit(levels(tmp), ",")
lapply(levels_split, as.numeric)
Using Andrie's dat
val <- scan(text=levels(dat),sep=",")
#Read 50 items
split(val,cumsum(c(T,diff(val) <0)))
#$`1`
#[1] 1 2 3 6 11 12 13 18 20 21 22 26 29 33 40 43 46
#$`2`
#[1] 4 7 10 14 15 17 19 23 25 27 28 30 31 32 34 37 38 41 44 45 47 48 49 50
#$`3`
#[1] 5 9 16 24 35 36 42
#$`4`
#[1] 8 39

Resources