Access the levels of a factor in R - r

I have a 5-level factor that looks like the following:
tmp
[1] NA
[2] 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
[3] NA
[4] NA
[5] 5,9,16,24,35,36,42
[6] 4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
[7] 8,39
5 Levels: 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46 ...
I want to access the items within each level except NA. So I use the levels() function, which gives me:
> levels(tmp)
[1] "1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46"
[2] "4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50"
[3] "5,9,16,24,35,36,42"
[4] "8,39"
[5] "NA"
Then I would like to access the elements in each level, and store them as numbers. However, for example,
>as.numeric(cat(levels(tmp)[3]))
5,9,16,24,35,36,42numeric(0)
Can you help me removing the commas within the numbers and the numeric(0) at the very end. I would like to have a vector of numerics 5, 9, 16, 24, 35, 36, 42 so that I can use them as indices to access a data frame. Thanks!

You need to use a combination of unlist, strsplit and unique.
First, recreate your data:
dat <- read.table(text="
NA
1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
NA
NA
5,9,16,24,35,36,42
4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
8,39")$V1
Next, find all the unique levels, after using strsplit:
sort(unique(unlist(
sapply(levels(dat), function(x)unlist(strsplit(x, split=",")))
)))
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26"
[20] "27" "28" "29" "3" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "4" "40" "41" "42" "43"
[39] "44" "45" "46" "47" "48" "49" "5" "50" "6" "7" "8" "9"

Does this do what you want?
levels_split <- strsplit(levels(tmp), ",")
lapply(levels_split, as.numeric)

Using Andrie's dat
val <- scan(text=levels(dat),sep=",")
#Read 50 items
split(val,cumsum(c(T,diff(val) <0)))
#$`1`
#[1] 1 2 3 6 11 12 13 18 20 21 22 26 29 33 40 43 46
#$`2`
#[1] 4 7 10 14 15 17 19 23 25 27 28 30 31 32 34 37 38 41 44 45 47 48 49 50
#$`3`
#[1] 5 9 16 24 35 36 42
#$`4`
#[1] 8 39

Related

How to obtain values from a matrix using stored numbers as indexes in R

am really new at R and I can't find the way of subsetting matrix rows given a list of indexes.
I have a dataframe called 'demo' with 855 rows and 3 columns that looks like this:
## Subject AGE DX
## 1 011_S_0002_bl 74.3 0
## 2 011_S_0003_bl 81.3 1
## 3 011_S_0005_bl 73.7 0
## 4 022_S_0007_bl 75.4 1
## 5 011_S_0008_bl 84.5 0
## 6 011_S_0010_bl 73.9 1
From this, I want to extract the indexes for all the rows that match DX == 1. So I do:
rownames(demo[demo$DX == 1,])
Which returns:
## [1] "2" "4" "6" "14" "20" "31" "33" "34" "36" "39" "40" "41"
## [13] "46" "47" "53" "54" "55" "58" "64" "67" "69" "70" "72" "81"
## [25] "84" "87" "88" "92" "96" "98" "100" "101" "106" "108" "109" "112"
....
Now I have a matrix called T_hat with 855 rows and 1 column that looks like this:
## [,1]
## [1,] 5.812925
## [2,] 10.477721
## [3,] 1.519726
## [4,] -0.221328
## [5,] 1.784920
What I want is to use the numbers in 'al' to subset the values with the corresponding numbers in the indexes and to get something like this:
## [,1]
## [2,] 10.477721
## [4,] -0.221328
...and so on.
I've tried all these options:
T_hat_a <- T_hat[rownames(demo[demo$DX == 1,]),1]
T_hat_b <- T_hat[is.numeric(rownames(demo[demo$DX == 1,])),1]
T_hat_c <- T_hat[rownames(T_hat) %in% rownames(demo[demo$DX == 1,]),1]
T_hat_d <- T_hat[rownames(T_hat) %in% is.numeric(rownames(demo[demo$DX == 1,])),1]
But none returns what I expect.
T_hat_a = ERROR "no 'dimnames' attributes for array
T_hat_b = numeric(0)
T_hat_c = numeric(0)
T_hat_d = numeric(0)
I've also tried to convert my matrix to a df, but only the T_hat_a option returns a result, but it is not at all as desired, since it returns different values...

data frame search == not finding all conditions that hold

I am trying to conditionally replace some fields in a dataframe; however, my code is finding about 25% of the actual instances present. I've searched through the other conditional search questions, but didn't find anything matching my problem -- I apologize in advance if I missed one.
Specifically, I am trying to replace all numbers 1 to 9 in dta$day, with a to i.
Here are the first 100 items in that vector: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9
When I conditionally search for values 1 to 9, using:
dta$day == c("1","2","3","4","5","6","7","8","9")
It states that only the first and last set in that grouping match my condition as below (I've bolded ~what should be TRUE for your reference):
[1] **TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE** FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE **FALSE**
[33] **FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE** FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE **FALSE FALSE**
[65] **FALSE FALSE FALSE FALSE FALSE FALSE FALSE** FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE **TRUE TRUE TRUE TRUE TRUE TRUE**
[97] **TRUE TRUE TRUE**
The problem must be in that first step, but to show you the result, only the first and last set in that first 100 in my vector are appropriately replaced after applying this code:
dta[dta$day == c("1","2","3","4","5","6","7","8","9"),1
] <- c("a", "b", "c", "d", "e", "f", "g", "h", "i")
[1] **"a" "b" "c" "d" "e" "f" "g" "h" "i"** "10" "11" "12" "13" "14" "15" "16" "17" "18" "19"
[20] "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" **"1" "2" "3" "4" "5" "6" "7"**
[39] "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
[58] "27" "28" **"1" "2" "3" "4" "5" "6" "7" "8" "9" "10"** "11" "12" "13" "14" "15" "16" "17"
[77] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" **"a" "b" "c" "d" "e"
[96] "f" "g" "h" "i"**
If useful, here is the initial state of that vector:
is.numeric(dta$day)
[1] TRUE
summary(dta$day)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 8.00 16.00 15.73 23.00 31.00
I am reproducing the data frame here:
day <- c(1:31,1:28,1:31,1:30)
month <- c(rep_len(1,31),rep_len(2,28),rep_len(3,31),rep_len(4,30))
temp <- rnorm(length(month),10,10)
dta=as.data.frame(cbind(day,month,temp))
And actually, although I am able to reproduce the problem with this toy example, I get a warning that I do not get with my actual data (not reproduced here because it is very large): "longer object length is not a multiple of shorter object length".
I would love some help, and if I haven't provided something or haven't done so in the format needed, please kindly let me know!
It looks like you're checking equivalence to a vector, rather than it's components. Try %in% instead, like this:
dta[dta$day %in% c("1","2","3","4","5","6","7","8","9"), ]
Use %in% rather than == and then index your data frame/vector as below to replace 1:9 with a:i as wanted:
y <- c(1:9)
dta$day[dta$day %in% y] <- letters[1:length(y)]
Read more about the different behaviours of these operators here:
Difference between the == and %in% operators in R
And
Difference between `%in%` and `==`

Flatten a named list in R

This is a very simple question, I can't believe I can't figure it out. I've searched high and low for a solution.
I have a named list, like so:
> fitted(mdl)
1 2 3 4 5 6 7 8
-424.8135 -395.0308 -436.5832 -414.3145 -382.9686 -380.7277 -394.2808 -394.3340
9 10 11 12 13 14 15 16
-401.6710 -386.6691 -407.4558 -427.4056 -397.4963 -415.6302 -436.1703 -378.4489
17 18 19 20 21 22 23 24
-353.7718 -377.3190 -390.5177 -370.3608 -389.7843 -397.8872 -401.9937 -390.4119
25 26 27 28 29 30 31 32
-387.4962 -422.4953 -427.1638 -402.5654 -409.6334 -360.7378 -355.1824 -370.9121
33 34 35 36 37 38 39 40
-377.6591 -373.3049 -388.4417 -398.1172 -357.1107 -376.8618 -378.7070 -420.5362
41 42 43 44 45 46 47 48
-390.8324 -406.5956 -403.1015 -363.5008 -347.2580 -371.0433 -376.4454 -360.3895
49
-383.9711
mdl is an object returned from lm(), and I'm trying to extract the predicted values using the extractor function fitted()
I would like this to be without the 1,2,3,... names. str() told me that names is an attribute. I can do
> names(fitted(mdl))
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
[46] "46" "47" "48" "49"
And that is what I want, except with the data. After trying various combinations of unlist,cbind/rbind, do.call, c(), etc. I finally figured out a solution:
> data.frame(fitted(mdl))$fitted.mdl
[1] -424.8135 -395.0308 -436.5832 -414.3145 -382.9686 -380.7277 -394.2808
[8] -394.3340 -401.6710 -386.6691 -407.4558 -427.4056 -397.4963 -415.6302
[15] -436.1703 -378.4489 -353.7718 -377.3190 -390.5177 -370.3608 -389.7843
[22] -397.8872 -401.9937 -390.4119 -387.4962 -422.4953 -427.1638 -402.5654
[29] -409.6334 -360.7378 -355.1824 -370.9121 -377.6591 -373.3049 -388.4417
[36] -398.1172 -357.1107 -376.8618 -378.7070 -420.5362 -390.8324 -406.5956
[43] -403.1015 -363.5008 -347.2580 -371.0433 -376.4454 -360.3895 -383.9711
But this is a very roundabout hack for something that must be right under my nose.
Any suggestions at what I'm missing?
(I don't know how to phrase the problem very well, or come up with a better title for the question, as I don't know the terminology to describe what I want. So feel free to edit :)
If you're just trying to remove the names of the object, just use unname.
Here's a basic example:
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
fitted(lm.D9)
# 1 2 3 4 5 6 7 8 9 10 11 12 13
# 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 4.661 4.661 4.661
# 14 15 16 17 18 19 20
# 4.661 4.661 4.661 4.661 4.661 4.661 4.661
Remove the names:
unname(fitted(lm.D9))
# [1] 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 4.661 4.661 4.661
# [14] 4.661 4.661 4.661 4.661 4.661 4.661 4.661
here is another simple way:
set.seed(100)
x <- rpois(5, 5)
y <- 2*x + rnorm(5)
mod <- lm(y ~ x)
fitted_ <- fitted(mod)
fitted_
1 2 3 4 5
# 7.822806 6.312569 9.333042 4.802333 9.333042
names(fitted_) <- NULL
fitted_
# [1] 7.822806 6.312569 9.333042 4.802333 9.333042

Grouping a variable with numerous levels

Let's say I have a factor variable with numerous levels and I am trying to group them into several groups.
> levels(dat$years_continuously_insured_order2)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18"
[19] "19" "20"
> levels(dat$age_of_oldest_driver)
[1] "-16" "1" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33"
[22] "34" "35" "36" "37" "38" "39" "40
I have a script which runs through these variables and groups them into several categories. However, the number of levels could (and usually is) different each time my script runs. Therefore, if my original code to group the variables was the following (see below), it wouldn't be of use if in an hour later, my script runs and the levels are different. Instead of 15 levels, I could now have 25 levels and the values are different, but I still need to group them into specific categories.
dat$years_continuously_insured2 <- NA
dat$years_continuously_insured2[dat$years_continuously_insured %in% levels(dat$years_continuously_insured)[1]] <- NA
dat$years_continuously_insured2[dat$years_continuously_insured %in% levels(dat$years_continuously_insured)[2:3]] <- "1 or less"
dat$years_continuously_insured2[dat$years_continuously_insured %in% levels(dat$years_continuously_insured)[4]] <- "2"
dat$years_continuously_insured2[dat$years_continuously_insured %in% levels(dat$years_continuously_insured)[5:7]] <- "3 +"
dat$years_continuously_insured2 <- factor(dat$years_continuously_insured2)
How can I find a more elegant way to group variables into segments? Are there better ways to do this in R?
Thanks!
You could convert your factor levels in the continuously insured variable to numeric and then cut to your categories and re-factor(). The first step is described in the R-FAQ (to do properly it's a two step process):
dat$years_cont <- factor( cut( as.numeric(as.character(
dat$years_continuously_insured_order2)),
breaks=c(0,2,3, Inf), right=FALSE ),
labels=c( "1 or less", "2", "3 +")
)
#-----------------
> str(dat)
'data.frame': 100 obs. of 2 variables:
$ years_continuously_insured_order2: Factor w/ 20 levels "1","10","11",..: 4 15 19 5 8 4 16 12 12 18 ...
$ years_cont : Factor w/ 3 levels "1 or less","2",..: 3 3 3 3 3 3 3 2 2 3 ...
If your original column is a number, treat it as a number, not a factor. A much easier way to do what you're doing is:
bin.value = function(x) {
ifelse(x <= 1, "1 or less", ifelse(x == 2, "2", "3+"))
}
dat$years_continuously_insured2 = as.factor(bin.value(as.integer(dat$years_continuously_insured)))

reformatting data frame with List in R

Helo, I am trying to reshape a data.frame in R such that each row will repeat with a different value from a list, then the next row will repeat from a differing value from the second entry of the list.
the list is called, wrk, dfx is the dataframe I want to reshape, and listOut is what I want to end up with.
Thank you very much for your help.
> wrk
[[1]]
[1] "41" "42" "44" "45" "97" "99" "100" "101" "102"
[10] "103" "105" "123" "124" "126" "127" "130" "132" "135"
[19] "136" "137" "138" "139" "140" "141" "158" "159" "160"
[28] "161" "162" "163" "221" "223" "224" ""
[[2]]
[1] "41" "42" "44" "45" "98" "99" "100" "101" "102"
[10] "103" "105" "123" "124" "126" "127" "130" "132" "135"
[19] "136" "137" "138" "139" "140" "141" "158" "159" "160"
[28] "161" "162" "163" "221" "223" "224" ""
>dfx
projectScore highestRankingGroup
1 0.8852 1
2 0.8845 2
>listOut
projectScore highestRankingGroup wrk
1 0.8852 1 41
2 0.8852 1 42
3 0.8852 1 44
4 0.8852 1 45
5 0.8852 1 97
6 0.8852 1 99
7 0.8852 1 100
8 0.8852 1 101
...
35 0.8845 2 41
36 0.8845 2 42
37 0.8845 2 44
38 0.8845 2 45
39 0.8845 2 98
40 0.8845 2 99
41 0.8845 2 100
How about replicate rows of dfx and cbind with unlisted wrk:
listOut <- cbind(
dfx[rep(seq_along(wrk), sapply(wrk, length)), ],
wrk = unlist(wrk)
)
How about:
If wrk contains simple vectors like in your example:
> szs<-sapply(wrk, length)
> fulldfr<-do.call(c, wrk)
> listOut<-cbind(dfx[rep(seq_along(szs), szs),], fulldfr)
If wrk contains dataframes:
> szs<-sapply(wrk, function(dfr){dim(dfr)[1]})
> fulldfr<-do.call(rbind, wrk)
> listOut<-cbind(dfx[rep(seq_along(szs), szs),], fulldfr)
How about:
expand.grid(dfx$projectScore, dfx$highestRankingGroup, wrk[[1]])
Edit:
Maybe you can eleborate a bit more, because this does seem to work:
a <- c("41","42","44","45","97","99","100","101","102","103","105", "123","124","126","127","130","132","135","136","137","138","139","140","141","158","159","160","161","162","163","221","223","224")
wrk <-list(a, a)
dfx <- data.frame(projectScore=c(0.8852, 0.8845), highestRankingGroup=c(1,2))
listOut <- expand.grid(dfx$projectScore, dfx$highestRankingGroup, wrk[[1]])
names(listOut) <- c("projectScore", "highestRankingGroup", "wrk")
listOut[order(-listOut$projectScore,listOut$highestRankingGroup, listOut$wrk),]

Resources