Get label for given level of factor in R - r

Given this factor:
> str(some$factor)
Factor w/ 398 levels "13:23","13:24",..: 1 2 3 4 5 6 7 8 9 10 ...
> levels(some$factor)
[1] "13:23" "13:24" "13:25" "13:26" "13:27" ...
> labels(some$factor)
[1] "1" "2" "3" "4" "5" ...
how can I get a label (e.g. "2") for a given level (e.g. "13:24")?

We can create an index with match to extract the corresponding labels in base R
labels(some$factor)[match("13:24", levels(some$factor))]
#[1] "2"
data
some <- data.frame(factor = c("13:23", "13:24", "13:25"), stringsAsFactors = TRUE)

Related

Why does `ave` with `table` return character when first argument is character?

Consider two vectors v1and v2,
v1 <- c(3, 3, 3, 3, 2, 2, 2, 1, 1)
v2 <- as.character(v1)
where their tables give identical numerical output.
table(v1)
# v1
# 1 2 3
# 2 3 4
table(v2)
# v1
# 1 2 3
# 2 3 4
Now, aveing with numerics as first argument gives "numeric":
ave(v1, v1, FUN=table)
# [1] 4 4 4 4 3 3 3 2 2
ave(v1, v2, FUN=table)
# [1] 4 4 4 4 3 3 3 2 2
Whereas character as first argument gives "character":
ave(v2, v1, FUN=table)
# [1] "4" "4" "4" "4" "3" "3" "3" "2" "2"
ave(v2, v2, FUN=table)
# [1] "4" "4" "4" "4" "3" "3" "3" "2" "2"
Documentation of ave says:
Value
A numeric vector, say y of length length(x). [...]
For me that means it should always return "numeric".
Is this a bug or a feature?

How to return only the wanted vector in the which() funtion

I have this initial matrix:
> fil
2 3 6
1 1 1
> str(fil)
Named num [1:3] 1 1 1
- attr(*, "names")= chr [1:3] "2" "3" "6"
When I do this:
which(fil==min(fil,na.rm = TRUE))
I have this returned
> which(fil==min(fil,na.rm = TRUE))
2 3 6
1 2 3
And I wanted the names of the vector to be returned:
2 3 6
When you see an output like the one in the question, you must suspect that the upper vector are the names of the vector printed below them. What is below is the actual vector, its values, not the first line of the output.
This is confirmed with str
str(fil)
# Named num [1:3] 1 1 1
# - attr(*, "names")= chr [1:3] "2" "3" "6"
It starts by saying Named num, so it is a named numeric vector.
Then there is an attributes line. The attribute in question is "names". And there are functions to get some frequent attributes, such as the "names" attribute.
fil <- c('2' = 1, '3' = 1, '6' = 1)
fil
#2 3 6
#1 1 1
attributes(fil)
#$names
#[1] "2" "3" "6"
There are two ways to get the attribute "names". The second is the shorcut I will use:
attr(fil, "names")
#[1] "2" "3" "6"
names(fil)
#[1] "2" "3" "6"
Now, to answer the question, just subset the names that correspond to the minimum of the vector fil.
names(fil)[which(fil==min(fil,na.rm = TRUE))]
#[1] "2" "3" "6"

R Writing to data frame from inside for-loop

Brand new to R programming so please forgive me if I'm using wrong terminologies.
I'm trying to insert/append values to a data frame from inside a for-loop.
I can get the right values if I just print() them, but when I try to put it inside the data frame, I get mostly NA's. If I run this code it prints out the values I want.
output <- data.frame()
for (i in seq_along(Reasons)){
assign(paste(Reasons[i]), sum(ER$Reason == paste(Reasons[i])))
Tot <- get(paste(Reasons[i]))
assign(paste(Reasons[i],'ER',sep="_"), sum(grepl("ER|Er", ER$Disposition) & ER$Reason == paste(Reasons[i])))
Er <- get(paste(Reasons[i],'ER',sep="_"))
assign(paste(Reasons[i],'adm',sep="_"), sum(grepl("Admi|admi|ADMI|ADmi", ER$Disposition) & ER$Reason == paste(Reasons[i])))
Adm <- get(paste(Reasons[i],'adm',sep="_"))
assign(paste(Reasons[i],'admrate',sep="_"), sprintf("%.0f%%", (sum(grepl("Admi|admi|ADMI|ADmi", ER$Disposition) & ER$Reason == paste(Reasons[i])))/(sum(ER$Reason == paste(Reasons[i])))*100))
Rate <- get(paste(Reasons[i],'admrate',sep="_"))
print(c(Er,Adm,Tot,Rate))
#clear variables just created
rm(list=ls(pattern=Reasons[i]))
rm(Tot,Er,Adm,Rate)
}
[1] "7" "13" "20" "65%"
[1] "4" "8" "12" "67%"
[1] "12" "12" "24" "50%"
[1] "23" "7" "30" "23%"
[1] "7" "1" "8" "12%"
[1] "3" "1" "4" "25%"
[1] "3" "0" "3" "0%"
[1] "6" "5" "11" "45%"
[1] "2" "9" "11" "82%"
[1] "2" "4" "6" "67%"
[1] "10" "4" "14" "29%"
[1] "5" "0" "5" "0%"
[1] "10" "4" "14" "29%"
[1] "0" "3" "3" "100%"
[1] "7" "3" "10" "30%"
[1] "0" "4" "4" "100%"
But when I use
output <- rbind(output, c(Er, Adm, Tot, Rate))
Instead of
print(c(Er,Adm,Tot,Rate))
I get the first row of values (7, 13, 20, 65%), then all NA's except the "7" in rows 5 and 15... What am I doing wrong?
Thank you in advance
As I don't know what your data look like I cannot reproduce your error. If I understand it correctly, for each value in Reasons you want to find (a) the total number of observations, (b) the number of observations with the string "Er" in the variable Disposition, (c) the number of observations with the string "Admi" in the variable Disposition and (d) the percentage of observations with the string "Admi" in the variable Disposition. If that is the case then you don't have to use assign and get to do this.
Here is a simpler way to do it (although it's not the best way to do it, see below):
## Here I just generated some data that might look like the data
## you are dealing with:
Reasons <- LETTERS[1:10]
ER <- data.frame(Reason = LETTERS[sample.int(10,100, replace = TRUE)],
Disposition = c("ER", "Admi", "SomethingElse")[sample.int(3,100, replace = TRUE)])
output <- data.frame()
for (i in seq(along = Reasons)){
Tot <- sum(ER$Reason ==Reasons[i])
Er <- sum(grepl("ER|Er", ER$Disposition) & (ER$Reason ==Reasons[i]))
Adm <- sum(grepl("Admi|admi|ADMI|ADmi", ER$Disposition) & (ER$Reason ==Reasons[i]))
Rate <- paste(round(Adm/Tot*100), "%")
output <- rbind(output, c(Er, Adm, Tot, Rate))
}
> output
X.4. X.3. X.10. X.30...
1 4 3 10 30 %
2 2 3 6 50 %
3 2 1 6 17 %
4 5 2 14 14 %
5 3 5 11 45 %
6 2 4 11 36 %
7 3 6 14 43 %
8 2 2 5 40 %
9 1 7 11 64 %
10 4 4 12 33 %
Dynamically appending rows to a data frame or matrix is generally not a very good idea as it is quite memory intensive. If you know the dimensions of your matrix beforehand (as you do) you should initialize it with the right size and then fill the entries inside your loop:
## Initialize data:
output <- matrix(nrow = length(Reasons), ncol = 4)
for (i in seq(along = Reasons)){
Tot <- sum(ER$Reason ==Reasons[i])
Er <- sum(grepl("ER|Er", ER$Disposition) & (ER$Reason ==Reasons[i]))
Adm <- sum(grepl("Admi|admi|ADMI|ADmi", ER$Disposition) & (ER$Reason ==Reasons[i]))
Rate <- paste(round(Adm/Tot*100), "%")
output[i,] <- c(Er, Adm, Tot, Rate)
}
There are, however, even simpler ways to do this kind of evaluation. You could e.g. use the dplyr package, where you can group the data by a variable (the different Values of ER$Reason in your case) and the evaluate the values you need:
## Load the package 'dplyr'
library(dplyr)
## Group the variable and evaluate:
output <- ER %>% group_by(Reason) %>%
dplyr::summarise(Er = sum(grepl("ER|Er", Disposition)),
Adm = sum(grepl("Admi|admi|ADMI|ADmi", Disposition)),
Tot = n(),
Rate = paste(round(Adm/Tot*100), "%"))
> output
# A tibble: 10 × 5
Reason Er Adm Tot Rate
<chr> <int> <int> <int> <chr>
1 A 4 3 10 30 %
2 B 2 3 6 50 %
3 C 2 1 6 17 %
4 D 5 2 14 14 %
5 E 3 5 11 45 %
6 F 2 4 11 36 %
7 G 3 6 14 43 %
8 H 2 2 5 40 %
9 I 1 7 11 64 %
10 J 4 4 12 33 %

Converting a vector of strings into a numerical vector, based on string-sequences

I have a vector like
A <- c("A","A","B","B", "C","C","C", "D")
i would like to convert it into a numerical vector based on the sequence in A, that would look like:
c(1:2, 3:4, 5:7, 8)
Is this possible?
Try:
A <- c("A","A","B","B", "C","C","C", "D")
as.numeric(factor(A))
[1] 1 1 2 2 3 3 3 4
and in case you really want a sequence from 1 to the length of the vector:
labels(factor(A))
[1] "1" "2" "3" "4" "5" "6" "7" "8"
or
1:length(A)
[1] 1 2 3 4 5 6 7 8
If the first sequence is what you want, you may find plyr::mapvalues interesting in case you have more complicated cases at some point. For instance,
library(plyr)
mapvalues(A, from=unique(A), to=1:4)
[1] "1" "1" "2" "2" "3" "3" "3" "4"
This comes in handy when you need a bit more control. For instance, you could easily supply other output as to argument, e.g.month.name[1:4].

How to understand the output of using gsub on a data.frame

Can you use gsub on a data.frame?
dat="1 1W 16 2W 16
2 1 16 2W W
3 1W 16 16 0
4 4 64 64 0"
data=read.table(text=dat,header=F)
gsub("W",3,data)
Why we get an output such as below:
[1] "1:4" "c(2, 1, 2, 3)" "c(16, 16, 16, 64)" "c(2, 2, 1, 3)" "c(2, 3, 1, 1)" .
It is hard to understand.
> str(data)
'data.frame': 4 obs. of 5 variables:
$ V1: int 1 2 3 4
$ V2: Factor w/ 3 levels "1","1W","4": 2 1 2 3
$ V3: int 16 16 16 64
$ V4: Factor w/ 3 levels "16","2W","64": 2 2 1 3
$ V5: Factor w/ 3 levels "0","16","W": 2 3 1 1
What is the meaning of the *2 1 2 3 * in V2: Factor w/ 3 levels "1","1W","4": 2 1 2 3?
The output is the same as as.character(data).
Since the letter W never appears in any of these strings, gsub has no effect, other than the conversion to character.
As discussed in the comments, as.character has quirky behaviour on data frames. It calls as.vector(x, "character"), which needs to condense each column to a single value, and chooses to return the code needed to recreate the column, ignoring attributes. For factor columns this means that you get the integer levels, not the string values, which is why W never appears.
You need to apply through each value in your data frame:
apply(data, 1:2, function(x) gsub("W", 3, x))
# V1 V2 V3 V4 V5
# [1,] "1" "13" "16" "23" "16"
# [2,] "2" "1" "16" "23" "3"
# [3,] "3" "13" "16" "16" "0"
# [4,] "4" "4" "64" "64" "0"
#Richie Cotton's comments explain why you need to do it this way.

Resources