How to understand the output of using gsub on a data.frame - r

Can you use gsub on a data.frame?
dat="1 1W 16 2W 16
2 1 16 2W W
3 1W 16 16 0
4 4 64 64 0"
data=read.table(text=dat,header=F)
gsub("W",3,data)
Why we get an output such as below:
[1] "1:4" "c(2, 1, 2, 3)" "c(16, 16, 16, 64)" "c(2, 2, 1, 3)" "c(2, 3, 1, 1)" .
It is hard to understand.
> str(data)
'data.frame': 4 obs. of 5 variables:
$ V1: int 1 2 3 4
$ V2: Factor w/ 3 levels "1","1W","4": 2 1 2 3
$ V3: int 16 16 16 64
$ V4: Factor w/ 3 levels "16","2W","64": 2 2 1 3
$ V5: Factor w/ 3 levels "0","16","W": 2 3 1 1
What is the meaning of the *2 1 2 3 * in V2: Factor w/ 3 levels "1","1W","4": 2 1 2 3?

The output is the same as as.character(data).
Since the letter W never appears in any of these strings, gsub has no effect, other than the conversion to character.
As discussed in the comments, as.character has quirky behaviour on data frames. It calls as.vector(x, "character"), which needs to condense each column to a single value, and chooses to return the code needed to recreate the column, ignoring attributes. For factor columns this means that you get the integer levels, not the string values, which is why W never appears.

You need to apply through each value in your data frame:
apply(data, 1:2, function(x) gsub("W", 3, x))
# V1 V2 V3 V4 V5
# [1,] "1" "13" "16" "23" "16"
# [2,] "2" "1" "16" "23" "3"
# [3,] "3" "13" "16" "16" "0"
# [4,] "4" "4" "64" "64" "0"
#Richie Cotton's comments explain why you need to do it this way.

Related

Why does `ave` with `table` return character when first argument is character?

Consider two vectors v1and v2,
v1 <- c(3, 3, 3, 3, 2, 2, 2, 1, 1)
v2 <- as.character(v1)
where their tables give identical numerical output.
table(v1)
# v1
# 1 2 3
# 2 3 4
table(v2)
# v1
# 1 2 3
# 2 3 4
Now, aveing with numerics as first argument gives "numeric":
ave(v1, v1, FUN=table)
# [1] 4 4 4 4 3 3 3 2 2
ave(v1, v2, FUN=table)
# [1] 4 4 4 4 3 3 3 2 2
Whereas character as first argument gives "character":
ave(v2, v1, FUN=table)
# [1] "4" "4" "4" "4" "3" "3" "3" "2" "2"
ave(v2, v2, FUN=table)
# [1] "4" "4" "4" "4" "3" "3" "3" "2" "2"
Documentation of ave says:
Value
A numeric vector, say y of length length(x). [...]
For me that means it should always return "numeric".
Is this a bug or a feature?

Get label for given level of factor in R

Given this factor:
> str(some$factor)
Factor w/ 398 levels "13:23","13:24",..: 1 2 3 4 5 6 7 8 9 10 ...
> levels(some$factor)
[1] "13:23" "13:24" "13:25" "13:26" "13:27" ...
> labels(some$factor)
[1] "1" "2" "3" "4" "5" ...
how can I get a label (e.g. "2") for a given level (e.g. "13:24")?
We can create an index with match to extract the corresponding labels in base R
labels(some$factor)[match("13:24", levels(some$factor))]
#[1] "2"
data
some <- data.frame(factor = c("13:23", "13:24", "13:25"), stringsAsFactors = TRUE)

Convert pivot table generated from pivottabler package to dataframe

I'm trying to make a pivot table with pivottabler package. I want to convert the pivot table object to dataframe, so that I can convert it to data table (with DT) and render it in Shiny app, so that it's downloadable.
library(pivottabler)
pt = qpvt(mtcars, 'cyl', 'vs', 'n()')
I tried to convert it to matrix
as.data.frame(pt)
I got error message like below:
Error in as.data.frame.default(pt) : cannot coerce class ‘c("PivotTable", "R6")’ to a data.frame
Does anyone know how to convert the pivot table object to dataframe?
It is an R6 class. One option would be to extract with asDataFrame which can be revealed if we check the str
str(pt)
#...
#...
#asDataFrame: function (separator = " ", stringsAsFactors = default.stringsAsFactors())
#asJSON: function ()
#asList: function ()
#asMatrix: function (includeHeaders = TRUE, repeatHeaders = FALSE, rawValue = FALSE)
#asTidyDataFrame: function (includeGroupCaptions = TRUE, includeGroupValues = TRUE,
...
Therefore, applying asDataFrame() on the R6 object
out <- pt$asDataFrame()
out
# 0 1 Total
#4 1 10 11
#6 3 4 7
#8 14 NA 14
#Total 18 14 32
str(out)
#'data.frame': 4 obs. of 3 variables:
#$ 0 : int 1 3 14 18
#$ 1 : int 10 4 NA 14
#$ Total: int 11 7 14 32
or to get a matrix, asMatrix
pt$asMatrix()
# [,1] [,2] [,3] [,4]
#[1,] "" "0" "1" "Total"
#[2,] "4" "1" "10" "11"
#[3,] "6" "3" "4" "7"
#[4,] "8" "14" "" "14"
#[5,] "Total" "18" "14" "32"

how to select only integer values of a column [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 6 years ago.
my data have many columns with different names and want see all numeric values only in column name_id and store those values in z.
I want z should contains only numeric values of column name_id of data, if any alphabet is there in column then it should not get store in z.
z <- unique(data$name_id)
z
#[1] 10 11 12 13 14 3 4 5 6 7 8 9
#Levels: 10 11 12 13 14 3 4 5 6 7 8 9 a b c d e f
when i tried this
z <- unique(as.numeric(data$name_id))
z
# [1] 1 2 3 4 5 6 7 8 9 10 11 12
output contains values only till 12 but column has values greater than 12 also
Considering your data frame as
> b
[1] "1" "2" "3" "4" "5" "13" "14" "15" "45" "567" "999" "Name" "Age"
Apply this :
regexp <- "[[:digit:]]+"
> z <- str_extract(b , regexp)
z[is.na(z)] <- ""
> z
[1] "1" "2" "3" "4" "5" "13" "14" "15" "45" "567" "999" "" ""
Hope this helps .

Getting back original names from rpart.object

I have saved models which were created using the rpart package in R. I am trying to retrieve some information from these saved models; specifically from rpart.object. While the documentation - rpart doc - is helpful there are a few things it is not clear about:
How do I find out which variables are categorical and which are numeric? Currently, what I do is refer to the 'index' column in the splits matrix. I've noticed that for numeric variables only, the entry is not an integer. Is there a cleaner way to do this?
The csplit matrix refers to the various values a categorical variable can take using integers i.e. R maps the original names to integers. Is there a way to access this mapping? For ex. if my original variable, say, Country can take any of the values France, Germany, Japan etc, the csplit matrix lets me know that a certain split is based on Country == 1, 2. Here, rpart has replaced references to France, Germany with 1, 2 respectively. How do I get the original names - France, Germany, Japan - back from the model file? Also, how do I know what the mapping between the names and the integers is?
Generally it is the terms component that would have that sort of information. See ?rpart::rpart.object.
fit <- rpart::rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$terms # notice that the attribute dataClasses has the information
attr(fit$terms, "dataClasses")
#------------
Kyphosis Age Number Start
"factor" "numeric" "numeric" "numeric"
That example doesn't have a csplit node in its structure because none of hte variables are factors. You could make one fairly easily:
> fit <- rpart::rpart(Kyphosis ~ Age + factor(findInterval(Number,c(0,4,6,Inf))) + Start, data = kyphosis)
> fit$csplit
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 1 3
[3,] 3 1 3
[4,] 1 3 3
[5,] 3 1 3
[6,] 3 3 1
[7,] 3 1 3
[8,] 1 1 3
> attr(fit$terms, "dataClasses")
Kyphosis
"factor"
Age
"numeric"
factor(findInterval(Number, c(0, 4, 6, Inf)))
"factor"
Start
"numeric"
The integers are just the values of the factor variables so the "mapping" is just the same as it would be from as.numeric() to the levels() of a factor. If I were trying to construct a character matrix version of the fit$csplit-matrix that substituted the names of the levels in a factor variable, this would be one path to success:
> kyphosis$Numlev <- factor(findInterval(kyphosis$Number, c(0, 4, 6, Inf)), labels=c("low","med","high"))
> str(kyphosis)
'data.frame': 81 obs. of 5 variables:
$ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
$ Age : int 71 158 128 2 1 1 61 37 113 59 ...
$ Number : int 3 3 4 5 4 2 2 3 2 6 ...
$ Start : int 5 14 5 1 15 16 17 16 16 12 ...
$ Numlev : Factor w/ 3 levels "low","med","high": 1 1 2 2 2 1 1 1 1 3 ...
> fit <- rpart::rpart(Kyphosis ~ Age +Numlev + Start, data = kyphosis)
> Levels <- fit$csplit
> Levels[] <- levels(kyphosis$Numlev)[Levels]
> Levels
[,1] [,2] [,3]
[1,] "low" "low" "high"
[2,] "low" "low" "high"
[3,] "high" "low" "high"
[4,] "low" "high" "high"
[5,] "high" "low" "high"
[6,] "high" "high" "low"
[7,] "high" "low" "high"
[8,] "low" "low" "high"
Response to comment: If you only have the model then use str() to look at it. I see an "ordered" leaf in the example I created that has the factor labels stored in an attribute named "xlevels":
$ ordered : Named logi [1:3] FALSE FALSE FALSE
..- attr(*, "names")= chr [1:3] "Age" "Numlev" "Start"
- attr(*, "xlevels")=List of 1
..$ Numlev: chr [1:3] "low" "med" "high"
- attr(*, "ylevels")= chr [1:2] "absent" "present"
- attr(*, "class")= chr "rpart"

Resources