Row names disappear after as.matrix - r

I notice that if the row names of the dataframe follows a sequence of numbers from 1 to the number of rows. The row names of the dataframe will disappear after using as.matrix. But the row names re-appear if the row name is not a sequence.
Here are a reproducible example:
test <- as.data.frame(list(x=c(0.1, 0.1, 1), y=c(0.1, 0.2, 0.3)))
rownames(test)
# [1] "1" "2" "3"
rownames(as.matrix(test))
# NULL
rownames(as.matrix(test[c(1, 3), ]))
# [1] "1" "3"
Does anyone have an idea on what is going on?
Thanks a lot

You can enable rownames = TRUE when you apply as.matrix
> as.matrix(test, rownames = TRUE)
x y
1 0.1 0.1
2 0.1 0.2
3 1.0 0.3

First and foremost, we always have a numerical index for sub-setting that won't disappear and that we should not confuse with row names.
as.matrix(test)[c(1, 3), ]
# x y
# [1,] 0.1 0.1
# [2,] 1.0 0.3
WHAT's going on while using rownames is the dimnames feature in the serene source code of base:::rownames(),
function (x, do.NULL = TRUE, prefix = "row")
{
dn <- dimnames(x)
if (!is.null(dn[[1L]]))
dn[[1L]]
else {
nr <- NROW(x)
if (do.NULL)
NULL
else if (nr > 0L)
paste0(prefix, seq_len(nr))
else character()
}
}
which yields NULL for dimnames(as.matrix(test))[[1]] but yields "1" "3" in the case of dimnames(as.matrix(test[c(1, 3), ]))[[1]].
Note, that the method base:::row.names.data.frame is applied in case of data frames, e.g. rownames(test).
The WHAT should be explained with it, fortunately you did not ask for the WHY, which would be rather opinion-based.

There is a difference between 'automatic' and non-'automatic' row names.
Here is a motivating example:
automatic
test <- as.data.frame(list(x = c(0.1,0.1,1), y = c(0.1,0.2,0.3)))
rownames(test)
# [1] "1" "2" "3"
rownames(as.matrix(test))
# NULL
non-'automatic'
test1 <- test
rownames(test1) <- as.character(1:3)
rownames(test1)
# [1] "1" "2" "3"
rownames(as.matrix(test1))
# [1] "1" "2" "3"
You can read about this in e.g. ?data.frame, which mentions the behavior you discovered at the end:
If row.names was supplied as NULL or no suitable component was found the row names are the integer sequence starting at one (and such row names are considered to be ‘automatic’, and not preserved by as.matrix).
When you call test[c(1, 3), ] then you create non-'automatic' rownames implicitly, which is kinda documented in ?Extract.data.frame:
If `[` returns a data frame it will have unique (and non-missing) row names.
(type `[.data.frame` into your console if you want to go deeper here.)
Others showed what this means for your case already, see the argument rownames.force in ?matrix:
rownames.force: ... The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.

The difference dataframe vs. matrix:
?rownames
rownames(x, do.NULL = TRUE, prefix = "row")
The important part is do.NULL = TRUE the default is TRUE: This means:
If do.NULL is FALSE, a character vector (of length NROW(x) or NCOL(x)) is returned in any case,
If the replacement versions are called on a matrix without any existing dimnames, they will add suitable dimnames. But constructions such as
rownames(x)[3] <- "c"
may not work unless x already has dimnames, since this will create a length-3 value from the NULL value of rownames(x).
For me that means (maybe not correct or professional) to apply rownames() function to a matrix the dimensions of the row must be declared before otherwise you will get NULL -> because this is the default setting in the function rownames().
In your example you experience this kind of behaviour:
Here you declare row 1 and 3 and get 1 and 3
rownames(as.matrix(test[c(1, 3), ]))
[1] "1" "3"
Here you declare nothing and get NULL because NULL is the default.
rownames(as.matrix(test))
NULL
You can overcome this by declaring before:
rownames(test) <- 1:3
rownames(as.matrix(test))
[1] "1" "2" "3"
or you could do :
rownames(as.matrix(test), do.NULL = FALSE)
[1] "row1" "row2" "row3"
> rownames(as.matrix(test), do.NULL = FALSE, prefix="")
[1] "1" "2" "3"
Similar effect with rownames.force:
rownames.force
logical indicating if the resulting matrix should have character (rather than NULL) rownames. The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.
dimnames(matrix_test)

I don't know exactly why it happens, but one way to fix it is to include the argument rownames.force = T, inside as.matrix
rownames(as.matrix(test, rownames.force = T))

Related

Recode if string (with punctuation) contains certain text

How can I search through a character vector and, if the string at a given index contains a pattern, replace that index's value?
I tried this:
List <- c(1:8)
Types<-as.character(c(
"ABC, the (stuff).\n\n\n fun", "meaningful", "relevant", "rewarding",
"unpleasant", "enjoyable", "engaging", "disinteresting"))
for (i in List) {
if (grepl(Types[i], "fun", fixed = TRUE))
{Types[i]="1"
} else if (grepl(Types[i], "meaningful", fixed = TRUE))
{Types[i]="2"}}
The code works for "meaningful", but doesn't when there's punctuation or other things in the string, as with "fun".
The first argument to grepl is the pattern, not the string.
This would be a literal fix of your code:
for (i in seq_along(Types)) {
if (grepl("fun", Types[i], fixed = TRUE)) {
Types[i] = "1"
} else if (grepl("meaningful", Types[i], fixed = TRUE)) {
Types[i] = "2"
}
}
Types
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"
BTW, the use of List works, but it's a little extra: when you have separate variables like that, it is possible that one might go out of sync with the other. For instance, if you update Types and forget to update List, then it will break (or fail). For this, I used seq_along(Types) instead.
BTW: here's a slightly different version that leaves Types untouched and returns a new vector, and is introducing you to the power of vectorization:
Types[grepl("fun", Types, fixed = TRUE)] <- "1"
Types[grepl("meaningful", Types, fixed = TRUE)] <- "2"
Types
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"
The next level (perhaps over-complicating?) would be to store the patterns and recoding replacements in a frame (always a 1-to-1, you'll never accidentally update one without the other, can be stored in CSV if needed) and Reduce on it:
ptns <- data.frame(ptn = c("fun", "meaningful"), repl = c("1", "2"))
Reduce(function(txt, i) {
txt[grepl(ptns$ptn[i], txt, fixed = TRUE)] <- ptns$repl[i]
txt
}, seq_len(nrow(ptns)), init = Types)
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"
You could use str_replace_all:
library(stringr)
pat <- c(fun = '1', meaningful = '2')
str_replace_all(Types, setNames(pat, sprintf('(?s).*%s.*', names(pat))))
[1] "1" "2" "relevant"
[4] "rewarding" "unpleasant" "enjoyable"
[7] "engaging" "disinteresting"
Try to use str_replace(string, pattern, replacement) from string package.

Keep significant zeros when switching column to character formatting in R

I am cleaning up data in R and would like to maintain numeric formatting when switching my column from numeric to character, specifically the significant zeros in the hundredths place (in example below). My input columns mostly begin as Factor data and the below is an example of what I am trying to do.
I'm sure there is a better way, just hoping for some folks with more knowledge than I to shed some light. Most questions online deal with leading zeros or formatting purely numeric columns, but the aspect of the "<" symbol in my data throws me for a loop as to the proper way of doing this.
df = as.factor(c("0.01","5.231","<0.02","0.30","0.801","2.302"))
ind = which(df %in% "<0.02") # Locate the below detection value.
df[ind] <- NA # Substitute NA temporarily
df = as.numeric(as.character(df)) # Changes to numeric column
df = round(df, digits = 2) # Rounds to hundredths place
ind1 = which(df < 0.02) # Check for below reporting limit values
df = as.character(df) # Change back to character column...
df[c(ind,ind1)] = "<0.02" # so I can place the reporting limit back
> # RESULTS::
> df
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3"
However, the 4th, 5th, and 6th values in the data are no longer reporting the zero in the hundredths place. What would be the proper order of operations for this? Perhaps changing the column back to character is incorrect? Any advice would be appreciated.
Thank you.
EDIT: ---- Upon recommendations from hrbrmstr and Mike:
Thanks for the advice. I tried the following and they both result in the same problem. Perhaps there is another way I could be indexing/replacing values?
format, same problem:
#... code from above...
ind1 = which(df < 0.02)
df = as.character(df)
df[!c(ind,ind1)] = format(df[!c(ind,ind1)],digits=2,nsmall=2)
> df
[1] "<0.02" "5.23" "<0.02" "0.3 " "0.8 " "2.3 "
sprintf, same problem:
# ... above code from example ...
ind1 = which(df < 0.02) # Check for below reporting limit values.
sprintf("%.2f",df) # sprintf attempt.
[1] "0.01" "5.23" "NA" "0.30" "0.80" "2.30"
df[c(ind,ind1)] = "<0.02" # Feed the symbols back into the column.
> df
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3" #Same Problem.
Tried a different way of replacing the values, and same problem.
# ... above code from example ...
> ind1 = which(df < 0.02)
> df[c(ind,ind1)] = 9999999
> sprintf("%.2f",df)
[1] "9999999.00" "5.23" "9999999.00" "0.30" "0.80" "2.30"
> gsub("9999999.00","<0.02",df)
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3" #Same Problem.
You could just pad it out with a gsub and a bit of regex...
df <- c("<0.02", "5.23", "<0.02", "0.3", "4", "0.8", "2.3")
gsub("^([^\\.]+)$", "\\1\\.00", gsub("\\.(\\d)$", "\\.\\10", df))
[1] "<0.02" "5.23" "<0.02" "0.30" "4.00" "0.80" "2.30"
The first gsub looks for a dot followed by a single digit and an end-of-string, and replaces the digit (the capture group \\1) with itself followed by a zero. The second checks for numbers with no dots, and adds .00 to the end.

names of leaves of nested list in R

I want to check if two nested lists have the same names at the last level.
If unlist gave an option not to concatenate names this would be trivial. However, it looks like I need some function leaf.names():
X <- list(list(a = pi, b = list(alpha.c = 1:5, 7.12)), d = "a test")
leaf.names(X)
[1] "a" "alpha.c" "NA" "d"
I want to avoid any inelegant grepping if possible. I feel like there should be some easy way to do this with rapply or unlist...
leaf.names <- function(X) names(rlang::squash(X))
or
leaf.names <- function(X){
while(any(sapply(X, is.list))) X <- purrr::flatten(X)
names(X)
}
gives
leaf.names(X)
# [1] "a" "alpha.c" "" "d"

Q: Distinction between levels() and unique() for zero-length strings in data frame

If a column in a data frame contains a zero-length string, passing it to the levels() function will return the zero-length string. However, unique() will not. This seems counter-intuitive to me, and I haven't been able to find any documentation that explains this. Does anyone know why this is the case?
Example:
d <- data.frame( col1 = c("", 'a', 'b')) # Contains "".
Call unique():
unique(d$col1)
unique() does not return the zero-length string element:
[1] a b
Levels: a b
levels() includes "" in the results:
levels(d$col1)
[1] "" "a" "b"
This is to do with printing methods. From the documentation, the return value for unique:
For a vector, an object of the same type of x, but with only one copy of each duplicated element. No attributes are copied (so the result has no names).
And indeed, you can see that unique return a factor because col1 is also a factor:
class(unique(d$col1))
[1] "factor"
Whereas, the return value for levels is a character vector:
class(levels(d$col1))
[1] "character"
And thus, unique does print the empty string but without the double quotes. An alternative would be to define the df without factor columns:
d <- data.frame(col1 = c("", 'a', 'b'), stringsAsFactors = F)
And here unique will indeed return the wanted "". Of course levels only applies to factors and would no longer be adequate:
unique(d$col1)
[1] "" "a" "b"

How assign names for each list[[i]] in a huge list file

I have a very big list file.dput() function for two of them is as below :
> dput(mydata).....
`NA` = c("SHC2", "GRB2", "HRAS", "KRAS", "NRAS", "SHC3",
"MAPK1", "MAPK3", "MAP2K1", "MAP2K2", "RAF1", "SHC1", "SOS1",
"YWHAB", "CDK1"), `NA` = c("NUP50", "NUPL2", "PSIP1", "NUP35",
"NUP205", "NUP210", "NUP188", "NUP62", "SLC25A4", "SLC25A5",
"SLC25A6", "HMGA1", "NUP43", "KPNA1", "NUP88", "NUP54", "NUP133",
"NUP107", "RANBP2", "LOC645870", "TPR", "NUP37", "NUP85",
"NUP214", "AAAS", "SEH1L", "RAE1", "BANF1", "NUP155", "NUP93",
"NUPL1", "POM121", "NUP153"), ....
I'm also have a file including names, but I can't assign it,
names(mydata)<-list("a", "b")# clears former data and replaces with "a" and "b"
names(mydata)<-c("a", "b")
I have tried using names(mydata) but it dosen't do what I need. I think "N" should be the name which I dont know how to access it. right?
If yes what should I do? Regards**
I'm not sure what you are trying to do. If you want to name the elements of a list with the names from another file, here's how to do it:
x <- list (1,2,3,4,5)
y <- LETTERS [1:5]
names (x) <- y
Thanks
The problem was: I was using [[ ]] to recruit names but [] should be used for names:
x <- list (1,2,3,4,5)
y <- LETTERS [1:5]
names (x) <- y
> x[[1]]
[1] 1
> x[1]
$A
[1] 1
> x[2]
$B
[1] 2

Resources