Keep significant zeros when switching column to character formatting in R - r

I am cleaning up data in R and would like to maintain numeric formatting when switching my column from numeric to character, specifically the significant zeros in the hundredths place (in example below). My input columns mostly begin as Factor data and the below is an example of what I am trying to do.
I'm sure there is a better way, just hoping for some folks with more knowledge than I to shed some light. Most questions online deal with leading zeros or formatting purely numeric columns, but the aspect of the "<" symbol in my data throws me for a loop as to the proper way of doing this.
df = as.factor(c("0.01","5.231","<0.02","0.30","0.801","2.302"))
ind = which(df %in% "<0.02") # Locate the below detection value.
df[ind] <- NA # Substitute NA temporarily
df = as.numeric(as.character(df)) # Changes to numeric column
df = round(df, digits = 2) # Rounds to hundredths place
ind1 = which(df < 0.02) # Check for below reporting limit values
df = as.character(df) # Change back to character column...
df[c(ind,ind1)] = "<0.02" # so I can place the reporting limit back
> # RESULTS::
> df
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3"
However, the 4th, 5th, and 6th values in the data are no longer reporting the zero in the hundredths place. What would be the proper order of operations for this? Perhaps changing the column back to character is incorrect? Any advice would be appreciated.
Thank you.
EDIT: ---- Upon recommendations from hrbrmstr and Mike:
Thanks for the advice. I tried the following and they both result in the same problem. Perhaps there is another way I could be indexing/replacing values?
format, same problem:
#... code from above...
ind1 = which(df < 0.02)
df = as.character(df)
df[!c(ind,ind1)] = format(df[!c(ind,ind1)],digits=2,nsmall=2)
> df
[1] "<0.02" "5.23" "<0.02" "0.3 " "0.8 " "2.3 "
sprintf, same problem:
# ... above code from example ...
ind1 = which(df < 0.02) # Check for below reporting limit values.
sprintf("%.2f",df) # sprintf attempt.
[1] "0.01" "5.23" "NA" "0.30" "0.80" "2.30"
df[c(ind,ind1)] = "<0.02" # Feed the symbols back into the column.
> df
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3" #Same Problem.
Tried a different way of replacing the values, and same problem.
# ... above code from example ...
> ind1 = which(df < 0.02)
> df[c(ind,ind1)] = 9999999
> sprintf("%.2f",df)
[1] "9999999.00" "5.23" "9999999.00" "0.30" "0.80" "2.30"
> gsub("9999999.00","<0.02",df)
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3" #Same Problem.

You could just pad it out with a gsub and a bit of regex...
df <- c("<0.02", "5.23", "<0.02", "0.3", "4", "0.8", "2.3")
gsub("^([^\\.]+)$", "\\1\\.00", gsub("\\.(\\d)$", "\\.\\10", df))
[1] "<0.02" "5.23" "<0.02" "0.30" "4.00" "0.80" "2.30"
The first gsub looks for a dot followed by a single digit and an end-of-string, and replaces the digit (the capture group \\1) with itself followed by a zero. The second checks for numbers with no dots, and adds .00 to the end.

Related

Row names disappear after as.matrix

I notice that if the row names of the dataframe follows a sequence of numbers from 1 to the number of rows. The row names of the dataframe will disappear after using as.matrix. But the row names re-appear if the row name is not a sequence.
Here are a reproducible example:
test <- as.data.frame(list(x=c(0.1, 0.1, 1), y=c(0.1, 0.2, 0.3)))
rownames(test)
# [1] "1" "2" "3"
rownames(as.matrix(test))
# NULL
rownames(as.matrix(test[c(1, 3), ]))
# [1] "1" "3"
Does anyone have an idea on what is going on?
Thanks a lot
You can enable rownames = TRUE when you apply as.matrix
> as.matrix(test, rownames = TRUE)
x y
1 0.1 0.1
2 0.1 0.2
3 1.0 0.3
First and foremost, we always have a numerical index for sub-setting that won't disappear and that we should not confuse with row names.
as.matrix(test)[c(1, 3), ]
# x y
# [1,] 0.1 0.1
# [2,] 1.0 0.3
WHAT's going on while using rownames is the dimnames feature in the serene source code of base:::rownames(),
function (x, do.NULL = TRUE, prefix = "row")
{
dn <- dimnames(x)
if (!is.null(dn[[1L]]))
dn[[1L]]
else {
nr <- NROW(x)
if (do.NULL)
NULL
else if (nr > 0L)
paste0(prefix, seq_len(nr))
else character()
}
}
which yields NULL for dimnames(as.matrix(test))[[1]] but yields "1" "3" in the case of dimnames(as.matrix(test[c(1, 3), ]))[[1]].
Note, that the method base:::row.names.data.frame is applied in case of data frames, e.g. rownames(test).
The WHAT should be explained with it, fortunately you did not ask for the WHY, which would be rather opinion-based.
There is a difference between 'automatic' and non-'automatic' row names.
Here is a motivating example:
automatic
test <- as.data.frame(list(x = c(0.1,0.1,1), y = c(0.1,0.2,0.3)))
rownames(test)
# [1] "1" "2" "3"
rownames(as.matrix(test))
# NULL
non-'automatic'
test1 <- test
rownames(test1) <- as.character(1:3)
rownames(test1)
# [1] "1" "2" "3"
rownames(as.matrix(test1))
# [1] "1" "2" "3"
You can read about this in e.g. ?data.frame, which mentions the behavior you discovered at the end:
If row.names was supplied as NULL or no suitable component was found the row names are the integer sequence starting at one (and such row names are considered to be ‘automatic’, and not preserved by as.matrix).
When you call test[c(1, 3), ] then you create non-'automatic' rownames implicitly, which is kinda documented in ?Extract.data.frame:
If `[` returns a data frame it will have unique (and non-missing) row names.
(type `[.data.frame` into your console if you want to go deeper here.)
Others showed what this means for your case already, see the argument rownames.force in ?matrix:
rownames.force: ... The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.
The difference dataframe vs. matrix:
?rownames
rownames(x, do.NULL = TRUE, prefix = "row")
The important part is do.NULL = TRUE the default is TRUE: This means:
If do.NULL is FALSE, a character vector (of length NROW(x) or NCOL(x)) is returned in any case,
If the replacement versions are called on a matrix without any existing dimnames, they will add suitable dimnames. But constructions such as
rownames(x)[3] <- "c"
may not work unless x already has dimnames, since this will create a length-3 value from the NULL value of rownames(x).
For me that means (maybe not correct or professional) to apply rownames() function to a matrix the dimensions of the row must be declared before otherwise you will get NULL -> because this is the default setting in the function rownames().
In your example you experience this kind of behaviour:
Here you declare row 1 and 3 and get 1 and 3
rownames(as.matrix(test[c(1, 3), ]))
[1] "1" "3"
Here you declare nothing and get NULL because NULL is the default.
rownames(as.matrix(test))
NULL
You can overcome this by declaring before:
rownames(test) <- 1:3
rownames(as.matrix(test))
[1] "1" "2" "3"
or you could do :
rownames(as.matrix(test), do.NULL = FALSE)
[1] "row1" "row2" "row3"
> rownames(as.matrix(test), do.NULL = FALSE, prefix="")
[1] "1" "2" "3"
Similar effect with rownames.force:
rownames.force
logical indicating if the resulting matrix should have character (rather than NULL) rownames. The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.
dimnames(matrix_test)
I don't know exactly why it happens, but one way to fix it is to include the argument rownames.force = T, inside as.matrix
rownames(as.matrix(test, rownames.force = T))

Setting maximum numbers of characters in number

I want to have number with respect to maximum number of characters. e.g. let's take value 517.1918
I want to set that maximum number of characters to three, then it should give mu just 517 (just three first characters)
My work so far
I tried so split my number into to parts : first one containing three first numbers and second one containing remaining numbers by a code following :
d_convert<-function(x){
x<-sub('(.{3})(.{2})', '\\1', x)
x
}
d_convert(12345)
And it work's, but I'm not sure how can I put instead of (.{2}), length(x)-3. I tried print(paste()) but it didn't work. Is there any simply way how to do it ?
Try using signif which rounds a number to a given number of significant digits.
> signif(517.1918, 3)
[1] 517
I'm not sure if I understood what want, but you can try this:
d_convert2 <-function(x, digits=3){
x <- gsub("\\D", "", x)
num_string <- strsplit(x, "")[[1]]
out <- list(digits = num_string[1L:digits], renaming = num_string[(digits+1):length(num_string)])
out <- lapply(out, paste0, collapse="")
return(out)
}
> d_convert2(12345)
$digits
[1] "123"
$renaming
[1] "45"
> d_convert2("1,234.5")
$digits
[1] "1" "2" "3"
$renaming
[1] "4" "5"

Extract numbers after a pattern in vector of characters

I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:
"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".
I'm having trouble using gsub to accomplish this.
library(tidyverse)
Data <- c("PASTE YOUR WHOLE STRING")
str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"
Reference <- Reference %>% mutate(end = end +1)
YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])
for (i in 2:dim(Reference)[1]){
Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
YourNumbers <- paste(YourNumbers, Temp, sep = "")
}
YourNumbers #Returns "1234567"
We can use str_match_all from stringr to get all the numbers followed by "Sent".
str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"
A base R option using strsplit and sub
lapply(strsplit(ss, "\\|"), function(x)
sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"
Sample data
ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"

Adding leading 0s in r

I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")

find occurrence of string starting with a value in R

Is there a function for printing the total number of values contained in the dataset beginning with (a value)?
consider this dataset of 4 version numbers,
df <- c("1.20", "3.1.20", "2.45", "1.10", "1.67.4.3", "5.200.1", "70.1.2.7")
I need to only print version numbers 1.x.
My output would be:
1.20, 1.10, 1.67.4.3
(becasue these are version numbers starting with "1." I do not want to print 3.1.20 or 70.1.2.7 becasue they do not start with "1." eventhough they contain "1." as a substring
df <- c("1.20", "3.1.20", "2.45", "1.10", "1.67.4.3", "5.200.1", "70.1.2.7")
grep("^1\\.", df, value = TRUE)
Use the function substring inside brackets for subsetting:
df[substring(df, 1,2) == "1."]
Or:
sum(substr(df, 1, 2) == "1.")
[1] 3
And for the values themselves:
df[substr(df, 1, 2) == "1."]
[1] "1.20" "1.10" "1.67.4.3"
df[df<"2"]
#[1] "1.20" "1.10" "1.67.4.3"
Depending on your dataset (e.g., if there are version numbers with a leading zero), you might need to expand this suggested solution by df[df<"2" & df>="1"]
The total number of values starting with a "1" can in this case be obtained with length(df[df<"2"]) (or length(df[df<"2" & df >="1"]) ).

Resources