Sprintf Function and Character Dates - r

I have a data set in which I want to pad zeroes in front of a set of dates that don't have six characters. For example, I have a date that reads 91003 (October 3rd, 2009) and I want it to read 091003, as well as any other date that is missing a zero in front. When I use the sprintf function, the code is:
Data1$entrydate <- sprintf("%06d", data1$entrydate)
But what it spits out is something like 000127, or some other other random number for all the other dates in the problem. I don't understand what's going on, and I would appreciate some help on the issue. Thanks.
PS. I am sometimes also getting a error message that sprintf is only for character values, I don't know if there is any code for numerical values.

I guess you got different results than expected because the column class was factor. You can convert the column to numeric either by as.numeric(as.character(datacolumn)) or as.numeric(levels(datacolumn)). According to ?factor
To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
So, you can use
levels(data1$entrydate) <- sprintf('%06d', as.numeric(levels(data1$entrydate)))
Example
Here is an example that shows the problem
v1 <- factor(c(91003, 91104,90103))
sprintf('%06d', v1)
#[1] "000002" "000003" "000001"
Or, it is equivalent to
sprintf('%06d', as.numeric(v1)) #the formatted numbers are
# the numeric index of factor levels.
#[1] "000002" "000003" "000001"
When you convert it back to numeric, works as expected
sprintf('%06d', as.numeric(levels(v1)))
#[1] "090103" "091003" "091104"

Related

UseMethod("type") error; no applicable method for 'type" applied to an object of class "c('double', 'numeric')"

In a dataframe, I have a column that has numeric values and some mixed in character data for some rows. I want to remove all rows with the character data and keep those rows with a number value. The df I have is 6 million rows, so I simply made a small object to try to solve my issue and then implement at a larger scale.
Here is what I did:
a <- c("fruit", "love", 53)
b <- str_replace_all("^[:alpha:]", 0)
Reading answers to other UseMethod errors on here (about factors), I tried to change "a" to as.character(a) and attempt "b" again. But, I get the same error. I'm trying to simply make any alphabetic value into the number zero and I'm fairly new at all this.
There are several issues here, even in these two lines of code. First, a is a character vector, because its first element is a character. This means that your numeric 53 is coerced into a character.
> print(a)
[1] "fruit" "love" "53"
You've got the wrong syntax for str_replace_all. See the documentation for how to use it correctly. But that's not what you want here, because you want numerics.
The first thing you need to do is convert a to a numeric. A crude way of doing this is simply
>b <- as.numeric(a)
Warning message:
NAs introduced by coercion b
> b
[1] NA NA 53
And then subset to include only the numeric values in b:
> b <- b[!is.na(b)]
> b
[1] 53
But whether that's what you want to do with a 6 million row dataframe is another matter. Please think about exactly what you would like to do, supply us with better test data, and ask your question again.
There's probably a more efficient way of doing this on a large data frame (e.g. something column-wise, instead of row-wise), but to answer your specific question about each row a:
as.numeric(stringr::str_replace_all(a, "[a-z]+", "0"))
Note that the replacing value must be a character (the last argument in the function call, "0"). (You can look up the documentation from your R-console by: ?stringr::str_replace_all)

Parsing dates in R from strings with multiple formats

I have a tibble in R with about 2,000 rows. It was imported from Excel using read_excel. One of the fields is a date field: dob. It imported as a string, and has dates in three formats:
"YYYY-MM-DD"
"DD-MM-YYYY"
"XXXXX" (ie, a five-digit Excel-style date)
Let's say I treat the column as a vector.
dob <- c("1969-02-02", "1986-05-02", "34486", "1995-09-05", "1983-06-05",
"1981-02-01", "30621", "01-05-1986")
I can see that I probably need a solution that uses both parse_date_time and as.Date.
If I use parse_date_time:
dob_fixed <- parse_date_time(dob, c("ymd", "dmy"))
This fixes them all, except the five-digit one, which returns NA.
I can fix the five-digit one, by using as.integer and as.Date:
dob_fixed2 <- as.Date(as.integer(dob), origin = "1899-12-30")
Ideally I would run one and then the other, but because each returns NA on the strings that don't work I can't do that.
Any suggestions for doing all? I could simply change them in Excel and re-import, but I feel like that's cheating!
We create a logical index after the first run based on the NA values and use that to index for the second run
i1 <- is.na(dob_fixed)
dob_fixed[i1] <- as.Date(as.integer(dob[i1]), origin = "1899-12-30")

regex single digit

I have a question which I think is solved by regex use in R.
I have a set of dates (as chr) which I would like in a different format (as chr).
I have tried to fool around with the below examples where the first (new_dates) gives the right format for months 1-9 and wrong for 10-12 and (new_dates2) gives the right format for 10-12 but nothing for 1-9.
I see that the code in the first case matches a single digit twice for 10-12, but don't really know how to tell it to match only single digit.
The final vector of correct dates shows the result I would like.
dates <- c("1/2016", "2/2016", "3/2016", "4/2016", "5/2016", "6/2016", "7/2016", "8/2016", "9/2016", "10/2016", "11/2016", "12/2016", "1/2017")
new_dates <- sub("(\\d)[:/:](\\d{4})","\\2M0\\1", dates)
new_dates2 <- sub("(\\d{2})[:/:](\\d{4})","\\2M\\1", dates)
correctdates <- c("2016M01", "2016M02", "2016M03", "2016M04", "2016M05", "2016M06", "2016M07", "2016M08", "2016M09", "2016M10", "2016M11", "2016M12", "2017M1")
Here's a base R method that will return the desired format:
format(as.Date(paste0("1/",dates), "%d/%m/%Y"), "%YM%m")
[1] "2016M01" "2016M02" "2016M03" "2016M04" "2016M05" "2016M06" "2016M07" "2016M08" "2016M09"
[10] "2016M10" "2016M11" "2016M12" "2017M01"
The idea is to first convert to a Date object and then use the format function to create the desired character representation. I pasted on 1/ so that a day is present in each element.
As #a p o m said it might be better to look for another solution if you are manipulating dates but if you want to stick with regular expressions you can try this one.
([02-9]|1[0-2]?)[:\/](\d{4}) example
new_dates <- sub("(\\d{1,2})\\/(\\d{4})","\\2M0\\1", dates)
It's fine.

How can I convert a factor variable with missing values to a numeric variable?

I loaded my dataset (original.csv) to R:
original <- read.csv("original.csv")
str(original) showed that my dataset has 16 variables (14 factors, 2 integers). 14 variables have missing values. It was OK, but 3 variables that are originally numbers, are known as factors.
I searched web and get a command as: as.numeric(as.character(original$Tumor_Size))
(Tumor_Size is a variable that has been known as factor).
By the way, missing values in my dataset are marked as dot (.)
After running: as.numeric(as.character(original$Tumor_Size)), the values of Tumor_Size were listed and in the end a warning massage as: “NAs introduced by coercion” was appeared.
I expected after running above command, the variable converted to numeric, but second str(original) showed that my guess was wrong and Tumor_Size and another two variables were factors. In the below is sample of my dataset:
a piece of my dataset
How can I solve my problem?
The crucial information here is how missing values are encoded in your data file. The corresponding argument in read.csv() is called na.strings. So if dots are used:
original <- read.csv("original.csv", na.strings = ".")
I'm not 100% sure what your problem is but maybe this will help....
original<-read.csv("original.csv",header = TRUE,stringsAsFactors = FALSE)
original$Tumor_Size<-as.numeric(original$Tumor_Size)
This will introduce NA's because it cannot convert your dot(.) to a numeric value. If you try to replace the NA's with a dot again it will return the field as a character, to do this you can use,
original$Tumor_Size[is.na(original$Tumor_Size)]<-"."
Hope this helps.

Adding conditional leading or trailing zeros

I need help conditionally adding leading or trailing zeros.
I have a dataframe with one column containing icd9 diagnoses. as a vector, the column looks like:
"33.27" "38.45" "9.25" "4.15" "38.45" "39.9" "84.1" "41.5" "50.3"
I need all the values to have the length of 5, including the period in the middle (not counting ""). If the value has one digit before the period, it need to have a leading zero. If value has one digit after the period, it need to have zero at the end. So the result should look like this:
"33.27" "38.45" "09.25" "04.15" "38.45" "39.90" "84.10" "41.50" "50.30"
Here is the vector for R:
icd9 <- c("33.27", "38.45", "9.25", "4.15", "38.45", "39.9", "84.1", "41.5", "50.3" )
This does it in one line
formatC(as.numeric(icd9),width=5,format='f',digits=2,flag='0')
ICD-9 codes have some formatting quirks which can lead to misinterpretation with simple string processing. The icd package on CRAN takes care of all the corner cases when doing ICD processing, and has been battle-tested over about six years of use by many R users.
Using this function called change that accepts the argument of the max number of characters, i think it can help
change<-function(x, n=max(nchar(x))) gsub(" ", "0", formatC(x, width=n))
icd92<-gsub(" ","",paste(change(icd9,5)))
You can also use sprintf after converting the vector into numeric.
sprintf("%05.2f", as.numeric(icd9))
[1] "33.27" "38.45" "09.25" "04.15" "38.45" "39.90" "84.10" "41.50" "50.30"
Notes
The examples in ?sprint to get work out the proper format.
There is some risk of introducing errors due to numerical precision here, though it works well in the example.

Resources