Create character vector of filenames in numeric sequence with leading zeroes - r

I need to generate 100 file names.
How would you generate the corresponding character vector files in R containing 100 file names: plot01.png, plot02.png, plot03.png, ..., plot99.png, plot100.png? Notice that the numbers of the first 9 files start with 0.
The obvious but very ineffective solution is to write a vector with 100 file names. I'm trying to figure out a more effective way to create this character vector.

A concise option is paste0("plot", sprintf("%02d.png", 1:100)):
[1] "plot01.png" "plot02.png" "plot03.png" "plot04.png" ...
Another approach that is more characters to write, but maybe easier to follow, is string padding with str_pad from the stringr package:
library(stringr)
paste0("plot", str_pad(1:100, width = 2, side = "left", pad = 0), ".png")

Combine paste and formatC:
paste(formatC(1:100, flag = "0", width = 2), "png", sep = ".")
# [1] "01.png" "02.png" "03.png" "04.png" "05.png" "06.png" "07.png" ...

Related

Count number of integers of a String then paste it into a new file - RStudio

I want to work with a very big text file (txt) that contains more than 3 million lines, each line with a different name, containing characters and integers.
My idea is to clean a little bit this file (so I can use it easier) and remove those names that have more than 2 numbers.
I would like to parse the names, counting the numbers and then if the name contains less than 3 numbers, writing it into a new file in R.
My big file would be something like this (separating names in new lines):
susan123 susan1 john john22345 alex55 alex1234
And then I would have this new file:
susan1 john alex55
Is this possible in R?
Thanks
I'll leave reading and writing the file to your choice of functions. When you've got the words in R in a vector, here's a way to subset it to just the names with less than 3 digits.
x = c("susan123", "susan1", "john", "john22345", "alex55", "alex1234")
library(stringr)
x[str_detect(x, pattern = "\\D+\\d{0,2}$")]
# [1] "susan1" "john" "alex55"
Base R:
We could use which with nchar and gsub and the useNames = TRUE argument of which:
x = c("susan123", "susan1", "john", "john22345", "alex55", "alex1234")
x[which(nchar(gsub("\\D", "", x)) < 3, useNames = TRUE)]
[1] "susan1" "john" "alex55"

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.
You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.
Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

R: Read in a subset of lines and turn it into a conventional format (data.table approach preferred)

I have a file that that has over a hundred million rows and scattered throughout there are extra tab delimiters in fields. I need to read the problematic rows into R whilst ignoring the others due to the large file size involved.
Example txt file with extra delimiters in some rows:
text_file <-"My\tname\tis\tAlpha\nMy\tname\tis\t\t\tBravo\nMy\tname\tis\tCharlie\nMy\tname\tis\t\t\tDelta\nMy\tname\tis\tEcho"
The first thing that I tried was using the 'readLines' function however whilst I can specify the row to stop on it will still read everything else up to that point which could still be too much
readLines(textConnection(text_file), n = 4)
[1] "My\tname\tis\tAlpha" "My\tname\tis\t\t\tBravo" "My\tname\tis\tCharlie" "My\tname\tis\t\t\tDelta"
I then realised that I could also use the other dataset import functions if I specified the delimiter to be something that should probably never appear. The "fread" function from the data.table package would be perfect for this as it is the fastest way to deal with large datasets like mine however when I tried it the data was in a format that I couldnt really work with further:
library(data.table)
library(stringi)
lines <- fread(text_file, sep = NULL, header = FALSE, skip = 1, nrows = 3)
> lines
V1
1: My\tname\tis\t\t\tBravo
2: My\tname\tis\tCharlie
3: My\tname\tis\t\t\tDelta
> invalid_delimiter_rows <- which(stri_count_regex(lines, "\\t") != 3)
Warning message:
In stri_count_regex(lines, "\\t") :
argument is not an atomic vector; coercing
Preferably I shouldnt have to convert this data after importing however when I tried changing this to a character vector or list it was still in a bad format (the concatenation is considered part of the string and not a function). What is the most computing time efficient way that I could approach this issue?
> class(lines)
[1] "data.table" "data.frame"
> as.character(lines)
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\")"
Let's replicate the process till fread() import:
# your example string
text_file <-"My\tname\tis\tAlpha\nMy\tname\tis\t\t\tBravo\nMy\tname\tis\tCharlie\nMy\tname\tis\t\t\tDelta\nMy\tname\tis\tEcho"
# import
library(data.table)
lines <- fread(text_file, sep = NULL, header = FALSE, skip = 1, nrows = 5)
lines
V1
1: My\tname\tis\t\t\tBravo
2: My\tname\tis\tCharlie
3: My\tname\tis\t\t\tDelta
4: My\tname\tis\tEcho
When you try
as.character(lines)
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
it converts all data.table in character, so each column will be a concatenated vector. See below:
as.character(data.table(lines$V1, lines$V1))
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
[2] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
What you want is extract just lines$V1, which is already a character vector.
lines$V1
[1] "My\tname\tis\t\t\tBravo" "My\tname\tis\tCharlie" "My\tname\tis\t\t\tDelta" "My\tname\tis\tEcho"

Format number in R with both comma thousands separator and specified decimals

I'd like to format numbers with both thousands separator and specifying the number of decimals. I know how to do these separately, but not together.
For example, I use format per this for the decimals:
FormatDecimal <- function(x, k) {
return(format(round(as.numeric(x), k), nsmall=k))
}
FormatDecimal(1000.64, 1) # 1000.6
And for thousands separator, formatC:
formatC(1000.64, big.mark=",") # 1,001
These don't play nicely together though:
formatC(FormatDecimal(1000.64, 1), big.mark=",")
# 1000.6, since no longer numeric
formatC(round(as.numeric(1000.64), 1), nsmall=1, big.mark=",")
# Error: unused argument (nsmall=1)
How can I get 1,000.6?
Edit: This differs from this question which asks about formatting 3.14 as 3,14 (was flagged as possible dup).
format not formatC:
format(round(as.numeric(1000.64), 1), nsmall=1, big.mark=",") # 1,000.6
formatC(1000.64, format="f", big.mark=",", digits=1)
(sorry if i'm missing something.)
The scales library has a label_comma function:
scales::label_comma(accuracy = .1)(1000.64)
[1] "1,000.6"
With additional arguments if you want to use something other than a comma in the thousands place or another character instead of a decimal point, etc (see below).
Note: the output of label_comma(...) is a function to make it easier to use in ggplot2 arguments, hence the additional parentheses notation. This could be helpful if you're using the same format repeatedly:
my_comma <- scales::label_comma(accuracy = .1, big.mark = ".", decimal.mark = ",")
my_comma(1000.64)
[1] "1.000,6"
my_comma(c(1000.64, 1234.56))
[1] "1.000,6" "1.234,6"
formattable provides comma:
library(formattable)
comma(1000.64, digits = 1) # 1,000.6
comma provides an elementary interface to formatC.

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Resources