format columns on data frame in r - r

I want to format certain columns of a data frame changing decimal mark and number of decimal positions
a <- c(3.412324463,3.2452364)
b <- c(2.2342,4.234234)
c <- data.frame(A=a, B=b)
I can do it column by column but would rather apply it to various columns, also I can not find number of decimals. "digits=2" gives me only to digits, including decimal part
c$A <- format(c$A, decimal.mark = ",",digits = 2)

It is better not to use function names (c) to name objects. To apply the format to all the columns
c[] <- lapply(c, format, decimal.mark = ",", digits = 2)
Or with formatC
c[] <- lapply(c, formatC, decimal.mark =",", format = "f", digits = 2)
If we need to apply to selected multiple columns, i.e. columns 1 to 3 and 7:10
j1 <- c(1:3, 7:10)
c[j1] <- lapply(c[j1, formatC, decimal.mark =",", format = "f", digits = 2)
Or another option with sprintf
c[] <- lapply(c, function(x) sub(".", ",", sprintf("%0.2f", x), fixed = TRUE))

There is also a package called fmtr designed specifically for this situation. It lets you control the formatting for each column independently. There are two main functions fapply() and fdata().
fapply() applies a format to a vector or single column. Like this:
a <- c(3.412324463,3.2452364)
b <- c(2.2342,4.234234)
c1 <- data.frame(A=a, B=b)
c2 <- c1
c2$A <- fapply(c2$A, "%.3f")
c2$B <- fapply(c2$B, "%.4f")
c2
# A B
# 1 3.412 2.2342
# 2 3.245 4.2342
fdata() applies formats to an entire data frame. All you have to do is assign the format to the format attribute of the column, and then call of the fdata() function on the data frame. fdata() will apply all the formats assigned, and leave any unformatted columns alone:
c3 <- c1
# Assign formats
attr(c3$A, "format") <- "%.1f"
attr(c3$B, "format") <- "%.2f"
# Apply formats
fdata(c3)
# A B
# 1 3.4 2.23
# 2 3.2 4.23
There is also a formats() function that allows you to more easily assign the format attributes to different columns. You just create a named list, where each name corresponds to the column you want to assign formats to. Then you can call the fdata() function like above:
c4 <- c1
# Assign formats
formats(c4) <- list(A = "%.2f",
B = "%.1f")
# Apply formats
fdata(c4)
# A B
# 1 3.41 2.2
# 2 3.25 4.2

Related

Is there an R function that reads text files with \n as a (column) delimiter?

The Problem
I'm trying to come up with a neat/fast way to read files delimited by newline (\n) characters into more than one column.
Essentially in a given input file, multiple rows in the input file should become a single row in the output, however most file reading functions sensibly interpret the newline character as signifying a new row, and so they end up as a data frame with a single column. Here's an example:
The input files look like this:
Header Info
2021-01-01
text
...
#
2021-01-02
text
...
#
...
Where the ... represents potentially multiple rows in the input file, and the # signifies what should really be the end of a row in the output data frame. So upon reading this file, it should become a data frame like this (ignoring the header):
X1
X2
...
Xn
2021-01-01
text
...
...
2021-01-02
text
...
...
...
...
...
...
My attempt
I've tried base, data.table, readr and vroom, and they all have one of two outputs, either a data frame with a single column, or a vector. I want to avoid a for loop, and so my current solution is using base::readLines(), to read it as a character vector, then manually adding some "proper" column separators (e.g. ;), and then joining and splitting again.
# Save the example data to use as input
writeLines(c("Header Info", "2021-01-01", "text", "#", "2021-01-02", "text", "#"), "input.txt")
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
# Clean up the example input
unlink("input.txt")
My code above works and gives the desired result, but surely there's a better way??
Edit: This is internal in a function, so part (perhaps the larger part) of the intention of any simplification is to improve the speed.
Thanks in advance!
1) Read in the data, locate the # signs giving logical variable at and then create a grouping variable g which has distinct values for each desired line. Finally use tapply with paste to rework it into lines that can be read using read.table and read it. (If there are commas in the data then use some other separating character.)
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","),
sep = ",", col.names = cnames)
giving this data frame:
V1 V2
1 2021-01-01 text
2 2021-01-02 text
2) Another approach is to rework the data into dcf form by removing the # sign and prefacing other lines with their column name and a colon. Then use read.dcf. cnames is a character vector of column names that you want to use.
cnames <- c("Date", "Text")
L <- readLines("input.txt")[-1]
LL <- sub("#", "", paste0(c(paste0(cnames, ": "), ""), L))
DF <- as.data.frame(read.dcf(textConnection(LL)))
DF[] <- lapply(DF, type.convert, as.is = TRUE)
DF
giving this data frame:
Date Text
1 2021-01-01 text
2 2021-01-02 text
3) This approach simply reshapes the data into a matrix and then converts it to a data frame. Note that (1) converts numeric columns to numeric whereas this one just leaves them as character.
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
## V1 V2
## 1 2021-01-01 text
## 2 2021-01-02 text
Benchmark
The question did not mention speed as a consideration but in a comment it was later mentioned. Based on the data in the benchmark below (1) runs over twice as fast as the code in the question and (3) runs nearly 25x faster.
library(microbenchmark)
writeLines(c("Header Info",
rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)),
"input.txt")
library(microbenchmark)
writeLines(c("Header Info", rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)), "input.txt")
microbenchmark(times = 10,
ques = {
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
},
ans1 = {
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","), sep = ",")
},
ans3 = {
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## ques 1146.62 1179.65 1188.74 1194.78 1200.11 1219.01 10 c
## ans1 518.95 522.75 548.33 532.59 561.55 647.14 10 b
## ans3 50.47 51.19 51.68 51.69 52.25 52.52 10 a
You can get round some of the string manipulation with something along the lines of:
input <- readLines("input.txt")[-1] #Read in and remove header
ncol <- which(input=="#")[1]-1 #Number of columns of data
data.frame(matrix(input[input != "#"], ncol = ncol, byrow=TRUE)) #Convert to dataframe
# X1 X2
#1 2021-01-01 text
#2 2021-01-02 text
At this point, you might consider going the full mile and use a proper grammar to parse it. I don't know how big or complex the situation really is, but using pegr it might look something like this:
input <-
"Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#
"
library(pegr)
peg <- new.parser(commonRules,action=TRUE) +
c("HEADER <- 'Header Info' EOL" , "{}" ) + # Rule to match literal 'Header Info' and a \n, then discard
c("TYPE <- 'text' EOL" , "{-}" ) + # Rule to match literal 'text', store paste and store as $TYPE
c("DATE <- (!EOL .)* EOL" , "{-}" ) + # Rule to match any character leading up to a new line. Could improve to look for a date format
c("EOS <- '#' EOL" , "{}" ) + # Rule to match end of section, then discard
c("BODY <- (!EOS .)*" , "{-}" ) + # Rule to match body of text, including newlines
c("SECTION <- DATE TYPE BODY EOS" ) + # Combining rules to match each section
c("DOCUMENT <- HEADER SECTION*" ) # Combining more rules to match the endire document
res <- peg[["DOCUMENT"]](input))
final <- matrix( value(res), ncol=3, byrow=TRUE ) %>%
as.data.frame %>%
setnames( names(value(res))[1:3])
final
Produces:
DATE TYPE BODY
1 2021-01-01 text multiple lines\nof\ntext\n
2 2021-01-02 text more\nlines of text\n
It might feel clunky if you don't know the syntax, but once you do, its a fire and forget solution. It'll run according to spec until the spec doesn't hold. You don't have to worry about fragile pretreatment and it is easy to adapt to changing formats in the future.
There's also the tidyverse way:
library(tidyr)
library(readr)
library(stringr)
max_columns <- 5
d <- {
readr::read_file("file.txt") %>%
stringr::str_remove("^Header Info\n") %>%
tibble::enframe(name = NULL) %>%
separate_rows(value, sep = "#\n") %>%
separate("value", into = paste0("X", 1:max_columns) , sep = "\n")
}
Using your example input in a file called file.txt, the d looks like:
# A tibble: 3 x 5
X1 X2 X3 X4 X5
<chr> <chr> <chr> <chr> <chr>
1 2021-01-01 text ... "" NA
2 2021-01-02 text ... "" NA
3 ... NA NA NA NA
Warning message:
Expected 5 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
Note that the warning is simply to make sure you know you're getting NA's, this is inevitable if number of rows between # varies
I am using the data similar to that provided by Sirius for demonstration. you can also do something like this to have variable number of columns in resulting data frame
example <- "Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#"
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate_rows(dummy, sep = '\\n') %>%
filter(row_number() !=1) %>%
group_by(rowid = rev(cumsum(rev(dummy == '#')))) %>%
filter(dummy != '#') %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 2 x 6
#> # Groups: rowid [2]
#> rowid X1 X2 X3 X4 X5
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 2 2021-01-01 text multiple lines of text
#> 2 1 2021-01-02 text more lines of text <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)

Converting dates in column of a data frame R

I am having problems converting a column of imported dates in a data frame, represented as characters in a different date format, into date objects in that same data frame. Here is a toy example:
xx <- data.frame(A = c(10, 15, 20), B = c("10/15/2010", "9/8/2015", "8/5/2013"))
If I print xx,
A B
1 10 10/15/2010
2 15 9/8/2015
3 20 8/5/2013
I apply:
xx[, "B"] <- sapply(xx[, "B"], function(x) {as.Date(x,
format = "%m/%d/%Y", origin = "1970-01-01")})
and I get:
A B
1 10 14897
2 15 16686
3 20 15922
If I look at the mode of column B, it is numeric, not date. No matter what I try I cannot seem to get a result that converts column B to a date type. I can always add:
xx[, "B"] <- as.Date(xx[, "B"])
but there must be a way to do this in one statement.
If you have only one column to convert, you can do
xx$B <- as.Date(xx$B, "%m/%d/%Y")
If you have multiple columns use lapply instead of sapply
cols <- 2
xx[cols] <- lapply(xx[cols], as.Date, "%m/%d/%Y")
Or using lubridate where you don't need to specify the format argument.
xx$B <- lubridate::mdy(xx$B)

Create combinations of measurements concatenated using underscore

I have a dataframe df1
ID <- c("A","B","C")
Measurement <- c("Length","Height","Breadth")
df1 <- data.frame(ID,Measurement)
I am trying to create combinations of measurements with an underscore between them and put it under the ID column "ALL"
Here is my desired output
ID Measurement
A Length
B Height
C Breadth
ALL Length_Height_Breadth
ALL Length_Breadth_Height
ALL Breadth_Height_Length
ALL Breadth_Length_Height
ALL Height_Length_Breadth
ALL Height_Breadth_Length
Also when there are similar measurements in the "measurement" column, I want to eliminate the underscore.
For example:
ID <- c("A","B")
Measurement <- c("Length","Length")
df2 <- data.frame(ID,Measurement)
Then I would want the desired output to be
ID Measurement
A Length
B Length
ALL Length
I am trying to do something like this which is totally wrong
df1$ID <- paste(df1$Measurement, df1$Measurement, sep="_")
Can someone point me in the right direction to achieving the above outputs?
I would like to see how it is done programmatically instead of using the actual measurement names. I am intending to apply the logic to a larger dataset that has several measurement names and so a general solution would be much appreciated.
We could use the permn function from the combinat package:
library(combinat)
sol_1 <- sapply(permn(unique(df1$Measurement)),
FUN = function(x) paste(x, collapse = '_'))
rbind.data.frame(df1, data.frame('ID' = 'All', 'Measurement' = sol_1))
# ID Measurement
# 1 A Length
# 2 B Height
# 3 C Breadth
# 4 All Length_Height_Breadth
# 5 All Length_Breadth_Height
# 6 All Breadth_Length_Height
# 7 All Breadth_Height_Length
# 8 All Height_Breadth_Length
# 9 All Height_Length_Breadth
sol_2 <- sapply(permn(unique(df2$Measurement)),
FUN = function(x) paste(x, collapse = '_'))
rbind.data.frame(df2, data.frame('ID' = 'All', 'Measurement' = sol_2))
# ID Measurement
# 1 A Length
# 2 B Length
# 3 All Length
Giving credit where credit is due: Generating all distinct permutations of a list.
We could also use permutations from the gtools package (HT #joel.wilson):
library(gtools)
unique_meas <- as.character(unique(df1$Measurement))
apply(permutations(length(unique_meas), length(unique_meas), unique_meas),
1, FUN = function(x) paste(x, collapse = '_'))
# "Breadth_Height_Length" "Breadth_Length_Height"
# "Height_Breadth_Length" "Height_Length_Breadth"
# "Length_Breadth_Height" "Length_Height_Breadth"

Extra column using tidyr's `unite_` vs `unite`

In the following example, why is there an additional column in the unite_() output vs the unite() output?
library(tidyr)
x1 <- data.frame(Sample=c("A", "B"), "1"=c("-", "y"),
"2"=c("-", "z"), "3"=c("x", "a"), check.names=F)
# Sample 1 2 3
# 1 A - - x
# 2 B y z a
Here we see the desired output:
unite(x1, mix, 2:ncol(x1), sep=",")
# Sample mix
# 1 A -,-,x
# 2 B y,z,a
Why is there an additional column here (the 1 column)? The default is to remove the columns used by unite_().
unite_(x1, "mix", 2:ncol(x1), sep=",")
# Sample 1 mix
# 1 A - -,-,x
# 2 B y y,z,a
Note: tidyr version 0.5.1
The syntax are slightly different between the two usages:
#unite(data, col, ..., sep = "_", remove = TRUE)
#unite_(data, col, from, sep = "_", remove = TRUE)
From the unite_ help page, the from option is defined as: "Names of existing columns as character vector."
Use column names as opposed to the column numbers provided the desired results:
unite_(x1, "mix", names(x1[,2:ncol(x1)]), sep=",")
# Sample mix
#1 A -,-,x
#2 B y,z,a
I tried with "Unite" but it did not work. However, It worked very well with "paste" function.
df$new_col <- paste(df$col1,df$col2,sep="-")
or if you have more columns to join,
df$new_col <- paste(df$col1,df$col2,df$col3,....,sep="-")

Unstacking a stacked dataframe unstacks columns in a different order

Using R 3.1.0
a = as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
b = unstack(stack(a))
# Returns FALSE
all(colnames(a) == colnames(b))
The documentation on stack/unstack says unstacking should "reverse this [stack] operation". Am I missing something? Why do I need to re-order the columns of b?
The last few lines of the stack (see utils:::stack.data.frame) function create a data.frame with two columns, "values" and "ind". The "ind" column is created with the code:
ind = factor(rep.int(names(x), lapply(x, length)))
But, look at how factor works in general (pay attention to the order of the "Levels"):
factor(c(1, 2, 3, 10, 4))
# [1] 1 2 3 10 4
# Levels: 1 2 3 4 10
factor(paste0("A", c(1, 2, 3, 10, 4)))
# [1] A1 A2 A3 A10 A4
# Levels: A1 A10 A2 A3 A4
If the functionality you describe is important for your analysis, you might do better modifying a version of stack.data.frame to capture the order of the data.frame names during the factoring process, like this:
Stack <- function (x, select, ...)
{
if (!missing(select)) {
nl <- as.list(1L:ncol(x))
names(nl) <- names(x)
vars <- eval(substitute(select), nl, parent.frame())
x <- x[, vars, drop = FALSE]
}
keep <- unlist(lapply(x, is.vector))
if (!sum(keep))
stop("no vector columns were selected")
if (!all(keep))
warning("non-vector columns will be ignored")
x <- x[, keep, drop = FALSE]
data.frame(values = unlist(unname(x)),
# REMOVE THIS --> ind = factor(rep.int(names(x), lapply(x, length))),
# AND ADD THIS:
ind = factor(rep.int(names(x), lapply(x, length)), unique(names(x))),
stringsAsFactors = FALSE)
}
Testing, one, two, three...
## Not using identical here because
## the factor levels are different
all.equal(Stack(a), stack(a))
# [1] TRUE
identical(unstack(Stack(a)), a)
# [1] TRUE
You'll never get me to defend the R documentation...
stack(...) creates a new data frame with two columns, values and ind. The latter has the column names from the original table, as a factor, ordered alphabetically. unstack(...) uses that factor to (re-) create columns of the new data frame. So the phrase "Unstacking reverses this operation" should be interpreted loosely...
To get the result you want, you need to reorder the factor ind, as follows:
a <- as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
c <- stack(a)
c$ind <- factor(c$ind, levels=colnames(a))
d <- unstack(c)
identical(a,d)
# [1] TRUE

Resources