R Applying self made formatting function over data frame R - r

I am using R and I need to format the number within a dataframe, notably by imposing the number of digits before the decimal separator as well as after. E.g. 3.56 must become "0003,56000".
So I built my own function:
format <- function(x, nbr_before_comma, nbr_after_comma){
x= round(x, nbr_after_comma)
x = toString(x)
l = strsplit(x, "[.]")[[1]]
#print(l)
#print(nchar(l[2]))
before_comma = paste0(strrep("0",nbr_before_comma - nchar(l[1])),l[1])
after_comma = ifelse(length(l) > 1,
paste0(l[2],strrep("0",nbr_after_comma - nchar(l[2]))),
strrep("0", nbre_after_comma))
res = paste0(before_comma, ",", after_comma)
return(res)
}
Trying this on a single number will work. Now I am trying to apply this to a dataframe. Let's take the toy example:
df <- data.frame("a" = c(2.5,3.56,4.5))
I define moreprecisely what I want:
format44 <- function(x){
return(format(x,4,4))
}
I have tried several possibilities:
df[] <- lapply(df, format44)
with dplyr:
df <- df %>%
mutate(a = format44(a))
and finally:
df["a"] <- lapply(df["a"],format44)
None will work. actually, I get the same output everytime:
a
1 0002,5, 3
2 0002,5, 3
3 0002,5, 3
Any idea what the problem is ?

Use sprintf and then translate the decimal points to comma:
before <- after <- 4
fmt <- sprintf("%%0%d.%df", before + after + 1, after)
transform(df, a = chartr(".", ",", sprintf(fmt, a)))
giving:
a
1 0002,5000
2 0003,5600
3 0004,5000
or writing this with dplyr:
library(dplyr)
before <- after <- 4
df %>%
mutate(a = "%%0%d.%df" %>%
sprintf(before + after + 1, after) %>%
sprintf(a) %>%
chartr(".", ",", .))
giving:
a
1 0002,5000
2 0003,5600
3 0004,5000

In this case, mapply suits better you:
df$b <- mapply(format44, df$a)
You do not even need the format44 wrapper. You can use:
df$c <- mapply(format, df$a, 4,4)

Related

Paste leading zero in columns A and B if column A meets condition

Data:
A B
"2058600192", "2058644"
"4087600101", "4087601"
"30138182591","30138011"
I am trying to add one leading 0 to columns A and B if column A is 10 characters.
This is what I have written so far:
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A)
data$B[i] <- paste0(0, data$B)
}
}
But I'm getting the following warning:
number of items to replace is not a multiple of replacement length
I've also tried using a dplyr solution, but I'm not sure how to mutate two columns based on one column. Any insight would be appreciated.
Your solution was already pretty good. You just made some very small mistakes. This code would give the correct output:
data <- data.frame(A = c("2058600192","4087600101","30138182591"), B = c("2058644","4087601","30138011"))
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A[i])
data$B[i] <- paste0(0, data$B[i])
}
}
The only difference is data$A[i] <- paste0(0, data$A[i]) instead of data$A[i] <- paste0(0, data$A). Without the [i] you would try to add the whole column.
You can get the index where the number of characters is equal to 10 and replace those values using lapply for multiple columns.
inds <- nchar(df$A) == 10
df[] <- lapply(df, function(x) replace(x, inds, paste0('0', x[inds])))
#If you want to replace only specific columns
#df[c('A', 'B')] <- lapply(df[c('A', 'B')], function(x)
# replace(x, inds, paste0('0', x[inds])))
df
# A B
#1 02058600192 02058644
#2 04087600101 04087601
#3 30138182591 30138011
data
df <- structure(list(A = c(2058600192, 4087600101, 30138182591), B = c(2058644L,
4087601L, 30138011L)), class = "data.frame", row.names = c(NA, -3L))
Just in case you were interested in using dplyr here's another solution using transmute.
df %>%
# Need to transmute B first, so that nchar is evaluated on the original A column and not on the one with leading zeros
transmute(B = ifelse(nchar(A) == 10, paste0(0, B), B),
A = ifelse(nchar(A) == 10, paste0(0, A), A)) %>%
# Just change the order of the columns to the original one
select(A,B)
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(A = ifelse(str_length(A) == 10, str_pad(A, width = 11, side = "left", pad = 0), A),
B = ifelse(grepl("^0", A), paste0("0", B), B))
# A B
# 1 02058600192 02058644
# 2 04087600101 04087601
# 3 30138182591 30138011
str_length to detect length of string
You can use str_pad to add leading zeros. More information about str_pad() here
We can use grepl to detect strings with leading zeros in column A and add leading zeros to column B.
You may use the ifelse vectorized function here:
data$A <- ifelse(nchar(data$A) == 10, paste0("0", data$A), data$A)
data$B <- ifelse(nchar(data$B) == 10, paste0("0", data$B), data$B)
data
A B
1 02058600192 2058644
2 04087600101 4087601
3 30138182591 30138011

How do I get the column number from a dataframe which contains specific strings?

I have a data frame df with 7 columns and I have a list z containing multiple strings.
I want a dataframe containing only the columns in df which contain the sting from z.
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
How do I get the column number of the z strings in df? Or how do I get a dataframe with only the columns which contains the z strings.
What I want is:
print(df)
"a_means" "c_m" "f_m"
What I tried:
match(a, names(df)
and
df[,which(colnames(df) %in% colnames(df[ ,grepl(z,names(df)])]
You can use:
df[,match(z, substring(colnames(df), 1, 3))]
With base R:
z <- paste(z, collapse = "|")
df[, grepl(z, names(df))] # you could use grep as well
Combine the search patterns and use that as a pattern for stringr::str_detect() function.
library(dplyr)
library(stringr)
df <- data.frame(a_means = "a_means",
b_means = "b_means",
c_means = "c_means",
d_means = "d_means",
e_means = "e_means",
f_means = "f_means",
g_means = "g_means"
)
z <- c("a_m","c_m","f_m")
z <- paste(z, collapse = "|")
df %>% select_if(str_detect(names(df), z))
#> a_means c_means f_means
#> 1 a_means c_means f_means
You can simply do this:
library(dplyr)
df %>%
select(contains(z))
Check out help("starts_with"). You can also match to a starting prefix with starts_with() among other things.
You can use select and matches to subest the columns based on z
library(dplyr)
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
df %>%
select(matches(z))
#> X.a_means. X.c_means. X.f_means.
#> 1 a_means c_means f_means

Flipping two sides of string

I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.
Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!
Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))
Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

R: Apply a function over only character columns without type coercion

I have a data frame with many columns. The columns differ in their types: some are numeric, some are character, etc. Here's a small example where we just have 3 variables with 2 types:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
I want to replace the value 3 with a space, but only for character columns. Here's my current code:
out <- as.data.frame(lapply(dat, function(x) {
ifelse(is.character(x),
gsub("3", " ", x),
x)}),
stringsAsFactors = FALSE)
The problem is that the ifelse() function ignores that y and z are numeric and that it also seems to coerce the numeric variables to character anyway.
And idea has been to pull out the character columns, gsub() them, then bind them back to the original data frame. This, however, changes the ordering of the columns. Key to any solution is that I do not need to specify variables by name but only by type.
One can also do this trivially using dplyr:
# Load package
library(dplyr)
# Create data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
# Replace 3's with spaces for character columns
dat <- dat %>% mutate_if(is.character, function(x) gsub(pattern = "3", " ", x))
I tried your code and for me it seems like ifelse did not work but separating if ad else does. Below is the code which works:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
> lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x })
$x
[1] "1" "2" " "
$y
[1] 1.0 2.5 3.3
$z
[1] 1 2 3
> as.data.frame(lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x }))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3
It comes down to this line in ?ifelse
ifelse returns a value with the same shape as test ...
is.character is length one so the returned value is length 1. You can use if(...) yes else no as you have suggested instead as #Heikki have suggested.
Similar to #user3614648 solution:
library(dplyr)
dat %>%
mutate_if(is.character, funs(ifelse(. == "3", " ", .)))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3

Resources