Calculate variable stored as character in R - r

R: How to calculate a variable, which is stored as character?
I want to get a solution as a vector of numeric values. However, when reading my df from a csv, all the elements of the df, which contain a mix of characters and numbers (those characters are to be substituted with certain values when needed) are converted to characters.
Any idea how to avoid/solve that?
This code below just simulates my problem:
#create two vectors and bind them into a df
c1 <- c("v-3", "v")
c2 <- c("1-v",0)
df <- data.frame(c1,c2)
df
c1 c2
1 v-3 1-v
2 v 0
#I would like to substitute "v" with a number
v <- 2
df
c1 c2
1 v-3 1-v
2 v 0
Now, how can I revert the class of the the elements of the df, so that the "v" can be substituted, and the values calculated?
Or maybe I can read csv in such a way that my mix of characters and numbers would be stored in a more friendly way?
Thanks in advance.
Greg

You can use str_replace and then map eval/parse to evaluate the expression.
library(dplyr)
library(rlang)
df %>%
mutate(
across(everything(), str_replace, "v", "2"),
across(everything(), ~map_dbl(., function(to_eval) eval(parse(text=to_eval))))
)
c1 c2
1 -1 -1
2 2 0

This might be a more efficient way to do what you're after:
Write a small function that:
Uses gsub to replace the letter with a value.
Writes the result to a tempfile
Parses the tempfile
Evaluates the values and inserts them back into the structure of your original data.frame.
Here's the function:
fun <- function(df, patt, repl, fixed = TRUE) {
fil <- tempfile()
writeLines(gsub(patt, repl, as.matrix(df), fixed = fixed), con = fil)
df[] <- sapply(parse(fil), eval)
df
}
Here's how you'd use the function:
fun(df, "v", 2)
## c1 c2
## 1 -1 -1
## 2 2 0
Here's a timing comparison with the other answer, with a larger dataset.
options <- c("v-3", "v", "v*2", "1-v", "v/5", 0, "v+2")
nrow <- 10000
ncol <- 20
set.seed(1)
df <- data.frame(matrix(sample(options, nrow*ncol, TRUE),
nrow = nrow, ncol = ncol))
fun2 <- function(df, patt, repl) {
# df = input data.frame
# patt = pattern to search for
# repl = replacement value (as character)
df %>%
mutate(
across(everything(), str_replace, patt, repl),
across(everything(), ~map_dbl(., function(to_eval) eval(parse(text=to_eval))))
)
}
library(microbenchmark)
microbenchmark(fun(df, "v", 2), fun2(df, "v", "2"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# fun(df, "v", 2) 831.731 924.9648 1159.544 1012.590 1366.072 1882.586 10 a
# fun2(df, "v", "2") 4471.800 4721.3587 4847.252 4853.269 4959.595 5157.823 10 b

Related

Formatting numbers with functions

I am trying to format the numbers in a dataframe using functions. Because there are many columns that needs to be formatted so i am trying to avoid it by creating functions.
##this is a dataframe
asdf <- data.frame(a = c(2133242,2421), b = c(4323,543532654))
##Function to format numbers
## df is a dataframe and column_name is column names
format_numbers <- function (df, column_name){
df["column_name"] <- ifelse(nchar(df["column_name"]) <= 5, paste(format(round(df["column_name"] / 1e3, 1), trim = TRUE), "K"),
paste(format(round(df["column_name"] / 1e6, 1), trim = TRUE), "M"))
return(df)
}
Expected out put
asdf$a <- format_numbers(asdf, a)
asdf
a b
1 2.1M 4323
2 2.4K 543532654

Building DataFrame Column Names From List Generates Invalid Names

I'm trying to create a dataframe with column names from a list. However, it is treating my column names as invalid, and so is adding X. in front of them. The names should be valid. What is going on?
Example:
col_names = list("A", "A95", "p", "NN")
df = data.frame(col_names)
Output:
X.A. X.A95. X.p. X.NN.
1 A A95 p NN
Also, how do I get the contents of these columns to be empty or NA (pereferably without writing models = list("A"=NA, "A95"=NA, "p"=NA, etc... for each variable)?
Edit: Found a 2 liner workaround.
df = data.frame(col_names)
setNames(df, col_names)
The behaviour is not odd, it is as documented. You are trying to convert a list to a data frame, saying nothing about the names of the variables. The variable names are inferred from the contents, and an "X" is added to each name, as in:
> data.frame(as.list(1:5))
# X1L X2L X3L X4L X5L
# 1 1 2 3 4 5
or:
> data.frame(as.list(as.numeric(1:5)))
# X1 X2 X3 X4 X5
# 1 1 2 3 4 5
Fortunately, there is setNames, so you can do:
cn <- list("A", "A95", "p", "NN")
df <- as.data.frame(setNames(cn,cn))
df
# A A95 p NN
# 1 A A95 p NN
Ok, to have the contents appear as NA you can do either...
df[] <- NA
or, from the start:
df <- as.data.frame(setNames(as.list(rep(NA,4)), cn))
What about using dplyr to do this:
library(dplyr)
col_names = list("A", "A95", "p", "NN")
df <- data.frame(matrix(NA, nrow = 0, ncol = 4)) %>% setNames(nm = c(col_names))
Output:
> df
[1] A A95 p NN
<0 rows> (or 0-length row.names)

Take mean of digits that are run together in one column

My data is in this format:
country gdp digits
US 100 2657
Aus 50 123
NZ 40 11
and I'd like to take the mean, for each country of the individual digits that are all stored in the digits column.
So this is what I'm after:
country gdp digits mean_digits
US 100 2657 5
Aus 50 123 2
NZ 40 11 1
I imagine I should split the digits column into individual digits in separate columns and then take an arithmetic mean, but I was just a little unsure, because different rows have different numbers of digits in the digits field.
Code for the reproducable data below:
df <- data.frame(stringsAsFactors=FALSE,
country = c("US", "AUS", "NZ"),
gdp = c(100, 50, 40),
digits = c(2657, 123, 11)
)
We need a function to split the number into digits and take the mean:
mean_digits = function(x) {
sapply(strsplit(as.character(x), split = "", fixed = TRUE),
function(x) mean(as.integer(x)))
}
df$mean_digits = mean_digits(df$digits)
df
# country gdp digits mean_digits
# 1 US 100 2657 5
# 2 AUS 50 123 2
# 3 NZ 40 11 1
as.character() converts the numeric input to character, strsplit splits the numbers into individual digits (resulting in a list), then with sapply, to each list element we convert to integer and take the mean.
We use fixed = TRUE for a little bit of efficiency, since we don't need any special regex to split every digit apart.
If you're using this function frequently, you may want to round or check that the input is integer, it will return NA if the input has decimals due to the ..
One tidyverse possibility could be:
df %>%
mutate(digits = str_split(digits, pattern = "")) %>%
unnest() %>%
group_by(country, gdp) %>%
summarise(digits = mean(as.numeric(digits)))
country gdp digits
<chr> <int> <dbl>
1 Aus 50 2
2 NZ 40 1
3 US 100 5
Or:
df %>%
mutate(digits = str_split(digits, pattern = "")) %>%
unnest() %>%
group_by(country, gdp) %>%
summarise_all(list(~ mean(as.numeric(.))))
1) strapply This one-liner uses strapply in gsubfn. It converts each digit to numeric and then takes the mean of each.
library(gsubfn)
transform(df, mean = sapply(strapply(digits, ".", as.numeric, simplify = TRUE), mean))
2) This is a little longer but still one statement and uses no packages. It inserts a space between digits, reads them using read.table and then applies rowMeans.
transform(df,
mean = rowMeans(read.table(text = gsub("\\b", " ", digits), fill = NA), na.rm = TRUE))
An another tidyverse one-liner w/o other dependencies:
df %>% mutate(mean_digits = map_dbl(strsplit(as.character(df$digits), ""),
~ mean(as.numeric(.x))))
# country gdp digits mean_digits
# 1 US 100 2657 5
# 2 AUS 50 123 2
# 3 NZ 40 11 1
Explanation
You use strsplit to split the digits into single digits. This gives you a list where each element contains the single digits.
Then you loop over this list and calculate the mean over these digits. Here we use map_dbl from purrr but a simple sapply would also do the trick.
Or a solution based on arithmetics rather than string spliiting:
df %>% mutate(mean_digits =
map_dbl(digits,
~ mean((.x %/% 10 ^ (0:(nchar(as.character(.x)) - 1)) %% 10))))
Explanation
You integer divide (%/%) each number by powers of 10 (i.e. 10^0, 10^1, 10^2, ..., 10^i up to the number of digits and you take this result modulo 10 (which gives you exactly the original digit). Then you calculate the mean.
Bare functions to be used for benchmarking
split_based <- function(x) {
sapply(strsplit(as.character(x), ""),
function(.x) mean(as.numeric(.x)))
}
## split_based(df$digits)
arithmetic_based <- function(.x) {
mean((.x %/% 10 ^ (0:(nchar(as.character(.x)) - 1)) %% 10))
}
## sapply(df$digits, arithmetic_based)
Here is a stringr alternative. It uses sapply with str_extract_all to extract the characters of df$digits for each row and calculates the mean.
library(stringr)
df$mean_digits <- sapply(str_extract_all(df$digits, ".{1}"), function(x) mean(as.numeric(x)))
df
country gdp digits mean_digits
1 US 100 2657 5
2 AUS 50 123 2
3 NZ 40 11 1
Or, if you really wanted to, you could do it by using the matrix output from str_extract_all and rowMeans. Note: for str_extract_all, simplify = FALSE is the default.
extracted_mat <- str_extract_all(df$digits, ".{1}", simplify = TRUE)
class(extracted_mat) <- "numeric"
df$mean_digits <- rowMeans(extracted_mat, na.rm = T)
EDIT: running benchmarks on a larger scale (i.e., using #Gregor's sample suggestion).
# Packages
library(stringr)
library(gsubfn)
# Functions
mean_digits = function(x) {
sapply(strsplit(as.character(x), split = "", fixed = TRUE),
function(x) mean(as.integer(x)))
}
mnDigit <- function(x) {
n <- nchar(x)
sq <- as.numeric(paste0("1e", n:0))
mean((x %% sq[-length(sq)]) %/% sq[-1])
}
mnDigit2 <- function(a) {
dig <- ceiling(log10(a + 1))
vec1 <- 10^(dig:1)
vec2 <- vec1 / 10
mean((a %% vec1) %/% vec2)
}
# Creating x
set.seed(1)
x = sample(1:1e7, size = 5e5)
microbenchmark::microbenchmark(mnDigit2=sapply(x, mnDigit2),
mnDigit=sapply(x, mnDigit),
stringr=sapply(str_extract_all(x, ".{1}"), function(x) mean(as.numeric(x))),
stringr_matrix = {
extracted_mat <- str_extract_all(x, ".{1}", simplify = TRUE)
class(extracted_mat) <- "numeric"
rowMeans(extracted_mat, na.rm = T)
},
strsplit=mean_digits(x),
rowMeans=rowMeans(read.table(text = gsub("\\b", " ", x), fill = NA), na.rm = TRUE),
#strapply=sapply(strapply(x, ".", as.numeric, simplify=TRUE), mean),
times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
mnDigit2 3154.4249 3226.633 3461.847 3445.867 3612.690 3840.691 10 c
mnDigit 6403.7460 6613.345 6876.223 6736.304 6965.453 7634.197 10 d
stringr 3277.0188 3628.581 3765.786 3711.022 3808.547 4347.229 10 c
stringr_matrix 944.5599 1029.527 1136.334 1090.186 1169.633 1540.976 10 a
strsplit 3087.6628 3259.925 3500.780 3416.607 3585.573 4249.027 10 c
rowMeans 1354.5196 1449.871 1604.305 1594.297 1745.088 1828.070 10 b
identical(sapply(x, mnDigit2), sapply(x, mnDigit))
[1] TRUE
identical(sapply(x, mnDigit2), sapply(str_extract_all(x, ".{1}"), function(x) mean(as.numeric(x))))
[1] TRUE
identical(sapply(x, mnDigit2), {
extracted_mat <- str_extract_all(x, ".{1}", simplify = TRUE)
class(extracted_mat) <- "numeric"
rowMeans(extracted_mat, na.rm = T)
})
[1] TRUE
identical(sapply(x, mnDigit2), mean_digits(x))
[1] TRUE
identical(sapply(x, mnDigit2), rowMeans(read.table(text = gsub("\\b", " ", x), fill = NA), na.rm = TRUE))
[1] TRUE
This might more efficiently be done with aritmetics.
Inspired from this solution we could do:
mnDigit <- function(x) {
n <- nchar(x)
sq <- as.numeric(paste0("1e", n:0))
mean((x %% sq[-length(sq)]) %/% sq[-1])
}
sapply(df$digits, mnDigit)
# [1] 5 2 1
Explanation: In the function nchar first counts the digits and creates a vector of powers of 10. The final line basically counts each power of 10 in modulo.
Applying the "more general solution" mentioned in the linked answer would look like this (thx to #thothal for fixing the error):
mnDigit2 <- function(a) {
dig <- ceiling(log10(a + 1))
vec1 <- 10^(dig:1)
vec2 <- vec1 / 10
mean((a %% vec1) %/% vec2)
}
Let's take a look at the benchmark:
Unit: milliseconds
expr min lq mean median uq max neval cld
mnDigit2 140.65468 152.48952 173.7740 171.3010 179.23491 248.25977 10 a
mnDigit 130.21340 151.76850 185.0632 166.7446 193.03661 292.59642 10 a
stringr 112.80276 116.17671 129.7033 130.6521 137.24450 149.82282 10 a
strsplit 106.64857 133.76875 155.3771 138.6853 148.58234 257.20670 10 a
rowMeans 27.58122 28.55431 37.8117 29.5755 41.82507 66.96972 10 a
strapply 6260.85467 6725.88120 7673.3511 6888.5765 8957.92438 10773.54486 10 b
split_based 363.59171 432.15120 475.5603 459.9434 528.20592 623.79144 10 a
arithmetic_based 137.60552 172.90697 195.4316 183.1395 208.44365 292.07671 10 a
Note: I've taken out the tidyverse solutions because they are too nested with additional data frame manipulation.
However, this seems NOT to be true. In fact the rowMeans - read.table approach seems to be by far the fastest.
Data
df <- structure(list(country = c("US", "AUS", "NZ"), gdp = c(100, 50,
40), digits = c(2657, 123, 11)), class = "data.frame", row.names = c(NA,
-3L))
Benchmark code
set.seed(42)
evav <- sample(1:1e5, size=1e4)
library(stringr) # for str_extract_all
library(gsubfn) # for strapply
microbenchmark::microbenchmark(mnDigit2=sapply(evav, mnDigit2),
mnDigit=sapply(evav, mnDigit2),
stringr=sapply(str_extract_all(evav, ".{1}"), function(x) mean(as.numeric(x))),
strsplit=mean_digits(evav),
rowMeans=rowMeans(read.table(text = gsub("\\b", " ", evav), fill = NA), na.rm = TRUE),
strapply=sapply(strapply(evav, ".", as.numeric, simplify=TRUE), mean),
split_based=sapply(evav, split_based),
arithmetic_based=sapply(evav, arithmetic_based),
times=10L,
control=list(warmup=10L))
# see `mean_digits` `split_based` & `arithmetic_based` functions in other answers

R Copy column with less characters

I have a Dataframe that looks like this:
Tree Species
5 rops_002
6 tico_001
8 tico_004
I need to add a column with less characters, like this:
Tree Species Species1
5 rops_002 rops
6 tico_001 tico
8 tico_004 tico
does somebody know how to do this?
Thank you very much!
dt <- data.frame(a = 1:2)
dt$Species <- c("assa_12", "bssa_12")
dt
# a Species
# 1 1 assa_12
# 2 2 bssa_12
One way:
dt$Species1 <- substr(dt$Species, 1, 4)
dt
# a Species Species1
# 1 1 assa_12 assa
# 2 2 bssa_12 bssa
Second option:
dt$Species1 <- sapply(strsplit(dt$Species, "_"), function(x) x[1])
dt
# a Species Species1
# 1 1 assa_12 assa
# 2 2 bssa_12 bssa
More functions and benchmarks:
minem1 <- function(x) substr(x, 1, 4) # takes firs 4 characters
minem2 <- function(x) sapply(strsplit(x, "_"), function(x) x[1]) # splits by "_" and takes first part
minem3 <- function(x) sapply(strsplit(x, "_", fixed = T), function(x) x[1]) # the same
andrewGustar <- function(x) gsub("_\\d+", "", x) # replaces anything after "_" with ""
koenV <- function(x) sub(x, pattern = "_.+", replacement = "") #changed a little
require(data.table)
setDT(dt)
minem4 <- function(x) data.table::tstrsplit(x, "_", fixed = T)[[1]]
# also splits and takes first part
# creata large test case:
n <- 100000
dt <- data.frame(a = 1:n,
Species = sample(c("aaaa", "abda", "asdf", "dads"), n, replace = T))
dt$Species <- paste(dt$Species, dt$a, sep = "_")
require(microbenchmark)
bench <- microbenchmark(minem1(dt$Species),
minem2(dt$Species),
andrewGustar(dt$Species),
koenV(dt$Species),
minem3(dt$Species),
minem4(dt$Species))
bench
Unit: milliseconds
# expr min lq mean median uq max neval cld
# minem1(dt$Species) 5.12257 5.465827 5.655002 5.620615 5.818871 6.94633 100 a
# minem2(dt$Species) 126.19138 133.780757 167.598675 176.696708 186.330236 627.31002 100 d
# andrewGustar(dt$Species) 40.24816 41.988833 42.591255 42.549435 42.942418 48.48893 100 b
# koenV(dt$Species) 37.91208 39.528120 40.369007 40.412091 40.885594 46.52658 100 b
# minem3(dt$Species) 80.40778 86.622198 112.163038 90.496686 137.788859 575.97141 100 c
# minem4(dt$Species) 15.28590 16.111006 17.737274 16.552911 17.054645 69.07255 100 a
autoplot(bench)
Conclusions: if you are sure that Species1 is 4 character long string then use substr, if not, then try tstrsplit from data.table. Also you could look at stringr and stringi packages for faster character sub-setting.
Or df$Species1 <- gsub("_\\d+","",df$Species)
This will remove the _nnn part, whereas minem's answer just keeps the first four characters. It depends what you want! If they are always in the AAAA_nnn format, then both are equivalent.
one very simple way may be:
df$Species1 <- sub(x = df$Species, pattern = "_00.", replacement = "")
if your pattern to remove is always _00x, where x is one digit

Count values separated by a comma in a character string

I have this example data
d<-"30,3"
class(d)
I have this character objects in one column in my work data frame and I need to be able to identify how many numbers it has.
I have tried to use length(d), but it says 1
After looking for solution here I have tried
eval(parse(text='d'))
as.numeric(d)
as.vector.character(d)
But it still doesn't work.
Any straightforward approach to solve this problem?
These two approaches are each short, work on vectors of strings, do not involve the expense of explicitly constructing the split string and do not use any packages. Here d is a vector of strings such as d <- c("1,2,3", "5,2") :
1) count.fields
count.fields(textConnection(d), sep = ",")
2) gregexpr
lengths(gregexpr(",", d)) + 1
You could use scan.
v1 <- scan(text=d, sep=',', what=numeric(), quiet=TRUE)
v1
#[1] 30 3
Or using stri_split from stringi. This should take both character and factor class without converting explicitly to character using as.character
library(stringi)
v2 <- as.numeric(unlist(stri_split(d,fixed=',')))
v2
#[1] 30 3
You can do the count using base R by
length(v1)
#[1] 2
Or
nchar(gsub('[^,]', '', d))+1
#[1] 2
Visualize the regex
[^,]
Debuggex Demo
Update
If d is a column in a dataset df and want to subset rows with number of digits equals 2
d<-c("30,3,5","30,5")
df <- data.frame(d,stringsAsFactors=FALSE)
df[nchar(gsub('[^,]', '',df$d))+1==2,,drop=FALSE]
# d
#2 30,5
Just to test
df[nchar(gsub('[^,]', '',df$d))+1==10,,drop=FALSE]
#[1] d
#<0 rows> (or 0-length row.names)
You could also try stringi package stri_count_* funcitons (should be very effcient)
library(stringi)
stri_count_regex(d, "\\d+")
## [1] 2
stri_count_fixed(d, ",") + 1
## [1] 2
stringr package has a similar functionality
library(stringr)
str_count(d, "\\d+")
## [1] 2
Update:
If you want to subset your data set by length 2 vectors, could try
df[stri_count_regex(df$d, "\\d+") == 2,, drop = FALSE]
# d
# 2 30,5
Or simpler
subset(df, stri_count_regex(d, "\\d+") == 2)
# d
# 2 30,5
Update #2
Here's a benchmark that illustrates why one should consider using external packages (#rengis answer wasn't included because it doesn't answer the question)
library(microbenchmark)
library(stringi)
d <- rep("30,3", 1e4)
microbenchmark( akrun = nchar(gsub('[^,]', '', d))+1,
GG1 = count.fields(textConnection(d), sep = ","),
GG2 = sapply(gregexpr(",", d), length) + 1,
DA1 = stri_count_regex(d, "\\d+"),
DA2 = stri_count_fixed(d, ",") + 1)
# Unit: microseconds
# expr min lq mean median uq max neval
# akrun 8817.950 9479.9485 11489.7282 10642.4895 12480.845 46538.39 100
# GG1 55451.474 61906.2460 72324.0820 68783.9935 78980.216 150673.72 100
# GG2 33026.455 43349.5900 60960.8762 51825.6845 72293.923 203126.27 100
# DA1 4730.302 5120.5145 6206.8297 5550.7930 7179.536 10507.09 100
# DA2 380.147 418.2395 534.6911 448.2405 597.259 2278.11 100
Here is a possibility
> as.numeric(unlist(strsplit("30,3", ",")))
# 30 3
A slight variation on the accepted answer, requires no packages. Using the example d <- c("1,2,3", "5,2")
lengths(strsplit(d, ","))
> [1] 3 2
Or as a data.frame
df <- data.frame(d = d)
df$counts <- lengths(strsplit(df$d, ","))
df
#----
d counts
1,2,3 3
5,2 2

Resources