R Copy column with less characters - r

I have a Dataframe that looks like this:
Tree Species
5 rops_002
6 tico_001
8 tico_004
I need to add a column with less characters, like this:
Tree Species Species1
5 rops_002 rops
6 tico_001 tico
8 tico_004 tico
does somebody know how to do this?
Thank you very much!

dt <- data.frame(a = 1:2)
dt$Species <- c("assa_12", "bssa_12")
dt
# a Species
# 1 1 assa_12
# 2 2 bssa_12
One way:
dt$Species1 <- substr(dt$Species, 1, 4)
dt
# a Species Species1
# 1 1 assa_12 assa
# 2 2 bssa_12 bssa
Second option:
dt$Species1 <- sapply(strsplit(dt$Species, "_"), function(x) x[1])
dt
# a Species Species1
# 1 1 assa_12 assa
# 2 2 bssa_12 bssa
More functions and benchmarks:
minem1 <- function(x) substr(x, 1, 4) # takes firs 4 characters
minem2 <- function(x) sapply(strsplit(x, "_"), function(x) x[1]) # splits by "_" and takes first part
minem3 <- function(x) sapply(strsplit(x, "_", fixed = T), function(x) x[1]) # the same
andrewGustar <- function(x) gsub("_\\d+", "", x) # replaces anything after "_" with ""
koenV <- function(x) sub(x, pattern = "_.+", replacement = "") #changed a little
require(data.table)
setDT(dt)
minem4 <- function(x) data.table::tstrsplit(x, "_", fixed = T)[[1]]
# also splits and takes first part
# creata large test case:
n <- 100000
dt <- data.frame(a = 1:n,
Species = sample(c("aaaa", "abda", "asdf", "dads"), n, replace = T))
dt$Species <- paste(dt$Species, dt$a, sep = "_")
require(microbenchmark)
bench <- microbenchmark(minem1(dt$Species),
minem2(dt$Species),
andrewGustar(dt$Species),
koenV(dt$Species),
minem3(dt$Species),
minem4(dt$Species))
bench
Unit: milliseconds
# expr min lq mean median uq max neval cld
# minem1(dt$Species) 5.12257 5.465827 5.655002 5.620615 5.818871 6.94633 100 a
# minem2(dt$Species) 126.19138 133.780757 167.598675 176.696708 186.330236 627.31002 100 d
# andrewGustar(dt$Species) 40.24816 41.988833 42.591255 42.549435 42.942418 48.48893 100 b
# koenV(dt$Species) 37.91208 39.528120 40.369007 40.412091 40.885594 46.52658 100 b
# minem3(dt$Species) 80.40778 86.622198 112.163038 90.496686 137.788859 575.97141 100 c
# minem4(dt$Species) 15.28590 16.111006 17.737274 16.552911 17.054645 69.07255 100 a
autoplot(bench)
Conclusions: if you are sure that Species1 is 4 character long string then use substr, if not, then try tstrsplit from data.table. Also you could look at stringr and stringi packages for faster character sub-setting.

Or df$Species1 <- gsub("_\\d+","",df$Species)
This will remove the _nnn part, whereas minem's answer just keeps the first four characters. It depends what you want! If they are always in the AAAA_nnn format, then both are equivalent.

one very simple way may be:
df$Species1 <- sub(x = df$Species, pattern = "_00.", replacement = "")
if your pattern to remove is always _00x, where x is one digit

Related

How to seperate multiple numbers in one cell and count correct numbers? [duplicate]

I have this example data
d<-"30,3"
class(d)
I have this character objects in one column in my work data frame and I need to be able to identify how many numbers it has.
I have tried to use length(d), but it says 1
After looking for solution here I have tried
eval(parse(text='d'))
as.numeric(d)
as.vector.character(d)
But it still doesn't work.
Any straightforward approach to solve this problem?
These two approaches are each short, work on vectors of strings, do not involve the expense of explicitly constructing the split string and do not use any packages. Here d is a vector of strings such as d <- c("1,2,3", "5,2") :
1) count.fields
count.fields(textConnection(d), sep = ",")
2) gregexpr
lengths(gregexpr(",", d)) + 1
You could use scan.
v1 <- scan(text=d, sep=',', what=numeric(), quiet=TRUE)
v1
#[1] 30 3
Or using stri_split from stringi. This should take both character and factor class without converting explicitly to character using as.character
library(stringi)
v2 <- as.numeric(unlist(stri_split(d,fixed=',')))
v2
#[1] 30 3
You can do the count using base R by
length(v1)
#[1] 2
Or
nchar(gsub('[^,]', '', d))+1
#[1] 2
Visualize the regex
[^,]
Debuggex Demo
Update
If d is a column in a dataset df and want to subset rows with number of digits equals 2
d<-c("30,3,5","30,5")
df <- data.frame(d,stringsAsFactors=FALSE)
df[nchar(gsub('[^,]', '',df$d))+1==2,,drop=FALSE]
# d
#2 30,5
Just to test
df[nchar(gsub('[^,]', '',df$d))+1==10,,drop=FALSE]
#[1] d
#<0 rows> (or 0-length row.names)
You could also try stringi package stri_count_* funcitons (should be very effcient)
library(stringi)
stri_count_regex(d, "\\d+")
## [1] 2
stri_count_fixed(d, ",") + 1
## [1] 2
stringr package has a similar functionality
library(stringr)
str_count(d, "\\d+")
## [1] 2
Update:
If you want to subset your data set by length 2 vectors, could try
df[stri_count_regex(df$d, "\\d+") == 2,, drop = FALSE]
# d
# 2 30,5
Or simpler
subset(df, stri_count_regex(d, "\\d+") == 2)
# d
# 2 30,5
Update #2
Here's a benchmark that illustrates why one should consider using external packages (#rengis answer wasn't included because it doesn't answer the question)
library(microbenchmark)
library(stringi)
d <- rep("30,3", 1e4)
microbenchmark( akrun = nchar(gsub('[^,]', '', d))+1,
GG1 = count.fields(textConnection(d), sep = ","),
GG2 = sapply(gregexpr(",", d), length) + 1,
DA1 = stri_count_regex(d, "\\d+"),
DA2 = stri_count_fixed(d, ",") + 1)
# Unit: microseconds
# expr min lq mean median uq max neval
# akrun 8817.950 9479.9485 11489.7282 10642.4895 12480.845 46538.39 100
# GG1 55451.474 61906.2460 72324.0820 68783.9935 78980.216 150673.72 100
# GG2 33026.455 43349.5900 60960.8762 51825.6845 72293.923 203126.27 100
# DA1 4730.302 5120.5145 6206.8297 5550.7930 7179.536 10507.09 100
# DA2 380.147 418.2395 534.6911 448.2405 597.259 2278.11 100
Here is a possibility
> as.numeric(unlist(strsplit("30,3", ",")))
# 30 3
A slight variation on the accepted answer, requires no packages. Using the example d <- c("1,2,3", "5,2")
lengths(strsplit(d, ","))
> [1] 3 2
Or as a data.frame
df <- data.frame(d = d)
df$counts <- lengths(strsplit(df$d, ","))
df
#----
d counts
1,2,3 3
5,2 2

Take mean of digits that are run together in one column

My data is in this format:
country gdp digits
US 100 2657
Aus 50 123
NZ 40 11
and I'd like to take the mean, for each country of the individual digits that are all stored in the digits column.
So this is what I'm after:
country gdp digits mean_digits
US 100 2657 5
Aus 50 123 2
NZ 40 11 1
I imagine I should split the digits column into individual digits in separate columns and then take an arithmetic mean, but I was just a little unsure, because different rows have different numbers of digits in the digits field.
Code for the reproducable data below:
df <- data.frame(stringsAsFactors=FALSE,
country = c("US", "AUS", "NZ"),
gdp = c(100, 50, 40),
digits = c(2657, 123, 11)
)
We need a function to split the number into digits and take the mean:
mean_digits = function(x) {
sapply(strsplit(as.character(x), split = "", fixed = TRUE),
function(x) mean(as.integer(x)))
}
df$mean_digits = mean_digits(df$digits)
df
# country gdp digits mean_digits
# 1 US 100 2657 5
# 2 AUS 50 123 2
# 3 NZ 40 11 1
as.character() converts the numeric input to character, strsplit splits the numbers into individual digits (resulting in a list), then with sapply, to each list element we convert to integer and take the mean.
We use fixed = TRUE for a little bit of efficiency, since we don't need any special regex to split every digit apart.
If you're using this function frequently, you may want to round or check that the input is integer, it will return NA if the input has decimals due to the ..
One tidyverse possibility could be:
df %>%
mutate(digits = str_split(digits, pattern = "")) %>%
unnest() %>%
group_by(country, gdp) %>%
summarise(digits = mean(as.numeric(digits)))
country gdp digits
<chr> <int> <dbl>
1 Aus 50 2
2 NZ 40 1
3 US 100 5
Or:
df %>%
mutate(digits = str_split(digits, pattern = "")) %>%
unnest() %>%
group_by(country, gdp) %>%
summarise_all(list(~ mean(as.numeric(.))))
1) strapply This one-liner uses strapply in gsubfn. It converts each digit to numeric and then takes the mean of each.
library(gsubfn)
transform(df, mean = sapply(strapply(digits, ".", as.numeric, simplify = TRUE), mean))
2) This is a little longer but still one statement and uses no packages. It inserts a space between digits, reads them using read.table and then applies rowMeans.
transform(df,
mean = rowMeans(read.table(text = gsub("\\b", " ", digits), fill = NA), na.rm = TRUE))
An another tidyverse one-liner w/o other dependencies:
df %>% mutate(mean_digits = map_dbl(strsplit(as.character(df$digits), ""),
~ mean(as.numeric(.x))))
# country gdp digits mean_digits
# 1 US 100 2657 5
# 2 AUS 50 123 2
# 3 NZ 40 11 1
Explanation
You use strsplit to split the digits into single digits. This gives you a list where each element contains the single digits.
Then you loop over this list and calculate the mean over these digits. Here we use map_dbl from purrr but a simple sapply would also do the trick.
Or a solution based on arithmetics rather than string spliiting:
df %>% mutate(mean_digits =
map_dbl(digits,
~ mean((.x %/% 10 ^ (0:(nchar(as.character(.x)) - 1)) %% 10))))
Explanation
You integer divide (%/%) each number by powers of 10 (i.e. 10^0, 10^1, 10^2, ..., 10^i up to the number of digits and you take this result modulo 10 (which gives you exactly the original digit). Then you calculate the mean.
Bare functions to be used for benchmarking
split_based <- function(x) {
sapply(strsplit(as.character(x), ""),
function(.x) mean(as.numeric(.x)))
}
## split_based(df$digits)
arithmetic_based <- function(.x) {
mean((.x %/% 10 ^ (0:(nchar(as.character(.x)) - 1)) %% 10))
}
## sapply(df$digits, arithmetic_based)
Here is a stringr alternative. It uses sapply with str_extract_all to extract the characters of df$digits for each row and calculates the mean.
library(stringr)
df$mean_digits <- sapply(str_extract_all(df$digits, ".{1}"), function(x) mean(as.numeric(x)))
df
country gdp digits mean_digits
1 US 100 2657 5
2 AUS 50 123 2
3 NZ 40 11 1
Or, if you really wanted to, you could do it by using the matrix output from str_extract_all and rowMeans. Note: for str_extract_all, simplify = FALSE is the default.
extracted_mat <- str_extract_all(df$digits, ".{1}", simplify = TRUE)
class(extracted_mat) <- "numeric"
df$mean_digits <- rowMeans(extracted_mat, na.rm = T)
EDIT: running benchmarks on a larger scale (i.e., using #Gregor's sample suggestion).
# Packages
library(stringr)
library(gsubfn)
# Functions
mean_digits = function(x) {
sapply(strsplit(as.character(x), split = "", fixed = TRUE),
function(x) mean(as.integer(x)))
}
mnDigit <- function(x) {
n <- nchar(x)
sq <- as.numeric(paste0("1e", n:0))
mean((x %% sq[-length(sq)]) %/% sq[-1])
}
mnDigit2 <- function(a) {
dig <- ceiling(log10(a + 1))
vec1 <- 10^(dig:1)
vec2 <- vec1 / 10
mean((a %% vec1) %/% vec2)
}
# Creating x
set.seed(1)
x = sample(1:1e7, size = 5e5)
microbenchmark::microbenchmark(mnDigit2=sapply(x, mnDigit2),
mnDigit=sapply(x, mnDigit),
stringr=sapply(str_extract_all(x, ".{1}"), function(x) mean(as.numeric(x))),
stringr_matrix = {
extracted_mat <- str_extract_all(x, ".{1}", simplify = TRUE)
class(extracted_mat) <- "numeric"
rowMeans(extracted_mat, na.rm = T)
},
strsplit=mean_digits(x),
rowMeans=rowMeans(read.table(text = gsub("\\b", " ", x), fill = NA), na.rm = TRUE),
#strapply=sapply(strapply(x, ".", as.numeric, simplify=TRUE), mean),
times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
mnDigit2 3154.4249 3226.633 3461.847 3445.867 3612.690 3840.691 10 c
mnDigit 6403.7460 6613.345 6876.223 6736.304 6965.453 7634.197 10 d
stringr 3277.0188 3628.581 3765.786 3711.022 3808.547 4347.229 10 c
stringr_matrix 944.5599 1029.527 1136.334 1090.186 1169.633 1540.976 10 a
strsplit 3087.6628 3259.925 3500.780 3416.607 3585.573 4249.027 10 c
rowMeans 1354.5196 1449.871 1604.305 1594.297 1745.088 1828.070 10 b
identical(sapply(x, mnDigit2), sapply(x, mnDigit))
[1] TRUE
identical(sapply(x, mnDigit2), sapply(str_extract_all(x, ".{1}"), function(x) mean(as.numeric(x))))
[1] TRUE
identical(sapply(x, mnDigit2), {
extracted_mat <- str_extract_all(x, ".{1}", simplify = TRUE)
class(extracted_mat) <- "numeric"
rowMeans(extracted_mat, na.rm = T)
})
[1] TRUE
identical(sapply(x, mnDigit2), mean_digits(x))
[1] TRUE
identical(sapply(x, mnDigit2), rowMeans(read.table(text = gsub("\\b", " ", x), fill = NA), na.rm = TRUE))
[1] TRUE
This might more efficiently be done with aritmetics.
Inspired from this solution we could do:
mnDigit <- function(x) {
n <- nchar(x)
sq <- as.numeric(paste0("1e", n:0))
mean((x %% sq[-length(sq)]) %/% sq[-1])
}
sapply(df$digits, mnDigit)
# [1] 5 2 1
Explanation: In the function nchar first counts the digits and creates a vector of powers of 10. The final line basically counts each power of 10 in modulo.
Applying the "more general solution" mentioned in the linked answer would look like this (thx to #thothal for fixing the error):
mnDigit2 <- function(a) {
dig <- ceiling(log10(a + 1))
vec1 <- 10^(dig:1)
vec2 <- vec1 / 10
mean((a %% vec1) %/% vec2)
}
Let's take a look at the benchmark:
Unit: milliseconds
expr min lq mean median uq max neval cld
mnDigit2 140.65468 152.48952 173.7740 171.3010 179.23491 248.25977 10 a
mnDigit 130.21340 151.76850 185.0632 166.7446 193.03661 292.59642 10 a
stringr 112.80276 116.17671 129.7033 130.6521 137.24450 149.82282 10 a
strsplit 106.64857 133.76875 155.3771 138.6853 148.58234 257.20670 10 a
rowMeans 27.58122 28.55431 37.8117 29.5755 41.82507 66.96972 10 a
strapply 6260.85467 6725.88120 7673.3511 6888.5765 8957.92438 10773.54486 10 b
split_based 363.59171 432.15120 475.5603 459.9434 528.20592 623.79144 10 a
arithmetic_based 137.60552 172.90697 195.4316 183.1395 208.44365 292.07671 10 a
Note: I've taken out the tidyverse solutions because they are too nested with additional data frame manipulation.
However, this seems NOT to be true. In fact the rowMeans - read.table approach seems to be by far the fastest.
Data
df <- structure(list(country = c("US", "AUS", "NZ"), gdp = c(100, 50,
40), digits = c(2657, 123, 11)), class = "data.frame", row.names = c(NA,
-3L))
Benchmark code
set.seed(42)
evav <- sample(1:1e5, size=1e4)
library(stringr) # for str_extract_all
library(gsubfn) # for strapply
microbenchmark::microbenchmark(mnDigit2=sapply(evav, mnDigit2),
mnDigit=sapply(evav, mnDigit2),
stringr=sapply(str_extract_all(evav, ".{1}"), function(x) mean(as.numeric(x))),
strsplit=mean_digits(evav),
rowMeans=rowMeans(read.table(text = gsub("\\b", " ", evav), fill = NA), na.rm = TRUE),
strapply=sapply(strapply(evav, ".", as.numeric, simplify=TRUE), mean),
split_based=sapply(evav, split_based),
arithmetic_based=sapply(evav, arithmetic_based),
times=10L,
control=list(warmup=10L))
# see `mean_digits` `split_based` & `arithmetic_based` functions in other answers

R Minimum Value from Datatable Not Equal to a Particular Value

How do I find the minimum value from an R data table other than a particular value?
For example, there could be zeroes in the data table and the goal would be to find the minimum non zero value.
I tried using the sapply with min, but am not sure how to specify the extra criteria that we have so that the minimum is not equal to a certain value.
More generally, How do we find the minimum from a data table not equal to any element from a list of possible values?
If you want to find the minimum value from a vector while excluding certain values from that vector, then you can use %in%:
v <- c(1:10) # values 1 .. 10
v.exclude <- c(1, 2) # exclude the values 1 and 2 from consideration
min.exclude <- min(v[!v %in% v.exclude])
The logic won't change much if you are using a column from a data table/frame. In this case you can just replace the vector v with the apropriate column. If you have your excluded values in a list, then you can flatten it to produce your v.exclude vector.
This can be done with data.table (as the OP mentioned about data table in the post) after setting the key
library(data.table)
setDT(df, key='a')[!.(exclude)]
# a b
#1: 4 40
#2: 5 50
#3: 6 60
If we need the min value of 'a'
min(setDT(df, key='a')[!.(exclude)]$a)
#[1] 4
For finding the min in all the columns (using the setkey method), we loop over the columns of the dataset, set the key as each of the column, subset the dataset, get the min value in a previously created list object.
setDT(df)
MinVal <- vector('list', length(df))
for(j in seq_along(df)){
setkeyv(df, names(df)[j])
MinVal[[j]] <- min(df[!.(exclude)][[j]])
}
MinVal
#[[1]]
#[1] 4
#[[2]]
#[1] 10
data
df <- data.frame(a = c(0,2,3,2,1,2,3,4,5,6),
b = c(10,10,20,20,30,30,40,40,50,60))
exclude <- c(0,1,2,3)
Assuming you are working with a data.frame
Data
df <- data.frame(a = c(0,2,3,2,1,2,3,4,5,6),
b = c(10,10,20,20,30,30,40,40,50,60))
Values to exlude from our minimum search
exclude <- c(0,1,2,3)
we can find the minimum value from column a excluding our exclude vector
## minimum from column a
min(df[!df$a %in% exclude,]$a)
# [1] 4
Or from b
exclude <- c(10, 20, 30, 40)
min(df[!df$b %in% exclude,]$b)
# [1] 50
To return the row that corresponds to the minimum value
df[df$b == min( df[ !df$b %in% exclude, ]$b ),]
# a b
# 9 5 50
Update
To find the minimum across multiple rows we can do it this way:
## values to exclude
exclude_a <- c(0,1)
exclude_b <- c(10)
## exclude rows/values from each column we don't want
df2 <- df[!(df$a %in% exclude_a) & !(df$b %in% exclude_b),]
## order the data
df3 <- df2[with(df2, order(a,b)),]
## take the first row
df3[1,]
# > df3[1,]
# a b
#4 2 20
Update 2
To select from multiple columns we can iterate over them as #akrun has shown, or alternatively we can construct our subsetting formula using an expression and evaluate it inside our [ operation
exclude <- c(0,1,2, 10)
## construct a formula/expression using the column names
n <- names(df)
expr <- paste0("(", paste0(" !(df$", n, " %in% exclude) ", collapse = "&") ,")")
# [1] "( !(df$a %in% exclude) & !(df$b %in% exclude) )"
expr <- parse(text=expr)
df2 <- df[eval(expr),]
## order and select first row as before
df2 <- df2[with(df2, order(a,b)),]
df2 <- df2[1,]
And if we wanted to use data.table for this:
library(data.table)
setDT(df)[ eval(expr) ][order(a, b),][1,]
comparison of methods
library(microbenchmark)
fun_1 <- function(x){
df2 <- x[eval(expr),]
## order and select first row as before
df2 <- df2[with(df2, order(a,b)),]
df2 <- df2[1,]
return(df2)
}
fun_2 <- function(x){
df2 <- setDT(x)[ eval(expr) ][order(a, b),][1,]
return(df2)
}
## including #akrun's solution
fun_3 <- function(x){
setDT(df)
MinVal <- vector('list', length(df))
for(j in seq_along(df)){
setkeyv(df, names(df)[j])
MinVal[[j]] <- min(df[!.(exclude)][[j]])
}
return(MinVal)
}
microbenchmark(fun_1(df), fun_2(df), fun_3(df) , times=1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# fun_1(df) 770.376 804.5715 866.3499 833.071 869.2195 2728.740 1000
# fun_2(df) 854.862 893.1220 952.1207 925.200 962.6820 3115.119 1000
# fun_3(df) 1108.316 1148.3340 1233.1268 1186.938 1234.3570 5400.544 1000

Count values separated by a comma in a character string

I have this example data
d<-"30,3"
class(d)
I have this character objects in one column in my work data frame and I need to be able to identify how many numbers it has.
I have tried to use length(d), but it says 1
After looking for solution here I have tried
eval(parse(text='d'))
as.numeric(d)
as.vector.character(d)
But it still doesn't work.
Any straightforward approach to solve this problem?
These two approaches are each short, work on vectors of strings, do not involve the expense of explicitly constructing the split string and do not use any packages. Here d is a vector of strings such as d <- c("1,2,3", "5,2") :
1) count.fields
count.fields(textConnection(d), sep = ",")
2) gregexpr
lengths(gregexpr(",", d)) + 1
You could use scan.
v1 <- scan(text=d, sep=',', what=numeric(), quiet=TRUE)
v1
#[1] 30 3
Or using stri_split from stringi. This should take both character and factor class without converting explicitly to character using as.character
library(stringi)
v2 <- as.numeric(unlist(stri_split(d,fixed=',')))
v2
#[1] 30 3
You can do the count using base R by
length(v1)
#[1] 2
Or
nchar(gsub('[^,]', '', d))+1
#[1] 2
Visualize the regex
[^,]
Debuggex Demo
Update
If d is a column in a dataset df and want to subset rows with number of digits equals 2
d<-c("30,3,5","30,5")
df <- data.frame(d,stringsAsFactors=FALSE)
df[nchar(gsub('[^,]', '',df$d))+1==2,,drop=FALSE]
# d
#2 30,5
Just to test
df[nchar(gsub('[^,]', '',df$d))+1==10,,drop=FALSE]
#[1] d
#<0 rows> (or 0-length row.names)
You could also try stringi package stri_count_* funcitons (should be very effcient)
library(stringi)
stri_count_regex(d, "\\d+")
## [1] 2
stri_count_fixed(d, ",") + 1
## [1] 2
stringr package has a similar functionality
library(stringr)
str_count(d, "\\d+")
## [1] 2
Update:
If you want to subset your data set by length 2 vectors, could try
df[stri_count_regex(df$d, "\\d+") == 2,, drop = FALSE]
# d
# 2 30,5
Or simpler
subset(df, stri_count_regex(d, "\\d+") == 2)
# d
# 2 30,5
Update #2
Here's a benchmark that illustrates why one should consider using external packages (#rengis answer wasn't included because it doesn't answer the question)
library(microbenchmark)
library(stringi)
d <- rep("30,3", 1e4)
microbenchmark( akrun = nchar(gsub('[^,]', '', d))+1,
GG1 = count.fields(textConnection(d), sep = ","),
GG2 = sapply(gregexpr(",", d), length) + 1,
DA1 = stri_count_regex(d, "\\d+"),
DA2 = stri_count_fixed(d, ",") + 1)
# Unit: microseconds
# expr min lq mean median uq max neval
# akrun 8817.950 9479.9485 11489.7282 10642.4895 12480.845 46538.39 100
# GG1 55451.474 61906.2460 72324.0820 68783.9935 78980.216 150673.72 100
# GG2 33026.455 43349.5900 60960.8762 51825.6845 72293.923 203126.27 100
# DA1 4730.302 5120.5145 6206.8297 5550.7930 7179.536 10507.09 100
# DA2 380.147 418.2395 534.6911 448.2405 597.259 2278.11 100
Here is a possibility
> as.numeric(unlist(strsplit("30,3", ",")))
# 30 3
A slight variation on the accepted answer, requires no packages. Using the example d <- c("1,2,3", "5,2")
lengths(strsplit(d, ","))
> [1] 3 2
Or as a data.frame
df <- data.frame(d = d)
df$counts <- lengths(strsplit(df$d, ","))
df
#----
d counts
1,2,3 3
5,2 2

Replace parts of a variable using numeric indices in dplyr. Do I need to create an index column and use ifelse?

At one stage in longer chain of dplyr functions, I need to replace parts of a variable using numeric indices to specify which elements to replace.
My data looks like this:
df1 <- data.frame(grp = rep(1:2, each = 3),
a = 1:6,
b = rep(c(10, 20), each = 3))
df1
# grp a b
# 1 1 1 10
# 2 1 2 10
# 3 1 3 10
# 4 2 4 20
# 5 2 5 20
# 6 2 6 20
Assume that we, within each group, wish to replace elements in variable a with the corresponding elements in b, at one or more positions. In this simple example I use a single index (id), but this could be a vector of indices. First, here's how I would do it with ddply:
library(plyr)
id <- 2
ddply(.data = df1, .variables = .(grp), function(x){
x$a[id] <- x$b[id]
x
})
# grp a b
# 1 1 1 10
# 2 1 10 10
# 3 1 3 10
# 4 2 4 20
# 5 2 20 20
# 6 2 6 20
In dplyr I could think of some different ways to perform the replacement. (1) Use do with an anonymous function, similar to the one used in ddply. (2) Use mutate: concatenate a vector where the replacement is 'inserted' using numeric indexing. This is probably only fruitful for a single index. (3) Use mutate: create an index vector and use conditional replacement with ifelse (see e.g. here, here, here, and here).
detach("package:plyr", unload = TRUE)
library(dplyr)
# (1)
fun_do <- function(df){
l <- df %.%
group_by(grp) %.%
do(function(dat){
dat$a[id] <- dat$b[id]
dat
})
do.call(rbind, l)
}
# (2)
fun_mut <- function(df){
df %.%
group_by(grp) %.%
mutate(
a = c(a[1:(id - 1)], b[id], a[(id + 1):length(a)])
)
}
# (3)
fun_mut_ifelse <- function(df){
df %.%
group_by(grp) %.%
mutate(
idx = 1:n(),
a = ifelse(idx %in% id, b, a)) %.%
select(-idx)
}
fun_do(df1)
fun_mut(df1)
fun_mut_ifelse(df1)
In a benchmark with a slightly larger data set, the 'jigsaw puzzle insertion' is fastest, but again, this method is probably only suited for single replacements. And it doesn't look very clean...
set.seed(123)
df2 <- data.frame(grp = rep(1:200, each = 3),
a = rnorm(600),
b = rnorm(600))
library(microbenchmark)
microbenchmark(fun_do(df2),
fun_mut(df2),
fun_mut_ifelse(df2),
times = 10)
# Unit: microseconds
# expr min lq median uq max neval
# fun_do(df2) 48443.075 49912.682 51356.631 53369.644 55108.769 10
# fun_mut(df2) 891.420 933.996 1019.906 1066.663 1155.235 10
# fun_mut_ifelse(df2) 2503.579 2667.798 2869.270 3027.407 3138.787 10
Just to check the influence of the do.call(rbind part in the do function, try without it:
fun_do2 <- function(df){
df %.%
group_by(grp) %.%
do(function(dat){
dat$a[2] <- dat$b[2]
dat
})
}
fun_do2(df1)
Then a new benchmark on a larger data set:
df3 <- data.frame(grp = rep(1:2000, each = 3),
a = rnorm(6000),
b = rnorm(6000))
microbenchmark(fun_do(df3),
fun_do2(df3),
fun_mut(df3),
fun_mut_ifelse(df3),
times = 10)
Again, a simple 'insertion' is fastest, while the do function is losing ground. In the help text do is described as "a general purpose complement" to the other dplyr functions. To me it seemed to be a natural choice for an anonymous function. However, I was surprised that do was so much slower, also when the non-dplyr rbinding part was skipped. Currently, the do documentation is rather scarce, so I wonder if I am abusing the function, and that there may be more appropriate (undocumented?) ways to do it?
I got no hits on index/indices when I searched the dplyr help text or vignette. So now I wonder:
Are there other dplyr methods to replace parts of a variable using numeric indices which I have overlooked? Specifically, is the creation of an index column in combination with ifelse the way to go, or are there more direct a[i] <- b[i]-like alternatives?
Edit following comment from #G.Grothendieck (Thanks!). Added replace alternative (a candidate for 'See also' in ?[).
fun_replace <- function(df){
df %.%
group_by(grp) %.%
mutate(
a = replace(a, id, b[id]))
}
fun_replace(df1)
microbenchmark(fun_do(df3),
fun_do2(df3),
fun_mut(df3),
fun_mut_ifelse(df3),
fun_replace(df3),
times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun_do(df3) 685.154605 693.327160 706.055271 712.180410 851.757790 10
# fun_do2(df3) 291.787455 294.047747 297.753888 299.624730 302.368554 10
# fun_mut(df3) 5.736640 5.883753 6.206679 6.353222 7.381871 10
# fun_mut_ifelse(df3) 24.321894 26.091049 29.361553 32.649924 52.981525 10
# fun_replace(df3) 4.616757 4.748665 4.981689 5.279716 5.911503 10
replace function is fastest, and for sure easier to use than fun_mut when there are more than one index.
Edit 2 fun_do and fun_do2 no longer works in dplyr 0.2; Error: Results are not data frames at positions:
Here's a much faster modify-in-place approach:
library(data.table)
# select rows we want, then assign b to a for those rows, in place
fun_dt = function(dt) dt[dt[, .I[id], by = grp]$V1, a := b]
# benchmark
df4 = data.frame(grp = rep(1:20000, each = 3),
a = rnorm(60000),
b = rnorm(60000))
dt4 = as.data.table(df4)
library(microbenchmark)
# using fastest function from OP
microbenchmark(fun_dt(dt4), fun_replace(df4), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_dt(dt4) 15.62325 17.22828 18.42445 20.83768 21.25371 10
# fun_replace(df4) 99.03505 107.31529 116.74830 188.89134 286.50199 10

Resources