I have a vector that looks like :
numbers <- c("1/1/1", "1/0/2", "1/1/1/1", "2/0/1/1", "1/2/1")
(not always the same number of "/" character)
How can I create another vector with the sum of the numbers of each string?
Something like :
sum
3
3
4
4
4
One solution with strsplit and sapply:
sapply(strsplit(numbers, '/'), function(x) sum(as.numeric(x)))
#[1] 3 3 4 4 4
strsplit will split your stings on / (doesn't matter how many /s you have). The output of strsplit is a list, so we iterate over it to calculate the sum with sapply.
What seems to me to be the most straightforward approach here is to convert your number strings to actual valid string arithmetic expressions, and then evaluate them in R using eval along with parse. Hence, the string 1/0/2 would become 1+0+2, and then we can simply evaluate that expression.
sapply(numbers, function(x) { eval(parse(text=gsub("/", "+", x))) })
1/1/1 1/0/2 1/1/1/1 2/0/1/1 1/2/1
3 3 4 4 4
Demo
1) strapply strapply matches each string of digits using \\d+ and then applies as.numeric to it returning a list with one vector of numbers per input string. We then apply sum to each of those vectors. This solution seems particularly short.
library(gsubfn)
sapply(strapply(numbers, "\\d+", as.numeric), sum)
## [1] 3 3 4 4 4
2) read.table This applies sum(read.table(...)) to each string. It is a bit longer (but still only one line of code) but uses no packages.
sapply(numbers, function(x) sum(read.table(text = x, sep = "/")))
## 1/1/1 1/0/2 1/1/1/1 2/0/1/1 1/2/1
## 3 3 4 4 4
Add the USE.NAMES = FALSE argument to sapply if you don't want names on the output.
scan(textConnection(x), sep = "/", quiet = TRUE) could be used in place of read.table but is longer.
Related
['ax', 'byc', 'crm', 'dop']
This is a character string, and I want a count of all substrings, ie 4 here as output. Want to do this for the entire column containing such strings.
We may use str_count
library(stringr)
str_count(str1, "\\w+")
[1] 4
Or may also extract the alpha numeric characters into a list and get the lengths
lengths(str_extract_all(str1, "[[:alnum:]]+"))
If it is a data.frame column, extract the column as a vector and apply str_count
str_count(df1$str1, "\\w+")
data
str1 <- "['ax', 'byc', 'crm', 'dop']"
df1 <- data.frame(str1)
Here are a few base R approaches. We use the 2 row input defined reproducibly in the Note at the end. No packages are used.
lengths(strsplit(DF$b, ","))
## [1] 4 4
nchar(gsub("[^,]", "", DF$b)) + 1
## [1] 4 4
count.fields(textConnection(DF$b), ",")
## [1] 4 4
Note
DF <- data.frame(a = 1:2, b = "['ax', 'byc', 'crm', 'dop']")
I have two character vectors x and y, the former comprising (potential) sub strings of the latter, and both containing duplicate values. I want to return the index of the first match (if present) in y for each element in x, where the sub string is matched at the beginning of the string (cf. ^ anchor in regex), e.g:
x <- c("Halimid", "Halimid", "Callimid", "Diplid", "Halimid", "Cyathid")
y <- c("Bathymidae", "Bathymidae", "Halimidopidae", "Cyathidae", "Bothridae", "Cyathidae", "Diplididae", "Holothuridae")
some function(first match for each element of x in y if there is a match)
2, 2, NA, 7, 2, 4
i.e the function should return a vector of same length as x, containing the indices of the first match in y, or NA for elements without a match. I've already tried base::startsWith(), but it only works for a single substring and pmatch() hasn't worked for me either. I want to avoid apply and loops if possible so vectorized solutions preferred
I can’t think of a solution without lapply() or purrr::map(), not sure
if those are acceptable for you, but they are quite simple, so here we go:
x <- c("Halimid", "Halimid", "Callimid", "Diplid", "Halimid", "Cyathid")
y <- c("Bathymidae", "Bathymidae", "Halimidopidae", "Cyathidae", "Bothridae", "Cyathidae", "Diplididae", "Holothuridae")
Using lapply() and grep().
a <- lapply(x, function(z) grep(z, y)[1])
unlist(a)
#> [1] 3 3 NA 7 3 4
Using map_dbl() we can make the code appear a bit more simple, but it is
essentially the same.
library(purrr)
map_dbl(x, ~grep(., y)[1])
#> [1] 3 3 NA 7 3 4
Created on 2020-11-02 by the reprex package (v0.3.0)
Using traditional for loop:
v <- NULL
for(chr in x){
v <- c(v,grep(chr, y)[1])
}
v
[1] 3 3 NA 7 3 4
This question already has answers here:
How to transform a key/value string into distinct rows?
(2 answers)
Closed 4 years ago.
I have a large text file that I want to import in R with multimodal data encoded as such :
A=1,B=1,C=2,...
A=2,B=1,C=1,...
A=1,B=2,C=1,...
What I'd like to have is a dataframe similar to this :
A B C
1 1 2
2 1 1
1 2 1
Because the column name is being repeated over and over for each row, I was wondering if there was a way import that text file with a fscanf functionality that would parse the A, B, C column names such as "A=%d,B=%d,C=%d,...."
Or maybe there's a simpler way using read.table or scan ? But I couldn't figure out how.
Thanks for any tip
1) read.pattern read.pattern in the gsubfn package is very close to what you are asking. Instead of %d use (\\d+) when specifying the pattern. If the column names are not important the col.names argument could be omitted.
library(gsubfn)
L <- c("A=1,B=1,C=2", "A=1,B=1,C=2", "A=1,B=1,C=2") # test input
pat <- "A=(\\d+),B=(\\d+),C=(\\d+)"
read.pattern(text = L, pattern = pat, col.names = unlist(strsplit(pat, "=.*?(,|$)")))
giving:
A B C
1 1 1 2
2 1 1 2
3 1 1 2
1a) percent format Just for fun we could implement it using exactly the format given in the question.
fmt <- "A=%d,B=%d,C=%d"
pat <- gsub("%d", "(\\\\d+)", fmt)
Now run the read.pattern statement above.
2) strapply Using the same input and the gsubfn package, again, an alternative is to pull out all strings of digits eliminating the need for the pat shown in (1) reducing the pattern to just "\\d+".
DF <- strapply(L, "\\d+", as.numeric, simplify = data.frame)
names(DF) <- unlist(strsplit(L[1], "=.*?(,|$)"))
3) read.csv Even simpler is this base only solution which deletes the headings and reads in what is left setting the column names as in the prior solution. Again, omit the col.names argument if column names are not important.
read.csv(text = gsub("\\w*=", "", L), header = FALSE,
col.names = unlist(strsplit(L[1], "=.*?(,|$)")))
I have a string in which i want to find the repeated letters. For example,
A <- c('A-B-A-B-C', 'A-B-C-D', 'A-B-A-C-D-E-F', 'A-B-A-B')
I want to create a vector B which takes a value of 0 if there is no repetition of letters, 1 otherwise.
B <- c('1','0','1','1')
You can combine strsplit and anyDuplicated in base R to get close to what you want.
sapply(strsplit(A, "-"), anyDuplicated, fixed=TRUE)
[1] 3 0 3 3
anyDuplicated returns the first index of the duplicated value. As #rich-scriven notes, adding fixed=TRUE to the strsplit function should increase efficiency as it does a literal mapping (no regular expressions involved)
You could wrap this in pmin to get your desired result:
pmin(sapply(strsplit(A, "-", fixed=TRUE), anyDuplicated), 1)
[1] 1 0 1 1
or as #rich-scriven notes, use sign to convert the values.
sign(sapply(strsplit(A, "-", fixed=TRUE), anyDuplicated))
[1] 1 0 1 1
I have a string having value as given below separated by vertical bar.
String1 <- "5|10|25|25|10|10|10|5"
String2 <- "5|10|25|25"
Is there any Direct Function to get the sum of the numbers in string ,
in this case it Should be 100 for Srting1 and 65 for string2,and I have a character vector of such.
>chk
chk
1 5|10|25|25|10|10|10|5
2 5|55|20|5|5|5|5
3 6
4 Not Available
> sum(scan(text=gsub("\\Not Available\\b", "NA", chk$chk), sep="|", what = numeric(), quiet=TRUE), na.rm = TRUE)
[1] 206
As it Should be
[1]100 100 6 NA
We can do a scan and then sum
sum(scan(text=String1, sep="|", what = numeric(), quiet=TRUE))
For multiple vectors, place it in a list and do the same operation
sapply(mget(paste0("String", 1:2)), function(x)
sum(scan(text=x, sep="|", what=numeric(), quiet=TRUE)))
# String1 String2
# 100 65
Another option is eval(parse( (not recommended though) after replacing the | with +
eval(parse(text=gsub("[|]", "+", String1)))
#[1] 100
Or as #thelatemail mentioned in the comments, assign (<-) the | to + and then do the eval(parse(..
`|` <- `+`
eval(parse(text=String1))
#[1] 100
If we have a data.frame column with strings, then it may be better to split by | to a list of vectors, convert the vectors to numeric (all the non-numeric elements coerce to NA with a friendly warning), get the sum with na.rm=TRUE
sapply(strsplit(as.character(chk$chk), "[|]"),
function(x) sum(as.numeric(x), na.rm=TRUE))
#[1] 100 100 6 0
NOTE: The as.character is not needed if the 'chk' column is already a character class
Otherwise, if we are using scan or eval(parse, it should be done for each element.
We can extract all the numbers from the string and then sum over it
library(stringr)
sum(as.numeric(unlist(str_match_all(String1, "[0-9]+"))))
#[1] 100
sum(as.numeric(unlist(str_match_all(String2, "[0-9]+"))))
#[1] 65
For multiple vectors we can keep it in a list
sapply(list(String1, String2), function(x)
sum(as.numeric(unlist(str_match_all(x, "[0-9]+")))))
#[1] 100 65