I have a vector as below
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
here i want to extract the first number before the "X" for each of the elements.
In case of situations with 2 "X" i.e. "6X2X75CL" the number 12 (6 multiplied by 2) should be calculated.
expected output
6, 24, 12, 168
Thank you for the help...
Here's a possible solution using regular expressions :
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
# this regular expression finds any group of digits followed
# by a upper-case 'X' in each string and returns a list of the matches
tokens <- regmatches(data,gregexpr('[[:digit:]]+(?=X)',data,perl=TRUE))
res <- sapply(tokens,function(x)prod(as.numeric(x)))
> res
[1] 6 24 12 168
Here is a method using base R:
dataList <- strsplit(data, split="X")
sapply(dataList, function(x) Reduce("*", as.numeric(head(x, -1))))
[1] 6 24 12 168
strplit breaks up the vector along "X". The resulting list is fed to sapply which the performs an operation on all but the final element of each vector in the list. The operation is to transform the elements into numerics and the multiply them. The final element is dropped using head(x, -1).
As #zheyuan-li comments, prod can fill in for Reduce and will probably be a bit faster:
sapply(dataList, function(x) prod(as.numeric(head(x, -1))))
[1] 6 24 12 168
We can also use str_extract_all
library(stringr)
sapply(str_extract_all(data, "\\d+(?=X)"), function(x) prod(as.numeric(x)))
#[1] 6 24 12 168
ind=regexpr("X",data)
val=as.integer(substr(data, 1, ind-1))
data2=substring(data,ind+1)
ind2=regexpr("[0-9]+X", data2)
if (!all(ind2!=1)) {
val2 = as.integer(substr(data2[ind2==1], 1, attr(ind2,"match.length")[ind2==1]-1))
val[ind2==1] = val[ind2==1] * val2
}
Related
I'm trying to learn R and a sample problem is asking to only reverse part of a string that is in alphabetical order:
String: "abctextdefgtext"
StringNew: "cbatextgfedtext"
Is there a way to identify alphabetical patterns to do this?
Here is one approach with base R based on the patterns showed in the example. We split the string to individual characters ('v1'), use match to find the position of characters with that of alphabet position (letters), get the difference of the index and check if it is equal to 1 ('i1'). Using the logical vector, we subset the vector ('v1'), create a grouping variable and reverse (rev) the vector based on grouping variable. Finally, paste the characters together to get the expected output
v1 <- strsplit(str1, "")[[1]]
i1 <- cumsum(c(TRUE, diff(match(v1, letters)) != 1L))
paste(ave(v1, i1, FUN = rev), collapse="")
#[1] "cbatextgfedtext"
Or as #alexislaz mentioned in the comments
v1 = as.integer(charToRaw(str1))
rawToChar(as.raw(ave(v1, cumsum(c(TRUE, diff(v1) != 1L)), FUN = rev)))
#[1] "cbatextgfedtext"
EDIT:
1) A mistake was corrected based on #alexislaz's comments
2) Updated with another method suggested by #alexislaz in the comments
data
str1 <- "abctextdefgtext"
You could do this in base R
vec <- match(unlist(strsplit(s, "")), letters)
x <- c(0, which(diff(vec) != 1), length(vec))
newvec <- unlist(sapply(seq(length(x) - 1), function(i) rev(vec[(x[i]+1):x[i+1]])))
paste0(letters[newvec], collapse = "")
#[1] "cbatextgfedtext"
Where s <- "abctextdefgtext"
First you find the positions of each letter in the sequence of letters ([1] 1 2 3 20 5 24 20 4 5 6 7 20 5 24 20)
Having the positions in hand, you look for consecutive numbers and, when found, reverse that sequence. ([1] 3 2 1 20 5 24 20 7 6 5 4 20 5 24 20)
Finally, you get the letters back in the last line.
I have a vector:
lst <- c("2,1","7,10","11,0","7,0","10,0","1,1","1,0","4,0","4,1","0,1","6,0")
each element contains two numbers,separated by ",". I would like to get indexes of elements containing "1".
So the index list is expected:
1, 6, 7, 9, 10
grep() will work nicely for this. By default, it returns the indices of the matched pattern.
grep("^1,|,1$", lst)
# [1] 1 6 7 9 10
The regular expression ^1,|,1$ looks to match a string that
^1, = starts with 1,
| OR
,1$ = ends with ,1
each element contains two numbers. my answer is not ideal but I got what I need.
m <- as.numeric(unlist(lapply(strsplit(as.character(lst), "\\,"),"[[",1)))
n <- as.numeric(unlist(lapply(strsplit(as.character(lst), "\\,"),"[[",2)))
sort(unique(c(which(m==1),which(n==1))))
Depending on background and context of this task it might be prudent to turn this vector into a data.frame:
lst <- c("2,1","7,10","11,0","7,0","10,0","1,1","1,0","4,0","4,1","0,1","6,0")
DF <- read.table(text = do.call(paste, list(lst, collapse = "\n")), sep = ",")
which(DF$V1 == 1L | DF$V2 == 1L)
#[1] 1 6 7 9 10
I have a data frame with several columns. One of those contains Plotids like AEG1, AEG2,..., AEG50, HEG1, HEG2,..., HEG50, SEG1, SEG2,..., SEG50. So, the data frame has 150 rows. Now I want to change only some of these Plotids, so that there is AEG01, AEG02,... instead of AEG1, AEG2, ... So, I just want to add a "0" to some of the column entries. I tried it by using lapply, a for loop, writing a function,... but nothing did the job. There was always the error message:
In if (nchar(as.character(dat_merge$EP_Plotid)) == 4)
paste(substr(dat_merge$EP_Plotid, ... :
the condition has length > 1 and only the first element will be used
So, this was my last try:
Plotid_func <- function(x) {
if(nchar(as.character(dat_merge$EP_Plotid))==4)
paste(substr(dat_merge$EP_Plotid, 1, 3), "0", substr(dat_merge$EP_Plotid, 4, 4), sep="")
}
dat_merge$Plotid <- sapply(dat_merge$EP_Plotid, Plotid_func)
Therewith, I wanted to select only those column entries with four digits. And to only those selected entries, I wanted to add a 0. Can anybody help me? dat_merge is the name of my data frame and EP_Plotid is the column I want to edit. Thanks in advance
Just extract the "string" portion and the "numeric" portion and paste them back together after using sprintf on the numeric portion.
An example:
## "x" is the "column" of plot ids. Here I go up to 12
## to demonstrate the zero padding that it sounds like
## you're looking for
x <- c(paste0("AEG", 1:12), paste0("HEG", 1:12))
## Extract the string values
Strings <- gsub("([A-Z]+)(.*)", "\\1", x)
## Extract the numeric values
Nums <- gsub("([A-Z]+)(.*)", "\\2", x)
## Put them back together
paste0(Strings, sprintf("%02d", as.numeric(Nums)))
# [1] "AEG01" "AEG02" "AEG03" "AEG04" "AEG05" "AEG06"
# [7] "AEG07" "AEG08" "AEG09" "AEG10" "AEG11" "AEG12"
# [13] "HEG01" "HEG02" "HEG03" "HEG04" "HEG05" "HEG06"
# [19] "HEG07" "HEG08" "HEG09" "HEG10" "HEG11" "HEG12"
Or you can just modify your function to actually use the input variable x (which is not happening in your original function)
dat_merge <- data.frame(EP_Plotid = c("AEG1", "AEG2", "AEG50", "HEG1", "HEG2", "HEG50", "SEG1", "SEG2", "SEG50"))
Plotid_func <- function(x) {
if(nchar(as.character(x)) == 4){
paste(substr(x, 1, 3), "0", substr(x, 4, 4), sep="")
} else as.character(x)
}
dat_merge$Plotid <- sapply(dat_merge$EP_Plotid, Plotid_func)
dat_merge
# EP_Plotid Plotid
# 1 AEG1 AEG01
# 2 AEG2 AEG02
# 3 AEG50 AEG50
# 4 HEG1 HEG01
# 5 HEG2 HEG02
# 6 HEG50 HEG50
# 7 SEG1 SEG01
# 8 SEG2 SEG02
# 9 SEG50 SEG50
A vectorized version of your function (which is much better than using sapply which is just a for loop) would be
dat_merge$Plotid <- ifelse(nchar(as.character(dat_merge$EP_Plotid))==4, paste(substr(dat_merge$EP_Plotid, 1, 3), "0", substr(dat_merge$EP_Plotid, 4, 4), sep=""), as.character(dat_merge$EP_Plotid))
Or use a combination of formatC with str_extract from library(stringr)
library(stringr)
x from Ananda's post.
Extract alphabets and numbers separately.
Flag 0's to the numbers with formatC
paste together
paste0(str_extract(x, "[[:alpha:]]+"), formatC(as.numeric(str_extract(x,"\\d+")), width=2, flag=0))
#[1] "AEG01" "AEG02" "AEG03" "AEG04" "AEG05" "AEG06" "AEG07" "AEG08" "AEG09"
#[10] "AEG10" "AEG11" "AEG12" "HEG01" "HEG02" "HEG03" "HEG04" "HEG05" "HEG06"
#[19] "HEG07" "HEG08" "HEG09" "HEG10" "HEG11" "HEG12"
I know this should be simple but I just can't do it...I have a data frame called data that works nicely and does what I want it to with the correct column headers and everything. I can call colSums() to get a list of 21 numbers which are the sums of each column.
> a <- colSums(data,na.rm = TRUE)
> names(a) <- NULL
> a
[1] 1000000.00 680000.00 170000.00 462400.00 115600.00 144500.00 314432.00 78608.00 98260.00 122825.00 213813.76 53453.44 66816.80
[14] 83521.00 104401.25 145393.36 36348.34 45435.42 56794.28 70992.85 88741.06
The problem is I need a list with the first number alone, the sum of the next two, sum of the next 3, sum of the next 4 etc. until I run out of numbers. I imagine it would look something like this:
c(sum(a[1]),sum(a[2:3]),sum(a[4:6])... etc.
Any help or a different way to do this would be greatly appreciated!
Thank you.
You should only need to go out to something on the order of sqrt(length(vector)). The seq function lets you specify a start integer and a length, so sending a sequence of integers to seq(1+x*(x-1)/2, length=x) should create the right set of sequences. It wasn't clear whether incomplete sequences at the end should return a result or NA so I put in na.rm=TRUE. You might decide otherwise. (You did not illustrate a dataframe but rather an ordinary numeric vector.
sumsegs <- function(vec) sapply(1:sqrt(2*length(vec)), function(x)
sum( vec[seq(1+x*(x-1)/2, length=x)], na.rm=TRUE) )
a <- scan()
1000000.00 680000.00 170000.00 462400.00 115600.00 144500.00 314432.00 78608.00 98260.00 122825.00 213813.76 53453.44 66816.80 83521.00 104401.25 145393.36 36348.34 45435.42 56794.28 70992.85 88741.06
# 22: enter carriage return to stop scan input
#Read 21 items
sumsegs(a)
#[1] 1000000.0 850000.0 722500.0 614125.0 522006.2 443705.3
I'm not exactly sure what the right upper limit on the number to send to the inner function. sqrt(length(vec)) is too short, but sqrt(2*length(vec)) seems to be "working" at lower numbers anyway.
> sapply( sapply(1:sqrt(2*100), function(x) seq(1+x*(x-1)/2, length=x) ), max)
[1] 1 3 6 10 15 21 28 36 45 55 66 78 91 105
> sapply( sapply(1:sqrt(100), function(x) seq(1+x*(x-1)/2, length=x) ), max)
[1] 1 3 6 10 15 21 28 36 45 55
This is a function that returns the last element in sequences so formed and making the factor 2.1 rather than 2 corrects minor deficiencies in the range of length 500-1000:
tail(lapply( sapply(1:sqrt(2.1*500), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 528
tail(lapply( sapply(1:sqrt(2.1*500), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 496
Going higher did not seem to degrade the "times 2" correction. There's probably some kewl number theory explanation for this.
tail(lapply( sapply(1:sqrt(2*100000), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 100128
Alternatively a much more naive method is:
sums=colSums(data)
n=0 # number of sums
i=1 # currentIndex
intermediate=0;
newIndex=1;
newVec <- vector()
while(i<length(sums)) {
for(j in i:(i+n)) {
if(j<=length(sums))
intermediate=intermediate+sums[j]
}
if(n>1){
i=i+n+1;
}
else{
i=i+1;
}
newVec=c(newVec, intermediate);
intermediate=0;
n=n+1;
}
Here's a similar approach, using rep(...) and by(...)
n <- (-1+sqrt(1+8*length(a)))/2 # number of groups
groups <- rep(1:n,1:n) # indexing vector
result <- as.vector(by(a,groups,sum))
result
# [1] 1000000.0 850000.0 722500.0 614125.0 522006.2 443705.3
I've got the following task
Treatment$V010 <- as.numeric(substr(Treatment$V010,1,2))
Treatment$V020 <- as.numeric(substr(Treatment$V020,1,2))
[...]
Treatment$V1000 <- as.numeric(substr(Treatment$V1000,1,2))
I have 100 variables from $V010, $V020, $V030... to $V1000. Those are numbers of different length. I want to "extract" just the first two digits of the numbers and replace the old number with the new number which is two digits long.
My data frame "Treatment" has 80 more variables which i did not mention here, so it is my goal that this function will just be applied to the 100 variables mentioned.
How can I do that? I could write that command 100 times but I am sure there is a better solution.
Alright, let's do it. First thing first: as you want to get specific columns of your dataframe, you need to specify their names to access them:
cnames = paste0('V',formatC(seq(10,1000,by=10), width = 3, format = "d", flag = "0"))
(cnames is a vector containing c('V010','V020', ..., 'V1000'))
Next, we will get their indexes:
coli=unlist(sapply(cnames, function (x) which(colnames(Treatment)==x)))
(coli is a vector containing the indexes in Treatment of the relevant columns)
Finally, we will apply your function over these columns:
Treatment[coli] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[coli])
Does it work?
PS: if anyone has a better/more concise way to do it, please tell me :)
EDIT:
The intermediate step is not useful, as you can already use the column names cnames to get the relevant columns, i.e.
Treatment[cnames] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[cnames])
(the only advantage of doing the conversion from column names to column indexes is when there are some missing columns in the dataframe - in this case, Treatment['non existing column'] crashes with undefined columns selected)
A solution where relevant columns are selected based on a pattern that can be described with a regular expression.
Regex explanation:
^: Start of string
V: Literal V
\\d{2}: Exactly 2 digits
Treatment <- data.frame(V010 = c(120, 130), x010 = c(120, 130), xV1000 = c(111, 222), V1000 = c(111, 222))
Treatment
# V010 x010 xV1000 V1000
# 1 120 120 111 111
# 2 130 130 222 222
# columns with a name that matches the pattern (logical vector)
idx <- grepl(x = names(Treatment), pattern = "^V\\d{2}")
# substr the relevant columns
Treatment[ , idx] <- sapply(Treatment[ , idx], FUN = function(x){
as.numeric(substr(x, 1, 2))
})
Treatment
# V010 x010 xV1000 V1000
# 1 12 120 111 11
# 2 13 130 222 22