My goal is to order multiple variables in a list in R:
"+5x^{5}" "-2x^{3}" "5x^{7}" "0" "1"
I want to get this order:
"5x^{7}" "+5x^{5}" "-2x^{3}" "1" "0"
So exponents from highest to lowest, then numerical order for the numbers.
How can I achieve this? Decreasing for numerical alone is clear. For the exponents it would be necessary to detect if there is a x in the string and then extract the exponent and order it based on this. But I dont know how to do it.
A more verbose option is to extract the exponents and the multipliers and then use arrange. This has the advantage of having these numbers ready if you need to use them.
library(stringr)
library(dplyr)
dat <- data.frame(x = c("+5x^{5}", "-2x^{3}", "5x^{7}", "0", "1"))
dat |> mutate(m = as.numeric(str_match(x, "([+-]*\\d+)x\\^\\{(\\d)\\}")[, 2]),
exp = as.numeric(str_match(x, "([+-]*\\d+)x\\^\\{(\\d)\\}")[, 3]),
n = as.numeric(x)) |>
arrange(desc(exp), desc(n))
Output
#> x m exp n
#> 1 5x^{7} 5 7 NA
#> 2 +5x^{5} 5 5 NA
#> 3 -2x^{3} -2 3 NA
#> 4 1 NA NA 1
#> 5 0 NA NA 0
Created on 2022-06-13 by the reprex package (v2.0.1)
Base R solution:
x[
order(
gsub(
".*\\{(\\d+)\\}.*",
"\\1",
x
),
decreasing = TRUE
)
]
Input data:
x <- c(
"+5x^{5}",
"-2x^{3}",
"5x^{7}",
"0",
"1"
)
Well... it works
> x=c("+5x^{5}","-2x^{3}","5x^{7}","0","1")
> x[order(gsub("(.*\\^\\{)(.+)(\\}.*)","\\2",x),decreasing=T)]
[1] "5x^{7}" "+5x^{5}" "-2x^{3}" "1" "0"
the regex string (.*\\^\\{)(.+)(\\}.*) looks for three things:
(.*\\^\\{) searches for anything before ^{, this is the first split,
(.+) searches for anything inside curly brackets, second split,
(\\}.*) searches for anything after }, third split,
in the end it returns only \\2, the contents of the second split,
which is what we use to order the elements of the string vector.
Related
Hi i'm running the same calculations for 98 country and need to take by(df$var,df$vactor,sum) occasionally. I create a segment factor variable with the cut function and need to calculate the sum by segment at a later point. This works fine, but I have countries where the top segment is empty and then I get a "NA" for the top segment in the sum. Is there a better way to avoid this, then just replacing the NAs with Zero in an additional command after? I want to keep the length of the about.
MWE where i get an NA for factor level "C" in df2:
df1<-data.frame( val=rep(seq(1:3),4),
factor=cut(rep(seq(1:3),4),breaks=c(1,2,3,4), include.lowest = TRUE, ordered_results=True , labels=LETTERS[1:3]))
df2<-data.frame( val=rep(seq(1:4),3),
factor=cut(rep(seq(1:4),3),breaks=c(1,2,3,4), include.lowest = TRUE, ordered_results=True , labels=LETTERS[1:3]))
by(df1$val,df1$factor,sum)
by(df2$val,df2$factor,sum)
You can use droplevels function so it drop levels in your variable and print sum values grouped by factor
by(df1$val,droplevels(df1$factor),sum)
droplevels(df1$factor): A
[1] 12
-------------------------------------------------------------------------------
droplevels(df1$factor): B
[1] 12
Or you can use ifelse condition
x <- by(df1$val,df1$factor,sum)
x <- ifelse(is.na(x),"0",x)
print(x)
df1$factor
A B C
"12" "12" "0"
Can use as.numeric also
by(df1$val,as.numeric(df1$factor),sum)
as.numeric(df1$factor): 1
[1] 12
-------------------------------------------------------------------------------
as.numeric(df1$factor): 2
[1] 12
# mike suggestion
by(df1$val,as.character(df1$factor),sum)
as.character(df1$factor): A
[1] 12
-------------------------------------------------------------------------------
as.character(df1$factor): B
[1] 12
I have a data frame like this:
v2 v3
1.000 2:3,3:2,5:2,
2.012 1:5,2:4,6:3,
The second column v3, consists of 'index-value' pairs, each pair separated by a ,.
Within each 'index-value' pair, the number preceeding the : is the vector index. The number after the : is the corresponding value. E.g. in the first row, the vector indices are 2, 3, and 5, and the corresponding values are 3, 2, and 2.
Indices not represented in the string should have the value 0 in the resulting vector.
I wish to convert the 'index-value' vector to a vector of values.
Thus, for the two strings above the expected result is:
v2 v3
1.000 c(0,3,2,0,2,0)
2.012 c(5,4,0,0,0,3)
We make use of the data.table package just to use its tstrsplit function. It removes an intermediate step. Try this:
require(data.table)
df$v3<-lapply(
lapply(strsplit(as.character(df$v3),",",fixed=TRUE),tstrsplit,":"),
function(x) {res<-numeric(6);res[as.numeric(x[[1]])]<-as.numeric(x[[2]]);res})
# v2 v3
#1 1.000 0,3,2,0,2,0
#2 2.012 5,4,0,0,0,3
We first split each element of v3 using the comma (,)
We then split again using the : as separator;
We create a numeric vector of length 6;
We finally fill the values according the described logic.
I would suggest going with an approach like that suggested by #nicola, however, for fun, here's an alternative.
Use read.dcf, which is used to read "tag:value" type data. To get all the "tags", use the fields argument. You've specified this as 1:6 in your comment to #nicola. Also, you need to replace your "," with newline characters ("\n").
We'll store all of this in a string so that deparse + textConnection will be able to handle it. Not necessary for this example, but just in case....
str <- gsub(",", "\n", mydf$v3)
x <- read.dcf(textConnection(str), fields = as.character(1:6))
x <- replace(x, is.na(x), 0)
x
# 1 2 3 4 5 6
# [1,] "0" "3" "2" "0" "2" "0"
# [2,] "5" "4" "0" "0" "0" "3"
To get it back in your data.frame as a list of numeric vectors, do this:
mydf$v3_l <- lapply(1:nrow(x), function(y) as.numeric(x[y, ]))
Here's the resulting str:
str(mydf)
'data.frame': 2 obs. of 3 variables:
$ v2 : num 1 2.01
$ v3 : chr "2:3,3:2,5:2," "1:5,2:4,6:3,"
$ v3_l:List of 2
..$ : num 0 3 2 0 2 0
..$ : num 5 4 0 0 0 3
Here's another approach using only base functions.
First the string is split (strsplit) by : or ,. Elements at odd positions correspond to indices, and even positions to values. We pre-allocate a numeric vector of length of the max index.
In the lapply loop, we assign the values of the split vector (i.e. the even elements; x[c(FALSE, TRUE)]) to the pre-alloctad vector vec, at the indices (i.e. the odd elements of the splitted vector; x[c(TRUE, FALSE)]).
l <- strsplit(df$v3, "[:|,]")
vec <- numeric(length = max(as.integer(unlist(l)[c(TRUE, FALSE)])))
df$v3 <- lapply(l, function(x){
x <- as.numeric(x)
vec[x[c(TRUE, FALSE)]] <- x[c(FALSE, TRUE)]
vec
})
df
# v2 v3
# 1 1.000 0, 3, 2, 0, 2, 0
# 2 2.012 5, 4, 0, 0, 0, 3
If I have a list for example:
Userid Total
Apple1 12
Apple2 8
Apple3 15
Apple4 3
Apple5 4
Apple6 6
Apple7 20
Apple8 22
Apple9 5
Apple10 11
Orange1 15
Orange2 8
but I want to do calculations of all Apple items in general, how do I subtract the numbers from the end, I have a code that works if it is a single digit, however I do not know what to do when it becomes double digit.
I currently am using:
substr(userid, 1, nchar(userid)-1)
, which would work for Apple1-9 however Apple10 would then be Apple1, any suggestions what to do.
try gsub to replace all numbers:
x <- c("Apple10", "Apple3", "Orange123")
gsub("[0-9]", "", x)
#[1] "Apple" "Apple" "Orange"
This means, check each element of x and replace any numbers with nothing.
Or, if your data was in a data.frame called df:
df$Userid <- gsub("[0-9]", "",df$Userid)
Now you can procede with ordering as you wish
Using the stringr package, and a different approach:
require(stringi)
x <- c("Apple10", "Apple3", "Orange123")
str_replace_all(str = x, pattern = "\\d{1,3}$", replacement = "")
[1] "Apple" "Apple" "Orange"
The pattern to be replaced by "" is 1 to 3 digits at the end of a string.
I have a list of strings which contain random characters such as:
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
I'd like to know which numbers are present at least once (unique()) in this list. The solution of my example is:
solution: c(7,667,11,5,2)
If someone has a method that does not consider 11 as "eleven" but as "one and one", it would also be useful. The solution in this condition would be:
solution: c(7,6,1,5,2)
(I found this post on a related subject: Extracting numbers from vectors of strings)
For the second answer, you can use gsub to remove everything from the string that's not a number, then split the string as follows:
unique(as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(ll)), ""))))
# [1] 7 6 1 5 2
For the first answer, similarly using strsplit,
unique(na.omit(as.numeric(unlist(strsplit(unlist(ll), "[^0-9]+")))))
# [1] 7 667 11 5 2
PS: don't name your variable list (as there's an inbuilt function list). I've named your data as ll.
Here is yet another answer, this one using gregexpr to find the numbers, and regmatches to extract them:
l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")
temp1 <- gregexpr("[0-9]", l) # Individual digits
temp2 <- gregexpr("[0-9]+", l) # Numbers with any number of digits
as.numeric(unique(unlist(regmatches(l, temp1))))
# [1] 7 6 1 5 2
as.numeric(unique(unlist(regmatches(l, temp2))))
# [1] 7 667 11 5 2
A solution using stringi
# extract the numbers:
nums <- stri_extract_all_regex(list, "[0-9]+")
# Make vector and get unique numbers:
nums <- unlist(nums)
nums <- unique(nums)
And that's your first solution
For the second solution I would use substr:
nums_first <- sapply(nums, function(x) unique(substr(x,1,1)))
You could use ?strsplit (like suggested in #Arun's answer in Extracting numbers from vectors (of strings)):
l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")
## split string at non-digits
s <- strsplit(l, "[^[:digit:]]")
## convert strings to numeric ("" become NA)
solution <- as.numeric(unlist(s))
## remove NA and duplicates
solution <- unique(solution[!is.na(solution)])
# [1] 7 667 11 5 2
A stringr solution with str_match_all and piped operators. For the first solution:
library(stringr)
str_match_all(ll, "[0-9]+") %>% unlist %>% unique %>% as.numeric
Second solution:
str_match_all(ll, "[0-9]") %>% unlist %>% unique %>% as.numeric
(Note: I've also called the list ll)
Use strsplit using pattern as the inverse of numeric digits: 0-9
For the example you have provided, do this:
tmp <- sapply(list, function (k) strsplit(k, "[^0-9]"))
Then simply take a union of all `sets' in the list, like so:
tmp <- Reduce(union, tmp)
Then you only have to remove the empty string.
Check out the str_extract_numbers() function from the strex package.
pacman::p_load(strex)
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
charvec <- unlist(list)
print(charvec)
#> [1] "djud7+dg[a]hs667" "7fd*hac11(5)" "2tu,g7gka5"
str_extract_numbers(charvec)
#> [[1]]
#> [1] 7 667
#>
#> [[2]]
#> [1] 7 11 5
#>
#> [[3]]
#> [1] 2 7 5
unique(unlist(str_extract_numbers(charvec)))
#> [1] 7 667 11 5 2
Created on 2018-09-03 by the reprex package (v0.2.0).
I don't find the help page for the replace function from the base package to be very helpful. Worst part, it has no examples which could help understand how it works.
Could you please explain how to use it? An example or two would be great.
If you look at the function (by typing it's name at the console) you will see that it is just a simple functionalized version of the [<- function which is described at ?"[". [ is a rather basic function to R so you would be well-advised to look at that page for further details. Especially important is learning that the index argument (the second argument in replace can be logical, numeric or character classed values. Recycling will occur when there are differing lengths of the second and third arguments:
You should "read" the function call as" "within the first argument, use the second argument as an index for placing the values of the third argument into the first":
> replace( 1:20, 10:15, 1:2)
[1] 1 2 3 4 5 6 7 8 9 1 2 1 2 1 2 16 17 18 19 20
Character indexing for a named vector:
> replace(c(a=1, b=2, c=3, d=4), "b", 10)
a b c d
1 10 3 4
Logical indexing:
> replace(x <- c(a=1, b=2, c=3, d=4), x>2, 10)
a b c d
1 2 10 10
You can also use logical tests
x <- data.frame(a = c(0,1,2,NA), b = c(0,NA,1,2), c = c(NA, 0, 1, 2))
x
x$a <- replace(x$a, is.na(x$a), 0)
x
x$b <- replace(x$b, x$b==2, 333)
Here's two simple examples
> x <- letters[1:4]
> replace(x, 3, 'Z') #replacing 'c' by 'Z'
[1] "a" "b" "Z" "d"
>
> y <- 1:10
> replace(y, c(4,5), c(20,30)) # replacing 4th and 5th elements by 20 and 30
[1] 1 2 3 20 30 6 7 8 9 10
Be aware that the third parameter (value) in the examples given above: the value is a constant (e.g. 'Z' or c(20,30)).
Defining the third parameter using values from the data frame itself can lead to confusion.
E.g. with a simple data frame such as this (using dplyr::data_frame):
tmp <- data_frame(a=1:10, b=sample(LETTERS[24:26], 10, replace=T))
This will create somthing like this:
a b
(int) (chr)
1 1 X
2 2 Y
3 3 Y
4 4 X
5 5 Z
..etc
Now suppose you want wanted to do, was to multiply the values in column 'a' by 2, but only where column 'b' is "X". My immediate thought would be something like this:
with(tmp, replace(a, b=="X", a*2))
That will not provide the desired outcome, however. The a*2 will defined as a fixed vector rather than a reference to the 'a' column. The vector 'a*2' will thus be
[1] 2 4 6 8 10 12 14 16 18 20
at the start of the 'replace' operation. Thus, the first row where 'b' equals "X", the value in 'a' will be placed by 2. The second time, it will be replaced by 4, etc ... it will not be replaced by two-times-the-value-of-a in that particular row.
Here's an example where I found the replace( ) function helpful for giving me insight. The problem required a long integer vector be changed into a character vector and with its integers replaced by given character values.
## figuring out replace( )
(test <- c(rep(1,3),rep(2,2),rep(3,1)))
which looks like
[1] 1 1 1 2 2 3
and I want to replace every 1 with an A and 2 with a B and 3 with a C
letts <- c("A","B","C")
so in my own secret little "dirty-verse" I used a loop
for(i in 1:3)
{test <- replace(test,test==i,letts[i])}
which did what I wanted
test
[1] "A" "A" "A" "B" "B" "C"
In the first sentence I purposefully left out that the real objective was to make the big vector of integers a factor vector and assign the integer values (levels) some names (labels).
So another way of doing the replace( ) application here would be
(test <- factor(test,labels=letts))
[1] A A A B B C
Levels: A B C