by factor sum with empty category - r

Hi i'm running the same calculations for 98 country and need to take by(df$var,df$vactor,sum) occasionally. I create a segment factor variable with the cut function and need to calculate the sum by segment at a later point. This works fine, but I have countries where the top segment is empty and then I get a "NA" for the top segment in the sum. Is there a better way to avoid this, then just replacing the NAs with Zero in an additional command after? I want to keep the length of the about.
MWE where i get an NA for factor level "C" in df2:
df1<-data.frame( val=rep(seq(1:3),4),
factor=cut(rep(seq(1:3),4),breaks=c(1,2,3,4), include.lowest = TRUE, ordered_results=True , labels=LETTERS[1:3]))
df2<-data.frame( val=rep(seq(1:4),3),
factor=cut(rep(seq(1:4),3),breaks=c(1,2,3,4), include.lowest = TRUE, ordered_results=True , labels=LETTERS[1:3]))
by(df1$val,df1$factor,sum)
by(df2$val,df2$factor,sum)

You can use droplevels function so it drop levels in your variable and print sum values grouped by factor
by(df1$val,droplevels(df1$factor),sum)
droplevels(df1$factor): A
[1] 12
-------------------------------------------------------------------------------
droplevels(df1$factor): B
[1] 12
Or you can use ifelse condition
x <- by(df1$val,df1$factor,sum)
x <- ifelse(is.na(x),"0",x)
print(x)
df1$factor
A B C
"12" "12" "0"
Can use as.numeric also
by(df1$val,as.numeric(df1$factor),sum)
as.numeric(df1$factor): 1
[1] 12
-------------------------------------------------------------------------------
as.numeric(df1$factor): 2
[1] 12
# mike suggestion
by(df1$val,as.character(df1$factor),sum)
as.character(df1$factor): A
[1] 12
-------------------------------------------------------------------------------
as.character(df1$factor): B
[1] 12

Related

How to order from highest to lowest in R

My goal is to order multiple variables in a list in R:
"+5x^{5}" "-2x^{3}" "5x^{7}" "0" "1"
I want to get this order:
"5x^{7}" "+5x^{5}" "-2x^{3}" "1" "0"
So exponents from highest to lowest, then numerical order for the numbers.
How can I achieve this? Decreasing for numerical alone is clear. For the exponents it would be necessary to detect if there is a x in the string and then extract the exponent and order it based on this. But I dont know how to do it.
A more verbose option is to extract the exponents and the multipliers and then use arrange. This has the advantage of having these numbers ready if you need to use them.
library(stringr)
library(dplyr)
dat <- data.frame(x = c("+5x^{5}", "-2x^{3}", "5x^{7}", "0", "1"))
dat |> mutate(m = as.numeric(str_match(x, "([+-]*\\d+)x\\^\\{(\\d)\\}")[, 2]),
exp = as.numeric(str_match(x, "([+-]*\\d+)x\\^\\{(\\d)\\}")[, 3]),
n = as.numeric(x)) |>
arrange(desc(exp), desc(n))
Output
#> x m exp n
#> 1 5x^{7} 5 7 NA
#> 2 +5x^{5} 5 5 NA
#> 3 -2x^{3} -2 3 NA
#> 4 1 NA NA 1
#> 5 0 NA NA 0
Created on 2022-06-13 by the reprex package (v2.0.1)
Base R solution:
x[
order(
gsub(
".*\\{(\\d+)\\}.*",
"\\1",
x
),
decreasing = TRUE
)
]
Input data:
x <- c(
"+5x^{5}",
"-2x^{3}",
"5x^{7}",
"0",
"1"
)
Well... it works
> x=c("+5x^{5}","-2x^{3}","5x^{7}","0","1")
> x[order(gsub("(.*\\^\\{)(.+)(\\}.*)","\\2",x),decreasing=T)]
[1] "5x^{7}" "+5x^{5}" "-2x^{3}" "1" "0"
the regex string (.*\\^\\{)(.+)(\\}.*) looks for three things:
(.*\\^\\{) searches for anything before ^{, this is the first split,
(.+) searches for anything inside curly brackets, second split,
(\\}.*) searches for anything after }, third split,
in the end it returns only \\2, the contents of the second split,
which is what we use to order the elements of the string vector.

I use as.complex() to convert a string column to a numeric column in r

I have three columns which are characters A, B, and C respectively. I am using is.numeric to convert them to numeric and then assign them values e.g. 1,2 and 3, but when I am using is.numeric(). it returns back NAs. In different data frames these orders vary, e.g. ABC or ACB, but A=i+0i, B=2+3i and C is also a complex number. I want to first convert the string to a complex number and then assign values to them.
LV$phase1 <- as.numeric(LV$phase1)
class(phase1)
A=1
print(phase1)
This is the error:
"Warning message:
NAs introduced by coercion "
It does not usually make sense to convert character data to numeric, but if the letters refer to an ordered sequence of events/phases/periods, then it may be useful. R uses factors for this purpose. For example
set.seed(42)
phase <- sample(LETTERS[1:4], 10, replace=TRUE)
phase
# [1] "A" "A" "A" "A" "B" "D" "B" "B" "A" "D"
factor(phase)
# [1] A A A A B D B B A D
Levels: A B D
as.numeric(factor(phase))
# [1] 1 1 1 1 2 3 2 2 1 3
If this is what you are trying to do
LV$phase1 <- as.numeric(factor(LV$phase1))
will convert the letters to an ordered sequence and assign numbers to represent those categories.

How to pass decreasing and/or na.last argument to sort through tapply in R

I am teaching myself the basics of R and have been encountering trouble using the function tapply when passing the sort function while trying to use non-default optional arguments for sort. Here is an example of the trouble I am facing:
Given the vectors
x <- c(1.1, 1.0, 2.1, NA_real_)
y <- c("a", "b", "c","d")
I find that
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
results in the same output regardless of the logical assignments I endow decreasing and na.last with. In fact, the output always defaults to the sort default values
decreasing = FALSE, na.last = NA
For the record, when inputing the above example, the output is
> tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
1 1.1 2.1
"b" "a" "c"
Let me also mention that if I define the alternate function
sort2 <- function(v) sort(v, decreasing=TRUE, na.last=TRUE);
and pass sort2 to tapply instead, I still encounter the same trouble.
I am using running this code on a Mac OS X 10.10.4, using R 3.2.0. Using sort standalone results in the desired behavior (calling sort on its own without passing through tapply, that is), since it acts appropriately when altering the decreasing and na.last arguments.
Thank you in advance for any help.
I don't think you're using tapply() correctly.
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
The above line of code basically says "sort vector y grouping by categorical vector x". Your vector x is not really a categorical vector at all, it's a numeric vector with only distinct values, plus an NA. tapply() ignores the NA index, and then treats each of the remaining three distinct numeric values in x as separate groups, so it passes each of the three corresponding character strings from y to three different calls of sort(), which obviously has no effect on anything (which explains why your customization arguments have no effect) and returns the result ordered by the x groups.
Here's an example of how to do what I think you're trying to do:
x <- c(NA,1,2,3,NA,2,1,3);
g <- rep(letters[1:2],each=4);
x;
## [1] NA 1 2 3 NA 2 1 3
g;
## [1] "a" "a" "a" "a" "b" "b" "b" "b"
tapply(x,g,sort,decreasing=T,na.last=T);
## $a
## [1] 3 2 1 NA
##
## $b
## [1] 3 2 1 NA
##
Edit: When you want to sort a vector by another vector, you can use order():
y[order(x,decreasing=T,na.last=T)];
## [1] "c" "a" "b" "d"
y[order(x,decreasing=F,na.last=T)];
## [1] "b" "a" "c" "d"

"replace" function examples

I don't find the help page for the replace function from the base package to be very helpful. Worst part, it has no examples which could help understand how it works.
Could you please explain how to use it? An example or two would be great.
If you look at the function (by typing it's name at the console) you will see that it is just a simple functionalized version of the [<- function which is described at ?"[". [ is a rather basic function to R so you would be well-advised to look at that page for further details. Especially important is learning that the index argument (the second argument in replace can be logical, numeric or character classed values. Recycling will occur when there are differing lengths of the second and third arguments:
You should "read" the function call as" "within the first argument, use the second argument as an index for placing the values of the third argument into the first":
> replace( 1:20, 10:15, 1:2)
[1] 1 2 3 4 5 6 7 8 9 1 2 1 2 1 2 16 17 18 19 20
Character indexing for a named vector:
> replace(c(a=1, b=2, c=3, d=4), "b", 10)
a b c d
1 10 3 4
Logical indexing:
> replace(x <- c(a=1, b=2, c=3, d=4), x>2, 10)
a b c d
1 2 10 10
You can also use logical tests
x <- data.frame(a = c(0,1,2,NA), b = c(0,NA,1,2), c = c(NA, 0, 1, 2))
x
x$a <- replace(x$a, is.na(x$a), 0)
x
x$b <- replace(x$b, x$b==2, 333)
Here's two simple examples
> x <- letters[1:4]
> replace(x, 3, 'Z') #replacing 'c' by 'Z'
[1] "a" "b" "Z" "d"
>
> y <- 1:10
> replace(y, c(4,5), c(20,30)) # replacing 4th and 5th elements by 20 and 30
[1] 1 2 3 20 30 6 7 8 9 10
Be aware that the third parameter (value) in the examples given above: the value is a constant (e.g. 'Z' or c(20,30)).
Defining the third parameter using values from the data frame itself can lead to confusion.
E.g. with a simple data frame such as this (using dplyr::data_frame):
tmp <- data_frame(a=1:10, b=sample(LETTERS[24:26], 10, replace=T))
This will create somthing like this:
a b
(int) (chr)
1 1 X
2 2 Y
3 3 Y
4 4 X
5 5 Z
..etc
Now suppose you want wanted to do, was to multiply the values in column 'a' by 2, but only where column 'b' is "X". My immediate thought would be something like this:
with(tmp, replace(a, b=="X", a*2))
That will not provide the desired outcome, however. The a*2 will defined as a fixed vector rather than a reference to the 'a' column. The vector 'a*2' will thus be
[1] 2 4 6 8 10 12 14 16 18 20
at the start of the 'replace' operation. Thus, the first row where 'b' equals "X", the value in 'a' will be placed by 2. The second time, it will be replaced by 4, etc ... it will not be replaced by two-times-the-value-of-a in that particular row.
Here's an example where I found the replace( ) function helpful for giving me insight. The problem required a long integer vector be changed into a character vector and with its integers replaced by given character values.
## figuring out replace( )
(test <- c(rep(1,3),rep(2,2),rep(3,1)))
which looks like
[1] 1 1 1 2 2 3
and I want to replace every 1 with an A and 2 with a B and 3 with a C
letts <- c("A","B","C")
so in my own secret little "dirty-verse" I used a loop
for(i in 1:3)
{test <- replace(test,test==i,letts[i])}
which did what I wanted
test
[1] "A" "A" "A" "B" "B" "C"
In the first sentence I purposefully left out that the real objective was to make the big vector of integers a factor vector and assign the integer values (levels) some names (labels).
So another way of doing the replace( ) application here would be
(test <- factor(test,labels=letts))
[1] A A A B B C
Levels: A B C

Why does tapply take the subset as NA and not exclude them totally

I have a question. I want to make a barplot with the mean and errorbars, where it is grouped for two factors. To get the mean and the standard errors I used the function tapply.
However for one of the factor I want to drop one level.
So what I did was did:
dataFE <- data[-which(plant=="FS"),] # this works fine, I get exactly the data set I want without the FS level of the factor plant
Then to get the mean and standard error I use this:
means <- with(dataFE, as.matrix(tapply(leaves, list(plant, Orchestia), mean), nrow=2)
e <- with(dataFE, as.matrix(tapply (leaves, list(plant, Orchestia), function(x) sd(x)/sqrt(length(x))), nrow=2))
And there something strange happens, it does not calculate the FS, however it puts it in a table with NA:
row.names no yes
1 F 7.009022 5.307185
2 FS NA NA
3 S 2.837139 2.111054
This I don't want, cause if I use this in barplot2 (package gplots) then I will get an empty bar for the FS, whereas that one should not be there at all.
So does any of use have a solution or an other method to get a nice barplot :). Thanks any way!
Without a sample of your data, I'll just wager a guess:
your column plant is a factor. And while you have dropped the rows that have that value, the "level" FS still exists. Use levels(data$plant) to see. You can then use droplevels to get rid of it.
dat <- data.frame(x=1:15, y=factor(letters[1:3]))
> levels(dat$y)
[1] "a" "b" "c"
dat <- dat[dat$y != 'a',]
> levels(dat$y)
[1] "a" "b" "c"
>
> tapply(dat$x, dat$y, sum)
a b c
NA 40 45
>
> droplevels(dat$y)
[1] b c b c b c b c b c
Levels: b c
> dat$y <- droplevels(dat$y)
> tapply(dat$x, dat$y, sum)
b c
40 45
>

Resources