Rank with tied numbers - r

I have to rank high values in a column of a data frame like this:
example <- data.frame(
country=c("Arg","Uru","Arg","Uru","Arg","Uru","Arg","Uru"),
value=c(1,1,2,3,4,5,6,10))
If I would rank the values with
example$rank<- rank(-example$value)
I would have something like this:
print(example$rank)
#[1] 7.5 7.5 6.0 5.0 4.0 3.0 2.0 1.0
When actually I am looking something like this:
#[1] 7 8 6 5 4 3 2 1
If they are tied, I don't mind which one has a higher rank.

Related

Pulling columns based on values in a row

I am looking for a way to use the values in the first row to help filter values I want. Say if I want to keep certain columns in R based on the values in the first row. So in the first row, we have -0.5, 0.7, 1.1, and -1.2.
I want to only keep values that are equal to or greater than 1, or less than or equal to -1.2. Everything else will just be dropped.
So say my original data I have is DF1
ID
Location
XPL
SNA
AAS
APA
First
Park
-0.5
0.7
1.1
-1.2
Second
School
2
5
2
3
Second
Home
4
5
6
4
Third
Car
1
8
8
5
Third
Lake
7
5
4
6
Fourth
Prison
4
5
1
7
With the filter, I would now have a new DF:
ID
Location
AAS
APA
First
Park
1.1
-1.2
Second
School
2
3
Second
Home
6
4
Third
Car
8
5
Third
Lake
4
6
Fourth
Prison
1
7
What would be the best way for this. I feel there is a way to sort columns based on values from a row, but I am unable to think of the way we can with certain commands.
ID <- c("First", "Second", "Second", "Third", "Third", "Fourth")
Location <- c("Park", "School", "Home", "Car", "Lake", "Prison")
XPL <- c(-0.5,2,4,1,7,4)
SNA <- c(0.7,5,5,8,5,5)
AAS <- c(1.1,2,6,8,4,1)
APA <- c(-1.2,3,4,5,6,7)
DF1 <- data.frame(ID, Location, XPL, SNA, AAS,APA)
In dplyr, you can select numeric columns whose first absolute value is above 1:
library(dplyr)
DF1 %>%
select(!where(~ is.numeric(.x) && abs(first(.x)) <= 1))
# ID Location AAS APA GOP
# 1 First Park 1.1 -1.2 1.4
# 2 Second School 2.0 3.0 1.0
# 3 Second Home 6.0 4.0 2.0
# 4 Third Car 8.0 5.0 2.0
# 5 Third Lake 4.0 6.0 3.0
# 6 Fourth Prison 1.0 7.0 3.0
Or with between:
DF1 %>%
select(!where(~ is.numeric(.x) && between(first(.x), -1.19, 0.99)))
If you are using the first row as the basis, you can convert it to a normal integer vector and use the which function to know the indexes that will be kept.
test.row <- as.numeric(DF1[1,3:6])
the 3 and 6 corresponds to the range of index from XPL to APA.
DF1 <- DF1[,c(1:2, 2 + which(test.row >= 1 | test.row <= -1.2))]
we keep the columns ID and Location as 1:2 and we offset 2 to the which function.

How to rank data from multiple rows and columns?

Example data:
>data.frame("A" = c(20,40,53), "B" = c(40,11,60))
What's the easiest way in R to get from this
A B
1 20 40
2 40 11
3 53 60
to this?
A B
1 2.0 3.5
2 3.5 1.0
3 5.0 6.0
I couldn't find a way to make rank() or frank() work on multiple rows/columns and googling things like "r rank dataframe" "r rank multiple rows" yielded only questions on how to rank multiple rows/columns individually, which is weird, as I suspect the question must have been answered before.
Try rank like below
df[] <- rank(df)
or
df <- list2DF(relist(rank(df),skeleton = unclass(df)))
and you will get
> df
A B
1 2.0 3.5
2 3.5 1.0
3 5.0 6.0

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

R: Mean of subvectors based on repeats in another vector

I am trying to make two subvectors equal length from two vectors equal length.
Values in first vector are ordered as follows:
a<-c(9,9,9,8,8,7,6,5,5,5)
Second vector is random, but lets take
b<-c(1,2,3,4,5,6,7,8,9,10)
The first subvector is simple:it is just the vector a withouth repeats
f(a)<-c(9,8,7,6,5)
The second subvector should be made as follows:
for single value in vector a (no repeats in a)the vector g(b) has the same value as vector b on corresponding position. For repeats in a the g(b) value should be mean of values from corresponding subvector b. So:
g(b)<-c(mean(c(1,2,3)), mean(c(4,5)), 6, 7, mean(c(8,9,10)))
I have no idea where to start. Thx for help!
tapply is the function you want. See ?tapply to see how it works. Here:
res<-tapply(b,a,mean)
# 5 6 7 8 9
#9.0 7.0 6.0 4.5 2.0
If you want to preserve the order:
tapply(b,a,mean)[as.character(unique(a))]
# 9 8 7 6 5
#2.0 4.5 6.0 7.0 9.0
As you can see, it gives the unique values of a and for each of them, the desired function (in this case mean(b)) is evaluated.
We can also use ave
unique(ave(b, a))
#[1] 2.0 4.5 6.0 7.0 9.0
Or another option would be to convert the 'b' to factor with levels specified
tapply(b, factor(a, levels=unique(a)), FUN=mean)
# 9 8 7 6 5
#2.0 4.5 6.0 7.0 9.0
You can do in this way:
uniqueA <- a[!duplicated(a)] # or simply unique(a) but I'm not sure about order preservation
uniqueB <- as.numeric(by(b,match(a,uniqueA),mean))
> uniqueA
[1] 9 8 7 6 5
> uniqueB
[1] 2.0 4.5 6.0 7.0 9.0

Computing a "rightmost" moving average?

I would like to compute a moving average (ma) over some time series data but I would like the ma to consider the order n starting from the rightmost of my series so my last ma value corresponds to the ma of the last n values of my series. The desired function rightmost_ma would produce this output:
data <- seq(1,10)
> data
[1] 1 2 3 4 5 6 7 8 9 10
rightmost_ma(data, n=2)
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I was reviewing the different ma possibilities e.g. package forecast and could not find how to cover this use case. Note that the critical requirement for me is to have valid non NA ma values for the last elements of the series or in other words I want my ma to produce valid results without "looking into the future".
Take a look at rollmean function from zoo package
> library(zoo)
> rollmean(zoo(1:10), 2, align ="right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
you can also use rollapply
> rollapply(zoo(1:10), width=2, FUN=mean, align = "right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I think using stats::filter is less complicated, and might have better performance (though zoo is well written).
This:
filter(1:10, c(1,1)/2, sides=1)
gives:
Time Series:
Start = 1
End = 10
Frequency = 1
[1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
If you don't want the result to be a ts object, use as.vector on the result.

Resources