Sort rows of data frame by shifting the rows so that the maximum value is on the top - r

I have a data frame like below, values of which needs to be sorted.
Name Bin Value
a 1 10
a 2 1000
a 3 1
a 4 100
b 1 20
b 2 2
b 3 200
b 4 2000
I wish that the maximum value goes to the top with keeping the relative position of values to the other values, so that the new order looks like below.
Name Bin Value
a 1 1000
a 2 1
a 3 100
a 4 10
b 1 2000
b 2 20
b 3 2
b 4 200
It is not just bring the maximum Value to the top, but the whole sequence of Value needs to be shifted with maximum Value like a 1 is always below a 1000 in both old and new data.frame.

Define a function which takes a vector and shifts it upwards moving the maximum to the top and then shifting the values before the maximum to the bottom. Use ave to apply that to Value by Name.
max2top <- function(x) {
wx <- which.max(x) - 1
if (wx == 0) x else c(tail(x, -wx), head(x, wx))
}
transform(DF, Value = ave(Value, Name, FUN = max2top))
giving
Name Bin Value
1 a 1 1000
2 a 2 1
3 a 3 100
4 a 4 10
5 b 1 2000
6 b 2 20
7 b 3 2
8 b 4 200
Note
The input in reproducible form:
Lines <- "Name Bin Value
a 1 10
a 2 1000
a 3 1
a 4 100
b 1 20
b 2 2
b 3 200
b 4 2000"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)

Related

assign values from same row in R

I want to assign the row value of B to row A only when A = 1. This is what I have done so far:
Data frame:
df <- data.frame('A' = c(1,1,2,5,4,3,1,2), 'B' = c(100,200,200,200,100,200,100,200))
A B
1 1 100
2 1 200
3 2 200
4 5 200
5 4 100
6 3 200
7 1 100
8 2 200
Output:
df$A[df$A == 1] <- df$B
A B
1 100 100
2 200 200
3 2 200
4 5 200
5 4 100
6 3 200
7 200 100
8 2 200
As you can see, rows 1 and 2 do what they are supposed to do. However, row 7 doesn't, but instead takes the value from row 3 - it is assigning values sequentially.
My question: how do I assign values that takes the inputs from the same row?
Use:
df$A[df$A == 1] <- df$B[df$A == 1]
You need to apply the same index to both, column to be replaced and column that holds the replacements.

R Matching based on mutiple columns

Let's say I have the following data set, which acts like the key
x y value
1 2 10
1 1 20
2 1 30
1 1 20
2 3 200
I have another data with many many columns, 2 of them being x and y. I want to create a column value that matches to the key, e.g.
x y value and other columns
1 1 20
2 1 30
2 3 300
I can only use the match to make it work when matching one column. How do I extend to multiple column matching?
You can use merge, as #MrFlick suggested:
df.key <- data.frame(
x=c(1,1,2,1,2),
y=c(2,1,1,1,3),
value=c(10,20,30,20,200))
##
df.add <- data.frame(
x=c(1,2,2),
y=c(1,1,3),
value=c(20,30,300),
a=rnorm(3),
b=rpois(3,0))
##
> merge(
x=df.key,
y=df.add)
x y value a b
1 1 1 20 0.9246104 0
2 1 1 20 0.9246104 0
3 2 1 30 0.2685016 0
##
> merge(
x=df.key,
y=df.add,
by=c("x","y"))
x y value.x value.y a b
1 1 1 20 20 0.9246104 0
2 1 1 20 20 0.9246104 0
3 2 1 30 30 0.2685016 0
4 2 3 200 300 -0.4174230 0
By default, this will join on the intersection of column names, like in the first example (x,y,value). Additionally, you can specify which columns to use from both data.frames using by=, as in the second example. Or, you can get more specific by using by.x= and/or by.y=. See ?merge.
Edit:
The problem is that df.key contains two rows where x=1, y=1 is TRUE, so the row in df.add with x=1,y=1 has to be duplicated in the join in order to preserve the data in df.key. I'm not sure how to make this adjustment elegantly (e.g. by specifying certain arguments to merge), but here's one approach:
R> merge(
x=df.key[!duplicated(df.key[,c(1:2)]),],
y=df.add)
x y value a b
1 1 1 20 -1.0185211 0
2 2 1 30 2.7507656 0
3 2 3 200 0.3986168 0

Replace values in a series exceeding a threshold

In a dataframe I'd like to replace values in a series where they exceed a given threshold.
For example, within a group ('ID') in a series designated by 'time', if 'value' ever exceeds 3, I'd like to make all following entries also equal 3.
ID <- as.factor(c(rep("A", 3), rep("B",3), rep("C",3)))
time <- rep(1:3, 3)
value <- c(c(1,1,2), c(2,3,2), c(3,3,2))
dat <- cbind.data.frame(ID, time, value)
dat
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 2
C 1 3
C 2 3
C 3 2
I'd like it to be:
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 3
C 1 3
C 2 3
C 3 3
This should be easy, but I can't figure it out. Thanks!
The ave function makes this very easy by allowing you to apply a function to each of the groupings. In this case, we will adapth the cummax (cumulative maximum) to see if we've seen a 3 yet.
dat$value2<-with(dat, ave(value, ID, FUN=
function(x) ifelse(cummax(x)>=3, 3, x)))
dat;
# ID time value value2
# 1 A 1 1 1
# 2 A 2 1 1
# 3 A 3 2 2
# 4 B 1 2 2
# 5 B 2 3 3
# 6 B 3 2 3
# 7 C 1 3 3
# 8 C 2 3 3
# 9 C 3 2 3
You could also just use FUN=cummax if you want never-decreasing values. I wasn't sure about the sequence c(1,2,1) if you wanted to keep that unchanged or not.
If you can assume your data are sorted by group, then this should be fast, essentially relying on findInterval() behind the scenes:
library(IRanges)
id <- Rle(ID)
three <- which(value>=3L)
ir <- reduce(IRanges(three, end(id)[findRun(three, id)])))
dat$value[as.integer(ir)] <- 3L
This avoids looping over the groups.

R add index column to data frame based on row values

This is a continuation of r - How to add row index to a data frame, based on combination of factors
I tried to replicate what I believe to be the desired results using the green checked answer and am consistently getting something other than expected. I am sure I am doing something really basic wrong, but can't seem to see it OR I've misunderstood what the desired state is.
The data from the original post:
temp <- data.frame(
Dim1 = c("A","A","A","A","A","A","B","B"),
Dim2 = c(100,100,100,100,200,200,100,200),
Value = sample(1:10, 8)
)
Then I ran the following code: temp$indexLength <- ave( 1:nrow(temp), temp$Dim1, factor( temp$Dim2), FUN=function(x) 1:length(x) )
and: temp$indexSeqAlong <- ave( 1:nrow(temp), temp$Dim1, factor( temp$Dim2), FUN=seq_along )
and then I created the following: temp$indexDesired <- c(1, 1, 1, 1, 2, 2, 3, 3)
...ending up with the data frame below:
Dim1 Dim2 Value indexLength indexSeqAlong indexDesired
1 A 100 6 1 1 1
2 A 100 2 2 2 1
3 A 100 9 3 3 1
4 A 100 8 4 4 1
5 A 200 10 1 1 2
6 A 200 4 2 2 2
7 B 100 3 1 1 3
8 B 200 5 1 1 4
If I can figure out what I'm not getting the desired index -- and assuming the code is extensible to more than 2 variables -- I should be all set. Thanks in advance!
If you use data.table, there is a "symbol" .GRP which records this information ( a simple group counter)
library(data.table)
DT <- data.table(temp)
DT[, index := .GRP, by = list(Dim1, Dim2)]
DT
# Dim1 Dim2 Value index
# 1: A 100 10 1
# 2: A 100 2 1
# 3: A 100 9 1
# 4: A 100 4 1
# 5: A 200 6 2
# 6: A 200 1 2
# 7: B 100 8 3
# 8: B 200 7 4
Once the values in teh first argument have been partitioned, there is no way that ave "knows" what order they have been passed. You want a method that can look at changes in values. The duplicated function is generic and has a data.frame method that looks at multiple columns:
temp$indexSeqAlong <- cumsum(!duplicated(temp[, 1:2]) )
temp
Dim1 Dim2 Value indexSeqAlong
1 A 100 8 1
2 A 100 2 1
3 A 100 7 1
4 A 100 3 1
5 A 200 5 2
6 A 200 1 2
7 B 100 4 3
8 B 200 10 4
Is extensible to as many columns as you want.

How to calculate the # of unique player (when repeat entry is allowed)?

I am trying to calculate the number of unique player in an experiment where each player is allowed to re-enter the game. Here is what the data look like
x <- read.table(header=T, text="group timepast Name NoOfUniquePlayer
1 0.02703 A 1
1 0.02827 B 2
1 0.02874 A 2
1 0.02875 A 2
1 0.02875 D 3
2 0.03255 M 1
2 0.03417 K 2
2 0.10029 T 3
2 0.10394 T 3
2 0.10605 K 3
2 0.16522 T 3
3 0.11938 E 1
3 0.12607 F 2
3 0.13858 E 2
3 0.16084 G 3
3 0.19830 G 3
3 0.24563 V 4")
The original experiment data contain the first 3 columns, the first one is the group number of each experiment (3 groups here), the second column is the normalized time each player joined the experiment (I've sort this column from smallest to largest), the third one is the name of each player (each player only join one single group).
What I want to generate is the last column called # of unique players, e.g. for group 1, five players (A B A A D) are recorded but only 3 unique players there (A B D), player A started the game (1st row) and re-joined (3rd row) after player B played (2nd row), and then player A joined the game again (the 4th row thereby was recorded), finally player D entered and finished the whole game.
Can anyone help me figure out how to program in R to get this problem solved?
I think this will give you what you want (I think there is an error in your example for group 2)
x$uniquenum <- unlist(
tapply(
x$Name,
x$group,
function(y)
cummax(as.numeric(factor(y,levels=y[!duplicated(y)])))
)
)
group timepast Name NoOfUniquePlayer uniquenum
1 1 0.02703 A 1 1
2 1 0.02827 B 2 2
3 1 0.02874 A 2 2
4 1 0.02875 A 2 2
5 1 0.02875 D 3 3
6 2 0.03255 M 1 1
7 2 0.03417 K 2 2
8 2 0.10029 T 3 3
9 2 0.10394 T 3 3
10 2 0.10605 K 4 3
11 2 0.16522 T 4 3
12 3 0.11938 E 1 1
13 3 0.12607 F 2 2
14 3 0.13858 E 2 2
15 3 0.16084 G 3 3
16 3 0.19830 G 3 3
17 3 0.24563 V 4 4
slightly more compactly, using data.table
DT <- data.table(x)
DT[, uniqueNum := cummax(match(Name,unique(Name))), by = group]
if you want the total number of unique players then
DT[, totalUnique := max(uniqueNum), by = group]

Resources