Grouped ranking in R - r

I have a data with primary key and ratio values like the following
2.243164164
1.429242413
2.119270714
3.013427143
1.208634972
1.208634972
1.23657632
2.212136028
2.168583297
2.151961216
1.159886063
1.234106444
1.694206176
1.401425329
5.210125578
1.215267806
1.089189869
I want to add a rank column which groups these ratios in say 3 bins. Functionality similar to the sas code:
PROC RANK DATA = TAB1 GROUPS = &NUM_BINS

I did the following:
Convert your vector to data frame.
Create variable Rank:
test2$rank<-rank(test2$test)
> test2
test rank
1 2.243164 15.0
2 1.429242 9.0
3 2.119271 11.0
4 3.013427 16.0
5 1.208635 3.5
6 1.208635 3.5
7 1.236576 7.0
8 2.212136 14.0
9 2.168583 13.0
10 2.151961 12.0
11 1.159886 2.0
12 1.234106 6.0
13 1.694206 10.0
14 1.401425 8.0
15 5.210126 17.0
16 1.215268 5.0
17 1.089190 1.0
Define function to convert to percentile ranks and then define pr as that percentile.
percent.rank<-function(x) trunc(rank(x)/length(x)*100)
test3<-within(test2,pr<-percent.rank(rank))
Then I created bins on the fact you wanted 3 of them.
test3$bins <- cut(test3$pr, breaks=c(0,33,66,100), labels=c("0-33","34-66","66-100"))
test x rank pr bins
1 2.243164 15.0 15.0 88 66-100
2 1.429242 9.0 9.0 52 34-66
3 2.119271 11.0 11.0 64 34-66
4 3.013427 16.0 16.0 94 66-100
5 1.208635 3.5 3.5 20 0-33
6 1.208635 3.5 3.5 20 0-33
7 1.236576 7.0 7.0 41 34-66
8 2.212136 14.0 14.0 82 66-100
9 2.168583 13.0 13.0 76 66-100
10 2.151961 12.0 12.0 70 66-100
11 1.159886 2.0 2.0 11 0-33
12 1.234106 6.0 6.0 35 34-66
13 1.694206 10.0 10.0 58 34-66
14 1.401425 8.0 8.0 47 34-66
15 5.210126 17.0 17.0 100 66-100
16 1.215268 5.0 5.0 29 0-33
17 1.089190 1.0 1.0 5 0-33
That work for you?

Almost late but given your data, we can use ntile from dplyr package to get equal sized groups:
df <- data.frame(values = c(2.243164164,
1.429242413,
2.119270714,
3.013427143,
1.208634972,
1.208634972,
1.23657632,
2.212136028,
2.168583297,
2.151961216,
1.159886063,
1.234106444,
1.694206176,
1.401425329,
5.210125578,
1.215267806,
1.089189869))
library(dplyr)
df <- df %>%
arrange(values) %>%
mutate(rank = ntile(values, 3))
values rank
1 1.089190 1
2 1.159886 1
3 1.208635 1
4 1.208635 1
5 1.215268 1
6 1.234106 1
7 1.236576 2
8 1.401425 2
9 1.429242 2
10 1.694206 2
11 2.119271 2
12 2.151961 2
13 2.168583 3
14 2.212136 3
15 2.243164 3
16 3.013427 3
17 5.210126 3
Or see cut_number from ggplot2 package:
library(ggplot2)
df$rank2 <- cut_number(df$values, 3, labels = c(1:3))
values rank rank2
1 1.089190 1 1
2 1.159886 1 1
3 1.208635 1 1
4 1.208635 1 1
5 1.215268 1 1
6 1.234106 1 1
7 1.236576 2 2
8 1.401425 2 2
9 1.429242 2 2
10 1.694206 2 2
11 2.119271 2 2
12 2.151961 2 3
13 2.168583 3 3
14 2.212136 3 3
15 2.243164 3 3
16 3.013427 3 3
17 5.210126 3 3
Because your sample consists of 17 numbers, one bin consists of 5 numbers while the others consist of 6 numbers. There are differences for row 12: ntile assigns 6 numbers to the first and second group, whereas cut_number assigns them to the first and third group.
> table(df$rank)
1 2 3
6 6 5
> table(df$rank2)
1 2 3
6 5 6
See also here: Splitting a continuous variable into equal sized groups

Related

Run function in R for mean

Based on my values i need a function to get following results.
enter image description here
The functions has to calculate the mean of current value and the 3 previous values.
The function should be flexible in that way, that the same calculation can be applied for 2, 4, 5 or x previous values, for example: mean of current value and the 2 previous values.
please consider, that my daten has random numbers, and not like in above example ascending numbers
What you need is a rolling mean, in the argument k (4 in my example) you provide an integer width of the rolling window. Check the documentation page for the rollmean function of the zoo package, ?rollmean.
zoo
library(zoo)
library(dplyr)
df <- data.frame(number = 1:20)
df %>% mutate(rolling_avg = rollmean(number, k = 4 , fill = NA, align = "right"))
RcppRoll
library(RcppRoll)
df %>% mutate(rolling_avg = roll_mean(number, n = 4, fill = NA, align = "right"))
Output
number rolling_avg
1 1 NA
2 2 NA
3 3 NA
4 4 2.5
5 5 3.5
6 6 4.5
7 7 5.5
8 8 6.5
9 9 7.5
10 10 8.5
11 11 9.5
12 12 10.5
13 13 11.5
14 14 12.5
15 15 13.5
16 16 14.5
17 17 15.5
18 18 16.5
19 19 17.5
20 20 18.5
Using the other vector you provided in the comments:
df <- data.frame(number = c(1,-3,5,4,3,2,-4,5,6,-4,3,2,3,-4,5,6,6,3,2))
df %>% mutate(rolling_avg = rollmean(number, 4, fill = NA, align = "right"))
Output
number rolling_avg
1 1 NA
2 -3 NA
3 5 NA
4 4 1.75
5 3 2.25
6 2 3.50
7 -4 1.25
8 5 1.50
9 6 2.25
10 -4 0.75
11 3 2.50
12 2 1.75
13 3 1.00
14 -4 1.00
15 5 1.50
16 6 2.50
17 6 3.25
18 3 5.00
19 2 4.25
You can also use the rollify function in the tibbletime package to create a custom rolling function for any function. For mean it would look like this (using data from #mpalanco's answer):
library(dplyr)
library(tibbletime)
rolling_mean <- rollify(mean, window = 4)
df %>% mutate(moving_average = rolling_mean(number))
which gives you:
number moving_average
1 1 NA
2 2 NA
3 3 NA
4 4 2.5
5 5 3.5
6 6 4.5
7 7 5.5
8 8 6.5
9 9 7.5
10 10 8.5
11 11 9.5
12 12 10.5
13 13 11.5
14 14 12.5
15 15 13.5
16 16 14.5
17 17 15.5
18 18 16.5
19 19 17.5
20 20 18.5
The benefit of this approach is that it is easy to extend to things other than rolling average.

Calculating cumulative mean using floating conditions

My dataset has as features: players IDs, team, weeks and points.
I want to calculate the mean of TEAM points for previous weeks, but not all past weeks, just to the last 5 or less (if the current week is smaller than 5).
Example: For team = A, week = 7, the result will be the average of POINTS for team = A and weeks 2, 3, 4, 5 and 6.
The dataset can be created using the following code:
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
I would like to have a solution without a heavy looping, because the dataset is huge.
I have done related questions here that maybe will help, but I could not adapt to this case.
Question 1
Question 2
Thank you!
We adapt the approach from my answer to one of your other questions if you want a dplyr solution:
library(dplyr)
library(zoo)
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
roll_mean <- function(x, k) {
result <- rollapplyr(x, k, mean, partial=TRUE, na.rm=TRUE)
result[is.nan(result)] <- NA
return( result )
}
It might first be easier to aggregate by team:
team_data <- mydata %>%
select(-player_id) %>%
group_by(team, week) %>%
arrange(week) %>%
summarise(team_points = sum(points)) %>%
mutate(rolling_team_mean = roll_mean(lag(team_points), k=5)) %>%
arrange(team)
team_data
# A tibble: 30 x 4
# Groups: team [2]
team week team_points rolling_team_mean
<fctr> <int> <dbl> <dbl>
1 A 1 13 NA
2 A 2 11 13.00
3 A 3 6 12.00
4 A 4 13 10.00
5 A 5 19 10.75
6 A 6 10 12.40
7 A 7 13 11.80
8 A 8 16 12.20
9 A 9 16 14.20
10 A 10 12 14.80
# ... with 20 more rows
Then, if you like we can put everything back together:
mydata <- inner_join(mydata, team_data) %>%
arrange(week, team, player_id)
mydata[1:12, ]
player_id team week points team_points rolling_team_mean
1 1 A 1 4 13 NA
2 2 A 1 9 13 NA
3 3 B 1 10 12 NA
4 4 B 1 2 12 NA
5 1 A 2 8 11 13
6 2 A 2 3 11 13
7 3 B 2 9 12 12
8 4 B 2 3 12 12
9 1 A 3 5 6 12
10 2 A 3 1 6 12
11 3 B 3 7 12 12
12 4 B 3 5 12 12
Here's one way:
# compute points per team per week
pts <- with(mydata, tapply(points, list(team, week), sum, default = 0))
pts
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 11 6 13 19 10 13 16 16 12 17 11 13 10 4
#B 12 12 12 11 10 6 13 11 6 9 5 7 13 13 6
# compute the 5-week averages
sapply(setNames(seq(2, ncol(pts)), seq(2, ncol(pts))),
function(i) {
apply(pts[, seq(max(1, i - 5), i - 1), drop = FALSE], 1, mean)
})
# 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 12 10 10.75 12.4 11.8 12.2 14.2 14.8 13.4 14.8 14.4 13.8 12.6
#B 12 12 12 11.75 11.4 10.2 10.4 10.2 9.2 9.0 8.8 7.6 8.0 9.4
This will give the wrong result if the week variable has gaps.

r: append mean of a subset of columns by name

I have this df:
webvisits1 webvisits2 webvisits3 webvisits4
s001 2 0 11 2
s002 11 2 23 3
s003 12 1 1 5
s004 13 5 5 0
s005 4 3 9 3
I need to create an output dataframe with an added columns containing the difference between the mean of webvisits(3-4) and webvisits (1-2), like so:
webvisits1 webvisits2 webvisits3 webvisits4 difference_mean
s001 2 0 11 2 -5.5
s002 11 2 23 3 -6.5
s003 12 1 1 5 3.5
s004 13 5 5 0 6.5
s005 4 3 9 3 -2.5
Is there an easy way to do so, considering that column names (webvisits) are important?
Thank you
rowSums function can sum rows of each variables, then after find difference between existing variables and take mean of them
library(dplyr)
dt %>%
mutate(difference_mean = (rowSums(dt[,2:3])-rowSums(dt[,4:5]))/2)
s.no webvisits1 webvisits2 webvisits3 webvisits4 difference_mean
1 s001 2 0 11 2 -5.5
2 s002 11 2 23 3 -6.5
3 s003 12 1 1 5 3.5
4 s004 13 5 5 0 6.5
5 s005 4 3 9 3 -2.5
We subset the dataset into two (df[1:2], df[3:4]), get the difference and then with rowMeans we find the mean, create a new column 'differenceMean' using transform.
df <- transform(df, differenceMean = rowMeans(df[1:2]- df[3:4]))
df
# webvisits1 webvisits2 webvisits3 webvisits4 differenceMean
#s001 2 0 11 2 -5.5
#s002 11 2 23 3 -6.5
#s003 12 1 1 5 3.5
#s004 13 5 5 0 6.5
#s005 4 3 9 3 -2.5

Calculate mean and SD for every timepoint with different ID's and different doses

I have the following dummy data set:
ID TIME DDAY DV
1 0 50 6.6
1 12 50 6.1
1 24 50 5.6
1 48 50 7.6
2 0 10 6.6
2 12 10 6.6
2 24 10 6.6
2 48 10 6.6
3 0 50 3.6
3 12 50 6.8
3 24 50 9.6
3 48 50 7.1
4 0 10 8.6
4 12 10 6.4
4 24 10 4.6
4 48 10 5.6
I want to create summary table for mean and standard deviations for DV as shown below:
N TIME DDAY MEAN-DV SD-DV
2 0 50 6.5 1.1
2 12 50 6.1 0.8
2 24 50 4.5 2.0
2 48 50 7.5 1.0
2 0 10 6.9 1.5
2 12 10 8.5 1.3
2 24 10 6.1 0.9
2 48 10 4.5 1.8
How do I do this in R?
You can use:
1) dplyr:
library(dplyr)
dat %.%
group_by(TIME, DDAY) %.%
summarise(MEAN_DV = mean(DV), SD_DV = sd(DV), N = length(DV))
# TIME DDAY MEAN_DV SD_DV N
# 1 48 10 6.10 0.7071068 2
# 2 24 10 5.60 1.4142136 2
# 3 12 10 6.50 0.1414214 2
# 4 0 10 7.60 1.4142136 2
# 5 48 50 7.35 0.3535534 2
# 6 24 50 7.60 2.8284271 2
# 7 12 50 6.45 0.4949747 2
# 8 0 50 5.10 2.1213203 2
where dat is the name of your data frame.
2) data.table:
library(data.table)
DT <- as.data.table(dat)
DT[ , list(MEAN_DV = mean(DV), SD_DV = sd(DV), N = .N), by = c("TIME", "DDAY")]
# TIME DDAY MEAN_DV SD_DV N
# 1: 0 50 5.10 2.1213203 2
# 2: 12 50 6.45 0.4949747 2
# 3: 24 50 7.60 2.8284271 2
# 4: 48 50 7.35 0.3535534 2
# 5: 0 10 7.60 1.4142136 2
# 6: 12 10 6.50 0.1414214 2
# 7: 24 10 5.60 1.4142136 2
# 8: 48 10 6.10 0.7071068 2
require(plyr)
# THIS COLLAPSES ON TIME
ddply(df, .(TIME), summarize, MEAN_DV=mean(DV), SD_DV=sd(DV), N=length(DV))
# THIS COLLAPSES ON TIME AND DDAY
ddply(df, .(TIME, DDAY), summarize, MEAN_DV=mean(DV), SD_DV=sd(DV), N=length(DV))

Merging data sets with unequal observations

I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")

Resources