apply conditional numbering to grouped data in R

apply conditional numbering to grouped data in R - r

I have a table like the one below with 100's of rows of data.
ID RANK
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3
4 6
4 7
4 7
4 7
4 7
4 7
4 6
I want to try to find a way to group the data by ID so that I can ReRank each group separately. The ReRank column is based on the Rank column and basically renumbering it starting at 1 from least to greatest, but it's important to note that the the number in the ReRank column can be put in more than once depending on the numbers in the Rank column .
In other words, the output needs to look like this
ID Rank ReRANK
1 3 2
1 2 1
1 3 2
2 4 1
2 8 2
3 3 1
3 3 1
3 3 1
For the life of me, I can't figure out how to be able to ReRank the the columns by the grouped columns and the value of the Rank columns.
This has been my best guess so far, but it definitely is not doing what I need it to do
ReRANK = mat.or.vec(length(RANK),1)
ReRANK[1] = counter = 1
for(i in 2:length(RANK)) {
if (RANK[i] != RANK[i-1]) { counter = counter + 1 }
ReRANK[i] = counter
}
Thank you in advance for the help!!

Here is a base R method using ave and rank:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) rank(i, ties.method="min"))
The min argument in rank assures that the minimum ranking will occur when there are ties. the default is to take the mean of the ranks.
In the case that you have ties lower down in the groups, rank will count those lower values and then add continue with the next lowest value as the count of the lower values + 1. These values wil still be ordered and distinct. If you really want to have the count be 1, 2, 3, and so on rather than 1, 3, 6 or whatever depending on the number of duplicate values, here is a little hack using factor:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) {
as.integer(factor(rank(i, ties.method="min"))))
Here, we use factor to build values counting from upward for each level. We then coerce it to be an integer.
For example,
temp <- c(rep(1, 3), 2,5,1,4,3,7)
[1] 2.5 2.5 2.5 5.0 8.0 2.5 7.0 6.0 9.0
rank(temp, ties.method="min")
[1] 1 1 1 5 8 1 7 6 9
as.integer(factor(rank(temp, ties.method="min")))
[1] 1 1 1 2 5 1 4 3 6
data
df <- read.table(header=T, text="ID Rank
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3 ")

Related

How to extract a list of columns name based on the means of their data?

I'm pretty new to R and hope i'll make myself clear enough.
I have a table of several columns which are factors. I want to make a score for each of these columns. Then I want to calculate the mean of each score, and display the list of columns ranked by their mean scores, is that possible ?
Table would be:
head(musico[,69:73])
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4
I want to make a score for each:
musico$score1<-0
musico$score1[musico$AVIS1==1]<-1
musico$score1[musico$AVIS1==2]<-0.5
then do the mean of each column score: mean of score1, mean of score2, ...:
mean(musico$score1), mean(musico$score2), ...
My goal is to have a list of titles (avis1, avis2,...) ranked by their mean score.
Any advice appreciated !

Here's one way using base although it is somewhat unclear what you want. What does score1 have to do with AVIS1? I think you may be missing some of the data from musico.
Based on the example provided, here's a base R solution. vapply loops through the data.frame and produces the mean for each column. Then the stack and order are only there to make the output a dataframe that looks nice.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE)
means <- vapply(music, mean, 1)
stack(means[order(means, decreasing = TRUE)])
values ind
4 4.000000 AVIS4
3 3.166667 AVIS3
2 2.666667 AVIS2
5 2.500000 AVIS5
1 2.166667 AVIS1

This is how I would do it by first introducing a scores vector to be used as a lookup. I assume that scores are decreasing by 0.5 and that the number of scores needed are according to the maximum number of levels found in your columns (i.e. 6 seen in AVIS1).
Then using tidyr you can organise your data set such that you have to variables (i.e. AVIS and Value) containing the respective levels. Then add a score variable with the mutate function from dplyr in which the position of the score in the score vector matches the value in the Value variable. From here you can find the mean scores corresponding to the AVIS levels, arrange them accordingly and put them in a list.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE) # your data
scores <- seq(1, by = -0.5, length.out = 6) # vector of scores
library(tidyr)
library(dplyr)
music2 <- music %>%
gather(AVIS, Value) %>% # here you tidy the data
mutate(score = scores[Value]) %>% # match score to value
group_by(AVIS) %>% # group AVIS levels
summarise(score.mean = mean(score)) %>% # find mean scores for AVIS levels
arrange(desc(score.mean))
list <- list(AVIS = music2$AVIS) # here is the list
> list$AVIS
[1] "AVIS1" "AVIS5" "AVIS2" "AVIS3" "AVIS4"

Create a rolling index of pairs over groups

I need to create (with R) a rolling index of pairs from a data set that includes groups. Consider the following data set:
times <- c(4,3,2)
V1 <- unlist(lapply(times, function(x) seq(1, x)))
df <- data.frame(group = rep(1:length(times), times = times),
V1 = V1,
rolling_index = c(1,1,2,2,3,3,4,5,5))
df
group V1 rolling_index
1 1 1 1
2 1 2 1
3 1 3 2
4 1 4 2
5 2 1 3
6 2 2 3
7 2 3 4
8 3 1 5
9 3 2 5
The data frame I have includes the variables group and V1. Within each group V1 designates a running index (that may or may not start at 1).
I want to create a new indexing variable that looks like rolling_index. This variable groups rows within the same group and consecutive V1 value, thus creating a new rolling index. This new index must be consecutive over groups. If there is an uneven amount of rows within a group (e.g. group 2), then the last, single row gets its own rolling index value.

You can try
library(data.table)
setDT(df)[, gr:=as.numeric(gl(.N, 2, .N)), group][,
rollindex:=cumsum(c(TRUE,abs(diff(gr))>0))][,gr:= NULL]
# group V1 rolling_index rollindex
#1: 1 1 1 1
#2: 1 2 1 1
#3: 1 3 2 2
#4: 1 4 2 2
#5: 2 1 3 3
#6: 2 2 3 3
#7: 2 3 4 4
#8: 3 1 5 5
#9: 3 2 5 5
Or using base R
indx1 <- !duplicated(df$group)
indx2 <- with(df, ave(group, group, FUN=function(x)
gl(length(x), 2, length(x))))
cumsum(c(TRUE,diff(indx2)>0)|indx1)
#[1] 1 1 2 2 3 3 4 5 5
Update
The above methods are based on the 'group' column. Suppose you already have a sequence column ('V1') by group as showed in the example, creation of rolling index is easier
cumsum(!!df$V1 %%2)
#[1] 1 1 2 2 3 3 4 5 5
As mentioned in the post, if the 'V1' column do not start at '1' for some groups, we can get the sequence from the 'group' and then do the cumsum as above
cumsum(!!with(df, ave(seq_along(group), group, FUN=seq_along))%%2)
#[1] 1 1 2 2 3 3 4 5 5

There is probably a simpler way but you can do:
rep_each <- unlist(mapply(function(q,r) {c(rep(2, q),rep(1, r))},
q=table(df$group)%/%2,
r=table(df$group)%%2))
df$rolling_index <- inverse.rle(x=list(lengths=rep_each, values=seq(rep_each)))
df$rolling_index
#[1] 1 1 2 2 3 3 4 5 5

Select max or equal value from several columns in a data frame

I'm trying to select the column with the highest value for each row in a data.frame. So for instance, the data is set up as such.
> df <- data.frame(one = c(0:6), two = c(6:0))
> df
one two
1 0 6
2 1 5
3 2 4
4 3 3
5 4 2
6 5 1
7 6 0
Then I'd like to set another column based on those rows. The data frame would look like this.
> df
one two rank
1 0 6 2
2 1 5 2
3 2 4 2
4 3 3 3
5 4 2 1
6 5 1 1
7 6 0 1
I imagine there is some sort of way that I can use plyr or sapply here but it's eluding me at the moment.

There might be a more efficient solution, but
ranks <- apply(df, 1, which.max)
ranks[which(df[, 1] == df[, 2])] <- 3
edit: properly spaced!

Create sequence of repeated values, in sequence?

I need a sequence of repeated numbers, i.e. 1 1 ... 1 2 2 ... 2 3 3 ... 3 etc. The way I implemented this was:
nyear <- 20
names <- c(rep(1,nyear),rep(2,nyear),rep(3,nyear),rep(4,nyear),
rep(5,nyear),rep(6,nyear),rep(7,nyear),rep(8,nyear))
which works, but is clumsy, and obviously doesn't scale well.
How do I repeat the N integers M times each in sequence?
I tried nesting seq() and rep() but that didn't quite do what I wanted.
I can obviously write a for-loop to do this, but there should be an intrinsic way to do this!

You missed the each= argument to rep():
R> n <- 3
R> rep(1:5, each=n)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
R>
so your example can be done with a simple
R> rep(1:8, each=20)

Another base R option could be gl():
gl(5, 3)
Where the output is a factor:
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
If integers are needed, you can convert it:
as.numeric(gl(5, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

For your example, Dirk's answer is perfect. If you instead had a data frame and wanted to add that sort of sequence as a column, you could also use group from groupdata2 (disclaimer: my package) to greedily divide the datapoints into groups.
# Attach groupdata2
library(groupdata2)
# Create a random data frame
df <- data.frame("x" = rnorm(27))
# Create groups with 5 members each (except last group)
group(df, n = 5, method = "greedy")
x .groups
<dbl> <fct>
1 0.891 1
2 -1.13 1
3 -0.500 1
4 -1.12 1
5 -0.0187 1
6 0.420 2
7 -0.449 2
8 0.365 2
9 0.526 2
10 0.466 2
# … with 17 more rows
There's a whole range of methods for creating this kind of grouping factor. E.g. by number of groups, a list of group sizes, or by having groups start when the value in some column differs from the value in the previous row (e.g. if a column is c("x","x","y","z","z") the grouping factor would be c(1,1,2,3,3).

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?

How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4

A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5

I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.

A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4

You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

apply conditional numbering to grouped data in R - r

Related

How to extract a list of columns name based on the means of their data?

Create a rolling index of pairs over groups

Select max or equal value from several columns in a data frame

Create sequence of repeated values, in sequence?

Calculating the occurrences of numbers in the subsets of a data.frame

Categories

Resources