This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 6 years ago.
First: I want to sort a dataframe and then add a rank to the dataframe.
df <- data.frame(a = 3:1, b = 6:4, Rank = NA) # create dataframe
df <- df[order(df[, 1], df[, 2]), ] # sort dataframe
for ( i in 1:nrow(dataframe) ) dataframe[i, 3] <- i # add the ranking
Second: I want to sort within a group g
df <- data.frame(g = sample(1:4, 4), num = 1:20, Rank = NA)
df <- df[order(df[, 1], df[, 2]), ]
row <- 1
for (x in 1:4) {
rank <- 1
df[row, 3] <- rank # adding the number one to list
row <- row + 1 # move to the next row!
while (df[row - 1, 1] == df[row, 1] & row < length(df[,1]) + 1){
# Check if state is the last row still same same, otherwise stop next loop!
rank <- rank + 1 # adding next to rank!
df[row, 3] <- rank # Put rank in dataframe!
row <- row + 1 # move to next row
}
}
it works but I would like to accomplish the same tasks with more parsimonious or efficient coding.
Try:
set.seed(123)
df = data.frame(g=sample(1:4, 4), num = 1:20, Rank = NA)
library(dplyr)
df %>% group_by(g) %>% arrange(num) %>% mutate(rank = seq_along(g))
Source: local data frame [20 x 4]
Groups: g
g num Rank rank
1 1 3 NA 1
2 1 7 NA 2
3 1 11 NA 3
4 1 15 NA 4
5 1 19 NA 5
6 2 1 NA 1
7 2 5 NA 2
8 2 9 NA 3
9 2 13 NA 4
10 2 17 NA 5
11 3 2 NA 1
12 3 6 NA 2
13 3 10 NA 3
14 3 14 NA 4
15 3 18 NA 5
16 4 4 NA 1
17 4 8 NA 2
18 4 12 NA 3
19 4 16 NA 4
20 4 20 NA 5
Is this what you need?
df = data.frame(g=sample(1:4, 4), num = 1:20, Rank = NA)
df <- df[order(df[,1],df[,2]),]
df$Rank <- rep(1:5,4)
Related
I have a dataframe that stores adjacency relations. I want to divide numbers into different groups according to this dataframe. The dataframe are as follows:
df = data.frame(from=c(1,1,2,2,2,3,3,3,4,4,4,5,5), to=c(1,3,2,3,4,1,2,3,2,4,5,4,5))
df
from to
1 1 1
2 1 3
3 2 2
4 2 3
5 2 4
6 3 1
7 3 2
8 3 3
9 4 2
10 4 4
11 4 5
12 5 4
13 5 5
In above dataframe, number 1 has links with number 1 and 3, number 2 has links with number 2, 3, 4, so number 1 can not be in same group with number 3 and number 2 can not be in same group with number 3 and number 4. In the end, groups can be c(1, 2, 5) and c(3, 4).
I wonder how to program it?
First replace the values of to with NA when from and to are equal.
df2 <- transform(df, to = replace(to, from == to, NA))
Then recursively bind each row of the data if from of the latter row has not appeared in to of the former rows.
Reduce(function(x, y) {
if(y$from %in% x$to) x else rbind(x, y)
}, split(df2, 1:nrow(df2)))
# from to
# 1 1 NA
# 2 1 3
# 3 2 NA
# 4 2 3
# 5 2 4
# 12 5 4
# 13 5 NA
Finally, you could extract unique elements for the both columns to get the two groups.
The overall pipeline should be
df |>
transform(to = replace(to, from == to, NA)) |>
(\(dat) split(dat, 1:nrow(dat)))() |>
Reduce(f = \(x, y) if(y$from %in% x$to) x else rbind(x, y))
The answer of Darren Tsai has solved this problem, but with some flaw.
Following is a very clumsy solution:
df = data.frame(from=c(1,1,2,2,2,3,3,3,4,4,4,5,5), to=c(1,3,2,3,4,1,2,3,2,4,5,4,5))
df.list = lapply(split(df,df$from), function(x){
x$to
})
group.idx = rep(1, length(unique(df$from)))
for (i in seq_along(df.list)) {
df.vec <- df.list[[i]]
curr.group = group.idx[i]
remain.vec = setdiff(df.vec, i)
for (j in remain.vec) {
if(group.idx[j] == curr.group){
group.idx[j] = curr.group + 1
}
}
}
group.idx
[1] 1 1 2 2 1
I have to subset the data of 6 rows every time. How to do that in R?
data:
col1 : 1,2,3,4,5,6,7,8,9,10
col2 : a1,a2,a3,a4,a5,a6,a7,a8,a9,a10
I want to do subset of 6 rows every time. First subset of the rows will have 1:6 ,next subset of the rows will have 7:nrow(data). I have tried using seq function.
seqData <- seq(1,nrow(data),6)
output: It is giving 1 and 7th row but I want 1 to 6 rows first, next onwards 7 to nrow(data).
How to get output like that.
Will this work:
set.seed(1)
dat <- data.frame(c1 = sample(1:5,12,T),
c2 = sample(1:5,12,T))
dat
c1 c2
1 1 2
2 4 2
3 1 1
4 2 5
5 5 5
6 3 1
7 2 1
8 3 5
9 3 5
10 1 2
11 5 2
12 5 1
split(dat, rep(1:ceiling(nrow(dat)/6), each = 6))
$`1`
c1 c2
1 1 2
2 4 2
3 1 1
4 2 5
5 5 5
6 3 1
$`2`
c1 c2
7 2 1
8 3 5
9 3 5
10 1 2
11 5 2
12 5 1
The function below creates a numeric vector with integers increasing by 1 unit every n rows. And uses this vector to split the data as needed.
data <- data.frame(col1 = 1:10, col2 = paste0("a", 1:10))
split_nrows <- function(x, n){
f <- c(1, rep(0, n - 1))
f <- rep(f, length.out = NROW(x))
f <- cumsum(f)
split(x, f)
}
split_nrows(data, 6)
Here's a simple example with mtcars that yields a list of 6 subset dfs.
nrows <- nrow(mtcars)
breaks <- seq(1, nrows, 6)
listdfs <- lapply(breaks, function(x) mtcars[x:(x+5), ]) # increment by 5 not 6
listdfs[[6]] <- listdfs[[6]][1:2, ] #last df: remove 4 NA rows (36 - 32)
Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")
I have code to turn the upper triangle of a matrix into a vector and store the values from this vector along with their original coordinates from the matrix into a data frame.
How do I skip the for loop if the element in the vector is zero?
I have tried else statements and other attempts.
v <- matrix(sample(0:1, 10, replace = TRUE),9,9)
t <- v[upper.tri(v,diag=T)]
tful <- t[t!=0]
df <- data.frame(FP1=rep(0,length(t)),FP2=rep(0,length(t)),tanimoto=rep(0,length(t)))
for (i in 1:length(t)){
if (t[i]==0) next
else {
col_num <- floor(sqrt(2*i-7/4)+.5)
row_num <- i-(.5*col_num^2-.5*col_num+1)+1
df$FP1[i] <- row_num
df$FP2[i] <- col_num
df$tanimoto[i] <- v[row_num,col_num]
}
}
I dont want any zeros in my data frame, and the loop to skip these values.
I understand the data frame needs to be smaller in rows but i am using this as an example.
Your next is working fine to skip the current iteration of the loop.
You still get 0s in the final result because all values of df were initialized df to 0. When you skip the iteration, they are not changed, so they remain 0. If you change the initialization to be NA values, you'll see that no 0s are added.
df <- data.frame(FP1=rep(NA,length(t)),FP2=rep(NA,length(t)),tanimoto=rep(NA,length(t)))
for (i in 1:length(t)){
if (t[i]==0) next
else {
col_num <- floor(sqrt(2*i-7/4)+.5)
row_num <- i-(.5*col_num^2-.5*col_num+1)+1
df$FP1[i] <- row_num
df$FP2[i] <- col_num
df$tanimoto[i] <- v[row_num,col_num]
}
}
df
# FP1 FP2 tanimoto
# 1 1 1 1
# 2 1 2 1
# 3 2 2 1
# 4 1 3 1
# 5 2 3 1
# 6 3 3 1
# 7 NA NA NA
# 8 2 4 1
# 9 3 4 1
# 10 4 4 1
# 11 NA NA NA
# ...
A simple modification would be to filter your data frame as a last step: df = df[df$tanimoto != 0, ], or if you switch to NA, df = na.omit(df).
We could also create a non-looping solution:
v1 = v != 0
df2 = data.frame(FP1 = row(v)[v1], FP2 = col(v)[v1], tanimoto = v[v1])
df2 = subset(df2, FP1 <= FP2)
df2
# FP1 FP2 tanimoto
# 1 1 1 1
# 7 1 2 1
# 8 2 2 1
# 13 1 3 1
# 14 2 3 1
# 15 3 3 1
# 20 2 4 1
# 21 3 4 1
# 22 4 4 1
# 27 3 5 1
# 28 4 5 1
# 29 5 5 1
# 33 1 6 1
# 34 4 6 1
# 35 5 6 1
# ...
I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive.
For example, for the following data frame:
df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)
x y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9
I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:
x y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9
Following directions from help and some other posts, I have tried using the duplicated function:
df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
x y z
1 1 10 1
6 4 13 6
7 2 12 7
9 1 30 9
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.
I have also tried using the data.table package as follows:
library(data.table)
dt <- as.data.table(df)
setkey(dt, x)
dt[J(unique(x)), mult ='last']
Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:
x y z
1 30 9
2 49 8
4 13 6
Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive.
I would appreciate any help.
Thanks
How about:
df[cumsum(rle(df$x)$lengths),]
Explanation:
rle(df$x)
gives you the run lengths and values of consecutive duplicates in the x variable. Then:
rle(df$x)$lengths
extracts the lengths. Finally:
cumsum(rle(df$x)$lengths)
gives the row indices which you can select using [.
EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by #James, and would be the answer I would "accept", and dp being the dplyr answer given by #Nik.
#> Unit: microseconds
#> expr min lq mean median uq max
#> rle 134.389 145.4220 162.6967 154.4180 172.8370 375.109
#> consec 111.411 118.9235 136.1893 123.6285 145.5765 314.249
#> dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
rle performs better than I thought it would.
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
A cheap solution with dplyr that I could think of:
Method:
library(dplyr)
df %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9
This will even work if your data has the same x value at the bottom
New Input:
df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2
x y z
1 1 10 1
2 1 11 2
3 1 30 3
4 2 12 4
5 2 49 5
6 4 13 6
7 2 12 7
8 2 49 8
9 1 30 9
10 1 10 12
Use same method:
df2 %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
New Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 10 12
Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.