Iteration through rows of a dataframe within group of columns in R - r

I have a dataframe df with 6 fields A,B,C,D,E & F. My requirement is to create a new column G which is equal to the previous value(C) + previous value(D) + previous (G) - F. But this needs to be implemented at a group level through columns A & B (group by A & B). In case it is the first row within the group then the value in column G should be equal to E.
Sample Df -
A B C D E F
1 2 100 200 300 0
1 2 110 210 310 10
1 2 120 130 300 10
1 1 140 150 80 0
1 1 50 60 80 20
1 1 50 60 80 20
Output -
A B C D E F G
1 2 100 200 300 0 300
1 2 110 210 310 10 590
1 2 120 130 300 10 900
1 1 140 150 80 0 80
1 1 50 60 80 20 350
1 1 50 60 80 20 440
Please provide a suitable solution.

Here is one option with dplyr where we group by 'A', 'B', take the lag of 'C', 'D', 'E' add (+) them, and subtract from 'F', and coalesce with the 'E' column
library(dplyr)
df1 %>%
group_by(A, B) %>%
mutate(G = coalesce(lag(C) + lag(D) + lag(E) - F, E))
-output
# A tibble: 6 x 7
# Groups: A, B [2]
# A B C D E F G
# <int> <int> <int> <int> <int> <int> <int>
#1 1 2 100 200 300 0 300
#2 1 2 110 210 310 10 590
#3 1 2 120 130 300 10 620
#4 1 1 140 150 80 0 80
#5 1 1 50 60 80 20 350
#6 1 1 50 60 80 20 170
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 1L, 1L), B = c(2L, 2L, 2L,
1L, 1L, 1L), C = c(100L, 110L, 120L, 140L, 50L, 50L), D = c(200L,
210L, 130L, 150L, 60L, 60L), E = c(300L, 310L, 300L, 80L, 80L,
80L), F = c(0L, 10L, 10L, 0L, 20L, 20L)), class = "data.frame",
row.names = c(NA,
-6L))

Related

Is there R function for removing specific column condition

Hello all my df looks like
PID V1
123 1
123 2
123 3
111 1
111 2
111 1
122 3
122 1
122 1
333 1
333 4
333 2
I want to delete rows contains 1 and 2 event alone for the PID
and expected output
PID V1
123 1
123 2
123 3
122 3
122 1
122 1
333 1
333 4
333 2
You can do this in base R :
subset(df, !ave(V1 %in% 1:2, PID, FUN = all))
# PID V1
#1 123 1
#2 123 2
#3 123 3
#7 122 3
#8 122 1
#9 122 1
#10 333 1
#11 333 4
#12 333 2
dplyr
library(dplyr)
df %>% group_by(PID) %>% filter(!all(V1 %in% 1:2))
or data.table :
library(data.table)
setDT(df)[, .SD[!all(V1 %in% 1:2)], PID]
The logic of all of them is the same. Remove groups (PID) who have only 1 and 2 in V1 column.
data
df <- structure(list(PID = c(123L, 123L, 123L, 111L, 111L, 111L, 122L,
122L, 122L, 333L, 333L, 333L), V1 = c(1L, 2L, 3L, 1L, 2L, 1L,
3L, 1L, 1L, 1L, 4L, 2L)), class = "data.frame", row.names = c(NA, -12L))

Select top x % of values per group - retain row ID

I am trying to identify the top 15% of scores for each watershed but retain the polygon ID when I print the results.
# here's a small example dataset (called "data"):
polygon watershed score
1 1 61
2 1 81
3 1 16
4 2 18
5 2 12
6 3 78
7 3 81
8 3 20
9 3 97
10 3 95
# I obtain the top 15% using this method:
top15 <- (data %>% select(watershed, score) %>%
group_by(watershed) %>%
arrange(watershed, desc(score)) %>%
filter(score > quantile(score, 0.15)))
# results look like this:
<int> <int>
1 1 81
2 1 61
3 2 18
4 3 97
5 3 95
6 3 81
7 3 78
How can I include the column "polygon" when I print the results?
Thanks so much for the help!
In your statement you selected only watershed and score but excluded polygon. So remove the select statement and you should get what you want. Additionally the arrange doesn't add value so I removed it:
library(dplyr)
mdat <- structure(list(polygon = 1:10,
watershed = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L),
score = c(61L, 81L, 16L, 18L, 12L, 78L, 81L, 20L, 97L, 95L)),
class = "data.frame", row.names = c(NA, -10L))
mdat %>%
group_by(watershed) %>%
filter(score > quantile(score, 0.15))
# # A tibble: 7 x 3
# # Groups: watershed [3]
# polygon watershed score
# <int> <int> <int>
# 1 1 1 61
# 2 2 1 81
# 3 4 2 18
# 4 6 3 78
# 5 7 3 81
# 6 9 3 97
# 7 10 3 95

ddply using "group_by" logic

I'm trying to use ddply to find the smallest distance between two positions pos where the corresponding chrom is the same in two dataframes:
head(bps, 10)
chrom pos iteration
1 1 4 1
2 1 14 1
3 1 68 1
4 1 79 1
5 1 200 1
6 1 205 1
7 1 270 1
8 1 304 1
9 2 7 1
10 2 13 1
head(flocs)
chrom pos
1 1 100
2 1 200
3 1 220
4 1 312
5 2 500
6 2 501
As an example, for the first line in bps, I want to find the closest pos in flocs where chrom = 1, which gives a value of -96.
The pseudocode for what I'm trying to do is:
foreach iteration (bps$iteration):
foreach chrom (bps$chrom):
foreach pos (bps$pos):
features_pos = pos in dataframe flocs closest to pos on the same chromosome
min_dist = feature_pos - pos
return features_pos, min_dist
I am trying to do this with ddply:
minDists <- ddply(bp_data, c("chrom", "pos"), function(x) {
index <- which.min(abs(flocs$pos[which(flocs$chrom==x$chrom)] - x$pos))
closestMotif <- flocs$pos[index]
chrom <- as.character(flocs$chrom[index])
dist <- (x$pos - closestMotif)
data.frame(features_pos = closestMotif, pos = x$pos, min_dist = dist, feature = feature)
})
But this doesn't constrain comparisons to the same chromosome:
head(minDists, 10)
chrom features_pos pos min_dist feature
1 1 100 4 -96 feature1
2 1 100 14 -86 feature1
3 1 100 68 -32 feature1
4 1 100 79 -21 feature1
5 1 200 200 0 feature1
6 1 200 205 5 feature1
7 1 312 270 -42 feature1
8 1 312 304 -8 feature1
9 2 100 7 -93 feature1 # bps chrom=2, flocs chrom=1
10 2 100 13 -87 feature1 # bps chrom=2, flocs chrom=1
The expected output here is:
chrom features_pos pos min_dist feature
1 1 100 4 -96 feature1
2 1 100 14 -86 feature1
3 1 100 68 -32 feature1
4 1 100 79 -21 feature1
5 1 200 200 0 feature1
6 1 200 205 5 feature1
7 1 312 270 -42 feature1
8 1 312 304 -8 feature1
9 2 500 7 -493 feature1 # bp1 chrom=2, flocs chrom=2
10 2 500 13 -487 feature1 # bp1 chrom=2, flocs chrom=2
I thought that by providing the columns c("chrom", "pos") essentially performed a group_by to the function call.
Is there any way that I can improve what I've written to achieve the desired result?
bps <- structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"
), class = "factor"), pos = c(4L, 14L, 68L, 79L, 200L, 205L,
270L, 304L, 7L, 13L, 23L, 39L, 100L, 150L, 17L, 55L, 75L, 79L,
102L, 109L, 123L, 155L, 157L, 200L, 260L, 299L, 300L, 320L, 323L,
345L, 450L, 550L), iteration = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor")), row.names = c(NA,
-32L), class = "data.frame")
flocs <- structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1", "2", "3"), class = "factor"), pos = c(100L,
200L, 220L, 312L, 500L, 501L, 123L, 444L)), row.names = c(NA,
-8L), class = "data.frame")
data.table approach using a rolling join...
updated answer
(initially forgot all about the by-reference joining, which is faster and most certainly shorter ;-) )
library( data.table )
#set data as data.table
setDT( bps, key = c("chrom", "pos") )
setDT( flocs, key = c("chrom", "pos") )
#perform by-reference rolling join
bps[, mindist := pos - flocs[bps, x.pos, roll = "nearest"]][]
output
# chrom pos iteration mindist
# 1: 1 4 1 -96
# 2: 1 14 1 -86
# 3: 1 68 1 -32
# 4: 1 79 1 -21
# 5: 1 200 1 0
# 6: 1 205 1 5
# 7: 1 270 1 -42
# 8: 1 304 1 -8
# 9: 2 7 1 -493
# 10: 2 13 1 -487
# 11: 2 23 1 -477
# 12: 2 39 1 -461
# 13: 2 100 1 -400
# 14: 2 150 1 -350
# 15: 3 17 1 -106
# 16: 3 55 1 -68
# 17: 3 75 1 -48
# 18: 3 79 1 -44
# 19: 3 102 1 -21
# 20: 3 109 1 -14
# 21: 3 123 1 0
# 22: 3 155 1 32
# 23: 3 157 1 34
# 24: 3 200 1 77
# 25: 3 260 1 137
# 26: 3 299 1 -145
# 27: 3 300 1 -144
# 28: 3 320 1 -124
# 29: 3 323 1 -121
# 30: 3 345 1 -99
# 31: 3 450 1 6
# 32: 3 550 1 106
# chrom pos iteration mindist
Benchmarking answer until now
# Unit: milliseconds
# expr min lq mean median uq max neval
# Ronak_base 2.355879 2.555768 2.973069 2.626415 2.773581 8.016016 100
# Wimpel_data.table 1.697921 2.035788 2.416199 2.209616 2.361001 17.724528 100
# Pawel_tidyverse 14.845354 15.310505 16.333158 15.814819 16.541618 24.077871 100
microbenchmark::microbenchmark(
Ronak_base = {
bps$min_dist <- unlist(mapply(return_min_value, unique(bps$chrom), split(bps$pos, bps$chrom)))
},
Wimpel_data.table = {
setDT( bps, key = c("chrom", "pos") )
setDT( flocs, key = c("chrom", "pos") )
#perform by-reference rolling join
bps[, mindist := pos - flocs[bps, x.pos, roll = "nearest"]][]
},
Pawel_tidyverse = {
bps %>%
select(-iteration) %>%
unite('bps') %>%
crossing(flocs %>% unite('flocks')) %>%
separate(bps, c('chrom_bps', 'pos')) %>%
separate(flocks, c('chrom_flocks', 'features_pos')) %>%
filter(chrom_bps == chrom_flocks) %>%
select(-chrom_flocks) %>%
rename_at(1, ~'chrom') %>%
mutate_all(as.numeric) %>%
mutate(min_dist = pos - features_pos) %>%
group_by(chrom, pos) %>%
filter(abs(min_dist) == min(abs(min_dist)))
}
)
Looks like my data-table answer and the answer by Ronak Shah are pretty close together. I believe that data.table will gain the clear advantage when the data-sets are getting lager-huge (but I haven't tested)..
My base R attempt by creating a helper function (return_min_value). This function subset flocs based on current chrom and then returns the minimum value after subtracting it from pos. We split the pos column based on chrom and pass these values along with unique chrom values in mapply to return_min_value function.
return_min_value <- function(x, y) {
sapply(y, function(p) {
vals = p - flocs$pos[flocs$chrom == x]
vals[which.min(abs(vals))]
})
}
bps$min_dist <- unlist(mapply(return_min_value,
unique(bps$chrom), split(bps$pos, bps$chrom)))
bps
# chrom pos iteration min_dist
#1 1 4 1 -96
#2 1 14 1 -86
#3 1 68 1 -32
#4 1 79 1 -21
#5 1 200 1 0
#6 1 205 1 5
#7 1 270 1 -42
#8 1 304 1 -8
#9 2 7 1 -493
#10 2 13 1 -487
#...
Check this solution:
library(tidyverse)
bps %>%
select(-iteration) %>%
unite('bps') %>%
crossing(flocs %>% unite('flocks')) %>%
separate(bps, c('chrom_bps', 'pos')) %>%
separate(flocks, c('chrom_flocks', 'features_pos')) %>%
filter(chrom_bps == chrom_flocks) %>%
select(-chrom_flocks) %>%
rename_at(1, ~'chrom') %>%
mutate_all(as.numeric) %>%
mutate(min_dist = pos - features_pos) %>%
group_by(chrom, pos) %>%
filter(abs(min_dist) == min(abs(min_dist)))
Output:
chrom pos features_pos min_dist
<dbl> <dbl> <dbl> <dbl>
1 1 4 100 -96
2 1 14 100 -86
3 1 68 100 -32
4 1 79 100 -21
5 1 200 200 0
6 1 205 200 5
7 1 270 312 -42
8 1 304 312 -8
9 2 7 500 -493
10 2 13 500 -487
# ... with 22 more rows

Subtract min for every column by group - Add Subtracted Values to another column in df

I have a dataframe below:
date group col1 col2 col3 col4 col5
1234 1 -2 3 4 -5 100
1235 1 4 5 -2 -7 200
1234 1 -5 2 9 1 400
1235 1 8 2 -4 7 900
1235 2 -72 83 -54 98 800
1233 2 32 -21 -1 4 900
1342 2 -54 0 -10 -11 100
1234 2 98 -8 -9 -10 100
Here is what I want to do:
For columns df[,3] to the 2nd to last column, I want to do the following:
1) For each column take the minimum value for positive numbers and the minimum number for negative numbers by group.
2) Then replace the current values using this logic:
a) If the value is positive, subtract the minimum value found for the positive numbers by group.
b) If the value is negative, subtract the minimum value found for the negative numbers by group.
c) If the value is 0, make no change
3) Then take the total values that were subtracted for each value in that row and add it to the last column value.
Minimum for col1 neg, group 1 = -5
Minimum for col1 pos, group 1 = 4
Minimum for col1 neg, group 2 = -72
Minimum for col1 pos, group 2 = 32
Minimum for col2 neg, group 1 = NA
Minimum for col2 pos, group 1 = 2
etc.
I want my final output to look like this:
date group col1 col2 col3 col4 col5
1234 1 -2-(-5) 3-2 4-4 -5-(-7) 100+(-5)+2+4+(-7)
1235 1 4-4 5-2 -2-(-4) -7-(-7) 200+4+2+(-4)+(-7)
1234 1 -5-(-5) 2-2 9-4 1-1 400+(-5)+2+4+1
1235 1 8-4 2-2 -4-(-4) 7-1 900+4+2+(-4)+1
1235 2 -72-(-72) 83-83 -54-(-54) 98-4 800+(-72)+83+(-54)+4
1233 2 32-32 -21-(-21) -1-(-54) 4-4 900+32+(-21)+(-54)+4
1342 2 -54-(-72) 0-0 -10-(-54) -11-(-11) 100+(-72)+0+(-54)+(-11)
1234 2 98-32 -8-(-21) -9-(-54) -10-(-11) 100+32+(-21)+(-54)+(-11)
Expected Output:
date group col1 col2 col3 col4 col5
1234 1 3 1 0 2 94
1235 1 0 3 2 0 195
1234 1 0 0 5 0 402
1235 1 4 0 0 6 903
1235 2 0 0 0 94 761
1233 2 0 0 53 0 861
1342 2 18 0 44 0 -37
1234 2 66 13 45 1 46
After grouping by 'group', mutate the columns 'col1' to 'col4' with the min value of positive and negative numbers, then add the rowwise sum of the numbers with the 'col5' and update 'col5'. Later, update the 'col1' to 'col4' by subtracting from the corresponding columns of initial dataset ('df1')
library(dplyr)
df2 <- df1 %>%
group_by(group) %>%
mutate_at(3:6,
funs(case_when(. < 0 ~ if(any(. < 0)) as.numeric(min(.[. <0])) else NA_real_,
. > 0 ~ if(any(. > 0)) as.numeric(min(.[. > 0])) else NA_real_,
TRUE ~ as.numeric(.)))) %>%
ungroup %>%
mutate(col5 = col5 + rowSums(.[3:6]))
nm1 <- paste0("col", 1:4)
#nm1 <- 3:6
df2[nm1] <- df1[nm1] - df2[nm1]
df2
# A tibble: 8 x 7
# date group col1 col2 col3 col4 col5
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1234 1 3 1 0 2 94
#2 1235 1 0 3 2 0 195
#3 1234 1 0 0 5 0 402
#4 1235 1 4 0 0 6 903
#5 1235 2 0 0 0 94 761
#6 1233 2 0 0 53 0 861
#7 1342 2 18 0 44 0 -37
#8 1234 2 66 13 45 1 46
Or using a modification with parse_exprs
library(rlang)
expr <- paste(glue::glue('{nm1} - {nm1}_new'), collapse=";")
df1 %>%
group_by(group) %>%
mutate_at(3:6, funs(new = ave(., sign(.), FUN = min))) %>%
ungroup %>%
mutate(col5 = col5 + select(., col1_new:col4_new) %>%
reduce(`+`)) %>%
transmute(date, group, !!! parse_exprs(expr), col5) %>%
rename_at(3:6, ~ nm1)
# A tibble: 8 x 7
# date group col1 col2 col3 col4 col5
# <int> <int> <int> <int> <int> <int> <int>
#1 1234 1 3 1 0 2 94
#2 1235 1 0 3 2 0 195
#3 1234 1 0 0 5 0 402
#4 1235 1 4 0 0 6 903
#5 1235 2 0 0 0 94 761
#6 1233 2 0 0 53 0 861
#7 1342 2 18 0 44 0 -37
#8 1234 2 66 13 45 1 46
Or convert to 'long' format to do the calculations and then change it to 'wide'
library(tidyverse)
df1 %>%
rownames_to_column('rn') %>%
gather(key, val, col1:col4) %>%
group_by(group, key, sn= sign(val)) %>%
mutate(mnVal = min(val)) %>%
group_by(rn) %>%
mutate(col5 = col5 + sum(mnVal), val = val - mnVal) %>%
select(-sn, -mnVal) %>%
spread(key, val) %>%
ungroup %>%
select(names(df1))
data
df1 <- structure(list(date = c(1234L, 1235L, 1234L, 1235L, 1235L, 1233L,
1342L, 1234L), group = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), col1 = c(-2L,
4L, -5L, 8L, -72L, 32L, -54L, 98L), col2 = c(3L, 5L, 2L, 2L,
83L, -21L, 0L, -8L), col3 = c(4L, -2L, 9L, -4L, -54L, -1L, -10L,
-9L), col4 = c(-5L, -7L, 1L, 7L, 98L, 4L, -11L, -10L), col5 = c(100L,
200L, 400L, 900L, 800L, 900L, 100L, 100L)), .Names = c("date",
"group", "col1", "col2", "col3", "col4", "col5"),
class = "data.frame", row.names = c(NA,
-8L))

How to subtract one row from multiple rows by group, for data set with multiple columns in R?

I would like to learn how to subtract one row from multiple rows by group, and save the results as a data table/matrix in R. For example, take the following data frame:
data.frame("patient" = c("a","a","a", "b","b","b","c","c","c"), "Time" = c(1,2,3), "Measure 1" = sample(1:100,size = 9,replace = TRUE), "Measure 2" = sample(1:100,size = 9,replace = TRUE), "Measure 3" = sample(1:100,size = 9,replace = TRUE))
patient Time Measure.1 Measure.2 Measure.3
1 a 1 19 5 75
2 a 2 64 20 74
3 a 3 40 4 78
4 b 1 80 91 80
5 b 2 48 31 73
6 b 3 10 5 4
7 c 1 30 67 55
8 c 2 24 13 90
9 c 3 45 31 88
For each patient, I would like to subtract the row where Time == 1 from all rows associated with that patient. The result would be:
patient Time Measure.1 Measure.2 Measure.3
1 a 1 0 0 0
2 a 2 45 15 -1
3 a 3 21 -1 3
4 b 1 0 0 0
5 b 2 -32 -60 -5
6 b 3 -70 -86 -76
7 c 1 0 0 0
....
I have tried the following code using the dplyr package, but to no avail:
raw_patient<- group_by(rawdata,patient, Time)
baseline_patient <-mutate(raw_patient,cpls = raw_patient[,]- raw_patient["Time" == 0,])
As there are multiple columns, we can use mutate_at by specifying the variables in vars and then subtract the elements from those elements in each column that corresponds to 'Time' 1 after grouping by 'patient'
library(dplyr)
df1 %>%
group_by(patient) %>%
mutate_at(vars(matches("Measure")), funs(.- .[Time==1]))
# A tibble: 9 × 5
# Groups: patient [3]
# patient Time Measure.1 Measure.2 Measure.3
# <chr> <int> <int> <int> <int>
#1 a 1 0 0 0
#2 a 2 45 15 -1
#3 a 3 21 -1 3
#4 b 1 0 0 0
#5 b 2 -32 -60 -7
#6 b 3 -70 -86 -76
#7 c 1 0 0 0
#8 c 2 -6 -54 35
#9 c 3 15 -36 33
data
df1 <- structure(list(patient = c("a", "a", "a", "b", "b", "b", "c",
"c", "c"), Time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Measure.1 = c(19L,
64L, 40L, 80L, 48L, 10L, 30L, 24L, 45L), Measure.2 = c(5L, 20L,
4L, 91L, 31L, 5L, 67L, 13L, 31L), Measure.3 = c(75L, 74L, 78L,
80L, 73L, 4L, 55L, 90L, 88L)), .Names = c("patient", "Time",
"Measure.1", "Measure.2", "Measure.3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

Resources