If function with looping to create new datatable - r

I have a dataset which i want to loop over with an if function:
id <- c(1,2,3,4,5)
value1 <- c(25, 100, 15, 20, 30)
value2 <- c(130, 25, 10, 30, 20)
value3 <- c(50, 60, 20, 120, 10)
month <- c(2, 3, 4, 2, 3)
df1 <- dataframe(id,value1, value2, value3, month)
I need an if function that would calculate the differences between value 1 and 2 and between 1 and 3 and check if one OR the other is higher than 35%. This should not be the change(increase or decrease) but just the difference in percentages. And when the differences is equal or higher than 35% it should add all the variables for that specific id to a seperate dataframe/table.
However i the function to loop over all the id's instead of just the top one.
This is what i have so far:
library(threadr)
if (percentage_difference(value1, value2) >= 35 | percentage_difference(value1, value3) >= 35) {print "bad"}

If I understand you correctly, this should do the trick, though I am not sure how you would want to indicate if it is >35.
Here I used a boolean T/F:
library(dplyr)
df2 <- df1
df2$perc_diff <- pmax(value1-value2, value1-value3)
df2$over35 <- df2$perc_diff >= 35
# id value1 value2 value3 month perc_diff over35
# 1 1 25 130 50 2 -25 FALSE
# 2 2 100 25 60 3 75 TRUE
# 3 3 15 10 20 4 5 FALSE
# 4 4 20 30 120 2 -10 FALSE
# 5 5 30 20 10 3 20 FALSE
Here I removed those < 35:
df3 <- df1
df3$perc_diff <- pmax(value1-value2, value1-value3)
df3[df3$perc_diff >= 35,]
# id value1 value2 value3 month perc_diff
# 2 100 25 60 3 75

Related

How to populate a column using multiple conditionals across 2 dataframes?

Im trying to populate a column with values based on two conditionals across two separate dataframes. So,
df1$day == df2$day & df1$hour == df2$hour then fill df1$X with df2$depth
I struggle because I am not asking it to populate it with a generic value (i.e. if x==y, then y2=1). I am trying to get it select values across multiple rows. A mock example:
df1 df2
day hour X day hour depth
1 10 NA 1 10 50
1 11 NA 1 11 10
2 5 NA 1 3 100
5 9 NA 5 9 50
6 20 NA 7 17 80
7 17 NA 10 4 65
Any help would be greatly appreciated.
An easier option is join from data.table
library(data.table)
setDT(df1)[df2, X := depth, on = .(day, hour)]
df1
# day hour X
#1: 1 10 50
#2: 1 11 10
#3: 2 5 NA
#4: 5 9 50
#5: 6 20 NA
#6: 7 17 80
In base R, we can use match
df1$X <- with(df1, df2$depth[match(paste(day, hour), paste(df2$day, df2$hour))])
data
df1<- data.frame(day = c(1, 1, 2, 5:7), hour = c(10:11, 5, 9, 20, 17),
X = NA_integer_)
df2 <- data.frame(day = c(1, 1, 1, 5, 7, 10), hour = c(10, 11, 3, 9,
17, 4), depth = c(50, 10, 100, 50, 80, 65))
Using dplyr, we can do a left_join and then rename the depth column as X
library(dplyr)
left_join(df1, df2, by = c("day", "hour")) %>%
select(-X) %>%
rename(X = depth)
# day hour X
#1 1 10 50
#2 1 11 10
#3 2 5 NA
#4 5 9 50
#5 6 20 NA
#6 7 17 80
If the X column is not always NA you could use coalesce.
left_join(df1, df2, by = c("day", "hour")) %>%
mutate(X = coalesce(depth, X)) %>%
select(names(df1))
Or in base R :
merge(df1, df2, all.x = TRUE)[-3]

shift a column with lagged data from other column and enlarge data frame as needed

I have a data frame with values and I need a new column with shifted values some rows down but data frame has to get more rows to accommodate the shifted data.
What I've got so far:
df <- data.frame(day=1:5,value=floor(runif(5, min=0, max=101)))
> df %>% dplyr::mutate(value2=dplyr::lag(value,n=2, default = 0))
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
Expected result:
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
6 6 0 14
7 7 0 10
Stuck on making the data frame grow the needed rows.
Here's a way with dplyr -
df %>%
bind_rows(
tail(df, 2) %>%
mutate(day = day + 2, value = 0)
) %>%
mutate(value2 = lag(value, 2, default = 0))
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
6 6 0 14
7 7 0 10
Use a merge. Create the "target" dataset with however many rows you want, fill in NA values with 0, then remap the lagged value onto "value2". It's useful to store "lag" as a variable, at the risk of being more verbose.
have <- data.frame(
day= 1:5,
value = c(19, 78, 18, 14, 10),
value2 = c(0, 0, 19, 78, 18)
)
target <- data.frame(
day=1:7
)
want <- merge(have, target, by='day', all=T)
want[is.na(want)] <- 0
lag <- 2
## just one way of mapping a lagged response
want$value2 <- c(rep(0, lag), rev(rev(want$value)[-{1:lag}]))

column with previous result

I'm working with R
WHAT I HAVE:
ID_1 ID_2 Date x_1 y_2
1 12 3 2011-12-21 15 10
2 12 13 2011-12-22 50 40
3 3 12 2011-12-22 20 30
4 15 13 2011-12-23 30 20
...
and so on
TARGET:
ID_1 ID_2 Date x_1 y_2 XX_1 YY_2
1 12 3 2011-12-21 15 10 0 0
2 12 13 2011-12-22 50 40 15 0
3 3 12 2011-12-22 20 30 10 50
4 15 13 2011-12-23 30 20 0 40
...
and so on
I want to see in XX_1 and in YY_2 the values from the columns x_1 and y_2 corresponding to the previous values of ID_1 and ID1_2 in or "0" in case of no value is available before that date. I don't know how to handle the fact that different values could be in ID_1 and ID_2 (like IDs 3 and 12 in the example).
#Ekatef
ID1 AND ID2 (find match of the whole ID row, even if the order of IDs is switched):
ID_1 ID_2 Date x_1 y_2 XX_1 YY_2
1 12 3 2011-12-21 15 10 0 0
2 12 13 2011-12-22 50 40 0 0
3 3 12 2011-12-22 20 30 10 15
4 15 13 2011-12-23 30 20 0 0
5 12 13 2011-12-23 10 5 50 40
The OP has requested to copy the previous value for an ID (if any) to the appropriate new column.
This can solved by reshaping multiple columns simultaneously from wide to long format, finding the previous value by shifting / lagging, and reshaping back to wide format:
library(data.table)
setDT(DF)[, rn := .I]
long <- melt(DF, id.vars = c("rn", "Date"), measure.vars = patterns("^ID", "^x|y"),
value.name = c("ID", "value"))
long[order(Date), previous := shift(value, fill = 0), by = ID]
dcast(long, rn + Date ~ variable, value.var = c("ID", "value", "previous"))
rn Date ID_1 ID_2 value_1 value_2 previous_1 previous_2
1: 1 2011-12-21 12 3 15 10 0 0
2: 2 2011-12-22 12 13 50 40 15 0
3: 3 2011-12-22 3 12 20 30 10 50
4: 4 2011-12-23 15 13 30 20 0 40
Alternatively, the final call to dcast() can be replaced by an update while joining:
DF[long, on = .(rn),
c("XX_1", "YY_2") := .(previous[variable == 1L], previous[variable == 2L])][
, rn := NULL]
DF
ID_1 ID_2 Date x_1 y_2 XX_1 YY_2
1: 12 3 2011-12-21 15 10 0 0
2: 12 13 2011-12-22 50 40 15 0
3: 3 12 2011-12-22 20 30 10 50
4: 15 13 2011-12-23 30 20 0 40
which reproduces exactly OP's expected result.
Data
library(data.table)
DF <- fread(
"i ID_1 ID_2 Date x_1 y_2
1 12 3 2011-12-21 15 10
2 12 13 2011-12-22 50 40
3 3 12 2011-12-22 20 30
4 15 13 2011-12-23 30 20 ",
drop = 1L
)
If I understand you correctly, the target ID should be looked up from the left to the right and from the bottom to the top in all the rows strictly above the given ID value. I would write the function to find the coordinates of the preceded ID like that
# find the indices of the preceded ID value
# #id_matrix == your_data_frame[, c("ID_1", "ID_2")]
# [#i_of_row, #i_of_col] are the coordinates of the considered ID
# i_of_row > 1
FindPreviousID <- function(id_matrix, i_of_row, i_of_col) {
shorten_matrix <- id_matrix[1:(i_of_row - 1),,drop = FALSE]
rev_ind <- match(table = rev(t(shorten_matrix)),
x = ids[i_of_row,i_of_col], nomatch = NA_real_)
n_row_found <- floor((length(shorten_matrix) - rev_ind)/2) + 1
n_col_found <- (length(shorten_matrix) - rev_ind) %% ncol(shorten_matrix) + 1
return(c(row = n_row_found, col = n_col_found))
}
...and use it to calculate XX_1 and YY2
# emulate the original dataframe
ID_1 <- c(12,12,3,15,16,3)
ID_2<-c(3,13,12,13,17,15)
ids <- cbind(ID_1, ID_2) # IDs columns
x1 <- c(15, 50, 20, 30, 51, 60)
y2 <- c(10, 40, 30, 20, 53, 62)
vars <- cbind(x1, y2) # x&y columns
# assuming that the first XX_1 & YY_2 should be always 0
indices_XX <- sapply(FUN = function(i) FindPreviousID(id_matrix = ids, i_of_col = 1, i),
X = seq(along.with = ids[, 1])[-1])
indices_YY <- sapply(FUN = function(i) FindPreviousID(id_matrix = ids, i_of_col = 2, i),
X = seq(along.with = ids[, 1])[-1])
# construct XX and YY columns
XX_column <- c(NA, vars[t(indices_XX)])
XX_column[is.na(XX_column)] <- 0
YY_column <- c(NA, vars[t(indices_YY)])
YY_column[is.na(YY_column)] <- 0
Hope, that helps :)
Upd If you are interested to find a pair of IDs instead of the single ID, the function should be redesigned. One of the possible solutions looks like this
FindPreviousIDsPair <- function(id_matrix, i_of_row) {
shorten_matrix <- id_matrix[1:(i_of_row - 1),,drop = FALSE]
string_to_search_for <- id_matrix[i_of_row, ]
string_to_search_for_sorted <-
string_to_search_for[order(string_to_search_for)]
found_rows_boolean <- sapply(FUN = function(i) all(shorten_matrix[i,
order(shorten_matrix[i, ])] ==
string_to_search_for_sorted), X = 1:(i_of_row - 1))
found_row_n <- ifelse(any(found_rows_boolean),
max(which(found_rows_boolean)), NA_real_)
found_col_of_DI1 <- ifelse(any(found_rows_boolean),
match(string_to_search_for[1], shorten_matrix[found_row_n, ]), NA_real_)
found_col_of_DI2 <- ifelse(any(found_rows_boolean),
match(string_to_search_for[2], shorten_matrix[found_row_n, ]), NA_real_)
return(c(found_row_n, found_col_of_DI1, found_col_of_DI2))
}
Application of the redisigned look-up function to calculate XX and YY
indices_of_vars <- sapply(FUN = function(i) FindPreviousIDsPair(id_matrix =
ids, i), X = seq(along.with = ids[, 1])[-1])
indices_XX <- indices_of_vars[1:2, ]
indices_YY <- indices_of_vars[c(1, 3), ]
XX_column <- c(NA, vars[t(indices_XX)])
XX_column[is.na(XX_column)] <- 0
YY_column <- c(NA, vars[t(indices_YY)])
YY_column[is.na(YY_column)] <- 0

Counting the number of observations by groups with conditions in R

I would like to count the number of observations within each group using conditions in R.
For example, I would like to count how many observations for ID "A" in every 10 days.
ID (A,A,A,A,A,A,A,A)
Day (7,14,17,25,35,37,42,57)
X (9,20,14,24,23,30,20,40)
Output Image
(In the first 10 days, we have one observation for ID "A". Days:7
In the next 10 days, we have two observations for ID "A". Days:14,17)
ID (A,A,A,A,A,A,A,A)
Day_10 (1,2,3,4,5,6)
Count_10 (1,2,1,2,1,1)
Also it would be great if I can calculate the number of observations before and after the certain values. For the given X value, I would like to know how many observation between [X-10, X+10] within ID "A".
The output image would be as follows:
ID (A,A,A,A,A,A,A,A)
X (9,20,14,24,23,30,40,50)
Count_X10 (3,3,3,3,3,3,2,1)
Count_X10: for a given X(=9) there are three observations within ID "A" [-1,19]
Here are the data loaded as a data.frame to keep the observations connected. Note that I added a second group to to show how to handle that
df <-
data.frame(
ID = rep(c("A","B"), each = 8)
, Day = c(7,14,17,25,35,37,42,57)
, X = c(9,20,14,24,23,30,20,40)
)
Then, I used dplyr to pass the data through a series of steps. First, I split by the ID column, then used lapply to run a function on each of those ID groups, including calculating two columns of interest (then returning the whole data.frame). Finally, I stitch the rows back together with bind_rows
df %>%
split(.$ID) %>%
lapply(function(x){
x$nextTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay, thisDay + 10))
})
x$plusMinusTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay - 10, thisDay + 10))
})
return(x)
}) %>%
bind_rows()
The result is
ID Day X nextTen plusMinusTen
1 A 7 9 3 3
2 A 14 20 2 3
3 A 17 14 2 4
4 A 25 24 2 3
5 A 35 23 3 4
6 A 37 30 2 3
7 A 42 20 1 3
8 A 57 40 1 1
9 B 7 9 3 3
10 B 14 20 2 3
11 B 17 14 2 4
12 B 25 24 2 3
13 B 35 23 3 4
14 B 37 30 2 3
15 B 42 20 1 3
16 B 57 40 1 1
But any condition you are interested good be added to that lapply step.
Your sample data :
df = data.frame(
ID = rep('A', 8),
Day = c(7, 14, 17, 25, 35, 37, 42, 57),
X = c(9, 20, 14, 24, 23, 30, 40, 50),
stringsAsFactors = FALSE)
Note: You give two different values for vector X. I suppose it is c(9, 20, 14, 24, 23, 30, 40, 50), and not c(9, 20, 14, 24, 23, 30, 20, 40).
First calculation:
library(dplyr)
output1 = df %>%
mutate(Day_10 = ceiling(Day/10)) %>%
group_by(ID, Day_10) %>%
summarise(Count_10 = n())
The mutate step creates the ranges of 10 days by rounding Day/10. Then we group by ID and Day_10 and we count the number of observations within each group.
> output1
ID Day_10 Count_10
<chr> <dbl> <int>
1 A 1 1
2 A 2 2
3 A 3 1
4 A 4 2
5 A 5 1
6 A 6 1
Second calculation:
output2 = df %>%
group_by(ID) %>%
mutate(Count_X10 = sapply(X, function(x){sum(Day >= x-10 & Day <= x+10)})) %>%
select(-Day)
We group by ID, and for each X we count the number of days with this ID that are between X-10 and X+10.
> output2
ID X Count_X10
<chr> <dbl> <int>
1 A 9 3
2 A 20 3
3 A 14 3
4 A 24 3
5 A 23 3
6 A 30 3
7 A 40 3
8 A 50 2
Note: I suppose there's a mistake in the desired output you give, because for instance, when X = 50, there are 2 observations within [40, 60] with ID "A": days 42 and 57.

comparing and finding overlap range in R

I have two tables where each of them including range of numbers. one table is subdivision of the other. I want to create binary column in the first table which shows in which range they are overlapped.
for example:
df1:
start1 end1
1 6
6 8
9 12
13 15
15 19
19 20
df2:
start2 end2
2 4
9 11
14 18
result: the result is the first table with column that shows if the overlap exists.
start1 end1 overlap
1 6 1
6 8 0
9 12 1
13 15 1
15 19 1
19 20 0
thanks.
You may also try foverlaps from data.table
library(data.table)
setkey(setDT(df1), start1, end1)
setkey(setDT(df2), start2, end2)
df1[,overlap:=foverlaps(df1, df2, which=TRUE)[, !is.na(yid),]+0]
df1
# start1 end1 overlap
#1: 1 6 1
#2: 6 8 0
#3: 9 12 1
#4: 13 15 1
#5: 15 19 1
#6: 19 20 0
With IRanges
library(IRanges)
ir1 = with(df1, IRanges(start1, end1))
ir2 = with(df2, IRanges(start2, end2))
df1$overlap = countOverlaps(ir1, ir2) != 0
If on the off chance this is genomic data, the GenomicRanges packages is appropriate.
Here's an approach based on generating sequences:
nums <- unlist(apply(df2, 1, Reduce, f = seq))
df1$overlap <- as.integer(apply(df1, 1, function(x) any(seq(x[1], x[2]) %in% nums)))
# start1 end1 overlap
# 1 1 6 1
# 2 6 8 0
# 3 9 12 1
# 4 13 15 1
# 5 15 19 1
# 6 19 20 0
You can use the ivs package, which is a package specifically about interval vectors. iv_overlaps() returns a logical vector that specifies if each interval of the column from df1 overlaps any interval from df2.
library(dplyr)
library(ivs)
df1 <- tribble(
~start1, ~end1,
1, 6,
6, 8,
9, 12,
13, 15,
15, 19,
19, 20
)
df2 <- tribble(
~start2, ~end2,
2, 4,
9, 11,
14, 18
)
df1 <- mutate(df1, range1 = iv(start1, end1), .keep = "unused")
df2 <- mutate(df2, range2 = iv(start2, end2), .keep = "unused")
df1 %>%
mutate(any_overlap = iv_overlaps(range1, df2$range2))
#> # A tibble: 6 × 2
#> range1 any_overlap
#> <iv<dbl>> <lgl>
#> 1 [1, 6) TRUE
#> 2 [6, 8) FALSE
#> 3 [9, 12) TRUE
#> 4 [13, 15) TRUE
#> 5 [15, 19) TRUE
#> 6 [19, 20) FALSE

Resources