conditionally copying reference values in r - r

I am trying to conditionally copy values from the x column into a new column based on a reference value. for example in row 1, for time == 1, the ref value is 7 so the newx value should copy the x value from time == 1 and id == 7 the copied value always needs to be in the same time block.
In the event the ref value is 0, the newx value should also be 0
I have tried a few approaches and the below is probably the closest I have reached but it still isn't working
library(dplyr)
x <- sample(1:50, 24)
y <- sample(1:50, 24)
ref <- c(7,7,7,7,0,0,0,0,0,0,0,0,4,3,4,1,8,8,5,8,0,0,0,0)
id <- rep(seq(1,8,1), 3)
time <- rep(1:3, each = 8)
x y ref id time
1 41 29 7 1 1
2 18 37 7 2 1
3 50 25 7 3 1
4 47 7 7 4 1
5 2 40 0 5 1
6 22 19 0 6 1
7 48 9 0 7 1
8 26 36 0 8 1
9 49 47 0 1 2
10 46 18 0 2 2
11 25 23 0 3 2
12 38 3 0 4 2
13 28 31 4 5 2
14 34 4 3 6 2
15 21 32 4 7 2
16 9 48 1 8 2
17 43 43 8 1 3
18 39 38 8 2 3
19 6 16 5 3 3
20 12 41 8 4 3
21 1 13 0 5 3
22 19 17 0 6 3
23 7 34 0 7 3
24 33 10 0 8 3
df <- as.data.frame(cbind(x,y,ref,id,time))
df <- df %>% group_by(time) %>% mutate(Newx = case_when((ref > 0) ~ x[which(id==ref)],
T ~ 0,))

You can join df with itself. The last mutate is just to remove the NAs for the ref == 0 rows. You can also use tidyr::replace_na but I wanted to stick to using only dplyr:
df %>%
left_join(df %>% select(x, id, time) %>% rename(newx = x), by= c("time", "ref" = "id")) %>%
mutate(newx = ifelse(is.na(newx), 0, newx))
Which results to:
x y ref id time newx
1 44 44 7 1 1 36
2 37 26 7 2 1 36
3 40 27 7 3 1 36
4 32 46 7 4 1 36
5 48 33 0 5 1 0
6 31 6 0 6 1 0
7 36 1 0 7 1 0
8 27 11 0 8 1 0
9 26 32 0 1 2 0
10 42 22 0 2 2 0
11 22 21 0 3 2 0
12 15 28 0 4 2 0
13 45 47 4 5 2 15
14 49 4 3 6 2 22
15 25 50 4 7 2 15
16 14 3 1 8 2 26
17 13 42 8 1 3 12
18 38 7 8 2 3 12
19 10 12 5 3 3 50
20 2 40 8 4 3 12
21 50 43 0 5 3 0
22 4 9 0 6 3 0
23 34 49 0 7 3 0
24 12 31 0 8 3 0

Using purrr::map_dbl you could do:
library(purrr)
library(dplyr)
df %>%
group_by(time) %>%
mutate(newx = map_dbl(ref, function(ref) if (ref > 0) .data$x[.data$id == ref] else 0)) %>%
ungroup()
#> # A tibble: 24 × 6
#> x y ref id time newx
#> <int> <int> <dbl> <dbl> <int> <dbl>
#> 1 31 17 7 1 1 37
#> 2 15 43 7 2 1 37
#> 3 14 39 7 3 1 37
#> 4 3 12 7 4 1 37
#> 5 42 15 0 5 1 0
#> 6 43 32 0 6 1 0
#> 7 37 42 0 7 1 0
#> 8 48 7 0 8 1 0
#> 9 25 9 0 1 2 0
#> 10 26 41 0 2 2 0
#> # … with 14 more rows
DATA
set.seed(123)
x <- sample(1:50, 24)
y <- sample(1:50, 24)
ref <- c(7, 7, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 4, 3, 4, 1, 8, 8, 5, 8, 0, 0, 0, 0)
id <- rep(seq(1, 8, 1), 3)
time <- rep(1:3, each = 8)
df <- data.frame(x, y, ref, id, time)

Related

Add rows to dataframe in R based on values in column

I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)

Fill zeros for missing values in R

I am trying to deal with this problem.
I have a df with a date column and I want to count the occurences per hour. Here is what I've done:
x <- df %>%
mutate(hora = hour(date)) %>%
select(hora) %>%
count(hora)
that gives as a result:
> x
# A tibble: 19 x 2
hora n
<int> <int>
1 0 1
2 1 1
3 3 1
4 8 4
5 9 7
6 10 10
7 11 14
8 12 10
9 13 8
10 14 4
11 15 5
12 16 12
13 17 4
14 18 12
15 19 9
16 20 5
17 21 2
18 22 4
19 23 4
As you can see, there are hours that don't show up that would have n=0, like 2 or 4:7. What I want is it to add the hours that are not in x with n=0 so the table is complete.
The expected output should be something like this:
hora n
1 0 12
2 1 3
3 2 5
4 3 7
5 4 8
6 5 1
7 6 0
8 7 11
9 8 6
10 9 10
11 10 9
12 11 0
13 12 0
14 13 3
15 14 0
16 15 7
17 16 8
18 17 1
19 18 2
20 19 11
21 20 6
22 21 10
23 22 9
24 23 4
I tried creating a table with hours 0:23 and all n=0 and trying to sum the two tables but obviously that didn't work. I also tried x$hour <- 0:23, thinking that the missing values would be added, but it didn't work as well.
You could convert hora to factor and use .drop = FALSE in count
library(dplyr)
library(lubridate)
df %>%
mutate(hora = factor(hour(date), levels = 0:23)) %>%
count(hora, .drop = FALSE)
Another option is to use complete :
df %>%
mutate(hora = hour(date)) %>%
count(hora) %>%
tidyr::complete(hora = 0:23, fill = list(n = 0))
A solution in Base R merges a vector of hours with the summarized data, and sets the missing counts to 0.
textFile <- "row hour count
1 0 1
2 1 1
3 3 1
4 8 4
5 9 7
6 10 10
7 11 14
8 12 10
9 13 8
10 14 4
11 15 5
12 16 12
13 17 4
14 18 12
15 19 9
16 20 5
17 21 2
18 22 4
19 23 4"
data <- read.table(text = textFile,header = TRUE)[-1]
hours <- data.frame(hour = 0:23)
merged <- merge(data,hours,all.y = TRUE)
merged[is.na(merged$count),"count"] <- 0
...and the output:
> head(merged)
hour count
1 0 1
2 1 1
3 2 0
4 3 1
5 4 0
6 5 0
>

How to merge two data frames by ranges in R?

Suppose I have two data frames such like:
set.seed(123)
df0<-data.frame(pos=3:12,
count0=rbinom(10, 50, 0.5),
count2=rbinom(10, 20, 0.5))
df0
pos count0 count2
1 3 23 14
2 4 28 10
3 5 24 11
4 6 29 10
5 7 30 7
6 8 19 13
7 9 25 8
8 10 29 6
9 11 25 9
10 12 25 14
df1<-data.frame(start=c(4, 7, 11, 14),
end=c(6, 9, 12, 15),
cnv=c(1, 2, 3, 4))
df1
start end cnv
1 4 6 1
2 7 9 2
3 11 12 3
4 14 15 4
What I want is to merge df0 and df1 using the df0$pos with the ranges ofdf1$start and df1$end. If the pos falls into the range of start:end, fills in the cnv from df1 otherwise set cnv as zeros. An output from the above example would be:
pos count0 count2 cnv
1 3 23 14 0
2 4 28 10 1
3 5 24 11 1
4 6 29 10 1
5 7 30 7 2
6 8 19 13 2
7 9 25 8 2
8 10 29 6 0
9 11 25 9 3
10 12 25 14 3
We can use sapply to find if there is an index which is present in range else return 0.
df0$cnv <- sapply(df0$pos, function(x) {
inds <- x >= df1$start & x <= df1$end
if (any(inds))
df1$cnv[inds]
else 0
})
df0
# pos count0 count2 cnv
#1 3 23 14 0
#2 4 28 10 1
#3 5 24 11 1
#4 6 29 10 1
#5 7 30 7 2
#6 8 19 13 2
#7 9 25 8 2
#8 10 29 6 0
#9 11 25 9 3
#10 12 25 14 3

How do I elegantly calculate a variable in an R data.frame that uses values in a previous row?

Here is a simple scenario I constructed:
Say I have the following:
set.seed(1)
id<-sample(3,10,replace = TRUE)
n<-1:10
x<-round(runif(10,30,40))
df<-data.frame(id,n,x)
df
id n x
1 1 1 32
2 2 2 32
3 2 3 37
4 3 4 34
5 1 5 38
6 3 6 35
7 3 7 37
8 2 8 40
9 2 9 34
10 1 10 38
How do I elegantly calculate x.lag where x.lag is a previous x for the same id or 0 if a previous value does not exist.
This is what I did but I'm not happy with it:
df$x.lag<-rep(0,10)
for (id in 1:3)
df[df$id==id,]$x.lag<-c(0,df[df$id==id,]$x)[1:sum(df$id==id)]
df
id n x x.lag
1 1 1 32 0
2 2 2 32 0
3 2 3 37 32
4 3 4 34 0
5 1 5 38 32
6 3 6 35 34
7 3 7 37 35
8 2 8 40 37
9 2 9 34 40
10 1 10 38 38
We can use data.table
library(data.table)
setDT(df)[, x.lag := shift(x, fill=0), id]
Or with dplyr
library(dplyr)
df %>%
group_by(id) %>%
mutate(x.lag = lag(x, default = 0))
Or using ave from base R
df$x.lag <- with(df, ave(x, id, FUN = function(x) c(0, x[-length(x)])))
df$x.lag
#[1] 0 0 32 0 32 34 35 37 40 38

Group rows and add sum column of unique values

Here an example of my data.frame:
df = read.table(text='colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
22 14 3
22 19 3
22 90 3
11 19 2
11 45 2
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0', header = TRUE)
I need to group the rows by colA and colC and add a new column which states the sum of unique values based on colB.
In steps here what I need to do for this specific data.frame:
group rows with colA = 10 and 9, colA = 2 and 1, colA = 22 and colA = 11;
find the unique values of colB per each group;
add the unique values in a new col (newcolD).
Note that colC states the total number of observations for colA = 10 and 9, colA = 2 and 1, colA = 22 and colA = 11.
The data.frame needs to remain ordered decreasingly by colC.
My expected output is:
colA colB colC newcolD
10 11 7 5
10 34 7 5
10 89 7 5
10 21 7 5
9 8 0 5
9 11 0 5
9 21 0 5
2 23 5 4
2 21 5 4
2 56 5 4
1 45 0 4
1 23 0 4
22 14 3 3
22 19 3 3
22 90 3 3
11 19 2 2
11 45 2 2
To note that in df the colB duplicated values are: 11 and 21 for group 10 and 9, and 23 for group 2 and 1.
You can do that with dplyr. The trick is to create a new grouping column which groups consecutive values in colA. This is done with cumsum(c(1, diff(colA) < -1) in the example below.
df1 = read.table(text='colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
22 14 3
22 19 3
22 90 3
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0', header = TRUE,stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
arrange(desc(colA)) %>%
group_by(group_sequential = cumsum(c(1, diff(colA) < -1))) %>%
mutate(newcolD=n_distinct(colB))
colA colB colC group_sequential newcolD
<int> <int> <int> <dbl> <int>
1 22 14 3 1 3
2 22 19 3 1 3
3 22 90 3 1 3
4 10 11 7 2 5
5 10 34 7 2 5
6 10 89 7 2 5
7 10 21 7 2 5
8 9 8 0 2 5
9 9 11 0 2 5
10 9 21 0 2 5
11 2 23 5 3 4
12 2 21 5 3 4
13 2 56 5 3 4
14 1 45 0 3 4
15 1 23 0 3 4
EDIT FOR NEW DATA
With the data you added, we need to create a custom grouping. I use case_when in the example below. This matches the order you show in the desired output column. In the text, you wrote that you wanted the table to be sorted by colC. To do so, change the last line to arrange(desc(colC))
df1 = read.table(text='colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
22 14 3
22 19 3
22 90 3
11 19 2
11 45 2
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0', header = TRUE,stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(group_sequential = case_when(.$colA==10|.$colA==9~1,
.$colA==2|.$colA==1~2,
.$colA==22~3,
.$colA==11~4)) %>%
mutate(newcolD=n_distinct(colB)) %>%
arrange(desc(newcolD))
colA colB colC group_sequential newcolD
<int> <int> <int> <dbl> <int>
1 10 11 7 1 5
2 10 34 7 1 5
3 10 89 7 1 5
4 10 21 7 1 5
5 9 8 0 1 5
6 9 11 0 1 5
7 9 21 0 1 5
8 2 23 5 2 4
9 2 21 5 2 4
10 2 56 5 2 4
11 1 45 0 2 4
12 1 23 0 2 4
13 22 14 3 3 3
14 22 19 3 3 3
15 22 90 3 3 3
16 11 19 2 4 2
17 11 45 2 4 2
You're really not making it easy for us, reposting slight variations of the same question instead of updating the old one and presenting conditions that are vague and inconsistent with what the desired output implies. Anyhow, here is my attempt. This is more an answer to the second question you posted, as that was a bit more general in form.
It's a bit messy, it's pretty much a direct translation of your conditions into a for loop with some if statements. I chose to focus on your written conditions rather than the expected output as that was the easier one to understand. If you want a better answer, please consider cleaning up you question(s) considerably.
df1 <- read.table(text="
colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
22 14 3
22 19 3
22 90 3
11 19 2
11 45 2
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0", header=TRUE)
df2 <- read.table(text="
colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
33 24 3
33 78 3
22 14 3
22 19 3
22 90 3
11 19 2
11 45 2
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0
32 11 0", header=TRUE)
df <- df1
for (i in 1:nrow(df)) {
df$colD[i] <- ifelse(df$colC[i] == 0,
0,
length(unique(df$colA[1:i])))
if (any(df$colA[i]-1 == df$colA[1:i]) & df$colC[i] != 0) {
df$colD[i] <- df$colD[which(df$colA[i]-1 == df$colA[1:i])][1]
}
}
# colA colB colC colD
# 10 11 7 1
# 10 34 7 1
# 10 89 7 1
# 10 21 7 1
# 2 23 5 2
# 2 21 5 2
# 2 56 5 2
# 22 14 3 3
# 22 19 3 3
# 22 90 3 3
# 11 19 2 1
# 11 45 2 1
# 1 45 0 0
# 1 23 0 0
# 9 8 0 0
# 9 11 0 0
# 9 21 0 0
df <- df2
for (i in 1:nrow(df)) {
df$colD[i] <- ifelse(df$colC[i] == 0,
0,
length(unique(df$colA[1:i])))
if (any(df$colA[i]-1 == df$colA[1:i]) & df$colC[i] != 0) {
df$colD[i] <- df$colD[which(df$colA[i]-1 == df$colA[1:i])][1]
}
}
df
# colA colB colC colD
# 10 11 7 1
# 10 34 7 1
# 10 89 7 1
# 10 21 7 1
# 2 23 5 2
# 2 21 5 2
# 2 56 5 2
# 33 24 3 3
# 33 78 3 3
# 22 14 3 4
# 22 19 3 4
# 22 90 3 4
# 11 19 2 1
# 11 45 2 1
# 1 45 0 0
# 1 23 0 0
# 9 8 0 0
# 9 11 0 0
# 9 21 0 0
# 32 11 0 0
To also group the rows where colC is zero, it's sufficient to adjust the conditionals like this:
for (i in 1:nrow(df)) {
df$colD[i] <- length(unique(df$colA[1:i]))
if (any(df$colA[i]-1 == df$colA[1:i])) {
df$colD[i] <- df$colD[which(df$colA[i]-1 == df$colA[1:i])][1]
}
}

Resources