Create new column with values from certain rows of other columns - r

I have a dataframe that looks like this:
(Example edited)
df <- data.frame(Subject = c(rep("A", 9), rep("B", 8)),
Trial = c(1,1,2,3,4,4,5,6,6,1,2,2,3,4,5,5,6),
Feature_1 = c(rep(123, 2), 234, 345, rep(456, 2), 567, rep(678, 2), 831, rep(444, 2), 461, 921, rep(436, 2), 111),
Feature_2 = c(rep(321, 2), 543, 654, rep(765, 2), 876, rep(987, 2), 912, rep(302, 2), 900, 555, rep(382, 2), 197),
Feature_3 = c(rep(190, 2), 459, 392, rep(398, 2), 492, rep(587, 2), 761, rep(901, 2), 783, 312, rep(880, 2), 229),
Feature_correct = NA)
df
Subject Trial Feature_1 Feature_2 Feature_3 Feature_correct
1 A 1 123 321 190 NA
2 A 1 123 321 190 NA
3 A 2 234 543 459 NA
4 A 3 345 654 392 NA
5 A 4 456 765 398 NA
6 A 4 456 765 398 NA
7 A 5 567 876 492 NA
8 A 6 678 987 587 NA
9 A 6 678 987 587 NA
10 B 1 831 912 761 NA
11 B 2 444 302 901 NA
12 B 2 444 302 901 NA
13 B 3 461 900 783 NA
14 B 4 921 555 312 NA
15 B 5 436 382 880 NA
16 B 5 436 382 880 NA
17 B 6 111 197 229 NA
What I need is for the Feature_correct column to contain the values from Feature_n depending on Trial for each Subject. So:
Subject A & Trials 1 and 2: Feature_correct contains the values for Subject A & Trials 1 and 2 under Feature_1 (respectively).
Subject A & Trials 3 and 4: Feature_correct contains the values for Subject A & Trials 3 and 4 under Feature_2 (respectively).
Subject A & Trials 5 and 6: Feature_correct contains the values for Subject A & Trials 5 and 6 under Feature_3 (respectively).
and so on for Subject B.
This is my goal:
df$Feature_goal <- c(rep(123, 2), 234, 654, rep(765, 2), 492, rep(587, 2), 831, rep(444, 2), 900, 555, rep(880, 2), 229)
head(df)
Subject Trial Feature_1 Feature_2 Feature_3 Feature_correct Feature_goal
1 A 1 123 321 190 NA 123
2 A 1 123 321 190 NA 123
3 A 2 234 543 459 NA 234
4 A 3 345 654 392 NA 654
5 A 4 456 765 398 NA 765
6 A 4 456 765 398 NA 765
I know how to do this manually (specifying the Subject name and Trial number in the syntax), but I'd like to create a loop (or whatever else works) so that I don't have to type the name of each Subject (in my real dataset I have many participants and many "Feature" variables).
I've tried this for loop, but I get an error:
df <- for(i in 1:nrow(df$Subject)) {
if(df$Trial %in% c(1,2)){
df[df$Subject == i $ df$Trial %in% c(1,2),]$Feature_correct = df[df$Subject == i & df$Trial %in% c(1,2),]$Feature_1
}
if(df$Trial %in% c(3,4)){
df[df$Subject == i $ df$Trial %in% c(3,4),]$Feature_correct = df[df$Subject == i & df$Trial %in% c(3,4),]$Feature_2
}
if(df$Trial %in% c(5,6)){
df[df$Subject == i $ df$Trial %in% c(5,6),]$Feature_correct = df[df$Subject == i & df$Trial %in% c(5,6),]$Feature_3
}
}
> Error in 1:nrow(df$Subject) : argument of length 0
Indeed,
nrow(df$Subject)
> NULL
Would anyone know how to make this work (with the loop or in any other way)?

A vectorized way would be to create a row/column index by pasting "Feature" with Trial number to match it with column names and subset values from original dataframe.
df$Feature_Goal <- df[cbind(seq_len(nrow(df)),
match(paste0("Feature_", df$Trial), names(df)))]
df
# Subject Trial Feature_1 Feature_2 Feature_3 Feature_correct Feature_Goal
#1 A 1 123 321 190 NA 123
#2 A 1 123 321 190 NA 123
#3 A 2 234 543 459 NA 543
#4 A 2 234 543 459 NA 543
#5 A 3 345 654 392 NA 392
#6 A 3 345 654 392 NA 392
#7 B 1 456 765 398 NA 456
#8 B 1 456 765 398 NA 456
#9 B 2 567 876 492 NA 876
#10 B 2 567 876 492 NA 876
#11 B 3 678 987 587 NA 587
#12 B 3 678 987 587 NA 587

Here is a solution using a loop.
for (i in 1:3) {
idx <- which(df$Trial == i)
df[idx,6] <- df[idx,i+2]
}

Related

Use a lookup table with a nested list

Hi I have a bunch of hydrological data on streamflow(Q) that I want to standardize. Data is stored an a large nested table with a layout like the one below that I need to keep:
Flowtestlist <- list(list("910" = data.frame( Q=c(650, 720, 550, 580, 800)),
"950" = data.frame( Q=c(550, 770, 520, 540, 790))),
list ("910" = data.frame( Q=c(450, 620, 750, 580, 800)),
"950" = data.frame( Q=c(650, 750, 580, 520, 890))))
I have levels [[1]] and [[2]], in reality, I have 9 of them and those are also model numbers. Within each model I have 18 subbasins numbered 910, 950, 1012, 1087 etc (in the example above just two subbasins 910, 950 for simplicity). The subbasins contain data on streamflow (Q).
There's also a lookup table:
test_model <- c(1,1,2,2)
test_subbasin <- c(910,950,910,950)
Q_mean <- c(870,765,823,689)
FlowtestDF <- data.frame(test_model, test_subbasin, Q_mean)
This data frame includes streamflow means (Q_mean) for the reference period for each model and subbasin. I want to take each Q from the nested table and find the matching model number and subbasin in the lookup table and divide it to get the standardized streamflow Q_st.
fun_st <- function(x, y=FlowtestDF) {
x$Q_st <- x$Q/y$Q_mean
x <- x
}
testresult <- lapply(Flowtestlist, lapply, fun_st)
It doesn't work. As I understand the function can't find the appropriate location of the needed number in the lookup table (model and subbasin). How can I make this work, while keeping the nested table structure of the data?
Are you looking for this?
Map(\(x, y) lapply(y[match(x$test_subbasin, names(y))], \(i) i / x$Q_mean),
split(FlowtestDF, FlowtestDF$test_model),
Flowtestlist)
# $`1`
# $`1`$`910`
# Q
# 1 0.7471264
# 2 0.9411765
# 3 0.6321839
# 4 0.7581699
# 5 0.9195402
#
# $`1`$`950`
# Q
# 1 0.6321839
# 2 1.0065359
# 3 0.5977011
# 4 0.7058824
# 5 0.9080460
#
#
# $`2`
# $`2`$`910`
# Q
# 1 0.5467801
# 2 0.8998549
# 3 0.9113001
# 4 0.8417997
# 5 0.9720535
#
# $`2`$`950`
# Q
# 1 0.7897934
# 2 1.0885341
# 3 0.7047388
# 4 0.7547170
# 5 1.0814095
Note: If you're (still) using R<4.1, instead of e.g. \(x, y) use function(x, y).
It is easier to do the processing if you have data in a flat dataframe. If for some reason you have to keep the dataframe in nested structure you may split it again.
library(dplyr)
library(purrr)
map_df(Flowtestlist, ~bind_rows(., .id = 'test_subbasin'), .id = 'test_model') %>%
type.convert(as.is = TRUE) %>%
left_join(FlowtestDF, by = c('test_subbasin', 'test_model')) %>%
mutate(Q_st = Q/Q_mean) %>%
split(.$test_model) %>%
map(~.x %>% select(Q, Q_st) %>% split(.x$test_subbasin))
#$`1`
#$`1`$`910`
# Q Q_st
#1 650 0.7471264
#2 720 0.8275862
#3 550 0.6321839
#4 580 0.6666667
#5 800 0.9195402
#$`1`$`950`
# Q Q_st
#6 550 0.7189542
#7 770 1.0065359
#8 520 0.6797386
#9 540 0.7058824
#10 790 1.0326797
#$`2`
#$`2`$`910`
# Q Q_st
#11 450 0.5467801
#12 620 0.7533414
#13 750 0.9113001
#14 580 0.7047388
#15 800 0.9720535
#$`2`$`950`
# Q Q_st
#16 650 0.9433962
#17 750 1.0885341
#18 580 0.8417997
#19 520 0.7547170
#20 890 1.2917271
library(tidyr)
extr <- function(x){
a <- data.frame(x)
names(a) <- names(x)
a$test_model <- parent.frame()$i
a <- pivot_longer(a,setdiff(names(a),'test_model'),names_to = 'test_subbasin',values_to = 'Q')
a
}
to_df <- lapply(Flowtestlist,extr)
df <- do.call(rbind,to_df)
with_lookup <- merge(df,FlowtestDF,by =c('test_model','test_subbasin'))
with_lookup$Q_st <- with_lookup$Q/with_lookup$Q_mean
with_lookup
output;
test_model test_subbasin Q Q_mean Q_st
<int> <chr> <dbl> <dbl> <dbl>
1 1 910 650 870 0.747
2 1 910 720 870 0.828
3 1 910 550 870 0.632
4 1 910 580 870 0.667
5 1 910 800 870 0.920
6 1 950 550 765 0.719
7 1 950 770 765 1.01
8 1 950 520 765 0.680
9 1 950 540 765 0.706
10 1 950 790 765 1.03
11 2 910 450 823 0.547
12 2 910 620 823 0.753
13 2 910 750 823 0.911
14 2 910 580 823 0.705
15 2 910 800 823 0.972
16 2 950 650 689 0.943
17 2 950 750 689 1.09
18 2 950 580 689 0.842
19 2 950 520 689 0.755
20 2 950 890 689 1.29
The following will derive the required output ...
df <- data.frame(test_subbasin = unlist(Flowtestlist), ref = names(unlist(Flowtestlist)))
df$Q_st <- df$test_subbasin / FlowtestDF$Q_mean[match(gsub("\\..*", "", df$ref), FlowtestDF$test_subbasin)]
df
# test_subbasin ref Q_st
# 1 650 910.Q1 0.7471264
# 2 720 910.Q2 0.8275862
# 3 550 910.Q3 0.6321839
# 4 580 910.Q4 0.6666667
# 5 800 910.Q5 0.9195402
# 6 550 950.Q1 0.7189542
# 7 770 950.Q2 1.0065359
# 8 520 950.Q3 0.6797386
# 9 540 950.Q4 0.7058824
# 10 790 950.Q5 1.0326797
# 11 450 910.Q1 0.5172414
# 12 620 910.Q2 0.7126437
# 13 750 910.Q3 0.8620690
# 14 580 910.Q4 0.6666667
# 15 800 910.Q5 0.9195402
# 16 650 950.Q1 0.8496732
# 17 750 950.Q2 0.9803922
# 18 580 950.Q3 0.7581699
# 19 520 950.Q4 0.6797386
# 20 890 950.Q5 1.1633987

Pivot/Reshape data in R [duplicate]

This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)

Convert hashtable/dictionary/array format data into regular column based data.frame

I am a beginner in R and I never processed these type of data before. I have the following two types of sample data sets (df1 and df2) that looks like the following:
df1 <- c("{\"\"Wednesday\"\":4,\"\"Monday\"\":5,\"\"Saturday\"\":4,\"\"Thursday\"\":4,\"\"Tuesday\"\":5,\"\"Friday\"\":1,\"\"Sunday\"\":5,\"\"Missing day\"\":2}",
"{\"\"Wednesday\"\":6,\"\"Monday\"\":5,\"\"Saturday\"\":2,\"\"Thursday\"\":6,\"\"Tuesday\"\":0,\"\"Friday\"\":2,\"\"Sunday\"\":4,\"\"Missing day\"\":1}",
"{\"\"Wednesday\"\":5,\"\"Monday\"\":5,\"\"Saturday\"\":3,\"\"Thursday\"\":8,\"\"Tuesday\"\":4,\"\"Friday\"\":3,\"\"Sunday\"\":6,\"\"Missing day\"\":4}",
"{\"\"Wednesday\"\":3,\"\"Monday\"\":5,\"\"Saturday\"\":4,\"\"Thursday\"\":1,\"\"Tuesday\"\":5,\"\"Friday\"\":4,\"\"Sunday\"\":4,\"\"Missing day\"\":6}")
df2 <- c("[373,357,382,411,310,315,330,385,367,396,402,348,354,343,392,395,392,401,376,448,341,373,369,304,298,332,366,287,334,222]",
"[319,347,284,313,300,292,228,322,291,275,278,289,323,342,272,242,295,347,290,343,337,309,268,251,256,266,346,260,232,160]",
"[165,154,161,152,164,152,156,150,137,170,147,210,235,190,176,175,191,186,209,157,210,199,162,149,162,165,174,171,178,126]",
"[253,274,240,258,264,231,296,233,230,252,210,233,233,295,235,229,270,275,278,297,255,253,250,252,299,305,310,308,263,141]")
Now, I need to convert df1 into df1_final and df2 into df2_final. Here is how the final data sets should look like:
df1_final <- data.frame("Day"=c("Wednesday","Monday", "Saturday", "Thursday", "Tuesday", "Friday", "Sunday", "Missing day"),
"Count1"=c(4,5,4,4,5,1,5,2),
"Count2"=c(6,5,2,6,0,2,4,1),
"Count3"=c(5,5,3,8,4,3,6,4),
"Count4"=c(3,5,4,1,5,4,4,6))
df2_final <- data.frame("group1"=c(373,357,382,411,310,315,330,385,367,396,402,348,354,343,392,395,392,401,376,448,341,373,369,304,298,332,366,287,334,222), "group2"=c(319,347,284,313,300,292,228,322,291,275,278,289,323,342,272,242,295,347,290,343,337,309,268,251,256,266,346,260,232,160), "group3"=c(165,154,161,152,164,152,156,150,137,170,147,210,235,190,176,175,191,186,209,157,210,199,162,149,162,165,174,171,178,126), "group4"=c(253,274,240,258,264,231,296,233,230,252,210,233,233,295,235,229,270,275,278,297,255,253,250,252,299,305,310,308,263,141))
Can someone please help me figure this out? Appreciate for any help. Thank you !!
So you could use either reticulate or jsonlite. I will use Jsonlite as below:
for df1:
df1_f <- jsonlite::fromJSON(gsub('"+','"',sprintf("[%s]", paste0(df1, collapse = ","))))
data.frame(Day = names(df1_f), `colnames<-`(t(df1_f), paste0("count",1:4)), row.names = NULL)
Day count1 count2 count3 count4
1 Wednesday 4 6 5 3
2 Monday 5 5 5 5
3 Saturday 4 2 3 4
4 Thursday 4 6 8 1
5 Tuesday 5 0 4 5
6 Friday 1 2 3 4
7 Sunday 5 4 6 4
8 Missing day 2 1 4 6
for df2 since the lists are not within the {} we will have to manually transform it to a dataframe:
df2_fin <- jsonlite::fromJSON(sprintf("[%s]",paste0(df2, collapse = ",")))
(df2_final <- setNames(data.frame(t(df2_fin)), paste0("group",1:4)))
group1 group2 group3 group4
1 373 319 165 253
2 357 347 154 274
3 382 284 161 240
4 411 313 152 258
5 310 300 164 264
6 315 292 152 231
7 330 228 156 296
8 385 322 150 233
9 367 291 137 230
10 396 275 170 252
11 402 278 147 210
12 348 289 210 233
13 354 323 235 233
:
:

How to find detect duplicates of single values in all rows and columns in R data.frame

I have a large data-set consisting of a header and a series of values in that column. I want to detect the presence and number of duplicates of these values within the whole dataset.
1 2 3 4 5 6 7
734 456 346 545 874 734 455
734 783 482 545 456 948 483
So for example, it would detect 734 3 times, 456 twice etc.
I've tried using the duplicated function in r but this seems to only work on rows as a whole or columns as a whole. Using
duplicated(df)
doesn't pick up any duplicates, though I know there are two duplicates in the first row.
So I'm asking how to detect duplicates both within and between columns/rows.
Cheers
You can use table() and data.frame() to see the occurrence
data.frame(table(v))
such that
v Freq
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 346 1
9 455 1
10 456 2
11 482 1
12 483 1
13 545 2
14 734 3
15 783 1
16 874 1
17 948 1
DATA
v <- c(1, 2, 3, 4, 5, 6, 7, 734, 456, 346, 545, 874, 734, 455, 734,
783, 482, 545, 456, 948, 483)
You can transform it to a vector and then use table() as follows:
library(data.table)
library(dplyr)
df<-fread("734 456 346 545 874 734 455
734 783 482 545 456 948 483")
df%>%unlist()%>%table()
# 346 455 456 482 483 545 734 783 874 948
# 1 1 2 1 1 2 3 1 1 1

Checking the value from given threshold in a set of observation and continue till end of vector

Task:
I have to check that if the value in the data vector is above from the given threshold,
If in my data vector, I found 5 consecutive values greater then the given threshold then I keep these values as they are.
If I have less then 5 values (not 5 consecutive values) then I will replace these values with NA's.
The sample data and required output is shown below. In this example the threshold value is 1000. X is input data variable and the desired output is: Y = X(Threshold > 1000)
X Y
580 580
457 457
980 980
1250 NA
3600 NA
598 598
1200 1200
1345 1345
9658 9658
1253 1253
4500 4500
1150 1150
596 596
594 594
550 550
1450 NA
320 320
1780 NA
592 592
590 590
I have used the following code in R for my desired output but unable to get the appropriate one:
for (i in 1:nrow(X)) # X is my data vector
{counter=0
if (X[i]>10000)
{
for (j in i:(i+4))
{
if (X[j]>10000)
{counter=counter+1}
}
ifelse (counter < 5, NA, X[j])
}
X[i]<- NA
}
X
I am sure that the above code contain some error. I need help in the form of either a new code or modifying this code or any package in R.
Here is an approach using dplyr, using a cumulative sum of diff(x > 1000) to group the values.
library(dplyr)
df <- data.frame(x)
df
# x
# 1 580
# 2 457
# 3 980
# 4 1250
# 5 3600
# 6 598
# 7 1200
# 8 1345
# 9 9658
# 10 1253
# 11 4500
# 12 1150
# 13 596
# 14 594
# 15 550
# 16 1450
# 17 320
# 18 1780
# 19 592
# 20 590
df %>% mutate(group = cumsum(c(0, abs(diff(x>1000))))) %>%
group_by(group) %>%
mutate(count = n()) %>%
ungroup() %>%
mutate(y = ifelse(x<1000 | count > 5, x, NA))
# x group count y
# (int) (dbl) (int) (int)
# 1 580 0 3 580
# 2 457 0 3 457
# 3 980 0 3 980
# 4 1250 1 2 NA
# 5 3600 1 2 NA
# 6 598 2 1 598
# 7 1200 3 6 1200
# 8 1345 3 6 1345
# 9 9658 3 6 9658
# 10 1253 3 6 1253
# 11 4500 3 6 4500
# 12 1150 3 6 1150
# 13 596 4 3 596
# 14 594 4 3 594
# 15 550 4 3 550
# 16 1450 5 1 NA
# 17 320 6 1 320
# 18 1780 7 1 NA
# 19 592 8 2 592
# 20 590 8 2 590
Another approach :
Y<-rep(NA,nrow(X))
for (i in 1:nrow(X)) {
if (X[i,1]<1000) {Y[i]<-X[i,1]} else if (sum(X[i:min((i+4),nrow(X)),1]>1000)>=5) {
Y[i:min((i+4),nrow(X))]<-X[i:min((i+4),nrow(X)),1]}
}
returns
> Y
[1] 580 457 980 NA NA 598 1200 1345 9658 1253 4500 1150 596 594 550 NA 320 NA 592 590
This assumes that the values of X are in the first column of a dataframe named X.
It then creates Y with NAand only change the values if the criteria are met.

Resources