fill in values between the start and end of multiple values [duplicate] - r

This question already has answers here:
Is there a way to fill in missing values of a column in between specific values? [duplicate]
(2 answers)
R - how to fill NA's between two corresponding ID's in a dataframe
(3 answers)
Replacing NAs between two rows with identical values in a specific column
(4 answers)
Replace NA values if last and next non-NA value are the same
(1 answer)
Conditional NA filling by group
(6 answers)
Closed 4 months ago.
I have a similar question to this post: Fill in values between start and end value in R
The difference is that I need to fill in values between the start and end of multiple values and it doesn’t contain and ID column:
My data look like this (Original data have many different values) :
My final result should look like this :
Data :
structure(list(elevation = c(150L,140L, 130L, 120L, 110L, 120L, 130L, 140L, 150L, 90L, 80L, 70L,66L, 60L, 50L, 66L, 70L, 72L, 68L, 65L, 60L, 68L, 70L),code = c(NA, NA, "W", NA, NA, NA, "W", NA, NA, NA, NA, NA, "X", NA, NA, "X", NA, NA, "Y", NA, NA, "Y", NA)), class = "data.frame", row.names = c(NA,-23L))
Thanks in advance

df %>%
mutate(code = runner::fill_run(code, only_within = T))
elevation code
1 150 <NA>
2 140 <NA>
3 130 W
4 120 W
5 110 W
6 120 W
7 130 W
8 140 <NA>
9 150 <NA>
10 90 <NA>
11 80 <NA>
12 70 <NA>
13 66 X
14 60 X
15 50 X
16 66 X
17 70 <NA>
18 72 <NA>
19 68 Y
20 65 Y
21 60 Y
22 68 Y
23 70 <NA>

This may not be pretty but it works:
codepos <- which(!is.na(dd$code))
stopifnot(length(codepos)%%2==0)
for (group in split(codepos, (seq_along(codepos)+1)%/%2)) {
stopifnot(dd$code[group[1]] == dd$code[group[2]])
dd$code[group[1]:group[2]] <- dd$code[group[1]]
}
We start by finding all the non-NA code. We assume that they are always paired values and then just fill in the ranges for each of the pairs

Here's a tidyverse approach. It generates a temporary grouping by assigning values to the pattern given through the alternating NAs and characters.
library(dplyr)
library(tidyr)
df %>%
mutate(n = n(), l_c = lag(code)) %>%
group_by(grp = cumsum(lag(!is.na(code), default = F) == is.na(code)),
grp_in = grp %in% seq(2, unique(n), 4)) %>%
fill(l_c) %>%
ungroup() %>%
mutate(code = ifelse(grp_in, l_c, code)) %>%
select(elevation, code) %>%
print(n = Inf)
# A tibble: 23 × 2
elevation code
<int> <chr>
1 150 NA
2 140 NA
3 130 W
4 120 W
5 110 W
6 120 W
7 130 W
8 140 NA
9 150 NA
10 90 NA
11 80 NA
12 70 NA
13 66 X
14 60 X
15 50 X
16 66 X
17 70 NA
18 72 NA
19 68 Y
20 65 Y
21 60 Y
22 68 Y
23 70 NA

Related

How do i remove a whole row of a dataframe if one column is 0 or NA

I have a dataframe (DF) with 4 columns. How do I make it so if column 4 is either a 0 or an NA, then remove the whole row? So in the example below only row 1 would be left.
Column 1 Column 2 Column 3 Column 4
11 24 234 2123
45 63 22 0
234 234 123 NA
using dplyr
library(dplyr)
df %>% filter(!is.na(Column.4) & Column.4 != 0)
You can use logical vectors to subset your data:
df[!is.na(df[,4]) & (df[,4]!=0), ]
Example:
df = data.frame(x = rnorm(30), y = rnorm(30), z = rnorm(30), a = rep(c(0,1,NA),10))
x y z a
2 -0.21772820 -0.5337648 -1.07579623 1
5 0.64536474 0.2011776 -0.12981424 1
8 2.36411372 0.0343823 2.03561701 1
11 1.09103526 -1.9287689 0.59511269 1
14 0.32482389 -0.5562136 -0.38943092 1
17 0.63621067 -1.6517097 -0.09804529 1
20 2.61892085 1.5575784 -0.50803567 1
23 0.07854647 1.1861483 -0.49798074 1
26 0.19561725 1.1036331 -0.66349688 1
29 0.22470875 -0.4192745 0.09153176 1
You can use sapply to loop thru each row and it will display the rows the rows that satisfy the underlying conditions:
df[sapply(1:nrow(df), function(i) all(!is.na(df[i,])) & all(df[i,] != 0)), ]
Data:
structure(list(Column.1 = c(11L, 45L, 234L), Column.2 = c(24L,
63L, 234L), Column.3 = c(234L, 22L, 123L), Column.4 = c(2123L,
0L, NA)), class = "data.frame", row.names = c(NA, -3L)) -> df
Output:
# Column.1 Column.2 Column.3 Column.4
# 1 11 24 234 2123

Sort a data frame based on another sorted column value in R

I have a data frame that is sorted based on one column(numeric column) to assign the rank. if this column value is zero then arrange the data frame based on another character column for those rows which have zero as a value in a numeric column.
But to give rank I have to consider var2 that is the reason I sorted based on var2, if there is any identical values in var2 for those rows I have to consider var3 to give rank. please see the data frame 2 and 3 rows, var2 values are identical in that case i have to consider var3 to give rank. In case var2 is zero i have to sort the var1 column(character column) in alphabetical order and give rank. if var2 is NA no rank. please refer the data frame given below.
Below, the data frame is sorted based on var2 column descending order, but var2 contains zero also if var2 is zero I have to sort the data frame based on var1 for the rows which are having zero in var2. I need sort by var1 for those rows which are having var2 as zero and followed by NA in alphabetical order of var1.
example:
# var1 var2 var3 rank
# 1 c 556 45 1
# 2 a 345 35 3
# 3 f 345 64 2
# 4 b 134 87 4
# 5 z 0 34 5
# 6 d 0 32 6
# 7 c 0 12 7
# 8 a 0 23 8
# 9 e NA
# 10 b NA
below is my code
df <- data.frame(var1=c("c","a","f","b","z","d", "c","a", "e", "b", "ad", "gf", "kg", "ts", "mp"), var2=c(134, NA,345, 200, 556,NA, 345, 200, 150, 0, 25,10,0,150,0), var3=c(65,'',45,34,68,'',73,12,35,23,34,56,56,78,123))
# To break the tie between var3 and var2
orderdf <- df[order(df$var2, df$var1, decreasing = TRUE), ]
#assigning rank
rankdf <- orderdf %>% mutate(rank = ifelse(is.na(var2),'', seq(1:nrow(orderdf))))
expected output is sort the var1 in alphabetical order if var2 value is zero(for those rows with var2 value is zero)
expected output:
# var1 var2 var3 rank
# 1 c 556 45 1
# 2 a 345 35 3
# 3 f 345 64 2
# 4 b 134 87 4
# 5 a 0 34 5
# 6 c 0 32 6
# 7 d 0 12 7
# 8 z 0 23 8
# 9 b NA
# 10 e NA
With dplyr you can use
df %>%
arrange(desc(var2), var1)
and afterwards you create the column rank
EDIT
The following code is a bit cumbersome but it gets the job done. Basically it orders the rows in which var2 is equal or different from zero separately, then combines the two ordered dataframes together and finally creates the rank column.
Data
df <- data.frame(
var1 = c("c","a","f","b","z","d", "c","a", "e", "z", "ad", "gf", "kg", "ts", "mp"),
var2 = c(134, NA,345, 200, 556,NA, 345, 200, 150, 0, 25,10,0,150,0),
var3 = as.numeric(c(65,'',45,34,68,'',73,12,35,23,34,56,56,78,123))
)
df
# var1 var2 var3
# 1 c 134 65
# 2 a NA NA
# 3 f 345 45
# 4 b 200 34
# 5 z 556 68
# 6 d NA NA
# 7 c 345 73
# 8 a 200 12
# 9 e 150 35
# 10 z 0 23
# 11 ad 25 34
# 12 gf 10 56
# 13 kg 0 56
# 14 ts 150 78
# 15 mp 0 123
Code
df %>%
# work on rows with var2 different from 0 or NA
filter(var2 != 0) %>%
arrange(desc(var2), desc(var3)) %>%
# merge with rows with var2 equal to 0 or NA
bind_rows(df %>% filter(var2 == 0 | is.na(var2)) %>% arrange(var1)) %>%
arrange(desc(var2)) %>%
# create the rank column only for the rows with var2 different from NA
mutate(
rank = seq_len(nrow(df)),
rank = ifelse(is.na(var2), NA, rank)
)
Output
# var1 var2 var3 rank
# 1 z 556 68 1
# 2 c 345 73 2
# 3 f 345 45 3
# 4 b 200 34 4
# 5 a 200 12 5
# 6 ts 150 78 6
# 7 e 150 35 7
# 8 c 134 65 8
# 9 ad 25 34 9
# 10 gf 10 56 10
# 11 kg 0 56 11
# 12 mp 0 123 12
# 13 z 0 23 13
# 14 a NA NA NA
# 15 d NA NA NA
Using only base R's order() function, sort first on descending order of var2 then ascending order of var1 to sort the data by passing the subsequent integer vector to square braces
df[order(-df$var2, df$var1), ]
Adding a rank column too is then just
df[order(-df$var2, df$var1), "rank"] <- 1:length(df$var1)
Using data.table
library(data.table)
setDT(df)[order(-var2, var1)][, rank := seq_len(.N)][]
data
df <- structure(list(var1 = structure(c(3L, 1L, 6L, 2L, 7L, 4L, 3L,
1L, 5L, 2L), .Label = c("a", "b", "c", "d", "e", "f", "z"), class = "factor"),
var2 = c(1456L, 456L, 345L, 134L, 0L, 0L, 0L, 0L, NA, NA)),
class = "data.frame", row.names = c(NA, -10L))
You can do it in base R, using order :
cols <- c('var1', 'var2')
remaining_cols <- setdiff(names(df), cols)
df1 <- df[cols]
cbind(transform(df1[with(df1, order(-var2, var1)), ],
rank = seq_len(nrow(df1))), df[remaining_cols])
# var1 var2 rank var3
#1 c 556 1 45
#2 a 345 2 35
#3 f 345 3 64
#4 b 134 4 87
#8 a 0 5 34
#7 c 0 6 32
#6 d 0 7 12
#5 z 0 8 23
#10 b NA 9 10
#9 e NA 10 11
data
df <- structure(list(var1 = structure(c(3L, 1L, 6L, 2L, 7L, 4L, 3L,
1L, 5L, 2L), .Label = c("a", "b", "c", "d", "e", "f", "z"), class = "factor"),
var2 = c(556L, 345L, 345L, 134L, 0L, 0L, 0L, 0L, NA, NA),
var3 = c(45L, 35L, 64L, 87L, 34L, 32L, 12L, 23L, 10L, 11L
)), class = "data.frame", row.names = c(NA, -10L))

Reduce repeated pivoting to a single pivot

Using tidyr >= 1.0.0, one can use tidy selection in the cols argument as follows:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("DL_TM"),
names_to = "TM",values_to = "DM_TM") %>%
pivot_longer(cols=starts_with("DL_CD"),
names_to = "CD",values_to = "DL_CD") %>%
na.omit() %>%
select(-TM,-CD)
However, the above will quickly get cumbersome(repetitive) with many columns, how can one reduce this to single pivoting?! I have imagined something conceptual like
pivot_longer(cols=starts_with("DL_TM | DL_CD")....) which will obviously not work because tidy selection only works for a single pattern(as far as I know).
Data
df <- structure(list(DL_TM1 = c(16L, 18L, 53L, 59L, 29L, 3L), DL_CD1 = c("AK",
"RB", "RA", "AJ", "RA", "RS"), DL_TM2 = c(5L, 4L, 8L, NA, 1L,
NA), DL_CD2 = c("CN", "AJ", "RB", NA, "AJ", NA), DL_TM3 = c(NA,
NA, 2L, NA, NA, NA), DL_CD3 = c(NA, NA, "AJ", NA, NA, NA), DL_TM4 = c(NA,
NA, NA, NA, NA, NA), DL_CD4 = c(NA, NA, NA, NA, NA, NA), DL_TM5 = c(NA,
NA, NA, NA, NA, NA), DL_CD5 = c(NA, NA, NA, NA, NA, NA), DEP_DELAY_TM = c(21L,
22L, 63L, 59L, 30L, 3L)), class = "data.frame", row.names = c(NA,
-6L))
Expected Output:
Same as the above but with single pivoting.
Based on the response to the comment that this was moved from the code in the question does not actually produce the desired result and what was wanted was the result that this produces:
df %>%
pivot_longer(-DEP_DELAY_TM, names_to = c(".value", "X"),
names_pattern = "(\\D+)(\\d)") %>%
select(-X) %>%
drop_na
giving:
# A tibble: 11 x 3
DEP_DELAY_TM DL_TM DL_CD
<int> <int> <chr>
1 21 16 AK
2 21 5 CN
3 22 18 RB
4 22 4 AJ
5 63 53 RA
6 63 8 RB
7 63 2 AJ
8 59 59 AJ
9 30 29 RA
10 30 1 AJ
11 3 3 RS
Base R
We can alternately do this using base R's reshape. First split the column names (except the last column) by the non-digit parts giving the varying list and then reshape df to long form using that and finally run na.omit to remove the rows with NAs.
nms1 <- head(names(df), -1)
varying <- split(nms1, gsub("\\d", "", nms1))
na.omit(reshape(df, dir = "long", varying = varying, v.names = names(varying)))
giving:
DEP_DELAY_TM time DL_CD DL_TM id
1.1 21 1 AK 16 1
2.1 22 1 RB 18 2
3.1 63 1 RA 53 3
4.1 59 1 AJ 59 4
5.1 30 1 RA 29 5
6.1 3 1 RS 3 6
1.2 21 2 CN 5 1
2.2 22 2 AJ 4 2
3.2 63 2 RB 8 3
5.2 30 2 AJ 1 5
3.3 63 3 AJ 2 3
We can extract the column groupings ("TM" and "CD" in this case), map over each column group to apply pivot_longer to that group, and then full_join the resulting list elements. Let me know if this covers your real-world use case.
suffixes = unique(gsub(".*_(.{2})[0-9]*", "\\1", names(df)))
df.long = suffixes %>%
map(~ df %>%
mutate(id=1:n()) %>% # Ensure unique identification of each original data row
select(id, DEP_DELAY_TM, starts_with(paste0("DL_",.x))) %>%
pivot_longer(cols=-c(DEP_DELAY_TM, id),
names_to=.x,
values_to=paste0(.x,"_value")) %>%
na.omit() %>%
select(-matches(paste0("^",.x,"$")))
) %>%
reduce(full_join) %>%
select(-id)
DEP_DELAY_TM TM_value CD_value
1 21 16 AK
2 21 16 CN
3 21 5 AK
4 21 5 CN
5 22 18 RB
6 22 18 AJ
7 22 4 RB
8 22 4 AJ
9 63 53 RA
10 63 53 RB
11 63 53 AJ
12 63 8 RA
13 63 8 RB
14 63 8 AJ
15 63 2 RA
16 63 2 RB
17 63 2 AJ
18 59 59 AJ
19 30 29 RA
20 30 29 AJ
21 30 1 RA
22 30 1 AJ
23 3 3 RS

How to join data frames in R without duplicating original data values

I have 2 data frames (DF1 & DF2) and 1 would like to join them together by a unique value called "acc_num". In DF2, payment was made twice by acc_num A and thrice by B. Data frames are as follows.
DF1:
acc_num total_use sales
A 433 145
A NA 2
A NA 18
B 149 32
DF2:
acc payment
A 150
A 98
B 44
B 15
B 10
My desired output is:
acc_num total_use sales payment
A 433 145 150
A NA 2 98
A NA 18 NA
B 149 32 44
B NA NA 15
B NA NA 10
I've tried full_join and merge but the output was not as desired. I couldn't work this out as I'm still a beginner in R, and haven't found the solution to this.
Example of the code I used was
test_full_join <- DF1 %>% full_join(DF2, by = c("acc_num" = "acc"))
The displayed output was:
acc_num total_use sales payment
A 433 145 150
A 433 145 98
A NA 2 150
A NA 2 98
A NA 18 150
A NA 18 98
B 149 32 44
B 149 32 15
B 149 32 10
This is contrary to my desired output as at the end,
my concern is to get the total sum of total_use, sales and payment.
This output will definitely give me wrong interpretation
for data visualization later on.
We may need to do a join by row_number() based on 'acc_num'
library(dplyr)
df1 %>%
group_by(acc_num) %>%
mutate(grpind = row_number()) %>%
full_join(df2 %>%
group_by(acc_num = acc) %>%
mutate(grpind = row_number())) %>%
select(acc_num, total_use, sales, payment)
# A tibble: 6 x 4
# Groups: acc_num [2]
# acc_num total_use sales payment
# <chr> <int> <int> <int>
#1 A 433 145 150
#2 A NA 2 98
#3 A NA 18 NA
#4 B 149 32 44
#5 B NA NA 15
#6 B NA NA 10
data
df1 <- structure(list(acc_num = c("A", "A", "A", "B"), total_use = c(433L,
NA, NA, 149L), sales = c(145L, 2L, 18L, 32L)), class = "data.frame",
row.names = c(NA,
-4L))
df2 <- structure(list(acc = c("A", "A", "B", "B", "B"), payment = c(150L,
98L, 44L, 15L, 10L)), class = "data.frame", row.names = c(NA,
-5L))

R split each row of a dataframe into two rows

I would like to splite each row of a data frame(numberic) into two rows. For example, part of the original data frame like this (nrow(original datafram) > 2800000):
ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47
And after spliting each row, we can get:
ID X Y Z
1 3 2 6
22 54 NA NA
6 11 5 9
52 71 NA NA
3 7 2 5
2 34 NA NA
5 10 7 1
23 47 NA NA
the "value_1" and "value_2" columns are split and each element is set to a new row. For example, value_1 = 22 and value_2 = 54 are set to a new row.
Here is one option with data.table. We convert the 'data.frame' to 'data.table' by creating a column of rownames (setDT(df1, keep.rownames = TRUE)). Subset the columns 1:5 and 1, 6, 7 in a list, rbind the list element with fill = TRUE option to return NA for corresponding columns that are not found in one of the datasets, order by the row number ('rn') and assign (:=) the row number column to 'NULL'.
library(data.table)
setDT(df1, keep.rownames = TRUE)[]
rbindlist(list(df1[, 1:5, with = FALSE], setnames(df1[, c(1, 6:7),
with = FALSE], 2:3, c("ID", "X"))), fill = TRUE)[order(rn)][, rn:= NULL][]
# ID X Y Z
#1: 1 3 2 6
#2: 22 54 NA NA
#3: 6 11 5 9
#4: 52 71 NA NA
#5: 3 7 2 5
#6: 2 34 NA NA
#7: 5 10 7 1
#8: 23 47 NA NA
A hadleyverse corresponding to the above logic would be
library(dplyr)
tibble::rownames_to_column(df1[1:4]) %>%
bind_rows(., setNames(tibble::rownames_to_column(df1[5:6]),
c("rowname", "ID", "X"))) %>%
arrange(rowname) %>%
select(-rowname)
# ID X Y Z
#1 1 3 2 6
#2 22 54 NA NA
#3 6 11 5 9
#4 52 71 NA NA
#5 3 7 2 5
#6 2 34 NA NA
#7 5 10 7 1
#8 23 47 NA NA
data
df1 <- structure(list(ID = c(1L, 6L, 3L, 5L), X = c(3L, 11L, 7L, 10L
), Y = c(2L, 5L, 2L, 7L), Z = c(6L, 9L, 5L, 1L), value_1 = c(22L,
52L, 2L, 23L), value_2 = c(54L, 71L, 34L, 47L)), .Names = c("ID",
"X", "Y", "Z", "value_1", "value_2"), class = "data.frame",
row.names = c(NA, -4L))
Here's a (very slow) pure R solution using no extra packages:
# Replicate your matrix
input_df <- data.frame(ID = rnorm(10000),
X = rnorm(10000),
Y = rnorm(10000),
Z = rnorm(10000),
value_1 = rnorm(10000),
value_2 = rnorm(10000))
# Preallocate memory to a data frame
output_df <- data.frame(
matrix(
nrow = nrow(input_df)*2,
ncol = ncol(input_df)-2))
# Loop through each row in turn.
# Put the first four elements into the current
# row, and the next two into the current+1 row
# with two NAs attached.
for(i in seq(1, nrow(output_df), 2)){
output_df[i,] <- input_df[i, c(1:4)]
output_df[i+1,] <- c(input_df[i, c(5:6)],NA,NA)
}
colnames(output_df) <- c("ID", "X", "Y", "Z")
Which results in
> head(output_df)
X1 X2 X3 X4
1 0.5529417 -0.93859275 2.0900276 -2.4023800
2 0.9751090 0.13357075 NA NA
3 0.6753835 0.07018647 0.8529300 -0.9844643
4 1.6405939 0.96133195 NA NA
5 0.3378821 -0.44612782 -0.8176745 0.2759752
6 -0.8910678 -0.37928353 NA NA
This should work
data <- read.table(text= "ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47", header=T)
data1 <- data[,1:4]
data2 <- setdiff(data,data1)
names(data2) <- names(data1)[1:ncol(data2)]
combined <- plyr::rbind.fill(data1,data2)
n <- nrow(data1)
combined[kronecker(1:n, c(0, n), "+"),]
Though why you would need to do this beats me.

Resources