Using tidyr >= 1.0.0, one can use tidy selection in the cols argument as follows:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("DL_TM"),
names_to = "TM",values_to = "DM_TM") %>%
pivot_longer(cols=starts_with("DL_CD"),
names_to = "CD",values_to = "DL_CD") %>%
na.omit() %>%
select(-TM,-CD)
However, the above will quickly get cumbersome(repetitive) with many columns, how can one reduce this to single pivoting?! I have imagined something conceptual like
pivot_longer(cols=starts_with("DL_TM | DL_CD")....) which will obviously not work because tidy selection only works for a single pattern(as far as I know).
Data
df <- structure(list(DL_TM1 = c(16L, 18L, 53L, 59L, 29L, 3L), DL_CD1 = c("AK",
"RB", "RA", "AJ", "RA", "RS"), DL_TM2 = c(5L, 4L, 8L, NA, 1L,
NA), DL_CD2 = c("CN", "AJ", "RB", NA, "AJ", NA), DL_TM3 = c(NA,
NA, 2L, NA, NA, NA), DL_CD3 = c(NA, NA, "AJ", NA, NA, NA), DL_TM4 = c(NA,
NA, NA, NA, NA, NA), DL_CD4 = c(NA, NA, NA, NA, NA, NA), DL_TM5 = c(NA,
NA, NA, NA, NA, NA), DL_CD5 = c(NA, NA, NA, NA, NA, NA), DEP_DELAY_TM = c(21L,
22L, 63L, 59L, 30L, 3L)), class = "data.frame", row.names = c(NA,
-6L))
Expected Output:
Same as the above but with single pivoting.
Based on the response to the comment that this was moved from the code in the question does not actually produce the desired result and what was wanted was the result that this produces:
df %>%
pivot_longer(-DEP_DELAY_TM, names_to = c(".value", "X"),
names_pattern = "(\\D+)(\\d)") %>%
select(-X) %>%
drop_na
giving:
# A tibble: 11 x 3
DEP_DELAY_TM DL_TM DL_CD
<int> <int> <chr>
1 21 16 AK
2 21 5 CN
3 22 18 RB
4 22 4 AJ
5 63 53 RA
6 63 8 RB
7 63 2 AJ
8 59 59 AJ
9 30 29 RA
10 30 1 AJ
11 3 3 RS
Base R
We can alternately do this using base R's reshape. First split the column names (except the last column) by the non-digit parts giving the varying list and then reshape df to long form using that and finally run na.omit to remove the rows with NAs.
nms1 <- head(names(df), -1)
varying <- split(nms1, gsub("\\d", "", nms1))
na.omit(reshape(df, dir = "long", varying = varying, v.names = names(varying)))
giving:
DEP_DELAY_TM time DL_CD DL_TM id
1.1 21 1 AK 16 1
2.1 22 1 RB 18 2
3.1 63 1 RA 53 3
4.1 59 1 AJ 59 4
5.1 30 1 RA 29 5
6.1 3 1 RS 3 6
1.2 21 2 CN 5 1
2.2 22 2 AJ 4 2
3.2 63 2 RB 8 3
5.2 30 2 AJ 1 5
3.3 63 3 AJ 2 3
We can extract the column groupings ("TM" and "CD" in this case), map over each column group to apply pivot_longer to that group, and then full_join the resulting list elements. Let me know if this covers your real-world use case.
suffixes = unique(gsub(".*_(.{2})[0-9]*", "\\1", names(df)))
df.long = suffixes %>%
map(~ df %>%
mutate(id=1:n()) %>% # Ensure unique identification of each original data row
select(id, DEP_DELAY_TM, starts_with(paste0("DL_",.x))) %>%
pivot_longer(cols=-c(DEP_DELAY_TM, id),
names_to=.x,
values_to=paste0(.x,"_value")) %>%
na.omit() %>%
select(-matches(paste0("^",.x,"$")))
) %>%
reduce(full_join) %>%
select(-id)
DEP_DELAY_TM TM_value CD_value
1 21 16 AK
2 21 16 CN
3 21 5 AK
4 21 5 CN
5 22 18 RB
6 22 18 AJ
7 22 4 RB
8 22 4 AJ
9 63 53 RA
10 63 53 RB
11 63 53 AJ
12 63 8 RA
13 63 8 RB
14 63 8 AJ
15 63 2 RA
16 63 2 RB
17 63 2 AJ
18 59 59 AJ
19 30 29 RA
20 30 29 AJ
21 30 1 RA
22 30 1 AJ
23 3 3 RS
Related
This question already has answers here:
Is there a way to fill in missing values of a column in between specific values? [duplicate]
(2 answers)
R - how to fill NA's between two corresponding ID's in a dataframe
(3 answers)
Replacing NAs between two rows with identical values in a specific column
(4 answers)
Replace NA values if last and next non-NA value are the same
(1 answer)
Conditional NA filling by group
(6 answers)
Closed 4 months ago.
I have a similar question to this post: Fill in values between start and end value in R
The difference is that I need to fill in values between the start and end of multiple values and it doesn’t contain and ID column:
My data look like this (Original data have many different values) :
My final result should look like this :
Data :
structure(list(elevation = c(150L,140L, 130L, 120L, 110L, 120L, 130L, 140L, 150L, 90L, 80L, 70L,66L, 60L, 50L, 66L, 70L, 72L, 68L, 65L, 60L, 68L, 70L),code = c(NA, NA, "W", NA, NA, NA, "W", NA, NA, NA, NA, NA, "X", NA, NA, "X", NA, NA, "Y", NA, NA, "Y", NA)), class = "data.frame", row.names = c(NA,-23L))
Thanks in advance
df %>%
mutate(code = runner::fill_run(code, only_within = T))
elevation code
1 150 <NA>
2 140 <NA>
3 130 W
4 120 W
5 110 W
6 120 W
7 130 W
8 140 <NA>
9 150 <NA>
10 90 <NA>
11 80 <NA>
12 70 <NA>
13 66 X
14 60 X
15 50 X
16 66 X
17 70 <NA>
18 72 <NA>
19 68 Y
20 65 Y
21 60 Y
22 68 Y
23 70 <NA>
This may not be pretty but it works:
codepos <- which(!is.na(dd$code))
stopifnot(length(codepos)%%2==0)
for (group in split(codepos, (seq_along(codepos)+1)%/%2)) {
stopifnot(dd$code[group[1]] == dd$code[group[2]])
dd$code[group[1]:group[2]] <- dd$code[group[1]]
}
We start by finding all the non-NA code. We assume that they are always paired values and then just fill in the ranges for each of the pairs
Here's a tidyverse approach. It generates a temporary grouping by assigning values to the pattern given through the alternating NAs and characters.
library(dplyr)
library(tidyr)
df %>%
mutate(n = n(), l_c = lag(code)) %>%
group_by(grp = cumsum(lag(!is.na(code), default = F) == is.na(code)),
grp_in = grp %in% seq(2, unique(n), 4)) %>%
fill(l_c) %>%
ungroup() %>%
mutate(code = ifelse(grp_in, l_c, code)) %>%
select(elevation, code) %>%
print(n = Inf)
# A tibble: 23 × 2
elevation code
<int> <chr>
1 150 NA
2 140 NA
3 130 W
4 120 W
5 110 W
6 120 W
7 130 W
8 140 NA
9 150 NA
10 90 NA
11 80 NA
12 70 NA
13 66 X
14 60 X
15 50 X
16 66 X
17 70 NA
18 72 NA
19 68 Y
20 65 Y
21 60 Y
22 68 Y
23 70 NA
I have a dataframe (DF) with 4 columns. How do I make it so if column 4 is either a 0 or an NA, then remove the whole row? So in the example below only row 1 would be left.
Column 1 Column 2 Column 3 Column 4
11 24 234 2123
45 63 22 0
234 234 123 NA
using dplyr
library(dplyr)
df %>% filter(!is.na(Column.4) & Column.4 != 0)
You can use logical vectors to subset your data:
df[!is.na(df[,4]) & (df[,4]!=0), ]
Example:
df = data.frame(x = rnorm(30), y = rnorm(30), z = rnorm(30), a = rep(c(0,1,NA),10))
x y z a
2 -0.21772820 -0.5337648 -1.07579623 1
5 0.64536474 0.2011776 -0.12981424 1
8 2.36411372 0.0343823 2.03561701 1
11 1.09103526 -1.9287689 0.59511269 1
14 0.32482389 -0.5562136 -0.38943092 1
17 0.63621067 -1.6517097 -0.09804529 1
20 2.61892085 1.5575784 -0.50803567 1
23 0.07854647 1.1861483 -0.49798074 1
26 0.19561725 1.1036331 -0.66349688 1
29 0.22470875 -0.4192745 0.09153176 1
You can use sapply to loop thru each row and it will display the rows the rows that satisfy the underlying conditions:
df[sapply(1:nrow(df), function(i) all(!is.na(df[i,])) & all(df[i,] != 0)), ]
Data:
structure(list(Column.1 = c(11L, 45L, 234L), Column.2 = c(24L,
63L, 234L), Column.3 = c(234L, 22L, 123L), Column.4 = c(2123L,
0L, NA)), class = "data.frame", row.names = c(NA, -3L)) -> df
Output:
# Column.1 Column.2 Column.3 Column.4
# 1 11 24 234 2123
I have a df in R as follows:
ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20
Now I want to only remove the rows where both Score 1 and Score 2 is missing (i.e. 3rd row)
You can filter it like this:
df <- read.table(head=T, text="ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20")
df[!(is.na(df$Score1) & is.na(df$Score2)), ]
# ID Age Score1 Score2
# 1 2 22 12 NA
# 2 3 19 11 22
# 4 1 21 NA 20
I.e. take rows where there's not (!) Score1 missing and (&) Score2 missing.
Here are two version with dplyr which can be extended to many columns with prefix "Score".
Using filter_at
library(dplyr)
df %>% filter_at(vars(starts_with("Score")), any_vars(!is.na(.)))
# ID Age Score1 Score2
#1 2 22 12 NA
#2 3 19 11 22
#3 1 21 NA 20
and filter_if
df %>% filter_if(startsWith(names(.),"Score"), any_vars(!is.na(.)))
A base R version with apply
df[apply(!is.na(df[startsWith(names(df),"Score")]), 1, any), ]
One option is rowSums
df1[ rowSums(is.na(df1[grep("Score", names(df1))])) < 2,]
Or another option with base R
df1[!Reduce(`&`, lapply(df1[grep("Score", names(df1))], is.na)),]
data
df1 <- structure(list(ID = c(2L, 3L, 4L, 1L), Age = c(22L, 19L, 20L,
21L), Score1 = c(12L, 11L, NA, NA), Score2 = c(NA, 22L, NA, 20L
)), class = "data.frame", row.names = c(NA, -4L))
I have this condition:
Animal Date.1 Weight.1 Date.2 Weight.2 Date.3 Weight.3 Date.4 Weight.4
1 12/18/19 55 1/2/20 67 6/6/20 101
2 12/18/19 64 1/3/20 69 2/4/20 80
3 12/18/19 75
4 1/3/20 85
5 12/18/19 88 1/6/20 86 2/7/20 96 6/6/20 100
And I would like to select the last weight after weight.1, like this:
Animal Date.last Last Weight
1 6/6/20 101
2 2/4/20 80
3 NA NA
4 1/3/20 85
5 6/6/20 100
Sorry, I didn't show any scripts but I didn't even know where to start.
Here is an option after reshaping to 'long' format
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = -Animal, names_to = c(".value", "group"),
names_sep="[.]", values_drop_na = TRUE) %>%
group_by(Animal) %>%
slice(n()) %>%
ungroup %>%
mutate_at(vars(Date, Weight), ~ replace(., group == 1, NA)) %>%
select(-group) %>%
rename_at(2:3, ~ str_c(., 'last'))
# A tibble: 5 x 3
# Animal Datelast Weightlast
# <int> <chr> <int>
#1 1 6/6/20 101
#2 2 2/4/20 80
#3 3 <NA> NA
#4 4 1/3/20 85
#5 5 6/6/20 100
data
df1 <- structure(list(Animal = 1:5, Date.1 = c("12/18/19", "12/18/19",
"12/18/19", NA, "12/18/19"), Weight.1 = c(55L, 64L, 75L, NA,
88L), Date.2 = c("1/2/20", "1/3/20", NA, "1/3/20", "1/6/20"),
Weight.2 = c(67L, 69L, NA, 85L, 86L), Date.3 = c(NA, "2/4/20",
NA, NA, "2/7/20"), Weight.3 = c(NA, 80L, NA, NA, 96L),
Date.4 = c("6/6/20",
NA, NA, NA, "6/6/20"), Weight.4 = c(101L, NA, NA, NA, 100L
)), class = "data.frame", row.names = c(NA, -5L))
in base R, you could use the aggregate +reshape functions:
df1 <- reshape(`is.na<-`(df,df==""),2:ncol(df),idvar = "Animal",dir="long")
aggregate(cbind(Date,Weight)~Animal, df1,
function(x)if(is.na(x[2])) NA else as.character(x[max(which(!is.na(x)))]),
na.action = identity)
Animal Date Weight
1 1 6/6/20 101
2 2 2/4/20 80
3 3 <NA> <NA>
4 4 1/3/20 85
5 5 6/6/20 100
Suppose I start with a data frame:
ID Measurement1 Measurement2
1 45 104
2 34 87
3 23 99
4 56 67
...
Then I have a second data frame which is meant to be used to update records in the first:
ID Measurement1 Measurement2
2 10 11
4 21 22
How do I use R to end up with:
ID Measurement1 Measurement2
1 45 104
2 10 11
3 23 99
4 21 22
...
The data frames in reality are very large datasets.
We can use match to get the row index. Using that index to subset the rows, we replace the 2nd and 3rd columns of the first dataset with the corresponding columns of second dataset.
ind <- match(df2$ID, df1$ID)
df1[ind, 2:3] <- df2[2:3]
df1
# ID Measurement1 Measurement2
#1 1 45 104
#2 2 10 11
#3 3 23 99
#4 4 21 22
Or we can use data.table to join the dataset on the 'ID' column (after converting the first dataset to 'data.table' i.e. setDT(df1)), and assign the 'Cols' with the 'iCols' from the second dataset.
library(data.table)#v1.9.6+
Cols <- names(df1)[-1]
iCols <- paste0('i.', Cols)
setDT(df1)[df2, (Cols) := mget(iCols), on= 'ID'][]
# ID Measurement1 Measurement2
#1: 1 45 104
#2: 2 10 11
#3: 3 23 99
#4: 4 21 22
data
df1 <- structure(list(ID = 1:4, Measurement1 = c(45L, 34L, 23L, 56L),
Measurement2 = c(104L, 87L, 99L, 67L)), .Names = c("ID",
"Measurement1", "Measurement2"), class = "data.frame",
row.names = c(NA, -4L))
df2 <- structure(list(ID = c(2L, 4L), Measurement1 = c(10L, 21L),
Measurement2 = c(11L,
22L)), .Names = c("ID", "Measurement1", "Measurement2"),
class = "data.frame", row.names = c(NA, -2L))
library(dplyr)
df1 %>%
anti_join(df2, by = "ID") %>%
bind_rows(df2) %>%
arrange(ID)
dplyr 1.0.0 introduced a family of SQL-inspired functions for modifying rows. In this case you can now use rows_update():
library(dplyr)
df1 %>%
rows_update(df2, by = "ID")
ID Measurement1 Measurement2
1 1 45 104
2 2 10 11
3 3 23 99
4 4 21 22