This question already has answers here:
filter for complete cases in data.frame using dplyr (case-wise deletion)
(7 answers)
Closed 1 year ago.
I would like to remove NA from my data set and then organise them by IDs.
My dataset is similar to this:
df<-read.table (text="ID Name Surname Group A1 A2 A3 Goal Sea
21 Goal Robi A 4 4 4 G No
21 Goal Robi B NA NA NA NA NA
21 Goal Robi C NA NA NA NA NA
21 Goal Robi D 3 4 4 G No
33 Nami Si O NA NA NA NA NA
33 Nami Si P NA NA NA NA NA
33 Nami Si Q 3 4 4 G No
33 Nami Si Z 3 3 3 S No
98 Sara Bat MT 4 4 4 S No
98 Sara Bat NC 4 3 2 D No
98 Sara Bat MF NA NA NA NA NA
98 Sara Bat LC NA NA NA NA NA
66 Noor Shor MF NA NA NA NA NA
66 Noor Shor LC NA NA NA NA NA
66 Noor Shor MT1 4 4 4 G No
66 Noor Shor NC1 2 3 3 D No
", header=TRUE)
By removing NA, rows and columns get a datframe with a lack of NA. So I would like to get this table
ID Name Surname Group_1 A1 A2 A3 Goal_1 Sea_1 Group_2 A1_1 A2_2 A3_3 Goal_2 Sea_2
21 Goal Robi A 4 4 4 G No D 3 4 4 G No
33 Nami Si Q 3 4 4 G No Z 3 3 3 S No
98 Sara Bat MT 4 4 4 S No NC 4 3 2 D No
66 Noor Shor Mt1 4 4 4 G No NC1 2 3 3 D No
Is it possible to get it. It seems we could do it using pivot_longer, but I do not know ho to get it
search for complete.cases()
final = final[complete.cases(final), ]
A possible solution with the Tidyverse:
df <- structure(list(ID = c(21L, 21L, 21L, 21L, 33L, 33L, 33L, 33L,
98L, 98L, 98L, 98L, 66L, 66L, 66L, 66L), Name = c("Goal", "Goal",
"Goal", "Goal", "Nami", "Nami", "Nami", "Nami", "Sara", "Sara",
"Sara", "Sara", "Noor", "Noor", "Noor", "Noor"), Surname = c("Robi",
"Robi", "Robi", "Robi", "Si", "Si", "Si", "Si", "Bat", "Bat",
"Bat", "Bat", "Shor", "Shor", "Shor", "Shor"), Group = c("A",
"B", "C", "D", "O", "P", "Q", "Z", "MT", "NC", "MF", "LC", "MF",
"LC", "MT1", "NC1"), A1 = c(4L, NA, NA, 3L, NA, NA, 3L, 3L, 4L,
4L, NA, NA, NA, NA, 4L, 2L), A2 = c(4L, NA, NA, 4L, NA, NA, 4L,
3L, 4L, 3L, NA, NA, NA, NA, 4L, 3L), A3 = c(4L, NA, NA, 4L, NA,
NA, 4L, 3L, 4L, 2L, NA, NA, NA, NA, 4L, 3L), Goal = c("G", NA,
NA, "G", NA, NA, "G", "S", "S", "D", NA, NA, NA, NA, "G", "D"
), Sea = c("No", NA, NA, "No", NA, NA, "No", "No", "No", "No",
NA, NA, NA, NA, "No", "No")), class = "data.frame", row.names = c(NA,
-16L))
new_df <- df %>%
drop_na() %>%
group_by(ID) %>%
mutate(n = row_number()) %>%
pivot_wider(
names_from = n,
values_from= c(Group, A1, A2, A3, Goal, Sea)
) %>%
relocate(ends_with("2"), .after= last_col())
print(new_df)
We can group_by the ID columns and then filter out rows with all NAs in the target columns:
df %>% group_by(ID, Name, Surname) %>%
filter(!if_all(A1:Sea, is.na))%>%
slice_head(n=1)
# A tibble: 4 × 9
# Groups: ID, Name, Surname [4]
ID Name Surname Group A1 A2 A3 Goal Sea
<int> <chr> <chr> <chr> <int> <int> <int> <chr> <chr>
1 21 Goal Robi A 4 4 4 G No
2 33 Nami Si Q 3 4 4 G No
3 66 Noor Shor MT1 4 4 4 G No
4 98 Sara Bat MT 4 4 4 S No
Related
I've been trying to summarize data by multiple groups, where the new column should be a summary of the proportion of one column to another, by these groups. Because these two columns never both contain a value, their proportions cannot be calculated per row.
Below is an example.
By, P_Common and Number7 groups, I'd like the total N_count/A_count
structure(list(P_Common = c("B", "B", "C", "C", "D", "E", "E",
"F", "G", "G", "B", "G", "E", "D", "F", "C"), Number_7 = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 1L, 1L),
N_count = c(0L, 4L, 22L, NA, 7L, 0L, 44L, 16L, NA, NA, NA,
NA, NA, NA, NA, NA), A_count = c(NA, NA, NA, NA, NA, NA,
NA, NA, 0L, 4L, 7L, NA, 23L, 4L, 7L, 17L)), class = "data.frame", row.names = c(NA,
-16L))
P_Common Number_7 N_count A_count
B 1 0 NA
B 1 4 NA
C 1 22 NA
C 1 NA NA
D 2 7 NA
E 2 0 NA
E 2 44 NA
F 2 16 NA
B 1 NA 7
G 3 NA NA
E 1 NA 23
D 2 NA 4
F 1 NA 7
C 1 NA 17
In this example there'd be quite some 0 / NA values but that's okay, they can stay in, but overall it would become like
P_Common Number_7 Propo
B 1 0.571428571
C 1 1.294117647
D 2 1.75
... etc
You can do:
df %>%
group_by(P_Common, Number_7) %>%
summarise(Propo = sum(N_count, na.rm = T) / sum(A_count, na.rm = T))
P_Common Number_7 Propo
<chr> <int> <dbl>
1 B 1 0.571
2 C 1 1.29
3 D 2 1.75
4 E 1 0
5 E 2 Inf
6 F 1 0
7 F 2 Inf
8 G 3 0
I have a set of data roughly like this (more data in dput & desired results below):
id date u v
<chr> <date> <chr> <int>
1 a 2019-05-14 NA 0
2 a 2018-06-29 u 1
3 b 2020-12-02 u 1
4 b 2017-08-16 NA 1
5 b 2016-04-07 NA 0
6 c 2018-05-22 u 1
7 c 2018-05-22 u 1
8 e 2019-03-06 u 1
9 e 2019-03-06 NA 1
I am trying to create a new variable pr identifying, for each id, whether when u == u, there is an equal or earlier date where v == 1 within that id group (regardless of the value of u).
I know how generally to create a new variable based on in-group conditions:
library(dplyr)
x %>%
group_by(id) %>%
mutate(pr = case_when())
But I can't figure out how to compare the other dates within the group to the date corresponding to u and how to identify the presence of v == 1 not including the u row I am using as a reference. And u will always have v == 1.
Expected output is:
id date u v pr
<chr> <date> <chr> <int> <int>
1 a 2019-05-14 NA 0 NA
2 a 2018-06-29 u 1 0
3 b 2020-12-02 u 1 1
4 b 2017-08-16 NA 1 NA
5 b 2016-04-07 NA 0 NA
6 c 2018-05-22 u 1 1
7 c 2018-05-22 u 1 1
8 e 2019-03-06 u 1 1
9 e 2019-03-06 NA 1 NA
10 f 2020-10-20 u 1 0
11 f 2019-01-25 NA 0 NA
12 h 2020-02-24 NA 0 NA
13 h 2018-10-15 u 1 0
14 h 2018-03-07 NA 0 NA
15 i 2021-02-02 u 1 1
16 i 2020-11-19 NA 1 NA
17 i 2020-11-19 NA 1 NA
18 j 2019-02-11 u 1 1
19 j 2017-06-26 u 1 0
20 k 2018-12-13 u 1 0
21 k 2017-07-18 NA 0 NA
22 l 2018-05-08 u 1 1
23 l 2018-02-15 NA 0 NA
24 l 2018-02-15 u 1 0
25 l 2017-11-07 NA 0 NA
26 l 2015-09-10 NA 0 NA
The format of the variables isn't ideal; if there's any way for me to help clean it up let me know. Actual data is sensitive so I'm approximating.
> dput(x)
structure(list(id = c("a", "a", "b", "b", "b", "c", "c", "e",
"e", "f", "f", "h", "h", "h", "i", "i", "i", "j", "j", "k", "k",
"l", "l", "l", "l", "l"), date = structure(c(18030, 17711, 18598,
17394, 16898, 17673, 17673, 17961, 17961, 18555, 17921, 18316,
17819, 17597, 18660, 18585, 18585, 17938, 17343, 17878, 17365,
17659, 17577, 17577, 17477, 16688), class = "Date"), u = c(NA,
"u", "u", NA, NA, "u", "u", "u", NA, "u", NA, NA, "u", NA, "u",
NA, NA, "u", "u", "u", NA, "u", NA, "u", NA, NA), v = c(0L, 1L,
1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L), pr = c(NA, 0L, 1L, NA, NA, 1L,
1L, 1L, NA, 0L, NA, NA, 0L, NA, 1L, NA, NA, 1L, 0L, 0L, NA, 1L,
NA, 0L, NA, NA)), row.names = c(NA, -26L), class = c("tbl_df",
"tbl", "data.frame"))
We may create a function
library(dplyr)
library(purrr)
f1 <- function(u, v, date) {
# create a variable with only 0s
tmp <- rep(0, n())
# create logical vectors based on 'u' value and 1 in `v`
i1 <- u %in% "u"
i2 <- v %in% 1
# loop over the subset of date where v values are 1
# check whether `all` of the dates are greater than or equal to
# subset of date where values are 'u' in `u`
# and if the number of v values are greater than 1
# assign it to the 'tmp' where v values are 1 and return the 'tmp'
# after assigning NA where u values are NA
tmp[i2] <- +(purrr::map_lgl(date[i2],
~ all(.x >= date[i1])) & sum(i2) > 1)
tmp[is.na(u)] <- NA
tmp
}
and apply it after grouping
x1 <- x %>%
group_by(id) %>%
mutate(prnew = f1(u, v, date)) %>%
ungroup
> all.equal(x1$pr, x1$prnew)
[1] TRUE
I need to check if rows are partially duplicated and delete/overwrite those where 2 columns match a different row where 3 values are present. one problem is, that the "real" dataframe contains a couple of list columns which makes some operations unfeasible. Best case would be if any row where a match can be found would be checked independently of column numbers - meaning only the row with the most columns having non NA values (out of all which include matching column values) is kept.
o1 o2 o3
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 5 9 NA # this row has only 2 values which match values from row 11 but the last value is na
8 10 NA NA
9 12 NA NA
10 13 NA NA
11 5 9 14 # this row has values in all 3 columns
12 14 NA NA
13 8 11 15 # so does this row
14 16 NA NA
15 17 NA NA
16 18 NA NA
17 19 NA NA
18 20 NA NA
The result should be the same data frame - just without row 7 or where row 7 is overwritten by row 11.
This should be easy to do but for some reason i didn't manage it (except with a convoluted for loop that is hard to generalize should more columns be added at a later time). Is there a straight forward way to do this?
dput of above df:
structure(list(o1 = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 12L, 13L,
5L, 14L, 8L, 16L, 17L, 18L, 19L, 20L), o2 = c(NA, NA, NA, NA,
NA, NA, 9L, NA, NA, NA, 9L, NA, 11L, NA, NA, NA, NA, NA), o3 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 14L, NA, 15L, NA, NA, NA,
NA, NA)), row.names = c(NA, -18L), class = "data.frame")
If there is already an answer for something like this, please let me know.
I thought of using dplyr:
library(dplyr)
df %>%
mutate(rn = row_number(),
count_na = rowSums(across(o1:o3, is.na))) %>%
group_by(o1, o2) %>%
slice_min(count_na) %>%
arrange(rn) %>%
ungroup() %>%
select(o1:o3)
This returns
# A tibble: 17 x 3
o1 o2 o3
<int> <int> <int>
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 10 NA NA
8 12 NA NA
9 13 NA NA
10 5 9 14
11 14 NA NA
12 8 11 15
13 16 NA NA
14 17 NA NA
15 18 NA NA
16 19 NA NA
17 20 NA NA
This solution is based on the following ideas:
For every row we count the number of NAs in this row.
We group for o1 and o2 to create groups of data that belong together. Here is a possible flaw: perhaps it is a better approach to group by o1 only or do some other grouping. This depends on the structure of your data: should 1, <NA>, <NA> be overwritten by 1, 2, <NA>?
After grouping, we select the row with the smallest number of NAs.
Finally we do some clean up: removing the auxiliary columns, arranging the data and ungrouping.
A partial solution to detect the duplicates, it remains to specify which rows to delete, ran out of time. I've went ahead and "duplicated" a couple more rows.
df=read.table(text="
o1 o2 o3
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 5 9 NA
8 10 NA NA
9 12 NA NA
10 13 NA NA
11 5 9 14
12 14 NA NA
13 8 11 15
14 16 NA NA
15 7 1 2
16 18 NA NA
17 7 1 3
18 20 NA NA",h=T)
The main trick is to calculate a distance matrix and check which rows have a distance of zero, since dist will automatically estimate a pairwise distance, removing missing values.
tmp=as.matrix(dist(df))
diag(tmp)=NA
tmp[lower.tri(tmp)]=NA
tod=data.frame(which(tmp==0,arr.ind=T))
resulting in
row col
X7 7 11
X6 6 15
X6.1 6 17
Here's another way which considers all columns, should work with any number of columns and regardless of their names or positions
library(dplyr)
mydf <- structure(list(o1 = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 12L, 13L,
5L, 14L, 8L, 16L, 17L, 18L, 19L, 20L),
o2 = c(NA, NA, NA, NA,
NA, NA, 9L, NA, NA, NA, 9L, NA, 11L, NA, NA, NA, NA, NA),
o3 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 14L, NA, 15L, NA, NA, NA,
NA, NA)),
row.names = c(NA, -18L),
class = "data.frame")
columns <- names(mydf)
dummy_cols <- paste0(columns, "_dummy")
mydf %>%
# duplicate the dataframe
cbind(mydf %>% `names<-`(dummy_cols)) %>%
# arrange across all columns
arrange(across(columns)) %>%
# fill NAs downwards
tidyr::fill(dummy_cols, .direction = "down") %>%
# create a dummy ID
tidyr::unite(id_dummy, dummy_cols, sep = "") %>%
# group by the id
group_by(id_dummy) %>%
# get the first row of each
filter(row_number()==1) %>%
ungroup() %>%
select(columns)
P.S. also replaces 1 - NA - NA with 1 - 2 - NA and replaces 1 - NA - NA with 1 - NA - 3
My data.frame looks like
ID Encounter Value1 Value2
1 A 1 NA
1 A 2 10
1 B NA 20
1 B 4 30
2 A 5 40
2 A 6 50
2 B NA NA
2 B 7 60
and I want it to look like
ID Encounter Value1 Value2
1 A 1 10
1 B 4 20
2 A 5 40
2 B 7 60
We can use dplyr. Grouped by 'ID', 'Encounter', get the first value that is not an NA (!is.na(.)) in the rest of the column. By any chane, if all the values are NA, then return the NA
library(dplyr)
df1 %>%
group_by(ID, Encounter) %>%
summarise_at(vars(-group_cols()), ~ if(all(is.na(.))) NA_integer_
else first(.[!is.na(.)]))
# A tibble: 4 x 4
# Groups: ID [2]
# ID Encounter Value1 Value2
# <int> <chr> <int> <int>
#1 1 A 1 10
#2 1 B 4 20
#3 2 A 5 40
#4 2 B 7 60
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
Encounter = c("A",
"A", "B", "B", "A", "A", "B", "B"), Value1 = c(1L, 2L, NA, 4L,
5L, 6L, NA, 7L), Value2 = c(NA, 10L, 20L, 30L, 40L, 50L, NA,
60L)), class = "data.frame", row.names = c(NA, -8L))
I would like to copy the last two columns from each month to the beginning of the next month. I did it as follows (below), but the data contains NA and when I change it to character, the program breaks down. How do I copy columns to keep their type?
My code:
library(readxl)
library(tibble)
df<- read_excel("C:/Users/Rezerwa/Documents/Database.xlsx")
df=add_column(df, Feb1 = as.character(do.call(paste0, df["January...4"])), .after = "January...5")
df=add_column(df, Feb2 = as.numeric(do.call(paste0, df["January...5"])), .after = "Feb1")
My data:
df
# A tibble: 10 x 13
Product January...2 January...3 January...4 January...5 February...6 February...7 February...8 February...9 March...10 March...11 March...12 March...13
<chr> <lgl> <lgl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 a NA NA 754.00 4 754.00 4 754.00 4 754.00 4 754.00 4
2 b NA NA 706.00 3 706.00 3 706.00 3 706.00 3 706.00 3
3 c NA NA 517.00 3 517.00 3 517.00 3 517.00 3 517.00 3
4 d NA NA 1466.00 9 1466.00 9 1466.00 9 1466.00 9 1466.00 9
5 e NA NA 543.00 8 543.00 8 543.00 8 543.00 8 543.00 8
6 f NA NA NA NA NA NA NA NA NA NA NA NA
7 g NA NA NA NA NA NA NA NA NA NA NA NA
8 h NA NA NA NA NA NA NA NA NA NA NA NA
9 i NA NA 1466.00 8 NA NA NA NA NA NA NA NA
10 j NA NA NA NA 543.00 3 NA NA NA NA NA NA
My error:
> df=add_column(df, Feb1 = as.character(do.call(paste0, df["January...4"])), .after = "January...5")
> df=add_column(df, Feb2 = as.numeric(do.call(paste0, df["January...5"])), .after = "Feb1")
Warning message:
In eval_tidy(xs[[i]], unique_output) : NAs introduced by coercion
Using base R we can split the columns based on the prefix of their names, select last two columns from each group and cbind to original df.
df1 <- cbind(df, do.call(cbind, lapply(split.default(df[-1],
sub("\\..*", "", names(df)[-1])), function(x) {n <- ncol(x);x[, c(n-1, n)]})))
To get data in order, we can do
cbind(df1[1], df1[-1][order(match(sub("\\..*", "", names(df1)[-1]), month.name))])
data
df <- structure(list(Product = structure(1:10, .Label = c("a", "b",
"c", "d", "e", "f", "g", "h", "i", "j"), class = "factor"), January...2 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), January...3 = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), January...4 = c(754, 706, 517,
1466, 543, NA, NA, NA, 1466, NA), January...5 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, 8L, NA), February...6 = c(754, 706, 517,
1466, 543, NA, NA, NA, NA, 543), February...7 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, NA, 3L), February...8 = c(754, 706, 517,
1466, 543, NA, NA, NA, NA, NA), February...9 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, NA, NA), March...10 = c(754, 706, 517, 1466,
543, NA, NA, NA, NA, NA), March...11 = c(4L, 3L, 3L, 9L, 8L,
NA, NA, NA, NA, NA), March...12 = c(754, 706, 517, 1466, 543,
NA, NA, NA, NA, NA), March...13 = c(4L, 3L, 3L, 9L, 8L, NA, NA,
NA, NA, NA)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))