I have a set of data roughly like this (more data in dput & desired results below):
id date u v
<chr> <date> <chr> <int>
1 a 2019-05-14 NA 0
2 a 2018-06-29 u 1
3 b 2020-12-02 u 1
4 b 2017-08-16 NA 1
5 b 2016-04-07 NA 0
6 c 2018-05-22 u 1
7 c 2018-05-22 u 1
8 e 2019-03-06 u 1
9 e 2019-03-06 NA 1
I am trying to create a new variable pr identifying, for each id, whether when u == u, there is an equal or earlier date where v == 1 within that id group (regardless of the value of u).
I know how generally to create a new variable based on in-group conditions:
library(dplyr)
x %>%
group_by(id) %>%
mutate(pr = case_when())
But I can't figure out how to compare the other dates within the group to the date corresponding to u and how to identify the presence of v == 1 not including the u row I am using as a reference. And u will always have v == 1.
Expected output is:
id date u v pr
<chr> <date> <chr> <int> <int>
1 a 2019-05-14 NA 0 NA
2 a 2018-06-29 u 1 0
3 b 2020-12-02 u 1 1
4 b 2017-08-16 NA 1 NA
5 b 2016-04-07 NA 0 NA
6 c 2018-05-22 u 1 1
7 c 2018-05-22 u 1 1
8 e 2019-03-06 u 1 1
9 e 2019-03-06 NA 1 NA
10 f 2020-10-20 u 1 0
11 f 2019-01-25 NA 0 NA
12 h 2020-02-24 NA 0 NA
13 h 2018-10-15 u 1 0
14 h 2018-03-07 NA 0 NA
15 i 2021-02-02 u 1 1
16 i 2020-11-19 NA 1 NA
17 i 2020-11-19 NA 1 NA
18 j 2019-02-11 u 1 1
19 j 2017-06-26 u 1 0
20 k 2018-12-13 u 1 0
21 k 2017-07-18 NA 0 NA
22 l 2018-05-08 u 1 1
23 l 2018-02-15 NA 0 NA
24 l 2018-02-15 u 1 0
25 l 2017-11-07 NA 0 NA
26 l 2015-09-10 NA 0 NA
The format of the variables isn't ideal; if there's any way for me to help clean it up let me know. Actual data is sensitive so I'm approximating.
> dput(x)
structure(list(id = c("a", "a", "b", "b", "b", "c", "c", "e",
"e", "f", "f", "h", "h", "h", "i", "i", "i", "j", "j", "k", "k",
"l", "l", "l", "l", "l"), date = structure(c(18030, 17711, 18598,
17394, 16898, 17673, 17673, 17961, 17961, 18555, 17921, 18316,
17819, 17597, 18660, 18585, 18585, 17938, 17343, 17878, 17365,
17659, 17577, 17577, 17477, 16688), class = "Date"), u = c(NA,
"u", "u", NA, NA, "u", "u", "u", NA, "u", NA, NA, "u", NA, "u",
NA, NA, "u", "u", "u", NA, "u", NA, "u", NA, NA), v = c(0L, 1L,
1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L), pr = c(NA, 0L, 1L, NA, NA, 1L,
1L, 1L, NA, 0L, NA, NA, 0L, NA, 1L, NA, NA, 1L, 0L, 0L, NA, 1L,
NA, 0L, NA, NA)), row.names = c(NA, -26L), class = c("tbl_df",
"tbl", "data.frame"))
We may create a function
library(dplyr)
library(purrr)
f1 <- function(u, v, date) {
# create a variable with only 0s
tmp <- rep(0, n())
# create logical vectors based on 'u' value and 1 in `v`
i1 <- u %in% "u"
i2 <- v %in% 1
# loop over the subset of date where v values are 1
# check whether `all` of the dates are greater than or equal to
# subset of date where values are 'u' in `u`
# and if the number of v values are greater than 1
# assign it to the 'tmp' where v values are 1 and return the 'tmp'
# after assigning NA where u values are NA
tmp[i2] <- +(purrr::map_lgl(date[i2],
~ all(.x >= date[i1])) & sum(i2) > 1)
tmp[is.na(u)] <- NA
tmp
}
and apply it after grouping
x1 <- x %>%
group_by(id) %>%
mutate(prnew = f1(u, v, date)) %>%
ungroup
> all.equal(x1$pr, x1$prnew)
[1] TRUE
Related
I've been trying to summarize data by multiple groups, where the new column should be a summary of the proportion of one column to another, by these groups. Because these two columns never both contain a value, their proportions cannot be calculated per row.
Below is an example.
By, P_Common and Number7 groups, I'd like the total N_count/A_count
structure(list(P_Common = c("B", "B", "C", "C", "D", "E", "E",
"F", "G", "G", "B", "G", "E", "D", "F", "C"), Number_7 = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 1L, 1L),
N_count = c(0L, 4L, 22L, NA, 7L, 0L, 44L, 16L, NA, NA, NA,
NA, NA, NA, NA, NA), A_count = c(NA, NA, NA, NA, NA, NA,
NA, NA, 0L, 4L, 7L, NA, 23L, 4L, 7L, 17L)), class = "data.frame", row.names = c(NA,
-16L))
P_Common Number_7 N_count A_count
B 1 0 NA
B 1 4 NA
C 1 22 NA
C 1 NA NA
D 2 7 NA
E 2 0 NA
E 2 44 NA
F 2 16 NA
B 1 NA 7
G 3 NA NA
E 1 NA 23
D 2 NA 4
F 1 NA 7
C 1 NA 17
In this example there'd be quite some 0 / NA values but that's okay, they can stay in, but overall it would become like
P_Common Number_7 Propo
B 1 0.571428571
C 1 1.294117647
D 2 1.75
... etc
You can do:
df %>%
group_by(P_Common, Number_7) %>%
summarise(Propo = sum(N_count, na.rm = T) / sum(A_count, na.rm = T))
P_Common Number_7 Propo
<chr> <int> <dbl>
1 B 1 0.571
2 C 1 1.29
3 D 2 1.75
4 E 1 0
5 E 2 Inf
6 F 1 0
7 F 2 Inf
8 G 3 0
Let's say I have this data frame. How would I go about removing only the NA values associated with name a without physically removing them manually?
a 1 4
a 7 3
a NA 4
a 6 3
a NA 4
a NA 3
a 2 4
a NA 3
a 1 4
b NA 2
c 3 NA
I've tried using the function !is.na, but that removes all the NA values in the column ID1 for all the names. How would I specifically target the ones that are associated with name a?
You could subset your data frame as follows:
df_new <- df[!(df$name == "a" & is.na(df$ID1)), ]
This can also be written as:
df_new <- df[df$name != "a" | !is.na(df$ID1), ]
With dplyr:
library(dplyr)
df %>%
filter(!(name == "a" & is.na(ID1)))
Or with subset:
subset(df, !(name == "a" & is.na(ID1)))
Output
name ID1 ID2
1 a 1 4
2 a 7 3
3 a 6 3
4 a 2 4
5 a 1 4
6 b NA 2
7 c 3 NA
Data
df <- structure(list(name = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "c"), ID1 = c(1L, 7L, NA, 6L, NA, NA, 2L, NA, 1L, NA,
3L), ID2 = c(4L, 3L, 4L, 3L, 4L, 3L, 4L, 3L, 4L, 2L, NA)), class = "data.frame", row.names = c(NA,
-11L))
This question already has answers here:
filter for complete cases in data.frame using dplyr (case-wise deletion)
(7 answers)
Closed 1 year ago.
I would like to remove NA from my data set and then organise them by IDs.
My dataset is similar to this:
df<-read.table (text="ID Name Surname Group A1 A2 A3 Goal Sea
21 Goal Robi A 4 4 4 G No
21 Goal Robi B NA NA NA NA NA
21 Goal Robi C NA NA NA NA NA
21 Goal Robi D 3 4 4 G No
33 Nami Si O NA NA NA NA NA
33 Nami Si P NA NA NA NA NA
33 Nami Si Q 3 4 4 G No
33 Nami Si Z 3 3 3 S No
98 Sara Bat MT 4 4 4 S No
98 Sara Bat NC 4 3 2 D No
98 Sara Bat MF NA NA NA NA NA
98 Sara Bat LC NA NA NA NA NA
66 Noor Shor MF NA NA NA NA NA
66 Noor Shor LC NA NA NA NA NA
66 Noor Shor MT1 4 4 4 G No
66 Noor Shor NC1 2 3 3 D No
", header=TRUE)
By removing NA, rows and columns get a datframe with a lack of NA. So I would like to get this table
ID Name Surname Group_1 A1 A2 A3 Goal_1 Sea_1 Group_2 A1_1 A2_2 A3_3 Goal_2 Sea_2
21 Goal Robi A 4 4 4 G No D 3 4 4 G No
33 Nami Si Q 3 4 4 G No Z 3 3 3 S No
98 Sara Bat MT 4 4 4 S No NC 4 3 2 D No
66 Noor Shor Mt1 4 4 4 G No NC1 2 3 3 D No
Is it possible to get it. It seems we could do it using pivot_longer, but I do not know ho to get it
search for complete.cases()
final = final[complete.cases(final), ]
A possible solution with the Tidyverse:
df <- structure(list(ID = c(21L, 21L, 21L, 21L, 33L, 33L, 33L, 33L,
98L, 98L, 98L, 98L, 66L, 66L, 66L, 66L), Name = c("Goal", "Goal",
"Goal", "Goal", "Nami", "Nami", "Nami", "Nami", "Sara", "Sara",
"Sara", "Sara", "Noor", "Noor", "Noor", "Noor"), Surname = c("Robi",
"Robi", "Robi", "Robi", "Si", "Si", "Si", "Si", "Bat", "Bat",
"Bat", "Bat", "Shor", "Shor", "Shor", "Shor"), Group = c("A",
"B", "C", "D", "O", "P", "Q", "Z", "MT", "NC", "MF", "LC", "MF",
"LC", "MT1", "NC1"), A1 = c(4L, NA, NA, 3L, NA, NA, 3L, 3L, 4L,
4L, NA, NA, NA, NA, 4L, 2L), A2 = c(4L, NA, NA, 4L, NA, NA, 4L,
3L, 4L, 3L, NA, NA, NA, NA, 4L, 3L), A3 = c(4L, NA, NA, 4L, NA,
NA, 4L, 3L, 4L, 2L, NA, NA, NA, NA, 4L, 3L), Goal = c("G", NA,
NA, "G", NA, NA, "G", "S", "S", "D", NA, NA, NA, NA, "G", "D"
), Sea = c("No", NA, NA, "No", NA, NA, "No", "No", "No", "No",
NA, NA, NA, NA, "No", "No")), class = "data.frame", row.names = c(NA,
-16L))
new_df <- df %>%
drop_na() %>%
group_by(ID) %>%
mutate(n = row_number()) %>%
pivot_wider(
names_from = n,
values_from= c(Group, A1, A2, A3, Goal, Sea)
) %>%
relocate(ends_with("2"), .after= last_col())
print(new_df)
We can group_by the ID columns and then filter out rows with all NAs in the target columns:
df %>% group_by(ID, Name, Surname) %>%
filter(!if_all(A1:Sea, is.na))%>%
slice_head(n=1)
# A tibble: 4 × 9
# Groups: ID, Name, Surname [4]
ID Name Surname Group A1 A2 A3 Goal Sea
<int> <chr> <chr> <chr> <int> <int> <int> <chr> <chr>
1 21 Goal Robi A 4 4 4 G No
2 33 Nami Si Q 3 4 4 G No
3 66 Noor Shor MT1 4 4 4 G No
4 98 Sara Bat MT 4 4 4 S No
I have the following data frame:
structure(list(A = c(1L, 1L, 1L, 1L, 2L), B = c(1L, 2L, 2L, 2L,
1L), C = c(1L, 1L, 2L, 2L, 1L), D = structure(c(2L, 2L, 1L, 2L,
2L), .Label = c("", "x"), class = "factor"), E = structure(c(2L,
1L, 2L, 2L, 1L), .Label = c("", "x"), class = "factor"), F = structure(c(2L,
1L, 2L, 2L, 2L), .Label = c("", "x"), class = "factor"), G = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("", "x"), class = "factor"), Y = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("", "x"), class = "factor")), .Names = c("A",
"B", "C", "D", "E", "F", "G", "Y"), class = "data.frame", row.names = c(NA,
-5L))
I would like to filter this dataframe and remove missing values in the columns (D,E,F,G,Y). I'm doing this using 'complete.cases' in the following code:
completeFun <- function(data, desiredCols) {
completeVec <- complete.cases(data[, desiredCols])
return(data[completeVec, ])
}
However, what I noticed is that when I call the function, e.g.: completeFun(test, c('E','F') the following output is returned:
A B C D E F G Y
1 1 1 1 x x x x x
3 1 2 2 <NA> x x <NA> x
4 1 2 2 x x x <NA> <NA>
which is removing the rows where E OR F are NA and only keeping the rows where E AND F are NOT NA.
However, what I want instead, is to keep the rows where any one of those columns (E,F) is NOT NA, i.e, neither E nor F == NA, which means this output in this case:
A B C D E F G Y
1 1 1 1 x x x x x
3 1 2 2 <NA> x x <NA> x
4 1 2 2 x x x <NA> <NA>
5 2 1 1 x <NA> x <NA> <NA>
Of course I would like to keep the function as flexible as possible to be able to include more columns into the calculation.
What is the best R way to do this?
UPDATE
Based on the answer of Sotos, here is a case that does not work based on his answer:
structure(list(A = c(1L, 1L, 1L, 1L, 2L), B = c(1L, 2L, 2L, 2L,
1L), C = c(1L, 1L, 2L, 2L, 1L), D = structure(c(1L, 1L, NA, 1L,
1L), .Label = "x", class = "factor"), E = structure(c(1L, NA,
1L, 1L, NA), .Label = "x", class = "factor"), F = structure(c(1L,
NA, 1L, 1L, 1L), .Label = "x", class = "factor"), G = structure(c(1L,
NA, NA, NA, NA), .Label = "x", class = "factor"), Y = structure(c(1L,
NA, 1L, NA, 1L), .Label = "x", class = "factor")), .Names = c("A",
"B", "C", "D", "E", "F", "G", "Y"), row.names = c(NA, -5L), class = "data.frame")
For this new data frame, if I call the function as follow: completeFun(test, cols = c('E','F', 'Y')) I get the following output:
A B C D E F G Y
1 1 1 1 x x x x x
NA NA NA NA <NA> <NA> <NA> <NA> <NA>
3 1 2 2 <NA> x x <NA> x
NA.1 NA NA NA <NA> <NA> <NA> <NA> <NA>
NA.2 NA NA NA <NA> <NA> <NA> <NA> <NA>
which is missing the last row of the dataframe where F AND Y have a non-empty value.
You can do this via rowSums, i.e.
completeFun <- function(df, cols) {
return(df[rowSums(df[cols] == '') != length(cols),])
}
completeFun(dd, cols = c('E', 'F'))
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 x x x
#4 1 2 2 x x x
#5 2 1 1 x x
completeFun(dd, cols = 'Y')
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 x x x
EDIT
In the previous example OP had empty spaces instead of NA hence, we were checking for them. If we want to check for NAs we can modify the function and check with is.na instead.
completeFun <- function(df, cols) {
df[rowSums(is.na(df[cols])) != length(cols), ]
}
completeFun(df, cols = c('E','F', 'Y'))
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 <NA> x x <NA> x
#4 1 2 2 x x x <NA> <NA>
#5 2 1 1 x <NA> x <NA> x
Similar to Sotos' answer, except it is a bit more flexible.
A row is considered complete if the number of non-NA values is equal to, or more than the threshold thrsh.
completeFun <- function(dtf, cols, na.val="", thrsh=1) {
dtf[dtf == na.val] <- NA
ix <- rowSums(!is.na(dtf[, cols])) >= thrsh
dtf[ix, ]
}
completeFun(test, cols=c("E", "F"))
# A B C D E F G Y
# 1 1 1 1 x x x x x
# 3 1 2 2 <NA> x x <NA> x
# 4 1 2 2 x x x <NA> <NA>
# 5 2 1 1 x <NA> x <NA> <NA>
completeFun(test, cols=c("D", "E", "F", "Y"), thrsh=3)
# A B C D E F G Y
# 1 1 1 1 x x x x x
# 3 1 2 2 <NA> x x <NA> x
# 4 1 2 2 x x x <NA> <NA>
For a sample dataframe:
df1 <- structure(list(i.d = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), group = c(1L,
1L, 2L, 1L, 3L, 3L, 2L, 2L, 1L), cat = c(0L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, NA)), .Names = c("i.d", "group", "cat"), class = "data.frame", row.names = c(NA,
-9L))
I wish to add an additional column to my dataframe ("pc.cat") which records the percentage '1s' in column cat BY the group ID variable.
For example, there are four values in group 1 (i.d's a, b, d and i). Value 'i' is NA so this can be ignored for now. Only one of the three values left is one, so the percentage would read 33.33 (to 2 dp). This value will be populated into column 'pc.cat' next to all the rows with '1' in the group (even the NA columns). The process would then be repeated for the other groups (2 and 3).
If anyone could help me with the code for this I would greatly appreciate it.
This can be accomplished with the ave function:
df1$pc.cat <- ave(df1$cat, df1$group, FUN=function(x) 100*mean(na.omit(x)))
df1
# i.d group cat pc.cat
# 1 a 1 0 33.33333
# 2 b 1 0 33.33333
# 3 c 2 1 66.66667
# 4 d 1 1 33.33333
# 5 e 3 0 0.00000
# 6 f 3 0 0.00000
# 7 g 2 1 66.66667
# 8 h 2 0 66.66667
# 9 i 1 NA 33.33333
library(data.table)
setDT(df1)
df1[!is.na(cat), mean(cat), by=group]
With data.table:
library(data.table)
DT <- data.table(df1)
DT[, list(sum(na.omit(cat))/length(cat)), by = "group"]