I've been trying to summarize data by multiple groups, where the new column should be a summary of the proportion of one column to another, by these groups. Because these two columns never both contain a value, their proportions cannot be calculated per row.
Below is an example.
By, P_Common and Number7 groups, I'd like the total N_count/A_count
structure(list(P_Common = c("B", "B", "C", "C", "D", "E", "E",
"F", "G", "G", "B", "G", "E", "D", "F", "C"), Number_7 = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 1L, 1L),
N_count = c(0L, 4L, 22L, NA, 7L, 0L, 44L, 16L, NA, NA, NA,
NA, NA, NA, NA, NA), A_count = c(NA, NA, NA, NA, NA, NA,
NA, NA, 0L, 4L, 7L, NA, 23L, 4L, 7L, 17L)), class = "data.frame", row.names = c(NA,
-16L))
P_Common Number_7 N_count A_count
B 1 0 NA
B 1 4 NA
C 1 22 NA
C 1 NA NA
D 2 7 NA
E 2 0 NA
E 2 44 NA
F 2 16 NA
B 1 NA 7
G 3 NA NA
E 1 NA 23
D 2 NA 4
F 1 NA 7
C 1 NA 17
In this example there'd be quite some 0 / NA values but that's okay, they can stay in, but overall it would become like
P_Common Number_7 Propo
B 1 0.571428571
C 1 1.294117647
D 2 1.75
... etc
You can do:
df %>%
group_by(P_Common, Number_7) %>%
summarise(Propo = sum(N_count, na.rm = T) / sum(A_count, na.rm = T))
P_Common Number_7 Propo
<chr> <int> <dbl>
1 B 1 0.571
2 C 1 1.29
3 D 2 1.75
4 E 1 0
5 E 2 Inf
6 F 1 0
7 F 2 Inf
8 G 3 0
Related
I have a set of data roughly like this (more data in dput & desired results below):
id date u v
<chr> <date> <chr> <int>
1 a 2019-05-14 NA 0
2 a 2018-06-29 u 1
3 b 2020-12-02 u 1
4 b 2017-08-16 NA 1
5 b 2016-04-07 NA 0
6 c 2018-05-22 u 1
7 c 2018-05-22 u 1
8 e 2019-03-06 u 1
9 e 2019-03-06 NA 1
I am trying to create a new variable pr identifying, for each id, whether when u == u, there is an equal or earlier date where v == 1 within that id group (regardless of the value of u).
I know how generally to create a new variable based on in-group conditions:
library(dplyr)
x %>%
group_by(id) %>%
mutate(pr = case_when())
But I can't figure out how to compare the other dates within the group to the date corresponding to u and how to identify the presence of v == 1 not including the u row I am using as a reference. And u will always have v == 1.
Expected output is:
id date u v pr
<chr> <date> <chr> <int> <int>
1 a 2019-05-14 NA 0 NA
2 a 2018-06-29 u 1 0
3 b 2020-12-02 u 1 1
4 b 2017-08-16 NA 1 NA
5 b 2016-04-07 NA 0 NA
6 c 2018-05-22 u 1 1
7 c 2018-05-22 u 1 1
8 e 2019-03-06 u 1 1
9 e 2019-03-06 NA 1 NA
10 f 2020-10-20 u 1 0
11 f 2019-01-25 NA 0 NA
12 h 2020-02-24 NA 0 NA
13 h 2018-10-15 u 1 0
14 h 2018-03-07 NA 0 NA
15 i 2021-02-02 u 1 1
16 i 2020-11-19 NA 1 NA
17 i 2020-11-19 NA 1 NA
18 j 2019-02-11 u 1 1
19 j 2017-06-26 u 1 0
20 k 2018-12-13 u 1 0
21 k 2017-07-18 NA 0 NA
22 l 2018-05-08 u 1 1
23 l 2018-02-15 NA 0 NA
24 l 2018-02-15 u 1 0
25 l 2017-11-07 NA 0 NA
26 l 2015-09-10 NA 0 NA
The format of the variables isn't ideal; if there's any way for me to help clean it up let me know. Actual data is sensitive so I'm approximating.
> dput(x)
structure(list(id = c("a", "a", "b", "b", "b", "c", "c", "e",
"e", "f", "f", "h", "h", "h", "i", "i", "i", "j", "j", "k", "k",
"l", "l", "l", "l", "l"), date = structure(c(18030, 17711, 18598,
17394, 16898, 17673, 17673, 17961, 17961, 18555, 17921, 18316,
17819, 17597, 18660, 18585, 18585, 17938, 17343, 17878, 17365,
17659, 17577, 17577, 17477, 16688), class = "Date"), u = c(NA,
"u", "u", NA, NA, "u", "u", "u", NA, "u", NA, NA, "u", NA, "u",
NA, NA, "u", "u", "u", NA, "u", NA, "u", NA, NA), v = c(0L, 1L,
1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L), pr = c(NA, 0L, 1L, NA, NA, 1L,
1L, 1L, NA, 0L, NA, NA, 0L, NA, 1L, NA, NA, 1L, 0L, 0L, NA, 1L,
NA, 0L, NA, NA)), row.names = c(NA, -26L), class = c("tbl_df",
"tbl", "data.frame"))
We may create a function
library(dplyr)
library(purrr)
f1 <- function(u, v, date) {
# create a variable with only 0s
tmp <- rep(0, n())
# create logical vectors based on 'u' value and 1 in `v`
i1 <- u %in% "u"
i2 <- v %in% 1
# loop over the subset of date where v values are 1
# check whether `all` of the dates are greater than or equal to
# subset of date where values are 'u' in `u`
# and if the number of v values are greater than 1
# assign it to the 'tmp' where v values are 1 and return the 'tmp'
# after assigning NA where u values are NA
tmp[i2] <- +(purrr::map_lgl(date[i2],
~ all(.x >= date[i1])) & sum(i2) > 1)
tmp[is.na(u)] <- NA
tmp
}
and apply it after grouping
x1 <- x %>%
group_by(id) %>%
mutate(prnew = f1(u, v, date)) %>%
ungroup
> all.equal(x1$pr, x1$prnew)
[1] TRUE
Let's say I have this data frame. How would I go about removing only the NA values associated with name a without physically removing them manually?
a 1 4
a 7 3
a NA 4
a 6 3
a NA 4
a NA 3
a 2 4
a NA 3
a 1 4
b NA 2
c 3 NA
I've tried using the function !is.na, but that removes all the NA values in the column ID1 for all the names. How would I specifically target the ones that are associated with name a?
You could subset your data frame as follows:
df_new <- df[!(df$name == "a" & is.na(df$ID1)), ]
This can also be written as:
df_new <- df[df$name != "a" | !is.na(df$ID1), ]
With dplyr:
library(dplyr)
df %>%
filter(!(name == "a" & is.na(ID1)))
Or with subset:
subset(df, !(name == "a" & is.na(ID1)))
Output
name ID1 ID2
1 a 1 4
2 a 7 3
3 a 6 3
4 a 2 4
5 a 1 4
6 b NA 2
7 c 3 NA
Data
df <- structure(list(name = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "c"), ID1 = c(1L, 7L, NA, 6L, NA, NA, 2L, NA, 1L, NA,
3L), ID2 = c(4L, 3L, 4L, 3L, 4L, 3L, 4L, 3L, 4L, 2L, NA)), class = "data.frame", row.names = c(NA,
-11L))
This question already has answers here:
filter for complete cases in data.frame using dplyr (case-wise deletion)
(7 answers)
Closed 1 year ago.
I would like to remove NA from my data set and then organise them by IDs.
My dataset is similar to this:
df<-read.table (text="ID Name Surname Group A1 A2 A3 Goal Sea
21 Goal Robi A 4 4 4 G No
21 Goal Robi B NA NA NA NA NA
21 Goal Robi C NA NA NA NA NA
21 Goal Robi D 3 4 4 G No
33 Nami Si O NA NA NA NA NA
33 Nami Si P NA NA NA NA NA
33 Nami Si Q 3 4 4 G No
33 Nami Si Z 3 3 3 S No
98 Sara Bat MT 4 4 4 S No
98 Sara Bat NC 4 3 2 D No
98 Sara Bat MF NA NA NA NA NA
98 Sara Bat LC NA NA NA NA NA
66 Noor Shor MF NA NA NA NA NA
66 Noor Shor LC NA NA NA NA NA
66 Noor Shor MT1 4 4 4 G No
66 Noor Shor NC1 2 3 3 D No
", header=TRUE)
By removing NA, rows and columns get a datframe with a lack of NA. So I would like to get this table
ID Name Surname Group_1 A1 A2 A3 Goal_1 Sea_1 Group_2 A1_1 A2_2 A3_3 Goal_2 Sea_2
21 Goal Robi A 4 4 4 G No D 3 4 4 G No
33 Nami Si Q 3 4 4 G No Z 3 3 3 S No
98 Sara Bat MT 4 4 4 S No NC 4 3 2 D No
66 Noor Shor Mt1 4 4 4 G No NC1 2 3 3 D No
Is it possible to get it. It seems we could do it using pivot_longer, but I do not know ho to get it
search for complete.cases()
final = final[complete.cases(final), ]
A possible solution with the Tidyverse:
df <- structure(list(ID = c(21L, 21L, 21L, 21L, 33L, 33L, 33L, 33L,
98L, 98L, 98L, 98L, 66L, 66L, 66L, 66L), Name = c("Goal", "Goal",
"Goal", "Goal", "Nami", "Nami", "Nami", "Nami", "Sara", "Sara",
"Sara", "Sara", "Noor", "Noor", "Noor", "Noor"), Surname = c("Robi",
"Robi", "Robi", "Robi", "Si", "Si", "Si", "Si", "Bat", "Bat",
"Bat", "Bat", "Shor", "Shor", "Shor", "Shor"), Group = c("A",
"B", "C", "D", "O", "P", "Q", "Z", "MT", "NC", "MF", "LC", "MF",
"LC", "MT1", "NC1"), A1 = c(4L, NA, NA, 3L, NA, NA, 3L, 3L, 4L,
4L, NA, NA, NA, NA, 4L, 2L), A2 = c(4L, NA, NA, 4L, NA, NA, 4L,
3L, 4L, 3L, NA, NA, NA, NA, 4L, 3L), A3 = c(4L, NA, NA, 4L, NA,
NA, 4L, 3L, 4L, 2L, NA, NA, NA, NA, 4L, 3L), Goal = c("G", NA,
NA, "G", NA, NA, "G", "S", "S", "D", NA, NA, NA, NA, "G", "D"
), Sea = c("No", NA, NA, "No", NA, NA, "No", "No", "No", "No",
NA, NA, NA, NA, "No", "No")), class = "data.frame", row.names = c(NA,
-16L))
new_df <- df %>%
drop_na() %>%
group_by(ID) %>%
mutate(n = row_number()) %>%
pivot_wider(
names_from = n,
values_from= c(Group, A1, A2, A3, Goal, Sea)
) %>%
relocate(ends_with("2"), .after= last_col())
print(new_df)
We can group_by the ID columns and then filter out rows with all NAs in the target columns:
df %>% group_by(ID, Name, Surname) %>%
filter(!if_all(A1:Sea, is.na))%>%
slice_head(n=1)
# A tibble: 4 × 9
# Groups: ID, Name, Surname [4]
ID Name Surname Group A1 A2 A3 Goal Sea
<int> <chr> <chr> <chr> <int> <int> <int> <chr> <chr>
1 21 Goal Robi A 4 4 4 G No
2 33 Nami Si Q 3 4 4 G No
3 66 Noor Shor MT1 4 4 4 G No
4 98 Sara Bat MT 4 4 4 S No
So, I have a large data frame with monthly observations of n individuals.
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
A 33 6 1 2 1 5
B 36 5 0 2 1 5
C 22 4 1 NA 1 5
D 2 2 0 2 1 5
E 5 2 1 2 1 6
F 7 1 0 2 1 5
G 8 6 1 2 1 5
H 2 8 0 2 2 5
I 1 3 1 2 1 5
J 3 2 0 2 1 5
I want to create a new data frame, in which include the individuals who meet some specific conditions.
E.g. if, for individual i, the range of column y_0101:y_0312 does NOT include values of 3 & 6 & NA, AND include values of 2 | 1 THEN for individual i should be included in new data frame. Which produce the following data frame:
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
B 36 5 0 2 1 5
D 2 2 0 2 1 5
F 7 1 0 2 1 5
H 2 8 0 2 2 5
I tried different ways, but I can't figure out how to get multiple conditions included.
df <- df %>% filter(vars(starts_with("y_"))!=3 | !=6 | != NA)
or
df <- df %>% filter_at(vars(starts_with("y_")), all_vars(!=3 | !=6 | != NA)
I've tried some other things as well, like !%in%, but that doesn't seem to work. Any ideas?
I think you're almost there, but might need a slight shift in the logic:
df <- data.frame(A1 = 1:10,
A2 = 10:1,
A3 = 1:10,
B1 = 1:10)
df %>%
filter_at(vars(starts_with("A")), ~!(.x %in% c(3, 6, NA))) %>%
filter(if_any(starts_with("A"), ~ .x %in% c(1, 2)))
In the first step, I filter out all rows where any of the columns are 3, 6, or NA. In the second row, I filter down to only rows where at least one of the columns is 1 or 2. Does this help with your case?
Here is a base R option using rowSums :
cols <- grep('y_', names(df))
include <- c(1, 2)
not_include <- c(3, 6, NA)
result <- subset(df, rowSums(sapply(df[cols], `%in%`, include)) > 0 &
rowSums(sapply(df[cols], `%in%`, not_include)) == 0)
result
# ind y_0101 y_0102 y_0103 y_0104 y_0311 y_0312
#2 B 36 5 0 2 1 5
#4 D 2 2 0 2 1 5
#6 F 7 1 0 2 1 5
#8 H 2 8 0 2 2 5
data
df <- structure(list(ind = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J"), y_0101 = c(33L, 36L, 22L, 2L, 5L, 7L, 8L, 2L, 1L,
3L), y_0102 = c(6L, 5L, 4L, 2L, 2L, 1L, 6L, 8L, 3L, 2L), y_0103 = c(1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L), y_0104 = c(2L, 2L, NA, 2L,
2L, 2L, 2L, 2L, 2L, 2L), y_0311 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L), y_0312 = c(5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L
)), class = "data.frame", row.names = c(NA, -10L))
My data.frame looks like
ID Encounter Value1 Value2
1 A 1 NA
1 A 2 10
1 B NA 20
1 B 4 30
2 A 5 40
2 A 6 50
2 B NA NA
2 B 7 60
and I want it to look like
ID Encounter Value1 Value2
1 A 1 10
1 B 4 20
2 A 5 40
2 B 7 60
We can use dplyr. Grouped by 'ID', 'Encounter', get the first value that is not an NA (!is.na(.)) in the rest of the column. By any chane, if all the values are NA, then return the NA
library(dplyr)
df1 %>%
group_by(ID, Encounter) %>%
summarise_at(vars(-group_cols()), ~ if(all(is.na(.))) NA_integer_
else first(.[!is.na(.)]))
# A tibble: 4 x 4
# Groups: ID [2]
# ID Encounter Value1 Value2
# <int> <chr> <int> <int>
#1 1 A 1 10
#2 1 B 4 20
#3 2 A 5 40
#4 2 B 7 60
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
Encounter = c("A",
"A", "B", "B", "A", "A", "B", "B"), Value1 = c(1L, 2L, NA, 4L,
5L, 6L, NA, 7L), Value2 = c(NA, 10L, 20L, 30L, 40L, 50L, NA,
60L)), class = "data.frame", row.names = c(NA, -8L))