How can I remove NAs when both columns are missing only? - r

I have a df in R as follows:
ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20
Now I want to only remove the rows where both Score 1 and Score 2 is missing (i.e. 3rd row)

You can filter it like this:
df <- read.table(head=T, text="ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20")
df[!(is.na(df$Score1) & is.na(df$Score2)), ]
# ID Age Score1 Score2
# 1 2 22 12 NA
# 2 3 19 11 22
# 4 1 21 NA 20
I.e. take rows where there's not (!) Score1 missing and (&) Score2 missing.

Here are two version with dplyr which can be extended to many columns with prefix "Score".
Using filter_at
library(dplyr)
df %>% filter_at(vars(starts_with("Score")), any_vars(!is.na(.)))
# ID Age Score1 Score2
#1 2 22 12 NA
#2 3 19 11 22
#3 1 21 NA 20
and filter_if
df %>% filter_if(startsWith(names(.),"Score"), any_vars(!is.na(.)))
A base R version with apply
df[apply(!is.na(df[startsWith(names(df),"Score")]), 1, any), ]

One option is rowSums
df1[ rowSums(is.na(df1[grep("Score", names(df1))])) < 2,]
Or another option with base R
df1[!Reduce(`&`, lapply(df1[grep("Score", names(df1))], is.na)),]
data
df1 <- structure(list(ID = c(2L, 3L, 4L, 1L), Age = c(22L, 19L, 20L,
21L), Score1 = c(12L, 11L, NA, NA), Score2 = c(NA, 22L, NA, 20L
)), class = "data.frame", row.names = c(NA, -4L))

Related

How do i remove a whole row of a dataframe if one column is 0 or NA

I have a dataframe (DF) with 4 columns. How do I make it so if column 4 is either a 0 or an NA, then remove the whole row? So in the example below only row 1 would be left.
Column 1 Column 2 Column 3 Column 4
11 24 234 2123
45 63 22 0
234 234 123 NA
using dplyr
library(dplyr)
df %>% filter(!is.na(Column.4) & Column.4 != 0)
You can use logical vectors to subset your data:
df[!is.na(df[,4]) & (df[,4]!=0), ]
Example:
df = data.frame(x = rnorm(30), y = rnorm(30), z = rnorm(30), a = rep(c(0,1,NA),10))
x y z a
2 -0.21772820 -0.5337648 -1.07579623 1
5 0.64536474 0.2011776 -0.12981424 1
8 2.36411372 0.0343823 2.03561701 1
11 1.09103526 -1.9287689 0.59511269 1
14 0.32482389 -0.5562136 -0.38943092 1
17 0.63621067 -1.6517097 -0.09804529 1
20 2.61892085 1.5575784 -0.50803567 1
23 0.07854647 1.1861483 -0.49798074 1
26 0.19561725 1.1036331 -0.66349688 1
29 0.22470875 -0.4192745 0.09153176 1
You can use sapply to loop thru each row and it will display the rows the rows that satisfy the underlying conditions:
df[sapply(1:nrow(df), function(i) all(!is.na(df[i,])) & all(df[i,] != 0)), ]
Data:
structure(list(Column.1 = c(11L, 45L, 234L), Column.2 = c(24L,
63L, 234L), Column.3 = c(234L, 22L, 123L), Column.4 = c(2123L,
0L, NA)), class = "data.frame", row.names = c(NA, -3L)) -> df
Output:
# Column.1 Column.2 Column.3 Column.4
# 1 11 24 234 2123

How do you find if a number is between a range of multiple mins and max numbers

In R I have:
DataSet1
A
1
4
13
19
22
DataSet2
(min)B (max)C
4 6
8 9
12 15
16 18
I am looking to set up a binary column D based on whether A is between B and C.
So D would added to dataset 1 and calculated as follows:
A D
1 0
4 1
13 1
19 0
22 0
I have tried using the InRange function but it just calculating for between one row of B and C rather than all intervals.
Any help would be much appreciated.
enter image description here
Here is one option using fuzzy_left_join
library(fuzzyjoin)
library(dplyr)
df1 %>% fuzzy_left_join(df2, by = c("A" = "B", "A" = "C"),
match_fun = list(`>=`, `<`)) %>%
mutate(D = ifelse(is.na(B) & is.na(C), 0, 1))
A B C D
1 1 NA NA 0
2 4 4 6 1
3 13 12 15 1
4 19 NA NA 0
5 22 NA NA 0
Data
df1 <- structure(list(A = c(1L, 4L, 13L, 19L, 22L)), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(B = c(4L, 8L, 12L, 16L), C = c(6L, 9L, 15L, 18L)), class = "data.frame", row.names = c(NA, -4L))
Here's a way using sapply from base R -
df1$D <- sapply(df1$A, function(x) {
+any(x >= df2$B & x <= df2$C)
})
df1
A D
1 1 0
2 4 1
3 13 1
4 19 0
5 22 0

R split each row of a dataframe into two rows

I would like to splite each row of a data frame(numberic) into two rows. For example, part of the original data frame like this (nrow(original datafram) > 2800000):
ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47
And after spliting each row, we can get:
ID X Y Z
1 3 2 6
22 54 NA NA
6 11 5 9
52 71 NA NA
3 7 2 5
2 34 NA NA
5 10 7 1
23 47 NA NA
the "value_1" and "value_2" columns are split and each element is set to a new row. For example, value_1 = 22 and value_2 = 54 are set to a new row.
Here is one option with data.table. We convert the 'data.frame' to 'data.table' by creating a column of rownames (setDT(df1, keep.rownames = TRUE)). Subset the columns 1:5 and 1, 6, 7 in a list, rbind the list element with fill = TRUE option to return NA for corresponding columns that are not found in one of the datasets, order by the row number ('rn') and assign (:=) the row number column to 'NULL'.
library(data.table)
setDT(df1, keep.rownames = TRUE)[]
rbindlist(list(df1[, 1:5, with = FALSE], setnames(df1[, c(1, 6:7),
with = FALSE], 2:3, c("ID", "X"))), fill = TRUE)[order(rn)][, rn:= NULL][]
# ID X Y Z
#1: 1 3 2 6
#2: 22 54 NA NA
#3: 6 11 5 9
#4: 52 71 NA NA
#5: 3 7 2 5
#6: 2 34 NA NA
#7: 5 10 7 1
#8: 23 47 NA NA
A hadleyverse corresponding to the above logic would be
library(dplyr)
tibble::rownames_to_column(df1[1:4]) %>%
bind_rows(., setNames(tibble::rownames_to_column(df1[5:6]),
c("rowname", "ID", "X"))) %>%
arrange(rowname) %>%
select(-rowname)
# ID X Y Z
#1 1 3 2 6
#2 22 54 NA NA
#3 6 11 5 9
#4 52 71 NA NA
#5 3 7 2 5
#6 2 34 NA NA
#7 5 10 7 1
#8 23 47 NA NA
data
df1 <- structure(list(ID = c(1L, 6L, 3L, 5L), X = c(3L, 11L, 7L, 10L
), Y = c(2L, 5L, 2L, 7L), Z = c(6L, 9L, 5L, 1L), value_1 = c(22L,
52L, 2L, 23L), value_2 = c(54L, 71L, 34L, 47L)), .Names = c("ID",
"X", "Y", "Z", "value_1", "value_2"), class = "data.frame",
row.names = c(NA, -4L))
Here's a (very slow) pure R solution using no extra packages:
# Replicate your matrix
input_df <- data.frame(ID = rnorm(10000),
X = rnorm(10000),
Y = rnorm(10000),
Z = rnorm(10000),
value_1 = rnorm(10000),
value_2 = rnorm(10000))
# Preallocate memory to a data frame
output_df <- data.frame(
matrix(
nrow = nrow(input_df)*2,
ncol = ncol(input_df)-2))
# Loop through each row in turn.
# Put the first four elements into the current
# row, and the next two into the current+1 row
# with two NAs attached.
for(i in seq(1, nrow(output_df), 2)){
output_df[i,] <- input_df[i, c(1:4)]
output_df[i+1,] <- c(input_df[i, c(5:6)],NA,NA)
}
colnames(output_df) <- c("ID", "X", "Y", "Z")
Which results in
> head(output_df)
X1 X2 X3 X4
1 0.5529417 -0.93859275 2.0900276 -2.4023800
2 0.9751090 0.13357075 NA NA
3 0.6753835 0.07018647 0.8529300 -0.9844643
4 1.6405939 0.96133195 NA NA
5 0.3378821 -0.44612782 -0.8176745 0.2759752
6 -0.8910678 -0.37928353 NA NA
This should work
data <- read.table(text= "ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47", header=T)
data1 <- data[,1:4]
data2 <- setdiff(data,data1)
names(data2) <- names(data1)[1:ncol(data2)]
combined <- plyr::rbind.fill(data1,data2)
n <- nrow(data1)
combined[kronecker(1:n, c(0, n), "+"),]
Though why you would need to do this beats me.

Update rows of data frame in R

Suppose I start with a data frame:
ID Measurement1 Measurement2
1 45 104
2 34 87
3 23 99
4 56 67
...
Then I have a second data frame which is meant to be used to update records in the first:
ID Measurement1 Measurement2
2 10 11
4 21 22
How do I use R to end up with:
ID Measurement1 Measurement2
1 45 104
2 10 11
3 23 99
4 21 22
...
The data frames in reality are very large datasets.
We can use match to get the row index. Using that index to subset the rows, we replace the 2nd and 3rd columns of the first dataset with the corresponding columns of second dataset.
ind <- match(df2$ID, df1$ID)
df1[ind, 2:3] <- df2[2:3]
df1
# ID Measurement1 Measurement2
#1 1 45 104
#2 2 10 11
#3 3 23 99
#4 4 21 22
Or we can use data.table to join the dataset on the 'ID' column (after converting the first dataset to 'data.table' i.e. setDT(df1)), and assign the 'Cols' with the 'iCols' from the second dataset.
library(data.table)#v1.9.6+
Cols <- names(df1)[-1]
iCols <- paste0('i.', Cols)
setDT(df1)[df2, (Cols) := mget(iCols), on= 'ID'][]
# ID Measurement1 Measurement2
#1: 1 45 104
#2: 2 10 11
#3: 3 23 99
#4: 4 21 22
data
df1 <- structure(list(ID = 1:4, Measurement1 = c(45L, 34L, 23L, 56L),
Measurement2 = c(104L, 87L, 99L, 67L)), .Names = c("ID",
"Measurement1", "Measurement2"), class = "data.frame",
row.names = c(NA, -4L))
df2 <- structure(list(ID = c(2L, 4L), Measurement1 = c(10L, 21L),
Measurement2 = c(11L,
22L)), .Names = c("ID", "Measurement1", "Measurement2"),
class = "data.frame", row.names = c(NA, -2L))
library(dplyr)
df1 %>%
anti_join(df2, by = "ID") %>%
bind_rows(df2) %>%
arrange(ID)
dplyr 1.0.0 introduced a family of SQL-inspired functions for modifying rows. In this case you can now use rows_update():
library(dplyr)
df1 %>%
rows_update(df2, by = "ID")
ID Measurement1 Measurement2
1 1 45 104
2 2 10 11
3 3 23 99
4 4 21 22

Find the last row in a data frame that meets certain criteria

I'm looking for a way to refer to a pevious row in my data frame that has one column value in common with the 'current row'. Basically, if this would be my data frame
A B D
1 10
4 5
6 6
3 25
1 40
I would want D(i) to contain the B value of the last row for which A has the same value as A(i). So for the last row that should be 10.
You could try this:
for(i in seq_len(nrow(dat))) {
try(dat$D[i] <- dat$B[tail(which(dat$A[1:i-1] == dat$A[i]),1)],silent=TRUE)
}
Results:
> dat
A B D
1 1 10 NA
2 4 5 NA
3 6 6 NA
4 3 25 NA
5 1 40 10
Data:
dat <- read.csv(text="A,B,D
1,10
4,5
6,6
3,25
1,40")
You may try
library(dplyr)
df1%>%
group_by(A) %>%
mutate(D=lag(B))
# A B D
#1 1 10 NA
#2 4 5 NA
#3 6 6 NA
#4 3 25 NA
#5 1 40 10
Or
library(data.table)#data.table_1.9.5
setDT(df1)[, D:=shift(B), A][]
data
df1 <- structure(list(A = c(1L, 4L, 6L, 3L, 1L), B = c(10L, 5L, 6L,
25L, 40L)), .Names = c("A", "B"), class = "data.frame",
row.names = c(NA, -5L))

Resources