Rows With Blank Entries in R - r

I have a 721 x 26 dataframe. Some rows have entries that are blank. It's not NULL
or NA but just empty like the following. How can I delete those rows that have these kind of entries?
1 Y N Y N 86.8
2 N N Y N 50.0
3 76.8
4 N N Y N 46.6
5 Y Y Y Y 30.0

The answer to this question depends on how paranoid you want to be about the sort of things that might be in 'blank'-appearing character strings. Here's a fairly careful approach that will match the zero-length blank string "" as well as any string composed of one or more [[:space:]] characters (i.e. "tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters", according to the ?regex help page).
## An example data.frame containing all sorts of 'blank' strings
df <- data.frame(A = c("a", "", "\n", " ", " \t\t", "b"),
B = c("b", "b", "\t", " ", "\t\t\t", "d"),
C = 1:6)
## Test each element to see if is either zero-length or contains just
## space characters
pat <- "^[[:space:]]*$"
subdf <- df[-which(names(df) %in% "C")] # removes columns not involved in the test
matches <- data.frame(lapply(subdf, function(x) grepl(pat, x)))
## Subset df to remove rows fully composed of elements matching `pat`
df[!apply(matches, 1, all),]
# A B C
# 1 a b 1
# 2 b 2
# 6 b d 6
## OR, to remove rows with *any* blank entries
df[!apply(matches, 1, any),]
# A B C
# 1 a b 1
# 6 b d 6

Related

Reinsert special character * into strings at predefined positions

(This question is based on a previous question Convert letters with duplicates to numbers)
I have series of events and non-events in column aoi, with events expressed as capital letters and non-events expressed as "*":
df <- data.frame(
Partcpt = c("B","A","B","C","A","B"),
aoi = c("B*B*B","*A*C*A*C","*B*B","A*C","*A*","*")
)
I need to convert the letters to consecutive numbers unless they are duplicates, in which case the previous number should be repeated. This conversion is accomplished by this:
df$aoi_0 <- sapply(strsplit(df$aoi, split = ""), function(x) paste(match(x[x!="*"], unique(x[x!="*"])), collapse = ""))
df
Partcpt aoi aoi_0
1 B B*B*B 111
2 A *A*C*A*C 1212
3 B *B*B 11
4 C A*C 12
5 A *A* 1
6 B *
But now the information on the non-events is lost. How can I reinstate that information in the strings themselves, by re-inserting the "*" character where appropriate, like so:
df
Partcpt aoi aoi_0
1 B B*B*B 1*1*1
2 A *A*C*A*C *1*2*1*2
3 B *B*B *1*1
4 C A*C 1*2
5 A *A* *1*
6 B * *
You can modify the anonymous function with an ifelse() to return * if the input is * but otherwise to follow the logic of your previous code, i.e. match the input to the vector of unique values.
df$aoi_1 <- sapply(
strsplit(df$aoi, split = ""),
\(x) paste0(
ifelse(
x=="*",
"*",
match(x, unique(x[x!="*"]))
), collapse = ""
)
)
df
# Partcpt aoi aoi_0 aoi_1
# 1 B B*B*B 111 1*1*1
# 2 A *A*C*A*C 1212 *1*2*1*2
# 3 B *B*B 11 *1*1
# 4 C A*C 12 1*2
# 5 A *A* 1 *1*
# 6 B * *
Another possible solution, which is based on the following ideas:
Try to match * with unique(x[x!="*"].
This outcomes no match for *.
Configure nomatch = 0.
Use gsub to replace 0 by *.
df$aoi_0 <- sapply(strsplit(df$aoi, split = ""),
function(x) gsub("0", "*", paste(match(x, unique(x[x!="*"]), nomatch = 0),
collapse = "")))
df
#> Partcpt aoi aoi_0
#> 1 B B*B*B 1*1*1
#> 2 A *A*C*A*C *1*2*1*2
#> 3 B *B*B *1*1
#> 4 C A*C 1*2
#> 5 A *A* *1*
#> 6 B * *

Replace values in a data.table based on row values in another table

I have two data.tables:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
right_table <- data.table(record = sample(LETTERS, 9))
I would like to replace the numeric entries in left_table by the values associated with the corresponding row numbers in right_table. e.g. All instances of 4 in left_table are replaced by whatever letter (or set of characters in my real data) is on row 4 of right_table and so on.
I have this solution but I feel it's a bit cumbersome and a simpler solution must be possible?
right_table <- data.table(row_n = as.character(seq_along(1:9)), right_table)
for (i in seq_along(left_table)){
cols <- colnames(left_table)
current_col <- cols[i]
# convert numbers to character to allow := to work for matching records
left_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#right_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#set key for quick joins
setkeyv(left_table, current_col)
setkeyv(right_table, "row_n")
# replace matching records
left_table[right_table, (current_col) := record]
}
You can create the new columns fetching the letters from right_table using the original variables.
left_table[, c("newa","newb","newc") :=
.(right_table[a,record],right_table[b,record],right_table[c,record])]
# a b c newa newb newc
# 1: 1 4 8 Y A R
# 2: 2 5 9 D B W
# 3: 3 6 10 G K <NA>
# 4: 4 7 11 A N <NA>
Edit:
To make it more generic:
columnNames <- names(left_table)
left_table[, (columnNames) :=
lapply(columnNames, function(x) right_table[left_table[,get(x)],record])]
Although there is probably a better way to do this without needing to call left_table inside lapply()
Using mapvalue from plyr:
library(plyr)
corresp <- function(x) mapvalues(x,seq(right_table$record),right_table$record)
left_table[,c(names(left_table)) := lapply(.SD,corresp),.SDcols = names(left_table)]
a b c
1: N K X
2: U Q V
3: Z I 10
4: K G 11
Here is my attempt. When we replace the numeric values to character values, we get NAs as we see from some other answers. So I decided to take another way. First, I created a vector using unlist(). Then, I used fifelse() from the data.table package. I used foo as indices and replaces numbers in foo with characters. I also converted numeric to character (i.e., 10 and 11 in the sample data). Then, I created a matrix and converted it to a data.table object. Finally, I assigned column names to the object.
library(data.table)
foo <- unlist(left_table)
temp <- fifelse(test = foo <= nrow(right_table),
yes = right_table$record[foo],
no = as.character(foo))
res <- as.data.table(matrix(data = temp, nrow = nrow(left_table)))
setnames(res, names(left_table))
# a b c
#1: B G J
#2: Y D I
#3: P T 10
#4: G S 11
I think it might be easier to just keep record as a vector and access it via indexing:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
# a b c
#1: 1 4 8
#2: 2 5 9
#3: 3 6 10
#4: 4 7 11
set.seed(0L)
right_table <- data.table(record = sample(LETTERS, 9))
record <- right_table$record
#[1] "N" "Y" "D" "G" "A" "B" "K" "Z" "R"
left_table[, names(left_table) := lapply(.SD, function(k) fcoalesce(record[k], as.character(k)))]
left_table
# a b c
# 1: N G Z
# 2: Y A R
# 3: D B 10
# 4: G K 11

Error when unlisting columns in a data frame

Suppose I have a data frame called DF:
options(stringsAsFactors = F)
letters <- list("A", "B", "C", "D")
numbers <- list(list(1,2), 1, 1, 2)
score <- list(.44, .54, .21, .102)
DF <- data.frame(cbind(letters, numbers, score))
Note that all columns in the data frame are of class "list".
Also, take a look at the structure: DF$numbers[1] is also a list
I'm trying to UNLIST each column.
DF$letters <- unlist(DF$letters)
DF$score <- unlist(DF$score)
DF$numbers <- unlist(DF$numbers)
However, because, DF$numbers[1] is also a list, I'm thrown back this error:
Error in `$<-.data.frame`(`*tmp*`, numbers, value = c(1, 2, 1, 1, 2)) :
replacement has 5 rows, data has 4
Is there a way that I can unlist the whole column, and keep the values cells like DF$numbers[1] as a character vector like c(1,2) or 1,2?
Ideally I would like DF to look something like this, where the individual values in the number column are still of type int:
letters numbers score
A 1,2 .44
B 1 .54
C 1 .21
D 2 .102
The goal is to then write the data frame to a csv file.
You can apply unlist to each individual element of the column numbers instead of the whole column:
DF$numbers <- lapply(DF$numbers, unlist)
DF
# letters numbers value
#1 A 1, 2 0.440
#2 B 1 0.540
#3 C 1 0.210
#4 D 2 0.102
DF$numbers[1]
#[[1]]
#[1] 1 2
Or paste the elements as a single string if you want an atomic vector column:
DF$numbers <- sapply(DF$numbers, toString)
DF
# letters numbers value
#1 A 1, 2 0.44
#2 B 1 0.54
#3 C 1 0.21
#4 D 2 0.102
DF$numbers[1]
#[1] "1, 2"
class(DF$numbers)
# [1] "character"
You can do:
DF$letters <- unlist(DF$letters)
DF$value <- unlist(DF$value)
DF$numbers <- unlist(as.character(DF$numbers))
This returns:
DF
letters numbers value
1 A c(1, 2) 0.440
2 B 1 0.540
3 C 1 0.210
4 D 2 0.102

Find the index of the row in data frame that contain one element in a string vector

If I have a data.frame like this
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to get the row indices that contains one of the element in c("a", "k", "n"); in this example, the result should be 1, 2, 5.
If you have a large data frame and you wish to check all columns, try this
x <- c("a", "k", "n")
Reduce(union, lapply(x, function(a) which(rowSums(df == a) > 0)))
# [1] 1 5 2
and of course you can sort the end result.
s <- c('a','k','n');
which(df$col1%in%s|df$col3%in%s);
## [1] 1 2 5
Here's another solution. This one works on the entire data.frame, and happens to capture the search strings as element names (you can get rid of those via unname()):
sapply(s,function(s) which(apply(df==s,1,any))[1]);
## a k n
## 1 2 5
Original second solution:
sort(unique(rep(1:nrow(df),ncol(df))[as.matrix(df)%in%s]));
## [1] 1 2 5

Extract subset of data

Ok, I have a matrix of values with certain identifiers, such as:
A 2
B 3
C 4
D 5
E 6
F 7
G 8
I would like to pull out a subset of these values (using R) based on a list of the identifiers ("B", "D", "E") for example, so I would get the following output:
B 3
D 5
E 6
I'm sure there's an easy way to do this (some sort of apply?) but I can't seem to figure it out. Any ideas? Thanks!
If the letters are the row names, then you can just use this:
m <- matrix(2:8, dimnames = list(LETTERS[1:7], NULL))
m[c("B","D","E"),]
# B D E
# 3 5 6
Note that there is a subtle but very important difference between: m[c("B","D","E"),] and m[rownames(m) %in% c("B","D","E"),]. Both return the same rows, but not necessarily in the same order.
The former uses the character vector c("B","D","E") as in index into m. As a result, the rows will be returned in the order of character vector. For instance:
# result depends on order in c(...)
m[c("B","D","E"),]
# B D E
# 3 5 6
m[c("E","D","B"),]
# E D B
# 6 5 3
The second method, using %in%, creates a logical vector with length = nrow(m). For each element, that element is T if the row name is present in c("B","D","E"), and F otherwise. Indexing with a logical vector returns rows in the original order:
# result does NOT depend on order in c(...)
m[rownames(m) %in% c("B","D","E"),]
# B D E
# 3 5 6
m[rownames(m) %in% c("E","D","B"),]
# B D E
# 3 5 6
This is probably more than you wanted to know...
Your matrix:
> m <- matrix(2:8, dimnames = list(LETTERS[1:7]))
You can use %in% to filter out the desired rows. If the original matrix only has a single column, using drop = FALSE will keep the matrix structure. Otherwise it will be converted to a named vector.
> m[rownames(m) %in% c("B", "D", "E"), , drop = FALSE]
# [,1]
# B 3
# D 5
# E 6

Resources