How to select columns in an R dataframe based on string matching - r

I don't think this exact question has been asked yet (for R, anyway).
I want to retain any columns in my dataset (there are hundreds in actuality) that contain a certain string, and drop the rest. I have found plenty of examples of string searching column names, but nothing for the contents of the columns themselves.
As an example, say I have this dataset:
df = data.frame(v1 = c(1, 8, 7, 'No number'),
v2 = c(5, 3, 5, 1),
v3 = c('Nothing', 4, 2, 9),
v4 = c(3, 8, 'Something', 6))
For this example, say I want to retain any columns with the string No, so that the resulting dataset is:
v1 v3
1 1 Nothing
2 8 4
3 7 2
4 No number 9
How can I do this in R? I am happy with any sort of solution (e.g., base R, dplyr, etc.)!
Thanks in advance!

Simply
df[grep("No", df)]
# v1 v3
# 1 1 Nothing
# 2 8 4
# 3 7 2
# 4 No number 9
This works, because grep internally checks if if (!is.character(x)) and if that's true it basically does:
s <- structure(as.character(df), names = names(df))
s
# v1
# "c(\"1\", \"8\", \"7\", \"No number\")"
# v2
# "c(5, 3, 5, 1)"
# v3
# "c(\"Nothing\", \"4\", \"2\", \"9\")"
# v4
# "c(\"3\", \"8\", \"Something\", \"6\")"
grep("No", s)
# [1] 1 3
Note:
R.version.string
# [1] "R version 4.0.3 (2020-10-10)"

Base R :
df[colSums(sapply(df, grepl, pattern = 'No')) > 0]
# v1 v3
#1 1 Nothing
#2 8 4
#3 7 2
#4 No number 9
Using dplyr :
library(dplyr)
df %>% select(where(~any(grepl('No', .))))

Use dplyr::select_if() function:
df <- df %>% select_if(function(col) any(grepl("No", col)))

You can run grepl for each column and if there's any value in there, pick it.
df = data.frame(v1 = c(1, 8, 7, 'No number'),
v2 = c(5, 3, 5, 1),
v3 = c('Nothing', 4, 2, 9),
v4 = c(3, 8, 'Something', 6))
find.no <- sapply(X = df, FUN = function(x) {
any(grep("No", x = x))
})
> df[, find.no]
v1 v3
1 1 Nothing
2 8 4
3 7 2
4 No number 9

Related

Checking conditions and assigning values by row in R

I have a dataset that has one row per subject, and there is a variable for which I want to reassign values based on a condition. For example, if the value of the variable is 6, I want to change the value to the mean of the other variables in the dataset.
Subject V1 V2 V3 V4
123 2 2 2 3
234 1 5 4 4
345 1 4 3 6
In the above dataset, for each patient, I want to reassign all 6's for V4 with the mean of that patient's V1, V2, V3. Thus, for subject 345, V4 would take on the new value 8/3 or ((1+4+3)/3). I was thinking of using an ifelse statement, but I haven't been able to get it to work. Any help would be greatly appreciated.
Given:
library(dplyr)
library(tibble)
data <- tibble(
Subject = c("123", "234", "345"),
V1 = c(2, 1, 1),
V2 = c(2, 5, 4),
V3 = c(2, 4, 3),
V4 = c(3, 4, 6)
)
You could do this using base-R:
data$V4 <- ifelse(data$V4 == 6,(data$V1 + data$V2 + data$V3)/3, data$V4)
Or using a dplyr chain:
data <- data %>%
mutate(V4 = ifelse(V4 == 6,(V1 + V2 + V3)/3, V4))
Turn the V4 value to NA and replace them with rowMeans.
df$V4[df$V4 == 6] <- NA
df$V4 <- ifelse(is.na(df$V4), rowMeans(df[-1], na.rm = TRUE), df$V4)
df
# Subject V1 V2 V3 V4
#1 123 2 2 2 3.00
#2 234 1 5 4 4.00
#3 345 1 4 3 2.67
You can use any of the below formula.
d[,4]<-ifelse(d[,4]==6,(d[,1]+d[,2]+d[,3])/3,d[,4])
d[,4]<-ifelse(d[,4]==6,rowMeans(d[,1:3]),d[,4])

In R subtract a vector from each row of a dataframe

I'm searching a better, more efficient solution to subtract a vector from each row of a dataframe (df1). My current solution repeats the vector (Vec) to create a dataframe (Vec_df1) with similar length as the df1 and then subtracts the two dataframes. Now I wonder if there is a more "direct" way to do this without having to create the new Vec_df1 dataframe (preferably in tidyverse). See example data below.
#Example data
V1 <- c(1, 2, 3)
V2 <- c(4, 5, 6)
V3 <- c(7, 8, 9)
df1 <- tibble(V1, V2, V3)
Vec <- c(1, 1, 2)
# Current solution, creates a dataframe with the same nrows by repeating the vector.
Vec_df1 <- tibble::as_tibble(t(Vec)) %>%
dplyr::slice(rep(dplyr::row_number(), nrow(df1)))
# Subtraction.
df2 <- df1-Vec_df1
df2
Thanks in advance
We can use sweep :
sweep(df1, 2, Vec, `-`)
# `-` is default FUN in sweep so you can also use
#sweep(df1, 2, Vec)
# V1 V2 V3
#1 0 3 5
#2 1 4 6
#3 2 5 7
Or an attempt similar to yours
df1 - rep(Vec, each = nrow(df1))
A similar approach using map2_df():
library(purrr)
map2_df(df1, Vec, `-`)
# A tibble: 3 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 0 3 5
2 1 4 6
3 2 5 7
the fastest way to do this :
as_tibble(t(t(df1) - Vec))
# A tibble: 3 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 0 3 5
2 1 4 6
3 2 5 7
We can also do
df1 - Vec[col(df1)]

Using a column as a column index to extract value from a data frame in R

I am trying to use the values from a column to extract column numbers in a data frame. My problem is similar to this topic in r-bloggers. Copying the script here:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
However, instead of having column names in choice, I have column index number, such that my data frame looks like this:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3),
stringsAsFactors = FALSE)
I tried using this solution:
df$newValue <-
df[cbind(
seq_len(nrow(df)),
match(df$choice, colnames(df))
)]
Instead of giving me an output that looks like this:
# x y choice newValue
# 1 1 4 1 1
# 2 2 5 2 2
# 3 3 6 1 6
# 4 8 9 3 NA
My newValue column returns all NAs.
# x y choice newValue
# 1 1 4 1 NA
# 2 2 5 2 NA
# 3 3 6 1 NA
# 4 8 9 3 NA
What should I modify in the code so that it would read my choice column as column index?
As you have column numbers which we need to extract from data frame already we don't need match here. However, since there is a column called choice in the data which you don't want to consider while extracting data we need to turn the values which are not in the range to NA before subsetting from the dataframe.
mat <- cbind(seq_len(nrow(df)), df$choice)
mat[mat[, 2] > (ncol(df) -1), ] <- NA
df$newValue <- df[mat]
df
# x y choice newValue
#1 1 5 1 1
#2 2 6 2 6
#3 3 7 1 3
#4 4 8 3 NA
data
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3))

Replacing NA's with LOCF using Sparklyr

My aim is to replace NA's in a spark data frame using the Last Observation Carried Forward method. I wrote the following code and works. However, it seems to take longer than expected for a larger dataset.
It would be great if someone can recommend a better approach or improve the code.
Example and Code with Sparklyr
In the following example, NA's are replaced after ordering them using the
time and grouping them by grp.
df_with_nas <- data.frame(time = seq(as.Date('2001/01/01'),
as.Date('2010/01/01'), length.out = 10),
grp = c(rep(1, 5), rep(2, 5)),
v1 = c(1, rep(NA, 3), 5, rep(NA, 5)),
v2 = c(NA, NA, 3, rep(NA, 4), 3, NA, NA))
tbl <- copy_to(sc, df_with_nas, overwrite = TRUE)
tbl %>%
spark_apply(function(df) {
library(dplyr)
na_locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v) + 1]
}
df %>% arrange(time) %>% group_by(grp) %>% mutate_at(vars(-v1, -grp),
funs(na_locf(.)))
})
# # Source: spark<?> [?? x 4]
# time grp v1 v2
# <dbl> <dbl> <dbl> <dbl>
# 1 11323 1 1 NaN
# 2 11688. 1 NaN NaN
# 3 12053. 1 NaN 3
# 4 12419. 1 NaN 3
# 5 12784. 1 5 3
# 6 13149. 2 NaN NaN
# 7 13514. 2 NaN NaN
# 8 13880. 2 NaN 3
# 9 14245. 2 NaN 3
# 10 14610 2 NaN 3
data.table
Following approach with data.table works quite fast for the data I have. I am expecting the size of the data to increase soon, and then I may have to rely on sparklyr.
library(data.table)
setDT(df_with_nas)
df_with_nas <- df_with_nas[order(time)]
cols <- c("v1", "v2")
df_with_nas[, (cols) := zoo::na.locf(.SD, na.rm = FALSE),
by = grp, .SDcols = cols]
I did this sort of loop, is quite slow...
df_with_nas = df_with_nas %>% mutate(row = 1:nrow(df_with_nas))
for(n in 1:50){
df_with_nas = df_with_nas %>%
arrange(row) %>%
mutate_all(~if_else(is.na(.),lag(.,1),.))
}
run until no NA
then
collect(df_with_nas)
Will run the code.
You can leverage the spark_apply() function and run the na.locf function in each of your cluster nodes.
Install R runtimes on each of your cluster nodes.
Install the zoo R package on each nodes as well.
Run spark apply this way:
data_filled <- spark_apply(data_with_holes, function(df) zoo:na.locf(df))
You can do this quite quickly using sql with the added benefit that you can easily apply LOCF on grouped basis. The pattern you want to use is LAST_VALUE(column, true) OVER (window) - this searches over the window for the most recent column value which is not NA (passing "true" to LAST_VALUE sets ignore NA = true). Since you want to look backwards from the current value the window should be
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING
Of course, if the first value in the group is NA it will remain NA.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
test_table <- data.frame(
v1 = c(1, 2, NA, 3, NA, 5, NA, 6, NA),
v2 = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
time = c(1, 2, 3, 4, 5, 2, 1, 3, 4)
) %>%
sdf_copy_to(sc, ., "test_table")
spark_session(sc) %>%
sparklyr::invoke("sql", "SELECT *, LAST_VALUE(v1, true)
OVER (PARTITION BY v2
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING)
AS last_non_na
FROM test_table") %>%
sdf_register() %>%
mutate(v1 = ifelse(is.na(v1), last_non_na, v1))
#> # Source: spark<?> [?? x 4]
#> v1 v2 time last_non_na
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 NaN
#> 2 2 1 2 1
#> 3 2 1 3 2
#> 4 3 1 4 2
#> 5 3 1 5 3
#> 6 NaN 2 1 NaN
#> 7 5 2 2 NaN
#> 8 6 2 3 5
#> 9 6 2 4 6
Created on 2019-08-27 by the reprex package (v0.3.0)

Creating a new column which is a vector of other columns

I have a dataframe with two columns (V1 and V2) and I'd like to create another column which is a vector - via combine function: c() - taking as arguments the other columns.
I'm using dplyr for all tasks, so I'd like to use it also in this context.
I've tried to create the new column with an apply function but it returns a vector with all the rows (not rowwise), something which surprises me, because with other functions it works rowwise.
I've solved it using the function rowwise, but as it's not usually so efficient, I'd like to see if there's another option.
Here is the definition of the dataframe:
IDs <- structure(list(V1 = c("1", "1", "6"),
V2 = c("6", "8", "8")),
class = "data.frame",
row.names = c(NA, -3L)
)
Here is the creation of the columns (together1 is the wrong result, and together2 the good one):
IDs <-
IDs %>%
mutate(together1 = list(mapply(function (x,y) c(x,y), V1, V2))
) %>%
rowwise() %>%
mutate(together2 = list(mapply(function (x,y) c(x,y), V1, V2))
) %>%
ungroup()
Here are the printed results:
print(as.data.frame(IDs))
V1 V2 together1 together2
1 1 6 1, 6, 1, 8, 6, 8 1, 6
2 1 8 1, 6, 1, 8, 6, 8 1, 8
3 6 8 1, 6, 1, 8, 6, 8 6, 8
Thanks in advance!
You can do it with purrr's map2 function:
library(dplyr)
library(purrr)
IDs %>%
mutate(together = map2(V1, V2, ~c(.x, .y)))
pmap could be used here
library(tidyverse)
IDs %>%
mutate(together = pmap(unname(.), c))
# V1 V2 together
#1 1 6 1, 6
#2 1 8 1, 8
#3 6 8 6, 8
You've just missed the SIMPLIFY = FALSE in your mapply() call:
dplyr::mutate(IDs, togeher = mapply(c, V1, V2, SIMPLIFY = F))
V1 V2 togeher
1 1 6 1, 6
2 1 8 1, 8
3 6 8 6, 8

Resources