How to ignore NA when I paste columns using data.table? - r

I have a dataframe:
ID value1 value2
1 wad 11
2 NA NA
3 elf 1
When i do this:
dt[,new:=paste0(value1, value2)]
I get:
ID value1 value2 new
1 wad 11 wad11
2 NA NA NANA
3 elf 1 elf1
How to ignore NA? And remove value1 and value2? I want to get:
ID new
1 wad11
2 NA
3 elf1

A data.table option
setDT(df)[,
.(ID, new = do.call(paste, c(replace(.SD, is.na(.SD), ""), sep = ""))),
.SDcols = patterns("^value")
]
gives
ID new
1: 1 wad11
2: 2
3: 3 elf1
Data
> dput(df)
structure(list(ID = 1:3, value1 = c("wad", NA, "elf"), value2 = c(11L,
NA, 1L)), class = "data.frame", row.names = c(NA, -3L))

We can do
library(dplyr)
df %>%
transmute(ID, new = case_when(is.na(value1) & is.na(value2) ~ "",
TRUE ~str_c(value1, value2, sep='_')))
# ID new
#1 1 wad_11
#2 2
#3 3 elf_1

With data.table :
library(data.table)
setDT(df)
df[is.na(df)] <- ''
df[, new := paste0(value1, value2)][, .(ID, new)]
tidyr::unite has a one-liner solution to this :
tidyr::unite(df, new, value1, value2, na.rm = TRUE)
# ID new
#1 1 wad_11
#2 2
#3 3 elf_1

Related

Splitting values in a column

sorry I'm new to R but I've got some data that looks like the following:
I'd like count the number of times each object is mentioned in the findings. So the result would look like this:
I've tried tidyverse and separate but can't seem to get the hang of it, any help would be amazing, thanks in advance!
To recreate my data:
df <- data.frame(
col_1 = paste0("image", 1:5),
findings = c("rock|cat|sun", "cat", "cat|dog|fish|sun", "sun", "dog|cat")
)
You can use separate_rows() and then count().
library(tidyverse)
df %>%
separate_rows(findings) %>%
count(findings)
# # A tibble: 5 x 2
# findings n
# <chr> <int>
# 1 cat 4
# 2 dog 2
# 3 fish 1
# 4 rock 1
# 5 sun 3
Data
df <- structure(list(col_1 = c("image_1", "image_2", "image_3", "image_4",
"image_5"), findings = c("rock|cat|sun", "cat", "cat|dog|fish|sun",
"sun", "dog|cat")), class = "data.frame", row.names = c(NA, -5L))
In base R:
as.data.frame(table(unlist(strsplit(df$col_2, "|", fixed = TRUE))))
# Var1 Freq
# 1 cat 4
# 2 dog 2
# 3 fish 1
# 4 rock 1
# 5 sun 3
Reproducible data (please provide it in your next post):
df <- data.frame(
col_1 = paste0("image", 1:5),
col_2 = c("rock|cat|sun", "cat", "cat|dog|fish|sun", "sun", "dog|cat")
)
An option with cSplit
library(splitstackshape)
cSplit(df, 'col_2', 'long', sep="|")[, .N, col_2]
# col_2 N
#1: rock 1
#2: cat 4
#3: sun 3
#4: dog 2
#5: fish 1
data
df <- structure(list(col_1 = c("image1", "image2", "image3", "image4",
"image5"), col_2 = c("rock|cat|sun", "cat", "cat|dog|fish|sun",
"sun", "dog|cat")), class = "data.frame", row.names = c(NA, -5L
))
Using tidyverse:
df %>%
separate_rows(findings) %>%
group_by(findings) %>%
summarize(total_count_col=n())
First we convert the data into a long format using separate_rows, then group and count the number of rows with each finding.
Example:
df<-data.frame(col1=c(rep(letters[1:3],3),"d"),col2=c(rep("moose|cat|dog",9),"rock"), stringsAsFactors = FALSE)
df %>% separate_rows(col2) %>% group_by(col2) %>% summarize(total_count_col=n())
# A tibble: 4 x 2
col2 total_count_col
<chr> <int>
1 cat 9
2 dog 9
3 moose 9
4 rock 1

Check if column value is in between (range) of two other column values

I have a data frame that looks like this (Dataframe X):
id number found
1 5225 NA
2 2222 NA
3 3121 NA
I have another data frame that looks like this (Dataframe Y):
id number1 number2
1 4000 6000
3 2500 3300
3 7000 8000
What I want to do is this: For each value in the Dataframe X "number" column, search if it is equal to or between ANY of the "number1" and "number2" pair values of Dataframe Y. Additionally, for this "number1" and "number2" pair values, its respective "id" must match the "id" in Dataframe X. If this is all true, then I want to insert a "YES in the "found" column of the respective row in Dataframe X:
id number found
1 5225 YES
2 2222 NA
3 3121 YES
How would I go about doing this? Thanks for the help.
Using tidyverse functions, especially map_chr to iterate over each number:
library(tidyverse)
tbl1 <- read_table2(
"id number found
1 5225 NA
2 2222 NA
3 3121 NA"
)
tbl2 <- read_table2(
"id number1 number2
1 4000 6000
2 2500 3300
3 7000 8000"
)
tbl1 %>%
mutate(found = map_chr(
.x = number,
.f = ~ if_else(
condition = any(.x > tbl2$number1 & .x < tbl2$number2),
true = "YES",
false = NA_character_
)
))
#> # A tibble: 3 x 3
#> id number found
#> <int> <int> <chr>
#> 1 1 5225 YES
#> 2 2 2222 <NA>
#> 3 3 3121 YES
Created on 2018-10-18 by the reprex package (v0.2.0).
Here is an option using fuzzy_join
library(fuzzy_join)
library(dplyr)
fuzzy_left_join(X, Y[-1], by = c("number" = "number1", "number" = "number2"),
match_fun =list(`>=`, `<=`)) %>%
mutate(found = c(NA, "YES")[(!is.na(number1)) + 1]) %>%
select(names(X))
# id number found
#1 1 5225 YES
#2 2 2222 <NA>
#3 3 3121 YES
Or another option is a non-equi join with data.table
library(data.table)
setDT(X)[, found := NULL]
X[Y, found := "YES", on = .(number >= number1, number <= number2)]
X
# id number found
#1: 1 5225 YES
#2: 2 2222 <NA>
#3: 3 3121 YES
data
X <- structure(list(id = 1:3, number = c(5225L, 2222L, 3121L), found = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
Y <- structure(list(id = 1:3, number1 = c(4000L, 2500L, 7000L), number2 = c(6000L,
3300L, 8000L)), class = "data.frame", row.names = c(NA, -3L))
We can loop over each x$number using sapply and check if it lies in range of any of y$number1 and y$number2 and give the value accordingly.
x$found <- ifelse(sapply(x$number, function(p)
any(y$number1 <= p & y$number2 >= p)),"YES", NA)
x
# id number found
#1 1 5225 YES
#2 2 2222 <NA>
#3 3 3121 YES
Using the same logic but with replace
x$found <- replace(x$found,
sapply(x$number, function(p) any(y$number1 <= p & y$number2 >= p)), "YES")
EDIT
If we want to also compare the id value we could do
x$found <- ifelse(sapply(seq_along(x$number), function(i) {
inds <- y$number1 <= x$number[i] & y$number2 >= x$number[i]
any(inds) & (x$id[i] == y$id[which.max(inds)])
}), "YES", NA)
x$found
#[1] "YES" NA "YES"
Using sqldf:
library(sqldf)
sql <- "SELECT DISTINCT x.id, x.number, "
sql <- paste0(sql, "CASE WHEN y.id IS NOT NULL THEN 'YES' END AS found ")
sql <- paste0(sql, "FROM X x LEFT JOIN Y y ON x.number BETWEEN y.number1 AND y.number2")
X <- sqldf(sql)

Dynamically select all columns but among ones that start with a certain word exclude all but keep one

I have many data frames that come in such a format:
df1 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, Group = 1:2,
FORMULA_RULE = 1:2, FORMULA_TRANSFORM = 1:2, FORMULA_UNITE = 1:2,
FORMULA_CALCULATE = 1:2, FORMULA_JOIN = 1:2), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, FORMULA_RULE = 1:2,
FORMULA_META = c(NA, NA), FORMULA_DATA = 1:2, FORMULA_JOIN = 1:2,
FORMULA_TRANSFORM = 1:2, Group = 1:2), class = "data.frame", row.names = c(NA,
-2L))
View:
df1
ID Name Gender Group FORMULA_RULE FORMULA_TRANSFORM FORMULA_UNITE FORMULA_CALCULATE FORMULA_JOIN
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
df2
ID Name Gender FORMULA_RULE FORMULA_META FORMULA_DATA FORMULA_JOIN FORMULA_TRANSFORM Group
1 1 1 1 1 NA 1 1 1 1
2 2 2 2 2 NA 2 2 2 2
I want to write a code that would work on all such dataframes in a way that all columns are kept, but among the columns starts with FORMULA_, only FORMULA_TRANSFORM is selected. Please note that columns that do NOT start with FORMULA_ are not always the same, that is to say, I cannot simply write a code that always selects ID, Name, Gender, Group, and FORMULA_TRANSFORM, because there are some data frames that contain many other columns that do not start with FORMULA_ which I want to keep.
My attempt to solve this problem is this ugly code which works as expected:
library(tidyverse)
for(i in 1:length(ls(pattern = "df"))){
get(paste0("df", i)) %>%
select(-starts_with("FORMULA"),
(names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T))[!names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T) %in% "FORMULA_TRANSFORM"])
%>% print
}
Is there a more straight-forward way to do this?
With dplyr we can use select and it's pretty straight forward using starts_with and contains.
library(dplyr)
df1 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
Let's try with a dataframe without "FORMULA_TRANSFORM" column
df3 <- df1
df3$FORMULA_TRANSFORM <- NULL
df3 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2
With minus sign we are removing the columns that starts_with "FORMULA_" and selecting the one with "FORMULA_TRANSFORM". Instead of contains we can also use one_of() or matches() and it would still work.
Using base R we can use grep with invert and value set as TRUE
df1[c(grep("^FORMULA_", names(df1), invert = TRUE, value = TRUE),
"FORMULA_TRANSFORM")]
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
This creates a vector of column names where column name doesn't start with "FORMULA_" and we add "FORMULA_TRANSFORM" manually later.
The above method assumes that you always have "FORMULA_TRANSFORM" column in your dataframe and it will fail if there isn't. Safer option would be
get_selected_cols <- function(df1) {
cbind(df1[grep("^FORMULA_", names(df1), invert = TRUE)],
df1[names(df1) == "FORMULA_TRANSFORM"])
}
get_selected_cols(df1)
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
get_selected_cols(df3)
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2

extract key value pairs from R dataframe column

I have a dataframe with two columns. An ID column and a character column containing key value pairs delimited by semicolon.
ID | KeyValPairs
1 | "zx=1; ds=4; xx=6"
2 | "qw=5; df=2"
. | ....
I want to turn this into a dataframe with three columns
ID | Key | Val
1 | zx | 1
1 | ds | 4
1 | xx | 6
2 | qw | 5
2 | df | 2
There is no fixed number of key value pairs in the KeyValPairs column, and no closed set of possible keys. I have been goofing around with solutions that involve looping and reinserting into an empty dataframe, but it is not working properly and I am told I should avoid loops in R.
A tidyr and dplyr approach:
tidyr
library(tidyr)
library(reshape2)
s <- separate(df, KeyValPairs, 1:3, sep=";")
m <- melt(s, id.vars="ID")
out <- separate(m, value, c("Key", "Val"), sep="=")
na.omit(out[order(out$ID),][-2])
# ID Key Val
# 1 1 zx 1
# 3 1 ds 4
# 5 1 xx 6
# 2 2 qw 5
# 4 2 df 2
dplyrish
library(tidyr)
library(dplyr)
df %>%
mutate(KeyValPairs = strsplit(as.character(KeyValPairs), "; ")) %>%
unnest(KeyValPairs) %>%
separate(KeyValPairs, into = c("key", "val"), "=")
#courtesy of #jeremycg
Data
df <- structure(list(ID = c(1, 2), KeyValPairs = structure(c(2L, 1L
), .Label = c(" qw=5; df=2", " zx=1; ds=4; xx=6"), class = "factor")), .Names = c("ID",
"KeyValPairs"), class = "data.frame", row.names = c(NA, -2L))
A data.table solution, just to use tstrsplit:
library(data.table) # V 1.9.6+
setDT(df)[, .(key = unlist(strsplit(as.character(KeyValPairs), ";"))), by = ID
][, c("Val", "Key") := tstrsplit(key, "=")
][, key := NULL][]
# ID Val Key
#1: 1 zx 1
#2: 1 ds 4
#3: 1 xx 6
#4: 2 qw 5
#5: 2 df 2
Maybe also a case for {splitstackshape} from #AnandaMahto:
df <- read.table(sep = "|", header = TRUE, text = '
ID | KeyValPairs
1 | "zx=1; ds=4; xx=6"
2 | "qw=5; df=2"')
library(splitstackshape)
setNames(
cSplit(cSplit(df, 2, ";", "long"), 2, "="),
c("id", "key", "val")
)
# id key val
# 1: 1 zx 1
# 2: 1 ds 4
# 3: 1 xx 6
# 4: 2 qw 5
# 5: 2 df 2

drop levels of factor for which there is one missing value for one column r

I would like to drop any occurrence of a factor level for which one row contains a missing value
Example:
ID var1 var2
1 1 2
1 NA 3
2 1 2
2 2 4
So, in this hypothetical, what would be left would be:
ID var1 var2
2 1 2
2 2 4
Hers's possible data.table solution (sorry #rawr)
library(data.table)
setDT(df)[, if (all(!is.na(.SD))) .SD, ID]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
If you only want to check var1 then
df[, if (all(!is.na(var1))) .SD, ID]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
Assuming that NAs would occur in both var columns,
df[with(df, !ave(!!rowSums(is.na(df[,-1])), ID, FUN=any)),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
Or if it is only specific to var1
df[with(df, !ave(is.na(var1), ID, FUN=any)),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(!is.na(var1)))
# ID var1 var2
#1 2 1 2
#2 2 2 4
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L), var1 = c(1L, NA, 1L, 2L
), var2 = c(2L, 3L, 2L, 4L)), .Names = c("ID", "var1", "var2"
), class = "data.frame", row.names = c(NA, -4L))
Here's one more option in base R. It will check all columns for NAs.
df[!df$ID %in% df$ID[rowSums(is.na(df)) > 0],]
# ID var1 var2
#3 2 1 2
#4 2 2 4
If you only want to check in column "var1" you can do:
df[!with(df, ID %in% ID[is.na(var1)]),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
In the current development version of data.table, there's a new implementation of na.omit for data.tables, which takes a cols =and invert = arguments.
The cols = allows to specify the columns on which to look for NAs. And invert = TRUE returns the NA rows instead, instead of omitting them.
You can install the devel version by following these instructions. Or you can wait for 1.9.6 on CRAN at some point. Using that, we can do:
require(data.table) ## 1.9.5+
setkey(setDT(df), ID)
df[!na.omit(df, invert = TRUE)]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
How this works:
setDT converts data.frame to data.table by reference.
setkey sorts the data.table by the columns provided and marks those columns as key columns so that we can perform a join.
na.omit(df, invert = TRUE) gives just those rows that have NA anywhere.
X[!Y] does an anit-join by joining on the key column ID, and returns all the rows that don't match ID = 1 (from Y). Check this post to read in detail about data.table's joins.
HTH

Resources