Grabs rows where second column is equal to value - r

I have a dataset which looks something like this:
print(animals_in_zoo)
// I only know the name of the first column, the second one is dynamic/based on a previously calculated variable
animals | dynamic_column_name
// What the data looks like
elefant x
turtle
monkey
giraffe x
swan
tiger x
What I want is to collect the rows in which the second columns' value is equal to "x".
What I want to do is something like:
SELECT * from data where col2 == "x";
After that, I want to grab only the first column and create a string object like "elefant giraffe tiger", but that is the easy part.

You can reference that column by its index and use that to get the animals you want:
df1 <- structure(list(animal = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), dynamic_column = c("x", NA, NA, "x", NA, "x"
)), row.names = c(NA, -6L), class = "data.frame")
df1[, 1][df1[, 2] == "x" & !is.na(df1[, 2])]
#> [1] "elefant" "giraffe" "tiger"

We could use filter with grepl which searches for a pattern 'x' in the string:
# the data frame
df <- read.table(header = TRUE, text =
'my_col
"elefant x"
turtle
monkey
"giraffe x"
swan
"tiger x"'
)
library(dplyr)
df %>%
filter(grepl('x', my_col))
my_col
1 elefant x
2 giraffe x
3 tiger x

Use [: the first argument refers to the rows. You want the rows where the second column is "x". The second argument is the column you need in the end, and you want the column named "animals":
dat[dat[2] == "x", "animals"]
#[1] "elefant" "giraffe" "tiger"
data
dat <- structure(list(animals = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), V2 = c("x", "", "", "x", "", "x")), row.names = c(NA,
-6L), class = "data.frame")
# animals V2
# 1 elefant x
# 2 turtle
# 3 monkey
# 4 giraffe x
# 5 swan
# 6 tiger x

I guess you have a dataframe?
If so, something like df[df$col2 == 'x',] should work.

With base functions, you can do it like this:
# Option 1
your_dataframe[your_dataframe$col2 == "x", ]
# Option 2
your_dataframe[your_dataframe[,2] == "x", ]
With dplyr functions, you can do it like this:
library(dplyr)
your_dataframe %>%
filter(col2 == "x")

Related

create new variables in a dataframe from existing variables using lists of names

I was wondering if there was a way to construction a function to do the following:
Original dataframe, df:
Obs Col1 Col2
1 Y NA
2 NA Y
3 Y Y
Modified dataframe, df:
Obs Col1 Col2 Col1_YN Col2_YN
1 Y NA “Yes” “No”
2 NA Y “No “Yes”
3 Y Y “Yes” “Yes”
The following code works just fine to create the new variables but I have lots of original columns with this structure and the “Yes” “No” format works better when constructing tables.
df$Col1_YN <- as.factor(ifelse(is.na(df$Col1), “No”, “Yes”))
df$Col2_YN <- as.factor(ifelse(is.na(df$Col2), “No”, “Yes”))
I was thinking along the lines of defining lists of input and output columns to be passed to a function, or using lapply but haven’t figured out how to do this.
We can use across to loop over the columns and create the new columns by modifying the .names
library(dplyr)
df1 <- df %>%
mutate(across(-Obs,
~ case_when(. %in% "Y" ~ "Yes", TRUE ~ "No"), .names = "{.col}_YN"))
-output
df1
Obs Col1 Col2 Col1_YN Col2_YN
1 1 Y <NA> Yes No
2 2 <NA> Y No Yes
3 3 Y Y Yes Yes
If we want to use lapply, loop over the columns of interest, apply the ifelse and assign it back to new columns by creating a vector of new names with paste
df[paste0(names(df)[-1], "_YN")] <-
lapply(df[-1], \(x) ifelse(is.na(x), "No", "Yes"))
data
df <- structure(list(Obs = 1:3, Col1 = c("Y", NA, "Y"), Col2 = c(NA,
"Y", "Y")), class = "data.frame", row.names = c(NA, -3L))

How to remove NA from certain columns only using R

I got data which has many columns but I want to remove NAs values from specific columns
If i got a df such as below
structure(list(id = c(1, 1, 2, 2), admission = c("2001/01/01",
"2001/03/01", "NA", "2005/01/01"), discharged = c("2001/01/07",
"NA", "NA", "2005/01/03")), class = "data.frame", row.names = c(NA,
-4L))
and I want to exclude the records for each id which has NAs in it and get the df such as below
structure(list(id2 = c(1, 2), admission2 = c("2001/01/01", "2005/01/01"
), discharged2 = c("2001/01/07", "2005/01/03")), class = "data.frame", row.names = c(NA,
-2L))
The NAs in the data are "NA". Convert to NA with is.na and then use na.omit
is.na(df1) <- df1 == "NA"
df1 <- na.omit(df1)
df1
id admission discharged
1 1 2001/01/01 2001/01/07
4 2 2005/01/01 2005/01/03
If it is specific columns, use
df1[complete.cases(df1[c("admission", "discharged")]),]
Base R option:
character "NA" to NA
then use complete.cases
df[df=="NA"] <- NA
dfNA <- complete.cases(df)
df[dfNA,]
id admission discharged
1 1 2001/01/01 2001/01/07
4 2 2005/01/01 2005/01/03
We can make like this too:
library(tidyr)
df1[df1=="NA"] <- NA
df1_new <- df1 %>% drop_na()
df1_new
id admission discharged
1 1 2001/01/01 2001/01/07
2 2 2005/01/01 2005/01/03

If string has a certain character, fill an empty cell in the same row with a certain value

Say I have the following data frame:
# S/N a b
# 1 L1-S2 <blank>
# 2 T1-T3 <blank>
# 3 T1-L2 <blank>
How do I turn the above data frame into this:
# S/N a b
# 1 L1-S2 LS
# 2 T1-T3 T
# 3 T1-L2 TL
I am thinking of writing a loop, where
For x in column a,
If first character in x == L AND 4th character in x == S,
fill the corresponding cell in b with LS
and so on...
However, I am not sure how to implement it, or if there is a more elegant way of doing this.
We can extract the upper case letters and remove the repeated ones
library(stringr)
library(dplyr)
df1 %>%
mutate(b = str_replace(str_replace(a, "^([A-Z])\\d+-([A-Z])\\d+",
"\\1\\2"), "(.)\\1+", "\\1"))
-output
# S_N a b
#1 1 L1-S2 LS
#2 2 T1-T3 T
#3 3 T1-L2 TL
Or another option is str_extract_all to extract the upper case letters, loop over the list with map, paste the unique elements
library(purrr)
df1 %>%
mutate(b = str_extract_all(a, "[A-Z]") %>%
map_chr(~ str_c(unique(.x), collapse="")))
Or using a corresponding base R option for the first tidyverse option
df1$b <- sub("(.)\\1+", "\\1", gsub("[0-9-]+", "", df1$a))
Or with strsplit
df1$b <- sapply(strsplit(df1$a, "[0-9-]+"),
function(x) paste(unique(x), collapse=""))
data
df1 <- structure(list(S_N = 1:3, a = c("L1-S2", "T1-T3", "T1-L2"),
b = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))

Unexpected results using str_split and union in a function with sapply

Given this data.frame:
library(dplyr)
library(stringr)
ml.mat2 <- structure(list(value = c("a", "b", "c"), ground_truth = c("label1, label3",
"label2", "label1"), predicted = c("label1", "label2,label3",
"label1")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
glimpse(ml.mat2)
Observations: 3
Variables: 3
$ value <chr> "a", "b", "c"
$ ground_truth <chr> "label1, label3", "label2", "label1"
$ predicted <chr> "label1", "label2,label3", "label1"
I want to measure the length of the intersect between ground_truth and predicted for each row, after splitting the repeated labels based on ,.
In other words, I would expect a result of length 3 with values of 2 2 1.
I wrote a function to do this, but it only seems to work outside of sapply:
m_fn <- function(x,y) length(union(unlist(sapply(x, str_split,",")),
unlist(sapply(y, str_split,","))))
m_fn(ml.mat2$ground_truth[1], y = ml.mat2$predicted[1])
[1] 2
m_fn(ml.mat2$ground_truth[2], y = ml.mat2$predicted[2])
[1] 2
m_fn(ml.mat2$ground_truth[3], y = ml.mat2$predicted[3])
[1] 1
Rather than iterating through the rows of the data set manually like this or with a loop, I would expect to be able to vectorize the solution with sapply like this:
sapply(ml.mat2$ground_truth, m_fn, ml.mat2$predicted)
However, the unexpected results are:
label1, label3 label2 label1
4 3 3
Since you're interating within same observation size, you can generate an index of row numbers and run it in your sapply:
sapply(1:nrow(ml.mat2), function(i) m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))
#[1] 2 2 1
or with seq_len:
sapply(seq_len(nrow(ml.mat2)), function(i)
m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))

replacing blank not NA

I have two variables a and b
a b
vessel hot
parts
nest NA
best true
neat smooth
I want to replace blank in b with a
la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
But it is not working
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), specify the condition in 'i' (b==''), and assign the values of 'a' that corresponds to TRUE values in 'i' to 'b'. It should be fast as we are assigning in place.
library(data.able)
setDT(df1)[b=='', b:= a]
df1
# a b
#1: vessel hot
#2: parts parts
#3: nest NA
#4: best true
#5: neat smooth
Or we can just base R
i1 <- df1$b=='' & !is.na(df1$b)
df1$b[i1] <- df1$a[i1]
data
df1 <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "", NA, "true", "smooth")), .Names = c("a", "b"
), class = "data.frame", row.names = c(NA, -5L))
instead of
# la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
# what is i1? it doesn't seem to have any obvious function here
... it should be:
la$b <- ifelse(la$b == "", la$a, la$b)
assuming that you want to replace blank in b with a and that applies to all blanks
it works:
df <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "parts", NA, "true", "smooth")), .Names = c("a",
"b"), row.names = c(NA, -5L), class = "data.frame")
df$b <- ifelse(df$b=="", df$a, df$b)
# or, with `with`: df$b <- with(df, ifelse(b=="",a,b))
# > df
# a b
# 1 vessel hot
# 2 parts parts
# 3 nest <NA>
# 4 best true
# 5 neat smooth

Resources