How to remove NA from certain columns only using R - r

I got data which has many columns but I want to remove NAs values from specific columns
If i got a df such as below
structure(list(id = c(1, 1, 2, 2), admission = c("2001/01/01",
"2001/03/01", "NA", "2005/01/01"), discharged = c("2001/01/07",
"NA", "NA", "2005/01/03")), class = "data.frame", row.names = c(NA,
-4L))
and I want to exclude the records for each id which has NAs in it and get the df such as below
structure(list(id2 = c(1, 2), admission2 = c("2001/01/01", "2005/01/01"
), discharged2 = c("2001/01/07", "2005/01/03")), class = "data.frame", row.names = c(NA,
-2L))

The NAs in the data are "NA". Convert to NA with is.na and then use na.omit
is.na(df1) <- df1 == "NA"
df1 <- na.omit(df1)
df1
id admission discharged
1 1 2001/01/01 2001/01/07
4 2 2005/01/01 2005/01/03
If it is specific columns, use
df1[complete.cases(df1[c("admission", "discharged")]),]

Base R option:
character "NA" to NA
then use complete.cases
df[df=="NA"] <- NA
dfNA <- complete.cases(df)
df[dfNA,]
id admission discharged
1 1 2001/01/01 2001/01/07
4 2 2005/01/01 2005/01/03

We can make like this too:
library(tidyr)
df1[df1=="NA"] <- NA
df1_new <- df1 %>% drop_na()
df1_new
id admission discharged
1 1 2001/01/01 2001/01/07
2 2 2005/01/01 2005/01/03

Related

Grabs rows where second column is equal to value

I have a dataset which looks something like this:
print(animals_in_zoo)
// I only know the name of the first column, the second one is dynamic/based on a previously calculated variable
animals | dynamic_column_name
// What the data looks like
elefant x
turtle
monkey
giraffe x
swan
tiger x
What I want is to collect the rows in which the second columns' value is equal to "x".
What I want to do is something like:
SELECT * from data where col2 == "x";
After that, I want to grab only the first column and create a string object like "elefant giraffe tiger", but that is the easy part.
You can reference that column by its index and use that to get the animals you want:
df1 <- structure(list(animal = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), dynamic_column = c("x", NA, NA, "x", NA, "x"
)), row.names = c(NA, -6L), class = "data.frame")
df1[, 1][df1[, 2] == "x" & !is.na(df1[, 2])]
#> [1] "elefant" "giraffe" "tiger"
We could use filter with grepl which searches for a pattern 'x' in the string:
# the data frame
df <- read.table(header = TRUE, text =
'my_col
"elefant x"
turtle
monkey
"giraffe x"
swan
"tiger x"'
)
library(dplyr)
df %>%
filter(grepl('x', my_col))
my_col
1 elefant x
2 giraffe x
3 tiger x
Use [: the first argument refers to the rows. You want the rows where the second column is "x". The second argument is the column you need in the end, and you want the column named "animals":
dat[dat[2] == "x", "animals"]
#[1] "elefant" "giraffe" "tiger"
data
dat <- structure(list(animals = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), V2 = c("x", "", "", "x", "", "x")), row.names = c(NA,
-6L), class = "data.frame")
# animals V2
# 1 elefant x
# 2 turtle
# 3 monkey
# 4 giraffe x
# 5 swan
# 6 tiger x
I guess you have a dataframe?
If so, something like df[df$col2 == 'x',] should work.
With base functions, you can do it like this:
# Option 1
your_dataframe[your_dataframe$col2 == "x", ]
# Option 2
your_dataframe[your_dataframe[,2] == "x", ]
With dplyr functions, you can do it like this:
library(dplyr)
your_dataframe %>%
filter(col2 == "x")

R: adding matching vector values from two dataframes in one column

I have a data frame which is configured roughly like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(0,0,0))
words
frequency
count
hello
7
0
yes
8
0
example
5
0
What I'm trying to do is add values to the third column from a different data frame, which is similiar but looks like this:
df2 <- cbind(c('example','hello') ,c(5,6))
words
frequency
example
5
hello
6
My goal is to find matching values for the first column in both data frames (they have the same column name) and add matching values from the second data frame to the third column of the first data frame.
The result should look like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(6,0,5))
words
frequency
count
hello
7
6
yes
8
0
example
5
5
What I've tried so far is:
df <- merge(df,df2, by = "words", all.x=TRUE)
However, it doesn't work.
I could use some help understanding how could it be done. Any help will be welcome.
This is an "update join". My favorite way to do it is in dplyr:
library(dplyr)
df %>% rows_update(rename(df2, count = frequency), by = "words")
In base R you could do the same thing like this:
names(df2)[2] = "count2"
df = merge(df, df2, by = "words", all.x=TRUE)
df$count = ifelse(is.na(df$coutn2), df$count, df$count2)
df$count2 = NULL
Here is an option with data.table:
library(data.table)
setDT(df)[setDT(df2), on = "words", count := i.frequency]
Output
words frequency count
<char> <num> <num>
1: hello 7 6
2: yes 8 0
3: example 5 5
Or using match in base R:
df$count[match(df2$words, df$words)] <- df2$frequency
Or another option with tidyverse using left_join and coalesce:
library(tidyverse)
left_join(df, df2 %>% rename(count.y = frequency), by = "words") %>%
mutate(count = pmax(count.y, count, na.rm = T)) %>%
select(-count.y)
Data
df <- structure(list(words = c("hello", "yes", "example"), frequency = c(7,
8, 5), count = c(0, 0, 0)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(words = c("example", "hello"), frequency = c(5, 6)), class = "data.frame", row.names = c(NA,
-2L))

create new variables in a dataframe from existing variables using lists of names

I was wondering if there was a way to construction a function to do the following:
Original dataframe, df:
Obs Col1 Col2
1 Y NA
2 NA Y
3 Y Y
Modified dataframe, df:
Obs Col1 Col2 Col1_YN Col2_YN
1 Y NA “Yes” “No”
2 NA Y “No “Yes”
3 Y Y “Yes” “Yes”
The following code works just fine to create the new variables but I have lots of original columns with this structure and the “Yes” “No” format works better when constructing tables.
df$Col1_YN <- as.factor(ifelse(is.na(df$Col1), “No”, “Yes”))
df$Col2_YN <- as.factor(ifelse(is.na(df$Col2), “No”, “Yes”))
I was thinking along the lines of defining lists of input and output columns to be passed to a function, or using lapply but haven’t figured out how to do this.
We can use across to loop over the columns and create the new columns by modifying the .names
library(dplyr)
df1 <- df %>%
mutate(across(-Obs,
~ case_when(. %in% "Y" ~ "Yes", TRUE ~ "No"), .names = "{.col}_YN"))
-output
df1
Obs Col1 Col2 Col1_YN Col2_YN
1 1 Y <NA> Yes No
2 2 <NA> Y No Yes
3 3 Y Y Yes Yes
If we want to use lapply, loop over the columns of interest, apply the ifelse and assign it back to new columns by creating a vector of new names with paste
df[paste0(names(df)[-1], "_YN")] <-
lapply(df[-1], \(x) ifelse(is.na(x), "No", "Yes"))
data
df <- structure(list(Obs = 1:3, Col1 = c("Y", NA, "Y"), Col2 = c(NA,
"Y", "Y")), class = "data.frame", row.names = c(NA, -3L))

Reference column name from another table to insert value from shared ID

I am attempting to reference a column in another dataframe (df2) based on a column of column names in my primary dataframe (df1$reference) to return the extact value where the ID and column name match in a new column (df1$new_col). Here is a simplified example of my data and the desired outcome:
I have tried rbind but ran into errors due to differences in the number of rows/columns. I also tried bind_rows/bind_cols, but am having a hard time joining only the referenced data (I would like to avoid a large join because my real data has many more columns). I feel like indexing would be able to accomplish this but I am not as familiar with indexing outside of simple tasks. I'm open to any and all suggestions / approaches!
We may use row/column indexing
DF1$new_col <- DF2[-1][cbind(seq_len(nrow(DF1)),
match(DF1$reference, names(DF2)[-1]))]
-output
> DF1
ID value reference new_col
1 1 4 colD no
2 2 5 colD no
3 3 6 colE no
data
DF1 <- structure(list(ID = 1:3, value = 4:6, reference = c("colD", "colD",
"colE")), class = "data.frame", row.names = c(NA, -3L))
DF2 <- structure(list(ID = 1:3, colD = c("no", "no", "yes"), colE = c("yes",
"no", "no"), colF = c("no", "yes", "no")),
class = "data.frame", row.names = c(NA,
-3L))
Maybe something like this?
library(dplyr)
left_join(DF1, DF2, by="ID") %>%
mutate(New_col = case_when(reference=="colD" ~ colD,
reference=="colE" ~ colE,
reference=="colF" ~ colF)) %>%
select(ID, value, reference, New_col)
ID value reference New_col
1 1 4 colD no
2 2 5 colD no
3 3 6 colE no

replacing blank not NA

I have two variables a and b
a b
vessel hot
parts
nest NA
best true
neat smooth
I want to replace blank in b with a
la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
But it is not working
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), specify the condition in 'i' (b==''), and assign the values of 'a' that corresponds to TRUE values in 'i' to 'b'. It should be fast as we are assigning in place.
library(data.able)
setDT(df1)[b=='', b:= a]
df1
# a b
#1: vessel hot
#2: parts parts
#3: nest NA
#4: best true
#5: neat smooth
Or we can just base R
i1 <- df1$b=='' & !is.na(df1$b)
df1$b[i1] <- df1$a[i1]
data
df1 <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "", NA, "true", "smooth")), .Names = c("a", "b"
), class = "data.frame", row.names = c(NA, -5L))
instead of
# la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
# what is i1? it doesn't seem to have any obvious function here
... it should be:
la$b <- ifelse(la$b == "", la$a, la$b)
assuming that you want to replace blank in b with a and that applies to all blanks
it works:
df <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "parts", NA, "true", "smooth")), .Names = c("a",
"b"), row.names = c(NA, -5L), class = "data.frame")
df$b <- ifelse(df$b=="", df$a, df$b)
# or, with `with`: df$b <- with(df, ifelse(b=="",a,b))
# > df
# a b
# 1 vessel hot
# 2 parts parts
# 3 nest <NA>
# 4 best true
# 5 neat smooth

Resources