I have a dataframe like below:
Col1 Col2 COl4 Col5
A B NA NA
M L NA lo
A N NA KE
How do I make the logic where, if Col1 = A, replace NA in COl4 with "Pass"?
When I try using ifelse, I do not get the expected output.
Expected output should be:
Col1 Col2 COl4 Col5
A B Pass NA
M L NA lo
A N Pass KE
I tried this but no luck:
df$COl4<-
ifelse(df$Col1=="A", "Pass", df$COl4)
No real need for ifelse() here. You can use standard index replacement.
df$COl4[df$Col1 == "A"] <- "Pass"
This says that we are replacing COl4 such that Col1 == "A" with "Pass". Additionally, this method will not mess with attributes like ifelse() will.
You can use case_when:
library(tidyverse)
tab <- tibble(Col1 = c("A", "M", "A"), Col2 = c("B", "L", "N"), COl4 = c(NA, NA, NA), Col5 = c(NA, "lo", "KE"))
tab %>%
mutate(COl4 = case_when(
Col1 == "A" ~ "Pass",
TRUE ~ as.character(COl4))
)
# A tibble: 3 x 4
Col1 Col2 COl4 Col5
<chr> <chr> <chr> <chr>
1 A B Pass NA
2 M L NA lo
3 A N Pass KE
The benefit of use case_when is when you have too many conditions!
The TRUE is for the rest of the COl4 that don't need any condition).
Related
Hello coding community,
If my data frame looks like:
ID Col1 Col2 Col3 Col4
Per1 1 2 3 4
Per2 2 NA NA NA
Per3 NA NA 5 NA
Is there any syntax to delete the row associated with ID = Per2, on the basis that Col2, Col3, AND Col4 = NA? I am hoping for code that will allow me to delete a row on the basis that three specific columns (Col2, Col3, and Col4) ALL are NA. This code would NOT delete the row ID = Per3, even though there are three NAs.
Please note that I know how to delete a specific row, but my data frame is big so I do not want to manually sort through all rows/columns.
Big thanks!
Test for NA and delete rows with a number of NA's equal to the number of columns tested using rowSums.
dat[!rowSums(is.na(dat[c('Col2', 'Col3', 'Col4')])) == 3, ]
# ID Col1 Col2 Col3 Col4
# 1 Per1 1 2 3 4
# 3 Per3 NA NA 5 NA
You can use if_all
library(dplyr)
filter(df, !if_all(c(Col2, Col3, Col4), ~ is.na(.)))
# ID Col1 Col2 Col3 Col4
# 1 Per1 1 2 3 4
# 2 Per3 NA NA 5 NA
data
df <- structure(list(ID = c("Per1", "Per2", "Per3"), Col1 = c(1L, 2L,
NA), Col2 = c(2L, NA, NA), Col3 = c(3L, NA, 5L), Col4 = c(4L,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
Using if_any
library(dplyr)
df %>%
filter(if_any(Col2:Col4, complete.cases))
ID Col1 Col2 Col3 Col4
1 Per1 1 2 3 4
2 Per3 NA NA 5 NA
I have two columns
COL1 COL2
SCS NA
NA NA
NA PB
NA RM
Whenever col1 is na and col2 has a value, I want col2's value to overwrite the na.
Whenever col1 has a value, I want it to stay that value no matter what is in col2.
This could be done with coalesce by specifying COL1 as first argument followed by COL2 so that if there is any NA in COL1 for a particular row, it will be replaced by the corresponding row from COL2
library(dplyr)
data <- data %>%
mutate(COL1 = coalesce(COL1, COL2))
-output
data
COL1 COL2
1 SCS <NA>
2 <NA> <NA>
3 PB PB
4 RM RM
data
data <- structure(list(COL1 = c("SCS", NA, NA, NA), COL2 = c(NA, NA,
"PB", "RM")), class = "data.frame", row.names = c(NA, -4L))
With the data.table package and assuming your data.frame is named df:
library(data.table)
setDT(df)
df[is.na(col1), col1 := col2]
This will replace values in col1 with values in col2 if a col1 value is `na
data <- data%>%
mutate(COL1 = case_when(is.na(COL1) ~ COL2,
TRUE ~ as.character(COL1)))
COL1 COL2
SCS NA
NA NA
PB PB
RM RM
I have a dataset like below:
Col1 Col2 Col3
abckel NA 7
jdmelw njabc NA
8 jdken jdne
How do I subset my dataset so that it only keeps rows that contain the string "abc"?
Final Expected Output:
Col1 Col2 Col3
abckel NA 7
jdmelw njabc NA
With your data.frame:
d <- data.frame("Col1" = c("abckel", "jdmelw", 8),
"Col2" = c(NA, "njabc", NA),
"Col3" = c(7, NA, "jdne"),
stringsAsFactors = F)
The following should return your desired result:
d_new <- d[apply(d, 1, function(x) any(grepl("abc", x))), ]
A dplyr solution:
library(dplyr)
df %>% filter_all(any_vars(grepl("abc", .)))
Output:
Col1 Col2 Col3
1: abckel <NA> 7
2: jdmelw njabc <NA>
Have a situation where my code uses arrange for a certain column - say col1, but if that row does not have data available for that column, then I'd like it to use the col2, if col2 is not available, then I'd like it to use col3 and so on until col6.
so currently:
df <- data.frame(col1 = c("NA", "1999-07-01", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))
currently something similar to this is applied, but need to make it more flexible for the different cases mentioned above:
df %>%
mutate(col1 = as.Date(col1)) %>%
group_by(id) %>%
arrange(col1) %>%
mutate(diff = col1 - lag(col1))
I was thinking to use case_when in arrange but not sure how to translate that into the mutate aspect.
Alternatively, I was thinking about just creating another column i.e:
df <- df %>%
mutate(earliestDate = case_when(
!is.na(col1) ~ col1,
is.na(col1) ~ col2,
is.na(col2) ~ col3,
is.na(col3) ~ col4,
is.na(col4) ~ col5))
but the above doesn't update the new earliestDate column to have the earliest date, just grabs the first column?
I assume you want to order rows by earliestDate; why not do something like this?
df %>%
gather(key, date, starts_with("col")) %>%
group_by(id) %>%
mutate(earliestDate = min(as.Date(date, format = "%m-%d-%Y"), na.rm = TRUE)) %>%
spread(key, date)
## A tibble: 3 x 8
## Groups: id [3]
# id earliestDate col1 col2 col3 col4 col5 col6
# <dbl> <date> <chr> <chr> <chr> <chr> <chr> <chr>
#1 1209. 2009-01-12 NA 01-12-2009 01-12-2009 01-12-2009 01-12… NA
#2 1251. 2015-04-01 NA NA 04-01-2015 04-01-2015 NA 04-01…
#3 16121. 1999-07-01 07-01-1999 09-22-2011 09-22-2011 NA 09-22… 09-22…
Explanation: We convert data from wide to long, group by id and determine the earliestDate; we then convert data back from long to wide.
Note that dates from your sample data are not 100% consistent: for most entries you have dates in the format "%d-%m-%Y" except for the first entry in col1 which is "1999-07-01". I have changed this in the sample data below.
Sample data
df <- data.frame(col1 = c("NA", "07-01-1999", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))
To start your current "NA" values are not really R's NA value, so convert them.
df[df == "NA"] <- NA
Then you can take advantage of the row margin option in apply to find the leftmost (assuming that's what you want to do and not actually build true date objects, like Maurtis' answer) value that is not missing.
df$left_most <- apply(df[-7], 1, function(x) x[which.min(is.na(x))])
df
col1 col2 col3 col4 col5 col6 id left_most
1 <NA> <NA> 04-01-2015 04-01-2015 <NA> 04-01-2015 1251 04-01-2015
2 07-01-1999 09-22-2011 09-22-2011 <NA> 09-22-2011 09-22-2011 16121 07-01-1999
3 <NA> 01-12-2009 01-12-2009 01-12-2009 01-12-2009 <NA> 1209 01-12-2009
I can see there are two challenge in the data provided by OP.
The date format are not consistent. Sometime year part is in beginning and sometime its in end.
The order of preference for the columns. First Col1 is considered and then Col2 and so on.
To handle date in heterogeneous format one can use parse_date_time function from dplyr. And the use coalesce to group column is such a way that col1 data gets preference, and then col2 and so on.
library(dplyr)
library(lubridate)
df %>%
mutate_at(vars(1:6), funs(parse_date_time(., orders=c("ymd","mdy"),quiet=TRUE))) %>%
mutate(col = coalesce(col1,col2,col3,col4,col5,col6)) %>%
select(id, col)
# id col
# 1 1251 2015-04-01
# 2 16121 1999-07-01
# 3 1209 2009-01-12
Data:
df <- data.frame(col1 = c("NA", "1999-07-01", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))
I would like to paste0 two columns if the element in one column is not NA.If one element of one columns is NA then keep the element of the other column only.
structure(list(col1 = structure(1:3, .Label = c("A", "B", "C"),
class = "factor"), col2 = c(1, NA, 3)), .Names = c("col1", "col2"),
class = "data.frame",row.names = c(NA, -3L))
# col1 col2
# 1 A 1
# 2 B NA
# 3 C 3
structure(list(col1 = structure(1:3, .Label = c("A", "B", "C"),
class = "factor"),col2 = c(1, NA, 3), col3 = c("A|1", "B", "C|3")),
.Names = c("col1", "col2", "col3"), row.names = c(NA,-3L),
class = "data.frame")
# col1 col2 col3
#1 A 1 A|1
#2 B NA B
#3 C 3 C|3
you can also do it with regular expressions:
df$col3 <- sub("NA\\||\\|NA", "", with(df, paste0(col1, "|", col2)))
That is, paste them in regular way and then replace any "NA|" or "|NA" with "". Note that | needs to be "double escaped" because it means "OR" in regexps, that's why the strange pattern NA\\||\\|NA means actually "NA|" OR "|NA".
As #Roland says, this is easy using ifelse (just translate the mental logic into a series of nested ifelse statements):
x <- transform(x,col3=ifelse(is.na(col1),as.character(col2),
ifelse(is.na(col2),as.character(col1),
paste0(col1,"|",col2))))
update: need as.character in some cases.
Try:
> df$col1 = as.character(df$col1)
> df$col3 = with(df, ifelse(is.na(col1),col2, ifelse(is.na(col2), col1, paste0(col1,'|',col2))))
> df
col1 col2 col3
1 A 1 A|1
2 B NA B
3 C 3 C|3
You could also do:
library(stringr)
df$col3 <- apply(df, 1, function(x)
paste(str_trim(x[!is.na(x)]), collapse="|"))
df
# col1 col2 col3
#1 A 1 A|1
#2 B NA B
#3 C 3 C|3