Have a situation where my code uses arrange for a certain column - say col1, but if that row does not have data available for that column, then I'd like it to use the col2, if col2 is not available, then I'd like it to use col3 and so on until col6.
so currently:
df <- data.frame(col1 = c("NA", "1999-07-01", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))
currently something similar to this is applied, but need to make it more flexible for the different cases mentioned above:
df %>%
mutate(col1 = as.Date(col1)) %>%
group_by(id) %>%
arrange(col1) %>%
mutate(diff = col1 - lag(col1))
I was thinking to use case_when in arrange but not sure how to translate that into the mutate aspect.
Alternatively, I was thinking about just creating another column i.e:
df <- df %>%
mutate(earliestDate = case_when(
!is.na(col1) ~ col1,
is.na(col1) ~ col2,
is.na(col2) ~ col3,
is.na(col3) ~ col4,
is.na(col4) ~ col5))
but the above doesn't update the new earliestDate column to have the earliest date, just grabs the first column?
I assume you want to order rows by earliestDate; why not do something like this?
df %>%
gather(key, date, starts_with("col")) %>%
group_by(id) %>%
mutate(earliestDate = min(as.Date(date, format = "%m-%d-%Y"), na.rm = TRUE)) %>%
spread(key, date)
## A tibble: 3 x 8
## Groups: id [3]
# id earliestDate col1 col2 col3 col4 col5 col6
# <dbl> <date> <chr> <chr> <chr> <chr> <chr> <chr>
#1 1209. 2009-01-12 NA 01-12-2009 01-12-2009 01-12-2009 01-12… NA
#2 1251. 2015-04-01 NA NA 04-01-2015 04-01-2015 NA 04-01…
#3 16121. 1999-07-01 07-01-1999 09-22-2011 09-22-2011 NA 09-22… 09-22…
Explanation: We convert data from wide to long, group by id and determine the earliestDate; we then convert data back from long to wide.
Note that dates from your sample data are not 100% consistent: for most entries you have dates in the format "%d-%m-%Y" except for the first entry in col1 which is "1999-07-01". I have changed this in the sample data below.
Sample data
df <- data.frame(col1 = c("NA", "07-01-1999", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))
To start your current "NA" values are not really R's NA value, so convert them.
df[df == "NA"] <- NA
Then you can take advantage of the row margin option in apply to find the leftmost (assuming that's what you want to do and not actually build true date objects, like Maurtis' answer) value that is not missing.
df$left_most <- apply(df[-7], 1, function(x) x[which.min(is.na(x))])
df
col1 col2 col3 col4 col5 col6 id left_most
1 <NA> <NA> 04-01-2015 04-01-2015 <NA> 04-01-2015 1251 04-01-2015
2 07-01-1999 09-22-2011 09-22-2011 <NA> 09-22-2011 09-22-2011 16121 07-01-1999
3 <NA> 01-12-2009 01-12-2009 01-12-2009 01-12-2009 <NA> 1209 01-12-2009
I can see there are two challenge in the data provided by OP.
The date format are not consistent. Sometime year part is in beginning and sometime its in end.
The order of preference for the columns. First Col1 is considered and then Col2 and so on.
To handle date in heterogeneous format one can use parse_date_time function from dplyr. And the use coalesce to group column is such a way that col1 data gets preference, and then col2 and so on.
library(dplyr)
library(lubridate)
df %>%
mutate_at(vars(1:6), funs(parse_date_time(., orders=c("ymd","mdy"),quiet=TRUE))) %>%
mutate(col = coalesce(col1,col2,col3,col4,col5,col6)) %>%
select(id, col)
# id col
# 1 1251 2015-04-01
# 2 16121 1999-07-01
# 3 1209 2009-01-12
Data:
df <- data.frame(col1 = c("NA", "1999-07-01", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))
Related
I have two columns
COL1 COL2
SCS NA
NA NA
NA PB
NA RM
Whenever col1 is na and col2 has a value, I want col2's value to overwrite the na.
Whenever col1 has a value, I want it to stay that value no matter what is in col2.
This could be done with coalesce by specifying COL1 as first argument followed by COL2 so that if there is any NA in COL1 for a particular row, it will be replaced by the corresponding row from COL2
library(dplyr)
data <- data %>%
mutate(COL1 = coalesce(COL1, COL2))
-output
data
COL1 COL2
1 SCS <NA>
2 <NA> <NA>
3 PB PB
4 RM RM
data
data <- structure(list(COL1 = c("SCS", NA, NA, NA), COL2 = c(NA, NA,
"PB", "RM")), class = "data.frame", row.names = c(NA, -4L))
With the data.table package and assuming your data.frame is named df:
library(data.table)
setDT(df)
df[is.na(col1), col1 := col2]
This will replace values in col1 with values in col2 if a col1 value is `na
data <- data%>%
mutate(COL1 = case_when(is.na(COL1) ~ COL2,
TRUE ~ as.character(COL1)))
COL1 COL2
SCS NA
NA NA
PB PB
RM RM
I have a dataframe:
ID col1 col2
1 LOY A
2 LOY B
3 LOY B
4 LOY B
5 LOY A
I want to count number of occurrences of unique values according to col1 and col2. So, desired result is:
event count
loy-a 2
loy-b 3
How could i do that?
You can also try:
library(dplyr)
#Code
new <- df %>% group_by(event=tolower(paste0(col1,'-',col2))) %>%
summarise(count=n())
Output:
# A tibble: 2 x 2
event count
<chr> <int>
1 loy-a 2
2 loy-b 3
Some data used:
#Data
df <- structure(list(ID = 1:5, col1 = c("LOY", "LOY", "LOY", "LOY",
"LOY"), col2 = c("A", "B", "B", "B", "A")), class = "data.frame", row.names = c(NA,
-5L))
Here is an option where we convert the columns to lower case, then get the count and unite the 'col1', 'col2' to a single 'event' column
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(c(col1, col2), tolower)) %>%
count(col1, col2) %>%
unite(event, col1, col2, sep='-')
-output
# event n
#1 loy-a 2
#2 loy-b 3
NOTE: Returns the OP's expected output
Or using base R
with(df1, table(tolower(paste(col1, col2, sep='-'))))
data
df1 <- structure(list(ID = 1:5, col1 = c("LOY", "LOY", "LOY", "LOY",
"LOY"), col2 = c("A", "B", "B", "B", "A")),
class = "data.frame", row.names = c(NA,
-5L))
I have a dataframe like below:
Col1 Col2 COl4 Col5
A B NA NA
M L NA lo
A N NA KE
How do I make the logic where, if Col1 = A, replace NA in COl4 with "Pass"?
When I try using ifelse, I do not get the expected output.
Expected output should be:
Col1 Col2 COl4 Col5
A B Pass NA
M L NA lo
A N Pass KE
I tried this but no luck:
df$COl4<-
ifelse(df$Col1=="A", "Pass", df$COl4)
No real need for ifelse() here. You can use standard index replacement.
df$COl4[df$Col1 == "A"] <- "Pass"
This says that we are replacing COl4 such that Col1 == "A" with "Pass". Additionally, this method will not mess with attributes like ifelse() will.
You can use case_when:
library(tidyverse)
tab <- tibble(Col1 = c("A", "M", "A"), Col2 = c("B", "L", "N"), COl4 = c(NA, NA, NA), Col5 = c(NA, "lo", "KE"))
tab %>%
mutate(COl4 = case_when(
Col1 == "A" ~ "Pass",
TRUE ~ as.character(COl4))
)
# A tibble: 3 x 4
Col1 Col2 COl4 Col5
<chr> <chr> <chr> <chr>
1 A B Pass NA
2 M L NA lo
3 A N Pass KE
The benefit of use case_when is when you have too many conditions!
The TRUE is for the rest of the COl4 that don't need any condition).
I have a dataset like below:
Col1 Col2 Col3
abckel NA 7
jdmelw njabc NA
8 jdken jdne
How do I subset my dataset so that it only keeps rows that contain the string "abc"?
Final Expected Output:
Col1 Col2 Col3
abckel NA 7
jdmelw njabc NA
With your data.frame:
d <- data.frame("Col1" = c("abckel", "jdmelw", 8),
"Col2" = c(NA, "njabc", NA),
"Col3" = c(7, NA, "jdne"),
stringsAsFactors = F)
The following should return your desired result:
d_new <- d[apply(d, 1, function(x) any(grepl("abc", x))), ]
A dplyr solution:
library(dplyr)
df %>% filter_all(any_vars(grepl("abc", .)))
Output:
Col1 Col2 Col3
1: abckel <NA> 7
2: jdmelw njabc <NA>
I have a messy dataset that has 2 rows of info that belongs on 1. I would like to take the second row and slap it on the end of the first row and create new columns in the process.
For example, I would like:
COL1 COL2
1 name1 score1
2 state1 rating1
3 name2 score2
4 state2 rating2
To become:
COL1 COL2 COL3 COL4
1 name1 score1 state1 rating1
2 name2 score2 state2 rating2
Is there anything simplistic in the Hadleyverse for this?
I would do this using unite() and separate() from tidyr, and lead() from dplyr.
library(dplyr)
library(tidyr)
df <- tribble(
~COL1, ~COL2,
"name1", "score1",
"state1", "rating1",
"name2", "score2",
"state2", "rating2"
)
df %>%
unite(old_cols, COL1, COL2) %>%
mutate(new_cols = lead(old_cols)) %>%
filter(row_number() %% 2 == 1) %>%
separate(old_cols, into = c("COL1", "COL2")) %>%
separate(new_cols, into = c("COL3", "COL4"))
#> # A tibble: 2 x 4
#> COL1 COL2 COL3 COL4
#> * <chr> <chr> <chr> <chr>
#> 1 name1 score1 state1 rating1
#> 2 name2 score2 state2 rating2
You should separate the data frame into two data frames: one containing the even rows and another the odd rows. Caution: If there is an odd number of rows, the last row will contain NA in the new added columns.
Odd rows: df[seq(1, nrow(df), 2), ]
Even rows: df[seq(2, nrow(df), 2), ]
The next step is to cbind them:
df_new = cbind(df[seq(1, nrow(df), 2), ], df[seq(2, nrow(df), 2), ])
The last step should be to rename the columns:
colnames(df_new) = c("COL1", "COL2", "COL3", "COL4")
With base R, we could use recycling of logical vector to subset the rows into a list and then cbind
setNames(do.call(cbind, list(df[c(TRUE, FALSE),],
df[c(FALSE, TRUE),])), paste0("COL", 1:4))
# COL1 COL2 COL3 COL4
#1 name1 score1 state1 rating1
#3 name2 score2 state2 rating2
Here is a dplyr solution.
library(dplyr)
dt2 <- dt %>%
mutate(Group = rep(1:2, times = nrow(.)/2)) %>%
split(.$Group) %>%
bind_cols() %>%
select(-starts_with("Group")) %>%
setNames(paste0("COL", 1:ncol(.)))
dt2
COL1 COL2 COL3 COL4
1 name1 score1 state1 rating1
2 name2 score2 state2 rating2
Or we can also use the purrr package with dplyr package.
library(dplyr)
library(purrr)
dt2 <- dt %>%
mutate(Group = rep(1:2, times = nrow(.)/2)) %>%
split(.$Group) %>%
map_dfc(. %>% select(-Group)) %>%
setNames(paste0("COL", 1:ncol(.)))
dt2
COL1 COL2 COL3 COL4
1 name1 score1 state1 rating1
2 name2 score2 state2 rating2
DATA
dt <- read.table(text = " COL1 COL2
1 name1 score1
2 state1 rating1
3 name2 score2
4 state2 rating2",
header = TRUE, stringsAsFactors = FALSE)