Combining first and second rows in a text file in R - r

I have a messy dataset that has 2 rows of info that belongs on 1. I would like to take the second row and slap it on the end of the first row and create new columns in the process.
For example, I would like:
COL1 COL2
1 name1 score1
2 state1 rating1
3 name2 score2
4 state2 rating2
To become:
COL1 COL2 COL3 COL4
1 name1 score1 state1 rating1
2 name2 score2 state2 rating2
Is there anything simplistic in the Hadleyverse for this?

I would do this using unite() and separate() from tidyr, and lead() from dplyr.
library(dplyr)
library(tidyr)
df <- tribble(
~COL1, ~COL2,
"name1", "score1",
"state1", "rating1",
"name2", "score2",
"state2", "rating2"
)
df %>%
unite(old_cols, COL1, COL2) %>%
mutate(new_cols = lead(old_cols)) %>%
filter(row_number() %% 2 == 1) %>%
separate(old_cols, into = c("COL1", "COL2")) %>%
separate(new_cols, into = c("COL3", "COL4"))
#> # A tibble: 2 x 4
#> COL1 COL2 COL3 COL4
#> * <chr> <chr> <chr> <chr>
#> 1 name1 score1 state1 rating1
#> 2 name2 score2 state2 rating2

You should separate the data frame into two data frames: one containing the even rows and another the odd rows. Caution: If there is an odd number of rows, the last row will contain NA in the new added columns.
Odd rows: df[seq(1, nrow(df), 2), ]
Even rows: df[seq(2, nrow(df), 2), ]
The next step is to cbind them:
df_new = cbind(df[seq(1, nrow(df), 2), ], df[seq(2, nrow(df), 2), ])
The last step should be to rename the columns:
colnames(df_new) = c("COL1", "COL2", "COL3", "COL4")

With base R, we could use recycling of logical vector to subset the rows into a list and then cbind
setNames(do.call(cbind, list(df[c(TRUE, FALSE),],
df[c(FALSE, TRUE),])), paste0("COL", 1:4))
# COL1 COL2 COL3 COL4
#1 name1 score1 state1 rating1
#3 name2 score2 state2 rating2

Here is a dplyr solution.
library(dplyr)
dt2 <- dt %>%
mutate(Group = rep(1:2, times = nrow(.)/2)) %>%
split(.$Group) %>%
bind_cols() %>%
select(-starts_with("Group")) %>%
setNames(paste0("COL", 1:ncol(.)))
dt2
COL1 COL2 COL3 COL4
1 name1 score1 state1 rating1
2 name2 score2 state2 rating2
Or we can also use the purrr package with dplyr package.
library(dplyr)
library(purrr)
dt2 <- dt %>%
mutate(Group = rep(1:2, times = nrow(.)/2)) %>%
split(.$Group) %>%
map_dfc(. %>% select(-Group)) %>%
setNames(paste0("COL", 1:ncol(.)))
dt2
COL1 COL2 COL3 COL4
1 name1 score1 state1 rating1
2 name2 score2 state2 rating2
DATA
dt <- read.table(text = " COL1 COL2
1 name1 score1
2 state1 rating1
3 name2 score2
4 state2 rating2",
header = TRUE, stringsAsFactors = FALSE)

Related

Join of column values for specific row values

I'd like to join (left_join) a tibble (df2) to another one (df1) only where the value of col2 in df1 is NA. I am currently using a code that is not very elegant. Any advice on how to shorten the code would be greatly appreciated!
library(tidyverse)
# df1 contains NAs that need to be replaced by values from df2, for relevant col1 values
df1 <- tibble(col1 = c("a", "b", "c", "d"), col2 = c(1, 2, NA, NA), col3 = c(10, 20, 30, 40))
df2 <- tibble(col1 = c("a", "b", "c", "d"), col2 = c(5, 6, 7, 8), col3 = c(50, 60, 70, 80))
# my current approach
df3 <- df1 %>%
filter(!is.na(col2))
df4 <- df1 %>%
filter(is.na(col2)) %>%
select(col1)%>%
left_join(df2)
# output tibble that is expected
df_final <- df3 %>%
bind_rows(df4)
Here's a small dplyr answer that works for me, although it might get slow if you have tons of rows:
df1 %>%
filter(is.na(col2)) %>%
select(col1) %>%
left_join(df2, by = "col1") %>%
bind_rows(df1, .) %>%
filter(!is.na(col2))
We can use data.table methods
library(data.table)
setDT(df1)[setDT(df2), col2 := fcoalesce(col2, i.col2), on = .(col1)]
-output
> df1
col1 col2 col3
1: a 1 10
2: b 2 20
3: c 7 30
4: d 8 40
Or an option with tidyverse
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = c("col1")) %>%
transmute(col1, across(ends_with(".x"),
~ coalesce(., get(str_replace(cur_column(), ".x", ".y"))),
.names = "{str_remove(.col, '.x')}"))
-output
# A tibble: 4 x 3
col1 col2 col3
<chr> <dbl> <dbl>
1 a 1 10
2 b 2 20
3 c 7 30
4 d 8 40

Count number of occurrences of two column cases

I have a dataframe:
ID col1 col2
1 LOY A
2 LOY B
3 LOY B
4 LOY B
5 LOY A
I want to count number of occurrences of unique values according to col1 and col2. So, desired result is:
event count
loy-a 2
loy-b 3
How could i do that?
You can also try:
library(dplyr)
#Code
new <- df %>% group_by(event=tolower(paste0(col1,'-',col2))) %>%
summarise(count=n())
Output:
# A tibble: 2 x 2
event count
<chr> <int>
1 loy-a 2
2 loy-b 3
Some data used:
#Data
df <- structure(list(ID = 1:5, col1 = c("LOY", "LOY", "LOY", "LOY",
"LOY"), col2 = c("A", "B", "B", "B", "A")), class = "data.frame", row.names = c(NA,
-5L))
Here is an option where we convert the columns to lower case, then get the count and unite the 'col1', 'col2' to a single 'event' column
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(c(col1, col2), tolower)) %>%
count(col1, col2) %>%
unite(event, col1, col2, sep='-')
-output
# event n
#1 loy-a 2
#2 loy-b 3
NOTE: Returns the OP's expected output
Or using base R
with(df1, table(tolower(paste(col1, col2, sep='-'))))
data
df1 <- structure(list(ID = 1:5, col1 = c("LOY", "LOY", "LOY", "LOY",
"LOY"), col2 = c("A", "B", "B", "B", "A")),
class = "data.frame", row.names = c(NA,
-5L))

Add 2 dataframe with dfifferent lengths in R

I have the above 2 dataframes in R,
df1 = [a,2 df2 = [a,10
b,3] c,2]
I want to add those 2 df, so the output can be
df = [a, 12,
b, 3,
c, 2]
Any advice would be much appreciated, thanks!
We can rbind the two datasets and do a group by sum
aggregate(col2 ~ col1, rbind(df1, df2), sum)
-output
# col1 col2
#1 a 12
#2 b 3
#3 c 2
Or in dplyr
library(dplyr)
bind_rows(df1, df2) %>%
group_by(col1) %>%
summarise(col2 = sum(col2), .groups = 'drop')
-output
# A tibble: 3 x 2
# col1 col2
# <chr> <dbl>
#1 a 12
#2 b 3
#3 c 2
data
df2 <- data.frame(col1 = c('a', 'c'), col2 = c(10, 2))
df1 <- data.frame(col1 = c('a', 'b'), col2 = c(2, 3))

Subset rows that contain string in any column

I have a dataset like below:
Col1 Col2 Col3
abckel NA 7
jdmelw njabc NA
8 jdken jdne
How do I subset my dataset so that it only keeps rows that contain the string "abc"?
Final Expected Output:
Col1 Col2 Col3
abckel NA 7
jdmelw njabc NA
With your data.frame:
d <- data.frame("Col1" = c("abckel", "jdmelw", 8),
"Col2" = c(NA, "njabc", NA),
"Col3" = c(7, NA, "jdne"),
stringsAsFactors = F)
The following should return your desired result:
d_new <- d[apply(d, 1, function(x) any(grepl("abc", x))), ]
A dplyr solution:
library(dplyr)
df %>% filter_all(any_vars(grepl("abc", .)))
Output:
Col1 Col2 Col3
1: abckel <NA> 7
2: jdmelw njabc <NA>

Skip NA using Case_When

Have a situation where my code uses arrange for a certain column - say col1, but if that row does not have data available for that column, then I'd like it to use the col2, if col2 is not available, then I'd like it to use col3 and so on until col6.
so currently:
df <- data.frame(col1 = c("NA", "1999-07-01", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))
currently something similar to this is applied, but need to make it more flexible for the different cases mentioned above:
df %>%
mutate(col1 = as.Date(col1)) %>%
group_by(id) %>%
arrange(col1) %>%
mutate(diff = col1 - lag(col1))
I was thinking to use case_when in arrange but not sure how to translate that into the mutate aspect.
Alternatively, I was thinking about just creating another column i.e:
df <- df %>%
mutate(earliestDate = case_when(
!is.na(col1) ~ col1,
is.na(col1) ~ col2,
is.na(col2) ~ col3,
is.na(col3) ~ col4,
is.na(col4) ~ col5))
but the above doesn't update the new earliestDate column to have the earliest date, just grabs the first column?
I assume you want to order rows by earliestDate; why not do something like this?
df %>%
gather(key, date, starts_with("col")) %>%
group_by(id) %>%
mutate(earliestDate = min(as.Date(date, format = "%m-%d-%Y"), na.rm = TRUE)) %>%
spread(key, date)
## A tibble: 3 x 8
## Groups: id [3]
# id earliestDate col1 col2 col3 col4 col5 col6
# <dbl> <date> <chr> <chr> <chr> <chr> <chr> <chr>
#1 1209. 2009-01-12 NA 01-12-2009 01-12-2009 01-12-2009 01-12… NA
#2 1251. 2015-04-01 NA NA 04-01-2015 04-01-2015 NA 04-01…
#3 16121. 1999-07-01 07-01-1999 09-22-2011 09-22-2011 NA 09-22… 09-22…
Explanation: We convert data from wide to long, group by id and determine the earliestDate; we then convert data back from long to wide.
Note that dates from your sample data are not 100% consistent: for most entries you have dates in the format "%d-%m-%Y" except for the first entry in col1 which is "1999-07-01". I have changed this in the sample data below.
Sample data
df <- data.frame(col1 = c("NA", "07-01-1999", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))
To start your current "NA" values are not really R's NA value, so convert them.
df[df == "NA"] <- NA
Then you can take advantage of the row margin option in apply to find the leftmost (assuming that's what you want to do and not actually build true date objects, like Maurtis' answer) value that is not missing.
df$left_most <- apply(df[-7], 1, function(x) x[which.min(is.na(x))])
df
col1 col2 col3 col4 col5 col6 id left_most
1 <NA> <NA> 04-01-2015 04-01-2015 <NA> 04-01-2015 1251 04-01-2015
2 07-01-1999 09-22-2011 09-22-2011 <NA> 09-22-2011 09-22-2011 16121 07-01-1999
3 <NA> 01-12-2009 01-12-2009 01-12-2009 01-12-2009 <NA> 1209 01-12-2009
I can see there are two challenge in the data provided by OP.
The date format are not consistent. Sometime year part is in beginning and sometime its in end.
The order of preference for the columns. First Col1 is considered and then Col2 and so on.
To handle date in heterogeneous format one can use parse_date_time function from dplyr. And the use coalesce to group column is such a way that col1 data gets preference, and then col2 and so on.
library(dplyr)
library(lubridate)
df %>%
mutate_at(vars(1:6), funs(parse_date_time(., orders=c("ymd","mdy"),quiet=TRUE))) %>%
mutate(col = coalesce(col1,col2,col3,col4,col5,col6)) %>%
select(id, col)
# id col
# 1 1251 2015-04-01
# 2 16121 1999-07-01
# 3 1209 2009-01-12
Data:
df <- data.frame(col1 = c("NA", "1999-07-01", "NA"),
col2 = c("NA", "09-22-2011", "01-12-2009"),
col3 = c("04-01-2015", "09-22-2011", "01-12-2009"),
col4 = c("04-01-2015", "NA", "01-12-2009"),
col5 = c("NA", "09-22-2011", "01-12-2009"),
col6 = c("04-01-2015", "09-22-2011", "NA"),
id = c(1251,16121,1209))

Resources