R: Data tyding when identifiers and variables are in the same column - r

I came across the following (stylized) data cleaning problem:
df <- data.frame(first_column = c("country1", "variable1", "variable2","country2", "variable1", "variable2"),
second_column = c(NA, "15", "16", NA, "62", "63")
)
df
#> first_column second_column
#> 1 country1 <NA>
#> 2 variable1 15
#> 3 variable2 16
#> 4 country2 <NA>
#> 5 variable1 62
#> 6 variable2 63
Created on 2020-11-02 by the reprex package (v0.3.0)
I was trying to convert this to a "tidy" (i.e. long or wide format) using pivot_longer_spec and pivot_wider_spec respectively, but couldn't work it out. There seems to be very little documentation on these functions, and it is difficult for me to find out how to specify the arguments correctly.
Can anyone tell me how to approach this problem, using either these functions or others?
Many thanks.

an alternative solution with the zoo package:
library(zoo)
library(dplyr)
df <- data.frame(first_column = c("country1", "variable1", "variable2","country2", "variable1", "variable2"),
second_column = c(NA, "15", "16", NA, "62", "63"))
df %>%
dplyr::mutate(COUNTRY = ifelse(is.na(second_column), first_column, NA)) %>%
dplyr::mutate(COUNTRY = zoo::na.locf(COUNTRY)) %>%
dplyr::filter(!is.na(second_column)) %>%
tidyr::pivot_wider(names_from = first_column, values_from = second_column)
# A tibble: 2 x 3
COUNTRY variable1 variable2
<chr> <chr> <chr>
1 country1 15 16
2 country2 62 63

This could be achieved like so:
The "tricky" part is to put the country identifiers in a separate column which I using an ifelse which conditions on NA values in the second column (as in the approach by #DPH), fill up the country column and filter to get rid of the "county rows"
After this we can simply pivot_wider
library(tidyr)
library(dplyr)
df <- data.frame(first_column = c("country1", "variable1", "variable2","country2", "variable1", "variable2"),
second_column = c(NA, "15", "16", NA, "62", "63")
)
df %>%
mutate(country = ifelse(is.na(second_column), first_column, NA)) %>%
tidyr::fill(country) %>%
filter(first_column != country) %>%
tidyr::pivot_wider(names_from = "first_column", values_from = "second_column")
#> # A tibble: 2 x 3
#> country variable1 variable2
#> <chr> <chr> <chr>
#> 1 country1 15 16
#> 2 country2 62 63

Related

Using coalesce function with many variables

I have two datasets with similar variables. dataset2 has values of of variables that were not captured in dataset2. My aim is to use the dataset2 variables to fill the corresponding values in variables in dataset1. Is there a way to achieve this. It is possible to use coalesce but listing all the variables is a bit cumbersome.
library(dplyr)
dat1 <- tibble(
id = c("soo1", "soo2", "soo3", "soo4"),
a1= c("Test", "Tested", "Testing", NA),
a2= c("Math", "Eng", NA, "French"),
a3= c("Science", NA, "Biology", "Chem"))
dat2 <- tibble(
id = c("soo1", "soo2", "soo3", "soo4"),
a1= c(NA, NA, NA, "Tested"),
a2= c("Math", NA, "UK", NA),
a3= c("Science", "Physic", NA, NA))
dat1 %>%
inner_join(dat2, by = "id") %>%
mutate(a1 = coalesce(a1.x, a1.y),
a2 = coalesce(a2.x, a2.y))
Another possible solution, based on powerjoin:
library(powerjoin)
library(tibble)
power_inner_join(dat1, dat2, by = "id", conflict = coalesce_xy)
#> # A tibble: 4 × 4
#> id a1 a2 a3
#> <chr> <chr> <chr> <chr>
#> 1 soo1 Test Math Science
#> 2 soo2 Tested Eng Physic
#> 3 soo3 Testing UK Biology
#> 4 soo4 Tested French Chem
You could also fill your values "downup" per group for every column like this:
library(dplyr)
library(tidyr)
dat1 <- tibble(
id = c("soo1", "soo2", "soo3", "soo4"),
a1= c("Test", "Tested", "Testing", NA),
a2= c("Math", "Eng", NA, "French"),
a3= c("Science", NA, "Biology", "Chem"))
dat2 <- tibble(
id = c("soo1", "soo2", "soo3", "soo4"),
a1= c(NA, NA, NA, "Tested"),
a2= c("Math", NA, "UK", NA),
a3= c("Science", "Physic", NA, NA))
dat1 %>%
bind_rows(dat2) %>%
group_by(id) %>%
fill(everything(), .direction = "downup") %>%
slice(1)
#> # A tibble: 4 × 4
#> # Groups: id [4]
#> id a1 a2 a3
#> <chr> <chr> <chr> <chr>
#> 1 soo1 Test Math Science
#> 2 soo2 Tested Eng Physic
#> 3 soo3 Testing UK Biology
#> 4 soo4 Tested French Chem
Created on 2022-07-18 by the reprex package (v2.0.1)
In dplyr, we may use rows_patch
library(dplyr)
rows_patch(dat1, dat2, by = 'id')
-output
# A tibble: 4 × 4
id a1 a2 a3
<chr> <chr> <chr> <chr>
1 soo1 Test Math Science
2 soo2 Tested Eng Physic
3 soo3 Testing UK Biology
4 soo4 Tested French Chem

How to select row with maximum value if all values are NA in R

I have a dataframe that looks like this:
tt1 <- structure(list(sjlid = c("SJL1527107", "SJL1527107", "SJL1527107",
"SJL1527107", "SJL1527107"), condition = c("Abnormal_glucose_metabolism",
"Abnormal_glucose_metabolism", "Abnormal_glucose_metabolism",
"Abnormal_glucose_metabolism", "Abnormal_glucose_metabolism"),
grade = c(NA, NA, NA, NA, NA), ageevent = c(58.8352421588442,
62.1366120218579, 64.4872969533648, 68.9694887341867, 70.9612695561045
)), row.names = 72:76, class = "data.frame")
I need to run this code:
library(dplyr)
tt1 %>% group_by(condition) %>% top_n(1, grade) %>% top_n(1, ageevent)
This code works when there are multiple rows (like tt1), but if there are only one row (like tt2 below), it just fails to give that particular row. For example, if I have my dataframe like this:
tt2 <- structure(list(sjlid = "SJL1527107", condition = "Abnormal_glucose_metabolism",
grade = NA, ageevent = 58.8352421588442), row.names = 72L, class = "data.frame")
tt2 %>% group_by(condition) %>% top_n(1, grade) %>% top_n(1, ageevent) returns nothing
# A tibble: 0 x 4
# Groups: condition [0]
# ... with 4 variables: sjlid <chr>, condition <chr>, grade <lgl>, ageevent <dbl>
Instead, I want it to return this row because it's the only row there.
sjlid condition grade ageevent
72 SJL1527107 Abnormal_glucose_metabolism NA 58.8352421588442
slice_max can be used -top_n is kind of deprecated in favor of slice
library(dplyr)
tt1 %>%
group_by(condition) %>%
slice_max(n = 1, order_by = ageevent) %>%
ungroup
-output
# A tibble: 1 × 4
sjlid condition grade ageevent
<chr> <chr> <lgl> <dbl>
1 SJL1527107 Abnormal_glucose_metabolism NA 71.0
It also works with tt2 (if both columns needs to be considered)
tt2 %>%
group_by(condition) %>%
slice_max(n = 1, order_by = pmax(ageevent, grade, na.rm = TRUE) ) %>%
ungroup
# A tibble: 1 × 4
sjlid condition grade ageevent
<chr> <chr> <lgl> <dbl>
1 SJL1527107 Abnormal_glucose_metabolism NA 58.8
If we need to prioritize, an option is also with arrange
tt2 %>%
arrange(condition, desc(ageevent), desc(grade)) %>%
distinct(condition, .keep_all = TRUE)
For the top_n, we can use
tt2 %>%
group_by(condition) %>%
top_n(pmax(grade, ageevent, na.rm = TRUE)) %>%
ungroup
Selecting by ageevent
# A tibble: 1 × 4
sjlid condition grade ageevent
<chr> <chr> <lgl> <dbl>
1 SJL1527107 Abnormal_glucose_metabolism NA 58.8

How to match corresponding values to part of string (before and after space)?

I have two dataframes, and want to add values from the 2nd one to the 1st one according to string values, but use partial string matching if there is a space
df1:
cat
small dog
apple
df2:
cat 24
small 5
dog 400
apple 83
pear 55
I normally use "left_join" from tidyverse, which would be
df3 <- left_join(df1, df2, by="column_name")
df3:
cat 24
small dog NA
apple 83
but this means that "small dog" has a missing value. What I want to do this time is find the value for either "small" or "dog", and input whichever is bigger. I'm not able to find a function that will tell R to look separately before or after the space though
Another possible solution, based on inner_join:
library(tidyverse)
df1 %>%
mutate(spaces = row_number()*str_detect(column_name, " ")) %>%
separate_rows(column_name, sep = " ") %>%
inner_join(df2, by="column_name") %>%
group_by(spaces) %>%
mutate(col2 = if_else(spaces > 0, max(col2), col2),
column_name = if_else(spaces > 0, str_c(column_name, collapse = " "),
column_name)) %>%
ungroup %>% distinct %>% select(-spaces)
#> # A tibble: 3 × 2
#> column_name col2
#> <chr> <dbl>
#> 1 cat 24
#> 2 small dog 400
#> 3 apple 83
We may use regex_left_join from fuzzyjoin and then do a group by to summarise the second column with max values
library(dplyr)
library(fuzzyjoin)
regex_left_join(df1, df2, by = "column_name") %>%
group_by(column_name = column_name.x) %>%
summarise(col2 = max(col2))
-output
# A tibble: 3 × 2
column_name col2
<chr> <dbl>
1 apple 83
2 cat 24
3 small dog 400
data
df1 <- structure(list(column_name = c("cat", "small dog", "apple")),
class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(column_name = c("cat", "small", "dog", "apple",
"pear"), col2 = c(24, 5, 400, 83, 55)), class = "data.frame", row.names = c(NA,
-5L))

R Replace NA for all Columns Except *

library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#> # A tibble: 4 x 5
#> Date col1 thisCol thatCol col999
#> <date> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 1 NA 25 99
#> 2 2020-01-01 2 8 26 99
#> 3 2020-01-01 3 NA 27 99
#> 4 NA 4 3 28 99
My actual R data frame has hundreds of columns that aren't neatly named, but can be approximated by the df data frame above.
I want to replace all values of NA with 0, with the exception of several columns (in my example I want to leave out the Date column and the thatCol column. I'd want to do it in this sort of fashion:
df %>% replace(is.na(.), 0)
#> Error: Assigned data `values` must be compatible with existing data.
#> i Error occurred for column `Date`.
#> x Can't convert <double> to <date>.
#> Run `rlang::last_error()` to see where the error occurred.
And my unsuccessful ideas for accomplishing the "everything except" replace NA are shown below.
df %>% replace(is.na(c(., -c(Date, thatCol)), 0))
df %>% replace_na(list([, c(2:3, 5)] = 0))
df %>% replace_na(list(everything(-c(Date, thatCol)) = 0))
Is there a way to select everything BUT in the way I need to? There's hundred of columns, named inconsistently, so typing them one by one is not a practical option.
You can use mutate_at :
library(dplyr)
Remove them by Name
df %>% mutate_at(vars(-c(Date, thatCol)), ~replace(., is.na(.), 0))
Remove them by position
df %>% mutate_at(-c(1,4), ~replace(., is.na(.), 0))
Select them by name
df %>% mutate_at(vars(col1, thisCol, col999), ~replace(., is.na(.), 0))
Select them by position
df %>% mutate_at(c(2, 3, 5), ~replace(., is.na(.), 0))
If you want to use replace_na
df %>% mutate_at(vars(-c(Date, thatCol)), tidyr::replace_na, 0)
Note that mutate_at is soon going to be replaced by across in dplyr 1.0.0.
You have several options here based on data.table.
One of the coolest options: setnafill (version >= 1.12.4):
library(data.table)
setDT(df)
data.table::setnafill(df,fill = 0, cols = colnames(df)[!(colnames(df) %in% c("Date", thatCol)]))
Note that your dataframe is updated by reference.
Another base solution:
to_change<-grep("^(this|col)",names(df))
df[to_change]<- sapply(df[to_change],function(x) replace(x,is.na(x),0))
df
# A tibble: 4 x 5
Date col1 thisCol thatCol col999
<date> <dbl> <dbl> <int> <dbl>
1 2020-01-01 1 0 25 99
2 2020-01-01 2 8 26 99
3 2020-01-01 3 0 27 99
4 NA 0 3 28 99
Data(I changed one value):
df <- structure(list(Date = structure(c(18262, 18262, 18262, NA), class = "Date"),
col1 = c(1L, 2L, 3L, NA), thisCol = c(NA, 8, NA, 3), thatCol = 25:28,
col999 = c(99, 99, 99, 99)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
replace works on a data.frame, so we can just do the replacement by index and update the original dataset
df[-c(1, 4)] <- replace(df[-c(1, 4)], is.na(df[-c(1, 4)]), 0)
Or using replace_na with across (from the new dplyr)
library(dplyr)
library(tidyr)
df %>%
mutate(across(-c(Date, thatCol), ~ replace_na(., 0)))
If you know the ones that you don't want to change, you could do it like this:
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#dplyr
df_nonreplace <- select(df, c("Date", "thatCol"))
df_replace <- df[ ,!names(df) %in% names(df_nonreplace)]
df_replace[is.na(df_replace)] <- 0
df <- cbind(df_nonreplace, df_replace)
> head(df)
Date thatCol col1 thisCol col999
1 2020-01-01 25 1 0 99
2 2020-01-01 26 2 8 99
3 2020-01-01 27 3 0 99
4 <NA> 28 4 3 99

tidyr::spread resulting in multiple rows

I have a similar problem than the following, but the solution presented in the following link does not work for me:
tidyr spread does not aggregate data
I have a df in the following structure:
UndesiredIndex DesiredIndex DesiredRows Result
1 x1A x1 A 50,32
2 x1B x2 B 7,34
3 x2A x1 A 50,33
4 x2B x2 B 7,35
Using the code below:
dftest <- bd_teste %>%
select(-UndesiredIndex) %>%
spread(DesiredIndex, Result)
I expected the following result:
DesiredIndex A B
A 50,32 50,33
B 7,34 7,35
Although, I keep getting the following result:
DesiredIndex x1 x2
1 A 50.32 NA
2 B 7.34 NA
3 A NA 50.33
4 B NA 7.35
PS: Sometimes I force the column UndesiredIndex out with select(-UndesiredIndex), but I keep getting the following message:
Adding missing grouping variables: UndesiredIndex
Might be something easy to stack those rows, but I'm new to R and have been trying so hard to solve this but without success.
Thanks in advance!
We group by DesiredIndex, create a sequence column and then do the spread:
library(tidyverse)
df1 %>%
select(-UndesiredIndex) %>%
group_by(DesiredIndex) %>%
mutate(new = LETTERS[row_number()]) %>%
ungroup %>%
select(-DesiredIndex) %>%
spread(new, Result)
# A tibble: 2 x 3
# DesiredRows A B
# <chr> <chr> <chr>
#1 A 50,32 50,33
#2 B 7,34 7,35
Data
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
Shorter, but more theoretically round-about.
Data
(Thanks to #akrun!)
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
This is a great technique for concatenating rows.
df1 %>%
group_by(DesiredRows) %>%
summarise(Result = paste(Result, collapse = "|")) %>% #<Concatenate rows
separate(Result, into = c("A", "B"), sep = "\\|") #<Separate by '|'
#> # A tibble: 2 x 3
#> DesiredRows A B
#> <chr> <chr> <chr>
#> 1 A 50,32 50,33
#> 2 B 7,34 7,35
Created on 2018-08-06 by the reprex package (v0.2.0).

Resources