Reordering columns using common names - dplyr - r

My data comes from a database which, depending on when I run my SQL query could contain different values for POS from one week to the other.
Not knowing which values will be in a variable makes it very hard to automate the creation of a report.
My data looks as follows:
sample <- data.frame(DRUG = c("A","A","B"),POS = c("Hospital","Physician","Home"),GROSS_COST = c(50,100,60), NET_COST = c(45,80,40))
I need to pivot this data frame wider so that there's a column for each pos by cost (gross & net).
This can be easily achieve using pivot_wider:
x <- sample %>% pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST))
Objective
I would like to be able to keep the columns for each POS together i.e. the GROSS_COST_Hospital and NET_COST_Hospital would be side by side, similar for all other POS columns.
Is there an elegant way to group columns using string matching?

Unfortunately, I don't think there is a direct solution to this (yet!). See https://github.com/tidyverse/tidyr/issues/839 .
For now you can get the data in long format so you can control their ordering the way you want.
library(tidyr)
sample %>%
pivot_longer(cols = c(GROSS_COST, NET_COST)) %>%
pivot_wider(names_from = c(name, POS), values_from = value)
# DRUG GROSS_COST_Hosp… NET_COST_Hospit… GROSS_COST_Phys… NET_COST_Physic…
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 50 45 100 80
#2 B NA NA NA NA
# … with 2 more variables: GROSS_COST_Home <dbl>, NET_COST_Home <dbl>

We can do an ordering on the select step
library(dplyr)
library(tidyr)
library(stringr)
sample %>%
pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST)) %>%
select(DRUG, names(.)[-1][order(str_extract(names(.)[-1], '[^_]+$'))])
# A tibble: 2 x 7
# DRUG GROSS_COST_Home NET_COST_Home GROSS_COST_Hospital NET_COST_Hospital GROSS_COST_Physician NET_COST_Physician
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A NA NA 50 45 100 80
#2 B 60 40 NA NA NA NA

A data.table option using dcast + melt
> dcast(melt(setDT(sample), id.vars = c("DRUG", "POS")), DRUG ~ variable + POS)
DRUG GROSS_COST_Home GROSS_COST_Hospital GROSS_COST_Physician NET_COST_Home
1: A NA 50 100 NA
2: B 60 NA NA 40
NET_COST_Hospital NET_COST_Physician
1: 45 80
2: NA NA

With the advent of tidyr 1.2.0, the issue is finally resolved, you may do this directly using names_vary argument
library(tidyr)
sample <- data.frame(DRUG = c("A","A","B"),POS = c("Hospital","Physician","Home"),GROSS_COST = c(50,100,60), NET_COST = c(45,80,40))
sample %>%
pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST), names_vary = 'slowest')
#> # A tibble: 2 x 7
#> DRUG GROSS_COST_Hospital NET_COST_Hospital GROSS_COST_Physi~ NET_COST_Physic~
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 50 45 100 80
#> 2 B NA NA NA NA
#> # ... with 2 more variables: GROSS_COST_Home <dbl>, NET_COST_Home <dbl>
Created on 2022-02-18 by the reprex package (v2.0.1)

Related

Places after decimal points discarded when extracting numbers from strings

I'd like to extract weight values from strings with the unit and the time of measurement using tidyverse.
My dataset is like as below:
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
------
A tibble: 3 × 2
ID Weight
<chr> <chr>
1 A 45kg^20221120
2 B 11.5kg^20221015
3 C 66.05kg^20221020
I use stringr in the tidyverse package with regular expressions.
library(tidyverse)
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)")))
----------
A tibble: 3 × 3
ID Measurement Weight
<chr> <chr> <dbl>
1 A 45kg^20221120 45
2 B 11.5kg^20221015 11.5
3 C 66.05kg^20221020 66.0
The second decimal place of C (.05) doesn't extracted.
What's wrong with my code?
Any answers or comments are welcome.
Thanks.
Yes, it was extracted, however tibble is rounding it for 66.0 for easy display.
You can see it if you transform it in data.frame or if you View it
Solution
Check here
Check this
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>%
as.data.frame()
Output
#> ID Measurement Weight
#> 1 A 45kg^20221120 45.00
#> 2 B 51.5kg^20221015 51.50
#> 3 C 66.05kg^20221020 66.05
Or check this
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>%
View()
You could try to pull all the data out of the string at once with extract:
library(tidyverse)
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
df |>
extract(col = Weight,
into = c("weight", "unit", "date"),
regex = "(.*)(kg)\\^(.*$)",
remove = TRUE,
convert = TRUE) |>
mutate(date = lubridate::ymd(date))
#> # A tibble: 3 x 4
#> ID weight unit date
#> <chr> <dbl> <chr> <date>
#> 1 A 45 kg 2022-11-20
#> 2 B 51.5 kg 2022-10-15
#> 3 C 66.0 kg 2022-10-20
Note that, as stated in the comments, the .05 is just not printing, but is present in the data.

converting long to wide with columns starting at zero

I have the following data
county<-c(a,a,a,b,b,c)
id<-c(1,2,3,4,5,6)
data<-data.frame(county,id)
I need to convert from long to wide and get the following output
county<-c(a,b,c)
id__0<-c(1,4,6)
id__1<-c(2,5,##NA##)
id__2<-3,##NA##,##NA##)
data2<-data.frame(county,id__0,id__1,id__2)
My main problem is not in converting from long to wide, but how to make the columns start with id__0.
You could add an intermediate variable by grouping according to county and using mutate to build a sequence from 0 upwards for each county, then pivot_wider on that:
library(tidyr)
library(dplyr)
data %>%
group_by(county) %>%
mutate(id_count = seq(n()) - 1) %>%
pivot_wider(county, names_from =id_count, values_from = id, names_prefix = "id_")
#> # A tibble: 3 x 4
#> # Groups: county [3]
#> county id_0 id_1 id_2
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 1 2 3
#> 2 b 4 5 NA
#> 3 c 6 NA NA
Created on 2022-02-10 by the reprex package (v2.0.1)

Cannot use multi word variables in dplyr or am I missing something?

Why doesn't dplyr like this format of 'beta linalool' in my function as compared to beta.linalool?
It took me a few hours of troubleshooting to figure out what the problem was. Is there any way to use data where variables are labeled as more than one word or should I just move everything to the beta.linalool type format?
Everything I have learned has been from Programming with dplyr.
library(ggplot2)
library(readxl)
library(dplyr)
library(magrittr)
Data3<- read_excel("Desktop/Data3.xlsx")
Data3 %>% filter(Variety=="CS 420A"&`Red Blotch`=="-")%>% group_by(`Time Point`)%>%
summarise(m=mean(`beta linalool`),SD=sd(`beta linalool`))
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End 0.00300 0.000117
2 Mid 0.00385 0.000353
3 Must 0.000254 0.00000633
4 Start 0.000785 0.000283
Now when I work it into a function:
cwine<-function(df,v,rb,c){
c<-enquo(c)
df %>% filter(Variety==v&`Red Blotch`==rb)%>%
group_by(`Time Point`) %>%
summarise_(m=mean(!!c),SD=sd(!!c)) %>%
}
cwine(Data3,"CS 420A","-",'beta linalool')
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End NA NA
2 Mid NA NA
3 Must NA NA
4 Start NA NA
Warning messages:
1: In mean.default(~"beta linalool") :
argument is not numeric or logical: returning NA #this statement is repeated 4 more times
5: In var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
NAs introduced by coercion #this statement is repeated 4 more times
The problem lies in that beta linalool is typed in as 'beta linalool'. I figured this out by trying this methodology on the iris dataset and seeing that Petal.Length is not 'Petal Width':
my_function<-function(ds,x,y,c){
c<-enquo(c)
ds %>%filter(Sepal.Length>x&Sepal.Width<y) %>%
group_by(Species) %>%
summarise(m=mean(!!c),SD=sd(!!c))
}
my_function2(iris,5,4,Petal.Length)
# A tibble: 3 x 3
Species m SD
<fct> <dbl> <dbl>
1 setosa 1.53 0.157
2 versicolor 4.32 0.423
3 virginica 5.57 0.536
In fact my function works fine on a different variable:
> cwine(Data2,"CS 420A","-",nerol)
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End 0.000453 0.0000338
2 Mid 0.000659 0.0000660
3 Must 0.000560 0.0000234
4 Start 0.000927 0.0000224
Is dplyr just that sensitive or am I missing something?
One option would be convert it to symbol and evaluate it
library(tidyverse)
cwine <- function(df,v,rb,c){
df %>%
filter(Variety==v & `Red Blotch` == rb)%>%
group_by(`Time Point`) %>%
summarise(m = mean(!!rlang::sym(c)),
SD = sd(!! rlang::sym(c)))
}
cwine(Data3,"CS 420A","-",'beta linalool')
# A tibble: 2 x 3
# `Time Point` m SD
# <int> <dbl> <dbl>
#1 2 -2.11 2.23
#2 4 0.0171 NA
Also, if we want to pass it by converting to quosure (enquo), it works, when we pass the variable name with backquotes (usually, unquoted version works, but here there is a space between words and to evaluate it as it is, backquote is needed)
cwine <- function(df,v,rb,c){
c1 <- enquo(c)
df %>%
filter(Variety==v & `Red Blotch` == rb)%>%
group_by(`Time Point`) %>%
summarise(m = mean(!! c1 ),
SD = sd(!! c1))
}
cwine(Data3,"CS 420A","-",`beta linalool`)
# A tibble: 2 x 3
# `Time Point` m SD
# <int> <dbl> <dbl>
#1 2 -2.11 2.23
#2 4 0.0171 NA
data
set.seed(24)
Data3 <- tibble(Variety = sample(c("CS 420A", "CS 410A"), 20, replace = TRUE),
`Red Blotch` = sample(c("-", "+"), 20, replace = TRUE),
`Time Point` = sample(1:4, 20, replace = TRUE),
`beta linalool` = rnorm(20))

Finding duplicate observations of selected variables in a tibble

I have a rather large tibble (called df.tbl with ~ 26k rows and 22 columns) and I want to find the "twins" of each object, i.e. each row that has the same values in column 2:7 (date:Pos).
If I use:
inner_join(df.tbl, ~ df.tbl[i,], by = c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos"))
with i being the row I want to check for "twins", everything is working as expected, spitting out a 2 x 22 tibble, and I can expand this using:
x <- NULL
for (i in 1:nrow(df.tbl)) {
x[[i]] <- as_vector(inner_join(df.tbl[,],
df.tbl[i,],
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x)
}
to create a list containing the row numbers for each twin for each object (row).
I cannot, however I try, use map to produce a similar result:
twins <- map(df.tbl, ~ inner_join(df.tbl,
.,
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x) )
All I get is the following error:
Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"
How would I go about to convert the for loop into an equivalent using map?
My original data look like this:
>head(df.tbl, 3)
# A tibble: 3 x 22
rowNum date forge serNum PinMain PinMainNumber Pos FrontBack flow Sharped SV OP max min mean
<dbl> <date> <chr> <fct> <fct> <fct> <fct> <fct> <chr> <fct> <fct> <chr> <dbl> <dbl> <dbl>
1 1 2017-10-18 NA 179 Pin 1 W F NA 3 36237 235 77.7 55.3 64.7
2 2 2017-10-18 NA 179 Pin 2 W F NA 3 36237 235 77.5 52.1 67.4
3 3 2017-10-18 NA 179 Pin 3 W F NA 3 36237 235 79.5 58.6 69.0
# ... with 7 more variables: median <dbl>, sd <dbl>, Round2 <dbl>, Round4 <dbl>, OrigData <list>, dataSize <int>,
# fileName <chr>
and I would like a list with a length the same as nrow(df.tbl) looking like this:
> twins
[[1]]
[1] 1 7
[[2]]
[1] 2 8
[[3]]
[1] 3 9
Almost all objects have one twin / duplicate (as above) but a few have two or even three duplicates (as defined above, i.e. column 2:7 are the same)
A bit late to the party, but you can do it much more neatly with nest().
tbl.df1 <- tbl.df %>% group_by(date, forge, serNum, PinMain, PinMainNumber, Pos) %>% nest(rowNum)
The twins will be in the list of tibbles created by nest.
tbl.df1$data
# [[1]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 1
# 2 7
#[[2]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 2
# 2 8
# etc
do you really need to solve it with map?
I would solve through combining duplicated and semi_join from the package dplyr like this
defining_columns <- c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos")
dplyr::semi_join(
df.tbl,
df.tbl[duplicated(df.tbl[defining_columns]),],
by = defining_columns
) %>%
group_by_at(defining_columns) %>%
arrange(.by_group = TRUE) %>%
summarise(twins = paste0(rowNum,collapse = ",")) %>%
pull(twins) %>%
strsplit(",")
the duplicated gives us which rows are duplicated and the semi_join only keeps rows in x that are present in y
Hope this helps!!

How to control new variables' names after tidyr's spread?

I have a dataframe with panel structure: 2 observations for each unit from two years:
library(tidyr)
mydf <- data.frame(
id = rep(1:3, rep(2,3)),
year = rep(c(2012, 2013), 3),
value = runif(6)
)
mydf
# id year value
#1 1 2012 0.09668064
#2 1 2013 0.62739399
#3 2 2012 0.45618433
#4 2 2013 0.60347152
#5 3 2012 0.84537624
#6 3 2013 0.33466030
I would like to reshape this data to wide format which can be done easily with tidyr::spread. However, as the values of the year variable are numbers, the names of my new variables become numbers as well which makes its further use harder.
spread(mydf, year, value)
# id 2012 2013
#1 1 0.09668064 0.6273940
#2 2 0.45618433 0.6034715
#3 3 0.84537624 0.3346603
I know I can easily rename the columns. However, if I would like to reshape within a chain with other operations, it becomes inconvenient. E.g. the following line obviously does not make sense.
library(dplyr)
mydf %>% spread(year, value) %>% filter(2012 > 0.5)
The following works but is not that concise:
tmp <- spread(mydf, year, value)
names(tmp) <- c("id", "y2012", "y2013")
filter(tmp, y2012 > 0.5)
Any idea how I can change the new variable names within spread?
I know some years has passed since this question was originally asked, but for posterity I want to also highlight the sep argument of spread. When not NULL, it will be used as separator between the key name and values:
mydf %>%
spread(key = year, value = value, sep = "")
# id year2012 year2013
#1 1 0.15608322 0.6886531
#2 2 0.04598124 0.0792947
#3 3 0.16835445 0.1744542
This is not exactly as wanted in the question, but sufficient for my purposes. See ?spread.
Update with tidyr 1.0.0: tidyr 1.0.0 have now introduced pivot_wider (and pivot_longer) which allows for more control in this respect with the arguments names_sep and names_prefix. So now the call would be:
mydf %>%
pivot_wider(names_from = year, values_from = value,
names_prefix = "year")
# # A tibble: 3 x 3
# id year2012 year2013
# <int> <dbl> <dbl>
# 1 1 0.347 0.388
# 2 2 0.565 0.924
# 3 3 0.406 0.296
To get exactly what was originally wanted (prefixing "y" only) you can of course now get that directly by simply having names_prefix = "y".
The names_sep is used in case you gather over multiple columns as demonstrated below where I have added quarters to the data:
# Add quarters to data
mydf2 <- data.frame(
id = rep(1:3, each = 8),
year = rep(rep(c(2012, 2013), each = 4), 3),
quarter = rep(c("Q1","Q2","Q3","Q4"), 3),
value = runif(24)
)
head(mydf2)
# id year quarter value
# 1 1 2012 Q1 0.8651470
# 2 1 2012 Q2 0.3944423
# 3 1 2012 Q3 0.4580580
# 4 1 2012 Q4 0.2902604
# 5 1 2013 Q1 0.4751588
# 6 1 2013 Q2 0.6851755
mydf2 %>%
pivot_wider(names_from = c(year, quarter), values_from = value,
names_sep = "_", names_prefix = "y")
# # A tibble: 3 x 9
# id y2012_Q1 y2012_Q2 y2012_Q3 y2012_Q4 y2013_Q1 y2013_Q2 y2013_Q3 y2013_Q4
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.865 0.394 0.458 0.290 0.475 0.685 0.213 0.920
# 2 2 0.566 0.614 0.509 0.0515 0.974 0.916 0.681 0.509
# 3 3 0.968 0.615 0.670 0.748 0.723 0.996 0.247 0.449
You can use backticks for column names starting with numbers and filter should work as expected
mydf %>%
spread(year, value) %>%
filter(`2012` > 0.5)
# id 2012 2013
#1 3 0.8453762 0.3346603
Or another option would be using unite to join two columns to a single columnn after creating a second column 'year1' with string 'y'.
mydf %>%
mutate(year1='y') %>%
unite(yearN, year1, year) %>%
spread(yearN, value) %>%
filter(y_2012 > 0.5)
# id y_2012 y_2013
#1 3 0.8453762 0.3346603
Even we can change the 'year' column within mutate by using paste
mydf %>%
mutate(year=paste('y', year, sep="_")) %>%
spread(year, value) %>%
filter(y_2012 > 0.5)
Another option is to use the setNames() function as the next thing in the pipe:
mydf %>%
spread(mydf, year, value) %>%
setNames( c("id", "y2012", "y2013") ) %>%
filter(y2012 > 0.5)
The only problem using setNames is that you have to know exactly what your columns will be when you spread() them. Most of the time, that's not a problem, particularly if you're working semi-interactively.
But if you're missing a key/value pair in your original data, there's a chance it won't show up as a column, and you can end up naming your columns incorrectly without even knowing it. Granted, setNames() will throw an error if the number of names doesn't match the number of columns, so you've got a bit of error checking built in.
Still, the convenience of using setNames() has outweighed the risk more often than not for me.
Using spread()'s successor pivot_wider() we can give a prefix to the created columns :
library(tidyr)
set.seed(1)
mydf <- data.frame(
id = rep(1:3, rep(2,3)),
year = rep(c(2012, 2013), 3),
value = runif(6)
)
pivot_wider(mydf, names_from = "year", values_from = "value", names_prefix = "y")
#> # A tibble: 3 x 3
#> id y2012 y2013
#> <int> <dbl> <dbl>
#> 1 1 0.266 0.372
#> 2 2 0.573 0.908
#> 3 3 0.202 0.898
Created on 2019-09-14 by the reprex package (v0.3.0)
rename() in dplyr should do the trick
library(tidyr); library(dplyr)
mydf %>%
spread(year,value)%>%
rename(y2012 = '2012',y2013 = '2013')%>%
filter(y2012>0.5)

Resources