Classifying columns based on str_detect - r

I am currently working with a data frame that looks like this:
Example <- structure(list(ID = c(12301L, 12301L, 15271L, 11888L, 15271L,
15271L, 15271L), StationOwner = c("Brian", "Brian", "Simon",
"Brian", "Simon", "Simon", "Simon"), StationName = c("Red", "Red",
"Red", "Green", "Yellow", "Yellow", "Yellow"), Parameter = c("Rain - Daily",
"Temperature -Daily", "VPD - Daily", "Rain - Daily", "Rain - Daily",
"Temperature -Daily", "VPD - Daily")), class = "data.frame", row.names = c(NA,
-7L))
I am looking into using str_detect to filter for example all the observation that start with “Rain –“ and adding what comes after under a new column called "Rain". I have been able to filter out only the values that start with “Rain” using str_detect but have not found a way to assign them automatically. Is there a specific function that would help with this? Appreciate the pointers, thanks!
Example of desired output that I am trying to achieve:
Desired <- structure(list(ID = c(12301L, 15271L, 12301L, 15271L
), StationOwner = c("Brian", "Simon", "Brian", "Simon"), StationName = c("Red",
"Red", "Green", "Yellow"), Rain = c("Daily", NA, "Daily", "Daily"
), Temperature = c("Daily", NA, NA, "Daily"), VDP = c(NA, "Daily",
NA, "Daily")), class = "data.frame", row.names = c(NA, -4L))

Directly using pivot_wider:
pivot_wider(Example, names_from = Parameter, values_from = Parameter,
names_repair = ~str_remove(.,' .*'),values_fn = ~str_remove(.,'.*- ?'))
# A tibble: 4 x 6
ID StationOwner StationName Rain Temperature VPD
<int> <chr> <chr> <chr> <chr> <chr>
1 12301 Brian Red Daily Daily NA
2 15271 Simon Red NA NA Daily
3 11888 Brian Green Daily NA NA
4 15271 Simon Yellow Daily Daily Daily

It's not using str_detectbut can achive Desired by
library(dplyr)
Example %>%
separate(Parameter, c('a', 'b'), sep = "-") %>%
mutate(across(where(is.character), ~trimws(.x))) %>%
pivot_wider(id_cols = c("ID","StationOwner", "StationName"), names_from = "a", values_from = "b")
ID StationOwner StationName Rain Temperature VPD
<int> <chr> <chr> <chr> <chr> <chr>
1 12301 Brian Red Daily Daily NA
2 15271 Simon Red NA NA Daily
3 11888 Brian Green Daily NA NA
4 15271 Simon Yellow Daily Daily Daily

Related

R dataframe with values in the wrong columns

I have a dataframe like this one:
Name Characteristic_1 Characteristic_2
Apple Yellow Italian
Pear British Yellow
Strawberries French Red
Blackberry Blue Austrian
As you can see the Characteristic can be in different Columns depending in the row. I would like to obtain a dataframe where each column contains only the values of a specific Characteristic.
Name Characteristic_1 Characteristic_2
Apple Yellow Italian
Pear Yellow British
Strawberries Red French
Blackberry Blue Austrian
My idea is to use the case_when function but I would like to know if there are Faster ways to achieve the same result.
Example data:
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
I suspect there is an easier way of solving the issue, but here is one potential solution:
# Load the libraries
library(tidyverse)
# Load the data
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
# R has 657 built in colour names. You can see them using the `colours()` function.
# Chances are your colours are contained in this list.
# The `str_to_title()` function capitalizes every colour in the list
list_of_colours <- str_to_title(colours())
# If your colours are not contained in the list, add them using e.g.
# `list_of_colours <- c(list_of_colours, "Octarine")`
# Create a new dataframe ("df2") by taking the original dataframe ("df")
df2 <- df %>%
# Create two new columns called "Colour" and "Origin" using `mutate()` with
# `ifelse` used to identify whether each word is in the list of colours.
# If the word is in the list of colours, add it to the "Colours" column, if
# it isn't, add it to the "Origin" column.
mutate(Colour = ifelse(!is.na(str_extract(Characteristic_1, paste(list_of_colours, collapse = "|"))),
Characteristic_1, Characteristic_2),
Origin = ifelse(is.na(str_extract(Characteristic_1, paste(list_of_colours, collapse = "|"))),
Characteristic_1, Characteristic_2)) %>%
# Then select the columns you want
select(Name, Colour, Origin)
df2
# A tibble: 4 x 3
# Name Colour Origin
# <chr> <chr> <chr>
#1 Apple Yellow Italian
#2 Pear Yellow British
#3 Strawberries Red French
#4 Blackberry Blue Austrian
I think there is also a better way of achieving this but for now this is the one solution that came to my mind:
library(dplyr)
library(stringr)
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
df %>%
mutate(char_1 = if_else(str_to_lower(Characteristic_1) %in% colours(distinct = TRUE),
Characteristic_1, Characteristic_2),
char_2 = if_else(Characteristic_1 == char_1, Characteristic_2, Characteristic_1)) %>%
select(-c(Characteristic_1, Characteristic_2))
# A tibble: 4 x 3
Name char_1 char_2
<chr> <chr> <chr>
1 Apple Yellow Italian
2 Pear Yellow British
3 Strawberries Red French
4 Blackberry Blue Austrian

Create funnelarea chart from data frame using plotly rstudio

I want to create a funnelarea chart using plotly, this is the example from plotly:
fig <- plot_ly(
type = "funnelarea",
values = c(5, 4, 3, 2, 1),
text = c("The 1st","The 2nd", "The 3rd", "The 4th", "The 5th"),
marker = list(colors = c("deepskyblue", "lightsalmon", "tan", "teal", "silver"),
line = list(color = c("wheat", "wheat", "blue", "wheat", "wheat"), width = c(0, 1, 5,
0, 4))),
textfont = list(family = "Old Standard TT, serif", size = 13, color = "black"),
opacity = 0.65)
fig
I would like to use a dataframe to fill this chart, use categories from my dataframe columns instead of text and values but I can't find the way to to it.
Example of my dataframe
funnel_stage size purchaser_payment
1. Available for Sale 10 10000
1. Available for Sale 15 15000
2. Pending on Sale 8 8000
2. Pending on Sale 9 9000
3. Already Sold 1 1000
3. Already Sold 45 45000
3. Already Sold 12 12000
I would like my funnel filled counting the number of times of repetition of first column, It would be like:
It's probably easiest if you first bring your data into a shape with one row per category in the correct order, see df2 below.
library("dplyr")
library("plotly")
df <- structure(
list(funnel_stage = c("Available for Sale", "Available for Sale",
"Pending on Sale", "Pending on Sale", "Already Sold",
"Already Sold", "Already Sold"),
size = c(10L, 15L, 8L, 9L, 1L, 45L, 12L),
purchaser_payment = c(10000L, 15000L, 8000L, 9000L, 1000L, 45000L, 12000L)),
class = "data.frame", row.names = c(NA, -7L))
df$funnel_stage <- factor(df$funnel_stage, levels = c("Available for Sale",
"Pending on Sale",
"Already Sold"))
df2 <- df %>%
group_by(funnel_stage) %>%
count()
df2
#> # A tibble: 3 x 2
#> # Groups: funnel_stage [3]
#> funnel_stage n
#> <fct> <int>
#> 1 Available for Sale 2
#> 2 Pending on Sale 2
#> 3 Already Sold 3
plot_ly() %>%
add_trace(
type = "funnelarea",
values = df2$n,
text = df2$funnel_stage)
packageVersion("dplyr")
#> [1] '0.8.5'
packageVersion("plotly")
#> [1] '4.9.2.1'

How to create a row by dividing First row by third row

I have a dataset which has values in first row & total in third row. I want to create a fourth row which is percentage of first by total which can be done by dividing first row with fourth row.
below is structure of dataframe
ds = structure(list(t1 = structure(c("1", "2", "Total"), label = "currently smoke any tobacco product", labels = c(no = 0,
yes = 1), class = "haven_labelled"), c1Female = c(679357.516868591,
8394232.81394577, 9073590.33081436), c1Male = c(2254232.8617363,
5802560.20343018, 8056793.06516647), se.c1Female = c(63743.4459540534,
421866.610586848, 485610.056540901), se.c1Male = c(185544.754820322,
386138.725133411, 571683.479953732), Total_1 = c(`1` = 2933590.37860489,
`2` = 14196793.0173759, `3` = 17130383.3959808), per = c(`1` = 0.171250713471665,
`2` = 0.828749286528335, `3` = 1)), class = "data.frame", row.names = c(NA,
-3L))
My try & what is wrong with this
ds %>% mutate(percentage = .[1,]/.[3,])
OUTPUT SHOULD BE : Below is the dput of Output Dataframe that I want
structure(list(t1 = structure(c(1L, 2L, 4L, 3L), .Label = c("1",
"2", "Percentage", "Total"), class = "factor"), c1Female = c(679357.517,
8394232.814, 9073590.331, 0.074871963), c1Male = c(2254232.86,
5802560.2, 8056793.07, 0.279792821), se.c1Female = c(63743.446,
421866.611, 485610.057, 0.131264674), se.c1Male = c(185544.755,
386138.725, 571683.48, 0.324558539), Total_1 = c(2933590.38,
14196793.02, 17130383.4, 0.171250714), per = c(0.171250713, 0.828749287,
1, 0.171250713)), class = "data.frame", row.names = c(NA, -4L
))
Do share the tidyverse way to do this. Also, do tell what is wrong with this approach below line code
ds %>% mutate(percentage = .[1,]/.[3,])
We can use summarise_at to divide multiple column values to return a single row and then bind with the original dataset
library(dplyr)
ds %>%
summarise_at(-1, ~ .[1]/.[3]) %>%
mutate(t1 = 'Percentage') %>%
bind_rows(ds, .)
# t1 c1Female c1Male se.c1Female se.c1Male Total_1 per
#1 1 6.793575e+05 2.254233e+06 6.374345e+04 1.855448e+05 2.933590e+06 0.1712507
#2 2 8.394233e+06 5.802560e+06 4.218666e+05 3.861387e+05 1.419679e+07 0.8287493
#3 Total 9.073590e+06 8.056793e+06 4.856101e+05 5.716835e+05 1.713038e+07 1.0000000
#4 Percentage 7.487196e-02 2.797928e-01 1.312647e-01 3.245585e-01 1.712507e-01 0.1712507
Or another option is add_row
ds %>%
add_row(t1 = 'Percentage') %>%
mutate_at(-1, ~ replace_na(., .[1]/.[3]))
Or do this within the add_row step itself
ds %>%
add_row(t1 = 'Percentage', !!!as.list(.[-1][1,]/.[-1][3,]))
# t1 c1Female c1Male se.c1Female se.c1Male Total_1 per
#1 1 6.793575e+05 2.254233e+06 6.374345e+04 1.855448e+05 2.933590e+06 0.1712507
#2 2 8.394233e+06 5.802560e+06 4.218666e+05 3.861387e+05 1.419679e+07 0.8287493
#3 Total 9.073590e+06 8.056793e+06 4.856101e+05 5.716835e+05 1.713038e+07 1.0000000
#4 Percentage 7.487196e-02 2.797928e-01 1.312647e-01 3.245585e-01 1.712507e-01 0.1712507

Splitting a dataframe column where new column values depend upon original data

I often work with dataframes that have columns with character string values that need to be separated. This results from a "select multiple" option in the data entry programme (which I cannot change unfortunately). I have tried tidyr::separate but that does not order the results properly. An example:
require(tidyr)
df = data.frame(
x = 1:3,
sick = c(NA, "malaria", "diarrhoea malaria"))
df <- df %>%
separate(sick, c("diarrhoea", "cough", "malaria"),
sep = " ", fill = "right", remove = FALSE)
But I want the result to look like this:
df2 = data.frame(
x = 1:3,
sick = c(NA, "malaria", "diarrhoea malaria"),
diarrhoea = c(NA, NA, "diarrhoea"),
cough = c(NA, NA, NA),
malaria = c(NA, "malaria", "malaria"))
Any help in the right direction would be much appreciated.
We can try with separate_rows and dcast
library(tidyr)
library(reshape2)
library(dplyr)
separate_rows(df, sick) %>%
mutate(sick = factor(sick, levels = c("diarrhoea", "cough", "malaria")), sick1 = sick) %>%
dcast(., x~sick, value.var = "sick1", drop=FALSE) %>%
bind_cols(., df[2]) %>%
select(x, sick, diarrhoea, cough, malaria)
# x sick diarrhoea cough malaria
#1 1 <NA> <NA> <NA> <NA>
#2 2 malaria <NA> <NA> malaria
#3 3 diarrhoea malaria diarrhoea <NA> malaria
Or another option is using cSplit from splitstackshape with dcast from data.table
library(splitstackshape)
dcast(cSplit(df, "sick", " ", "long")[, sick:= factor(sick, levels =
c("diarrhoea", "cough", "malaria"))], x~sick, value.var = "sick", drop = FALSE)[,
sick := df$sick][]

Plot different colours based on the conditions

This is the first 10 rows of my data frame:
head(test.data,10)
# A tibble: 10 x 5
date o2.permeg co2.ppm apo o2.spike
<time> <dbl> <dbl> <dbl> <chr>
1 2015-01-01 00:00:00 -685.09 413.023 -354.1816 N
2 2015-01-01 00:02:00 -695.10 412.894 -364.8690 N
3 2015-01-01 00:04:00 -687.84 412.979 -357.1627 N
4 2015-01-01 00:06:00 -683.23 412.866 -353.1460 N
5 2015-01-01 00:08:00 -683.28 412.755 -353.7788 N
6 2015-01-01 00:10:00 -685.40 412.647 -356.4659 N
7 2015-01-01 00:12:00 -687.80 412.659 -358.8029 N
8 2015-01-01 00:14:00 -662.79 412.665 NA Y
9 2015-01-01 00:16:00 -684.17 412.762 -354.6321 N
10 2015-01-01 00:18:00 -680.37 412.720 -351.0526 N
As you can see there's a last column named o2.spike, which has characters N and Y in it. N means that the data point is not a spike, and Y means that it is a spike. In this sample, there's only 1 Y, but in the real frame, there are loads, and randomly placed.
My desire is to plot all the data points in a plot, and those marked with Y will be plotted in a different colour.
For your information, this is the current code that I am using to plot everything. The first 3 variables are plotted in red, green, and blue, and I want the "Y" rows to be plotted in as, for example, pink.
library(openair)
test.data$yr_day <- format(as.Date(test.data$date), "%Y-%m-%d")
dir.create(daily) # where "daily" is the path of the folder I want to save the plots into
for (d in unique(test.data$yr_day)) {
mypath <- file.path(daily, paste(name, d, ".png", sep = "" ))
png(filename = mypath, width = 963, height = 690)
timePlot(subset(test.data, yr_day == d),
plot.type = "p",
pollutant = c("co2.ppm", "o2.permeg", "apo"),
y.relation = "free",
date.pad = TRUE,
pch = c(19,19,19),
cex = 0.2,
xlab = paste("Time of day in hours on", d),
ylab = "CO2, O2, and APO concentrations",
name.pol = c("CO2 (ppm)", "O2 (per meg)", "APO (per meg)"),
date.breaks = 24,
date.format = "%H:%M"
)
dev.off()
}
An example plot (containing all the spikes with the same colour as the non-spike ones) is as follows:
So how do I plot the spikes in a different colour from the others? Thank you very much!
Edit:
As asked by Sebastian, I have added this (not sure how you guys will be able to extract the data from that)
dput(head(test.data,20))
structure(list(date = structure(c(1420070400, 1420070520, 1420070640,
1420070760, 1420070880, 1420071000, 1420071120, 1420071240, 1420071360,
1420071480, 1420071600, 1420071720, 1420071840, 1420071960, 1420072080,
1420072200, 1420072320, 1420072440, 1420072560, 1420072680), class = c("POSIXct",
"POSIXt"), tzone = "GMT"), o2.permeg = c(-685.09, -695.1, -687.84,
-683.23, -683.28, -685.4, -687.8, -662.79, -684.17, -680.37,
-684.66, -686.13, -683.27, -680.77, -682.16, -692.54, NA, NA,
NA, NA), co2.ppm = c(413.023, 412.894, 412.979, 412.866, 412.755,
412.647, 412.659, 412.665, 412.762, 412.72, 412.692, 412.71,
412.757, 412.838, 412.922, 413.019, NA, NA, NA, NA), apo = c(-354.181646778043,
-364.868973747017, -357.162673031026, -353.145990453461, -353.778806682578,
-356.465871121718, -358.802863961814, NA, -354.632052505966,
-351.052577565632, -355.489594272076, -356.86508353222, -353.75830548926,
-350.833007159904, -351.781957040573, -361.652649164678, NA,
NA, NA, NA), o2.spike = c("N", "N", "N", "N", "N", "N", "N",
"Y", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"
)), .Names = c("date", "o2.permeg", "co2.ppm", "apo", "o2.spike"
), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"
))
Unfortunately, without having data, it's not easy to answer the question.
A ggplot2 solution could be:
g1 <- ggplot(data=test.data, aes(x=date, y=o2.permeg, col=o2.spike)) + geom_point()
g1
Passing a column of the dataframe to parameter "col" in "aes" makes you map with different colors every different value in that column.
It creates even a legend, with names associated to different colors.
I tried this with another dataframe ("iris", contained in base R) and it worked, hope it will be helpful.
Edit:
To have side-by-side plots, you can create 3 plots with ggplot and the use the function plot_grid() provided by "cowplot" package.
library(cowplot)
g1 <- ggplot(data=test.data, aes(x=date, y=o2.permeg, col=o2.spike)) + geom_point()
g2 <- ggplot(data=test.data, aes(x=date, y=co2.ppm, col=o2.spike)) + geom_point()
g3 <- ggplot(data=test.data, aes(x=date, y=apo, col=o2.spike)) + geom_point()
plot_grid(g1, g2, g3, nrow=3, ncol=1)

Resources