I'm trying to extract some specific characters from a column that are formatted very differently and i have some problems with the code. I'm using the next DF:
details<-data.frame(details=c("MG/0,9 ML SOL. INY. JRP",
"MG CM REC",
"MG LIOFIL P/INF. IV FAM",
"MG/ 5ML SOL. INY",
"MG/ML SOL.ORAL FC 100-200ML"))
I'm trying using extract() function but i don't know how to code de regex part:
extract(details,"details",c("detail_1","detail_2"),regex = ??)
I'm want to finally get the next two columns:
detail_1 detail_2
1 MG/0,9 ML SOL. INY. JRP
2 MG CM REC
3 MG LIOFIL P/INF. IV FAM
4 MG/ 5ML SOL. INY
5 MG/ML SOL.ORAL FC 100-200ML
Any help is highly appreciated. Thank you in advance!
Using extract we can do :
tidyr::extract(details, details, c("detail_1","detail_2"),
regex = '(.*(?:MG|ML)[^.$])(.*)')
# detail_1 detail_2
#1 MG/0,9 ML SOL. INY. JRP
#2 MG CM REC
#3 MG LIOFIL P/INF. IV FAM
#4 MG/ 5ML SOL. INY
#5 MG/ML SOL.ORAL FC 100-200ML
For detail_1 we extract everything until we encounter either "MG" or "ML" and which is not end of the sentence. For detail_2 we extract everything after that.
Another option using dplyr and stringr would be :
library(dplyr)
library(stringr)
details %>%
mutate(detail_1 = str_extract(details, ".*(MG|ML)[^.$]"),
detail_2 = str_remove(details, detail_1))
Related
I am using the pivot function from the lessR package, to create an Excel-like pivot table with two categorical variables that make up the vertical and horizontal categories, and a mean in each cell. (Hope this makes sense).
I followed the code that the documentation (https://cran.r-project.org/web/packages/lessR/vignettes/pivot.html) gives. Let's follow their example:
d <- Read("Employee")
a <- pivot(d, mean, Salary, Dept, Gender)
The data d is like this:
Years Gender Dept Salary JobSat Plan Pre Post
Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
Wu, James NA M SALE 94494.58 low 1 62 74
Hoang, Binh 15 M SALE 111074.86 low 3 96 97
Jones, Alissa 5 F <NA> 53772.58 <NA> 1 65 62
Downs, Deborah 7 F FINC 57139.90 high 2 90 86
Afshari, Anbar 6 F ADMN 69441.93 high 2 100 100
Knox, Michael 18 M MKTG 99062.66 med 3 81 84
Campagna, Justin 8 M SALE 72321.36 low 1 76 84
Kimball, Claire 8 F MKTG 61356.69 high 2 93 92
The pivottable a is a nice table, exactly as I want it to look in terms of cell contents, etc. It appears to be a knitr_kable.
Gender F M
Dept
------- --------- ---------
ACCT 63237.16 59626.20
ADMN 81434.00 80963.35
FINC 57139.90 72967.60
MKTG 64496.02 99062.66
SALE 64188.25 86150.97
Next, I would like to make a dataframe out of this, for easier manipulation in my code and for copying it to the clipboard. However, I don't know how to convert a knitr_kable to a dataframe. Here is my code and the error it results in:
as.data.frame(a)
Error in as.data.frame.default(a) :
cannot coerce class ‘"knitr_kable"’ to a data.frame
The knitr-documentation does not say anything about this conversion - it is only about converting a dataframe to a knitr_kable, which is the opposite of what I want.
I have also tried pivottabler, but this has similar issues: the resulting class cannot be coerced to a dataframe either.
Here are two potential answers:
Most direct: Wrangle the data yourself
If you're open to a tidyverse-style approach, it only takes a few lines to do the wrangling and summarising yourself. That will give you a datatable output that you can work with right away.
# load packages
library(lessR)
library(dplyr)
library(tidyr)
# load data
d <- Read("Employee")
# use tidyverse-style code to pivot and summarise the data yourself
d %>%
group_by(Gender, Dept) %>%
summarise(Salary_mean = mean(Salary)) %>%
pivot_wider(names_from= "Gender", values_from = "Salary_mean")
Read the knitr::kable() markdown output into a data frame
If you prefer to work backwards from a knitr::kable() output to a dataframe, this is addressed in this SO question: Markdown table to data frame in R
So I am using tidyr in Rstudio and I am trying to separate the data in the 'player' column (attached below) into 4 individual columns: 'number', 'name','position' and 'school'. I tried using the separate() function, but can't get the number to separate and can't use a str_sub because some numbers are double digits. Does anyone know how to separate this column to the appropriate 4 columns?
A method using a series of separate calls.
# Example data
df <- data.frame(
player = c('11Vita VeaDT | Washington',
'16Clelin FerrellEDGE | Clemson',
"17K'Lavon ChaissonEdge | LSU",
'15Cody FordOT | Oklahoma',
'20Derrius GuiceRB',
'1Joe BurrowQB | LSU'))
The steps are:
separate school using |
separate number using the distinction of numbers and letters
separate position using capital and lowercase, but starting at the end
cleanup, trim off white space, or extra spaces around the text
df %>%
separate(player, into = c('player', 'school'), '\\|') %>%
separate(player, into = c('number', 'player'), '(?<=[0-9])(?=[A-Za-z])') %>%
separate(player, into = c('last', 'position'), '(?<=[a-z])(?=[A-Z])') %>%
mutate_if(is.character, trimws)
# Results
number name position school
1 11 Vita Vea DT Washington
2 16 Clelin Ferrell EDGE Clemson
3 17 K'Lavon Chaisson Edge LSU
4 15 Cody Ford OT Oklahoma
5 20 Derrius Guice RB <NA>
6 1 Joe Burrow QB LSU
I'm trying to split a column that are formatted very differently. For example:
pharma <- c("DOXORUBICINA CLORH. FAM 50MG POL O LIOF",
"DROSPIRENONA/ETINILESTR. 3/0,02MG CM REC",
"DROSPIRENONA/ETINILESTR. 3/0,03MG CM REC",
"ETRAVIRINA 100 MG CM",
"AGALSIDASA ALFA 1MG/ML X 3,5 ML FAM")
And i'm using separate() to do the split in two different columns (i need separate the product name (i.e. DOXORUBICINA CLORH. FAM) and the details (50MG POL O LIOF)). The code is:
separate(data.frame(A = pharma), col = "A" , into = c("x","y"),sep = "(?<=[a-zA-Z])\\s*(?=[0-9])")
But i have the next by from R:
x y
1 DOXORUBICINA CLORH. FAM 50MG POL O LIOF
2 DROSPIRENONA/ETINILESTR. 3/0,02MG CM REC <NA>
3 DROSPIRENONA/ETINILESTR. 3/0,03MG CM REC <NA>
4 ETRAVIRINA 100 MG CM
5 AGALSIDASA ALFA 1MG/ML X
Warning messages:
1: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
2: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [2, 3].
I can't see what is happening.
Any help is highly appreciated. Thank you in advance!
The data on the second and third row contains a dot between the letters and whitespace, your pattern only accounts for 0+ whitespace chars between a letter and a digit.
You may use
sep = "(?<=[a-zA-Z])\\W+(?=[0-9])"
or
sep = "(?<=[a-zA-Z])\\W*(?=[0-9])"
The \W pattern matches any non-word chars, any char other than letter, digit and _.
See the regex demo.
R test:
> separate(data.frame(A = pharma), col = "A" , into = c("x","y"), sep = "(?<=[a-zA-Z])\\W*(?=[0-9])")
x y
1 DOXORUBICINA CLORH. FAM 50MG POL O LIOF
2 DROSPIRENONA/ETINILESTR 3/0,02MG CM REC
3 DROSPIRENONA/ETINILESTR 3/0,03MG CM REC
4 ETRAVIRINA 100 MG CM
We can do this in base R
do.call(rbind, strsplit(pharma, "(?<=[A-Za-z])\\s+(?=[0-9])", perl = TRUE))
I have a table that looks like this-
LDAutGroup PatientDays ExposedDays sex Ageband DrugGroup Prop LowerCI UpperCI concat
Group1 100 23 M 5 to 10 PSY 23 15.84 32.15 23 (15.84 -32.15) F
Group2 500 56 F 11 to 17 HYP 11.2 8.73 14.27 11.2 (8.73 -14.27)
Group3 300 89 M 18 and over PSY 29.67 24.78 35.07 29.67 (24.78 -35.07)
Group1 200 34 F 5 to 10 PSY 17 12.43 22.82 17 (12.43 -22.82)
Group2 456 78 M 11 to 17 ANX 17.11 13.93 20.83 17.11 (13.93 -20.83)
Following this, I want a pivot table to lay out the concat column as the valuename. However, the pivottabler only works on integers or numeric values. The following code runs right with either of the Prop, LowerCI or UpperCI columns on their own, but gives an error message for the concat column-
library(readr)
library(dplyr)
library(epitools)
library(gtools)
library(reshape2)
library(binom)
library(pivottabler)
pt <- PivotTable$new()
pt$addData(a)
pt$addColumnDataGroups("LDAutGroup")
pt$addColumnDataGroups("sex")
pt$addRowDataGroups("DrugGroup")
pt$addRowDataGroups("Ageband")
pt$defineCalculation(calculationName="TotalTrains", type="value", valueName="Prop")
pt$renderPivot()
Is there a way I can make this work on the concat column? I want a table that has the following layout and the cells populated with the strings in concat column in the table above
Group1 Group2 Group3
M F M F M F
ANX 11 to 17
18 and over
Total
HYP 11 to 17
18 and over
5 to 10
Total
PSY 18 and over
5 to 10
Total
I am the pivottabler package author.
As you say, pivottabler currently only pivots integer/numerical columns. A workaround exists however, using a custom cell calculation function to calculate the value in each cell. Custom calculation functions were intended for more complex use cases, so using them in this way is a sledgehammer approach, but it does the job, and I suppose makes sense in some scenarios, e.g. if you have other numerical pivot tables and want a uniform appearance for the pivot tables in your output.
Adapting an example from the package vignettes:
library(pivottabler)
library(dplyr)
trainsConcatendated <- mutate(bhmtrains, ConcatValue = paste(TOC, TrainCategory, sep=" "))
getConcatenatedValue <- function(pivotCalculator, netFilters, format, baseValues, cell) {
# get the data frame
trains <- pivotCalculator$getDataFrame("trainsConcatendated")
# apply the filters coming from the headers in the pivot table
filteredTrains <- pivotCalculator$getFilteredDataFrame(trains, netFilters)
# get the distinct values
distinctValues <- distinct(filteredTrains, ConcatValue)
# get the value of the concatenated column
# this just returns the first concatenated value for the cell
# if there are multiple values, the others are ignored
if(length(distinctValues$ConcatValue)==0) { tv <- "" }
else { tv <- distinctValues$ConcatValue[1] }
# build the return value
# the raw value must be numerical, so simply set this to zero
value <- list()
value$rawValue <- 0
value$formattedValue <- tv
return(value)
}
pt <- PivotTable$new()
pt$addData(trainsConcatendated)
pt$addColumnDataGroups("TrainCategory", addTotal=FALSE)
pt$addRowDataGroups("TOC", addTotal=FALSE)
pt$defineCalculation(calculationName="ConcatValue",
type="function", calculationFunction=getConcatenatedValue)
pt$renderPivot()
Results:
It is speculative to apply the same function for CI (lower or upper) as it could be for mean statistics to report subtotal levels, as well as no sense for concat to report subtotals (at least in the simple way of pivot table).
With no subtotal you can easily use tidyr library and report variable with character type in spread format of table : here is 2 line code. First is create groups for columns and second line is to change table format to spread version
library(tidyr)
Table_Original <- unite(Table_Original, "Col_pivot", c("LDAutGroup", "sex"), sep = "_", remove = F)
Table_Pivot <- spread(Table_Original[ ,c("Col_pivot","DrugGroup", "Ageband", "concat")], Col_pivot, concat)
So I've got some golf data that I'm messing with in R:
player rd hole shot distToPin distShot
E. Els 1 1 1 525 367.6
E. Els 1 1 2 157.4 130.8
E. Els 1 1 3 27.5 27.4
E. Els 1 1 4 1.2 1.2
E. Els 1 2 1 222 216.6
E. Els 1 2 2 6.8 6.6
E. Els 1 2 3 0.3 0.3
E. Els 2 1 1 378 244.4
E. Els 2 1 2 135.9 141.6
E. Els 2 1 3 6.7 6.9
E. Els 2 1 4 0.1 0.1
I'm trying to make an "efficiency" computation. Basically, I want to compute the following formula (which I made up, if you can't tell) by round:
E = hole yardage / (sum(distance of all shots) - hole yardage)
And ultimately, I want my results to look like this:
rd efficiency
E.Els 1 205.25
2 25.2
That efficiency column is the averaged result of the efficiency for each hole over the entire round. The issue that I'm having is I can't quite figure out how to do such a complex calculation using dplyr::summarize():
efficiency <- df %>%
group_by(player, rd) %>%
summarize(efficiency = (sum(distShot) - distToPin))
But the problem with that particular script is that it returns the error:
Error: expecting a single value
I think my problem is that were it to run, it wouldn't be able to tell WHICH distToPin to subtract, and the one I want is obviously the first distToPin of each hole, or the accurate hole length (unfortunately, I don't have a column of just "hole yardage." I want to pull that first distToPin of each hole out and use it within my summarize() arithmetic. Is this even possible?
I'm guessing that there is a way to do these types of complex, multi-step calculations within the summarize function, But maybe there's not! Any ideas or advice?
You seem to be missing some steps. Here is a deliberately labored version to show that, using dplyr. It assumes that your data frame is named golfdf:
golfdf %>%
group_by(player, round, hole) %>%
summarise(hole.length = first(distToPin), shots.length = sum(distShot)) %>%
group_by(player, round) %>%
summarise(efficiency = sum(hole.length) / (sum(shots.length) - sum(hole.length)))