How to separate out letters in a sentence using R - r

I have a character vector that is a string of letters and punctuation. I want to create a data frame where each column is made up of a letter/character from this string.
e.g.
Character string = I WENT TO THE FAIR
Dataframe = | I | | W | E | N | T | | T | O | | T | H | E | | F | A | I | R |
I thought I could do this using a loop with substr, but I can't work out how to get R to write into separate columns, rather than just writing over the previous letter. I'm new to writing loops etc so struggling a bit to get my head around the way in which to compose what I need.
Thanks for any help and advice that you can offer.
Best wishes,
Natalie

This should get that result
string <- "I WENT TO THE FAIR"
df <- as.data.frame(t(as.data.frame(strsplit(string,""))), row.names = "1")

Related

How to match two columns in one dataframe using values in another dataframe in R

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

knitr's kable is printing 2.29e-30 as "0"

CODE:
# some data
dat <-
data.frame(
log2fc = c(0.28, 10.82, 8.54, 5.64, 8.79, 6.46),
pvalue = c(0.00e+00, 2.29e-30, 7.02e-30, 4.14e-29, 1.86e-28, 1.78e-27)
)
# observe in markdown format
knitr::kable(dat, format="markdown")
OUTPUT:
| log2fc| pvalue|
|------:|------:|
| 0.28| 0|
| 10.82| 0|
| 8.54| 0|
| 5.64| 0|
| 8.79| 0|
| 6.46| 0|
PROBLEM:
The problem with the output is that, it is rendering the last column pvalue as zeros. But I would want to retain the same format as I see in my dataframe. How do I do that ? I've tried several solutions from various threads but nothing seems to work. Can someone point me to the right direction ?
Please do not suggest me to convert the pvalue column into a character vector. That is a quick and dirty solution that works, but I don't want to do that because:
I don't want to mess around with my dataframe.
I am interested in the reason for why the scientific format of the last column is not being retained while printing it in markdown.
I have many tables each with various columns with scientific format, I am looking for a way that automatically handles this issue.
kable() calls the base R function round(), which truncates those small values to zero unless you set digits to a really large value. But you can do that, e.g.
knitr::kable(dat, format = "markdown", digits = 32)
which gives
| log2fc| pvalue|
|------:|--------:|
| 0.28| 0.00e+00|
| 10.82| 2.29e-30|
| 8.54| 7.02e-30|
| 5.64| 4.14e-29|
| 8.79| 1.86e-28|
| 6.46| 1.78e-27|
If you do want the regular rounding in some columns, you can specify multiple values for digits, e.g.
knitr::kable(dat, format = "markdown", digits = c(1, 32))
| log2fc| pvalue|
|------:|--------:|
| 0.3| 0.00e+00|
| 10.8| 2.29e-30|
| 8.5| 7.02e-30|
| 5.6| 4.14e-29|
| 8.8| 1.86e-28|
| 6.5| 1.78e-27|

In julia, how do I assign the output of an expression to a new variable?

Stupid example, I would like to do something like
X=println("hi"),
and get
X="hi".
The general solution is to use IOBuffer and takebuf_string as described by #ARM above. If it's enough to capture the output of print, then
s = string(args...)
gives the string that would have been printed by print(args...). Also,
s = repr(X)
gives the string that would have been printed by showall(X). Both are implemented using IOBuffer and takebuf_string internally.
I think the poster wants to access the nice summary format that you can get from println. One way to access that as a string is to write to a buffer using print and then read it back as a string. There's probably also an easier way.
using DataFrames
data = DataFrame()
data[:turtle] = ["Suzy", "Suzy", "Bob", "Batman", "Batman", "Bob", "Adam"]
data[:mealType] = ["bug", "worm", "worm", "bug", "worm", "worm", "stick"]
stream = IOBuffer()
println(data)
print(stream, data)
yourString = takebuf_string(stream)
returns
"7x2 DataFrame\n| Row | turtle | mealType |\n|-----|----------|----------|\n| 1 | \"Suzy\" | \"bug\" |\n| 2 | \"Suzy\" | \"worm\" |\n| 3 | \"Bob\" | \"worm\" |\n| 4 | \"Batman\" | \"bug\" |\n| 5 | \"Batman\" | \"worm\" |\n| 6 | \"Bob\" | \"worm\" |\n| 7 | \"Adam\" | \"stick\" |"
If you are after formatted strings you can use #sprintf.
julia> x = #sprintf("%s", "hi")
"hi"
julia> x
"hi"
julia> x = #sprintf("%d/%d", 3, 4)
"3/4"
It's a macro though so be careful

Creating a unique integer on the basis of a string

I have a larger dataset (data.table with approx 9m rows) with a column that I would like to use to aggregate values (min and max etc). The column is a combination of various other columns and has a string based format, like the one below:
string <- "318XXXX | VNSGN | BIER"
To gain some speed in performing tasks, I would like to recode this to a unique integer. Another application that I use on a regular basis to deal with data has a build-in function that transforms a string as the one above in a integer (e.g. 73823). I was wondering whether there is a similar function in R? The idea is that a particular string will always result in the same integer; this will allow it to be used in merging data.tables etc.
Here a little example of the data.table column that I would like to encode in simple integer values:
sample <- c("318XXXX | VNSGN | BIER", "462XXXX | TZZZH | 9905", "462XXXX | TZZZH | 9905",
"462XXXX | TZZZH | 9905", "511XXXX | FAWOR | 336H", "511XXXX | FAWOR | 336H",
"652XXXX | XXXXR | T136", "652XXXX | XXXXR | T136", "672XXXX | BQQSZ | 7777",
"672XXXX | BQQSZ | 7777")
I am hoping to encode the strings into an additional column to the table like the one below; note that the same strings result in the same numbers.
String Number
318XXXX | VNSGN | BIER 19872
462XXXX | TZZZH | 9905 78392
462XXXX | TZZZH | 9905 78392
462XXXX | TZZZH | 9905 78392
511XXXX | FAWOR | 336H 23053
511XXXX | FAWOR | 336H 23053
652XXXX | XXXXR | T136 95832
652XXXX | XXXXR | T136 95832
672XXXX | BQQSZ | 7777 71829
672XXXX | BQQSZ | 7777 71829
The data.table package will create indexes for you without making you handle them explicitly so it would be less work than the approach in the question. See the setkey function in data.table.
Also the sqldf package can use the SQL create index statement as per Examples 4h and 4i on the sqldf home page as can just about any database package.

Replacement and non-matches with 'sub'

Months ago I ended up with a sub statement that originally worked with my input data. It has since stopped working causing me to re-examine my ugly process. I hate to share it but it accomplished several things at once:
active$id[grep("CIR",active$description)] <- sub(".*CIR0*(\\d+).*","\\1",active$description[grep("CIR",active$description)],perl=TRUE)
This statement created a new id column by finding rows that had an id embedded in the description column. The sub statement would find the number following a "CIR0" and populate the id column iff there was an id within a row's description. I recognize it is inefficient with the embedded grep subsetting either side of the assignment.
Is there a way to have a 'sub' replacement be NA or empty if the pattern does not match? I feel like I'm missing something very simple but ask for the community's assistance. Thank you.
Example with the results of creating an id column:
| name | id | description |
|------+-----+-------------------|
| a | 343 | Here is CIR00343 |
| b | | Didn't have it |
| c | 123 | What is CIR0123 |
| d | | CIR lacks a digit |
| e | 452 | CIR452 is next |
I was struggling with the same issue a few weeks ago. I ended up using the str_match function from the stringr package. It returns NA if the target string is not found. Just make sure you subset the result correctly. An example:
library(stringr)
str = "Little_Red_Riding_Hood"
sub(".*(Little).*","\\1",str) # Returns 'Little'
sub(".*(Big).*","\\1",str) # Returns 'Little_Red_Riding_Hood'
str_match(str,".*(Little).*")[1,2] #Returns 'Little'
str_match(str,".*(Big).*")[1,2] # Returns NA
I think in this case you could try using ifelse(), i.e.,
active$id[grep("CIR",active$description)] <- ifelse(match, replacement, "")
where match should evaluate to true if there's a match, and replacement is what that element would be replaced with in that case. Likewise, if match evaluates to false, that element's replaced with an empty string (or NA if you prefer).

Resources