R count character column - r

I want to add a column containing the amount of letters a-z in a different column from the same row.
dataset$count <-length((gregexpr('[a-z]', as.character(dataset$text))[[1]]))
does not work.
The result I would like to acheive:
text | count
a | 1
ao | 2
ao2 | 2
as2e | 3
as2eA | 3

Tricky one:
nchar(gsub("[^a-z]","",x))

This should do the trick:
numchars<-function(txt){
#basically your code, but to be applied to 1 item
tmpres<-gregexpr('[a-z]', as.character(txt))[[1]]
ifelse(tmpres[1]==-1, 0, length(tmpres))
}
#now apply it to all items:
dataset$count <-sapply(dataset$text, numchars)
Another option is more of a two-step approach:
charmatches<-gregexpr('[a-z]', as.character(dataset$text))[[1]]
dataset$count<-sapply(charmatches, length)

Related

get the top N counts based on the value of another column

I have a file with two columns that looks like this:
a 3
a 7
b 6
a 6
b 1
b 8
c 1
b 1
For each value in the first column, I'd like to find the top N counts from the second column. Using this example, I'd like to find the top 2 values in column 2 for each string in column 1. The desired output would be:
a 7
a 6
b 8
b 6
c 1
I was trying to do something like this with awk, but I am not that familiar with it. This gives the max, not top N:
awk '$2>max[$1]{max[$1]=$2; row[$1]=$0} END{for (i in row) print row[i]}'
Could you please try following, using sort + awk solution.
sort -k2 -s -nr Input_file | awk '++array[$1]<=2' | sort -k1,1 -k2,2nr
OR
sort -k2 -s -nr Input_file | sort -k1,1 -k2,2nr | awk '++array[$1]<=2'
Logical brief explanation: First 2 sort commands are being used to sort data as per 1st and 2nd fields to get data in right order(as per OP's samples), then passing its output to awk to get only 1st 2 occurrences of each first field only as per ask.
$ sort -k1,1 -k2,2rn file | awk -v n=2 '(++cnt[$1])<=n'
a 7
a 6
b 8
b 6
c 1

Can Powerbi count multiple occurrences of a specific text within a cell?

What i am trying to find out is, for example let's take as an example the following table:
| Col 1 | Col 2 |
|-------|---------|
| ab | 1 |
| ab ab | 2 |
| ac | 1 |
| ae | 1 |
| ae ae | 2 |
| af | 1 |
So basically if there are two occurrences of the same item in the cell, I want to display 2 in the next column. If there are 3, then 3 and so on. The thing is that I am looking for specific strings most of the time. Its a text and number string.
Is this doable in Power BI?
Assuming you want to count the number of occurrences of the first non-space characters that occur before the first separating space, you can do the following:
Col 2 =
VAR Trimmed = TRIM(Table2[Col 1])
VAR FirstSpace = SEARCH(" ", Trimmed, 1, LEN(Trimmed) + 1)
VAR FirstString = LEFT(Trimmed, FirstSpace - 1)
RETURN DIVIDE(
LEN(Trimmed) - LEN(SUBSTITUTE(Trimmed, FirstString, "")),
FirstSpace - 1
)
Let's go through an example to see how this works. Suppose we have a string " abc abc abc ".
The TRIM function removes any extraneous spaces at the beginning an end, so Trimmed = "abc abc abc".
The FirstSpace searches for the first space in Trimmed. In this case, FirstSpace = 4. (If there is no first space, then we define FirstSpace to be the length of Trimmed + 1 so the next part works correctly.)
The FirstString uses FirstSpace to find the first chunk. In this case, FirstString = "abc".
Finally, we use SUBSTITUTE to replace each FirstString with an empty string (leaving only the middle spaces) and look at how that changes the length of Trimmed. We know LEN(Trimmed) = 11 and LEN(" ") = 2, so the difference is the 9 characters we removed by substitution. We know that the 9 characters are n copies of FirstString, "abc" and we know the length of FirstString is FirstSpace - 1 = 3.
Thus we can solve 3n = 9 for n to get n = 9/3 = 3, the count of the "abc" substrings.

Replacement and non-matches with 'sub'

Months ago I ended up with a sub statement that originally worked with my input data. It has since stopped working causing me to re-examine my ugly process. I hate to share it but it accomplished several things at once:
active$id[grep("CIR",active$description)] <- sub(".*CIR0*(\\d+).*","\\1",active$description[grep("CIR",active$description)],perl=TRUE)
This statement created a new id column by finding rows that had an id embedded in the description column. The sub statement would find the number following a "CIR0" and populate the id column iff there was an id within a row's description. I recognize it is inefficient with the embedded grep subsetting either side of the assignment.
Is there a way to have a 'sub' replacement be NA or empty if the pattern does not match? I feel like I'm missing something very simple but ask for the community's assistance. Thank you.
Example with the results of creating an id column:
| name | id | description |
|------+-----+-------------------|
| a | 343 | Here is CIR00343 |
| b | | Didn't have it |
| c | 123 | What is CIR0123 |
| d | | CIR lacks a digit |
| e | 452 | CIR452 is next |
I was struggling with the same issue a few weeks ago. I ended up using the str_match function from the stringr package. It returns NA if the target string is not found. Just make sure you subset the result correctly. An example:
library(stringr)
str = "Little_Red_Riding_Hood"
sub(".*(Little).*","\\1",str) # Returns 'Little'
sub(".*(Big).*","\\1",str) # Returns 'Little_Red_Riding_Hood'
str_match(str,".*(Little).*")[1,2] #Returns 'Little'
str_match(str,".*(Big).*")[1,2] # Returns NA
I think in this case you could try using ifelse(), i.e.,
active$id[grep("CIR",active$description)] <- ifelse(match, replacement, "")
where match should evaluate to true if there's a match, and replacement is what that element would be replaced with in that case. Likewise, if match evaluates to false, that element's replaced with an empty string (or NA if you prefer).

Code new variable based on grep return in R

I have a variable actor which is a string and contains values like "military forces of guinea-bissau (1989-1992)" and a large range of other different values that are fairly complex. I have been using grep() to find character patterns that match different types of actors. For example I would like to code a new variable actor_type as 1 when actor contains "military forces of", doesn't contain "mutiny of", and the string variable country is also contained in the variable actor.
I am at a loss as to how to conditionally create this new variable without resorting to some type of horrible for loop. Help me!
Data looks roughly like this:
| | actor | country |
|---+----------------------------------------------------+-----------------|
| 1 | "military forces of guinea-bissau" | "guinea-bissau" |
| 2 | "mutiny of military forces of guinea-bissau" | "guinea-bissau" |
| 3 | "unidentified armed group (guinea-bissau)" | "guinea-bissau" |
| 4 | "mfdc: movement of democratic forces of casamance" | "guinea-bissau" |
if your data is in a data.frame df:
> ifelse(!grepl('mutiny of' , df$actor) & grepl('military forces of',df$actor) & apply(df,1,function(x) grepl(x[2],x[1])),1,0)
[1] 1 0 0 0
grepl returns a logical vector and this can be assigned to whatever, e.g. df$actor_type.
breaking that appart:
!grepl('mutiny of', df$actor) and grepl('military forces of', df$actor) satisfy your first two requirements. the last piece, apply(df,1,function(x) grepl(x[2],x[1])) goes row by row and greps for country in actor.

Vertical headers in RestructuredText tables

In RestructuredText, you can render a header row in a table like this (taken from the documentation :
+------------------------+------------+----------+----------+
| Header row, column 1 | Header 2 | Header 3 | Header 4 |
| (header rows optional) | | | |
+========================+============+==========+==========+
| body row 1, column 1 | column 2 | column 3 | column 4 |
+------------------------+------------+----------+----------+
| body row 2 | Cells may span columns. |
+------------------------+------------+---------------------+
| body row 3 | Cells may | - Table cells |
+------------------------+ span rows. | - contain |
| body row 4 | | - body elements. |
+------------------------+------------+---------------------+
Is it possible to do the something similar with the first column?
An example, which clearly doesn't work, could be the following (note the double like at the end of column 1):
+------------------------++------------+----------+----------+
| Header row, column 1 || Header 2 | Header 3 | Header 4 |
| (header rows optional) || | | |
+========================++============+==========+==========+
| body row 1, column 1 || column 2 | column 3 | column 4 |
+------------------------++------------+----------+----------+
| body row 2 || Cells may span columns. |
+------------------------++------------+---------------------+
| body row 3 || Cells may | - Table cells |
+------------------------++ span rows. | - contain |
| body row 4 || | - body elements. |
+------------------------++------------+---------------------+
You may achieve this using list-table directive with option stub-columns. Or, you may even combine stub-columns with header-rows. See the http://docutils.sourceforge.net/docs/ref/rst/directives.html#list-table for the details. Hereafter is a simple example:
.. list-table:: Sample list table
:widths: 10 20 20
:header-rows: 1
:stub-columns: 1
* -
- Column 1
- Column 2
* - Row 1
- Hello
- World!
* - Row 2
- Hello
- List Table!
* - Row 3
- This
- Works
An obvious disadvantage is that you need to maintain table content as a list, which may be not that much convenient as with regular simple tables. So, you might want to check out the csv-table directive here: http://docutils.sourceforge.net/docs/ref/rst/directives.html#id1 , which also has option stub-columns.
If you need to stick to regular tables syntax - sorry, I'm not sure this is possible. As a workaround - you can use strong emphasis for text in the first column :-)

Resources