subset rows between two rows containing specific values? - r

I have multiple data frames with the generic layout below. The strings of text vary in length from a few words to multiple sentences. The title strings on each data frame all vary slightly but they all share a word in common (for example, all of the TitleBs on each data frame share the word “code” in common and all TitleCs share the word “write” in common).
|element|NumbID|String |
|-—————-|-————-|-———————-|
|header |1 |TitleA |
|para |2 |TxtStrng |
|header |3 |TitleB |
|header |4 |Subtit1 |
|para |5 |TxtStrng |
|header |6 |Subtit2 |
|para |7 |TxtStrng |
|header |8 |TitleC |
I am trying to figure out how to write a code that can be used on all the data frames and will allow me to extract all the rows starting at TitleB and just before TitleC, as in the example below.
|element|NumbID|String |
|:————-:|:———-:| :————--:|
|header |3 |TitleB |
|header |4 |Subtit1 |
|para |5 |TxtStrng |
|header |6 |Subtit2 |
|para |7 |TxtStrng |
I thought maybe I could use subset() in some way to do this but I’m really struggling to figure out how to make it work.

So, you need a way to identify TitleB and TitleC strings. From you're description, I'll use grepl("code", String) for TitleB and grepl("write", String) for TitleC.
Then we need to identify rows where a TitleB has already occurred but a TitleC hasn't: we can use cumsum for this to generate a cumulative count of occurrences:
result = subset(
your_data,
cumsum(grepl("code", String)) > 0 &
cumsum(grepl("write", String)) == 0
)
If you need more help, please make your example more reproducible, preferably using dput() to share a copy/pasteable version of the data in valid R syntax.

Related

SQLite Versioning. Is it possible to use EXCEPT to show differences between rows where only one column changes?

I'm quite new to SQLite and I'm trying to use an EXCEPT statement in order to compare two tables with very similar data. The data comes from a CSV file I download daily, and within the file new rows are added and deleted, and old rows can have one or more columns change each day. I'm trying to find a way to select rows that have had a column's data change, when I am unable to predict which column's data will change.
Say for example I have:
TABLE contracts:
|ID|Description|Name|Contract Type|
|1 |Plumbing |Bob |Paper |
|2 |Cooking |Ryan|Paper |
|3 |Driving |Eric|Paper |
|4 |Dancing |Emma|Paper |
and:
TABLE updated_contracts:
|ID|Description|Name|Contract Type|
|1 |Hiking |Bob |Paper |
|2 |Cooking |Ryan|Paper |
|3 |Driving |Eric|Paper |
|4 |Dancing |Emma|Digital |
I'd like it to return:
|1 |Hiking |Bob |Paper |
|4 |Dancing |Emma|Digital |
because contract 1 has changed the description and contract 4 has changed the contract type.
Is it possible to do this in SQLite?
One way to do it is with a LEFT join of updated_contracts to contracts where the matching rows are filtered out:
select uc.*
from updated_contracts uc left join contracts c
using(id, Description, Name, `Contract Type`)
where c.id is null
EXCEPT can also be used like this:
select * from updated_contracts
except
select * from contracts
This will work only if the tables have the same number of columns and its advantage is that it compares null values in columns and returns true if they are both null.
See the demo.
Results:
| ID | Description | Name | Contract Type |
| --- | ----------- | ---- | ------------- |
| 1 | Hiking | Bob | Paper |
| 4 | Dancing | Emma | Digital |

Dividing the time into periods each 30 min

I have Dataframe contains "time" column I want to add a new column contain period number after dividing the time into periods each 30 min
for example,
The original Dataframe
l = [('A','2017-01-13 00:30:00'),('A','2017-01-13 00:00:01'),('E','2017-01-13 14:00:00'),('E','2017-01-13 12:08:15')]
df = spark.createDataFrame(l,['test','time'])
df1 = df.select(df.test,df.time.cast('timestamp'))
df1.show()
+----+-------------------+
|test| time|
+----+-------------------+
| A|2017-01-13 00:30:00|
| A|2017-01-13 00:00:01|
| E|2017-01-13 14:00:00|
| E|2017-01-13 12:08:15|
+----+-------------------+
The Desired Dataframe as follow:
+----+-------------------+------+
|test| time|period|
+----+-------------------+------+
| A|2017-01-13 00:30:00| 2|
| A|2017-01-13 00:00:01| 1|
| E|2017-01-13 14:00:00| 29|
| E|2017-01-13 12:08:15| 25|
+----+-------------------+------+
Are there ways to achieve that?
You can simply utilize the hour and minute inbuilt functions to get your final result with when inbuilt function as
from pyspark.sql import functions as F
df1.withColumn('period', (F.hour(df1['time'])*2)+1+(F.when(F.minute(df1['time']) >= 30, 1).otherwise(0))).show(truncate=False)
You should be getting
+----+---------------------+------+
|test|time |period|
+----+---------------------+------+
|A |2017-01-13 00:30:00.0|2 |
|A |2017-01-13 00:00:01.0|1 |
|E |2017-01-13 14:00:00.0|29 |
|E |2017-01-13 12:08:15.0|25 |
+----+---------------------+------+
I hope the answer is helpful

Proxy for Excel's split cell in R

Apologies in advance if anyone finds this to be a duplicate to a question answered before. I haven't found anything so here it is:
I have a 3x3 contingency table I made in RStudio (I am specifying this as a data frame below but I can also produce this as as.matrix, if that'll work better):
mat.s=data.frame("WT(H)"=11,"DEL(H)"=2)
mat.s[2,1]=13
mat.s[2,2]=500369
row.names(mat.s)=c("DEL(T)", "WT(T)")
mat.s=cbind(mat.s, Total=rowSums(mat.s))
mat.s=rbind(mat.s, Total=colSums(mat.s))
which looks like:
kable(mat.s)
| | WT.H.| DEL.H.| Total|
|:------|-----:|------:|------:|
|DEL(T) | 11| 2| 13|
|WT(T) | 13| 500369| 500382|
|Total | 24| 500371| 500395|
However, if I wanted to split a cell in this table (like you can do in Excel) into two, how would I do that? So I'd like to get something like the following when I render the document with kable:
| | WT.H.| DEL.H.| Total|
|:------|-----:|------:|------:|
|DEL(T) | S D | 2| 13|
| | 8 3 | | |
|WT(T) | 13| 500369| 500382|
|Total | 24| 500371| 500395|
So that when I want to calculate something from this table, I can call the split 8 or 3. Sorry if this is something very simple and easy to do! Still learning. Thanks!

Append data in sqlite3 c++

i have a table in sqlite like:
e.g
Columns
|a |b |c |d |e |
|5 |6 |4 |3 |6 |
columns a - e holds an integer
now i need to add number to some columns
for example add 3 to column 'c' and now c will hold 7.
how can i do it?
I think this could be done with a simple update query like this:
UPDATE <table> SET c = c + 1;

Creating a unique integer on the basis of a string

I have a larger dataset (data.table with approx 9m rows) with a column that I would like to use to aggregate values (min and max etc). The column is a combination of various other columns and has a string based format, like the one below:
string <- "318XXXX | VNSGN | BIER"
To gain some speed in performing tasks, I would like to recode this to a unique integer. Another application that I use on a regular basis to deal with data has a build-in function that transforms a string as the one above in a integer (e.g. 73823). I was wondering whether there is a similar function in R? The idea is that a particular string will always result in the same integer; this will allow it to be used in merging data.tables etc.
Here a little example of the data.table column that I would like to encode in simple integer values:
sample <- c("318XXXX | VNSGN | BIER", "462XXXX | TZZZH | 9905", "462XXXX | TZZZH | 9905",
"462XXXX | TZZZH | 9905", "511XXXX | FAWOR | 336H", "511XXXX | FAWOR | 336H",
"652XXXX | XXXXR | T136", "652XXXX | XXXXR | T136", "672XXXX | BQQSZ | 7777",
"672XXXX | BQQSZ | 7777")
I am hoping to encode the strings into an additional column to the table like the one below; note that the same strings result in the same numbers.
String Number
318XXXX | VNSGN | BIER 19872
462XXXX | TZZZH | 9905 78392
462XXXX | TZZZH | 9905 78392
462XXXX | TZZZH | 9905 78392
511XXXX | FAWOR | 336H 23053
511XXXX | FAWOR | 336H 23053
652XXXX | XXXXR | T136 95832
652XXXX | XXXXR | T136 95832
672XXXX | BQQSZ | 7777 71829
672XXXX | BQQSZ | 7777 71829
The data.table package will create indexes for you without making you handle them explicitly so it would be less work than the approach in the question. See the setkey function in data.table.
Also the sqldf package can use the SQL create index statement as per Examples 4h and 4i on the sqldf home page as can just about any database package.

Resources