rdflib::as_rdf only recognizes some IRIs - r

I am trying to convert a dataframe to RDF. The dataframe contains literals as well as IRIs.
I'm doing
test_withangles.rdf <-
rdflib::as_rdf(x = test_withangles,
key = 'uuid')
rdf_serialize(rdf = test_withangles.rdf, doc = "test_withangles.ttl")
on
+--------------------------------------+-----------------------------------------------+------------------------------------------------------+
| uuid | source_ontology | source_term |
+--------------------------------------+-----------------------------------------------+------------------------------------------------------+
| 7ca250c1-0747-4db0-b613-a1bb45848711 | <http://192.168.0.233:8080/ontologies/RXNORM> | <http://purl.bioontology.org/ontology/RXNORM/706898> |
+--------------------------------------+-----------------------------------------------+------------------------------------------------------+
| e10acb95-e9d9-4804-a227-844f3e551c78 | <http://192.168.0.233:8080/ontologies/RXNORM> | <http://purl.bioontology.org/ontology/RXNORM/214081> |
+--------------------------------------+-----------------------------------------------+------------------------------------------------------+
and getting
#base <localhost://> .
#prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<df:7ca250c1-0747-4db0-b613-a1bb45848711>
<df:source_ontology> <http://192.168.0.233:8080/ontologies/RXNORM> ;
<df:source_term> "<http://purl.bioontology.org/ontology/RXNORM/706898>" .
<df:e10acb95-e9d9-4804-a227-844f3e551c78>
<df:source_ontology> <http://192.168.0.233:8080/ontologies/RXNORM> ;
<df:source_term> "<http://purl.bioontology.org/ontology/RXNORM/214081>" .
Why is it interpreting my source_term column as strings, instead of IRIs?

Related

How to match two columns in one dataframe using values in another dataframe in R

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

List all strings appearing more than once in a file

I have a very large file (around 70GB), and I want to list all strings that appear more than once in the whole file.
I can list all the matches when I specify which string to search in a file, but I want to list all strings that have more than one occurrence.
For example, assuming my file looks like this:
+------+------------------------------------------------------------------+----------------------------------+--+
| HHID | VAL_CD64 | VAL_CD32 | |
+------+------------------------------------------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 373aeda34c0b4ab91a02ecf55af58e15 | |
| 7AB | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | 2c4f97a04f02db5a36a85f48dab39b5b | |
| 7AB | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | 2111293e946703652070968b224875c9 | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 5c80a555fcda02d028fc60afa29c4a40 | |
| 348 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 6c10cd11b805fa57d2ca36df91654576 | |
| 348 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 4D3 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | 71a91c4768bd314f3c9dc74e9c7937e8 | |
+------+------------------------------------------------------------------+----------------------------------+--+
And I want to list only records which have HHID more than once, i.e, 7AB and 348.
Any idea how can I implement this?
awk to the rescue:
awk -F'[ |]+' '
$2 ~ /^[[:alnum:]]+$/ { count[$2]++ }
END {
for (hhid in count) {
if (count[hhid] >= 2) {
print hhid
}
}
}
' file
-F'[ |]+' sets the field separator.
$2 ~ /^[[:alnum:]]+$/ filters out the header and horizontal lines.
count[$2]++ increases the value at $2, the string we’re counting. On the first occurrence this initialises the value to 1. On the second occurrence it increases it to 2, and so on.
END is run after all lines have been processed.
for (hhid in count) iterates over the strings in count.
if (count[hhid] >= 2) skips any <2 counts.
print hhid prints the string.

Parse data in Kusto

I am trying to parse the below data in Kusto. Need help.
[[ObjectCount][LinkCount][DurationInUs]]
[ChangeEnumeration][[88][9][346194]]
[ModifyTargetInLive][[3][6][595903]]
Need generic implementation without any hardcoding.
ideally - you'd be able to change the component that produces source data in that format to use a standard format (e.g. CSV, Json, etc.) instead.
The following could work, but you should consider it very inefficient
let T = datatable(s:string)
[
'[[ObjectCount][LinkCount][DurationInUs]]',
'[ChangeEnumeration][[88][9][346194]]',
'[ModifyTargetInLive][[3][6][595903]]',
];
let keys = toscalar(
T
| where s startswith "[["
| take 1
| project extract_all(#'\[([^\[\]]+)\]', s)
);
T
| where s !startswith "[["
| project values = extract_all(#'\[([^\[\]]+)\]', s)
| mv-apply with_itemindex = i keys on (
extend Category = tostring(values[0]), p = pack(tostring(keys[i]), values[i + 1])
| summarize b = make_bag(p) by Category
)
| project-away values
| evaluate bag_unpack(b)
--->
| Category | ObjectCount | LinkCount | DurationInUs |
|--------------------|-------------|-----------|--------------|
| ChangeEnumeration | 88 | 9 | 346194 |
| ModifyTargetInLive | 3 | 6 | 595903 |

How to duplicate input into outputs with jq?

I'm trying to adapt the following snippet:
echo '{"a":{"value":"b"}, "c":{"value":"d"}}' \
| jq -r '. as $in | keys[] | [$in[.].value | tostring + " 1"] | #tsv'
b 1
d 1
to output:
b 1
b 2
d 1
d 2
The following adaptation produces the desired output:
echo '{"a":{"value":"b"}, "c":{"value":"d"}}' |
jq -r '
def addindex(start;lessthan):
range(start;lessthan) as $i | "\(.) \($i)";
. as $in
| keys[]
| $in[.].value
| addindex(1;3)'
Note that keys emits the key names after they have been sorted, whereas keys_unsorted retains the ordering.

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

Resources