Read a file and extract text in R

Read a file and extract text in R - r

I have the following data in a file :
Message-ID: <123.juii#jkk>
Date: Wed, 9 Mar 2002 16:12:51 -0800 (CST)
From: jennifer.mcquade#enron.com
To: abc#ron.com, def#ron.com, ghi#ron.com,
gty#ron.com, mkl#ron.com
Subject: Sales details
Please find attached the latest sales information
let me know what you can do.
Thanks,
jLian
I want to extract the contents of the e-mail only. So I tried to extract the lines which don't have ":" character. I am not able to find any other way. But this will result in :
gty#ron.com, mkl#ron.com
Please find attached the latest sales information and
let me know what you can do.
Thanks,
jLian
Where only 2nd line is the message content.
library("stringr")
rawData = file("mail1","r")
while(TRUE){
line = readLines(rawData,n=1)
if(length(line)==0){
break
}
if(!(str_detect(line,":")))
print(line)
}

See if this here works:
data:
mail<-
'Message-ID: <123.juii#jkk>
Date: Wed, 9 Mar 2002 16:12:51 -0800 (CST)
From: jennifer.mcquade#enron.com
To: abc#ron.com, def#ron.com, ghi#ron.com,
gty#ron.com, mkl#ron.com
Subject: Sales details
Please find attached the latest sales information
let me know what you can do.
Thanks,
jLian'
code:
cat(
sub(".*Subject:.*?\n\n","",mail)
)
result:
#Please find attached the latest sales information
#let me know what you can do.
#Thanks,
#jLian
In order to use my solution effectively, read every Mail as a multiline string to list element.
listOfMails <- list(mail, mail, mail) #as many as you have.
fun1<-
function(m) { sub(".*Subject:.*?\n\n","",m) }
onlyContent<-
lapply(listOfMails,fun1)

Related

Dagster: Multiple and Conditional Outputs (Type check failed for step output xxx PySparkDataFrame)

I'm executing the Dagster tutorial, and I got stuck at the Multiple and Conditional Outputs step.
In the solid definitions, it asks to declare (among other things):
output_defs=[
OutputDefinition(
name="hot_cereals", dagster_type=DataFrame, is_required=False
),
OutputDefinition(
name="cold_cereals", dagster_type=DataFrame, is_required=False
),
],
But there's no information where the DataFrame cames from.
Firstly I have tried with pandas.DataFrame but I faced the error: {dagster_type} is not a valid dagster type. It happens when I try to submit it via $ dagit -f multiple_outputs.py.
Then I installed the dagster_pyspark and gave a try with the dagster_pyspark.DataFrame. This time I managed to summit the DAG to the UI. However, when I run it from the UI, I got the following error:
dagster.core.errors.DagsterTypeCheckDidNotPass: Type check failed for step output hot_cereals of type PySparkDataFrame.
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_plan.py", line 210, in _dagster_event_sequence_for_step
for step_event in check.generator(step_events):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 273, in core_dagster_event_sequence_for_step
for evt in _create_step_events_for_output(step_context, user_event):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 298, in _create_step_events_for_output
for output_event in _type_checked_step_output_event_sequence(step_context, output):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 221, in _type_checked_step_output_event_sequence
dagster_type=step_output.dagster_type,
Does anyone know how to fix it? Thanks for the help!

As Arthur pointed out, the full tutorial code is available on Dagster's github.
However, you do not need dagster_pandas, rather, the key lines missing from your code are:
if typing.TYPE_CHECKING:
DataFrame = list
else:
DataFrame = PythonObjectDagsterType(list, name="DataFrame") # type: Any
The reason for the above structure is to achieve MyPy compliance, see the Types & Expectations section of the tutorial.
See also the documentation on Dagster types.

I was stuck here, too, but luckily I found the updated source code.
They have updated the docs so that the OutputDefinition is defined beforehand.
Update your code before sorting and pipeline like below:
import csv
import os
from dagster import (
Bool,
Field,
Output,
OutputDefinition,
execute_pipeline,
pipeline,
solid,
)
#solid
def read_csv(context, csv_path):
lines = []
csv_path = os.path.join(os.path.dirname(__file__), csv_path)
with open(csv_path, "r") as fd:
for row in csv.DictReader(fd):
row["calories"] = int(row["calories"])
lines.append(row)
context.log.info("Read {n_lines} lines".format(n_lines=len(lines)))
return lines
#solid(
config_schema={
"process_hot": Field(Bool, is_required=False, default_value=True),
"process_cold": Field(Bool, is_required=False, default_value=True),
},
output_defs=[
OutputDefinition(name="hot_cereals", is_required=False),
OutputDefinition(name="cold_cereals", is_required=False),
],
)
def split_cereals(context, cereals):
if context.solid_config["process_hot"]:
hot_cereals = [cereal for cereal in cereals if cereal["type"] == "H"]
yield Output(hot_cereals, "hot_cereals")
if context.solid_config["process_cold"]:
cold_cereals = [cereal for cereal in cereals if cereal["type"] == "C"]
yield Output(cold_cereals, "cold_cereals")
You can also find the whole lines of codes from here.

Try first to install the dagster pandas integration:
pip install dagster_pandas
Then do:
from dagster_pandas import DataFrame
You can find the code from the tutorial here.

How can I seq_along an object of type response (httr package)

I have a list that contains 4 objects of type Response, as in an API response:
Response [https://api.idealista.com/3.5/es/search?&operation= etc. etc.]
Date: 2018-06-04 12:27
Status: 200
Content-Type: application/json;charset=UTF-8
Size: 45 kB
Suppose the list is called holle, I can access the contents and reassign them to another list, revs, as follows:
library(httr)
library(rlist)
revs[[1]] <- content(holle[[1]])$elementList
This works perfectly fine and all is well. However, I would like to seq_along each element and access the contents. When I write a for/seq_along, I get this error message:
for (i in seq_along(content(holle)$elementList)){
revs[[i]] <- content(holle[[i]])$elementList
}
"Error in content(holle) : is.response(x) is not TRUE".
Why?

R: How to cleanly retrieve the attributes of a remote file on the internet?

I can download a file from the internet easily enough using code such as this:
myurl <- "http://www.jatma.or.jp/toukei/xls/13_01.xls"
download.file(myurl, destfile = myfilepath, mode = 'wb')
However, usually I want to check the date the file was last modified before I download it. I can do this very easily in Perl using the LWP::Simple package. I've poked through the documentation for RCurl (which I admit I understand only poorly) and the closest thing I can find is the basicHeaderGatherer function.
library(RCurl)
if(url.exists("http://www.jatma.or.jp/toukei/xls/13_01.xls")) {
h = basicHeaderGatherer()
foo <- getURL("http://www.jatma.or.jp/toukei/xls/13_01.xls",
headerfunction = h$update)
names(h$value())
h$value()
}
h$value()[3]
By using the code above I can eventually access the 'Last-Modified' attribute, but not without generating errors as per the output below. How can I clean up my code to avoid this error and access the 'Last-Modified' attribute in a straightforward manner?
(Please note: this answer looks promising but it generates similar error messages to those shown below, so it doesn't resolve this particular issue.)
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) (from #3) :
embedded nul in string: ' \021ࡱ\032 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0>\0\003\0 \t\0\006\0\0\0\0\0\0\0\0\0\0\0\001\0\0\09\0\0\0\0\0\0\0\0\020\0\0 \0\0\0\0 \0\0\0\08\0\0\0 \t\b\020\0\0\006\005\0g2 \a \0\002\0\006\006\0\0 \0\002\0 \004 \0\002\0\0\0 \0\0\0\\\0p\0\003\0\0CVC B\0\002\0 \004a\001\002\0\0\0 \001\0\0=\001\002\0$\0 \0\002\0\021\0\031\0\002\0\0\0\022\0\002\0\0\0\023\0\002\0\0\0 \001\002\0\0\0 \001\002\0\0\0=\0\022\0 \017\0xKX/8\0\0\0\
> h$value()[3]
Last-Modified
"Fri, 06 Dec 2013 05:33:53 GMT"
>

library(RCurl)
url.exists("http://www.jatma.or.jp/toukei/xls/13_01.xls", .header=T)["Last-Modified"]
# Last-Modified
# "Fri, 06 Dec 2013 05:33:53 GMT"

PSI - Statusing Web Service - Results not as expected

I'm trying to update Status information on assignments via Statusing Web Service (PSI). Problem is, that the results are not as expected. I'll try to explain what I'm doing in detail:
Two cases:
1) An assignment for the resource exists on specified tasks. I want to report work actuals (update status).
2) There is no assignment for the resource on specified tasks. I want to create the assignment and report work actuals.
I have one task in my project (Auto scheduled, Fixed work). Resource availability of all resources is set to 100%. They all have the same calendar.
Name: Task 31 - Fixed Work
Duration: 12,5 days?
Start: Thu 14.03.13
Finish: Tue 02.04.13
Resource Names: Resource 1
Work: 100 hrs
First I execute an UpdateStatus with the following ChangeXML
<Changes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Proj ID="a8a601ce-f3ab-4c01-97ce-fecdad2359d9">
<Assn ID="d7273a28-c038-486b-b997-cdb2450ceef5" ResID="8a164257-7960-4b76-9506-ccd0efabdb72">
<Change PID="251658250">900000</Change>
</Assn>
</Proj>
</Changes>
Then I call a SubmitStatusForResource
client.SubmitStatusForResource(new Guid("8a164257-7960-4b76-9506-ccd0efabdb72"), null, "auto submit PSIStatusingGateway");
The following entry pops up in approval center (which is as I expected it):
Status Update; Task 31; Task update; Resource 1; 3/20/2012; 15h; 15%;
85h
Update in Project (still looks fine):
Task Name: Task 31 - Fixed Work
Duration: 12,5 days?
Start: Thu 14.03.13
Finish: Tue 02.04.13
Resource Names: Resource 1
Work: 100 hrs
Actual Work: 15 hrs
Remaining Work: 85 hrs
Then second case is executed: First I create a new assignment...
client.CreateNewAssignmentWithWork(
sName: Task 31 - Fixed Work,
projGuid: "a8a601ce-f3ab-4c01-97ce-fecdad2359d9",
taskGuid: "024d7b61-858b-40bb-ade3-009d7d821b3f",
assnGuid: "e3451938-36a5-4df3-87b1-0eb4b25a1dab",
sumTaskGuid: Guid.Empty,
dtStart: 14.03.2013 08:00:00,
dtFinish: 02.04.2013 15:36:00,
actWork: 900000,
fMilestone: false,
fAddToTimesheet: false,
fSubmit: false,
sComment: "auto commit...");
Then I call the UpdateStatus again:
<Changes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Proj ID="a8a601ce-f3ab-4c01-97ce-fecdad2359d9">
<Assn ID="e3451938-36a5-4df3-87b1-0eb4b25a1dab" ResID="c59ad8e2-7533-47bd-baa5-f5b03c3c43d6">
<Change PID="251658250">900000</Change>
</Assn>
</Proj>
</Changes>
And finally the SubmitStatusForResource again
client.SubmitStatusForResource(new Guid("c59ad8e2-7533-47bd-baa5-f5b03c3c43d6"), null, "auto submit PSIStatusingGateway");
This creates the following entry in approval center:
Status Update; Task 31 - Fixed Work; New reassignment request;
Resource 2; 3/20/2012; 15h; 100%; 0h
I accept it and update my project:
Name: Task 31 - Fixed Work
Duration: 6,76 days?
Start: Thu 14.03.13
Finish: Mon 25.03.13
Resource Names: Resource 1;Resource 2
Work: 69,05 hrs
Actual Work: 30 hrs
Remaining Work: 39,05 hrs
And I really don't get, why the new work would be 69,05 hours. The results I expected would have been:
Name: Task 31 - Fixed Work
Duration: 6,76 days?
Start: Thu 14.03.13
Finish: Mon 25.03.13
Resource Names: Resource 1;Resource 2
Work: 65 hrs
Actual Work: 30 hrs
Remaining Work: 35 hrs
I've spend quite a bunch of time, trying to find out, how to update the values to get the results that I want. I really would appreciate some help. This makes me want to rip my hair out!
Thanks in advance
PS: Forgot to say that I'm working with MS Project Server 2010 and MS Project Professional 2010

Delayed Job converting time, giving weird mocha expectation error

I'm not sure if you guys test like this, but I'm a TDD guy and keep stumbling into wierd stuff. The timestamps are converted somehow by DJ, or the time zone... I don't know. Test example follows
I'm using delayed_job 2.0.3
data = {:value => 0.856, :timestamp => Time.zone.now}
job = MyMailer.send_later :send_values, data, emails
MyMailer.expects(:send_values).with(data, emails).once
job.payload_object.perform
class MyMailer
def self.send_values(data, emails)
end
end
OK, test expectation failure
unexpected invocation: MyMailer.send_values({:value => 0.856576407208833, :timestamp => Thu Nov 11 22:01:00 UTC 2010 (1289512860.94962 secs)}, ..
unsatisfied expectations:
- expected exactly once, not yet invoked: MyMailer.send_values({:value => 0.856576407208833, :timestamp => Thu, 11 Nov 2010 23:01:00 CET +01:00}...
with datetime it's similar, DateTime.now instead of Time.zone.now
got :timestamp => Thu Nov 11 23:13:33 +0100 2010 (1289513613.0 secs)
expected :timestamp => 2010-11-11T23:13:33+01:00
What's happening? How can I control it (do I want to)?

I thought I had the answer to this when I first saw the question. But I don't anymore, so here is a guess:
Since Time.zone.now, DateTime.now and Time.now are slightly different from each other, and Zone is something Rails has created (I think), could it be that ruby somewhere in your testing framework misses out on the zone-thingy? Time.zone.now.to_datetime has rescued me from similar stuff before. It lets you use zone, and you get the same format as DateTime.now witch lacks DateTime.zone.now.
Try: data = {:value => 0.856, :timestamp => Time.zone.now.to_datetime}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Read a file and extract text in R - r

Related

Dagster: Multiple and Conditional Outputs (Type check failed for step output xxx PySparkDataFrame)

How can I seq_along an object of type response (httr package)

R: How to cleanly retrieve the attributes of a remote file on the internet?

PSI - Statusing Web Service - Results not as expected

Delayed Job converting time, giving weird mocha expectation error

Categories

Resources