I'm trying to extract all the emails from my gmail account to do some analysis. The end goal is a dataframe of emails. I'm using the gmailR package.
So far I've extracted all the email threads and "expanded" them by mapping all the thread IDs to gm_thread(). Here's the code for that:
threads <- gm_threads(num_results = 5)
thread_ids <- gm_id(threads)
#extract all the thread ids
threads_expanded <- map(thread_ids, gm_thread)
This returns a list of all the threads. The structure of this is a list of gmail_thread objects. When you drill down one level into the list of thread objects, str(threads_expanded[[1]], max.level = 1), you get a single thread object which looks like:
List of 3
$ id : chr "xxxx"
$ historyId: chr "yyyy"
$ messages :List of 3
- attr(*, "class")= chr "gmail_thread"
Then, if you drill down further into the messages composing the threads, you start to get the useful info. str(threads_expanded[[1]]$messages, max.level = 1) gets you a list of the gmail_message objects for that thread:
List of 3
$ :List of 8
..- attr(*, "class")= chr "gmail_message"
$ :List of 8
..- attr(*, "class")= chr "gmail_message"
$ :List of 8
..- attr(*, "class")= chr "gmail_message"
Where I'm stuck is actually extracting all the useful information from each email within all the threads. The end goal is a dataframe with a column for the message_id, thread_id, to, from, etc. I'm imagining something like this:
message_id | thread_id | to | from | ... |
-------------------------------------------------------------------------
1234 | abcd | me#gmail.com | pam#gmail.com | ... |
1235 | abcd | pam#gmail.com | me#gmail.com | ... |
1236 | abcf | me#gmail.com | tim#gmail.com | ... |
It's not the prettiest answer, but it works. I'm going to work on vectorizing it later:
threads <- gm_threads(num_results = 5)
thread_ids <- gm_id(threads)
#extract all the thread ids
threads_expanded <- map(thread_ids, gm_thread)
msgs <- vector()
for(i in (1:length(threads_expanded))){
msgs <- append(msgs, values = threads_expanded[[i]]$messages)
}
#extract all the individual messages from each thread
msg_ids <- unlist(map(msgs, gm_id))
#get the message id for each message
msg_body <- vector()
#get message body, store in vector
for(msg in msgs){
body <- gm_body(msg)
attchmnt <- nrow(gm_attachments(msg))
if(length(body) != 0 && attchmnt == 0){
#does not return a null value, rather an empty list or list
of length 0, so if,
#body is not 0 (there is something there) and there are no attachemts,
#add it to vector
msg_body <- append(msg_body, body)
#if there is no to info, fill that spot with an empty space
}
else{
msg_body <- append(msg_body, "")
#if there is no attachment but the body is also empty add "" to the list
}
}
msg_body <- unlist(msg_body)
msg_datetime <- msgs %>%
map(gm_date) %>%
unlist()%>%
dmy_hms()
#get datetime info, store in vector
message_df <- tibble(msg_ids, msg_datetime, msg_body)
#all the other possible categories, e.g., to, from, cc, subject, etc.,
#either use a similar for loop or a map call
Related
I'm trying to switch from the R package httr to httr2
httr2 is the modern rewrite and should be superior, but I'm a novice when it comes to APIs and coding and I've been stuck all day trying to figure out what I'm doing wrong. I can only get httr to work with this API.
I believe I am messing up with adding the headers, as I don't think the path sent to the API is changing. So my problem is the request gets rejected simply because the API can't read my API key.
Here is what I have done in httr2:
gov_url <- "https://api.dummy.gov/aaa/bb/cc"
resp <- request(gov_url) %>%
req_headers(
param1 = "10",
api_key = "abcdefg",
param2 = "xyz",
param3 = "09/10/2022"
) %>%
req_dry_run()
Output:
GET /destiny/v1/placeholder HTTP/1.1
Host: api.dummy.gov
User-Agent: httr2/0.2.1 r-curl/4.3.2 libcurl/7.64.1
Accept: */*
Accept-Encoding: deflate, gzip
param1: 10
api_key: abcdefg
param2: xyz
param3: 09/10/2022
The first line of that output hasn't changed.
GET /destiny/v1/placeholder HTTP/1.1
Showing the structure of the 'resp' object with str(resp)
List of 7
$ url : chr "https://api.dummy.gov/destiny/v1/placeholder"
$ method : NULL
$ headers :List of 4
..$ param1 : chr "10"
..$ api_key : chr "abcdefg"
..$ param2 : chr "xyz"
..$ param3 : chr "09/10/2022"
$ body : NULL
$ fields : list()
$ options : list()
$ policies: list()
- attr(*, "class")= chr "httr2_request"
Sending the request with resp %>% req_perform(verbosity = 2) gives me this error:
HTTP/1.1 403 Forbidden
---
"error":
"code": "API_KEY_MISSING",
"message": "No api_key was supplied. Please submit with a valid API key."
But when I use httr though, I can pull data from the API.
gov_url <- "https://api.dummy.gov/destiny/v1/placeholder"
query_params <- list('param1' = '10',
'api_key' = "abcdefg",
'param2' = 'xyz',
'param3' = '09/10/2022')
gov_api <- GET(path, query = query_params)
Showing structure str(gov_api) shows the path has changed which is quite good because it matches the example input provided from the API
List of 10
$ url : chr "https://api.dummy.gov/destiny/v1/placeholder?param1=10&api_key=abcdefg¶m2"| __trunc
$ status_code: int 200
$ headers :List of 19
..$ xyz : chr "1"
..$ abc : chr "application/json"
..$ edg : chr "something"
And http_status(gov_api) shows its making the connection
$message
"Success: (200) OK"
Then I'm able to successfully use httr to pull data from the API.
Thank you to anyone who has read this far. I'd appreciate any feedback if possible.
Other things I've tried to no avail:
!!! to evaluate the list of expressions
Sending the API a list named queue like I do in httr
Sending the param "accept" = "application/json"
Different syntax, quotations
I'm not sure what you're after, but if you want to reproduce same comportement as httr with httr2 you can do :
library(httr2)
gov_url <- "https://api.dummy.gov/aaa/bb/cc"
param <- list(param1 = "10",
api_key = "abcdefg",
param2 = "xyz",
param3 = "09/10/2022")
resp <- request(gov_url) %>%
req_url_query(!!!param)
resp$url
#> [1] "https://api.dummy.gov/aaa/bb/cc?param1=10&api_key=abcdefg¶m2=xyz¶m3=09%2F10%2F2022"
Perhaps the confusion come from the fact that in httr2 you're setting headers and in httr you're setting query.
I use the str to print out information for a calculate variable
str(samples)
'mcmc' num [1:1000, 1:228] 0.1079 -0.2367 -0.0757 -0.3414 -0.3382 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:228] "B[1,1,1]" "B[2,1,1]" "B[3,1,1]" "B[4,1,1]" ...
- attr(*, "mcpar")= num [1:3] 20005 25000 5
But how to read these information, or what does this output can tell us? For instance, the third component is attr(*, "mcpar")= num [1:3] 20005 25000 5, what does it mean?
The attr implies there is an attribute named 'mcpar' which is a numeric vector
attributes(samples)
returns a list of all the attributes of the samples i.e. it will return dimnames as one attribute and the 'mcpar' as another one
According to documentation
The ‘mcpar’ attribute of an MCMC object gives the start iteration the end iteration and the thinning interval of the chain.
I have this type of data that I want to send to a dataframe:
So, I am iterating through it and sending it to a vector. But my vector never keeps the data.
Dv = Vector{Dict}()
for item in reader
push!(Dv,item)
end
length(Dv)
This is what I get:
And I am sure this is the right way to do it. It works in Python:
EDIT
This is the code that I use to access the data that I want to send to a dataframe:
results=pyimport("splunklib.results")
kwargs_oneshot = (earliest_time= "2019-09-07T12:00:00.000-07:00",
latest_time= "2019-09-09T12:00:00.000-07:00",
count=0)
searchquery_oneshot = "search index=iis | lookup geo_BST_ONT longitude as sLongitude, latitude as sLatitude | stats count by featureId | geom geo_BST_ONT allFeatures=True | head 2"
oneshotsearch_results = service.jobs.oneshot(searchquery_oneshot; kwargs_oneshot...)
# Get the results and display them using the ResultsReader
reader = results.ResultsReader(oneshotsearch_results)
for item in reader
println(item)
end
ResultsReader is a streaming reader. This means you will "consume" its elements as you iterate over them. You can covert it to an array with collect. Do not print the items before you collect.
results=pyimport("splunklib.results")
kwargs_oneshot = (earliest_time= "2019-09-07T12:00:00.000-07:00",
latest_time= "2019-09-09T12:00:00.000-07:00",
count=0)
searchquery_oneshot = "search index=iis | lookup geo_BST_ONT longitude as sLongitude, latitude as sLatitude | stats count by featureId | geom geo_BST_ONT allFeatures=True | head 2"
oneshotsearch_results = service.jobs.oneshot(searchquery_oneshot; kwargs_oneshot...)
# Get the results
reader = results.ResultsReader(oneshotsearch_results)
# collect them into an array
Dv = collect(reader)
# Now you can iterate over them without changing the result
for item in Dv
println(item)
end
I am reading Google Chrome History from its Sql-Lite Db.
Table Name: Visits
Structure:
+-----+------------------+-----------+-----+--------+-----+
| "0" | "id" | "INTEGER" | "0" | "NULL" | "1" |
| "1" | "url" | "INTEGER" | "1" | "NULL" | "0" |
| "2" | "visit_time" | "INTEGER" | "1" | "NULL" | "0" |
| "3" | "from_visit" | "INTEGER" | "0" | "NULL" | "0" |
| "4" | "transition" | "INTEGER" | "1" | "0" | "0" |
| "5" | "segment_id" | "INTEGER" | "0" | "NULL" | "0" |
| "6" | "visit_duration" | "INTEGER" | "1" | "0" | "0" |
+-----+------------------+-----------+-----+--------+-----+
I was trying to find out what does transition means then I found the link : Page Transitions and according to it Google Chrome stores a transition value which identifies the type of transition between pages. These are stored in the history database to separate visits, and are reported by the renderer for page navigations.
There are many types of transitions like LINK, TYPED etc...
In sql lite table Google Chrome integer values.
Problem
How to figure out the Transition from the integer value??
There are some more tables in the DB but none of them contains any table representing the meaning of these values.
Other tables are:
Probably a little late, but I'll just leave this here for someone else.
Here is the relevant code from Chromium source -
https://github.com/adobe/chromium/blob/cfe5bf0b51b1f6b9fe239c2a3c2f2364da9967d7/content/public/common/page_transition_types.cc
Basic idea is that you take the integer value from the database and convert to hex.
Perform a Logical AND operation on it and convert the result back to integer.
Run it through a switch case and get the string value back.
For Eg : In Javascript you can do the following.
>> "822083585".toString(16) & 0xff
1
>> "1610612736".toString(16) & 0xff
0
based on #jayarma S answer and on https://github.com/adobe/chromium/blob/cfe5bf0b51b1f6b9fe239c2a3c2f2364da9967d7/content/public/common/page_transition_types.h
You can map the Transition Types as follows:
LINK = 0
TYPED = 1
AUTO_BOOKMARK = 2
AUTO_SUBFRAME = 3
MANUAL_SUBFRAME = 4
GENERATED = 5
START_PAGE = 6
FORM_SUBMIT = 7
RELOAD = 8
KEYWORD = 9
KEYWORD_GENERATED = 10
You can get these core transition type values by applying the core mask: 0xFF
There are also qualifiers that can also define the transition:
FORWARD_BACK = 0x01000000
FROM_ADDRESS_BAR = 0x02000000
HOME_PAGE = 0x04000000
CHAIN_START = 0x10000000
CHAIN_END = 0x20000000
CLIENT_REDIRECT = 0x40000000
SERVER_REDIRECT = 0x80000000
IS_REDIRECT_MASK = 0xC0000000
You can get these qualifier transition type values by applying the qualifier mask: 0xFFFFFF00
Here is an SQLite query to get the transition types:
select u1.title as to_url_title,
u1.url as to_url,
CASE vs.transition & 0xff
WHEN 0
THEN 'LINK'
WHEN 1
THEN 'TYPED'
WHEN 2
THEN 'AUTO_BOOKMARK'
WHEN 3
THEN 'AUTO_SUBFRAME'
WHEN 4
THEN 'MANUAL_SUBFRAME'
WHEN 5
THEN 'GENERATED'
WHEN 6
THEN 'START_PAGE'
WHEN 7
THEN 'FORM_SUBMIT'
WHEN 8
THEN 'RELOAD'
WHEN 9
THEN 'KEYWORD'
WHEN 10
THEN 'KEYWORD_GENERATED'
ELSE NULL
END core_transition_type,
CASE vs.transition & 0xFFFFFF00
WHEN 0x01000000
THEN 'FORWARD_BACK'
WHEN 0x02000000
THEN 'FROM_ADDRESS_BAR'
WHEN 0x04000000
THEN 'HOME_PAGE'
WHEN 0x10000000
THEN 'CHAIN_START'
WHEN 0x20000000
THEN 'CHAIN_END'
WHEN 0x40000000
THEN 'CLIENT_REDIRECT'
WHEN 0x80000000
THEN 'SERVER_REDIRECT'
WHEN 0xC0000000
THEN 'IS_REDIRECT_MASK'
ELSE NULL
END qualifier_transition_type
from visits as vs
join urls u1 on u1.id = vs.url
order by vs.visit_time DESC;
In attempting to Answer this Question I came across this in the output of str()
## R reference
rref <- bibentry(bibtype = "Manual",
title = "R: A Language and Environment for Statistical Computing",
author = person("R Development Core Team"),
organization = "R Foundation for Statistical Computing",
address = "Vienna, Austria",
year = 2010,
isbn = "3-900051-07-0",
url = "http://www.R-project.org/")
> str(rref)
Class 'bibentry' hidden list of 1
$ :List of 7
..$ title : chr "R: A Language and Environment for Statistical Computing"
..$ author :Class 'person' hidden list of 1
.. ..$ :List of 5
.. .. ..$ given : chr "R Development Core Team"
.. .. ..$ family : NULL
.. .. ..$ role : NULL
.. .. ..$ email : NULL
.. .. ..$ comment: NULL
..$ organization: chr "R Foundation for Statistical Computing"
..$ address : chr "Vienna, Austria"
..$ year : chr "2010"
..$ isbn : chr "3-900051-07-0"
..$ url : chr "http://www.R-project.org/"
..- attr(*, "bibtype")= chr "Manual"
In particular, I'm puzzled by this bit:
> str(rref)
Class 'bibentry' hidden list of 1
$ :List of 7
What does the "hidden list" bit refer to? What kind of object is this? Is this just some formatting output from str() when there is only a single component in the object that is itself a list? If so how is there a way to force str() to show the full structure?
This seems like an artefact of str. My interpretation is that the words hidden list are printed in the output of str if the obect is not a pairlist.
Since your object is of class bibtex, and there is no str method for bibtex, the method utils:::str.default is used to describe the structure.
Condensed extract from str.default:
...
if (is.list(object)) {
i.pl <- is.pairlist(object)
...
cat(if (i.pl)
"Dotted pair list"
else if (irregCl)
paste(pClass(cl), "hidden list")
else "List", " of ", le, "\n", sep = "")
...
}
The key bit that defines irregCl is:
....
else {
if (irregCl <- has.class && identical(object[[1L]],
object)) {
....
and that explain the hidden list bit - it hides the outer list if the object has a class and object and object[[1]] are identical. As you showed in the Answer you linked to, the [[ method returns an identical object if the list contains a single "bibentry" object.