Flume-twitter streaming API - hadoop-streaming

Flume-twitter streaming API - hadoop-streaming

I am new to flume, I have used flume to stream data from twitter using the search API. But the twitter json has the "geo" key set to null. So is there a way to get the twitter data using Streaming API in Flume.?

Please refer to this link. I helped me a lot when tried to do the same some time ago. Basically, you have to do the following:
Create an application in https://dev.twitter.com/apps/ in order to generate the OAuth keys. This step is probably already done since you say you have already queried Twitter in the past.
Download the Cloudera sources specifically designed for Twitter from here and put such jar into the Flume classpath by editing conf/flume-env.sh and adding this line:
FLUME_CLASSPATH="/home/training/Installations/apache-flume-1.3.1-bin/flume-sources-1.0-SNAPSHOT.jar"
Edit a Flume configuration file for a new Twitter agent called "TwitterAgent", something like:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>
TwitterAgent.sources.Twitter.keywords = <comma-separated list of keywords you are interested in>
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
Then, you are ready to start the Twitter Flume agent by issuing this command:
$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

Related

Openstack API - Creating instances does not accept user-data = <bash script>

I am automating instance creation using OpenstackSDK and passing bash script with commands as userdata. But the script does not excute even though the instance is crated. When I do this manually via GUI, the bash scripts executes fine to the newly created instance.
#Reading bash script
with open('elk.sh', 'r') as f:
init_script = f.read()
server = conn.compute.create_server(
name=name,
image_id=IMAGE_ID,
flavor_id=FLAVOUR_ID,
networks=[{"uuid": NETWORK_ID}],
user_data=init_script, # pass script to the instance
key_name=KEY_PAIR
)
Note: Also tried to encode as Base64 file butstill failed with
is not JSON serializable.
Code snippet:
with open(USER_DATA,'r') as file:
f = file.read()
bytes_content = bytes(f,encoding='utf-8')
init_script = base64.b64encode(bytes_content)
Can anyone advice on this, please?
Thanks

Python3 handles string and binary differently. Also, to pass bash/cloud-config file to --user_data via OpenstackSDK, it has to be base46 encoded.
Code snippet:
with open(USER_DATA,'r') as file:
f = encodeutils.safe_encode(file.read().encode('utf-8'))
init_script = base64.b64encode(f).decode('utf-8')

Using Jsch ChannelSftp put to a GDG file on mainframe

I'm currently stuck on trying to find a way to put to a generation data group (GDG) on mainframe(zos) using Jsch.
The command line syntax would be
put myFile.txt GDG.MYGDG(+1)
This I know works fine when using sftp or ftp command via a linux shell. I have not found any way to replicate this via the put method of the ChannelSftp class.
I have searched wide and far looking for examples and not found any which makes me think it is not possible with this framework.
Any help would be appreciated.
UPDATE
adding some example code. This is not the full code but shows what I have tried.
String localFileName = localFile.getName();
String remoteFile = "GDG.GDGROUP1(+1)";
FileInputStream inputStream = new FileInputStream(localFile);
ChannelSftp channelSftp = this.getChannelSftp();
String pwd = channelSftp.pwd();
channelSftp.put(inputStream, remoteFile);
The put call always returns that this is directory. I have also tried this for the remote file:
String remoteFile = localFile + " GDG.GDGROUP1(+1)";
This just returns failure.
I have tried absolute path using // and /-/. So far nothing has worked.

S3: How to do a partial read / seek without downloading the complete file?

Although they resemble files, objects in Amazon S3 aren't really "files", just like S3 buckets aren't really directories. On a Unix system I can use head to preview the first few lines of a file, no matter how large it is, but I can't do this on a S3. So how do I do a partial read on S3?

S3 files can be huge, but you don't have to fetch the entire thing just to read the first few bytes. The S3 APIs support the HTTP Range: header (see RFC 2616), which take a byte range argument.
Just add a Range: bytes=0-NN header to your S3 request, where NN is the requested number of bytes to read, and you'll fetch only those bytes rather than read the whole file. Now you can preview that 900 GB CSV file you left in an S3 bucket without waiting for the entire thing to download. Read the full GET Object docs on Amazon's developer docs.

The AWS .Net SDK only shows only fixed-ended ranges are possible (RE: public ByteRange(long start, long end) ). What if I want to start in the middle and read to the end? An HTTP range of Range: bytes=1000- is perfectly acceptable for "start at 1000 and read to the end" I do not believe that they have allowed for this in the .Net library.

get_object api has arg for partial read
s3 = boto3.client('s3')
resp = s3.get_object(Bucket=bucket, Key=key, Range='bytes={}-{}'.format(start_byte, stop_byte-1))
res = resp['Body'].read()

Using Python you can preview first records of compressed file.
Connect using boto.
#Connect:
s3 = boto.connect_s3()
bname='my_bucket'
self.bucket = s3.get_bucket(bname, validate=False)
Read first 20 lines from gzip compressed file
#Read first 20 records
limit=20
k = Key(self.bucket)
k.key = 'my_file.gz'
k.open()
gzipped = GzipFile(None, 'rb', fileobj=k)
reader = csv.reader(io.TextIOWrapper(gzipped, newline="", encoding="utf-8"), delimiter='^')
for id,line in enumerate(reader):
if id>=int(limit): break
print(id, line)
So it's an equivalent of a following Unix command:
zcat my_file.gz|head -20

R API Connection (Localytics) with getURL and RCurl Error

I am trying to get data from the mobile analytics service Localytics via their API (https://api.localytics.com/docs#query). In particular I would like to translate the following cURL command in R:
curl --get 'https://api.localytics.com/v1/query' \
--user 'API_KEY:API_SECRET' \
--data 'app_id=APP_ID' \
--data 'metrics=users' \
--data 'dimensions=day' \
--data-urlencode 'conditions={"day":["between","2013-04-01","2013-04-07"]}'
My R code looks like this at the moment. APIKey and API secret are of course replaced by the actual keys. However, I receive an error stating that at least a dimension or a metric has to be specified.
object <- getURL('https://api.localytics.com/v1/query', userpwd = "API_Key:API_Secret", httpheader=list(app_id = "app_id=03343434353534",
metrics = "metrics=users",
dimensions = "dimensions=day",
conditions = toJSON('conditions={"day":["between","2014-07-01","2014-07-10"]}')), ssl.verifypeer = FALSE)
What changes would be necessary to get it to work.
Thanks in advance for helping me out,
Peter

This is particular easy with the dev version of httr:
library(httr)
r <- POST('https://api.localytics.com/v1/query',
body = list(
app_id = "APP_ID",
metrics = "users",
dimensions = "day",
conditions = list(
day = c("between", "2014-07-01", "2004-07-10")
)
),
encode = "json",
authenticate("API_key", "API_secret")
)
stop_for_status(r)
content(r)
(I converted the request to a POST and used json encoding for everything, as describe in the API docs).
If you want to see exactly what's being sent to the server, use the verbose() config.

It looks like getURL passes the parameters you quested as HTTP headers and not as querystring data as your curl call does. You should use getForm instead. Also, I wasn't sure from which library your toJSON function came form, but that's at least not the right syntax for the one from rsjon.
Anyway, here's a call from R which should produce the same HTTP call are your curl command
library(rjson)
library(RCurl)
object <- getForm('https://api.localytics.com/v1/query',
app_id = "APP_ID",
metrics = "users",
dimensions = "day",
conditions = toJSON(list(day=c("between","2014-07-01","2004-07-10"))),
.opts=curlOptions(
userpwd = "API_Key:API_Secret", httpauth = 1L)
)
I found that using the site http://requestb.in/ is very helpful in debugging these problems (and that's exactly what I used to create this solution). You can send requests to their site and they record the exact HTTP message that was sent so you can compare different methods.
The httpauth part was from this SO question which seemed to be required to trigger authentication for the test site; you may not need it for the "real" site.

How to completely script the process of importing SSL certificate and binding this certificate to a specific site

I have been looking around for a solution for this problem that works across different versions of Windows Server & IIS, but so far I couldn't find a reasonable solution, what I need is some sort of a script or a command line tool, that takes a certificate file (pfx) for example and then either using the same script or tool find a way to configure one website to use this certificate.

I found a good script on TechNet
http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/96ccb49f-b669-4e05-965e-3090984a3594.mspx?mfr=true
CertImport.vbs
Option Explicit
Dim iiscertobj, pfxfile, pfxfilepassword, InstanceName, WebFarmServers, IISServer
Set iiscertobj = WScript.CreateObject("IIS.CertObj")
pfxfile = WScript.Arguments(0)
pfxfilepassword = WScript.Arguments(1)
InstanceName = WScript.Arguments(2)
WebFarmServers = split(WScript.Arguments(3), ",")
iiscertobj.UserName = WScript.Arguments(4)
iiscertobj.UserPassword = WScript.Arguments(5)
For Each IISServer in WebFarmServers
iiscertobj.ServerName = IISServer
iiscertobj.InstanceName = InstanceName
iiscertobj.Import pfxfile, pfxfilepassword, true, true
Next

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Flume-twitter streaming API - hadoop-streaming

I am new to flume, I have used flume to stream data from twitter using the search API. But the twitter json has the "geo" key set to null. So is there a way to get the twitter data using Streaming API in Flume.?

Related

Openstack API - Creating instances does not accept user-data = <bash script>

Using Jsch ChannelSftp put to a GDG file on mainframe

S3: How to do a partial read / seek without downloading the complete file?

R API Connection (Localytics) with getURL and RCurl Error

How to completely script the process of importing SSL certificate and binding this certificate to a specific site

Categories

Resources