Streaming Data to HDFS via webHDFS protocol - hadoop-streaming

We like to write a streaming data to HDFS using Webhdfs( https) protocol.
Is that possible ?

Yes. After acquiring the stream handle from CREATE/APPEND you have to set the connection to be chunked, using HttpURLConnection.setChunkedStreamingMode(CHUNK_SIZE)

Related

Send a file to Winsock socket with curl utility

I need to send a file to an HTTP server with the curl utility. However, as my server application only needs to process a single file, I'd like to avoid using some large HTTP framework with a wide range of functionality, so I'd like to stick to TCP/UDP protocols with some simple HTTP parser.
A file to send to the server may be quite large and I doubt if it is reasonable to send this file as a single TCP packet, so I'm dreaming of splitting this file into UDP packets and sending them one by one. But on the client side, all this must be done with a simple curl command, like curl --data-binary #filename 127.0.0.1:80.
Is it possible to split this request into several packets using WinSock API? For example, the server reads the name of a file, detects its size, allocates as many packets as needed and starts receiving UDP packets. Or maybe should I look at other ways of solving it?

How to download image via gRPC Gateway?

I want to serve files(images) via gRPC Gateway from gRPC server. Since protocol buffers messages have sctructure, I don't see how I could ensure the gateway to send content of the bytes field of the response message instead of the entire json-encoded message. Is there a native solution for this or does one simply have to write a dedicated http muxer to handle these requests?
Rather than sending arbitrarily-sized files (probably as bytes), it would probably be better to include a URL to the file in the message and then host/serve the file over HTTP (e.g. from S3, Google Cloud Storage etc. from which you could generate signed URLs to limit access).
I think the max message size is 2GB (source?) and the recommendation (Large Data Sets) is to consider alternative techniques once messages sizes exceed few MBs.

gRPC bi-directional stream processing with multi-threaded clients

A gRPC newbie question here.
We have a source system that exposes a bi directional gRPC stream. In order to scale our application, we wanted to process the stream data in parallel. Is it possible to have concurrent / multiple gRPC clients consuming from the stream without any conflicts in data processing / during acknowledgement process etc?
Thanks
Is this in the context of a single streaming call? In that case the answer is no. You have a single gRPC client receiving one response stream and it can use worker threads to hand off messages from the stream.
If you are thinking of multiple gRPC clients in an application talking to the same server (I don't see any advantage of doing that) each one will make a separate call and will receive a separate response stream.

protobuf vs gRPC

I try to understand protobuf and gRPC and how I can use both. Could you help me understand the following:
Considering the OSI model what is where, for example is Protobuf at layer 4?
Thinking through a message transfer how is the "flow", what is gRPC doing what protobuf misses?
If the sender uses protobuf can the server use gRPC or does gRPC add something which only a gRPC client can deliver?
If gRPC can make synchronous and asynchronous communication possible, Protobuf is just for the marshalling and therefore does not have anything to do with state - true or false?
Can I use gRPC in a frontend application communicating instead of REST or GraphQL?
I already know - or assume I do - that:
Protobuf
Binary protocol for data interchange
Designed by Google
Uses generated "Struct" like description at client and server to un-/-marshall message
gRPC
Uses protobuf (v3)
Again from Google
Framework for RPC calls
Makes use of HTTP/2 as well
Synchronous and asynchronous communication possible
I again assume its an easy question for someone already using the technology. I still would thank you to be patient with me and help me out. I would also be really thankful for any network deep dive of the technologies.
Protocol buffers is (are?) an Interface Definition Language and serialization library:
You define your data structures in its IDL i.e. describe the data objects you want to use
It provides routines to translate your data objects to and from binary, e.g. for writing/reading data from disk
gRPC uses the same IDL but adds syntax "rpc" which lets you define Remote Procedure Call method signatures using the Protobuf data structures as data types:
You define your data structures
You add your rpc method definitions
It provides code to serve up and call the method signatures over a network
You can still serialize the data objects manually with Protobuf if you need to
In answer to the questions:
gRPC works at layers 5, 6 and 7. Protobuf works at layer 6.
When you say "message transfer", Protobuf is not concerned with the transfer itself. It only works at either end of any data transfer, turning bytes into objects
Using gRPC by default means you are using Protobuf. You could write your own client that uses Protobuf but not gRPC to interoperate with gRPC, or plugin other serializers to gRPC - but using gRPC would be easier
True
Yes you can
Actually, gRPC and Protobuf are 2 completely different things. Let me simplify:
gRPC manages the way a client and a server can interact (just like a web client/server with a REST API)
protobuf is just a serialization/deserialization tool (just like JSON)
gRPC has 2 sides: a server side, and a client side, that is able to dial a server. The server exposes RPCs (ie. functions that you can call remotely). And you have plenty of options there: you can secure the communication (using TLS), add authentication layer (using interceptors), ...
You can use protobuf inside any program, that has no need to be client/server. If you need to exchange data, and want them to be strongly typed, protobuf is a nice option (fast & reliable).
That being said, you can combine both to build a nice client/server sytem: gRPC will be your client/server code, and protobuf your data protocol.
PS: I wrote this paper to show how one can build a client/server with gRPC and protobuf using Go, step by step.
grpc is a framework build by google and it is used in production projects from google itself and #HyperledgerFabric is built with grpc there are many opensource applications built with grpc
protobuff is a data representation like json this is also by google in fact they have some thousands of proto file are generated in their production projects
grpc
gRPC is an open-source framework developed by google
It allows us to create Request & Response for RPC and handle rest by the framework
REST is CRUD oriented but grpc is API oriented(no constraints)
Build on top of HTTP/2
Provides >>>>> Auth, Loadbalancing, Monitoring, logging
[HTTP/2]
HTTP1.1 has released in 1997 a long time ago
HTTP1 opens a new TCP connection to a server at each request
It doesn't compress headers
NO server push, it just works with Req, Res
HTTP2 released in 2015 (SPDY)
Supports multiplexing
client & server can push messages in parallel over the same TCP connection
Greatly reduces latency
HTTP2 supports header compression
HTTP2 is binary
proto buff is binary so it is a great match for HTTP2
[TYPES]
Unary
client streaming
server streaming
Bi directional streaming
grpc servers are Async by default
grpc clients can be sync or Async
protobuff
Protocol buffers are language agnostic
Parsing protocol buffers(binary format) is less CPU intensive
[Naming]
Use camel case for message names
underscore_seperated for fields
Use camelcase for Enums and CAPITAL_WITH_UNDERSCORE for value names
[Comments]
Support //
Support /* */
[Advantages]
Data is fully Typed
Data is fully compressed (less bandwidth usage)
Schema(message) is needed to generate code and read the code
Documentation can be embedded in the schema
Data can be read across any language
Schema can evolve any time in a safe manner
faster than XML
code is generated for you automatically
Google invented proto buff, they use 48000 protobuf messages & 12000.proto files
Lots of RPC frameworks, including grpc use protocol buffers to exchange data
gRPC is an instantiation of RPC integration style that is based on protobuf serialization library.
There are five integration styles: RPC, File Transfer, MOM, Distributed Objects, and Shared Database.
RMI is another example of instantiation of RPC integration style. There are many others. MQ is an instantiation of MOM integration style. RabbitMQ as well. Oracle database schema is an instantiation of Shared Database integration style. CORBA is an instantiation of Distributed Objects integration style. And so on.
Avro is an example of another (binary) serialization library.
gRPC (google Remote Procedure Call) is a client-server structure.
Protocol buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.
service Greeter {
rpc SayHello (HelloRequest) returns (HelloResponse) {}
}
message HelloRequest {
string myname = 1;
}
message HelloResponse {
string responseMsg = 1;
}
Protocol buffer is used to exchange data between gRPC client and gRPC Server. It is a protocol between gRPC client and gRPC Server. Protocol buffer is implemented as a .proto file in gRPC project. It defines interface, e.g. service, which is provided by server-side and message format between client and server, and rpc methods, which are used by the client to access the server.
Both client and side have the same proto files. (One real example: envoy xds grpc client side proto files, server side proto files.) It means that both the client and server know the interface, message format, and the way that the client accesses services on the server side.
The proto files (e.g. protocol buffer) will be compiled into real language.
The generated code contains both stub code for clients to use and an abstract interface for servers to implement, both with the method defined in the service.
service defined in the proto file (e.g. protocol buffer) will be translated as abstract class xxxxImplBase (e.g. interface on the server side).
newStub(), which is a synchronous call, is the way to implement a remote procedure call (e.g. rpc in the proto file).
And the methods which build request and response messages are also implemented in the generated files.
I re-implemented simple client and server-side samples based on samples in the official doc. cpp client, cpp server, java client, java server, springboot client, springboot server
Recommended Useful Docs:
cpp/helloworld/README.md#generating-grpc-code,
cpp/basics/#generating-client-and-server-code,
cpp/basics/#defining-the-service,
generated-code/#client-stubs,
a blocking/synchronous stub
StreamObserver
how-to-use-grpc-with-spring-boot
Others: core-concepts,
gRPC can use protocol buffers as both its Interface Definition Language (IDL) and as its underlying message interchange format
In simplest form grpc is like a public vechicle.It will exchange data between client and server.
The protocol Buffer is the protocol like your bus ticket,that decides where you should go or shouldn't go.

Can FTP have multiple TCP connection for multiple parallel file transfer

While reading the FTP protocol specification from : (http://www.pcvr.nl/tcpip/ftp_file.htm). I came across this "FTP differs from the other applications that we've described because it uses two TCP connections to transfer a file". My question is, can FTP have multiple TCP connection for multiple parallel file transfer, for example can I transfer two files in parallel over two TCP connections, is this a matter of customization or standardization?
While it would be theoretically possible to make an FTP server support multiple, concurrent transfers, it's not supported by the RFC or any known implementation.
The block is a simple one in that the control connection, after receiving a transfer request, does not return a final status or accept new commands until the data transfer is completed. Thus, though you could queue up another transfer request it wouldn't actually be processed by the server until the current one completes.
If you want multiple file transfers, just log into the FTP server multiple times using different programs or command-line windows and have each initiate a transfer.
No it can't. FTP uses a control connection for sending commands and a data connection that exists for the duration of the file transfer or directory listing retrieval, that's it.
For more information you can consult RFC 959, which defines the specs of the FTP protocol.

Resources