Hadoop: setting MapReduce resource permissions - unix

Given we have some Hadoop MapReduce task to be run. This MapReduce needs to access some system resources on local drive, i.e. on some node (in fact, we have to place that resources to all nodes).
A question is: which permissions should be given to that resource file?
I would like to give it permissions to be read by the user which runs Hadoop. But in fact the task will be executed under another user. That user is 'yarn'. I.e. if I want to place some resources to some home folder of user which runs Hadoop Job, or related Oozie job etc I cannot do it because in fact home folder of the user which owns MapReduce is /home/yarn/.
What is the best way to deal with this issue?
How do I control under which user MapReduce runs?
Where can I lookup that settings?

I guess all you need is to create the required folders for such resources in HDFS, and set the permissions to those folders and the contained files using 'hadoop fs -chmod ..' command.
Please refer this below link:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html

First off the statement "MapReduce needs to access some system resources on local drive" is not possible when running a MapReduce program in distributed mode. Whatever file you need should be moved to HDFS. Give the file a read permission to all users I hope everything should be fine. If you need to read the file in the Mapper or Reducer and not pass the file as the input to the MapReduce program then consider using Distributed Cache mechanism provided my MapReduce.

Related

Generating ZIP files in azure blob storage

What is the best method to zip large files present in AZ blob storage and download them to the user in an archive file (zip/rar)
does using azure batch can help ?
currently we implement this functions in a traditionally way , we read stream generate zip file and return the result but this take many resources on the server and time for users.
i'am asking about the best technical and technologies solution (preferred way using Microsoft techs)
There are few ways you can do this **from azure-batch only point of view**: (for the initial part user code should own whatever zip api they use to zip their files but once it is in blob and user want to use in the nodes then there are options mentioned below.)
For initial part of your question I found this which could come handy: https://microsoft.github.io/AzureTipsAndTricks/blog/tip141.html (but this is mainly from idea sake and you will know better + need to design you solution space accordingly)
In option 1 and 3 below you need to make sure you user code handle the unzip or unpacking the zip file. Option 2 is the batch built-in feature for *.zip file both at pool and task level.
Option 1: You could have your *rar or *zip file added as azure batch resource files and then unzip them at the start task level, once resource file is downloaded. Azure Batch Pool Start up task to download resource file from Blob FileShare
Option 2: The best opiton if you have zip but not rar file in the play is this feature named Azure batch applicaiton package link here : https://learn.microsoft.com/en-us/azure/batch/batch-application-packages
The application packages feature of Azure Batch provides easy
management of task applications and their deployment to the compute
nodes in your pool. With application packages, you can upload and
manage multiple versions of the applications your tasks run, including
their supporting files. You can then automatically deploy one or more
of these applications to the compute nodes in your pool.
https://learn.microsoft.com/en-us/azure/batch/batch-application-packages#application-packages
An application package is a .zip file that contains the application binaries and supporting files that are required for your
tasks to run the application. Each application package represents a
specific version of the application.
With regards to the size: refer to the max allowed in blob link in the document above.
Option 3: (Not sure if this will fit your scenario) Long shot for your specific scenario but you could also mount virtual blob to the drive at join pool via mount feature in azure batch and you need to write code at start task or some thing to unzip from the mounted location.
Hope this helps :)

Deploying a Symfony 2 application in AWS Opsworks

I want to deploy a php application from a git repository to AWS Opsworks service.
I've setup an App and configured chef cookbooks so it runs the database schema creation, dumping assets etc...
But my application has some user generated files in a sub folder under web root. git repository has a .gitignore file in that folder so an empty folder is there when i run deploy command.
My problem is : after generating some files (by using the site) in that folder, if I run 'deploy' command again 'Opsworks' adds a new release under 'site_name/releases/xxxx' folder and symlink to it from 'site_name/current' folder.
So it makes my previous 'user generated stuff' inaccessible. What is the best solution for this kind of situation?
Thanks in advance for your kind answers.
You have a few different options. Listed below in order of personal preference:
Use Simple Storage Service (S3) to store the files.
Add an Elastic Block Store (EBS) volume to your server and save files to the volume.
Save files to a database (This is something I would not do myself but the option is there.).
When using OpsWorks think of replicable/disposable servers.
What I mean by this is that if you can create one server (call it server A) and then switch to a different one in the same stack (call it server B), the result of using server A or server B should not impact how your application works.
While it may seem like a good idea to save your user generated files in a directory that is common between different versions of your app (every time you deploy a new release directory is generated) when you destroy your server, you run the risk of destroying your files.
Benefits and downsides of using S3?
Benefits:
S3 will give you high redundancy and availability to your files.
S3 is external to your application server so if your server dies or decide to move it to a different region, you can continue using the same s3 bucket.
Application Easy to scale. You could add multiple application servers that read and write files to S3.
Downsides:
You need extra code in you application. You will have to use the AWS API in order to store and retrieve the files. Using the S3 API is not hard but it may require an extra step to get you where you need. Take a look at the "Using an Amazon S3 Bucket" walk through for reference. This is be the code they use to upload the files to the S3 bucket in the example.
Benefits and downsides of using EBS?
Benefits:
EBS is an "external hard drive" that you can easily mount to your machine using the OpsWorks Resource Manager.
EBS volumes can be backed-up and restored.
It may be the fastest option to implement and integrate to your application.
Downsides:
You need to assign it to an instance before it is running.
It could be time consuming to move from server A to server B (downtime may be required).
You can not scale your application horizontally. While you can create copies of the EBS and assign them to different instances, the EBS will not be shared.
Downside of using a database?
Just do a google search on "storing files in database"
Take a look at Storing Images in DB - Yea or Nay?
My preferred choice would be to use S3, but ultimately this is your decision.
Good luck!
EDIT:
Take a look at this repository opsworks-chef-cookbooks it contains some recipes to deploy Symfony2 application on OpsWorks. I have been using it for over a year and works quite well.
Use Chef templates, and use them in a recipe in the opsworks deploy lifecycle event.

CFileFind::FindFile and network paths

I have a dll that opens a file for processing. It attempts to find the file with FindFile() function. I also have a service that calls the dll and here is the problem - when the path to the file is a network path, FindFile() fails to find it but only when called from the service, if I call it directly from my application it finds the file. I'm sure the FindFile() function gets the same parameters in both cases as I write a log file with it. Parameter looks like this:
"\SERVER\SERVER_USERS\USERX\TEST.TXT"
I know this is 6 months after the question, but I figured I'd answer it anyway ... Usually, it is a permissions thing. If the service does not have access to the network folder, then it won't find anything. Many services run as a local system account by default, and that account doesn't have built-in access to network files. So try making sure the service is running as an account that has access to the network folder in question.

Asp.net writing server side file

Needed to write a server text file as the output of a business process initiated by ASP.net app.
The text file writing code is in a library file using standard stream code
All worked OK in IDE.
Publish and it falls over trying to write file. IIS is reluctant to write to the file system.
Much rummaging around and hair pulling finally led to a solution. It is not pretty and only applicable in a situation where you have control over the Webserver.
Just saw your answer.
It doesn't need to be inside your inetpub or wwwroot directory for that matter, it could be anywhere, as long as security permissions are set correctly for the user under which the application is running as.
But this is actually desired. If not just imagine the consequences of allowing write access anywhere.
Also, there's no need for the virtual directory. You could create a directory like C:\ProcessOutput, and grant permissions accordingly and it should work just fine.
Another option, would be to have a service account created, and impersonate as that user within your application only for when you need to write that output file.
Solution was:
Create a physical directory on the webserver with the physical path of:
c:\inetpub\wwwroot\mywebapp\myOutputFileDirectory
Make a virtual directory that points at the directory.
Using windows explorer give write permission to the physical directory to IIS_IUSRS.
Use a physical path of c:\inetpub\wwwroot\mywebapp\myOutputFileDirectory in your Streamwriter code
Maybe the virtual path could point to somewhere more sensible across the LAN if you get the security sorted but I am sufficiently battered to accept this small crumb with gratitude.

Strategy for handling user input as files

I'm creating a script to process files provided to us by our users. Everything happens within the same UNIX system (running on Solaris 10)
Right now our design is this
User places file into upload directory
Script placed on cron to run every 10 minutes.
Script looks for files in upload directory, processes them, deletes immediately afterward
For historical/legacy reasons, #1 can't change. Also, deleting the file after processing is a requirement.
My primary concern is concurrency. It is very likely that the situation will arise where the analysis script runs while an input file is still being written to. In this case, data will be lost and this (obviously) unacceptable.
Since we have no control over the user's chosen means of placing the input file, we cannot require them to obtain a file lock. As I understand, file locks are advisory only on UNIX. Therefore a user must choose to adhere to them.
I am looking for advice on best practices for handling this problem. Thanks
Obviously all the best solutions involve the client providing some kind of trigger indicating that it has finished uploading. That could be a second file, an atomic move of the file to a processing directory after writing it to a stage directory, or a REST web service. I will assume you have no control over your clients and are unable or unwilling to change anything about them.
In that case, you still have a few options:
You can use a pretty simple heuristic: check the file size, wait 5 seconds, check the file size. If it didn't change, it's probably good to go.
If you have super-user privileges, you can use lsof to determine if anyone has this file open for writing.
If you have access to the thing that handles upload (HTTP, FTP, a setuid script that copies files?) you can put triggers in there of course.

Resources