How to force Hive to distribute data equally among different reducers?

How to force Hive to distribute data equally among different reducers? - r

Imagine I want to send the Iris dataset, that I have as a Hive table, to different reducers in order to run the same task in parallel on R. I can execute my R script through the transform function and use lateral view explode on hive to do a cartesian product on the iris dataset and an array containing my "partition" variable, like on the query below:
set source_table = iris;
set x_column_names = "sepallenght|sepalwidth|petallength|petalwidth";
set y_column_name = "species";
set output_dir = "/r_output";
set model_name ="paralelism_test";
set param_var = params;
set param_array = array(1,2,3);
set mapreduce.job.reduces=3;
select transform(id, sepallenght, sepalwidth, petallength, petalwidth, species, ${hiveconf:param_var})
using 'controlScript script.R ${hiveconf:x_column_names}${hiveconf:y_column_name}${hiveconf:output_dir}${hiveconf:model_name}${hiveconf:param_var}'
as (script_result string)
from
(select *
from ${hiveconf:source_table}
lateral view explode ( ${hiveconf:param_array} ) temp_table
as ${hiveconf:param_var}
distribute by ${hiveconf:param_var}
) data_query;
I call a memory control script, so please ignore it for the sake of objectivity.
What my script.R returns are a list of the unique parameters it has received (the "params" column populated with the "param_var" array values) and the number of rows the partition it gets has, as follows:
#The aim of this script is to validate the paralel computation of R scripts through Hive.
compute_model <- function(data){
paste("parameter ",unique(data[ncol(data)]), ", " , nrow(data), "lines")
}
main <- function(args){
#Reading the input parameters
#These inputs were passed along the transform's "using" clause, on Hive.
x_column_names <- as.character(unlist(strsplit(gsub(' ','',args[1]),'\\|')))
y_column_name <- as.character(args[2])
target_dir <- as.character(args[3])
model_name <- as.character(args[4])
param_var_name <- as.character(args[5])
#Reading the data table
f <- file("stdin")
open(f)
data <- tryCatch({
as.data.frame (
read.table(f, header=FALSE, sep='\t', stringsAsFactors = T, dec='.')
)},
warning = function(w) cat(w),
error = function(e) stop(e),
finally = close(f)
)
#Computes the model. Here, the model can be any computation.
instance_result <- as.character(compute_model(data))
#writes the result to "stdout" separated by '\t'. This output must be a data frame where
#each column represents a Hive Table column.
write.table(instance_result,
quote = FALSE,
row.names = FALSE,
col.names = FALSE,
sep = "\t",
dec='.'
)
}
#Main code
###############################################################
main(commandArgs(trailingOnly=TRUE))
What I want hive to do is replicate the Iris dataset equally among these reducers. It works fine when i put sequential values on my param_array variable, but for values like array(10, 100, 1000, 10000) and mapreduce.job.reduces=4, or array(-5,-4,-3,-2,-1,0,1,2,3,4,5) and mapreduce.job.reduces=11, some reducers won't receive any data, and others will receive more than one key.
The question is: is there a way to make sure hive distributes each partition to a different reducer?
Did I make myself clear?
It may look silly to do it, but I want to run grid search on Hadoop and have some restrictions on using other technologies that are more suitable to this task.
Thank you!

Related

Pass strings from a column into a function by using a loop - R

I have a dataset with ~10,000 species. For each species in the dataset I want to query the IUCN database for threats facing each species. I can do this with one species at a time using the rl_threats function from the package rredlist. Below is an example of the function, this example pulls the threats facing Fratercula arctica and assigns them to the object test1 (key is a string that serves as a password for using the IUCN API that stays constant, parse should be TRUE but not as important).
test1<-rl_threats(name="Fratercula arctica",
key = '1234',
parse = TRUE)
I want to get threats for all 10,000 species in my dataset. My idea is to use a loop that passes in the names from my dataset into the name=" " field in the rl_threats command. This is a basic loop I tried to construct to do this but I'm getting lots of errors:
for (i in 1:df$scientific_name) {
rl_threats(name=i,
key = '1234',
parse = TRUE)
}
How would I pass the species names from the scientific_name column into the rl_threats function such that R would loop through and pull threats for every species?
Thank you.

You can create a list to store the output.
result <- vector('list', length(df$scientific_name))
for (i in df$scientific_name) {
result[[i]] <- rl_threats(name=i, key = '1234', parse = TRUE)
}
You can also use lapply :
result <- lapply(df$scientific_name, function(x) rl_threats(name=x, key = '1234', parse = TRUE))

Reading Series of CSV Files into R, Running Functions, Writing Outputs to New Files

I am trying to automate the calculation of some animal energy requirements where I have inputs as days on feed, daily feed intake, etc. My code first reads in the initial data from a CSV, uses it to calculate some starting values outside the loop, runs a loop of that day's energy calculations for the time on feed, stores those results in a data frame, and then write the final data frame to a CSV.
I have data from >300 sheep on individual records basis like this and want to automate reading in the files, and writing the results to separate CSV files within a specific folder. I know this means a loop within a loop, but am trying to figure out how exactly to go about it.
I know I need to read in the files using files.list, like this:
files = list.files("C:/Users/Me/Desktop/Sheepfiles/", pattern = "Sheep+.*csv")
but I want each file as its own data frame run through the model and I need to keep everything separate going in and out.
setwd("C:Users/....../Sheepfiles")
input = read.csv(file = "Sheep131.csv", header = TRUE, sep =",")
#set up initialized values outside loop here
LWt0 = input$LWT[1]
EBW = LWT0*.96*.891
#constants go here
Results = NULL;
timefeed = input$DOF
#now the loop
for (i in timefeed)
{
#differential equations and calculations here
results1 = (c(t, NEG, MEI, OldMEI, HPmaint, EBW, ID, TRT))
names(results1) = c("DOF", "NEG", "MEI", "OldMEI","HPmaint", "EBW", "ID", "TRT")
print((results1))
Results = rbind(Results,results1)
#update variables to new values here
}
write.csv(Results, file = "Results131.csv")
What I want is for them to be able to have files with SheepX in the name, one per sheep, where X is the eartag #, have those read in, calculated, and then automatically output with the results in ResultsX.csv. If it helps, the eartag number is in the original input file under the column "ID". So for Sheep 1:150 I'd have Results1:150 etc
Later on, I'll need to be able to read those result files back in, extract outputs at specific days, and then pull those into a data frame for comparison with observations, but that's the next step after I get all these files run through the model.

You need to loop through your filenames and execute your existing code for each file, so a solution could look like this:
setwd("C:Users/....../Sheepfiles")
files = list.files("C:/Users/Me/Desktop/Sheepfiles/", pattern = "Sheep+.*csv")
for (i in files) {
input = read.csv(file = i,
header = TRUE,
sep = ",")
#set up initialized values outside loop here
LWt0 = input$LWT[1]
EBW = LWT0 * .96 * .891
#constants go here
Results = NULL
timefeed = input$DOF
#now the loop
for (i in timefeed)
{
#differential equations and calculations here
results1 = (c(t, NEG, MEI, OldMEI, HPmaint, EBW, ID, TRT))
names(results1) = c("DOF", "NEG", "MEI", "OldMEI", "HPmaint", "EBW", "ID", "TRT")
print((results1))
Results = rbind(Results, results1)
#update variables to new values here
}
# automatically generate filename for results
result.filename <- gsub("Sheep", "Results", i)
write.csv(Results, file = result.filename)
}
So you basically wrap a for-loop around your code, with your file-names as the counter-variable.

R: use single file while running a for loop on list of files

I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!

Clearing specific rows using RODBC

I would like to use the RODBC package to partially overwrite a Microsoft Access table with a data frame. Rather than overwriting the entire table, I am looking for a way in which to remove only specific rows from that table -- and then to append my data frame to its end.
My method for appending the frame is pretty straightforward. I would use the following function:
sqlSave(ch, df, tablename = "accessTable", rownames = F, append = T)
The challenge is finding a function that will allow me to clear specific row numbers from the Access table ahead of time. The sqlDrop and sqlClear functions do not seem to get me there, since they will either delete or clear the entire table as a whole.
Any recommendation to achieve this task would be much appreciated!

Indeed, consider using sqlQuery to subset your Access table of the rows you want to keep, then rbind with current dataframe and finally sqlSave, purposely overwriting original Access table with append = FALSE.
# IMPORT QUERY RESULTS INTO DATAFRAME
keeprows <- sqlQuery(ch, "SELECT * FROM [accesstable] WHERE timedata >= somevalue")
# CONCATENATE df to END
finaldata <- rbind(keeprows, df)
# OVERWRITE ORIGINAL ACCESS TABLE
sqlSave(ch, finaldata, tablename = "accessTable", rownames = FALSE, append = FALSE)
Of course you can also do the counter, deleting rows from table per specified logic and then appending (NOT overwriting) with sqlSave:
# ACTION QUERY TO RUN IN DATABASE
sqlQuery(ch, "DELETE FROM [accesstable] WHERE timedata <= somevalue")
# APPEND TO ACCESS TABLE
sqlSave(ch, df, tablename = "accessTable", rownames = FALSE, append = TRUE)
The key is finding the SQL logic that specifies the rows you intend to keep.

Progressive appending of data from read.csv

I want to construct a data frame by reading in a csv file for each day in the month. My daily csv files contain columns of characters, doubles, and integers of the same number of rows. I know the maximum number of rows for any given month and the number of columns remains the same for each csv file. I loop through each day of a month with fileListing, which contains the list of csv file names (say, for January):
output <- matrix(ncol=18, nrow=2976)
for ( i in 1 : length( fileListing ) ){
df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
# each df is a data frame with 96 rows and 18 columns
# now insert the data from the ith date for all its rows, appending as you go
for ( j in 1 : 18 ){
output[ , j ] = df[[ j ]]
}
}
Sorry for having revised my question as I figured out part of it (duh), but should I use rbind to progressively insert data at the bottom of the data frame, or is that slow?
Thank you.
BSL

You can read them into a list with lapply, then combine them all at once:
data <- lapply(fileListing, read.csv, header = FALSE, stringsAsFactors = FALSE, row.names = NULL)
df <- do.call(rbind.data.frame, data)

First define a master dataframe to hold all of the data. Then as each file read, append the data onto the master.
masterdf<-data.frame()
for ( i in 1 : length( fileListing ) ){
df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
# each df is a data frame with 96 rows and 18 columns
masterdf<-rbind(masterdf, df)
}
At the end of the loop, masterdf will contain all of the data. This code code can be improved but for the size of the dataset this should be quick enough.

If the data is fairly small relative to your available memory, just read the data in and don't worry about it. After you have read in all the data and done some cleaning, save the file using save() and have your analysis scripts read in that file using load(). Separating reading/cleaning scripts from analysis clips is a good way to reduce this problem.
A feature to speed up the reading of read.csv is to use the nrow and colClass arguments. Since you say that you know that number of rows in each file, telling R this will help speed up the reading. You can extract the column classes using
colClasses <- sapply(read.csv(file, nrow=100), class)
then give the result to the colClass argument.
If the data is getting close to being too large, you may consider processing individual files and saving intermediate versions. There are a number of related discussions to managing memory on the site that cover this topic.
On memory usage tricks:
Tricks to manage the available memory in an R session
On using the garbage collector function:
Forcing garbage collection to run in R with the gc() command

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to force Hive to distribute data equally among different reducers? - r

Related

Pass strings from a column into a function by using a loop - R

Reading Series of CSV Files into R, Running Functions, Writing Outputs to New Files

R: use single file while running a for loop on list of files

Clearing specific rows using RODBC

Progressive appending of data from read.csv

Categories

Resources