Thats a really frequently asked question. Oozie is a workflow manager and scheduler. Most companies already have a workflow schedulers – Activebatch, Autosys, UC4, HP Orchestration. These workflow schedulers run jobs on all their existing databases – Oracle, Netezza, MySQL. Why does Hadoop need its own special workflow scheduler?
As usual, it depends. In general, you can keep using any workflow scheduler that works for you. No need to change, really.
However, Oozie does have some benefits that are worth considering:
Very common request I get from my customers is to parameterize the query executed by a Hive action in their Oozie workflow.
For example, the dates used in the query depend on a result of a previous action. Or maybe they depend on something completely external to the system – the operator just decides to run the workflow on specific dates.
There are many ways to do this, including using EL expressions, capturing output from shell action or java action.
Here’s an example of how to pass the parameters through the command line. This assumes that whoever triggers the workflow (Human or an external system) has the correct value and just needs to pass it to the workflow so it will be used by the query.
Here’s what the query looks like:
In general, most Hadoop ecosystem tools work rather transparently in a kerberized cluster. Most of the time things “just work”. This includes Oozie. Still, when things don’t “just work”, they tend to fail with slightly alarming and highly ambiguous error messages. Here are few tips for using Oozie when your Hadoop cluster is kerberized. Note that this is a client/user guide. I assume you already followed the documentation on how to configure the Oozie server in the kerberized cluster (or you are using Cloudera Manager, which magically configures it for you).
Next week is Strata + Hadoop World which is bound to be exciting for those who deal with big data on a daily basis. I’ll be spending my time talking about Cloudera Impala at various places so I’m posting my schedule for those interesting in catching about fast SQL on Hadoop. Hope to see you there!
Office Hour with Greg Rahn @ the Cloudera Booth
10/29/2013 11:00am – 11:30am EDT (30 minutes)
Room: 3rd Floor, Mercury Ballroom, Booth #403
I spent the last 6 month helping customers design, implement and deploy successful Hadoop systems. Over time, you start seeing patterns. Certain things that if the customer gets right before the project event starts increase the probability that the project will finish successfully and on-time.
Lets assume you already took the most important step – you have an actual use-case or a problem to solve, and you think Hadoop could be the right technology to use in this case. What now?
Last week, while working on customer engagement, I learned a new method of quantifying behavior of time-series data. The method is called “Control Chart” and credit to Josh Wills, our director of data science, for pointing it out. I thought I’ll share it with my readers as its easy to understand, easy to implement, flexible and very useful in many situations.
The problem is ages old – you collect measurements over time and want to know when your measurements indicate abnormal behavior. “Abnormal” is not well defined, and thats on purpose – we want our method to be flexible enough to match what you define as an issue.
For example, lets say Facebook are interested in tracking usage trend for each user, catching those with decreasing use
There are few steps to the Control Chart method:
Only a week after Oracle OpenWorld concluded and I already feel like I’m hopelessly behind on posting news and impressions. Behind or not, I have news to share!
The most prominent feature announced at OpenWorld is the “In-Memory Option” for Oracle Database 12c. This option is essentially a new part of the SGA that caches tables in column formats. This is expected to make data warehouse queries significantly faster and more efficient. I would have described the feature in more details, but Jonathan Lewis gave a better overview in this forum discussion, so just go read his post.
Why am I excited about a feature that has nothing to do with Hadoop?
Oracle OpenWorld was fantastic, as usual. The best show in San Francisco. This is the seventh year in a row that I’m attending – 3 times as HP employee, 3 times as Pythian employee, and now as a Clouderan. My life changes, but the event and people are always fantastic.
There will be a separate blogpost about what I learned at the event, new exciting products and my thoughts of them. But first, let me follow up on what I taught.
Yesterday I went to the Big Data machine engineered systems demo grounds, to get an insight, exclusive demo from Dmitry Lychagin. Dmitry, being part of the XDB team explained with a lot of enthusiasm the new XQuery connector for Hadoop. Although currently not yet released, but available shortly, he demonstrated the potential while using XQuery
Thank you all who attended my sessions at NYOUG Fall Conference this morning. I appreciate spending you most precious commodity - your time - with me. I sincerely hope you found both the presentations enlightening as well as entertaining.
Please see the details of the sessions below along with the download links.
Yet another Oracle version is out and so are about 1500 new features in a variety of areas. Some are well marketed (e.g. pluggable database or the multitenant option) and some shine by their sheer usefulness. And there are some that do not get a whole lot of coverage but are are hidden gems. In this session you learned 12 broad areas of Oracle Database 12c I feel are worth learning about to make your job as a DBA or developer better, easier, smoother and, in some cases, even make it possible what was hitherto impossible or impractical.