Edward Capriolo

Friday Mar 03, 2017

Support Hive

  

My important email to hive-dev. To discuss actions by those in the Apache Spark community.

All,

I have compiled a short (non exhaustive) list of items related to Spark's
forking of Apache Hive code and usage of Apache Hive trademarks.

1)
----------------------------
The original spark proposal repeatedly claims that Spark "inter operates"
with hive.

https://wiki.apache.org/incubator/SparkProposal

"Finally, Shark (a higher layer framework built on Spark) inter-operates
with Apache Hive."

(EC note: Originally spark may have linked to hive, but now the situation
is much different.)
-------------------------

2)
------------------
Spark distributes jar files to maven repositories carrying the hive name.

https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec

(EC note These are not simple "ports" features are added/missing/broken in
artifacts named "hive")
-----------------------

3)
---------------------------------
Spark carries forked and modified copies of hive source code

https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java
--------------------------------------------

4
-------------------------------
Spark has "imported" and modified components of hive


https://issues.apache.org/jira/browse/SPARK-12572

(EC note: Further discussions of the code make little no reference to it's
origins in propaganda)
---------------------------------------------

5
--------------------------------
Databricks, a company heaving involved in spark development, uses the Hive
trademark to make claims

https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html

"The Databricks platform provides a fully managed Hive Metastore that
allows users to share a data catalog across multiple Spark clusters."


This blog defining hadoop (draft) is clear on this:
https://wiki.apache.org/hadoop/Defining%20Hadoop

"Products that are derivative works of Apache Hadoop are not Apache Hadoop,
and may not call themselves versions of Apache Hadoop, nor Distributions of
Apache Hadoop."

--------------------

6
----------------------
https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html

"Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. "

Apache spark can NOT support multiple versions of Hive because they are
working with a fork, and there is no standard body for "supporting hive"

Some products have been released that have been described as "compatible"
with Hadoop, even though parts of the Hadoop codebase have either been
changed or replaced. The Apache™ Hadoop® developer team are not a standards
body: they do not qualify such (derivative) works as compatible. Nor do
they feel constrained by the requirements of external entities when
changing the behavior of Apache Hadoop software or related Apache software.
-----------------------

7
---------------------------------
The spark committers openly use the word "take" during the process of
"importing" hive code.

https://github.com/apache/spark/pull/10583/files
"are there unit tests from Hive that we can take?"

Apache foundation will not take a hostile fork for a proposal. Had the
original Spark proposal implied they wished to fork portions of the hive
code base, I would have considered it a hostile fork. (this is open to
interpretation).

(EC Note: Is this the Apache way? How can we build communities? How would
small projects feel if for example hive "imported" copying code while they
sat in incubation)
------------------------------

8
----------------------------
Databricks (after borrowing slabs of hive code, using our trademarks, etc)
makes disparaging comments about the performance of hive.

https://databricks.com/blog/2017/02/28/voice-facebook-using-apache-spark-large-scale-language-model-training.html

"Spark-based pipelines can scale comfortably to process many times more
input data than what Hive could handle at peak. "

(EC Note: How is this statement verifiable?)
-----------------------------------------------

9
--------------------------
https://issues.apache.org/jira/browse/SPARK-10793

It's easily enough added, to the code, there's just the risk of the fork
diverging more from ASF hive.

(EC Note Even those responsible for this admit the code is diverging and
will diverge more from there actions.)
------------------------

10
----------------------

My opinion of all of this:
The above points are hurtful to Hive.First, we are robbed of community.
People could be improving hive by making it more modular, but instead they
are improving Spark's fork of hive. Next, our code base is subject to
continued "poaching". Apache Spark "imports", copies, alter, and claim
compatibility with/from Hive (I pointed out above why the compatibility
claims should not be made). Finally, We are subject to unfair performance
comparisons "x is faster then hive", by software (spark) that is
essentially

*POWERED BY Hive (via the forking and code copying).  *

Hive has always been a bullseye as the best hadoop SQL
https://vision.cloudera.com/impala-v-hive/

In my hood we have a saying, "Haters gonna hate"

For every Impala and every Spark claiming to be better then hive, there is
10 HadoopDB's that collapsed under the weight of themselves. We outlasted
fleets of them.

That being said, software like Hive Metastore our baby. It is our TM. It is
our creation. It is what makes us special. People have the right to fork it
via the licence. We can not stop that. But it cant be both ways: either
downstream needs to bring in our published artifacts, or they fork and give
what they are doing another name.

None of this activity represents what I believe is the "Apache Way". I
believe the Apache Way would be to communicate to us, the hive community,
about ways to make the components more modular and easier to use in other
projects. Users suffer when the same code "moves" between two projects
there is fragmentation and typically it leads to negative effects for both
projects.


--------------------------------------

Thanks, 

Edward 

Saturday Feb 04, 2017

Forget bowling green, worry about the Lusitania

There has been a lot of outrage, talk, and comedy about the Bowling Green Massacre. While revolting in it's own way, it is not new ground for the current state of affairs in the US. We know where the administration stands on immigration and we know this is a divisive issue. Sides are dug in and we all know where we stand. However, I find two events far more troubling: 

The first event where the White House Press Secretary, Sean Spicer, claimed Iranian's attacked a US ship. This claim was instantly debunked inside the press conference when a savvy reporter pointed out the fact that the ship was a Saudi ship. Now, why is this a big deal? Well in school we leaned that the sinking of the RMS Lusitania is what drew the US into World War I. Claiming that Iran attacked a US ship is a big deal. 

For the second event, examine the US raid on Yemen in which a US soldier and several civilians were killed. After the event the US released videos that they claim were captured during the raid. It was very quickly determined that the videos were not captured in the raid. Instead the videos were 10 years old

These two events are shocking. Either the administration is willfully misleading us, or they are grossly incompetent. What is worse? being the white house press secretary and confusing the fact that a ship is US or Saudi or purposely manipulating us. When the country is a tinderbox of opinions, pulling a video out of the archives and parading it as a result serves what purpose?

Knee-jerk twitter reactions are unavoidable now. Even if something is "retracted" a day later, the damage is already done. Had a reporter not been 'Johnny on the spot' during Spicer's blunder there would have been a massive fallout and more alternative-facts that might never be corrected in the mind of some. The result could be more far reaching than executive orders on immigration.

Thank you for listening 

 


Sunday Jan 29, 2017

Apache Incubator Gossip first release


In case you are wondering I have a been in the blogging world. A good portion of my free time has been spent with https://github.com/apache/incubator-gossip which is now in the apache incubator. 

 

 

We are building a lot of very cool stuff including talk about adding support for the SWIM protocol.

Friday Oct 28, 2016

Feature Flag dark launch library from the guy who has his own everything

 

Hopefully you do not deploy "dangerous" code, but sometimes if you want to "get dangerous" you might only want to role the code out to a portion of your users. So if someone says "lets get dangerous" think 

Darkwing 

If you really want a good argument about why you should have this read this thing at facebook

Tuesday Oct 18, 2016

What is collusion?

 

I just happened upon this article: 

WIKILEAKS : More Media Collusion – HuffPo Reporter Emails Hillary Camp to WARN Them About FOX NEWS Programming

http://truthfeed.com/wikileaks-more-media-collusion-huffpo-reporter-emails-hillary-camp-to-warn-them-about-fox-news-programming/30338/ 

First, huffpost bloggers are not the same as huffpost reporters. I have a huffpo blog and I am not affiliated with huffpost anymore.(I am not sure if the writer mentioned in the article is a blogger or an employee)

Second, The huffingtonpost has been openly against trump from the beginning. At the end of every trump article is this footer: http://www.huffingtonpost.com/entry/donald-trump-women-sick_us_5804d6ece4b0e8c198a8fb66

Editor’s note: Donald Trump regularly incites political violence and is a serial liarrampant xenophoberacistmisogynist and birther who has repeatedly pledged to ban all Muslims — 1.6 billion members of an entire religion — from entering the U.S.

Third. the definition of collusion:


secret or illegal cooperation or conspiracy, especially in order to cheat or deceive others.


The huffingtonpost has declared it does not like trump it is not a secret . I can not speak for what is legal, and without the secret you can not have a conspiracy, and I also do not see how the information is deceptive. 

 


Tuesday Mar 29, 2016

Deus X: The Fall - Ed's review

I have decided to change gears a bit and review one of my favorite andriod games Deus Ex: The Fall . I was a big fan of Deus Ex 3 which came out on the xbox. For those not familiar, Deux Ex is a sneak shooter. I actually play 'the fall' on train rides home, it took me a few months of playing it periodically to beat it.




What makes this game special? 

In the near future humans can be outfitted with augmentations "augs". They do things like steady your gun arm, mimetic camouflage etc. The way the Deus Ex game balances is you can not afford all the augs, so you pick and chose ones that match your game play. For example if you like run and gun type, you focus on body armor, speed enhancements and take downs, but if you want to sneak around you focus on stealth enhancements.

What makes a BAD sneak shooter ?

What makes a bad sneak shooters is huge missions, when your walking through a warehouse and you have to choke out 500 people over 4 hours of game play , this is just annoying. Think about it, could you image that in three or four hours no one realized that 500 security guards have not checked in?  Or in 4 hours that one guy at the computer would not go for a bathroom break and just happen to look in one of the 90 lockers you have hidden bodies in? Just not possible and kinda silly.

Why does 'The Fall' avoid this ?

Well obviously this is an Android game, so by its nature it avoids huge levels. This actually gives the game the right feel, they are small levels with a few rooms, you execute a few tactical take downs and you get a reward! In the xbox game a lot of time is spent moving/hiding bodies, so as not to alert others and bring about a free for all. In 'the fall' the bodies just vanish after a few seconds. Bodies vanishing is not realistic, but I think it goes with the style you knock someone out and you move on. When I play I simply force myself in the mind of a character and play a 'realistic' way, there is no way an augmented human is going to huddle in a corner waiting for 3 hours for 3 different people to be in the "perfect place",. You just make a move and be dammed with the consequences. 

Controls

I was rather impressed with the controls in fact I enjoyed them more than the console version. On screen you can switch weapons fast, icons appear when you are in take down range. A rather cool thing is that in the settings menu you can adjust the placement of each of the on screen controls. I was super impressed by this. I really did not have to move anything but the fact you could I thought was pretty neat.

Tidbits

One thing I enjoy is that around the game there are PDAs and computers that you can read or hack into to get some back story into the game and hints into what is unfolding. I really like that in all games, they did this in a gears of war with Journals and cogs, the nice part is this is always optional. You are not forced to watch 10 minute movies but if you care you can review the data in the world better. You can also talk to random people like a standard RPG and while they do not offer a ton to say that is still pretty cool.

 

Plot

You are an ex special forces character with augs drawn into something bigger than you. You are living below the radar and have to go on a variety of missions to acquire the drugs that keep you from rejecting your augs. As that goes down you have to deal with people who offer you what you need in exchange for your services and you are free to embark on side quests.For a 99 cent andriod game this plot is on a amazing and it would still be a fairly in depth plot for a console game.

Pros

Flexible game play, large environment to explore, up gradable character attributes, upgradeable weapons. Nice graphics and controls for a cell phone game. Retained a lot of the feel from the xbox game while moving to a cell platform.

Cons

While it is a sneak shooter the game is more biased towards the sneak, even with armor upgrades a couple well placed shots from enemies can put you down. The game is less fun to play as a shooter IMHO. Environments seem more detailed than characters. 

Overall

If you like the console game and you have a 30 minute train ride home everyday this game is amazing. Since it is an older game it is totally worth the cost ~ 0.99 cents. I would still happily pay 3 or 4 dollars for it. 

My score is 9.  

Wednesday Mar 09, 2016

Great Moments in Job Negotiation: Volume 1

Huffington Post is my current employer.Huffington Post is owned by AOL. The interview process has to go through two stages of HR. At the time the head of AOL also approved each hire. 

After multiple interviews with multiple people over three weeks I finally got my offer letter.

I replied to the recruiter, "This is a nice offer, but if I don't have a floppy disk in my mail box my Monday  with 30 free hours of AOL, the deal is off"

Sunday Mar 06, 2016

Rasp PI 3 is here

Up until this point you have had to attach Wifii or a 4G card to your 'internet of thing'. Well no more! the new Raspberry Pi 3 has build in wireless networking. This is going to get interesting.

 

 


Wednesday Feb 17, 2016

Python users / Data Scientists measuring PITA levels

Before I get started trashing people me say I have the greatest respect for former and current colleges, but there is a large looming problem that needs to be addressed.

The fanboy level of Python usage in people, mainly data scientists, needs to stop.

A sick blind devotion to python complete unchecked by reason

I was talking to a Python user about Spark: 
Me: "What were you looking to use spark for"
Them: I hear there is PySpark
Me: Yes very interesting, what are you looking to use it for,
Them: PySpark 

ROFL: The only take away about the spark platform is PySpark? Nothing else seemingly was interesting or caught your attention? Really nothing about streaming or in memory processing, just PySpark? lol #blinders

Your would think [data] scientist want to learn things?

I encounter this debate mostly with hive-streaming. When someone asks me about hive streaming I look at the problem. Admittedly there are actually a couple of tasks most easily addressed with streaming. But the majority of streaming things can be solved much more efficiently and correctly by writing a simple UDF UDAF in Java. What normally is a common reply when a Hive Committer, who wrote a book on hive, explains unequivocally  that a UDF is better for performance, debugging, test ability, and is not that hard to write?

"I don't want learn how to compile things | learn about java | learn about what you think is the right way to do things", You would think that a data scientist who is trying to search for great truths would actually want to find the best way to use a tool they have been working with for years.

Just to note: In hive streaming everything moving in between processes via pipes and is like 4 context switches and two serializations for each row (not including the processing that has to happen in the pipe). 

I don't care that 100% of the environment is Java, im f*ckin special

A few years back someone (prototyping in python) suggested we install LibHDFS. later someone suggested we install WebHDFS. The only reason to install these things is they must use python to do things, even if there already is prior examples of doing this exact task in java in our code base. Sysadmins should install new libraries, open new ports, monitor new services, and we should change our architecture, just because the python user does want to use Java for a task that 10 previous people have used java for. 

"I'm Just prototyping"

This is the biggest hand waiver. When scoping out a new project don't bother looking for the best tool for the job. Just start hacking away at something and then whatever type of monstrosity appears, just say its already done, someone will just have you jam it into production anyway. Good lucky supporting the "prototype" with no unit tests in production for next 4 years. You would think that someone would take lead from a professional coder and absorb their best practices. No of course not, they instead will just tell you how best practices don't apply to them.#ThisISSparta!

Anyway its 7:00 am and I woke up to write this so that I can vent. But yea its not python, its not data scientists, but there is just a hybrid intersection of the two that is so vexing. 

 

Friday Jan 22, 2016

My day

[edward@bjack event-horizon-app]$ git log
commit 9de21fbc97a7f573f6b0564daff20f5ce23c723e
Author: Edward Capriolo <edward.capriolo@.com>
Date:   Fri Jan 22 16:14:20 2016 -0500

    Ow yes yaml cares about spaces...beacause ansible

commit de07401a0087e86253cbf9c0369010e21d248eb9
Author: Edward Capriolo <edward.capriolo@.com>
Date:   Fri Jan 22 16:10:57 2016 -0500

    Why not

commit 0be598151962f647528406bad21b3b8c8e887ffd
Author: Edward Capriolo <edward.capriolo@com>
Date:   Fri Jan 22 16:05:06 2016 -0500

    This is soo much better than just writing a shell script

commit 4f4ea0b8b462a61e3ecde71ff656da9e1324095b
Author: Edward Capriolo <edward.capriolo@.com>
Date:   Fri Jan 22 16:01:53 2016 -0500

    Why dont we have a release engineer

commit b77264618f2fbe689ecc09e4575e10935ba20600
Author: Edward Capriolo <edward.capriolo@.com>
Date:   Fri Jan 22 15:57:56 2016 -0500

    bla

commit 912597f1ba4284a5312398ad770f6fd1d76301a1
Author: Edward Capriolo <edward.capriolo@.com>
Date:   Fri Jan 22 15:52:21 2016 -0500

    The real yaml apparently

commit ee64c5c4340202b95a0f05784f30b63abd755d2d
Author: Edward Capriolo <edward.capriolo@.com>
Date:   Fri Jan 22 15:32:28 2016 -0500

    Always asume kill worked. so we can start if nothing is running

Tuesday Jan 12, 2016

'No deal' is better than a 'Bad Deal'

After working for a few companies a few things have become clear to me. Some background, I have been at small companies with no code, large companies with little code, small companies with a lot of code, and large companies where we constantly re-write the same code. 

I was watching an episode of 'shark tank'. Contestant X had a product, call it 'Product X', and four of the five sharks offered nothing. The 5th shark, being very shark like, used this opportunity to offer a 'bad' deal. The maker of 'Product X' thought it over, refused the deal, and left with no deal. The other sharks were more impressed with 'Contestant X' than Product X'. They remarked that , "No deal is better than a Bad Deal". This statement is profound and software products should be managed the same way.

Think about the phrase tech-debt. People might say tech-debt kills your agility. But it is really not the tech-debt alone that kills your agility, it is 'bad deals' that lead to tech debt. As software gets larger it becomes harder to shape and harder to manage. At some point software becomes very big, and change causes a cascade of tech debt. Few people want to remove a feature. Think about Mokeys on a Ladder, and compare this to your software. Does anyone ever ask you to remove a feature? Even if something is rarely used or never used someone might advocate keeping it, as it might be used later. Removing something is viewed as a loss, even if it really is addition by subtraction. Even if no one knows who asked for this rule people might advocate keeping it anyway! Heck even if you find the person who wanted the feature and they are no longer at the company, and no one else uses it, people might advocate keeping it anyway!

The result of just-keep-it thinking is you end up keeping around code you won't use, which prevents you from easily adding new code. How many times have your heard someone say, 'Project X (scoff)!? That thing is a mess! I can re-write that in scala-on-rails in 3 days'. 4 weeks later when Project X on-scala-on-rails is released a customer contacts you about how they were affected because some small business rule was not ported correctly due to an over-site.

The solution to these over-sites is not test-coverage or sprints dedicated to removing tech-dept. The solution is never to make a bad deal. Do not write software with niche cases. Do not write software with surprising rules. The way I do this is a mental litmus test: Take the exit criteria of an issue and ask yourself, "Will I remember this rule in one year". If someone asks you to implement something and you realize it was implemented a year ago and no one ever used it, push back let them know the software has already gone in this direction and it led no where. If your a business and your struggling to close deals because the 'tech people' can not implement X in time, close a deal that does not involve X.

'No deal' is better than a 'Bad Deal'

'No code' is better than 'Bad Code'

'No feature' is better than 'Bad Feature' 

 

 

Saturday Dec 26, 2015

Introducting TUnit

Some of my unit tests have annoying sleep statements in them. I open sourced TUnit for changing this.

The old way:

Thread.sleep(10000);
Assert.assertEquals(2 , s[2].getClusterMembership().getLiveMembers().size());

The new way:

TUnit.assertThat( new Callable(){
public Object call() throws Exception {
return s[2].getClusterMembership().getLiveMembers().size();
}}).afterWaitingAtMost(11, TimeUnit.SECONDS).isEqualTo(2);  

You can see this in action here.

Tuesday Dec 15, 2015

I am highly available

 

https://www.linkedin.com/in/edwardcapriolo


Monday Dec 14, 2015

Mounting a come back!

Hey all! It has been a long time. Well if you don't know, my wife Stacey and I had a baby boy Ian! 



Well besides that I am gearing up for the next teknek release. With some cleanups I also replaced the crappy zookeeper lock recipe and added curator

Sunday Jul 12, 2015

Why Hive on Cloudera is like Python on Redhat

I used to be fairly anti-cloudera. I was never really convinced you needed someone to package up hadoop for you and your admins should just learn it. These days Hadoop is N degrees harder and I don't really have as much give-a-crap for learning to configure all the nobs that change names all the time. Thus I am more or less happy to let cloudera handle installing the 9000 hadoop components.

But really cloudera's testing is not that great. In my last version of cdh, decomissioning NodeManagers causes yarn to stop accepting jobs. ::Major fail:: Upgrade and in the new version the version hive can not support custom hive serde's because of an upstream Hive bug.

Filed this to CDH user:

https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/tTHw8kfanqQ

Got the ::cricket:: response

Thinking of getting away from CDH hive at this point. Why?

  1. Waited a long time for this so I could easily build in tez support
  2. Still no out of the box tez support even though its clearly the way forward (and would make everything umpteeth times faster)
  3. Does not really look like cloudera can/wants to keep up with Hive's release cycle
  4. Sabotaging features by adding check boxes and disabling things that work out of the box "Check the box for Enable Hive on Spark (Unsupported)."
  5. Constant complaints in manager that you should have a metastore server or should have zookeeper when truth is most users wont need either. (and I sure do not need this)
  6. N day wait to cofirm bugs, "Whenever we get to it" fixes
  7. 1 zillion unneeded jars in classpath , hbase etc that Im not actually using with hive.

Im tired of dealing with backreved revsions and cloudera's "Why aren't you just using impala" type stance.

I am going back to rolling my own. I will still use cdh to manager hdfs proper and YAWN, but this hive situation is unmanagable. Hive on cloudera is like Python on Redhat 5. You are painted into an annoying box and you have no direct way to make it better other than ignoring it entirely and rolling your own!

Calendar

Feeds

Search

Links

Navigation