This morning we learnt that the Apache Software Foundation has announced a new top level project - Apache Arrow. 

We think the announcement is important for the BigData and next gen Apps space, as it brings BigData into in memory and creates a exchange format for a large number of open source technologies that are critical for next generation applications. But let's take apart the press release in our customary style: 

Forest Hill, MD – UNDER EMBARGO UNTIL WEDNESDAY 17 Feb 2016 AT 7:00 AM ET -- The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today Apache Arrow as a new Top-Level Project.

MyPOV - Good to see activity at Apache all the way to a top level project.  

A high-performance cross-system data layer for columnar in-memory analytics, Apache Arrow provides the following benefits for Big Data workloads:
●        Accelerates the performance of various analytical workloads by more than 100x
●        Enables multi-system workloads by eliminating cross-system communication overhead 

MyPOV - Good summary what Apache Arrow does - it creates an efficient in memory format accelerating queries and - even more interestingly enables cross communication in memory - the fastest way this can be done today. It also means that a variety of open source projects who could benefit from an in memory representation of their data do not have to look any further. 

Initially seeded by code from the Apache Drill project, Apache Arrow was built on top of a number of Open Source collaborations, and establishes a de-facto standard for columnar in-memory processing and interchange.

MyPOV - An ambitious statement, but Apache Arrow is off to a good start to become the de facto standard for in memory processing and data interchange.  

“The Open Source community has joined forces on Apache Arrow,” said Jacques Nadeau, Vice President of Apache Arrow and Vice President Apache Drill. “Developers from 13 major Open Source Big Data projects are already on board --by introducing a new era of columnar in-memory analytics, we anticipate the majority of the world’s data will be processed through Arrow within the next few years.”

MyPOV - 13 open source projects on board with Apache Arrow must be some kind of record, even for the very collaborative open source community. The wide support gives proof to the attractiveness of the project and the trust put into the project leaders (see also below). 

Code committers to Apache Arrow include developers from Apache Big Data projects Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark, and Storm as well as established and emerging Open Source projects such as Pandas and Ibis.

MyPOV - This reads like a who is who of BigData projects. Good for Arrow to see such wide support, but also testament that it has hit a jack pot in terms of desirability for these other projects. 

“Arrow’s cross platform and cross system strengths will enable Python and R to become first-class languages across the entire Big Data stack,” said Wes McKinney, creator of Pandas. 

MyPOV - Well, we will see it they become first class languages, but the key aspect is that Arrow is polyglot and gives new languages as well as established languages a chance to access data.

Apache Arrow accelerates analytical processing by providing a high performance columnar in-memory representation. A number of processing algorithms benefit greatly from this memory design.

MyPOV - No surprise, but also good to see that the creators have paid special attention to make Arrow perform well on CPUs, where the 'runner hits the road' for in memory.  

“A columnar in-memory data layer enables systems and applications to process data at full hardware speeds,” said Todd Lipcon, original Apache Kudu creator and Apache Arrow PMC. “Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing.”

MyPOV - By realigning work and data to efficiently CPU utilization, Apache Arrow positions itself well to become the de facto in memory engine for many open source initiatives that don't have in memory capabilities yet. The better Arrow can used existing hardware the less likely the rise of any competing open source projects.  

In many workloads, 70-80% of CPU cycles are spent serializing and deserializing data. Arrow solves this problem by enabling data to be shared between systems and processes with no serialization, deserialization or memory copies.

MyPOV - The open nature of Arrow is making it an attractive in memory format that can be shared by multitude open sources players, which has a number of benefits: The projects don't need to build their own engines, the projects don't need to build integration interfaces to other projects and customers don't have to buy expensive hardware for storing data multiple times in memory and will benefit from a more efficient administration. 

“An industry-standard columnar in-memory data layer enables users to combine multiple systems, applications and programming languages in a single workload without the usual overhead,” said Ted Dunning, Vice President of the Apache Incubator and Apache Arrow PMC.

MyPOV - Well said by Dunning. Now we will have to see adoption becoming real in code and customer projects in the next months. 

In addition to traditional relational data, Arrow supports complex data with dynamic schemas. For example, Arrow can handle JSON data which is commonly used in IoT workloads, modern applications and log files. Implementations are also available (or underway) for a number of programming languages including Java, C++ and Python to allow greater interoperability among a number of Big Data solutions.

MyPOV - Arrow uniquely brings together in memory, columnar and support for complex eg JSON data types - a trifecta that is unusual and seldom achieved. Polyglot capabilities will make Arrow even more attractive to developers of next gen Apps.  

“Real world use cases often include complex combinations of structured and rapidly growing complex-data. Already tested with Apache Drill, the efficient in-memory columnar representation and processing in Arrow will enable users to enjoy the performance of columnar processing with the flexibility of JSON,” said Parth Chandra, Apache Drill PMC and Apache Arrow PMC.

MyPOV - Good to see Apache Drill on board.  

Catch Apache Arrow in action at Strata + Hadoop World (San Jose: 30 March 2016, and London: 1-3 June 2016), as well as upcoming MeetUps and local events http://arrow.apache.org/events

MyPOV - No surprise. Good timing.  

Availability and OversightApache Arrow software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Arrow, visit http://arrow.apache.org/

MyPOV - No surprise.

 

Overall MyPOV 

We witness a likely milestone on what can be done with BigData powering the creation of next generation Applications. In all seven universal uses cases of next generation Applications, BigData of the key underlying data technology. 

Finding a common and thus shared way to get into in memory is a major development, if successful will enable a whole new performance for modern application. While saving substantially on the hardware side, which means $s saved on machines, which can be invested into the software projects. 

As such Apache Arrow not only accelerstes next generation applications on a micro level, the project itself - but also on a macro level - funneling more investment into the projects on the software side. Apache Arrow is off to a great start, with the right charter, wide support and competent leadership in place. The latter always matters, but is even more importand in the open source community. We will be watching.  

[I typed this on a cruise ship, one of the last places where you get charged for Internet access by the minute, ol it comes over satellite. But no time for fancy formatting, will address when back on Terra Firma....].