Testing BigData projects

Posted by in Data, Java

Writing tests that use a traditional database is hard. But writing tests in a project using Hadoop is really harder. Hadoop stacks are complex pieces of software and if you want to test your Hadoop projects, it may be a real nightmare: – many components are involved, you are not just using HBase, but HBase, Zookeeper and a DFS. – a lot of configuration is needed – cleaning the data of the previous tests relies on many steps: truncate the HBase table, stop and delete the running Oozie workflows, clean…read more

Fitting Java and Python with JPY

Posted by in Java

There are many libraries in Java (more than 176,649 unique artifacts indexed just on Maven Central), but sometimes you can not find what you are looking for, except for a Python equivalent. In a previous project, I had to deal with custom MaxMind databases. Maxmind provides a Java library with a database reader, but does not provides a database writer. After some researches, I found one official lib in Perl, and an other (unofficial) in Python. Since, I didn’t have the time to redevelop the equivalent in Java (this was…read more

Solr: manage time-based collections

Posted by in Data

If you use Solr as your fulltext search engine, you may be frustated to miss the excellent tool Curator from Elastic, which allow you to manage your indices. Cloudera offers an admin tool for Solr, named solrctl, a light utility to supervise a SolrCloud deployment. Although solrctl has some useful commands, you don’t have the possibility to delete old time-based collections. Time-based collections, and globally shard/partition per time frame, is a common pattern for agregation but also for many other use cases. The idea is simple, the collections have names…read more

HBase: having fun with the shell

Posted by in Data

HBase shell is a full interactive JRuby shell (IRB) providing tools allowing you to query your data or execute admin commands on a HBase cluster. Since it uses JRuby, this shell is a powerful interactive scripting environment. This post is not about presenting you the commands available in the shell, you can easily find documentation or article on the Internet, but more about the possibilities of the shell. Add custom command Actually, there is not easy way to extend the set of available commands in HBase shell. If you want…read more

Knox in production: avoid pitfalls and common mistakes

Posted by in Data, Java

I’ve already post articles about Knox some weeks ago about two subjects: how to use the HBase REST API througth Knox and how to submit Spark job via the Knox API. In my current mission, many projects are now using Knox as main gateway for many services like HBase and HDFS, but also for Oozie, Yarn… After some weeks of development and deployment in production, I’ve decided to write a post about some troubles that you may encouter when you are using the Java client. NOTE I’m a great supporter…read more

Working with Parquet files

Posted by in Data

Apache Parquet is a columnar storage format available for most of the data processing frameworks in the Hadoop ecosystem: Hive Pig Spark Drill Arrow Apache Impala Cascading Crunch Tajo … and many more! In Parquet, the data are compressed column by column. This means that commands like these: hdfs dfs -cat hdfs:// hdfs dfs -text /…/file2 can not work anymore on Parquet files, all you can see are binary chunks on your terminal. Thankfully, Parquet provides an useful project in order to inspect Parquet file: Parquet Tools For a more…read more

Using HBase REST API with the Knox Java client

Posted by in Data

I’ve already introduced Knox in a previous post in order to deploy Spark Job with Knox using the Java client. This post is still about the Knox Java client, but we’ll see here an other usage with HBase. HBase provides a well documented and rich REST API with many endpoints exposing the data in various formats (JSON, XML and Protobuf!). First, we need to import the dependencies for the Knox Java client: <dependency> <groupId>org.apache.knox</groupId> <artifactId>gateway-shell</artifactId> <version>0.10.0</version> </dependency> Then, let’s write some code: Hadoop session = Hadoop.login(“https://$KNOX_SERVER:8443/gateway/default”, “user”, “password”); String tableName…read more

Submitting Spark Job via Knox on Yarn

Posted by in Data

Apache Knox is a REST API Gateway for interacting with Apache Hadoop clusters. It offers an extensible reverse proxy exposing securely REST APIs and HTTP based services in any Hadoop platform. Althought Knox is not designed to be a channel for high volume data ingest or export, it is perfectly suited for exposing a single entrypoint to your cluster and can be seen as a bastion for all your applications. One of the possible use-case of Knox is to deploy applications on Yarn, like Spark or Hive, without exposing the…read more

Microservices and gRPC: Use Atomix as service discovery

Posted by in Cloud

gRPC is a modern open source high performance RPC framework initiated by Google and supported by many languages and platforms (C++, Java, Go, Node, Ruby, Python and C# across Linux, Windows, and Mac). It is used by many projects (etcd/CoreOS, containerd/Docker, cockroachdb/Cockroach Labs…) and has reached a significant milestone with its 1.0 release. Used in a distributed environments where a large number of microservices are running, gRPC supports rich cloud oriented features like: – load balancing/discovery – tracing – health checking – authentication For now, gRPC supports only DNS as…read more

Efficient logging with Spring Boot, Logback and Logstash

Posted by in DevOps, Java

Logging is an important part of any entreprise application and Logback makes an excellent choice: it’s simple, fast, light and very powerful. Spring Boot has a great support for Logback and provides lot of features to configure it. In this article I will present you an integration for an entreprise logging stack using Logback, Spring Boot and Logstash. WARNING The Spring Boot recommands to use the -spring variants for your logging configuration (for example logback-spring.xml rather than logback.xml). If you use standard configuration locations (logback.xml), Spring cannot completely control log…read more