When you make some analysis on Hadoop, Apache Pig is one of the simplest ways to get and transform the data. Another alternative is Apache Hive, which seems more easy for people who already know SQL. Well, I used both, but writing scripts with Pig are better since you become able to see your data in each step of the codes. Moreover, it is more human-readable than SQL style code blocks (nested SQL, etc)
In the last two years, I wrote many Pig scripts. I would like to give some tips about Pig Scripting.
- Use DEFINE functions to separate the file loading functions into a different Pig, which can be named as Loader.pig
- When Pig does not provide the desired functionalities, write your own User Defined Functions with Java. For example, if you need to compare the object values, or if you want to use a sorting algorithm, then you may use your own Java codes and make them call from Pig script. This feature totally increases the flexibility of Apache Pig. When you enter the Java UDF world, then you can do everything with the collaboration of Java and Pig. Here, the main challenge is to track the objects called in UDF but you can develop yourself by making lots of trials.
- Parameter Substitution is a prominent feature of Pig. With @declare annotations, it is possible to define custom variables. However, the dynamic value assignment is a challenge.
- Before running in pig mode, complete your tests with the pig -x local mode with a small amount of data since it becomes inefficient to wait and see the script results in pig mode.