When you make some analysis on Hadoop, Apache Pig is one of the simplest ways to get and transform the data. Another alternative is Apache Hive, which seems more easy for people who already know SQL. Well, I used both, but writing scripts with Pig are better since you become able to see your data in each step of the codes. Moreover, it is more human-readable than SQL style code blocks (nested SQL, etc)
In the last two years, I wrote many Pig scripts. I would like to give some tips about Pig Scripting.
Use DEFINE functions to separate the file loading functions into a different Pig, which can be named as Loader.pig
When Pig does not provide the desired functionalities, write your own User Defined Functions with Java. For example, if you need to compare the object values, or if you want to use a sorting algorithm, then you may use your own Java codes and make them call from Pig script. This feature totally increases the flexibility of Apache Pig. When you enter the Java UDF world, then you can do everything with the collaboration of Java and Pig. Here, the main challenge is to track the objects called in UDF but you can develop yourself by making lots of trials.
Parameter Substitution is a prominent feature of Pig. With @declare annotations, it is possible to define custom variables. However, the dynamic value assignment is a challenge.
Before running in pig mode, complete your tests with the pig -x local mode with a small amount of data since it becomes inefficient to wait and see the script results in pig mode.
I suggest you to watch the TED video named as “A monkey that controls a robot with its thoughts”. I think this video is inspiring to understand how the world will be in the future, especially when we start to directly control robots using our brain from distance.
Today, with the advantages of analyzing billions of data in real time, companies can better understand their customers’ emotions. Especially the tech companies should develop bigdata based models to increase user experience and reduce churn.
At this point, I would like to share my experience about Need for Speed iOS app. In the last 3 days, there was a competition in the game that awards people with a Jaguar sports car. However, the car becomes available after a full dedication to the game with a high probability of spending real money.
My wife and I were playing this game during last 3 days in our spare times on the road, at home etc. After achieving 75% progress in the game I realized that I wouldn’t succeed in a limited time then I deleted the game. On the other hand, my wife completed 95% of the game but the end was the same as me: she deleted the game. The point that I would like to emphasize in that story that using big data technology and a churn prediction model, the game company could keep us playing the game at least my wife.
Here, an algorithm may calculate a churn score of each player. By collecting location data from each device, it is possible to identify the couples (small communities). When one member of the community deletes the app, the churn possibility of other members highly increase. Uninstalling the app can be identified by daily usage routine of the user. Finally, the app can propose extra advantages to rest of the members of a single community.
This is my advice to the Big Data team of EA Games. I am aware that they already follow the application usage routines and offer extra money in each trial. However, there is a substantial need for the development of custom algorithms.
I am curious about this technology. I was expecting the advances in Wi-Fi technology, however, anybody could imagine that the led lights will be the next data streaming resources. There are many concerns about the data streaming feature of the light including its potential impacts on the health. There are still grey points in my mind about how can be a complete synchronous data streaming in all over the world with the li-fi. This is truly the innovation. Congrats Prof. Harald Haas!!
Overlapping community detection allows placing one node to multiple communities. Up to now, many algorithms are proposed for this issue. However, their accuracy depends on the overlapping level of the structure. In this work, we aim at finding relatively small overlapping communities independently than their overlapping level. We define k-connected node groups as cohesive groups in which each pair of nodes has at least k different node-disjoint paths from one to another. We propose the algorithm EMOC first finding k-connected groups from the perspective of each node and second merging them to detect overlapping communities. We evaluate the accuracy of EMOC on artificial networks by comparing its results with foremost algorithms. The results indicate that EMOC can find small overlapping communities at any overlapping level. Results on the real-world network show that EMOC finds relatively small but consistent communities.