It’s been few years since I have been working on HIVE, Impala, Spark SQL, PySpark, Redshift and in the journey so far I have migrated many applications in different RDBMS like Teradata, Netezza, Oracle to cloud or on-premise Hadoop environment for various clients. I have seen the change in client preferences from Hive to ACID Hive to Spark SQL to PySpark. In this post I will share my observation so far when it comes to migrating existing Teradata application to Hadoop.
First question that shall come to your mind is that if you really want to move to Hadoop ?
Few common reasons are:
1) Migrate some heavy computation to Hadoop i.e. High AMPCPU consuming jobs to Hadoop.
2) Migrate jobs that involve high volume data or to free some storage space in Teradata.
3) Non-critical jobs or specific business applications.
4) Application with many adhoc requests.
All the points mentioned above cater to another reason that is making room for new business critical jobs in Teradata.
Before you move to Teradata, few technology related points you may have to consider.
1) Do you want to move to cloud or on-premise Hadoop ?
2) What shall be the tech-stack you want in Target environment ?
3) Do you have internal team or capability to support chosen technology ?
4) What are the pros and cons of each technology ?
5) Which technology suits best as per your requirement ?
6) How much budget you have for testing the first phase ?
Many clients prefer being Technology agnostic so you may not want to abide by a single vendor or hadoop distributor. A lot depends on your use-case which you have picked for moving to Teradata. I have worked with client, who intentionally involved Cloudera and Databricks along with AWS EMR just to have flexibility at later stage to move away from one vendor if something does not go as expected. I think that was a smart decision. Another client, which migrated Teradata dimension model to EMR – Spark. However for the business adhoc queries he preferred Redshift. So the ETL was done in EMR spark and the processed data was pushed to Redshift for business queries. That makes sense because if you prefer response in a express manner for your queries, spark or hive is not the ideal technology. Impala, Redshift may have better response time for adhoc business queries.
Let’s talk about the technical consideration from the SQL perspective you must consider before moving to Hadoop. You must understand that Teradata is built to run SQL queries at lightning fast speed. Teradata follows ANSI SQL standard with high overlapping percentage. In addition to that, Teradata also has extension to SQL which definitely makes SQL developer life easy. However Hadoop was NOT built to run SQL queries hence HIVE/Spark has yet to do lot of catching-up when it comes to supporting SQL standards. In short, many SQL queries which are supported in Teradata may not work on Hive/Spark.
Few common examples of SQL components that are not supported in HIVE:
1) Recursive Queries
2) Nested Subqueries and multiple Subqueries
3) Outer Joins with Non-equi join conditions
5) Stored Procedures
6) Views don’t support DML operations.
7) Datatypes like Time, Period and other complex types.
8) Constraints like Primary Key,Foreign Key, Unique, NOT NULL
9) SET tables
11) Update/Merge/Delete (need HIVE transactional tables to support)
13) Identity columns
Other than that few other things you may have to consider are :
1) String case-sensitiveness. Teradata is case-insensitive by default however HIVE is case-sensitive.
2) Also few things are supported however the syntax is different like Integer becomes Int in datatype.
3) Similarly INSERT query syntax changes if the table is partitioned in HIVE.
4) You may want to figure out any custom or Teradata in-built UDF (user-defined function) you are using which may not be available in HIVE.
5) SET operator like MINUS
6) UNION was not supported in HIVE 1.1 but 1.2 onwards it is supported.
7) Concat “||” to be replaced with CONCAT function
Although I have mentioned lot of points above which are not supported or indirectly supported I can tell you that even after these challenges , most of the SQL can be converted to HIVE compatible query.
After SQL syntax and components, I would like to talk about SQL Performance as well. Since I said it before Hadoop was not built for running usual SQL queries, so definitely you may see some obvious difference in execution time of SQL queries on HIVE when compared to Teradata. The execution time difference is bigger for queries like UPDATE/DELETE/MERGE. Other challenge I see here is Teradata has mature a lot in terms of handling complex queries and giving response in seconds. There are lots of material available on web for Teradata performance tuning which “actually work”. However for HIVE, you will find same performance tuning points every where and most of them won’t help much in actual scenarios. So I will recommend not to start with those applications which are time-bound i.e. time critical jobs. For example, there is a business application for which you must process ETL pipeline within 1 hour of receiving files from Source application. I will not recommend to start with this application. At the same time, if there is another application which is daily job and have bandwidth of 16-20 hours to complete on daily basis then it is recommended to try it on Hadoop. Note: 16-20 hours does not mean that job will take time to complete. But it is buffer time in which you can re-run jobs multiple times in a day in case of failures.
Are you planning to move to Hadoop from Teradata ? Or do you already have some experience ? I would love to hear about your comments and experience comparing Teradata & Hadoop Hive.
I have tried to cover topics at high level however if you wish to know anything specific let me know by leaving a comment. Hope this helps.