Ab Initio Interview Questions & Answers

1. What is the difference between rollup and scan?

By using rollup we cant generate cumulative summary records for that we will be using scan.

2. What is the difference between partitioning with key and round-robin?

PARTITION BY KEY:

In this, we have to specify the key based on which the partition will occur. Since it is key-based it results in very well-balanced data. It is useful for key-dependent parallelism.

PARTITION BY ROUND-ROBIN:

In this, the records are partitioned in a sequential way, distributing data evenly in blocksize chunks across the output partition. It is not key-based and results in well-balanced data especially with a block size of 1. It is useful for recording independent parallelism.

3. How do you truncate a table

There are many ways to do it.

1. Probably the easiest way is to use Truncate Table

2. Run Sql or update table can be used to do the same thing

3. Run Program

4. What is the difference between a DB config and a CFG file?

A .dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While.CFG file in the table configuration file created by db_config while using components like Load DB Table

5. Types of parallelism in detail.

There are 3 types of parallelism in ab-initio.

1) Data Parallelism:

Data is processed at the different servers at the same time.

2) Pipeline parallelism:

In this the records are processed in the pipeline, i.e. the components do not have to wait for all the records to be processed. The records that got processed are passed to the next component in the pipeline.

3) Component Parallelism:

In this, two or more components process the records in parallel.

Component parallelism:-

A graph with multiple processes running simultaneously on

separate data uses component parallelism.

Data parallelism :- A graph that deals with data divided into segments and operates on each segment simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data parallelism. To support this form of parallelism, Ab Initio provides Partition components to segment data, and Departition components to merge segmented data back together .

Pipeline parallelism :- A graph with multiple components running simultaneously on the same data uses pipeline parallelism. Each component in the pipeline continuously

reads from upstream components, processes data, and writes to downstream components. Since a downstream component can process records previously written

by an upstream component, both components can operate in parallel. NOTE: To limit the number of components running simultaneously, set phases in the graph.

6. What is the function you would use to transfer a string into a decimal?

For converting a string to a decimal we need to typecast it using the following

syntax,

out.decimal_field :: ( decimal( size_of_decimal ) ) string_field;

The above statement converts the string to decimal and populates it to the decimal

field in output.

7. How to execute the graph from start to end stages? Tell me and how to run graph in non-Abinitio system?

There are so many ways to do this, i am giving one example due to time constraint you can run components according to phasea how you defined. by creating ksh, sh scripts also you can run.

8.What is data mapping and data modelling?

Data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded.

For Example:

source;

string(35) name = “Siva Krishna “;

target;

string(“01?) nm=NULL(“”);/*(maximum length is string(35))*/

Then we can have a mapping like:

Straight move.Trim the leading or trailing spaces.

The above mapping specifies the transformation of the field nm

9.What is the difference between sandbox and EME, can we perform checkin and checkout through sandbox/ Can anybody explain checkin and checkout?

Sandboxes are work areas used to develop, test or run code associated with a given project. Only one version of the code can be held within the sandbox at any

time.

The EME Datastore contains all versions of the code that have been checked into it.A particular sandbox is associated with only one Project where as a Project can be

checked out to a number of sandboxes

10.explain the environment varaibles with example.?

Environemental variables server as global variables in unix envrionment. They are used for passing on values from a shell/ process to another. They are inherited by Abinitio as sandbox variables/ graph parameters like

AI_SORT_MAX_CORE

AI_HOME

AI_SERIAL

AI_MFS etc.

To know what all variables exist, in your unix shell, find out the naming convention and type a command like “env | grep AI”. This will provide you a list of all the

variables set in the shell. You can refer to the graph parameters/ components to see how these variables are used inside Abinitio.

11. What r the Graph parameter?

There are 2 types of graph parameters in AbInitio

1. local parameter

2. Formal parameters.(those parameters working at runtime)

. How to Improve Performance of graphs in Ab initio?Give some examples or tips.?

Ans: There are somany ways to improve the performance of the graphs in Abinitio.

I have few points from my side.

1.Use MFS system using Partion by Round by robin.

2.If needed use lookup local than lookup when there is a large data.

3.Takeout unnecessary components like filter by exp instead provide them in

reformat/Join/Rollup.

4.Use gather instead of concatenate.

5.Tune Max_core for Optional performance.

6.Try to avoid more phases.

12. What is the difference between check point and phase?

Check point:

- When a graph fails in the middle of the process, a recovery point is created, known as Check point

- The rest of the process will be continued after the check point

- Data from the check point is fetched and continue to execute after correction.

Phase:

- If a graph is created with phases, each phase is assigned to some part of memory one after another.

- All the phases will run one by one

- The intermediate file will be deleted

13.What is a deadlock and how it occurs?

- A graphical / program hand is known as deadlock.

- The progression of a program would be stopped when a dead lock occurs.

- Data flow pattern likely causes a deadlock

- If a graph flows diverge and converge in a single phase, it is potential for a deadlock

- A component might wait for the records to arrive on one flow during the flow converge, even though the unread data accumulates on others.

- In GDE version 1.8, the occurrence of a dead lock is very rare

14. State the relation between EME, GDE and Co-operating system.

EME:

- EME stands for Enterprise Metadata Environment

- It is a repository to AbInitio. It holds transformations, database configuration files, metadata and target information

GDE:

- GDE – Graphical Development Environment

- It is an end user environment. Graphs are developed in this environment

- It provides GUI for editing and executing AbInitio programs

Co-operative System:

- Co-operative system is the server of AbInitio.

- It is installed on a specific OS platform known as Native OS.

- All generated graphs in GDE are later deployed and executed in co-operative system

15.What parallelisms does Abinitio support?

AbInitio supports 3 parallelisms. They are

- Data Parallelism : Same data is parallelly worked in a single application

- Component Parallelism : Different data is worked parallelly in a single application

- Pipeline Parallelism : Data is passed from one component to another component. Data is worked on both of the components.

16.What are the operations that support avoiding duplicate record?

Duplicate records can be avoided by using the following:

- Using Dedup sort

- Performing aggregation

- Utilizing the Rollup component

17.What is MAX CORE of a component?

- MAX CORE is the space consumed by a component that is used for calculations

- Each component has different MAX COREs

- Component performances will be influenced by the MAX CORE’s contribution

- The process may slow down / fasten if a wrong MAX CORE is set

18. State the first_defined function with an example.

- This function is similar to the function NVL() in Oracle database

- It performs the first values which are not null among other values available in the function and assigns to the variable

Example: A set of variables, say v1,v2,v3,v4,v5,v6 are assigned with NULL.

Another variable num is assigned with value 340 (num=340)

num = first_defined(NULL, v1,v2,v3,v4,v5,v6,NUM)

The result of num is 340

19.State the working process of decimal_strip function.

- A decimal strip takes the decimal values out of the data.

- It trims any leading zeros

- The result is a valid decimal number

Ex:

decimal_strip(“-0184o”) := “-184?

decimal_strip(“oxyas97abc”) := “97?

decimal_strip(“+$78ab=-*&^*&%cdw”) := “78?

decimal_strip(“Honda”) “0?

20. Explain PDL with an example?

- To make a graph behave dynamically, PDL is used

- Suppose there is a need to have a dynamic field that is to be added to a predefined DML while executing the graph

- Then a graph level parameter can be defined

- Utilize this parameter while embedding the DML in output port.

- For Example : define a parameter named myfield with a value “string(“ | “”) name;”

- Use ${mystring} at the time of embedding the dml in out port.

- Use $substitution as an interpretation option

21. Describe the Evaluation of Parameters order.

Following is the order of evaluation:

- Host setup script will be executed first

- All Common parameters, that is, included , are evaluated

- All Sandbox parameters are evaluated

- The project script – project-start.ksh is executed

- All form parameters are evaluated

- Graph parameters are evaluated

- The Start Script of graph is executed

22. What is the function that transfers a string into a decimal?

- Use decimal cast with the size in the transform() function, when the size of the string and decimal is same.

- Ex: If the source field is defined as string(8).

- The destination is defined as decimal(8)

- Let us assume the field name is salary.

- The function is out.field :: (decimal(8)) in salary

- If the size of the destination field is lesser that the input then string_substring() function can be used

- Ex : Say the destination field is decimal(5) then use…

- out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5))

- The ‘ lrtrim ‘ function is used to remove leading and trailing spaces in the string

23.Explain the methods to improve performance of a graph?

The following are the ways to improve the performance of a graph :

- Make sure that a limited number of components are used in a particular phase

- Implement the usage of optimum value of max core values for the purpose of sorting and joining components.

- Utilize the minimum number of sort components

- Utilize the minimum number of sorted join components and replace them by in-memory join / hash join, if needed and possible

- Restrict only the needed fields in sort, reformat, join components

- Utilize phasing or flow buffers when merged or sorted joins

- Use sorted join, when two inputs are huge, otherwise use hash join

24. Have you worked with packages?

Multistage transform components by default uses packages. However user can create his own set of functions in a transfer function and can include this in other transfer functions.

25.Have you used rollup component? Describe how.

If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform function and it contains the following mandatory functions.

1. initialise

2. rollup

3. finalise

Also need to declare one temporary variable if you want to get counts of a particular group.

For each of the group, first it does call the initialise function once, followed by rollup function calls for each of the records in the group and finally calls the finalise function once at the end of last rollup call.

26. How do you add default rules in transformer?

Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match Names — Match names: generates a set of rules that copies input fields to output fields with the same name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields with the same name.

1)If it is not already displayed, display the Transform Editor Grid.

2)Click the Business Rules tab if it is not already displayed.

3)Select Edit > Add Default Rules.

In case of reformat if the destination field names are same or subset of the source fields then no need to write anything in the reformat xfr unless you dont want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achive the functionality.

27. What is the difference between partitioning with key and round robin?

Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew. But this method is used more often for parallel data processing.

Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner.

28.How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved.

1) Use a limited number of components in a particular phase

2) Use optimum value of max core values for sort and join components

3) Minimise the number of sort components

4) Minimise sorted join component and if possible replace them by in-memory join/hash join

5) Use only required fields in the sort, reformat, join components

6) Use phasing/flow buffers in case of merge, sorted joins

7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port

8) For large dataset don’t use broadcast as partitioner

9) Minimise the use of regular expression functions like re_index in the trasfer functions

10) Avoid repartitioning of data unnecessarily

Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be partitioned.

29.How do you truncate a table?

From Abinitio run sql component using the DDL “trucate table By using the Truncate table component in Ab Initio

30.What is the relation between EME , GDE and Co-operating system ?

EME is said as enterprise metdata env,

GDE as graphical devlopment env and Co-operating sytem can be said as asbinitio server relation b/w this CO-OP, EME AND GDE is as follows

Co operating system is the Abinitio Server.This co-op is installed on perticular O.S platform that is called NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold the metadata,trnsformations,db config files source and targets informations. comming to GDE its is end user envirinment where we can devlop the graphs(mapping just like in informatica) desinger uses the GDE and designs the graphs and save to the EME or Sand box it is at user side where EME is ast server side.