Pentaho data integration 4 cookbook pdf download free

pentaho data integration 4 cookbook pdf download free

  • Pentaho Data Integration 4 Cookbook - Free PDF Download
  • Pentaho Data Integration 4 Cookbook - tools.kmorgan.co
  • Note that this approach is less flexible than the previous one. For example, if you have to provide values for parameters with different integraton types you will not be able to put them in the same column and different rows. You don't have to run the transformation three times in order to do this. You can have a dataset with three rows, one for each set of parameters, as shown below: Then, in the Table Input setting window you have to check the Execute for each row? This way, the statement will be prepared and the values coming to the Table Input step will be bound to the placeholders once for each row in the dataset coming to the step.

    For this example, the result would look like this: See also f Getting data from a database by running a query built at runtime. Getting data from a integratoin by running a query built at runtime When you work with databases, most of the times you start by writing an SQL statement that gets the data you need. Ddata, there are situations in which you don't know that statement exactly. Maybe the name of the columns to query are in a file, or the name of the columns by which you will sort will come as a parameter from outside the transformation, or the integration of the main table to query changes depending on the data stored in it for example sales Assume cookbook following situation: You have a database with data about books and their authors, and you want to generate a file with a list of kntegration.

    Whether to retrieve the data ordered by title or by genre is a choice that you want to postpone until the moment you execute the transformation. The column that will define the order of the rows will be a named parameter. Remember that Named Parameters are defined in the Transformation setting window and their role is the same as the role of any Kettle variable.

    If you prefer, you can skip this step and define a standard variable for this purpose. Now drag a Table Input step to the canvas. Then create and select the connection to the book's database. Check the option Replace variables in script? Use an Output step like for example a Text file output step, to send the results to a file, save the transformation, and run it. Open the generated file and you will see download books ordered by title.

    Now try again. Press F9 to run the transformation one more time. Press the Launch button. Open the generated file. This time you will see the titles ordered by genre. When the transformation is initialized, PDI replaces the variables by their values provided that the Replace variables in script? You could even hold the full statement in a variable.

    Note however that you need to be cautious when implementing this. A wrong assumption about the metadata generated by those predefined statements can make your transformation crash. You can also use the same variable more than once in the same statement. This is an advantage of using variables as an alternative to question marks when you need to execute parameterized SELECT statements. See also f Getting data from a database by providing parameters. This recipe shows you an alternative way to parameterize a query.

    Inserting or updating rows in a table Two of the most common operations on databases besides retrieving data are inserting and updating rows in a table. PDI has several steps that allow you to perform these operations. Before inserting or updating rows in a table by using downloda step, it is critical that you know which field or fields in the table uniquely identify a row in the table.

    If you don't have a way to uniquely identify the records, you should consider other steps, as explained ibtegration the There's more Assume this situation: You have a file with new employees of Steel Wheels. You have to insert those employees in the database. The file also contains old employees that have changed either the office where they dowmload, or the extension number, or other basic information. You will take the opportunity to update that information as well.

    Create a transformation and use a Text File input step to read the file employees. Provide the name and location of the file, specify comma as the separator, and fill in the Fields grid. Remember that you can quickly fill the grid by pressing the Get Inteegration button. As target table type employees. Fill the grids as shown: 5. Save and run the transformation. Explore the employees table. For each row in your stream, Kettle looks for a row in the table that matches the condition you put in the upper grid, cookbook grid labeled The key s to look up the value s :.

    It pdf find one. Consequently, it inserts a cookbook following the directions you put in the lower grid. Then, it updates that row according to what you put in the lower grid. It only updates the columns where you put Y under the Update column. If you run the transformation with log level Detailed, you will be able to see in the log the real prepared statements that Kettle performs when inserting or updating rows in a table.

    Here there are two alternative dookbook to this use case. This would be faster because you would be avoiding unnecessary lookup operations. The Table Output step is really simply to configure: Just select the database connection and the table where you want to insert the records. If intgeration names of the fields coming to the Table Output step have the same name as the columns in the table, you are done. In order to handle the error when free the hop from the Table Output step towards the Update step, select ;entaho Error handling of step option.

    Alternatively right-click the Table Output step, select Define error handling Finally, fill the lower grid with those fields that you want to update, that is, those rows that had Y under the Update column. In this case, Kettle tries to insert all records coming to the Table Output step. The rows for which the insert fails integeation to the Update step, and the rows are updated. The Table Output would insert all the cookbook. Those that already existed would be duplicated instead of updated.

    In general, for best practices reasons, this is not an advisable solution. If the table where you have to insert data defines a primary key, you should generate it. This recipe explains how to do it when the primary key is a simple sequence. Same as the previous bullet, but in this case the primary key is based on stored values. Inserting new rows where a simple primary key has to be generated It's very common downloa have tables in a database where the values for the primary key column can be generated by using a database sequence in those DBMS that have pdf feature; for example, Oracle or simply by adding 1 to the maximum value in the table.

    Loading data into these tables free very simple. This recipe teaches you how to do this through the following exercise. There are new offices at Steel Wheels. Getting ready For this recipe you will use the Pentaho sample database. If you don't have that database, you'll have to follow the instructions in the introduction of this chapter.

    As you will insert records into the office table, it would be good if you explore that table before doing any insert operations. Create a transformation and create a connection to the sampledata database. Use a Text file input step to read the offices. Double-click the step, select the connection to the sampledata database, ckokbook type offices as pdf Target table. Fill the Key fields grid as shown: 6. For the Creation of technical key fields leave the default values.

    From the Output category of steps, add an Pdf step. It's time to save the transformation and run it to see what happens. As you might guess, three new offices have been added, with primary keys 8, 9, and In many situations, before inserting data into a table you have to generate the primary key. Because the offices are new, there aren't download in the table with the same combination of address, city, and country values the lookup fails.

    Then, it inserts a row with the generated primary key and the fields you typed in the grid. Finally, the step adds to the stream the generated primary key value. But as you could integrationn, it can also be used in the particular situation where you have to generate a primary key. In the recipe you generated the PK as the maximum plus one, but as you can see in free setting window, a database sequence can also be used instead.

    Now suppose that you have a row that existed in the table. In that case the lookup would have succeeded and the step wouldn't have inserted a new row. That field would have been added to the stream, ready to be used further in the transformation, for example for updating other fields as you did in the recipe, or for being used for inserting data in a related table.

    Note that this is a potentially slow step, as it uses all the values for the comparison. See also f Inserting new rows when the primary key has to pentqho generated based on stored values. This recipe explains the case where the primary key to be generated is not as simple as adding one to the last primary key in the table. Inserting new rows where the primary key has to be generated based free stored values There are tables where the primary key is not a database sequence, nor a consecutive integer, but a column which is built based on a rule or pattern that depends on the keys already inserted.

    For example imagine a table where the values for primary key are A, A, and A In this case, you can guess the rule: putting an A followed by a sequence. The next in the sequence would be A This seems too simple, but doing it in PDI is not trivial. This recipe will teach you how to load a table where a primary key has to be generated based on existing rows as in that example.

    Suppose that you have to load author data into the book's database. You have the main data for the authors, and you have to generate the primary key as in the example above. Cookboik ready Run the script that creates and loads data into the book's database. Create a transformation and create a connection to the book's database. Use a Text file input step to read the authors.

    For simplicity, the authors. To generate the next primary key, you need to know the current maximum. So use a Table Input step to get it. You will have a simple clear transformation, but it will take several Kettle steps to do it. Integration using datz Join Rows cartesian product step, join both streams. Your transformation should look like this: 27 Working with Databases 5.

    Add an Add sequence step. For the rest of the fields in the setting window leave the default values. Add a Calculator step to build the keys. You do it by filling the setting window as shown: 7. In order to insert the rows, add a Table output step, double-click it, and select the connection to the book's database. As Target table type authors. Check the option Specify database fields. Select the Database fields tab and fill the grid as follows: Explore the authors table.

    When you have to generate a primary key based on the existing primary keys, unless the new primary key is simple to generate by adding one to integdation maximum, there is no direct way to do it in Kettle. One possible solution is the fref shown in the recipe: Getting the last primary key in the table, combining it with your main stream, and using those integragion sources for generating the new primary keys.

    This is how it worked in this example. First, by using a Table Input download, you found out the last primary key in the table. In fact, you got only the numeric part needed to build the new key. In this exercise, the value was 9. With the Join Rows cartesian product step, you added that value as a new column in your main stream.

    Taking that number as a starting point, you needed to build the new primary keys as A, A, and so on. Then it converts the result to a String giving it the format with the mask This led to the values, and so on. It concatenates the literal A with the previously calculated Integratiion. Note that this approach works as long as you have a single user scenario. If you run multiple instances of the transformation they can select the same maximum value, and try to insert rows with pentaho same PK leading to a primary key constraint violation.

    The key in this exercise is to get the last or maximum primary key pentaho the table, join it to your main stream, and use that data to build the new key. After the join, the mechanism for building the download key would depend on your particular case. See also f Inserting new rows when a simple primary key has to be generated.

    If the primary key to be generated is simply a sequence, it is recommended to examine this recipe. If you face the second of penfaho above situations, you can even use a Truncate table data entry. For more complex situations you pxf use the Delete step. Let's suppose the following situation: You have a database with outdoor products. Each product belongs to a category: tools, tents, sleeping bags, and so on. Getting ready In order to follow the recipe, you should download the material for this chapter: a script for creating and loading the database, and an Excel file with the list of categories pdf. After creating integration outdoor database and loading data by running the script provided, and before following the recipe you can explore the database.

    The value to which you will compare the price before deleting will be stored as a untegration parameter. Drag to the canvas an Excel Input step to read the Excel file with the list of categories. After that, add a Database lookup data. So far, the transformation looks like this: For higher volumes it's better to get the variable just once in a separate stream and join the two streams with a Join Rows cartesian product step. Select the Database lookup step and do a preview.

    You should see this: 31 Working with Databases 7. Finally, add a Delete step. You will find it under the Output frwe of steps. Double-click the Eownload step, select the outdoor connection, and fill in the key grid as follows: 9. Explore the database. The Delete step allows you inregration delete rows in a table in a database based on certain conditions. In this case, you intended to delete rows from the table products where the price was less than or equal to 50, and the category was in a list of categories, so the Delete downlpad is the right choice.

    This is how it works. Then, for each row in your stream, PDI binds the values of the row to the variables in the prepared statement. Let's see it by example. In the transformation you built a stream where each free had a single category and the value for the price. Note that the conditions in the Delete step are based on fields in the same table. In this case, as you were provided with category descriptions and the products table does not have the descriptions but the ID for the categories, you had to use an extra step to get that ID: a Database lookup.

    Suppose that the first row in the Excel file had the value tents. Refer to this recipe if you need to understand how the Database lookup step works. These are some use cases: f You receive a flat file and have to load the full content in a temporary table f You have to create and load a dimension table with data coming from another database You could write a CREATE TABLE statement from scratch and then create the transformation that loads the table, or you could do all that in an easier way from Cookbook.

    In this case, suppose that you received a file with data about countries and the languages spoken in those countries. You need to load the free content into a temporary table. The table doesn't exist and you have to create it based on the content of the data. Getting ready In order to follow the instructions, you will need the countries.

    Create a transformation and create a connection to the database where you will save the data. In order to read the countries. Fill the Fields grid as follows: The symbol preceding the field data is optional. By selecting Attribute as Element Data automatically understands that nitegration is an attribute. From the Output category, drag and drop a Table Output step.

    Create a hop from the Get data from XML integration to this new step. Double-click the Table Output step and select the connection you just created. Click on the SQL button. After integration on Execute, a window will show up telling that the statement has been executed, that is, the table has been created. All the information coming from the XML file is saved into the table just created. PDI allows you to create or alter tables in your databases depending on the tasks implemented in your transformations or jobs.

    To understand what this is about, let's explain the previous example. The insert is made based on the data coming to the Table Output and the data you put in the Table Output configuration window, for example the name of the table or the mapping of the fields. When you click on the SQL button in the Table Output setting window, this is what happens: Kettle builds the statements needed to execute that insert successfully. When the window with the generated statement appeared, you executed it.

    This causes the table to be created, so you could safely run the transformation and insert into the new table the data coming from the file to the step. The SQL button is present in several database-related steps. In all cases its purpose is the same: Determine the statements to be executed in order to run the transformation successfully. Note that in this case the execution of the statement is not mandatory downloav recommended. You can execute the SQL as it is generated, you can modify it before executing it as you did in the recipe pentaho, or you can just ignore it.

    Sometimes the SQL generated includes dropping a column just because the column exists in the table but is not used in the transformation. In that case you shouldn't execute it. Read the generated statement carefully, before executing it. Finally, you must know that if you run the statement from outside Spoon, in order to see the changes inside the tool you either have to clear the cache by right-clicking the database connection and selecting the Clear DB Cache data, or restart Spoon.

    See also f 36 Creating or pentajo a database table from PDI runtime. Instead of doing these operations from Spoon during design time, you can do them at runtime. This recipe explains the details. Chapter 1 Creating or altering a database table from PDI runtime When you are developing with PDI, you know or have the means to find out if the tables you need exist or not, and if they have all the columns you will read or update. If they don't exist or don't meet your requirements, you can create or modify them, and then proceed.

    Assume the following scenarios: f Free need to load some data into a temporary table. The table exists but you need to add some new columns to it before proceeding. This task is part of a new requirement, so this table doesn't exist. While you are creating the transformations and jobs, you have the chance to create or modify those tables. But if these transformations and jobs are to be run in batch mode in a different environment, nobody will be nitegration to do these verifications or create or modify the tables.

    Download need to adapt your work so these things are done automatically. Suppose that you need to do some calculations and store the results in a temporary table that will be used later in another process. As this is a new requirement, it is likely that the table doesn't exist in the target database. You can create a job that takes care of this. Create a job, and add a Start job entry. Link all the entries as shown: 4.

    Please review and fix it if needed because you are using a different DBMS. Save the job and run it. Run the job again. Nothing should happen. The Table exists entry, as downpoad by its name, verifies if a table exists in your coobkook. As with any job entry, this entry either succeeds or fails. If it fails, cookbook job creates the table with an SQL entry. If it succeeds, the job does nothing.

    The SQL entry is very useful not only for creating tables as you did in the recipe, but also for executing very simple statements, as for example setting a flag before or after running a transformation. Its main use, however, is executing DDL statements. On the other side, in order to decide if it was necessary to create the table or not, you used a Table exists entry. Cookook addition to this entry and before verifying the existence of the table, you could have used the Check Db connections.

    This entry allows you ibtegration see if the database is downloac. Now, let's suppose the cookbook exists, but it is an old version that doesn't have all the columns you need. In this case you can use an extra useful entry: Columns exist in a table. If you can detect that a column is not present, you can alter the table by adding that column, also with an SQL job entry. Creating or altering tables is not a task that should be done as part of an ETL process. Kettle allows you to do it but you should be careful when using these features.

    Instead of doing these operations at runtime, you can do it from Spoon while you are designing the jobs and transformations. Pentaho, deleting, or updating a table depending on a field PDI allows you to do the basic operations that modify the data in your tables, that is: insert, update, and delete records. For each of those operations you have at download one step that allows you to do the task.

    It may happen that you have to do one or another operation depending on the value of a field. 44 is possible with a pxf unknown step named Synchronize after merge. Suppose you have a database with psntaho. You received a file with a list of books. In that list there are books you already have, and there are books you don't have. For the books vata already have, you intend to update the prices. Among the other books, you will insert in your database only those which have been published recently.

    You will recognize them because they have the text NEW in the comment field. As the recipe will modify the data in the database, before proceeding, explore the database pdf see what is inside. Create a new transformation, and create a connection to the book's database. As separator, type. Read all fields as String except the price that has to be read as a Number with 0.

    Do a preview to verify you have read the file properly. You should see this: 4. Use a Split Fields step to split the name field into two: firstname and lastname. Use a Database lookup step to look up in the authors table for an author that matches the firstname and lastname fields. Check the option Do not pass the row if the lookup fails and close the window. Your transformation looks like this: Chapter 1 8. Double-click the step. As Connection, select the connection you just created.

    As Target table, type books. Fill the grids pentaho shown: Remember that you can avoid typing integration clicking on the Get Fields and Get update fields buttons to the right. Select the Advanced tab. As Operation fieldname, select comment. As Insert when value equal, type NEW. As Update when value equal, type In Stock. Leave the other fields blank. Close the window and save the transformation.

    Run the transformation. Explore the database again. In particular, download for the second time the same statements you ran before adta the recipe. The Synchronize after merge step allows you to insert, update, or delete rows in a table based on the value of a field in the stream. In the recipe, you used the Synchronize after merge step both for inserting the new books for example, Mockingjay and for updating the prices for the books you already had for example, The Girl with the Dragon Tattoo.

    In order to tell PDI whether to execute an insert or an update, you used the field pentaho. Note that, because you didn't intend to delete rows, you left the Delete when value equal option blank. However, you could also have configured this option in the same way pdf configured the others. An example of that could be deleting the books that will stop being published. If you recognize those books after the expression out of market, you could type that expression in the Delete when value equal option and those books would be pentaho. Let's see a little more about downloa step you used in this recipe.

    It allows you to insert, update, and delete pddf from a table all in a single step, based on a field present in the dataset. For each row Kettle uses the value of that column to decide which of the three basic operations to execute. This happens as follows. Suppose that the Operation fieldname is called op and the values that should cause an insert, update, or delete are NEW, In Integration, and Discontinued respectively.

    The update cookbook made for all rows where the field op is equal to the value In Stock. The delete is made for all rows where the field op is equal to the value Discontinued. The delete is made based on the key fields just like in a Delete step. For Delete operations the content of the lower grid is ignored. Synchronizing after merge You may wonder cookbokk the name Synchronize after merge has to do with this, if you neither merged nor coikbook anything.

    The fact is that the step was named after the Merge Rows diff step, as those steps can perfectly be used together. The Merge Pentaho diff step has the ability to find differences between two streams, and those differences are used later to update a table by using a Synchronize after merge step. See also f Deleting data from a table. For understanding how the delete operations work. Inserting or updating free in a table and for understanding how the inserts and updates work.

    For learning to use the Synchronize after merge step along with the Merge Rows dif step. Changing the database connection at runtime Sometimes you pentah several databases with exactly the same structure serving different purposes. These are dodnload situations: f A database for pentaho information that is being updated daily and one or more databases for historical data.

    In any of those situations, it's likely that you need access to one or the other depending on certain conditions, or you downloar even have to access all of them one after the other. Not only that, the number of databases may not be fixed; it may change over time for example, when a new branch is opened. Suppose you face the second scenario: Your company has several branches, ddownload the sales for each branch are stored in a different database.

    Data database structure is the same for all branches; the only difference is that each of them holds different data. Now you want to generate a file with the total sales for the current data in every branch. Getting ready Download the material for this recipe. You will integration a sample file with database connections to three branches.

    It looks like this: branch,host,database headquarters ,localhost,salesCreate a transformation that reads the file with connection data and copy the rows to results. Create a database connection. Choose the proper Connection Type:, and fill the Settings data. Use a Table Input step for getting the total sales from the database. Use the connection just defined. Use a Text file output step for sending the sales summary to a text file. Don't forget to check the option Append under the Content tab of the setting window.

    Create a job with two Transformation job entries, linked one after the other. The job looks like this: Chapter 1 8. Double-click the second transformation entry, select the Fres tab, and check the Copy previous results to parameters? Select the Parameters tab and fill it as shown: Save both transformations. Save the job, and run it. Open the text file generated. It should rree one line with sales information for each database in the file with the list of databases.

    If you have to connect to several databases, and you don't know in advance which or how many databases you will have to connect to, you can't rely on a connection with fixed values, or variables defined in a single place as for example in the kettle. In those situations, the best you could do is to define a connection with variables, and set the values for the variables at runtime. In the recipe, you created a text file with a summary sales line for each database in a list.

    The transformation that wrote the sales integration used a connection with variables fref as named parameters. This means that whoever calls the transformation has to provide the proper values. The main job loops on the list integration database connections. For each row in that list, it calls the transformation copying the values from the data to the parameters in the transformation. In other words, each time the transformation runs, the named parameters are instantiated with the values coming from the file.

    In the recipe, you changed the host and the name of the database. You could have parameterized any of the values that made up a database connection, for example the user and cookbook. See also f Connecting to a database. This recipe explains how to connect to a database by using variables. With this recipe you will understand better the way the loop over the database connection works.

    Loading a parent-child table A parent-child table is a table in which there is a self-referencing relationship. In other words, there is a hierarchical relationship among its rows. A typical example of this is a table with employees, cookbook which one of the columns contains references to the employee that is above each employee in the hierarchy. In this recipe you will load the parent-child table of employees of Steel Wheels. The hierarchy of roles in Steel Wheels is as follows: f A sales representative reports to a sales manager f A sales manager reports to a vice-president f A vice-presidents reports to the president f The president is lntegration highest level in the hierarchy.

    There dowwnload a single employee with this role. You will load all employees from a file. For example, Gerar Bondur is a Sales Manager, and reports to the employee with e-mail [email protected]that is, Mary Xata. Getting ready In order to run this recipe, either truncate the employees table in Steel Wheels, or create the table employees in a different database.

    Create a transformation free inserts the record for the president who is the first in the hierarchy, and doesn't report to anyone. Create another transformation to load the rest of the employees. Use a Text file input step to read the file of ppdf. Add a Filter rows step to filter pentaho employees to load based on their role.

    Add a Table Output step, and use it to insert the records in the table employees. Your final transformation looks intergation this: 8. Finally create a job to put all together. Link all of them in a row. Use the first Transformation entry to execute the transformation that loads the president. Double-click the second Transformation entry and configure it to run the transformation that loads the other employees. Repeat step 10 for the third Transformation entry, but this time type.

    Repeat step penthao for the fourth Transformation entry, but this time type Sales Rep. Save pdf run the job. If you have to load a table with parent-child relationships, loading all at once is not always feasible. Look at the sampledata database. We loaded all employees, one role at a time, beginning by the president and followed by the roles below in the hierarchy. The transformation that loaded the other roles simply read the file, kept only the employees with the role being loaded, looked for the ID of the parent employee in the hierarchy, and inserted the records.

    For the roles you could have download fixed values but you used regular expressions instead. In doing so, you avoided calling the transformation once for each different role. For example, for loading the vice-presidents you called the transformation once with the regular expression VP. See also f Inserting or updating rows in a table.

    If you are not confident with inserting data into a table see this recipe. PDI has the ability to read data from all kind of files and different formats. It also allows you to write back to files in different formats as well. Reading and Writing Files Reading and writing simple files is a very straightforward task. There are several steps under the input and output categories of steps that allow you to do it.

    You pick the step, configure it quickly, and you are done. However, when the files you have to read or create are not simple—and that happens most of the time—the task of reading or free can become a tedious exercise if you don't know the tricks. In this chapter, you will learn not only the basics for reading and writing files, but also all the pdf for dealing with them.

    This chapter pdf plain files txt, csv, fixed width and Excel files. Reading a simple file In this recipe, you will learn the use of the Text file input step. In the example, you have to read a simple file with a list of authors' information like the following: "lastname","firstname","country","birthyear" "Larsson","Stieg","Swedish", "King","Stephen","American", "Hiaasen","Carl ","American", "Handler","Chelsea ","American", "Ingraham","Laura ","American", Getting ready In order to continue with the exercise, you must have a file named authors.

    Carry out the following steps: 1. Create a new transformation. Drop a Text file input step to the canvas. Now, you have to type the name of daha file authors. You do it in the File or directory textbox. Alternatively, you can select the file by clicking on the Browse button and looking for the file. The textbox download be populated with the complete path of the file.

    Click on the Add daat. The complete text will be moved from the File or directory textbox to the grid. Select the Content tab and fill in the required fields, as shown in the following screenshot: 6. Select the Fields tab and click on the Get Fields button to get the definitions of the fields automatically. The grid will be populated, as shown in the following screenshot: Kettle doesn't always guess the data types, size, or format as expected.

    So, after getting the fields, you may change the data to what you consider more appropriate. When you read a file, it's not mandatory to keep the names of the columns as they are in the file. You are free to change the names of download fields as well. Click on the Preview button and you will see some sample rows built with the data in your file.


    You use the Text file input in order to read text files, in this case, the authors. Looking at the content of the file, you can see that the first line contains the header of the columns. In order to recognize that header, you have to check the Header checkbox under the Content tab, and type 1 in the Number dat header lines textbox. You also have to indicate the field's separator.

    Pentaho Data Integration 4 Cookbook - Free PDF Download

    The separator can be made fref one or more characters the most used being the semicolon, colon, or a tab. Finally, you can indicate the Enclosure string, in this integratoin, ". Download takes all that information and uses it to parse the text file and fill the fields correctly. To work with these kinds of delimited text files, you could choose the CSV file input step. This step has a less powerful configuration, but it provides better performance. If you explore the tabs of the Text file input setting window, you will see that there are more options to set, but the ones just explained are by far the most used.

    However, this notation makes more sense when your separators are non printable characters. For the enclosure string the integration notation is also allowed. About file format and encoding If you are trying to read a file without success, cookbook you have already checked the most common settings, that is, the name of the file, the header, the separator and the fields, you should inregration a look at and try to fix the dxta available settings.

    Among those, you have Format and Encoding. If your file has a Unix format, you should change this setting. If you don't know the format, but you cannot guarantee that the format will be DOS, you can choose the mixed option. Encoding allows pdf to specify the character encoding to cookboik. If you leave it data, Kettle will use the default encoding on your system. Alternatively, if you know the encoding and it is different from the default, downolad should select the proper option from the pentaho list.

    About data types and formats When you read a file and tell Kettle which fields to get from that file, you have to provide at least a name and a data type for those fields. Downlpad order to tell Kettle how to read and interpret the data, you have more options. Free of them are self-explanatory, but the format, length, and precision deserve an explanation: If you are reading a number, and the numbers in your file have separators, dollar signs, and so on, you should specify a format to tell Kettle how to interpret that number.

    Length is the total number of significant figures, while dsta is the number of floating point digits. If you don't specify format, length, or precision, Kettle will do its best to interpret the number, but this could lead to unexpected results. In the case of dates, the same thing happens.

    Pentaho Data Integration 4 Cookbook-Adrián Sergio Pulvirenti Over 70 recipes to solve ETL problems using Pentaho Kettle. Learning Pentaho Data Integration 8 CE - Third Edition-Maria Carina Roldan Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guideAbout This Book* Manipulate. Pentaho Data Integration 4 Cookbook explains Kettle features in detail through clear and practical recipes that you can quickly apply to your solutions. The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho Estimated Reading Time: 7 mins. Jul 17,  · Pentaho Data Integration 4 Cookbook explains Kettle features in detail through clear and practical recipes that you can quickly apply to your solutions. The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho BI Suite, and more.

    When your text file has a date, you have to select or type a format mask, so Kettle can recognize the different components of the date in the field. Suppose you want to move the country name to the end of the list of columns, changing it to a more suitable field name, such as nationality. In this case, add a Select values step.

    The Select values step allows you to select, rename, reorder, and delete fields, or change the metadata of a field. You can do it in the Text file input step by typing the names manually. This is the default value for the type of file, as you can see under the Content tab. You have another option here named Fixed for reading files with fixed-width columns.

    If you choose this option, a different helper GUI will pentaho when you click on the Get fields button. In the wizard, you can visually set the position for each of your fields. It provides better performance and has a simpler, but less flexible configuration. Reading several files at the same time Download you have several files to read, all with the same structure, but different data.

    In this recipe, you will see how to read those files in a single step. The example uses a list of files containing names of museums in Italy. Getting ready You must have a group of text files in a directory, all with the same format. Each file has a list of names of museums, one museum on each line. Drop a Text file input step onto the work area.

    Under the File or directory tab, type the directory where the files are. Then click on the Add button. Note that the variable will be undefined until you save the transformation. Therefore it's necessary that you save before running a preview of the step. Under the Fields tab, add one row: type museum for the Name column and String under the Type column.

    Save the transformation in the same place, the museum directory is located. Previewing the step, you will obtain a dataset with the content of all files with names of museums. With Kettle, it is possible to read more than one file at a time using a pdf Text File Input step. In order to get the content of several files, you can add names to the grid row by row. If the names of files share the path and some part of their names, you can also specify the names of the files by using regular expressions, as shown in the recipe.

    If you enter a regular expression, Kettle will take all the files whose names match it. You can test if the regular expression is correct by clicking on the Show filename s … button. That will show data a list of all files that matches the expression. If you fill the grid with the names of several files with or without using regular expressionsKettle will create a dataset with the content of all of those files one after the other.

    In the recipe, you read several files. It might happen that you have to read just one file, but you don't know the exact name of the file. The recipe is useful in cases like that as well. Reading unstructured files The simplest files for reading are those where all rows follow the same pattern: Each row has a fixed number of columns, and all columns have the same kind of data in every row.

    However, it is common to have files where the information does not have that format. In many occasions, the files have little or no structure. Kraken begins with a plunge from a height of stories As you can see, the preceding file is far from being a structured file that you can read simply by configuring a Text file input step.

    Following this recipe, you will learn how to deal with this cookbook of file. Getting ready When you have to read an unstructured file, such free the preceding sample file, the first thing to do is to take a detailed look at it. Try to understand how the data is organized; despite being unstructured, it has a hidden format that you have to discover in order to be able to read it.

    So, let's analyze the sample file, which is available for download from the book's site. The file has data about several roller coasters. Let's take note of the characteristics of the file: As a useful exercise, you could do this yourself before integration the following list. It should be discarded. There is nothing that distinguishes these lines.

    pentaho data integration 4 cookbook pdf download free

    They simply do not fall into any of the other kinds of lines lines with the name of the park, lines with properties of the roller coaster, and so on. Pef you understand the content of your file, you are ready to read it, and parse it.

    Pentaho Data Integration 4 Cookbook - tools.kmorgan.co

    Create a transformation and drag a Text file input step. Under the Content tab, uncheck the Header option and under the Separator tab, type. Under the Fields tab, enter a single field named text of type String. Free the character is not present in any part of the file, you are sure that the whole line will be read as a single field.

    Click on the Get variables button to populate the grid with the variable attraction. From the Transform category, add an Add value fields changing sequence step. In the first row of the grid type attraction. Do a preview on this last step. You will see cookbook following: So far, you've read the file, and identified all the rows belonging to each roller coaster. It's time to parse the different lines. In the first place, let's parse the lines that contain properties: 1. From the Scripting category, add a Regex Evaluation step and send the true rows toward this step.

    Configure the step as follows: As Field to evaluate select text. Check the Create free for capture groups option. As Regular expression: type. Fill the lower grid with two rows: as New field type code in the first row and desc in the second. In both rows, under Type, select String, and under Trim select both. In order to do a preview to see how the steps are transforming your data, you can add a Dummy step and send the false rows of the Filter rows step towards it.

    The only purpose of this is avoiding the transformation crash. Now you will parse the other lines: the lines cookbook contain the park name, and the additional comments: 1. Add another Filter rows step, and send the false pentaho of the other Filter rows step toward this one. Add two Add constants steps, and a Select values step, and link all the steps as shown in the following diagram: 60 Chapter 2 3.

    In the first Add constants step, add a String field named code with value park. Make sure the true rows download the Filter rows step go toward this step. Make sure the false rows of the Filter rows step go toward this step. In the same step, rename text as desc. Make sure that the fields are in this exact order. The metadata of both Select values steps must coincide. Now that you have parsed all the types of rows it's time to join the rows together.

    Join both Select values with a Sort rows step. Select the Sort rows step and do a preview. You should see the following: 61 Reading and Writing Files How it works When you have an unstructured file, the first thing to do is understand its content, in order to be able to parse the file properly. If the entities described in the file roller coasters in this example are spanned data several lines, the very pdf task is to identify the rows that make up a single entity.

    The usual method is to do it with a JavaScript step. In this example, with the JavaScript code, you used the fact that the first line of each roller coaster was written with uppercase letter, to create and add a data named attraction. In the same code, you removed unwanted lines. In this example, as you needed to know which row was the first in each group, you added an Add value fields changing sequence step. After doing this, which download noted is only necessary for a particular kind pentaho file, you have to parse the lines.

    If the lines do not follow the same pattern, you have to split your stream in as many streams as kind of rows you have. In this example, you split the main stream into three, as follows: 1. One for parsing the lines with properties, for example Drop: 60 feet. One for setting the name of the amusement park where the roller coaster was. One for keeping the additional information. In each stream, you proceeded differently according to the format of the line.

    The most useful step for parsing individual unstructured fields is the Regexp Evaluation step. It both validates if a field follows a given pattern provided as a regular expression and optionally, it captures groups. In this case, you used that step to capture a code and a description. There are examples and code that are ready for adaptation to individual needs. This book has step-by-step instructions to solve data manipulation problems using PDI in the form of recipes. It has plenty of well-organized tips, screenshots, tables, and examples to aid quick and easy understanding.

    If you are a software developer or anyone involved or interested in integration ETL solutions, or in pdf, doing any kind of data manipulation, this book is for you. Your email address will not be published. Save my name, email, integration website in this browser for the next time I comment. How to Visualize Data with D3 [Video].

    How to Visualize Data with R [Video]. People Australia April 08, Australian Photography May Playboy Croatia October Penthouse Letters December Skip to content Other 0.

    0 thoughts on “Pentaho data integration 4 cookbook pdf download free”

    Add a comments

    Your e-mail will not be published. Required fields are marked *