Languages

Import table data from HTML

This article describes a solution of common problem - we have a HTML file with some data, which should be imported into a database. This task can be acomplished using import from text file feature of DatAdmin database administration tool.

Problem definition

For demostration purposes, we want to import the following table (arrivals at airport web page):

Arrival Destination Flight
14:10 from Terminal 1 Dublin (IE) EI 0644
14:10 from Terminal 2 London / Heathrow (GB) BA 0860
14:40 from Terminal 1 Barcelona (ES) VY 8654
15:10 from Terminal 3 Kiev / Borispol (UA) OK 4919

We want to import this data into table with following columns:

  • ID, int autoincrement (auto-generated)
  • Time, DateTime
  • Terminal, varchar(50)
  • Destination, varchar(50)
  • Flight, varchar(50)

Solution

Suppose you want to create new table. Import into existing table is very similar. Connect to database, in the "Tables" context menu choose "Create table from file". Bulkcopy wizard will appear. Select "Import from text" importer.

Wizard has 4 steps:

  • Source - defines data, which will be imported. You can define text file, URL or directly copy and paste text from clipboard
  • Rows - defines how to delimite rows
  • Fields - defines how to extract fields from rows defined in previous step
  • Preview - check how import will look like

So in our example, in "Souce" tab, choose "Text" as source data and copy imported HTML into edit box. You could also import data from file or from URL.

<tr><td>14:10 from Terminal 1</td><td>Dublin (IE)</td><td>EI 0644</td>
</tr>
<tr><td>14:10 from Terminal 2</td><td>London / Heathrow (GB)</td><td>BA 0860</td>
</tr>
<tr><td>14:40 from Terminal 1</td><td>Barcelona (ES)</td><td>VY 8654</td>
</tr>
<tr><td>15:10 from Terminal 3</td><td>Kiev / Borispol (UA)</td><td>OK 4919</td>
</tr>

Define rows

HTML file has rows delimited by tags <tr> and </tr>, so configuration is very simple. Choose "Limitation" option, "Begin" should be "<tr>" and "End" </tr>.

DatAdmin Bulkcopy wizard

Define fields

Defining fields is the most complicate step of whole process, in our case we will need regular expressions. For every field we can confugre 4 attributes:

  • Field - field name in output table
  • Regular expression - regex used for scanning field in allready delimited row
  • Group - number of group in regular expression, which will be used as column value. If ommited, full match is used
  • Skip - number of occurences to skip

In our example, fields will be defined in following way:

Field Regular expression Group Skip
Time <td>([^<]*) from ([^<]*)</td> 1  
Terminal <td>([^<]*) from ([^<]*)</td> 2  
Destination <td>([^<]*)</td> 1 2
Flight <td>([^<]*)</td> 1 3

As you can see, we you only two different regular expressions. "Time" and "Terminal" fields are parsed from first cell, they differ in group (time is the first group - text between first parentheses, terminal the second group - text between second parentheses). "Destination" is taken from the third cell - so there is defined "Skip"=2 for skipping first two cells.

View preview and save your work

Last tab is "Preview" - click "Refresh" button for view your last changes.

Tip: You can use button "Save as template" to save this definition. You can later reuse it or load and modify it. Laso when you are not fully satisfied with results, you can try it again without neccessarity of configuring again.

Create table and define column mapping

We are almost finished - Click "Next" button, choose table name. In the next screen, wee see imported columns. Ooops! There is not ID column that we said we need! But solution is simple - add column named ID and choose "Row number" as type.

That's really all. Click "Next", "Finish" and you have new now table with imported data in your database.

Conclusion

I have explained how to import the arrivals table from Prague airport web pages. But as you can imagine, this tool is very generic, and it can be used for a big variety the data import tasks.

Post new comment

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options