Seitenhierarchie

  Wiki Navigation

    Loading...


 Recently Updated


 Latest Releases

 MediaPortal 1.32
            Releasenews | Download
 MediaPortal 2.5
            Releasenews | Download


Table of Contents

Overview

How to use the Grabber Scripts Editor to modify an existing grabber and/or create your own grabber script to 'grab' data from your favorite movie website.

Goal

The goal of this tutorial is to help you learn to write your own grabber script.

You should already be familiar with the Grabber Scripts Editor interface and settings, and how to Test a Grabber.

Note: As of v 6.0.0 My Films offers online update of grabbers. So once your new, or custom grabber is 'ready', please submit it for testing in the My Films Grabber Scripts Interface Forum sticky thread. Once tested, it will be added to SVN for all users to update.

My Films Grabber Structure

Basic Functions

Before you start, it is important to understand the basic function of the grabber editor and how it uses the grabber scripts to retrieve data.

A grabber script is an XML file and internally consists of 4 sections or parts:

  1. Common definitions - like script name, internet site name and search page
  2. Search definitions - to give the grabber engine info how to get the info about the movies found (that includes title, year (optional, but important for good matching), director(optional) and the URL of the movie page.
  3. Details definitions - to give the grabber engine info how to retrieve all the data for a chosen movie
  4. Mapping definitions - to give the grabber engine info about how to return or combine the data for final output to My Films or AMC Updater.

The internet sites provide their data as HTML code. Thus most of the work for a grabber script is to find good definitions how to retrieve and clean the data to get the results we wish.

Headers

As of v 6.0.0, My Films grabber engine supports headers with multiple parameters:

Encodings

As of v 6.0.0, My Films grabber engine supports different encodings for sub-pages, which allows you to use another encoding for a "foreign" sub-page, when required 

The grabber engine supports "simple matches" to find places in the web site as well as regex - regex is usually more powerful and allows more flexibility and more robust scrapers.

Regex Support

  Regex help txt2re - online generator RegexMagic - free evaluation, use JGsoft internal engine 
Regex - Wikipedia

As of v 6.0.0, My Films grabber engine supports several new internal options for regular expressions (regex):

  • #ADD# - now also available on search page
  • #REGEX# - in 'Replace' box for cleanup of outer result
  • #REGEX##MULTI# for multiple replacements (in 'Replace' or 'With' box)
  • #LF# - To use a line break, use #"LF" - this will internally be changed into a line break
    Example: Replace #REGEX# With #REGEX##MULTI#,|#LF#  will replace comma separated results with line break results, e.g. display each actor on a separate line
  • #MULTI# - replacement option for multiple replacements - also supports Regex via "#REGEX##MULTI#"
     Example: "#REGEX##MULTI#movie|m;person|p" - replaces "movie" with "m" and "person" with "p"

Match Groups

A "special feature" is the so-called "match groups" - that is: after defining an area of a website containing a lot of similar information (like actors), this information can be pulled simply as a string - or preferably as regex match groups. With match groups, you get clean info for e.g. person and roles - plus this approach supports advanced options, like limiting the number of results or activating/de-activating the roles in a result.

We now have a much a more flexible system to create scripts that allows you to use information from other pages, as long as they can be properly referenced - see IMDB.DE-OFDB grabber as example.

Secondary (Sub) Pages

The system of “secondary pages” can be used universally , and link them to a field or even cascade them, e.g. defining a sub-page, based on a sub-page, based on the base page.

Example: Base page (details) -> sub-page (retrieving OFDB link) -> sub-page (OFDB data) -> Grab info from OFDB-site

These 'grab' pages can be freely assigned to any field or to a page/link field. Thus you can do choose for if data should be grabbed from the base page – or any of the defined sub-pages - for every field if you wish.

Advantages:

  • use title page for both Original (Otitle) and Translated (Ttitle) titles – and define whatever title you would like to have
  • grab basic data from IMDb.com and only grab data, that you want to have in localized language from IMDb.XX site
  • mix either of the above methods

Grabbing from multiple websites/sources (AKA multi-script grabbing)

Sometimes localized pages (e.g localized IMDb sites) have much less data than the original English pages. So, the localized grabber ends up with a lot of empty fields.

The solution:

  • “merge (prefer source)” and “merge (prefer destination)”
  • 3 “generic grabber fields” that mainly have the purpose to read data from e.g. other website and use it via matrix to “merge” it into AMC fields.

Example 1:

You want a localized IMDb scraper like IMDB.DE and grab the description from there – you get it in German, but the description is often missing.

So you define a “generic1” field, reading the description from IMDb.com page and configure that to “merge (prefer destination” to Description field – that will result in getting localized description, if available, but English description, if no localized description exists.

The same is possible with basically all other fields.

Example 2:

You could use any TMDb grabber and add sub-pages (Generic link) for IMDb.com.

Thus you can combine the data you like from IMDb and TMDb in one operation.

With the generic fields and the merge options you could even create “fall backs” – like getting data from TMDb – but use IMDb, if TMDb has no data.

This is very close to “multi-script grabbing”! 

Other Features

Buttons to load the base and sub-pages via a button into the web browser – as it makes script development much more convenient

Metering – so grabber interface shows bytes of loaded sub-page – plus preview shows loading time for all defined pages.

Of course (and unfortunately) I had to adapt all existing scripts. I did that already, hope, I didn't break anything.

Samples

You may use IMDB.DE-OFDB script as a sample (or copy it and use it as a base) for 'multi-script' type grabbing from different sites and different subpages for different fields.

Use IMDB.IT as a base for creating simple grabbers which access manly one page so are quite fast, and grab data from IMDb web site for localized data.

Related

   

 

This page has no comments.