video script

crawler profiles

In this video you will learn how to create and edit crawler profiles in MailCom's expert.tools. MailCom expert.tools is a browser application that provides versatile business data research and enhancement tools. Serious business professionals, marketing, sales and purchasing managers use expert.tools to leverage their business success.

In the following, I will first of all explain to you the crawling process of expert.tools. After that, I will show you how to use crawler profiles as well as how to create new crawler profiles or how to edit or delete existing ones.

So, let’s start with the crawling process of expert.tools. expert.tools is able to crawl a large number of web pages and you may search for specific information on these pages, e.g. a specific keyword.To efficiently use crawling time and manage cost, you should be able to have control over the crawling process.You may want expert.tools to review (or as we say “crawl”) all sub-pages of a website, only some specific sub-pages, such as the about us page, the news page, the contact or imprint page or even sometimes also the main page will contain the information that you are looking for.

To control the on-site crawling process for each website, crawler profiles come into play. With the crawler profiles you can determine how deep and which pages expert.tools should investigate to search for your desired information like a keyword or pattern.

Now I am going to show you where you can find the crawler profiles and how you can add, delete and edit them. As you can see, I have already uploaded an address list and I am currently showing it on the main screen of expert.tools. To open the crawler profile settings now, please go into the File menu, then go to Settings and finally click on Crawler. Now the crawler settings window pops up.

So, as you can see, there are already two predefined profiles named Main page only and Address pages. To switch between them just click on the tabs in the upper left corner of the window.I am now going to create a new crawler profile to show you all the different crawling options that you can choose. So, how do we do that? As you can see in the upper right corner of the window there are three round buttons.With the red one you can delete your current crawler profile, with the pencil button in the middle you can make a copy of your current profile to edit it (show it by clicking on the button) and with the plus button on the left you can create a new default profile for editing. So let’s do that.

I will now explain each of the elements in the crawler profile. Let me start with the upper row from left to right. Here, we first find the field "profile name". Let’s put in "Career page search" as the name for our new crawling profile. The purpose of this profile is to focus the crawler on searching career related pages only. I will show you later how we can set that up by defining a respective search pattern.

The next text field determines the maximum limit of how many pages the crawler looks at, for example five. Then we also have the Max depth field. Here, you have to decide how long the click path to a page may be at most. This means how many links may be followed to get to a page from the start page. Typically, two is a reasonable number here.

As you can see, there are also two check boxes where you can decide if you want to ignore non matching pages or main pages.Let us first talk about non matching pages. A non matching page is a website which does neither match the url search pattern nor the link search pattern of our crawling process. This means it does not match our defined crawling search pattern. Sometimes it can be useful to crawl these pages anyway, e.g. if the information you are looking for is two or more clicks away from the main page. By default, non-matching pages get a low priority, so they will be examined at last. All other pages are preferred for crawling. If you exclude non-matching pages these will never be examined. For example, when you want the keyword search to be as fast as possible and have given a short maximum search depth it can make sense first to search with ignoring main pages and then do another search on those pages, where you did not get any matches but now include non-matching pages.

Let's now talk about the second checkbox: "ignore main pages". In case you select to ignore main pages, the keyword search module will not be applied to the main page, unfortunately it has to be crawled anyway as all search paths to sub-pages start from the main page.

Finally, I will show you how you can add URL patterns. This is the core setting of the crawler pattern.Here you define which pages expert.tools will crawl. To do that, click on the green "Add filter" button which you find on the right side of the window within the table.You can define which link texts and url patterns the crawler shall follow with which priority.In the first column on the left you have to enter the anchor text or URL patterns to define which sub pages should be crawled. expert.tools will then crawl everything that matches these URLs or anchor text patterns with higher priority than other pages that do not match any pattern.

In the next column, you can define the search type. Here, you can choose between substring, prefix - to search for the beginning of a keyword, suffix - to search for the ending of a keyword, RegEx and an exact match.
Please remember: for this tutorial we had in mind to search for some information that is typically found on career related sub-pages of a website. For example, we might be interested if the company offers internships or hires students.
First of all, we define that the crawler shall focus on “career” related sub-pages. To do this, we choose the “RegEx” search type. Then we define a search pattern which covers common variations of so-called “career” page names. Therefore, in the first column we enter (karriere|career(s)?|job(s)?) which is a regular expression that matches every URL or Link to a sub-page which includes “job”, “career” and the German word for career, “karriere”. Please note that our regular expressions follow the JavaScript RegEx syntax.
In the third column, you can choose between anchor text search, URL text search or both. The anchor text is also known as the link label or link title. This text you will find right beneath the website links when you search for something with a search engine such as Google or Bing.
And the URL text is just the text that you will find within the URL of a website. To delete one of them, just click on the small cross sign within the gray containers. But in our case, let’s choose both of the two options.
And last but not least, here within the last column you can select the search priority for this specific filter. You can use these search priorities if you have several filters and want to determine which filters should rather be used and which are not so important. Thereby you can choose between High, Moderate, Low and Ignore. Since we have only one filter in our example, I leave the priority at High.

We have now successfully created our own specific crawler profile. To save it, just click on the save button which you find at the bottom right. To use this crawler profile, we start a new keyword search and select the new crawler profile in the keyword search module settings.
Let me now show how to do this. First of all, we need to enter a keyword that we want to search for. For this tutorial I will use the keywords intern, internship and student in the keyword search. As the crawler mainly searches for career related sub-pages, we try to identify those companies which offer internships.

So, once you enter the keyword pattern, click on the keyword search settings button. Now within the first tab of the new window, please select the website input column, then click on the second tab to select the output columns. Here I select the option that during the search a new output column will be created.
Finally, within the third tab, named Task settings, we can select the crawler profile that we have created before. So, let’s do that, keep the other settings as they are and click on Save.Then finally we start the keyword search by clicking on the Start button.

As you can see, expert.tools added a new results column to our address table.
The red crosses stand for the word “false”. That means that our keywords were not found on the websites and sub-pages we crawled. The green check marks represent the word “true” and mean that one of the keywords was found while crawling the respective sub-pages of the website.
If you only want to display the websites on which the keyword was found, just enter “true” as a filter in the column header. If you intend to display the websites on which the keyword was not found, enter “false” as filter in the column header.By defining a crawler profile for a specific task, you can save a lot of crawling time and cost. That way you can efficiently solve your research tasks.

We have now reached the end of this tutorial. I have shown you in this video, how to work with crawler profiles and how you can create, edit and delete them.

If you want to learn more about expert.tools you will find more videos on our youtube channel or on our expert.tools website. If you have any discussion items, questions or suggestions please let us know by commenting on the video below. You can also contact sales or just send us an email.