Guide to Fetch

By:  TiG  •  Tech Meta  •  4 years ago  •  1 comments

Guide to Fetch
Fetch can streamline seeding articles on NT.
But always verify the content it provides.

The Challenge

Seeding is a clumsy process.    Source articles have very different structures, embed advertisements and other distractions, including at times complex HTML that screws up the formatting.  Time must be spent copying & pasting contents and then cleaning up the resultant mess in NT.    On top of that, the process of downloading an appropriate image so that it can be uploaded and attached to the article is annoying.   Finally, it is not always easy or possible to attribute the seed properly to the original author.    


The Fetch functionality uses a collection of heuristic techniques to acquire as much good content while ignoring as much bad as possible.   It aggressively pours through the HTML of seed articles to get:

Title :  This is almost always easily found in an article so Fetch will typically provide the correct title

Quote or Summary :  If supplied by the author, Fetch will place the summary in the Quote field

Author (and site) :   Surprisingly, the author is often omitted or embedded in the actual content of the article.   Fetch uses a sophisticated method to determine the author.   But since this must be extracted from unstructured English, the author will not always be correct.   Also, sometimes articles provide a meta-data field for author but then put in the wrong information.   Author is one key field a seeder needs to scrutinize.

Content :   Ideally this is the core content of the seed.   This is often buried as a minority of content in an otherwise complex mess of HTML and related instructions.   Further, publishers often embed advertisement and teasers among the content (before, within and after).   Although Fetch tries to weed out the bad content, the seeder will often need to do more editorial work.

Designated Image :   Most every seed has a designated image.   This is the image deemed by the author to go with the seeded article.   Fetch can find this almost every time, but there are occasions when the seeded article has no image, states the wrong URL for the image, etc.   In these cases, the seeder must manually provide an image (but Fetch can help there too, see: FETCH IMAGE).

Embedded Media :  Optionally, the seeder can ask Fetch to include appropriate embedded images and videos.   These are graphical media sprinkled throughout the content of the article.  This is not something the seeder would typically do, but the option is available.

Fetch appears to the user as one or two buttons on the upper right corner of your Create Article or Update Article forms.   The principle button is labeled FETCH SEED .   This is the workhorse that accepts a URL from the user and then goes to the web to attempt to acquire that which is needed for an NT seed.   The FETCH SEED button appears when you are trying to create a Discussion or a Group Discussion .   It does not appear when creating a Blog because blogs are always original work (never seeded).

The other button is FETCH IMAGE .  This button appears when you create or update any type of article.   It gives the user the option to enter a URL instead of downloading a file to one's local machine and then uploading into NT.  FETCH IMAGE does all that work for you.   

Note :  you will not usually use FETCH IMAGE since FETCH SEED   frequently brings in the author-designated image for you.


Fetch is atypical automation;  it works on imperfect data and the results will vary per article and per site.  Fetch contends with a very messy world of inconsistent standards, human error and wildly different formats and styles.  It must grab raw code from an arbitrary target website (what Browsers process before presenting a nice view to the user) and extract only that which is needed for an NT seed.  To make sense of this content, Fetch contains a complex set of inferencing tools designed to cleanse and surgically extract desired content.  While it is usually not possible for Fetch to construct a seed as well as a human seeder (Fetch operates at the server level and thus does not have the benefit of a sophisticated browser), it does handle a lot of the grunt work and ideally leaves the seeder with minor cleanup work.  

Sometimes Fetch delivers perfection, sometimes the results are ugly.  Expect to find cases where Fetch ed content contains embedded advertisement, and sometimes complete junk (like portions of script code).   These items are usually a result of poorly encoded HTML content that lacks sufficient information for Fetch (and its supporting third party tools) to detect and remove.   Basically, if an author / site embeds content in the middle of an article and provides no markers to distinguish it from the true content of the article, there is nothing Fetch can do.  To Fetch, it all looks like article content.

In contrast to excess unwanted content, Fetch cannot always get necessary information from a site.  Twitter delivers almost entirely junk so Fetch will typically at best get the tweet itself.   Some sites (e.g. Bloomberg) prevent bots ( Fetch is effectively a bot ) from acquiring their content.  Some sites forbid access to their images (access denied).   Other sites use dynamic content which appear only as a result of the browser interacting with the server;  Fetch cannot even see this content (an example of this is a picture gallery).  When Fetch is not able to provide good content, the seeder should create the seed using the old manual approach (copy & paste).    However, even here, the FETCH IMAGE function will be available to facilitate acquiring a suitable image directly from the web (vs. download / upload).


When you click on the FETCH SEED  button you will see the following dialog box:


Seed URL :  is the URL you wish Fetch to access.   Typically you would paste in your URL as shown.

Include Compatible Embedded Media : checkbox is normally unchecked.   Thus Fetch will normally not include pictures and videos.   If you would like pictures and videos (at least those that are valid for NewsTalkers) then check this box.

Starting Point :  is rarely used.   Its purpose is to give Fetch your desired starting point in the article.   This will be used if the article has a bunch of junk content upfront and Fetch has no way to distinguish it from good (MSN likes to do this).   In this case, you would copy & paste (this must be exact) a unique portion of the starting sentence in the article.   Fetch will scan the content looking for this exact match and, if found, take that as the beginning of the article (and will throw away all content that precedes it).

Once you have entered in your data (typically just the URL), press the FETCH  button to automatically populate your seeded article.


Fetch seeks to make seeding easier.   It will not eliminate all the work but in most cases greatly streamlines the process.   If Fetch does not work well for a site, the legacy functionality (the copy & paste you do today) for seeds is still in place.

One should use Fetch as a support tool and always verify the content it provides.   The data Fetch can acquire will not always be perfect.


jrBlog - asc
1  author  TiG    4 years ago

An overview on the how Fetch works.