PDA

View Full Version : pic-upload.de - need some jscript help ;)


Max_Headroom
23rd June 2009, 02:31 AM
Hi. I'm sending BIIIIG thanx to CyberMatt for IHG and you all for supporting it ;)
At the moment I'm using it to download HQ-celeb-pics.
Unfortunately I came to a hoster "pic-upload. de" that uses a temp-name for its image and an included real filename in the HTML.
I tried to write a script that grabs the short temp-name and downloads the image, but saves it with the real filename found in the comment field of the image.
This is the source of the line in the HTML-file:
<a href=" http://www3.pic-upload. de/18.06.09/7ui9oz.jpg" rel="lightbox" title="Franziska-van-Almsick-Premiere-zum-Film-Cars-07.09.2006.jpg"><img src="http://www3.pic-upload. de/18.06.09/7ui9oz.jpg" width="750" height="1212.93800539" id="thepic" alt="Klicken, um in Original Größe zu sehen." border="1" style="border-color: #d4d6d7;border-width:1px;" /></a>
Original URL: http://www.pic-upload. de/view-2357935/Franziska-van-Almsick-Premiere-zum-Film-Cars-07.09.2006.jpg.html
The important part is the first href-line, because it contains the filename on the server. The real filename is included in the title-tag.
Scanning through the included hosts of IHG I found out, that there's a retVal.fileName variable that seems to be used for this.
Is it possible to download the file using the href-tag and save it using the filename given in the title-tag ? ;)
I must admit... I'm a big "n00b" in the regex-world :P
This is my script that causes a lot of headaches:
// URL Pattern: ^http:\/\/www\.pic-upload\. de\/view.+\/.+\.html
//
// Using ONLY "ID: thepic" works. But it saves the image using the "href"-tag.
// TODO: Grab the title-tag and use it's name.
function(pageData, pageUrl) {
var retVal = new Object();
// Scan for picture-URL
var sPattern = pageData.match(/\"(http:\/\/www[0-9]\.pic-upload\. de)\"/);
// If not found
if (!sPattern) {
retVal.imgUrl = null;
retVal.status = "ABORT";
}
else {
retVal.imgUrl = sPattern[1];
retVal.status = "OK";
}
// This is the ID-value for the site - used to locate the image
var theId = "thepic";
// Scan for filename and use it.
// Else use a random filename.
var imgs = pageData.match(/rel="lightbox" title="(.+?)".+?>/);
try {
retVal.fileName = imgs[1] + ".jpg";
}
catch(e) {
retVal.fileName = Math.random(). toString(). substring(2) + ".jpg";
}
// Return array to ImageHost Grabber
return retVal;
}

Sorry, I had to include some spaces after the dots to skip the posting-rules :P
When I only include the "theid" tag in the scripts field, it loads the image - and all other images (eg. signatures) hosted there. But saves it under the server's filename. When I use my script, it tries to load the image by using the real filename. But this fails of course :(
using "theid" tag: http://server. org/img001_shortname.jpg
using the script: http://server. org/full_filename_in_the_title_tag_but_NOT_on_server.j pg
I bet I made some stuuuupid mistakes in the code, but maybe it serves as inspiration for a (hopefully) bugfree addition to the hosts-file :d
I need to get into JScript to fully understand what I did :whack0:
Previously I used WGet and some AutoIt scripts to automate the process of thread-sucking. But since I got IHG I switched over to this fantastic tool. Unfortunately my JScript-skills aren't nearly as good as my AU3 or PureBasic-skills ;) So I hope you can help me understanding a bit more about scripting IHG, so maybe I can contribute to the list with some unknown hosts.

Sincerly,
Max

cybormatt
23rd June 2009, 10:04 AM
try this on for size:

function(pageData, pageUrl) {
var retVal = new Object();


var sPattern = pageData.match(/<a href=("|')(http:\/\/www\d+\.pic-upload\.de\/.+\/.+?)\1.+title=("|')(.+)\3>/);

if (!sPattern) {
retVal.imgUrl = null;
retVal.status = "ABORT";
}

else {
retVal.imgUrl = sPattern[2];
retVal.status = "OK";

try {
retVal.fileName = sPattern[4];
}
catch(e) {
retVal.fileName = Math.random().toString().substring(2) + ".jpg";
}

}

return retVal;
}

sorry for the lack of explanation.

web_surfer
24th June 2009, 12:24 AM
Ok, here is a little explanation:

// Scan for picture-URL
var sPattern = pageData.match(/\"(http:\/\/www[0-9]\.pic-upload\. de)\"/);

That's the mistake. That regular expression matches only the first part of the url. You need to match the exact url.

Look at cybormatt's code:
var sPattern = pageData.match(/<a href=("|')(http:\/\/www\d+\.pic-upload\.de\/.+\/.+?)\1.+title=("|')(.+)\3>/);

That regular expression matches the complete url.

To get the filename:
retVal.fileName = sPattern[4];
The '\3' refers to the third match (it can be a " or '). If you know that the website uses ", you can simplify the expression:
var sPattern = pageData.match(/<a href="(http:\/\/www\d+\.pic-upload\.de\/.+\/.+?)".+title="(.+)">/);
Now the url is in sPattern[1] and the filename is in sPattern[2].

This text may help you with the parentheses issue:
These are called capturing parentheses. For example, /(foo)/ matches
and remembers 'foo' in "foo bar." The matched substring can be recalled from the resulting array's elements [1], ... , [n]

Max_Headroom
24th June 2009, 03:59 AM
Thank you very much !
It works perfect !
One more for the host-list :)
I really need to learn more about that powerfull RegEx matching. It's much more complicated than my ol' Amiga knowledge where * and ? where enough to do all the work ;)

cybormatt
24th June 2009, 06:29 AM
Well done web_surfer. thanks for the help in explanation. As you say... the days of * and ? were enough... but you also had to use string manipulation techniques. The Basic language includes a nice set of string manipulation techniques, but also required some serious thought as to how the algorithm should work. Technically, you can use only string manipulation to find the patterns in the web pages, but it would require alot of extra work. Regular expressions have taken the job of pattern matching and made it easier. Think of it as wild cards on steroids. Except you specify what kind of characters the wild cards should be, and how long the wild cards should span. Check out the following sources for more information:

http://www.regular-expressions.info/
https://developer.mozilla.org/en/Core_JavaScript_1.5_Guide/Regular_Expressions