Recent Widgets


Register for DashboardWidgets

Recent Forums Posts

Partners


iCompositions

MacDesktops.net

RSS Showcase
RSS Comments
RSS Forums

This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies. Posted in: Widget Development

Extracing Info from Web Pages - XQuery?

Author Message
wuf810



Joined: 27 Mar 2005
Posts: 19
Location: United Kingdom (Great Britain)

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

Hi all,

I'm new to Dahsboard but have developed a number of Sherlock 2 channels. Much like Dashboard in its implementation, Sherlock 2 also uses Javascript for a lot of the coding. However its other major language is XQuery (which itself uses XPath) and as such extracting "parts" of an HTML pages through elements/nodes etc is very simple.

My question is how can this be achieved with Dashboard? If it involves writing a plug-in I'm stuffed as I've no idea how to go about this. Can it be done natively in javascript or is there another way to extract certain info from a web page...

Any ideas or suggestion gratefully accepted.

Thanks, Michael.
View user's profile Send private message Widgets
wuf810



Joined: 27 Mar 2005
Posts: 19
Location: United Kingdom (Great Britain)

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

I of course meant Sherlock 3. However my question still remains...
View user's profile Send private message Widgets
Z
Guest





Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

You probably know this already, but it would depend on whether the pages you are talking about provide web services, for instance providing raw data in XML, or an RSS feed.

If the page does not, your next resort would be to 'scrape' the HTML for useful information. This can work well, but relies on the page layout staying consistent, and is slower than simply getting the data from a feed. Were there any specific sites you had in mind?
Widgets
wuf810



Joined: 27 Mar 2005
Posts: 19
Location: United Kingdom (Great Britain)

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

Z. No unfortunately no web service, no XML and often the layout changes. I've decided to use the DOM model to parse the page instead. Not as easy or elegant as XPath but at least its a solution.

Thanks for the response.

Regards, Michael.
View user's profile Send private message Widgets
wuf810



Joined: 27 Mar 2005
Posts: 19
Location: United Kingdom (Great Britain)

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

OK it turns out I can't use the DOM or DHTML Object Model as planned because of the security aspect of "scraping" a webpage from a different domain.

Its does appear (at least to me) that an XPATH PLUG-IN is a necessity for making gadgets that extract and disaply information from variety of websites.

I've had a quick look at doing this and all the tools are available within Cocoa and the Apple libraries (NSXML) but I just can't get my head around how to do...I'm not a cocoa programmer.

Anybody out there that can help me? Anyone written their own Plug-Ins? I'm willing to pay for your time....

Michael.
View user's profile Send private message Widgets
Chris



Joined: 27 Jan 2005
Posts: 344
Location: Durham, UK

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

Are you sure about that? I've written a bunch of widgets which scrape sites. Translate is the most noticeable one. Although, the method that uses isn't exactly very nice/good.

I used to use XPath a lot a couple of years ago, but for some reason haven't ever tried it in Safari. I think I'll go and have a play...

As a quick thought, I'd presume something like:

var xpath_result = resp.evaluate("//some/query", document, null, XPathResult.ANY_TYPE,null);

would work. Where resp is the responseXML from an XMLHttpRequest. To get that from an HTML page, you'd override it's mime-type as part of the request.

That's all off the top of my head - I'll go and check it out when I have some time. It may just be that web core doesn't support XPath at the moment. If not, you could still use DOM methods on the response XML... However, due to web pages you may want to interact with being badly written (not well formed), you may have to resort to regular expressions and other string methods to match what you want from the responseText.
View user's profile Send private message Send e-mail Visit poster's website AIM Address Widgets
wuf810



Joined: 27 Mar 2005
Posts: 19
Location: United Kingdom (Great Britain)

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

Hi Chris,

Initially I was using a hidden iFrame to load the web page I wanted to "scrap" and then tried to use the DOM to grab the elements I needed. Unfortunately I hadn't realised that using the DOM with javascript requires that the page in the frame must come from the same domain!! I will try XMLHttpRequest and see how I get on. If I'm accessing a non XML page do I have override its mime-type? If so how would I do this?

I'm sure I read somewhere that XPATH is not available directly within Safari (or Mozilla) which is why I think I need a plug-in. However I had assumed it was part of web core as I thought Sherlock was using web core...but to be honest I'm not knowledgeable enough to test it one or the other. The more I read the more confused I'm getting

if you do find out anything, it would be much appreciated.

Thanks, Michael.
View user's profile Send private message Widgets
Chris



Joined: 27 Jan 2005
Posts: 344
Location: Durham, UK

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

XPath support is in Mozilla, and has been for quite a while.
Quoting: wuf810

If I'm accessing a non XML page do I have override its mime-type? If so how would I do this?

You use xmlhttp.overrideMimeType("text/xml"); making sure you put it before xmlhttp.send(); where xmlhttp is your XMLHttpRequest object.

If the responseXML you get is undefined, then try using regular expressions on the responseText instead.

I'm about to go and try some XPath stuff. I'll report back
View user's profile Send private message Send e-mail Visit poster's website AIM Address Widgets
wuf810



Joined: 27 Mar 2005
Posts: 19
Location: United Kingdom (Great Britain)

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

Hi Chris,

Yes you're right about Xpath support in Mozilla. I found this page that was very helpful. [url=]http://www-jcsu.jesus.cam.ac.uk/~jg307/mozilla/xpath-tutorial.ht ml[/url]. The examples work perfectly in Firebird but alas not in Safari. I haven't tried it with Dashboard but assume they won't work...

Thanks for your help with XMLHttprequest stuff. Didn't realise it was so powerful. Took me a while to get it working though as I took the following extract (from an apple document - [url=]http://developer.apple.com/internet/webcontent/xmlhttpreq.html[/ url]), as the whole truth:

"Security Issues

When the XMLHttpRequest object operates within a browser, it adopts the same-domain security policies of typical JavaScript activity (sharing the same "sandbox," as it were). This has some important implications that will impact your application of this feature.

First, on most browsers supporting this functionality, the page that bears scripts accessing the object needs to be retrieved via http: protocol, meaning that you won't be able to test the pages from a local hard disk (file: protocol) without some extra security issues cropping up, especially in Mozilla and IE/Windows. In fact, Mozilla requires that you wrap access to the object inside UniversalBrowserRead security privileges. IE, on the other hand, simply displays an alert to the user that a potentially unsafe activity may be going on and offers a chance to cancel.

Second, the domain of the URL request destination must be the same as the one that serves up the page containing the script. This means, unfortunately, that client-side scripts cannot fetch web service data from other sources, and blend that data into a page. Everything must come from the same domain. Under these circumstances, you don't have to worry about security alerts frightening your users."


In fact it is not entirely true as it works at least for me only with File protocol and not HTTP locally. Just proves one has to try it oneself to be sure...

Michael.
View user's profile Send private message Widgets
bartocc



Joined: 27 Mar 2005
Posts: 9

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

i do use XMLHTTPRequest to get a non-XML page. This page is simple HTML. I load the responseText it in a simple <div> tag so that it is parsed by the DOM and then scrap the needed info wheter with getElementByTagName or getElementByID and innerText or innerHTML.
when finished manipulating the whole html, i just replace all of it with nothing and display the desired info as I want, inside the widget display borders.

don't know if this is any help

julien
View user's profile Send private message Widgets
wuf810



Joined: 27 Mar 2005
Posts: 19
Location: United Kingdom (Great Britain)

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

bartocc,

That does help thank you.

When you say you "load the responseText it in a simple <div> tag so that it is parsed by the DOM", how do you do this??

M.
View user's profile Send private message Widgets
Chris



Joined: 27 Jan 2005
Posts: 344
Location: Durham, UK

Posted: Sun Nov 14, 2004 - 2:24 pm    Post subject: Imported from The Dashboader Reply with quote

document.getElementById("div_id").innerHTML = xmlhttp.responseText; is one way.
View user's profile Send private message Send e-mail Visit poster's website AIM Address Widgets
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.

 
Powered by phpBB © 2001, 2002 phpBB Group