Assignment 8: xml parsing?

Europe ? #
Polska http://pl.engadget.com
Deutschland http://de.engadget.com
Asia ? #
???? http://chinese.engadget.com
???? http://cn.engadget.com
??? http://japanese.engadget.com/
???? http://kr.engadget.com/
EspaƱol http://es.engadget.com
HD http://www.engadgethd.com
Mobile http://www.engadgetmobile.com
Engadget http://www.engadget.com/
Engadget #
Web http://search.aol.com/aol/search?invocationType=wl-gadget&query=
Images http://search.aol.com/aol/image?invocationType=wl-gadget&query=
Video http://search.aol.com/aol/video?invocationType=wl-gadget&query=
News http://search.aol.com/aol/news?invocationType=wl-gadget&query=
Local http://local.aol.com/aol/local?invocationType=wl-gadget&query=
RSS Feed /rss.xml
Contact us /contact/comment/
Tip us on news! /contact/tips/
http://www.monoprice.com/products/subdepartment.asp?c_id=104&cp_id=10428
Permalink http://www.engadget.com/2009/03/30/mini-displayport-adapters-now-available-for-20/
Email this /forward/1502779/
31 Comments http://www.engadget.com/2009/03/30/mini-displayport-adapters-now-available-for-20/#comments
http://www.businesswire.com/portal/site/google/?ndmViewId=news_view&newsId=20090330006184&newsLang=en
Permalink http://www.engadget.com/2009/03/30/intels-xeon-3500-5500-series-officially-unveiled-for-servers-a/
Email this /forward/1502788/
17 Comments http://www.engadget.com/2009/03/30/intels-xeon-3500-5500-series-officially-unveiled-for-servers-a/#comments

For this assignment I tried to get engadget headlines and mix up the links so that it doesn’t make sense, but all i could parse was some links and some crap. I tried to use the Getter.java and Homework.java to make this work. Later i found that Homework.java was used to parse HTML and not XML. So instead of feeding the rss.xml link for engadget I fed in the direct html link. Since the source for the HTML was very messy I could not seperate out the elements required.

code:

import org.dom4j.Document;
import org.dom4j.DocumentFactory;
import org.dom4j.io.SAXReader;
import org.dom4j.Element;
import org.xml.sax.XMLReader;
import java.util.List;
import java.util.HashMap;
import java.util.regex.*;

public class Getter1 {
public static void main(String[] args) throws Exception {
// String url = args[0];

HashMap<String, String> map = new HashMap<String, String>();
map.put(”xhtml”, “http://www.w3.org/1999/xhtml”);
DocumentFactory factory = DocumentFactory.getInstance();
factory.setXPathNamespaceURIs(map);

XMLReader tagsoup = new org.ccil.cowan.tagsoup.Parser();
SAXReader reader = new SAXReader(tagsoup);
EasyHTTPGet getter1 = new EasyHTTPGet (”http://www.engadget.com”);

Document document = reader.read(getter1.responseAsInputStream());
List listItems = document.selectNodes(”//xhtml:li”);

for (Object o: listItems) {
Element elem = (Element)o;
String[] parts = elem.getText().split(”/”);

Element anchor = (Element)elem.selectSingleNode(”xhtml:a”);
String project = anchor.getText();
String href = anchor.attributeValue(”href”);
System.out.println(project + ” ” + href);

}

}
}

This entry was posted on Tuesday, March 31st, 2009 at 2:18 am and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply