drewp

Categories: weblog | rdf


2008-11-02T10:08:32 Graphs from sparql results:

This is a response to Download SPARQL results directly into a spreadsheet


So far you've motivated seeing the results of a query in a table and making a graph from them. I'd like to have both of those capabilities in a webapp. E.g. I should be able to embed a live graph in my own page like this:

<img src="http://sparqlgrapher.com/svg/example.com/query=SELECT+?date+?price+{...}">

Visiting my hypothetical sparqlgrapher.com directly would give you a UI to layout and customize the graph. When you're done, you'd take that url and embed it elsewhere (or just take a copy of the image, if you want a one-off).

2008-10-13T22:38:18 rdflib vs jena graph creation APIs:

I actually looked at the jena RDF API today, and I was interested to see how graph creation compares to rdflib's style, which is the one I normally use.

From the Jena introduction (minus the model setup and some comments):

String personURI    = "http://somewhere/JohnSmith";
String givenName = "John";
String familyName = "Smith";
String fullName = givenName + " " + familyName;

Resource johnSmith
= model.createResource(personURI)
.addProperty(VCARD.FN, fullName)
.addProperty(VCARD.N,
model.createResource()
.addProperty(VCARD.Given, givenName)
.addProperty(VCARD.Family, familyName));

An rdflib python port of that:

johnSmith    = URIRef("http://somewhere/JohnSmith")
givenName = "John"
familyName = "Smith"
fullName = givenName + " " + familyName

graph.add((johnSmith, VCARD['FN'], Literal(fullName)))
name = BNode()
graph.add((johnSmith, VCARD['N'], name))
graph.add((name, VCARD['Given'], Literal(givenName)))
graph.add((name, VCARD['Family'], Literal(familyName)))

If I were making a new version of the rdflib API, here's what I'd consider:

  1. Don't expose the strings of URIRefs very easily. It should be somewhat hard to examine or operate on a URIRef's string value. This is first to encourage good practices (no more "print u.split('/')[-1]" or "if 'foo' in u:") but more importantly to allow for optimizations in query engines. A backend should be able to return URIRefs containing its internal ids while a query is running, and then any URIs that make it to the result set can be looked up (if needed). Actually getting the string form of a URIRef would still be possible of course, but it might require an explicit method call. Most people's URIRef serializations are in __repr__ calls and output formats, so this change shouldn't be that noticable.
  2. Literal can still be a string subclass; BNode should not be.
  3. Like jena, don't require Literal() on strings unless they need a lang or datatype. People forget Literal() sometimes anyway, which rdflib sometimes handles. The rest of the time it corrupts your database. "hello" is pretty clearly the same as Literal("hello"), so I think it's fine to support "hello" the same way that rdflib now supports 5 to be a xsd:int literal. Another choice would be to error quickly on strings, which would be a good move if people were providing URIRefs as strings accidentally.
  4. Support graph construction APIs with named methods, like graph.node(uri1).addProperty(pred1, "value1"). These are nice for new users since they bring terminology in quickly, and some of the condensed forms seem cool.  I don't like jena's 'createNode' method name, though, since as far as your graph is concerned, nothing got created. The fact that a java Node object was created is not important. Other possibilities:
    I also obviously prefer 'edge' to 'property', since edges sound more like the free-form graph that we're making. What system would have a property whose value is another property? That's a perfectly natural RDF construct, but pred1.addProperty(pred2, pred3) doesn't look so natural. It's also less surprising that edges can be traversed both ways. Other systems with "properties" don't always support that, causing users to make redundant inverse properties where they think they might need to traverse backwards.

T 2008-08-03T22:25:34 New home page:

I played with a bunch of New Fangled Web Technologies and redid my home page. Almost everything is dynamically derived from data sources that I presumably keep up to date for other reasons. The foaf part and projects list part aren't done yet. I also haven't removed all the zope pages yet, unfortunately. (Zope turned out not to be a good system for making a low-maintenance site that lasts for 10+ years.)

I hope to have a DOAP document for each project, which will make them easy to list on my home page as well as other project-list systems.

 

2008-04-27T19:51:52 Using freebase to help with dbpedia searches:

I wrote this response to a thread on a mailing list, but I can't find anyplace where sourceforge has my reply online (I did receive it in an email). I would have expected it on this archived thread.

So here it is again, at a place I can link to.



On 21 Apr 2008, at 14:40, robl wrote:
>> SELECT * FROM pages WHERE page_title LIKE "Queen%Elizabeth"
>>
>> This would perform a case insensitive match on Queen(anything)
>> Elizabeth
>> (at least in mySQL).
>>
...
>> Is there quick way to do what I want ?  Are there any indexes I could
>> apply to improve things (I have already created the indexes
>> specified at
>> http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/
>> kidehen@openlinksw.com's%20BLOG%20%5B127%5D/1298)
>> ?
>>
>> Or do I need to create a conventional SQL table of resource names and
>> then do a SQL LIKE query on those ?
>>

You might also want to check out freebase. Here's the approach I'm about
to attempt, myself. Start with a reconciliation query:

http://sandbox.freebase.com/dataserver/reconciliation/?name=Queen+Elizabeth&types=%2Fpeople%2Fperson&responseType=html
- the reconciliation service handles misspellings and other variations
- s/html/json/ for the machine readable version

Then look at the freebase page or perform a query:

http://www.freebase.com/view/en/elizabeth_ii_of_the_united_kingdom

That page has this link:

http://en.wikipedia.org/wiki/index.html?curid=12153654

On that page, we have

<a href="http://en.wikipedia.org/wiki/Elizabeth_II_of_the_United_Kingdom">article</a>

Maybe freebase can just hand us that link instead of the curid one. I
haven't gotten to that part of my code yet. I don't know how often the
last word of the freebase URI is in sync with the WP one, but that seems
like it would be the least reliable. Following freebase's designated WP
link is probably more robust.

Finally, take the wiki name, and make a dbpedia URI:

http://dbpedia.org/page/Elizabeth_II_of_the_United_Kingdom



You probably noticed that elizabeth_ii_of_the_united_kingdom wasn't the
first result for 'Queen Elizabeth' of type /people/person. I'm not sure
if freebase considers that a bad result page or not. The reconciliation
service is new, so now's probably a great time to tell them how
important good results are to you :)

2008-03-05T23:53:13 Notes from the talks at Semantic Web: Are Scalable Graph Data Applications Possible?:

Notes from the talks at Semantic Web: Are Scalable Graph Data Applications Possible?

I was looking forward to more Oracle demos and roadmap-type discussion, but instead the highlight was allegro.



jeff from oracle says:
fraud detection is using graphs
we're generating data faster than we can process it
business value comes from: reduce cost of operations; aid decision-making; improve the transparency of business operations (e.g. for businesses that need to meet regs)
nice slide on DB approaches, broken into disk/ram, native/layered, etc
siderean is a company doing in-mem, multiple machine storage



david from mulgara
key to web scaling is the late binding of address to resource. Allows the information mgmt technique of the web to scale well

the next gen mulgara version, which the team was meeting about this week in SF, will use lots of disk, perhaps 40G ram, and store 100B (?) triples



vertica:
SQL DBMS, focus on analytics
50+ customers: verizon, comcast, level3
came from the cstore project, MIT
MIT library catalog is rdf, 50M triples: Barton Dataset
uniprot protein dataset is 262M triples. vertica serves that dataset for public querying



jans aasman, franz inc:
23 years old company, 2 yr with a triple store
customers do 'event handling and activity recognition'
50 customers, plus free download
monterey aquarium doing Marine Metadata Interoperability Project
los alamos is studying who reads what publications, graph structures in readership
sun doing baetle, the bug tracking one
japan telecom KDDI is doing spam and fraud detection with allegro. they need to determine what is spam across their busy network. they create new spamassassin rules over time.
OFFIS using rdf for info about power grid usage
allegro loads 1e9 quads in 8 hours
has sesame interface
supports xml schema datatypes, e.g. range queries on dates. Literals can be stored as their own numbers
'social network analytics library' for degrees, cliques, group stats
quick loads from oracle for temporary dbs used for analytics (coming soon)
RDFS++ reasoner for the usual inferences
temporal reasoning (allen's temporal logic, for intervals)
their time/space handling helps with event search. one query involves a place and radius, person connections, other event details
police, e.g., need temporal reasoning
"homeland security is interested in every type of imaginable event"
"find all meetings that happened in december within 5 miles of berkeley that was attended by the most important person in Jans' friends and friends of friends"
they have a custom query language for their various datatypes and their capabilities, like (geo-box-around !geoname:Berkeley ?event 5 miles)
even 3 months of american phone call records is already petabytes
jans' thesis was about car driving behavior
GPS (maybe plus phone) leads to rich data about people- work, purchasing, etc



My question, which I didn't get to ask: How do the approaches compare in terms of latency for very small queries? Many of my queries are not batched together well, or my app needs to make a lot of decisions during the graph traversal.

T [Comments] (1) 2008-02-22T00:38:14 Goals for a wiki system:

Some goals for a better wiki system:

A common case seems to be "add a new page and list it in some existing TOC section". Another one is "add a new section (paragraph or more) to this page". Editing words within an existing section that you didn't write, that might be rare.

I still like tinymce, although nelix_ isn't a fan.

Wikis that I use (that I'm trying to be better than) are: twiki, zwiki, confluence.

Related: rdf blog engine ideas

T 2007-12-14T09:15:52 Notes from Intelligence at the Interface:

Event: http://sdforum.org/index.cfm?fuseaction=Calendar.eventDetail&eventId=13012&nodeID=1

tom gruber, tomgruber.org

Progress in the user experience on the web, if we look at what the user has to do and what the rest of the system has to do:

breadcrumbs (just links, user does everything)
-then-> portals (user picks yahoo, yahoo does more work) -then-> search (user queries, search engine does work) -then-> room service (agents)

examples:

'sandy' is an email reminder assistant

'farecast' for airfare. suggests alternate cheaper flights, trends. Looked cool from the screenshot and description.

Tom remarked at the end that finally, intelligence and computation will be able to be what we compete with, instead of just having "brand bullies" :)

And, "each time AI does a job well, it always disappears"

twine, nova

Remember when you started using delicious? it took 5 mins to learn most of the functionality, but then several days to notice that this is really worthwhile and it's going to help a lot. I expect a similar, but stronger, effect from twine. You learn the mechanics of checking information in, then after doing it for a while you notice which of your former laborious tasks have melted away. I also have high hopes for systems connected to twine. It's like a more polished version of piggybank. And they're going to add in recommendations, which may bring the 'smarts' closer to what magitti or calo is doing.

check Nova's blog for slideshow

semweb says, put metadata in the data so new software can reuse the past work (naturally!)

seems very close to that friendlist thing from that other blogger i read, i forget the exact name

builds a 'semantic interest profile' about you. picks people/places/organizations/topics you're interested in

create a 'twine' (like squidoo lens, page about a topic). The twines had surprising urls: like http://twine.com/twine/my-house, right at the global level. Are the urls different depending on who's logged in? Or does Nova's own stuff just go to the top? :)

A bookmarklet opens a transparent frame right on top of an external page you want to tag. From there, it's like delicious, but gathers a bit more data automatically.

When he used the bookmarklet on an amazon page, twine pulled some more fields from the page about the book

on the marked pages, twine finds words and topics and makes the links

edit-in-place UI to fix the fields of the data it found; add more fields. like freebase

they do some auto-summary of text from a wikipedia page

query is like newegg power search (or most semweb stuff for that matter), pick a type, add your filters

email in your own items to your 'recent items' list, just like a ticketing system would accept new tickets. URLs in the mail get crawled and those sites show up in your items too. (calo had a more turbocharged version of this, where they'd go hunting for info about everything and build big profiles about users and stuff)

goal of twine is organization. is this automating my tasks? the users will reveal what is valuable to automate.

PARC magitti

Finally, some novel UI work on a phone-based UI. It looked really nice-- low on sparkles and icons, high on usability. The app itself (recommendations and guides for your leisure time) seems good, and it was amazing to see a Japanese paper-printing company looking for ways to get into new media. Feels like the only stories I hear in the USA are about old companies putting their effort in keeping their old businesses going (e.g. big oil). Anyway, there was some cool personal activity prediction stuff like where they look at your messages and your past trends to guess what you want to do -right now-. I hope to get into exactly that kind of thing on my home automation project.

the name = magic + (something) + digital grafitti

19-25 year olds have 2x as much free time as other youth (japan, at least)

important for them to know what everyone else is thinking

predicts what to do, e.g. 'eat' (when it thinks you're hungry based on time, place, your emails, your explicit queries). Nice.

it reads emails only to guess what kind of activity you're currently doing. 11% of the test email dataset had information related to leisure activities (which is all magitti cares about). That seems low to me. Maybe that's all the ones they were able to correctly process (or maybe there's something I'm not estimating right about the emails of 20-somethings in Japan)

look at your past behavior to learn your patterns of eat/see/shop/... They can make plots based on day-of-week and time. This is what I want for my home automation.

ppl want to use the phone UI with one hand. 6 big buttons surrounding the content

pie menu on the phone. 4 quadrants only, sometimes more narrow ones for the border buttons. They looked really usable.

see yelp-style ratings on businesses, takes your star rating as you look at the page. collaborative recommendation stuff

the action buttons were arranged like this:
'M' [camera capture] [settings] (some content here) [any [eat]] [your location] 'clock'

hit the lower-left one to change your activity from 'any' to something else. Even if you dont say anything, they still list good ideas from their best matches of your activity, place, reviews, etc

you can force the activity ('shopping for clothes') and it refilters.

From the QA session: "what does the next 10 years of AI look like?"
answer: "busy"

yahoo

The phone-photo-tag part of this demo gave the most feeling of "you are looking into the future of technology" of all the presentations tonight. The UI was not elaborate. Mainly, it's that your phone camera is helping you tag your photos in real time (like delicious, except it knows your position and millions of past flickr tags too) and it's readily presenting you with other photos of interest. Everyone using this would essentially be running their own little version of justin.tv (photos, not video). The heavily-assisted tagging helps you organize your photos, and therefore organize your memories. Valuable! The speaker mentioned an example of looking up where you last had dinner with that friend. Since it was so easy at the time, you would have taken a photo and tagged it with the friend and the restaurant. Problem solved.

flickr photo locations plus tags shows popular tags on the map. 'tagmaps' from yahoo research berkeley. pretty cool to zoom in and out. using 4M photos, last year's data

upcoming version has 30M photos. Sometimes, these tags annotate world maps better than the pros do.

autotag your vacation photos by using the place of the photo

see the 'fireeagle' project for how web apps can know your location

i dont have live notes about the best demos, since I had to change seats to see the screen. The phone app that shows various feeds of pics included "wallet" (the photos you often show people), "my wife" (the photos she's taking now), "any flickr photos tagged with 'happy' near this location".

when reviewing all the tags on flickr, they consider the time too so as to figure out which things that are actually events ('bluegrass festival') and not places ('the mission'). This is like a topic I got into at a semweb meetup once: with just the tags on delicious, could you produce the names of all the states and their capitols? (I think yes)

CALO

The calo express part of the demo was pretty nice. It's a much smarter desktop search that would easily beat whatever you're using now. Especially what I'm using now, which is nothing (and I've tried a few OSS projects a little). Things took a turn for the industrial-strength-awesome when it got into the meeting planning and recording features, mainly for the amount of tech they're throwing at the problem. The AI testing stuff was also amazing, and it helped connect the project back to real life: if they don't make a certain amount of progress in their AI evalutions, they don't get funded for the next year.

This is a big research project that covers CPOF (recently in a Wired article) and has some kind of cross funding and sharing with many other projects, including twine.

cognitive assistant that learns and organizes

SRI, darpa

includes Command Post of the Future

builds 'relational model' of user's world. not sure if it's rdf

guesses what emails are about, what tasks they go with. you give feedback

'meeting understanding'. remote people are in everyone's headsets. CALO writes transcript, action items, Q/A pairs.

when he comes to a mtg, calo knows what all the people have been doing

has some kind of chat bot for scheduling a meeting (and other tasks, apparently). you use limited natural language

AI uses 'probable beliefs', revises them as new facts come in. 'probabilistic consistency engine' can update knowledge with new facts.

each year, they test the system (like an SAT test) and it has to improve. questions like "what to do when tom can't make a meeting: A. reschedule; B. tell tom; ...". They compare the baseline untrained CALO to an instance with 16 users for 2 weeks, and note whether calo does better at the test after that learning.

they have a full self-contained office environment, and a lite version (used by DARPA). lite one has almost no interface

the lite version does: google desktop search PLUS nlp (!). calo found someone's home page, pulled number and address and job title. Noted the person's publications and web pages to see what the person does.

followup query: "people with expertise in learning" then ".. that work at SRI" to narrow it down

A query for "slides about iris" finds individual slides in past presentations. then you search for similar slides to a near-match. Apparently the normal desktop searches look for keywords and stuff in a whole .ppt, which is obviously not as useful.

make a new presentation just based on title. digs up all relevant slides

'preppak' for a meeting. finds all documents that are required or recommended for the meeting

in the meeting, you can watch the transcript, which knows the person since everyone wears a mic. Testing within the government

calo is a personal assistant, doesn't share much with groups. some things (e.g. meeting schedule) are shared. you dont reveal all your meeting time prefs, but the calos negotiate it

T 2007-12-02T03:19:09 Watching the X screen power state:

http://cvs.bigasterisk.com/viewcvs/room/sys.dpms?rev=1.1&view=auto

New program to watch whether my screens are powered on or DPMS-sleeping. I also track the idle time and the currently-focused window, since I happened to find code for those while I was working out the DPMS. The result is a little RDF graph:

@prefix _4: <http://dash/>.
@prefix _5: <http://bigasterisk.com/computerIdleState/power/>.
@prefix idle: <http://bigasterisk.com/computerIdleState/>.

 _4:console idle:focusClass "rxvt";
    idle:focusName "XTerm";
    idle:focusWindowName "drewp@dash:/my/proj/room <>";
    idle:lastNonIdle "1196593439.27"^^<http://www.w3.org/2001/XMLSchema#float>;
    idle:power _5:On.

(I know the URLs and date formats are poor right now)

The program should be easy to run if you're on X and you have rdflib, py2.5, and python-xlib.

This results format is part of my new plan to have each program regenerate entire graphs of whatever they measure. I'm thinking of sending the graphs around with jabber, using pubsub to send them only when they change. That would be unlike https://stpeter.im/?p=1328 which uses SPARQL as you'd expect.

For example, if the user (or DPMS timer) turned off the screen, the last triple in my example would change to the _5:Suspend node. Other listeners who have subscribed to the computerIdleState graph will get an updated version of it.

The reason I started tracking screen state was simply to measure how many hours my 300W monitors are on per month. Either this program, or some listener one, will have to log that data somewhere. Of course, there are obvious uses for logging idle time patterns too, and that measurement probably wants a bit more compression. (Example app: tell my friends on jabber that I'm out, but my average return time for Tuesdays is 9:45pm.)

I really have to move this old home automation project off CVS and onto darcs. I don't mind the conversion, but I want to keep at least some of the cvsweb urls working since I think I've pasted them into a lot of postings all over the web.

T 2007-09-09T04:14:40 RDF reasoning for home automation:

Updated: fixed FuXi link

I'm trying to do my home automation with RDF and reasoning. RDF is the unified way to write all the configurations, and I'm hoping to use a logic engine (maybe FuXi or Euler) to write the control systems. Hopefully those will make it easy for humans or computers to edit the setup.

I look forward to being able to ask an N3 proof system "why is the porch light on?" and having it tell me "the web said the sun has set by now, you tripped a motion sensor within the last 15 minutes, and there was no other light shining in this area, therefore I turned on the porch light".

Tonight I cobbled together the first working version of some home automation components talking RDF. A bluetooth dongle constantly searches for devices, and if it finds one, it states that [the bluetooth sensor] [senses] [the URI for the device]. Here's that program.

(BTW, avoid bluetooth chips by Integrated System Solution Corp and prefer ones by Cambridge Silicon Radio. The ISSC one I got has the lousy address 11:11:11:11:11:11 that's hard to change since I'm not using windows. Also, this bluetooth intro is really good.)

Next in my home automation system, a reasoning program hears about new statements and executes the right logic to produce more statements about what should happen. This program is a stub for now- it just turns the presence of my phone into a statement to power the door lock. But devices.n3 suggests what some of the logic might eventually look like.

Finally, an output program has been watching for statements about pins on the parallel port it controls. The reasoning program said to put power on bit2, so this program sets the output accordingly. On that pin is a circuit with an optoisolator, a triac, a transformer, and the electric strike that releases the door.

When the real logic is in place, the proof system should be able to say "I unlocked the door because someone friendly was nearby, because Drew is friendly and Drew carries a phone with the bluetooth address I saw".

T 2007-09-01T17:34:55 Data table with tabulator:

Here's how to use tabulator to render a simple data table.

/attachment/weblog/2007/09/01/0/table.png

My test data might be a bit confusing since the terms overlap with tabulator terms. I'm trying to compare query runtimes of various queries on different databases. The result I'm trying to produce is a table showing how long each database took on each query.

Here's my mockup data in n3:

@prefix : <http://example.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:result rdfs:label "test result" .
:db rdfs:label "database" .
:time rdfs:label "elapsed time" .

<> :result
[a :Result; :query :q1; :db :rdflibBdb; :time ".5"],
[a :Result; :query :q2; :db :rdflibBdb; :time ".6"],
[a :Result; :query :q3; :db :rdflibBdb; :time ".7"],

[a :Result; :query :q1; :db :db2; :time ".8"],
[a :Result; :query :q2; :db :db2; :time ".9"],
[a :Result; :query :q3; :db :db2; :time ".11"]
.

Note the line which associates the 6 results with this document. Without that link, tabulator won't put the results in its outline.

I used cwm to create an XML version of that data, which you can view that data in tabulator with the following link. [Update: there was no need to convert; tabulator can read n3 thanks to a version of the cwm parser translated to js with pyjs!]

tabulator with mockup data

Tabulator has a query-building interface where you click on predicates and other nodes to constrain your result rows, but I couldn't figure out how to make the table I wanted. Instead, I used the SPARQL tab at the bottom and wrote my own query:

SELECT ?query ?bdb ?db2
WHERE
{
   ?v1 <http://example.org/db> <http://example.org/db2> .
   ?v0 <http://example.org/db> <http://example.org/rdflibBdb> .
   <http://bigasterisk.com/post-rdf/timing-results6.rdf> <http://example.org/result> ?v0 .
   ?v1 <http://example.org/query> ?query .
   ?v0 <http://example.org/time> ?bdb .
   ?v0 <http://example.org/query> ?query .
   ?v1 <http://example.org/time> ?db2 .
}

In english, that says "find queries with results for the two databases, and report their times in columns named after the databases". You can load tabulator with my datafile and that query together:

tabulator with mockup data and query

You have to click the radiobutton next to 'Query' to see the results.

Now I'll actually write my database benchmark, and I'll have it output result sets for each db. I should be able to combine the result sets together and display them in a table with the method described above. The biggest issue with abusing tabulator in this way is that I have to grow my query for each new database I test. Also, that query won't display a row unless it has results from all databases. It would be nice to have all cells optional, so I can still see a row if it only has a result from one database.

T 2007-04-06T16:46:43 unicode rdf symbol:

I discovered today that Unicode comes with an rdf symbol (almost): ༜

That's &#3868, "TIBETAN SIGN RDEL DKAR GSUM".

Use this if your font doesn't show the character.

2006-11-29T11:36:30 RDF literals as subjects:

Any proposal about allowing RDF literals as subjects, especially one that's for language purposes (in this case it's direction support), needs to address why RDF's current design has the exceptional 'language' attribute on literals. If your proposal is so good, why didn't RDF allow arcs from literals in the first place and avoid the langage/datatype special cases altogether?

I really know nothing about the direction support issue, but if it's one of the last few language-specific issues and it really ought to be separate from the 'langauge' attribute, I am inclined to prefer one more special case on literals than a total redo of the constraints on rdf graphs. My main concern with literals as subjects is that people will treat them like "casual" URIs that aren't universally unique.


[Main]

Unless otherwise noted, all content licensed by Drew Perttula
under a Creative Commons License.