I’m, at the moment, looking for a particular problem to work on for my dissertation. It feels a bit backwards the way I’m going about it–I know what kind of solution I want to deploy, but I’m looking for a problem to solve with it. It’s a bit like running around the house with a hammer, looking for nails to hit, or running around with a new saw, cutting up wood into bits for the hell of it. The danger is that I could end up cutting all my wood up into tiny shavings, having had a blast with the saw but finding myself homeless at the end of the day.
My tool in this case isn’t a saw, but the abstraction of narrative schemata. The idea is, using dependency parses and coreference chains, you can extract which verbs are likely to co-occur with a shared referent. For example, arrest, search, and detain often share role fillers of some kind–police, suspect, or something referring to something that is one of those two.
A corpus of news contains all kinds of relationships like those, buried inside the language data itself. Ideally, these represent some sort of shared world knowledge that can be applied to other tasks. To demonstrate that this isn’t mere idealism is what I’m looking to do my dissertation on at the moment.
Back in the spring, I took my first attempt at this, and it went ok. My hypothesis–one of convenience, mostly–didn’t pan out, but there were interesting trends in the data. That resulted in a problem, though; I had two things to sort out: was my hypothesis wrong? Was the measure I used to determine that fact suitable for doing so? There was some minor evidence that the measure was suitable, but nothing conclusive.
Instead, I started sniffing around for other hypotheses–things someone else had already thought of, and that may be demonstrable with narrative schemata as an overlying application. Per my typical procrastination, I stumbled upon a recent article on Salon that critiques national press coverage of Rick Perry, claiming that narratives presented in the national press diverge wildly from those presented in Texas papers.
With an author having shown this qualitatively, it’s ripe for quantitative replication. It would make a great experiment for showing the veracity of whatever measure I end up devising.
The difficulty comes in with corpus building. There isn’t a corpus of these texts lying around. I’d have to dig them up myself, from numerous scattered sources. Additionally, the number of sources is likely to be limited. I may be able to obtain a few hundred articles if I’m relentless. Prior work on schemata began with millions of articles. The robustness of the approach may be questionable, in this case.
Of course, the difference in size may be the source of an interesting result in and of itself, but it’s not what I’d set out to demonstrate when searching for a problem that demonstrates the veracity of my measure.