Doing a PhD
Published by Martin Kleppmann on 31 Mar 2009.
I have decided to apply to do a PhD in Cambridge.
This might come as a surprise, so please let me
explain. It is something which has been tempting me for a long time. I have always loved working
independently and getting deep into a project which I find cool, and a PhD (in computer science at
least) seemed to me the ultimate manifestation of this independence: three years in which you can
learn about and figure out an interesting topic, and invent new ways, with hardly any constraints
other than that you’re supposed to write up something vaguely insightful at the end. (I’m sure this
is an overly idealised notion of what a PhD entails, but please bear with my dreamworld for
On the other hand, I have started a company and I’ve had an incredibly experience-packed two
years so far doing that. I would be a completely different person now if I had gone directly into a
PhD after graduating. Running a start-up has made me less risk-averse, more dynamic, more outgoing
and confident, more pragmatic, more focussed, and has given me a much better understanding of how
the world works.
You might think that returning to university is a cop-out, a return from the harsh
winds of a start-up into the safe haven of academia. Let me assure you that this is not the case,
- Firstly, I will keep my company going on the side. Obviously I won’t be doing it
full-time any more, but I am keeping all of my active clients, and I will continue the high level of
service they know from me. It won’t be a return to student lifestyle for me; if anything, my focus
will get sharper.
- Secondly, the research proposal I have written is not just any proposal. It is
aimed squarely at what I (and many others) believe will be the most influential technologies of the
next decade or two: technology which deals with the vast amount of data on the web, filtering,
processing and mining that information such that it becomes a source of useful insight.
Machine learning and computational linguistics.
We are rapidly moving towards a world
where everything which can be digitised and put on the web will be. Blogs, social networking sites,
Twitter and many other services increasingly become expressions of a person’s identity. Already now
I find that if, for example, I receive an email from somebody I don’t know, often the first thing I
do is to look up them up on LinkedIn, find their blog or Twitter username, look up their company or
affiliation and find out what they do. This allows me to quickly judge the context of their enquiry,
gauge the level at which I should reply, or detect whether I need to be cautious for some reason. If
it is somebody I have dealt with before, I have a private database of contact history which helps
when I don’t remember details of conversations months or years back. (It’s nothing particularly
secret, it’s just an extended memory.)
Identity on the web further manifests itself in social
interactions with others. This can be a powerful source of insight: for example, if I don’t know
somebody, but I see that they publicly communicate with somebody I already know and trust, I will
immediately be more inclined to trust them too. This is not a rigorous decision, but a useful first
guess in the absence of other information.
However, gathering the pieces of a person’s identity
from across the web is currently a time-consuming manual process.
My own digital identity, for
example, is spread all over the interwebs. It manifests itself in
my blog (which you are currently reading),
my LinkedIn profile,
my undergraduate dissertation,
my open source projects,
my Facebook profile,
my photos and
my taste in music, not to mention the many other fragments
scattered about other sites, in the form of press articles about me or my projects, my archived
emails to mailing lists, my comments on other people’s blogs, etc. They are all there, and Google
has indexed them all (apart from the small number of things behind logins), but at the moment they
do not come together to form a coherent whole. They are scraps of data, but without further analysis
they don’t mean much.
In a nutshell, my PhD proposal is to gather that publicly available data
together and make it useful. For me, that means to map out the graph of connections between
different people, and relationships between people and topics. Who is interested in what, and who
discusses what topics with which people?
Consider an example to see why this might be useful: say
you are new to a particular field of specialism (whatever it may be), and you attend a conference to
find out more about it. The programme of the conference is a long list of sessions with names of
speakers and titles of talks. Someone who has been in the field for a while will know many of the
speakers’ names and will immediately know which sessions will be worth attending and which people to
talk to. But a newcomer will have no idea, and has no way to find out other than by spending years
getting to know the community. Why can’t you just visualise the relationships between the various
speakers and topics, so that you can immediately see who the most influential presenters are and
whose interests are closest to your own? Or even discover which attendees of the event would be most
worth talking to? At the moment we rely on personal referrals, serendipitous meetings and crude
markers (like First Tuesday’s
“green for start-up, red for investor, yellow for service provider”);
why can’t we have a more direct way of finding the people we should be talking to?
There are two
steps to making this work: firstly identifying which two bits of information on the web belong to
the same person (even if they are on different websites, using a variant spelling of the name or
pseudonym-like username, and without confusing two people who happen to share the same name), and
secondly mapping out the relationships between the people and the topics they talk about.
Google’s success rests, amongst other things, on the
PageRank algorithm which calculates a ‘quality’ rating for
each page on the web. Their core innovation was to realise that links between pages, not just the
pages’ content, were the measure which determined how useful a search result would be, and
implementing PageRank allowed them to achieve much better search results than the other search
engines at the time.
A lot has been said about the next big thing post-Google. I wouldn’t want to
make predictions, but let’s put it this way: I would not be surprised if the next core innovation is
to realise that individual people, and the connections between people, are even more powerful than
pages and links between pages. The marriage of social web and semantic web, to be fully
This is a difficult and multi-faceted problem, which is why I want to take it
on within the framework of a PhD rather than try to develop it as a product straight away. There is
a lot I need to learn, from the mathematical details of the best machine learning techniques to the
linguistic techniques needed to extract structured information from natural language text and small
clues on the web.
There is a lot of existing research on which to build.
Stephen Clark, my proposed PhD supervisor, is one of the
authors of the
C&C parser, which is maybe the finest statistical
natural language parser out there; also in the Computer Lab’s
Natural Language and Information Processing Group,
Simone Teufel and others’ work on
citation analysis is likely to be
relevant. And I hope to collaborate with the lovely people at the Cambridge Engineering department’s
Machine Learning group, including
Zoubin Ghahramani who is recognised as one of the top
researchers worldwide in the machine learning field. Very good reasons to be in Cambridge.
Please note that this is not at all certain yet – I have applied, but I may not get accepted, I may not
get funding, and the Board of Graduate Studies may lose/forget my papers. But all going well, this
is the general direction in which I’d like to head.
On a final note, it will also be interesting to
explore the ethical aspects of identity on the web. I believe that both open sharing of personal
information and automated mining of that information will increase massively in the coming years,
and exploring the ethical and social consequences, as well as protecting the rights of the
individual, should be a part of the research in this area.
PS. My favourite techy buzzword so far is “maximum entropy supertagger”
(one of the components of the C&C parser). Just say that out
loud. Maximum Entropy Supertagger. Doesn’t it sound awesome? Before my inner eye, there is a sci-fi
film of a group of heroes fighting off an alien invasion. The tentacled beasts from outer space are
everywhere, but the good guys are just managing to keep them at bay. But then… ominous music in
the background… a huge towering construction appears from behind a hill in the distance. Silence
falls. Everybody stares at the terrifying thing brought by the aliens. The guy who later will be in
charge of single-handedly saving the world turns around, and in a brief close-up shot he says to his
colleagues in a perfect Hollywood manner: “Oh my God. They’ve got a Maximum Entropy Supertagger.”