I've been learning OCaml for some time now but not really had a problem that I wanted to solve. As such, my progress has been rather slow and sporadic and I only make time for exercises when I'm travelling. In order to focus my learning, I have to identify and tackle something specific. That's usually the best way to advance and I recently found something I can work on.
As I've been trying to write more blog posts, I want to be able to keep as much content on my own site as possible and syndicate my posts out to other sites I run. Put simply, I want to be able to take multiple feeds from different sources and merge them into one feed, which will be served from some other site. In addition, I also want to render that feed as HTML on a webpage. All of this has to remain within the OCaml toolchain so it can be used as part of Mirage (i.e. I can use it when building unikernels).
What I'm describing might sound familiar and there's a well-known tool that does this called Planet. It's a 'river of news' feed reader, which aggregates feeds and can display posts on webpages and you can find the original Planet and it's successor Venus, both written in Python. However, Venus seems to be unmaintained as there are a number of unresolved issues and pull requests, which have been languishing for quite some time with no discussion. There does appear to be a more active Ruby implementation called Pluto, with recent commits and no reported issues.
Although I could use the one of the above options, it would be much more useful to keep everything within the OCaml ecosystem. This way I can make the best use of the unikernel approach with Mirage (i.e lean, single-purpose appliances). Obviously, the existing options don't lend themselves to this approach and there are known bugs as a lot has changed on the web since Planet Venus (e.g the adoption of HTML5). Having said that, I can learn a lot from the existing implementations and I'm glad I'm not embarking into completely uncharted territory.
In addition, the OCaml version doesn't need to (and shouldn't) be written as one monolithic library. Instead, pulling together a collection of smaller, reusable libraries that present clear interfaces to each other would make things much more maintainable. This would bring substantially greater benefits to everyone and OPAM can manage the dependencies.
The first cut is somewhat straightforward as we have a piece that deals with the consumption and manipulation of feeds and another that takes the result and emits HTML. This is also how the original Planet is put together, with a library called feedparser and another for templating pages.
For the feed-parsing aspect, I can break it down further by considering Atom and RSS feeds separately and then even further by thinking about how to (1) consume such feeds and (2) output them. Then there is the HTML component, where it may be necessary to consider existing representations of HTML. These are not new ideas and since I'm claiming that individual pieces might be useful then it's worth finding out which ones are already available.
The easiest way to find existing libraries is via the
OPAM package list. Some quick searches for
net bring up a lot of packages. The most relevant of these seem to be
xmlm, ocamlrss, cow and maybe xmldiff. I noticed that
nothing appears, when searching for
Atom, but I do know that
cow has an
Atom module for creating feeds. In terms of turning feeds into pages and
HTML, I'm aware of rss2html used on the OCaml website and parts of
ocamlnet that may be relevant (e.g
netstring) as well as
cow. There is likely to be other code I'm missing but this is useful as a
Overall, a number of components are already out there but it's not obvious if they're compatible (e.g html) and there are still gaps (e.g atom). Since I also want to minimise dependencies, I'll try and use whatever works but may ultimately have to roll my own. Either way, I can learn from what already exists. Perhaps I'm being overconfident but if I can break things down sensibly and keep the scope constrained then this should be an achievable project.
As this is an exercise for me to learn OCaml by solving a problem, I need to break it down into bite-size pieces and take each one at a time. Practically speaking, this means limiting the scope to be as narrow as possible while still producing a useful result for me. That last part is important as I have specific needs and it's likely that the first thing I make won't be particularly interesting for many others.
For my specific use-case, I'm only interested in dealing with Atom feeds as that's what I use on my site and others I'm involved with. Initial feedback is that creating an Atom parser will be the bulk of the work and I should start by defining the types. To keep this manageable, I'm only going to deal with my own feeds instead of attempting a fully compliant parser (in other words, I'll only consider the subset of RFC4287 that's relevant to me). Once I can parse, merge and write such feeds I should be able to iterate from there.
To make my requirements more concrete:
I've honestly no idea how long this might take and I'm treating it as a side-project. I know there are many people out there who could produce a working version of everything in a week or two but I'm not one of them (yet). There are also a lot of ancillary things I need to learn on the way, like packaging, improving my knowledge of Git and dealing with build systems. If I had to put a vague time frame on this, I'd be thinking in months rather than weeks. It might even be the case that others start work on parts of this and ship things sooner but that's great as I'll probably be able to use whatever they create and move further along the chain.
In terms of workflow, everything will be done in the open, warts and all, and I expect to make embarrassing mistakes as I go. You can follow along on my freshly created OCaml Atom repo, and I'll be using the issue tracker as the main way of dealing with bugs and features. Let the fun begin.