Friday, December 26, 2014

A web scraper/crawler in Java: Krawkraw

UPDATE: Krawkraw is now referred to as just Webmuncher.

Nobody sets out to write a web scraper, but here I am posting about just that, a general purpose web scraper I wrote…

How did this happen? Well, it started with wanting to play around with ElasticSearch a couple of months back. And in search of some data to index, I looked to websites' contents (with hindsight a more structured data format should have been sought, but well, the lesson has been learnt)...

So the question then was, how do I easily retrieve all the html pages of a site and have it thrown into ElasticSearch. Any easy web scraper in Java, out there that I could just use? I can’t even remember actively searching for such a tool...which explains why I somehow decided it would be much fulfilling if I throw a web scraper together...and in a couple of weekends, Krawkraw came to be.

Krawkraw is a tool that can be used to easily retrieve all the contents of a website. More accurately contents under a single domain. This is its perfect use case which reflects the original need for which it was written. So you can’t start at one edge of the web with Krawkraw and expect to crawl to another edge...Nope.

Krawkraw is available via Maven central, and you can easily drop it into your project with this coordinates:


com.blogspot.geekabyte.krawkraw
krawler
${krawkraw.version}


Or you can download the jar files here, and have it included in your classpath.

For more information on the API and its usage, The README should be your friend! If you happen to use Krawkraw and found it missing some features, or you have some ideas on some features that it should have? Then drop them in the issue tracker.

Wednesday, December 17, 2014

Hello Logging, Hello SLF4J

This is not a full blown guide, tutorial or introduction to logging in Java with the SLF4J abstraction, but a post that captures some of the nuances and components I personally had to get familiar with when I stopped seeing logging as an activity that happens only during development, to something that is actively build into software so as to have an insight into its state in production.

Personally...In the early days it was var_dump() and alert("It works"). Then I discovered the console…(thanks to Firebug and much later, Chrome DevTool) and I dropped alert in favor of console.log() And that was it... This was what "logging" was mainly about: A process of ascertaining the correctness of the software as we develop…

But as soon as I moved into building software systems on the JVM using Java, production logging became more upfront and also, the need for me, as a software developer, to proactively build logging into my application. It was not a complex switch, but looking back now, I realized that I unintentionally, for a while, approached writing production logging as if it were the ones that took place while development.

They are in no way the same...

First, there are more moving parts, second, the purpose differs: Including the ability for an application to write out log messages in production is to have insight into the internal state of the application and also, so as to provide a good source of forensics in case something goes wrong. It has even been estimated that any trivial system would have about 4% of its code base dedicated to logging. This shows how important production logging should be.

This post thus captures the new eyes with which I had to view production logging compared to the type I had long been familiar with; which is logging during development. It is written with logback in mind, which is an implementation of SLF4J, what has grown to be the standard abstraction when it comes to logging in Java

Almost all of what is mentioned would apply to any other SLF4J implementation. The difference that may exist would mostly likely be in the area of configurations...

...with that said, let’s start.


Saturday, November 15, 2014

Difference between BeanFactory and FactoryBean in Spring Framework

tl;dr A FactoryBean is an interface that you, as a developer, implements when writing factory classes and you want the object created by the factory to be managed as a bean by Spring, while a BeanFactory on the other hand, represents the Spring IoC container, it contains the managed beans and provides access to retrieving them. It is part of the core of the framework which implements the base functionality of an inversion of control container.

This post clarifies the difference between BeanFactory and FactoryBean in Spring Framework, and in the processes sheds some light on how the core component of Spring Frameworks stack together.

The Spring Framework has grown to be a lot of things, but at its core, it is a very versatile Dependency Injection container, upon which other things get built on. This core, this dependency injection container is actually built from 4 components.

1. The org.springframework.core package. In which, base functionalities that cut across the whole framework is implemented, e.g. exceptions, version detection etc.

2. The org.springframework.beans package. In which we have interfaces and classes for managing Java beans. The BeanFactory interface is found here.

3. The org.springframework.context Package. This builds on the beans package. An example of an interface out of this package that you would have used is the ApplicationContext (it implements the BeanFactory interface)

4. The org.springframework.expression package. This provides the expression language that can be used to traverse and manipulate the object graph within the container at runtime.


Saturday, November 01, 2014

How to Use Multiple SSH Keys

This post is a quick How-To.

It shows how to create multiple SSH keys for accessing multiple remote servers from a single local machine.

The post would use git hosting services: Github and GitLab (which allows for SSH connection to hosted repos), to illustrate the steps. Say for example, at work you use Gitlab but for personal projects you use Github, and you want to be able to use SSH to interact with these two services from the same local machine.

Although Git hosting services are used here, the steps outlined is in no way tied to these git hosting services, it works just the same for any SSH connections.

First Step: Generating the SSH Keys.

The git hosting service would provide you the instruction for generating an SSH public/private key and how to add the public key to your account. For examples see Github's instruction.

It basically involves running this command:
ssh-keygen -t rsa -C "Comments you want attached to the key"


Tuesday, October 07, 2014

Injecting and Binding Objects to Spring MVC Controllers

I have written two previous posts on how to inject custom, non Spring specific objects into the request handling methods of a controller in Spring MVC. One was using HandlerMethodArgumentResolver
the other with @ModelAttribute.

An adventurous reader of these posts would have discovered, that even if you remove the implemented HandlerMethodArgumentResolver or the @ModelAttribute annotation the custom objects would still be instantiated and provided to the controller method!

Yes, try it!

The only problem is that the properties of the instantiated object would be null, that is they won't have been initialized (this won't be the case though, if these properties are instantiated in the default constructor).


Monday, October 06, 2014

Using Spring's Type Converter to Inject Objects into Spring MVC Controllers

The pattern of having the identity of a resource articulated in the URL is not uncommon; In fact it is one of the ways of expressing RESTful end points...for example in an application that displays details about people, you can have a URL expressed thus:

http://www.example.com/people/777

where 777 is the number that serves as the identifier used to fetch the resources, in this case that could be the corresponding Person Object with an identifier of 777.

It wont be taking it too far then, if we see the number 777, as a direct mapping to the Person object with an id of 777.

In fact the controller method that maps the the URL above may simply have logic that get's the 777 part of the URL and use it to fetch the corresponding entry in the database backing the application. for example:


Injecting Objects into Spring MVC Controller Using @ModelAttribute Annotation

In How to Inject Objects Into Spring MVC Controller Using HandlerMethodArgumentResolver, I showed how to implement the HandlerMethodArgumentResolver interface in other to have a custom object passed as a method argument of a Spring MVC controller.

This post looks at how to still pass objects into the Spring MVC controller, but using a some what different approach: The @ModelAttribute annotation.

The @ModelAttribute annotation can be used to designates operation that can both retrieve and put stuff into the model. Its functionality depends on where it is placed...two places actually: On controller methods and on controller's method arguments.

When @ModelAttribute is placed on a controller's method, it allows the putting of stuff into the Model and when on a controller method arguments, it allows retrieving stuff from the Model...

I thus, like to think of @ModelAttribute as an annotation driven mechanism for writing and reading from the Model.

This post would look at these two methods of using @ModelAttribute and how it can be used to inject objects into controller's methods.