Monday, September 29, 2014

Value proposition of PaaS

It's simple: Bring your own code. What makes it so special then?


Have you ever had a new business idea but hesitate to try it out due to the risk related to the initial investment? If your business is related to internet services, maybe PaaS can help you. 

Perhaps the biggest promise of the Platform as a Service (PaaS) is the user's ability to try new things out with minimum risk. Just take an instance, deploy your code, and pay per use. If it's aint working as a business, then just quit it, and you didn't lost much.

Snapshot of Bluemix service configuration panel.

Let's take a closer look to some key aspects:

Pay per use

Do you know how many users you'll have on your new service? Seldom one can predict that, and it makes ROI calculation difficult. How about having your costs proportional to your user count. That means fixed production cost per production volume.

Modern PaaS services like Bluemix offer innovative payment models. The base charging unit of Bluemix is Gigabytehour (GBh), somewhat similar to kiloWatthours (kWh) you're paing for the electricity. The analogy is obvious: pay per use.

Focus on your code
You do not need to care about hardware, connectivity, or the software platform. Just select the necessary middleware services you need, and put all your energy on your application, which is the one who brings all the added value after all. It's the service provided by your application which matters to your customers, not the underlain infrastructure.

In traditional IaaS approach, you need someone capable of putting it all together and setting it up, and someone to maintain it and keep things rolling. With PaaS, it's all taken care on behalf of you. You only need to understand your application details.

Instantly in service
Just a few clicks to select components and services you need, and you're ready to deploy you code to production use. How about development then? Just use the same PaaS, but developed on your own sandbox. Once it's ready to publish, just deploy it to production by simple mouse click.

Maintenance free
If something is advertised as maintenance free, it usually means you can't fix it. In case of PaaS, you do not need to fix it, as someone is already doing that, and without you even notice it. 


IBM wrote about my thoughts in their blog at Sept. 23rd, 2014 (in Finnish):  Playground for developer - Early birds experiences

One can't avoid discussing about internet services without someone asking for security. That will topic for a next postings.

Wednesday, September 10, 2014

MQTT over WebSocket

MQTT over WebSocket sounds like "tårta på tårta" in the first place, like the Swedes tends to say. But is it really like that, just plain overhead?

All major MQTT brokers do support MQTT over WebSocket nowadays, including Mosquitto, ActiveMQ, HiveMQ, and more. What is it good for? I have already discussed about differences of these two protocols in my earlier posting MQTT vs. WebSocket.

There are at least two scenarios where MQTT over WebSocket does make sense:

a) The first and rather obvious one is use of MQTT directly from script running inside Web browser. WebSocket is called as the TCP socket of the Web. And it is typically the only supported way of exchanging real-time data, in addition to http polling.So, what ever protocol you want to use in your dynamic web page with client side content rendering, tunneling through WebSocket is the only way to go.

Tunneling always causes some extra overhead in data communication, but it can do some good as well. In many organizations, firewall rules may block direct MQTT communication, but WebSocket traffic is usually happily accepted.  So it may be your only way to get through the firewall, even if you're doing no evil.

b) Not only firewalls, but there may be other technical reasons preventing usage of the protocol of your choice. Many cloud environments which are build a top of CloudFoundry standard only accept incoming connections if forms of Http(s) and WebSocket(s). In order to use plain MQTT only, an external server running the broker is needed, and the connection is then created from the Cloud to the server. Many clouds do allow doing so.

But that's a bit against the whole philosophy of cloud services. Better to have the MQTT broker running in the cloud, and let connections to be created from outside world over the WebSocket.

MQTT tunneling over WebSocket
Two or more MQTT brokers may be connected over WebSocket. In a sensor network kind of schenario, it may make sense that not every sensor connects directly to the back-end system (cloud). Instead let there be number of intermediate brokers to reduce the number of direct connections the back-end system must handle.

If sensors are installed in a plant or other controlled environment, they may use unsecured mean of communication to the nearest broker, and then only the inter-broker connection needs to be secured. The broker may also have some local data storage to buffer the data in case of connection interruptions, thus no data is lost permanently.

* "tårta på tårta", direct translation is "cake on cake". Depending on context, it may be interpreted as tautology or recursion.

Monday, August 25, 2014

Sensor data with MQTT

Messaging protocols like MQTT makes it easy to design dynamic, scalable and modular architecture for sensor integration.

Message queues with publish/subscribe schema makes data producers and consumers independent from each other. The producer/publisher does not need to know who if any is interested in the data. Consumer can be changed in fly without any interruption to producer, and vice verse.

Pub/sub architecture
By nature, such architecture is unreliable, as producer does not get acknowledgement whether the data was received by anybody. MQTT tries to tackle that by introducing some QoS functionality;"at most once", "at least once" and "exactly once" delivery. In addition to that, publisher may define last will and retain messages. The last will will be delivered to subscribers if connection to the publisher gets broken. 

One common mistake with MQTT is to consider it as a pipe to deliver structured data in between producer and consumer.  Even if MQTT can do it, it's not according to the original design principle. The topic field of each message should contain relevant metadata about the meaning of the message. By representing the information in the payload data itself, the benefit of broker is lost. Let the broker do its job!

Let's have a practical example


Tellstick Duo
Tellstick Duo from Telldus can receive data from various wireless sensors from different vendors. In case of WT450H transmitter, an example of raw sensor event data from telldus-core driver consists of following fields:
  • class: sensor
  • protocol:mandolyn
  • id: 22
  • model: temperaturehumidity
  • temp: 22.9
  • humidity: 56 

It's feels quite natural to formulate the data as a JSON message and delivered it with topic something like /sensor/telldus. But! That's not how MQTT is supposed to be used. Better way is add the metadata in the topic and let payload only to contain the actual data. Something like:

  /sensor/temperature/<id> 22.9
  /sensor/humidity/<id> 56

Why this way?  The broker can do its job by delivering messages to those and only those who are interested in that particular message. If the topic would be just plain /sensor/telldus, every consumer would receive every message, and then in application level they should parse the message and decide whether they interested at all in it.

Thursday, August 21, 2014

Fault-tolerant IoT architecture

Distributed databases makes it easy to set up fault-tolerant architecture for IoT and more.

Let's assume a system with data sources like wireless sensors, gateways and back-end servers. In order to ensure that there is no single point of failure, some level of redundancy is needed by duplicating gateways, connections and back-end systems.


Fault-tolerant IoT architecture.

Data exchange between gateways and back-ends can be realized with help of distributed database, without need for separate transfer mechanism. As described in my earlier posting, distributed databases can be characterized by whether they favor  availability or consistency.

GaianDB is a dynamic distributed federated database provided by IBM.  GaianDB advocates a flexible "store locally, query anywhere" (SLQA) paradigm. Data is stored in one database, and queries are propagated across the whole cluster to find the given data. This approach by itself does not guarantee high availability, but when combined with redundancy, it gives nice fault-tolerant system.

In the diagram above, each sensor is expected to be listened by two or more gateways, if no-fault condition. Each gateway has it's own database storing data received from sensors it is able to listen. This means there is redundant data recorded in the system. It is important to store or buffer data locally in gateways. In case of temporary connection failure the data is not lost but can be later retrieved from the gateway.

The cluster is dynamically self-organizing, which means it always look for optimal route in between nodes, if there exists any. If individual link or node is lost, data is routed other way round. With help of redundancy, at any given situation no single failure blocks the whole system from working.

Databases favoring consistency are not good fault-tolerant architectures. Typically such have one DB instance defined as master for any given data entity. Data is available via every secondary DB, but if the master DB is out of order, all secondary ones will cease providing the data, as they can not guarantee its consistency. RethinkDB is a popular example of such database.

Monday, August 4, 2014

Gain your DevOps attitude

Playing around with a cluster is a good exercise for any developer. "It works on my desktop" is not  enough anymore.

I have to admit I have developer background and developer attitude. Since I assembled the RPi cluster reported in an earlier post, I have had to start thinking the operations way. How do I deploy into multiple instances the configuration I'm running now in this single board. Four nodes is enough to make you understand manual copy and configuration is not the way to go.


How to perform testing, deployment, configuration and management in traceable, reliable and effort-inexpensive way? This is one question that DevOps is trying to answer, by emphasizing communication around development, operations and QA functions. In order to be able to communicate efficiently, people must share common concepts, and think more or less the same way the counterpart does. This is where devops people are coming in by mixing the roles.

DevOps is considered as the third generation of software development methods, after waterfall and agile. There exists some criticism against devops, as it's considered consuming all developers time by doing less challenging tasks like QA and operations. This is why automation is important. DevOp is not supposed to perform manual testing or configure manually several instances of cloud environments.

DevOp uses his or her developer skills to build automated testing, deployment and management environment, and then focuses on developing something new, and let computers run the less challenging and repetitive tasks. At least in theory.

Back to the RPi cluster. Compiling decent database from sources natively in a single RPi takes a day or more. Cloud build environment could boost the process significantly, combined with automated cluster deployment and management tools. When I decided to build a cluster, I thought it's about studying distributed databases and messaging for IoT, but in practice it's a cluster management exercise making me a devops..

Friday, August 1, 2014

Distributed databases

Understanding fundamental differences of distributed database is crucially important when selecting proper database for an IoT application.

There are dozens of databases available at the day. Among distributed, perhaps most known ones are Cassandra, CloudDB, Riad, and RethinkDB. But which one is most suitable for my specific purpose?

Wikipedia  describes several characteristics of distributed databases. However, the article does not put enough attention to the perhaps most important factor; does the database favor data consistency or high availability?

Let's assume we have a cluster of four interconnected databases. (Surprisingly this setup reminds to what was described in previous posting). Let's assume each DB has number of sensor's and actuators behind it, only connected to one DB at a time, as illustrated below.

Distributed database.
What happens when connection to a database (A) is lost? There are two scenarios:
a) high availability: other databases B,C,D continue to provide the last known state of the nodes behind the DB A
b) data consistency: other databases B,C,D do no more provide information about nodes behind A, as they can not guarantee the data consistency. (integrity).

Which one is better? Well, it depends on your application. In a sensor network type of application where time-series data is typically stored, the high availability approach can do well. The historical data is not supposed to be altered at any way, thus it remains to be valid. Redundant databases can continue providing the last know history of lost nodes. If connection to the DB A is recovered later, all the databases can then synchronize the missing data from the duration of the connection lost.

In a real-time control type of application, historical data is perhaps not that important, but the only thing that matters is the current state of the system. With such an approach, in case of connection lost, it's better not to share the past state of the lost nodes. This is considered as favoring data consistency, no false data is provided. If a sub-system is not available, also data related to it is missing. This is quite natural.

The CAP theorem provides more scientific explanation of the difference explained above. In case of selecting most suitable DB for an IoT application, this is perhaps the most important factor to take into account. For example, how databases are synchronizing with each other is not that relevant, as long as they are connected and can synchronize. In IoT system, fault tolerance is one of the most important aspects, as system partition (connection breaks) are more than likely to occur.

Thursday, July 31, 2014

Raspberry cluster

Raspberry Pi cluster for IoT experimentation of distributed and fault tolerant systems.

The summer vacation season is over in Finland now. Berries are picked from raspberry bushes and it's the time to get back on duties. Inspired by raspberry bushes, I decided to build a Raspberry cluster.

The new format of Revision B+ RPi makes it more suitable for mechanical assembly. Four mounting holes in square and removal of parts sticking outside of the PCB are welcome improvements, including SD card and composite video output. Replacing linear regulators by switching regulators among other improvements causes 30% reduction in power consumption from B to B+.

Tower of four Raspberry Pi B+
Originally my intention was to use BeagleBone Black, but as they are having troubles with component availability, all the major distributors are having zero units in stock and long lead times. Thus I had to content myself with RPi. Why BBB? Because the AM335x CPU with Cortex-A8 core is more industrial grade and typical example of design choice in my company, among others. Whereas the BCM2835 SoC with ARM11 core was initially dedicated for mobile devices, and long-term availability is not guaranteed.

The cluster is intended for study of distributed databases, inter-system messaging (message queue), fault tolerancy, big data, and other aspects of reliable IoT system. The cluster consists of four RPi model B+, 5-port gigabit Ethernet switch, and a 7-port USB hub for power distribution. The only purpose of the USB hub is to distribute power and no data is delivered. This is due to RPi uses micro-USB connector for power input. The only custom-made part in the whole system is a USB to DC-plug cable in order to provide power from the hub to the switch. This way the setup requires only single 5VDC 20W (4A) power supply.

Total of 16 USB ports available.
The tower is quite compact and has sturdy mechanical structure. Raising crews are bolted straight to the metal chassis of the Ethernet switch, thus no other mounting brackets are needed. I consider this as a canonical form. A small flaw in beauty is cable ties used for mounting of the USB hub. All in all this setup costs 200-300 euros, depending on where the parts are purchased from. Thus it's affordable even with student budget.

In coming posts I will report my adventures in the world of distributed systems.