Saturday, May 26, 2018

The microservices challenge

It is 2018 and everything that is being built on the cloud has to be a Microservices Architecture. It was not so long ago, one could build one monolithic service, deploy it behind an elastic load balancer and he was set. Once a REST call was received by the service, it can either return a success or a failure based on what happened in the processing of the request.
But that architecture had its limitations and issues that led people to look at microservices architecture. As microservices become commonplace, we are faced with unique challenges that need to be resolved for a microservices based cloud deployment.
Let's look at an example. Let's say we are building a School Management System. A typical school management system would have following modules within the system.

A sample of school management system
Now, Let's take a very simple use case on this system. A parent who has got his child admitted into the school wants to pay the fees and confirm the admission. It would probably look something like below.
Confirming admission sample flow

Now, let's assume each of the lines in the above picture are separate microservices. When parent tried to confirm the admission by paying the fees, he interacted with Admissions microservice which provided him details of fees that needs to be paid. In turn, Account microservice calls the payment gateway interface and processes the payment.
It is possible that there is a transient error in the call between Accounts and Payment Gateway microservices. The likelihood of such errors across microservices is higher because every call across microservices is a network call. There are two options to handle such situations.

  1. The error from payment gateway is captured and returned to the user and he is asked to try after some time. This causes a bad user experience because we are returning an error to the end user without knowing the reason for the error. The user may think that there is something wrong with his credit card or bank account while the error may be just because of network failure.
  2. If we are not sure about the reasons for the error, another option is to pass on the request to a dead-letter-queue service. The whole purpose of the dead-letter-queue service is to retry the request based on a configured retry-count and retry-timeout. This makes sure we inform the custom asynchronously only in cases of genuine customer error.
Modules including dead letter queue

We now we have another microservice which takes care of retrying the failed request. Assuming the initial request for fee payment to the payment gateway was unsuccessful, the modified flow of requests will look like below.
Fee payment with dead letter queue
We can clearly see this flow is more resilience and takes care of scenarios when the REST call between microservices may fail. The next big question is what are the conditions when the requests need to be sent to dead letter queue. To understand this, let's look at HTTP error codes and try to figure out what they really mean.

The HTTP error codes depicting failures are defined in series 4XX and 5XX. Let's evaluate them one by one. Normally as per standards, 4XX series are error codes are used to denote the situations where the client has errored while 5XX denotes situations where the server has errored.


400 Bad Request, 411 Length Required, 412 Precondition Failed, 413 Request Entity Too Large, 414 Request-URI Too Long, 415 Unsupported Media Type, 416 Requested Range Not Satisfiable, 417 Expectation Failed Normally these errors are solely caused by bad request body. But in case of microservices architecture, the request bodies are composed by the microservices themselves. So unless we are talking about a customer facing microservice where the request is coming from external client or user, this is most probably caused because of some interface confusion between microservices. There is nothing that a client or user can do about this request. We send this request to DLQ.

401 Unauthorized, 403 Forbidden We need to understand how we are authenticating request across microservices. If the authentication is being down with client/user credentials, then this is a fatal error and we can return the error to client/user. If it is a service authentication across users, we will have to handle it internally and we should send the request to DLQ.

404 Not Found, 405 Method Not Allowed,406 Not Acceptable, 410 Gone  If the request is coming from user/client, we can return the error back to user/client. If it is a request between two microservices, we need to send the request to DLQ.

407 Proxy Authentication Required Our application will see this error only in case of calls between microservices and we need to send this request to DLQ.

408 Request Timeout This denotes some service being down, needs to be sent to DLQ.

409 Conflict This error is generated because of inconsistent state of the system. This needs to be sent to DLQ.


500 Internal Server Error This is a transient error and should be sent to DLQ.

501 Not Implemented, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout, 505 HTTP Version Not Supported If the request was received by an external client/user, this error is returned back to client/user else the request should be sent to DLQ and retried when the error is resolved.

Once we communicate an accurate understanding of each of the error codes with the team members and return proper error codes, a microservice architecture can be made more resilient with an architectural block like a dead letter queue.

Thursday, May 10, 2018

The agile conundrum

The agile manifesto has been around for quite some time and is being hailed as the biggest revolution since stored program concept. My personal experiences around agile never felt like a revolution.  If one looks at the sparsely available data related to project success rates in IT, one finds that agile projects did have higher success rates but not as high that they could be called revolutionary.

So I decided to look at the agile manifesto again.
  1. Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.
  2. Welcome changing requirements, even late in  development. Agile processes harness change for  the customer's competitive advantage.
  3. Deliver working software frequently, from a  couple of weeks to a couple of months, with a  preference to the shorter timescale.
  4. Business people and developers must work  together daily throughout the project.
  5. Build projects around motivated individuals.  Give them the environment and support they need,  and trust them to get the job done.
  6. The most efficient and effective method of  conveying information to and within a development  team is face-to-face conversation.
  7. Working software is the primary measure of progress.
  8. Agile processes promote sustainable development.  The sponsors, developers, and users should be able  to maintain a constant pace indefinitely.
  9. Continuous attention to technical excellence  and good design enhances agility.
  10. Simplicity--the art of maximizing the amount of work not done--is essential.
  11. The best architectures, requirements, and designs  emerge from self-organizing teams.
  12. At regular intervals, the team reflects on how  to become more effective, then tunes and adjusts  its behavior accordingly.
 As we carefully look at the above principles of the agile manifesto, there can't be any quarrels about 1, 3, 6, 7, 8, 9, 10, 12. These are just good principles for any project team. Nothing to do with agile versus traditional. Let's look at other items in the agile principles.

Changing Requirements
I agree that there is a need for the software teams to adapt to changing requirements but I would not go as far as to say that we need to welcome requirement change. Agile or traditional, any change in requirement does cause disturbance to the ongoing problem and agile teams may be better placed to handle that change but it is the adaptability comes from underlying software architecture and design rather than the agile process per say. I have good agile teams having underlying architecture and designs that result in the complete team being thrown off course as soon as a  requirement change is encountered.
Business and Developers must work together daily
In most of the agile teams that I have worked on, business people can't afford to work with developers on a daily basis and hence the voice of business people is carried to developers through the product managers. The original intent of "the individual with the problem" working together with "the person who can code" is not achieved anyway. There is still content lost in translation.
Motivated Individuals
I think this is the most important aspect of any agile team. This also overpowers all the other principles of the agile manifesto. If I can gather a group of motivated individuals, I don't need to really worry about most of the other stuff. If the team is motivated to build the right thing that the customer wants, all our problem would be solved. If I was running a non-agile, traditional project and I had a set of motivated individuals, I would still most probably end up with a successful project.
Best architecture, requirements, designs emerge from self-organizing teams
I think there is very little difference between a team in chaos and a self-organizing team. When a team delivers with best architecture, requirements, and design,  we call them self-organizing team otherwise they are labeled a chaos. I think this principle of the agile manifesto cannot be operationalized because if we give the freedom to self-organize to a team, whether they are effective or they have descended into chaos will only know once they produce their output and by that time it is too late.
In my view, the biggest problem staring the agile manifesto today is that it has become almost akin to a religion. People fight battles whether something should be called an Epic, Story or Task. People fight over what can be discussed in a standup. It is almost like the process has become most important aspect, nobody is worried about the end result.
Most people working in agile teams think that agile is all about coding, the thinking required to build a good product is considered a waste of time. People build bad software proudly claiming, we will refactor it later. In the name of backlog list, the visibility in the project tracking is completely lost. It has almost become a voodoo.
I believe the point of the agile manifesto was to eliminate activities that were not adding anything positive to the product. For example, there was absolutely no point in writing an interface control document because, C headers, Java interfaces could be used to explain interfaces better and they were easily manageable. But unfortunately, it has been taken as execute to eliminate activities that were actually contributing to product quality. In the end, we have not made gains that agile should have given us.

Monday, May 7, 2018

Writing Secure Code

Security is the paramount issue when deploying a product over the internet. Here I am trying to collect a set of principles for writing secure code, that I have collected from different places on the internet.

  1. Any functionality you have built can be used by anybody with an ulterior motive. Design functionality/endpoint/service/microservice with a mindset that guards against the most rogue user of that functionality.
  2. All user input coming your way should be assumed to be coming from a most rogue user of the functionality. Follow the steps defined below. 
    1. Sanitize
    2. Validate
    3. Execute
    4. Display feedback
  3. If you have defined endpoints and provided a web client or an app, Don't assume that is the only method that will be used to access your endpoints.
  4. API Keys are as important, if not more important, than usernames and password. Guard them like that.
  5. Any place you are using API keys or passwords, think, what will you do in case these are exposed. How will you handle key rotation? Particularly if you have a device as your client, it should be capable of handling key rotations, or forced password change.
  6. Passwords and API keys should not be part of the code and committed to configuration management systems.
  7. If you are using encryption with keys (e.g. AES 256), think where will be store the keys themselves. Think of Hardware Security Modules.
  8. Unless your occupation is a security researcher, don't fall to the temptation of designing your own encryption algorithm. Security by obfuscation is just a bad idea.
  9. Before storing any data, think about what you intend to do with it. Any customer data stored in your systems have potential to be leaked. Don't store a piece of data that you have no need for.
  10. Pay attention to filtering rules for your logging and audit your logs for any accidental sensitive data reaching the logs.
  11. Heed warning from compilers, lint, and other analysis tools. A sprint is done only when all the warnings have been removed.
  12. In this time and age, don't use HTTP, Stick to HTTPS with at least TLS/1.2
  13. Have a policy on employee turnover. Many times ex-employees may be the biggest source of data leakage. 
  14. Specifically design and test against injection risks. Minimize native SQL queries as much as possible.
  15. Specifically design and test against broken authentication and authorization. Many systems suffer from issues such as users assuming identities of other users, users able to authorize themselves for higher roles etc. Take the help of a security researcher if you don't have staff on the team.
  16. Each piece of data collected and stored should be specifically annotated with the level of privacy and encryption required. Detailed thought related to the type of data needs to be given right at the design stage.
  17. Auditing sensitive data is mandatory. The system needs to have an audit system built that can help us retrieve sufficient breadcrumbs in case of a customer complaint about data compromise.
These are some of the items that I could think of. Comment with your thoughts.

What are some fundamentals of security every developer should understand?
OWASP Top 10 - 2017