It is 2018 and everything that is being built on the cloud has to be a Microservices Architecture. It was not so long ago, one could build one monolithic service, deploy it behind an elastic load balancer and he was set. Once a REST call was received by the service, it can either return a success or a failure based on what happened in the processing of the request.
But that architecture had its limitations and issues that led people to look at microservices architecture. As microservices become commonplace, we are faced with unique challenges that need to be resolved for a microservices based cloud deployment.
Let's look at an example. Let's say we are building a School Management System. A typical school management system would have following modules within the system.
Now, Let's take a very simple use case on this system. A parent who has got his child admitted into the school wants to pay the fees and confirm the admission. It would probably look something like below.
Now, let's assume each of the lines in the above picture are separate microservices. When parent tried to confirm the admission by paying the fees, he interacted with Admissions microservice which provided him details of fees that needs to be paid. In turn, Account microservice calls the payment gateway interface and processes the payment.
It is possible that there is a transient error in the call between Accounts and Payment Gateway microservices. The likelihood of such errors across microservices is higher because every call across microservices is a network call. There are two options to handle such situations.
We now we have another microservice which takes care of retrying the failed request. Assuming the initial request for fee payment to the payment gateway was unsuccessful, the modified flow of requests will look like below.
We can clearly see this flow is more resilience and takes care of scenarios when the REST call between microservices may fail. The next big question is what are the conditions when the requests need to be sent to dead letter queue. To understand this, let's look at HTTP error codes and try to figure out what they really mean.
The HTTP error codes depicting failures are defined in series 4XX and 5XX. Let's evaluate them one by one. Normally as per standards, 4XX series are error codes are used to denote the situations where the client has errored while 5XX denotes situations where the server has errored.
401 Unauthorized, 403 Forbidden We need to understand how we are authenticating request across microservices. If the authentication is being down with client/user credentials, then this is a fatal error and we can return the error to client/user. If it is a service authentication across users, we will have to handle it internally and we should send the request to DLQ.
404 Not Found, 405 Method Not Allowed,406 Not Acceptable, 410 Gone If the request is coming from user/client, we can return the error back to user/client. If it is a request between two microservices, we need to send the request to DLQ.
407 Proxy Authentication Required Our application will see this error only in case of calls between microservices and we need to send this request to DLQ.
408 Request Timeout This denotes some service being down, needs to be sent to DLQ.
409 Conflict This error is generated because of inconsistent state of the system. This needs to be sent to DLQ.
501 Not Implemented, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout, 505 HTTP Version Not Supported If the request was received by an external client/user, this error is returned back to client/user else the request should be sent to DLQ and retried when the error is resolved.
Once we communicate an accurate understanding of each of the error codes with the team members and return proper error codes, a microservice architecture can be made more resilient with an architectural block like a dead letter queue.
But that architecture had its limitations and issues that led people to look at microservices architecture. As microservices become commonplace, we are faced with unique challenges that need to be resolved for a microservices based cloud deployment.
Let's look at an example. Let's say we are building a School Management System. A typical school management system would have following modules within the system.
A sample of school management system |
Confirming admission sample flow |
Now, let's assume each of the lines in the above picture are separate microservices. When parent tried to confirm the admission by paying the fees, he interacted with Admissions microservice which provided him details of fees that needs to be paid. In turn, Account microservice calls the payment gateway interface and processes the payment.
It is possible that there is a transient error in the call between Accounts and Payment Gateway microservices. The likelihood of such errors across microservices is higher because every call across microservices is a network call. There are two options to handle such situations.
- The error from payment gateway is captured and returned to the user and he is asked to try after some time. This causes a bad user experience because we are returning an error to the end user without knowing the reason for the error. The user may think that there is something wrong with his credit card or bank account while the error may be just because of network failure.
- If we are not sure about the reasons for the error, another option is to pass on the request to a dead-letter-queue service. The whole purpose of the dead-letter-queue service is to retry the request based on a configured retry-count and retry-timeout. This makes sure we inform the custom asynchronously only in cases of genuine customer error.
Modules including dead letter queue |
We now we have another microservice which takes care of retrying the failed request. Assuming the initial request for fee payment to the payment gateway was unsuccessful, the modified flow of requests will look like below.
Fee payment with dead letter queue |
The HTTP error codes depicting failures are defined in series 4XX and 5XX. Let's evaluate them one by one. Normally as per standards, 4XX series are error codes are used to denote the situations where the client has errored while 5XX denotes situations where the server has errored.
4XX
400 Bad Request, 411 Length Required, 412 Precondition Failed, 413 Request Entity Too Large, 414 Request-URI Too Long, 415 Unsupported Media Type, 416 Requested Range Not Satisfiable, 417 Expectation Failed Normally these errors are solely caused by bad request body. But in case of microservices architecture, the request bodies are composed by the microservices themselves. So unless we are talking about a customer facing microservice where the request is coming from external client or user, this is most probably caused because of some interface confusion between microservices. There is nothing that a client or user can do about this request. We send this request to DLQ.401 Unauthorized, 403 Forbidden We need to understand how we are authenticating request across microservices. If the authentication is being down with client/user credentials, then this is a fatal error and we can return the error to client/user. If it is a service authentication across users, we will have to handle it internally and we should send the request to DLQ.
404 Not Found, 405 Method Not Allowed,406 Not Acceptable, 410 Gone If the request is coming from user/client, we can return the error back to user/client. If it is a request between two microservices, we need to send the request to DLQ.
407 Proxy Authentication Required Our application will see this error only in case of calls between microservices and we need to send this request to DLQ.
408 Request Timeout This denotes some service being down, needs to be sent to DLQ.
409 Conflict This error is generated because of inconsistent state of the system. This needs to be sent to DLQ.
5XX
500 Internal Server Error This is a transient error and should be sent to DLQ.501 Not Implemented, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout, 505 HTTP Version Not Supported If the request was received by an external client/user, this error is returned back to client/user else the request should be sent to DLQ and retried when the error is resolved.
Once we communicate an accurate understanding of each of the error codes with the team members and return proper error codes, a microservice architecture can be made more resilient with an architectural block like a dead letter queue.