Recent versions of Workload Automation product are capable of seamless integrating with observability products such as Dynatrace, Instana, Datadog, Splunk and others. This is useful specially for companies that have large operations teams which are already monitoring applications on those observability solutions.
Having the job/scheduling metrics, logs and events and co-relating them with actual application performance data makes it easy to uncover bottlenecks, and identify potential SLA breaches as well as making it easy for the operator or SRE to identify jobs running/abending on the environment.
In this blog post, I will describe one of the pillars of observability from a Workload Automation point of view. HCL Workload Automation has exposed its metrics for the main components, the back-end (Master Domain Manager) which reports metrics around job execution as well as the health of its application server (Websphere liberty). As well as the front-end web user interface (Dynamic workload console – DWC).
Those metrics are exposed in open metrics format, which is a vendor-neutral format widely adopted by the community, it originated from the Prometheus project and it’s been the standard way to report metrics for cloud-native applications.
For HCL Workload Automation to start reporting metrics we should first enable to open metrics endpoint on all WebSphere components (MDM / BKMDM / DWC). The process is well documented here.
Once performed, the endpoints will be available on the HTTP/HTTPS ports: https://MDMIP:31116/metrics and https://DWCIP:9443/metrics
When accessing the links we should see the open metrics format data:
# TYPE base_REST_request_total counter # HELP base_REST_request_total The number of invocations and total response time of this RESTful resource method since the start of the server. The metric will not record the elapsed time nor count of a REST request if it resulted in an unmapped exception. Also tracks the highest recorded time duration within the previous completed full minute and lowest recorded time duration within the previous completed full minute. base_REST_request_total{class=”com.ibm.tws.twsd.rest.engine.resource.EngineResource”,method=”getPluginsInfo_javax.servlet.http.HttpServletRequest”} 39 base_REST_request_total{class=”com.ibm.tws.twsd.rest.plan.resource.JobStreamInPlanResource”,method=”getJobStreamInPlan_java.lang.String_java.lang.String_javax.servlet.http.HttpServletRequest”} 170 base_REST_request_total{class=”com.ibm.tws.twsd.rest.model.resource.JobStreamModelResource”,method=”getJobStreamById_java.lang.String_java.lang.Boolean_javax.servlet.http.HttpServletRequest”} 51 base_REST_request_total{class=”com.ibm.tws.twsd.rest.model.resource.FolderModelResource”,method=”getFolderById_java.lang.String_java.lang.Boolean_javax.servlet.http.HttpServletRequest”} 4 base_REST_request_total{class=”com.ibm.tws.twsd.rest.eventrule.engine.resource.RuleInstanceEventRuleResource”,method=”queryNextRuleInstanceHeader_com.ibm.tws.objects.bean.filter.eventruleengine.QueryEventRuleEngineContext_javax.servlet.http.HttpServletRequest”} 5 base_REST_request_total{class=”com.ibm.tws.twsd.rest.engine.resource.EngineResource”,method=”parametersToJsdl_com.ibm.tws.objects.bean.engine.ParametersInfo_javax.servlet.http.HttpServletRequest”} 1 base_REST_request_total{class=”com.ibm.tws.twsd.rest.model.resource.FolderModelResource”,method=”getFolderContent_com.ibm.tws.objects.bean.model.FolderContentParameters_javax.servlet.http.HttpServletRequest”} 79 base_REST_request_total{class=”com.ibm.tws.twsd.rest.eventrule.engine.resource.AuditRecordEventRuleResource”,method=”queryAuditRecordHeader_com.ibm.tws.objects.bean.filter.eventruleengine.QueryFilterEventRuleEngine_java.lang.Integer_javax.servlet.http.HttpServletRequest”} 105 base_REST_request_total{class=”com.hcl.wa.wd.rest.ResourceBundleService”,method=”getBundle_javax.servlet.http.HttpServletRequest”} 39 base_REST_request_total{class=”com.ibm.tws.twsd.rest.model.resource.EventRuleModelResource”,method=”getEventRuleById_java.lang.String_java.lang.Boolean_javax.servlet.http.HttpServletRequest”} 17 base_REST_request_total{class=”com.ibm.tws.twsd.rest.model.resource.EventRuleModelResource”,method=”queryEventRuleHeader_com.ibm.tws.objects.bean.filter.model.QueryFilterModel_java.lang.Integer_java.lang.Integer_java.lang.Integer_javax.servlet.http.HttpServletRequest”} 19 base_REST_request_total{class=”com.ibm.tws.twsd.rest.model.resource.JobDefinitionModelResource”,method=”listKeys_java.lang.String_java.lang.String_java.lang.String_javax.servlet.http.HttpServletRequest”} 3 base_REST_request_total{class=”com.ibm.tws.twsd.rest.model.resource.EventRuleModelResource”,method=”updateEventRule_java.lang.String_java.lang.Boolean_java.lang.Boolean_com.ibm.tws.objects.rules.EventRule_javax.servlet.http.HttpServletRequest”} 1 base_REST_request_total{class=”com.ibm.tws.twsd.rest.model.resource.WorkstationModelResource”,method=”unlockWorkstations_java.lang.String_java.lang.Boolean_java.lang.Boolean_javax.servlet.http.HttpServletRequest”} 1 base_REST_request_total{class=”com.hcl.wa.fileproxy.rest.FileProxyResources”,method=”proxyPutResponse_java.lang.String_java.lang.String_java.io.InputStream_javax.servlet.http.HttpServletResponse”} 26 base_REST_request_total{class=”com.hcl.wa.wd.rest.JsonService”,method=”getObjectProps_java.lang.String_java.lang.String_java.lang.String_java.lang.String_java.lang.String_java.lang.String_java.lang.String_java.lang.String_javax.servlet.http.HttpServletRequest”} 15 |
If the endpoints properly reporting the metrics we now move into sending the data to observability products. In our case we will leverage Prometheus as a monitoring solution, we will set up Prometheus to scrape the HWA openmetrics endpoints so it’s ingested by it and once in there we are able to setup alerts or dashboards.
Bellow is a Prometheus configuration example (/etc/prometheus/prometheus.yml) to scrape HWA’s openmetrics endpoints. Note the scrape_interval of 1 minute as well as we are disabling tls verification.
In below example the MDM https port is 31116 and DWC’s is 443 (the default is 9443).
scrape_configs:
– job_name: ‘hwa_mdm’ – job_name: ‘hwa_dwc’ # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. # metrics_path defaults to ‘/metrics’ static_configs: |
After recycling Prometheus, I can see the targets available on Prometheus’s UI.
Figure 1 Prometheus targets
With the data been received on Prometheus I am also able to search it by running promql queries as well as visualizing graphics.
Figure 2 Prometheus metrics
Below picture shows a promql query to list jobs in error by workstation.
Figure 3 Prometheus error jobs by workstation
By validating the metrics are being reported properly on Prometheus we can now leverage Grafana to display and build dashboards or/and leverage alert manager to be alerted in case of issues.
Regarding to Grafana, we can now leverage the Grafana dashboard available on yourautomationhub.io. The dashboard was built for Grafana with relevant data for scheduling environments. To use it on Grafana, first we need to define the Prometheus datasource, according to the below picture.
Figure 4 Grafana’s Prometheus datasource
Them all it takes is to import the HWA dashboard from Grafana’s import section. Type the id 14692 and it should load the dashboard automatically. Select the folder and the Prometheus datasource name we did set up on the previous step and click import.
Figure 5 Import dashboard on grafana
Once imported we can see all the metrics that is collected by Prometheus on Grafana’s dashboard:
Figure 6 HWA/IWA grafana’s dashboard
HWA nowadays is very observable, the data is very easy to ingest by a plethora of monitoring/observability products. Not only metrics but also our logs comes in .json format which also makes seamless integration with products like logstash, Splunk, and others.
Start a Conversation with Us
We’re here to help you find the right solutions and support you in achieving your business goals.