Skip to content

Troubleshooting

Tools

Monitoring

We have ping tests set up to notify about availability of each environment. Alerts go to #benefits-notify.

Logs

Azure App Service Logs

Open the Logs for the environment you are interested in. The following tables are likely of interest:

  • AppServiceConsoleLogs: stdout and stderr coming from the container
  • AppServiceHTTPLogs: requests coming through App Service
  • AppServicePlatformLogs: deployment information

For some pre-defined queries, click Queries, then Group by: Query type, and look under Query pack queries.

Live tail

After setting up the Azure CLI, you can use the following command to stream live logs:

az webapp log tail --resource-group RG-CDT-PUB-VIP-CALITP-P-001 --name AS-CDT-PUB-VIP-CALITP-P-001 2>&1 | grep -v /healthcheck

SCM

https://as-cdt-pub-vip-calitp-p-001-dev.scm.azurewebsites.net/api/logs/docker

Sentry

Cal-ITP’s Sentry instance collects both errors (“Issues”) and app performance info.

Alerts are sent to #benefits-notify in Slack. Others can be configured.

You can troubleshoot Sentry itself by turning on debug mode and visiting /error/.

Specific issues

This section serves as the runbook for Benefits.

Terraform lock

General info

If Terraform commands fail (locally or in the Pipeline) due to an Error acquiring the state lock:

  1. Check the Lock Info for the Created timestamp. If it’s in the past ten minutes or so, that probably means Terraform is still running elsewhere, and you should wait (stop here).
  2. Are any Pipeline runs stuck? If so, cancel that build, and try re-running the Terraform command.
  3. Do any engineers have a Terrafrom command running locally? You’ll need to ask them. For example: They may have started an apply and it’s sitting waiting for them to approve it. They will need to (gracefully) exit for the lock to be released.
  4. If none of the steps above identified the source of the lock, and especially if the Created time is more than ten minutes ago, that probably means the last Terraform command didn’t release the lock. You’ll need to grab the ID from the Lock Info output and force unlock.

App fails to start

If the container fails to start, you should see a downtime alert. Assuming this app version was working in another environment, the issue is likely due to misconfiguration. Some things you can do:

Littlepay API issue

Littlepay API issues may show up as:

  • The monitor failing
  • The Connect your card button doesn’t work

A common problem that causes Littlepay API failures is that the certificate expired. To resolve:

  1. Reach out to support@littlepay.com
  2. Receive a new certificate
  3. Put that certificate into the configuration data and/or the GitHub Actions secrets

Eligibility Server

If the Benefits application gets a 403 error when trying to make API calls to the Eligibility Server, it may be because the outbound IP addresses changed, and the Eligibility Server firewall is still restricting access to the old IP ranges.

  1. Grab the outbound_ip_ranges output values from the most recent Benefit deployment to the relevant environment.
  2. Update the IP ranges
    1. Go to the Eligibility Server Pipeline
    2. Click Edit
    3. Click Variables
    4. Update the relevant variable with the new list of CIDRs

Note there is nightly downtime as the Eligibility Server restarts and loads new data.