Troubleshooting ¶
Tools ¶
Monitoring ¶
We have ping tests set up to notify about availability of each environment. Alerts go to #benefits-notify.
Logs ¶
Azure App Service Logs ¶
Open the Logs for the environment you are interested in. The following tables are likely of interest:
- AppServiceConsoleLogs:- stdoutand- stderrcoming from the container
- AppServiceHTTPLogs: requests coming through App Service
- AppServicePlatformLogs: deployment information
For some pre-defined queries, click Queries, then Group by: Query type, and look under Query pack queries.
Live tail ¶
After setting up the Azure CLI, you can use the following command to stream live logs:
az webapp log tail --resource-group RG-CDT-PUB-VIP-CALITP-P-001 --name AS-CDT-PUB-VIP-CALITP-P-001 2>&1 | grep -v /healthcheck
SCM ¶
https://as-cdt-pub-vip-calitp-p-001-dev.scm.azurewebsites.net/api/logs/docker
Sentry ¶
Cal-ITP’s Sentry instance collects both errors (“Issues”) and app performance info.
Alerts are sent to #benefits-notify in Slack. Others can be configured.
You can troubleshoot Sentry itself by turning on debug mode and visiting /error/.
Specific issues ¶
This section serves as the runbook for Benefits.
Terraform lock ¶
If Terraform commands fail (locally or in the Pipeline) due to an Error acquiring the state lock:
- Check the Lock Infofor theCreatedtimestamp. If it’s in the past ten minutes or so, that probably means Terraform is still running elsewhere, and you should wait (stop here).
- Are any Pipeline runs stuck? If so, cancel that build, and try re-running the Terraform command.
- Do any engineers have a Terrafrom command running locally? You’ll need to ask them. For example: They may have started an applyand it’s sitting waiting for them to approve it. They will need to (gracefully) exit for the lock to be released.
- If none of the steps above identified the source of the lock, and especially if the Createdtime is more than ten minutes ago, that probably means the last Terraform command didn’t release the lock. You’ll need to grab theIDfrom theLock Infooutput and force unlock.
App fails to start ¶
If the container fails to start, you should see a downtime alert. Assuming this app version was working in another environment, the issue is likely due to misconfiguration. Some things you can do:
- Check the logs
- Ensure the environment variables and configuration data are set properly.
- Turn on debugging
- Force-push/revert the environment branch back to the old version to roll back
Littlepay API issue ¶
Littlepay API issues may show up as:
- The monitor failing
- The Connect your cardbutton doesn’t work
A common problem that causes Littlepay API failures is that the certificate expired. To resolve:
- Reach out to support@littlepay.com
- Receive a new certificate
- Put that certificate into the configuration data and/or the GitHub Actions secrets
Eligibility Server ¶
If the Benefits application gets a 403 error when trying to make API calls to the Eligibility Server, it may be because the outbound IP addresses changed, and the Eligibility Server firewall is still restricting access to the old IP ranges.
- Grab the outbound_ip_rangesoutputvalues from the most recent Benefit deployment to the relevant environment.
- Update the IP ranges- Go to the Eligibility Server Pipeline
- Click Edit
- Click Variables
- Update the relevant variable with the new list of CIDRs
 
Note there is nightly downtime as the Eligibility Server restarts and loads new data.