Skip to main content
Diagnostic Imaging Workflows

When Your PACS Queue Freezes at 2 AM—A 4-Step Recovery Checklist

The phone rings. The radiologist on call says studies stopped appearing in the worklist. You remote in and see it: the PACS queue is frozen, studies piling up with status "pending" or "queued" and nothing moving. It's 2 AM, and the ER has three stat CTs waiting. This is the moment that separates a smooth recovery from a long night of escalating tickets. This guide gives you a 4-step recovery checklist designed for exactly that scenario—plus the context you need to decide when to follow it and when to call for backup. We've compiled this from patterns seen across dozens of imaging IT environments. No two PACS setups are identical, but the failure modes repeat. The checklist below assumes you have console or SSH access to the PACS server and basic familiarity with your queue management tool. If you don't have that access, step zero is calling whoever does. 1.

The phone rings. The radiologist on call says studies stopped appearing in the worklist. You remote in and see it: the PACS queue is frozen, studies piling up with status "pending" or "queued" and nothing moving. It's 2 AM, and the ER has three stat CTs waiting. This is the moment that separates a smooth recovery from a long night of escalating tickets. This guide gives you a 4-step recovery checklist designed for exactly that scenario—plus the context you need to decide when to follow it and when to call for backup.

We've compiled this from patterns seen across dozens of imaging IT environments. No two PACS setups are identical, but the failure modes repeat. The checklist below assumes you have console or SSH access to the PACS server and basic familiarity with your queue management tool. If you don't have that access, step zero is calling whoever does.

1. The 2 AM Scenario: Where This Checklist Fits

Picture this: a mid-size hospital with a GE PACS, roughly 400 studies per day. The queue froze at 1:47 AM after a routine database maintenance window ended. The on-call tech tried restarting the queue service—twice—and now studies that were queued before the freeze are gone from the worklist entirely. The ER charge nurse is calling the radiology director. That's when they call you.

This checklist is for the person who gets that call. It's not a theoretical architecture review; it's a sequence of actions to take when the system is down and people are waiting. We assume you've already verified the basics: the PACS server is powered on, the network is up, and the database service is running. If any of those are false, fix those first. The checklist kicks in after those checks pass but the queue still isn't moving.

The four steps are: (1) Diagnose the queue state without touching it, (2) Clear the blockage with a controlled service restart, (3) Verify study flow and data integrity, and (4) Document what happened and set monitoring to prevent recurrence. Each step has sub-steps and failure modes we'll walk through.

Most teams miss the first step entirely. They jump straight to restarting services, which can make things worse—like when a database connection pool is exhausted and restarting the queue service just creates more connections, crashing the database. We'll cover that trap in detail.

When to deviate from the checklist

There are two situations where this checklist does not apply. First, if your PACS vendor has published a known bug for your version that causes queue freezes, follow their documented recovery procedure instead. Second, if the freeze is accompanied by disk I/O errors or storage alerts, the problem may be hardware-level—restarting software will not fix a failing RAID array. In that case, involve your storage team immediately.

"The worst queue freezes I've seen were caused by someone restarting the database because the queue was stuck. That killed the entire PACS for four hours."

— Senior PACS analyst, Level I trauma center

2. Why Queues Freeze: The Foundations Most Teams Misunderstand

To recover a frozen queue, you need to understand what a PACS queue actually does. At its simplest, a queue is a list of studies waiting to be processed—ingested from modalities, routed to worklists, archived to long-term storage. But in practice, the queue is a set of database entries with status flags, and the queue service is a process that polls those entries, performs actions, and updates the status. When the process fails to update a status (because of a timeout, an exception, or a deadlock), the entry stays in "pending" and the queue appears frozen.

The most common root causes fall into three buckets:

  • Database connection exhaustion: The queue service opens connections to the database for each study. If connections are not closed properly (due to a bug or a spike in volume), the pool runs out. New studies are queued but never processed because the service can't connect to update their status.
  • Storage latency or timeout: When archiving a study, the queue service waits for a confirmation from the storage system. If the storage is slow or unresponsive, the service may hang waiting for a response, blocking the entire queue.
  • Auto-routing rule deadlocks: Complex routing rules (e.g., route chest CTs to radiologist A unless contrast, then to B) can create circular references or conflicts that cause the routing engine to hang on a single study, blocking subsequent studies.

Many teams assume a queue freeze is a network issue. They restart network services, which disconnects all active queue sessions and can corrupt the queue state. In our composite scenario, the on-call tech who restarted the queue service twice was actually making things worse—each restart spawned new connections to the database without closing the hung ones, eventually exhausting the connection pool. The correct first step is to check the database connection count and the queue service logs, not to restart blindly.

Another common misunderstanding is that "pending" studies are lost. They are not—they are still in the database with a status flag. The queue service just stopped updating them. Once the service is restarted correctly, those studies will be picked up again, as long as the service re-reads the queue from the database rather than from an in-memory cache. That's why step 2 of the checklist specifies a controlled restart that forces a full re-scan of the queue table.

The role of queue depth monitoring

If you have monitoring in place, a frozen queue often shows up as a flat line on the queue depth graph—the number of pending studies stays constant while the number of processed studies stops increasing. That pattern is diagnostic: it tells you the service is alive (it's still polling) but not completing any actions. That points to a timeout or deadlock, not a service crash. Many monitoring tools can alert on this pattern, but most teams don't configure that specific metric. We'll cover monitoring setup in section 5.

3. The 4-Step Recovery Checklist: Patterns That Usually Work

Here is the checklist itself, with the rationale and failure modes for each step. Print this and keep it near your console.

Step 1: Diagnose without touching

Before you restart anything, collect these data points:

  • Queue depth and age: Check the oldest pending study and the number of pending studies. Use your PACS admin console or a SQL query against the queue table. If the oldest study is more than 10 minutes old, the queue is stuck.
  • Database connection count: Query the database for active connections from the queue service user. A normal count is 5–20. If it's 100+ or near the max pool size, you have connection exhaustion.
  • Service logs: Look at the last few minutes of the queue service log. Search for "timeout", "error", "deadlock", or "connection refused". The last error before the freeze is often the root cause.
  • Storage system status: Check if the archive storage is responsive. Try a simple test—write a small file to the archive mount point or ping the storage controller.

If you see connection exhaustion, do not restart the queue service yet. First, kill the hung connections from the database side (with care—you may need DBA privileges). Then restart the queue service. If you see a storage timeout, check the storage system before restarting the queue—otherwise the new service instance will immediately hang on the same study.

Step 2: Controlled service restart

If the diagnosis points to a deadlock or a transient error, perform a controlled restart:

  • Stop the queue service gracefully: Use the service stop command (e.g., systemctl stop pacs-queue). Wait for the process to exit completely. Check with ps or the service manager.
  • Clear any temporary lock files: Some queue services create lock files in /tmp or the application directory. Remove them if they exist. Check your vendor documentation for the exact path.
  • Start the queue service: Use the start command. Watch the logs for the first 30 seconds. You should see the service re-read the queue table and begin processing pending studies.
  • Verify processing: Check the queue depth after 2 minutes. It should be decreasing. If it stays flat, the service may be hung again—return to step 1.

A common failure at this step is restarting only the queue service when the database service is also down. Always verify the database is up and accepting connections before restarting the queue. Another pitfall: some PACS have multiple queue services (ingest, route, archive). Restart only the one that is frozen, but check that the others are still running.

Step 3: Verify study flow and data integrity

Once the queue is moving again, confirm that studies are flowing correctly:

  • Check a sample study: Find a study that was pending before the freeze and verify it appears in the worklist with the correct status. Use the study UID from the queue table.
  • Check for duplicate studies: A restart can sometimes cause the same study to be processed twice, creating duplicates. Check the worklist for duplicate accession numbers from the time period around the freeze.
  • Verify archive: Ensure that studies that completed routing are also being archived. A frozen queue can mask a secondary archive failure.

If you find duplicates, you may need to manually merge or delete them. Most PACS have a utility for that, but it's a separate process. Document the duplicate UIDs for follow-up.

Step 4: Document and set monitoring

After the immediate crisis is resolved, document what happened and why. Include:

  • Time of freeze and time of recovery
  • Root cause (from logs)
  • Actions taken (exactly which commands, in order)
  • Any studies affected (count, not patient-identifiable info)

Then, set up monitoring to catch the same pattern before it freezes again. At minimum, alert on queue depth > 50 pending studies for more than 5 minutes, and on queue service log errors. If you have a metrics pipeline, track the database connection count and alert on spikes.

"After we added queue depth alerts, our average recovery time dropped from 45 minutes to 8. We catch it before anyone calls."

— Imaging IT manager, 400-bed hospital

4. Anti-Patterns: Why Teams Revert and Make Things Worse

This section covers the most common recovery mistakes. Avoiding these is as important as following the checklist.

Restarting the database first

When the queue is frozen, some teams instinctively restart the database because it's the "big hammer." This is almost always a mistake unless you've confirmed the database itself is hung. Restarting the database kills all connections (including those from other services like the web server and modality gateways), and it can take 10–30 minutes for the database to recover, especially if it needs to replay transaction logs. Meanwhile, modalities may start rejecting studies because they can't connect. The queue freeze becomes a full PACS outage.

Re-indexing without a backup

Another anti-pattern: running a queue re-index utility while the queue is frozen. Re-indexing can take hours on a large queue table and may lock the table, preventing the queue service from reading it even after a restart. If you must re-index, do it during scheduled downtime, and always take a database backup first. In a 2 AM emergency, re-indexing is almost never the right first step.

Ignoring the logs

It's tempting to skip log analysis and just restart services. But the logs often contain the exact error message that explains the freeze. A single line like "ORA-00060: deadlock detected" tells you there's a database deadlock that a restart alone won't fix—you need to identify the conflicting transactions. Ignoring the logs means you may restart into the same deadlock repeatedly.

Assuming it's a one-time glitch

If the queue freezes at 2 AM and you restart it and it works, you might be tempted to go back to sleep and investigate in the morning. But if you don't find the root cause, it will happen again—often at 2 AM the next night. The monitoring step in the checklist is not optional; it's what breaks the cycle.

One team we know had queue freezes every Tuesday at 2 AM for three weeks. Each time, they restarted the service and it worked. The root cause turned out to be a weekly database maintenance job that was locking the queue table. Once they rescheduled that job, the freezes stopped. They never would have found it without checking the logs and correlating with the maintenance schedule.

5. Maintenance, Drift, and Long-Term Costs

Even with a perfect recovery checklist, queue freezes will recur if you don't address the underlying system health. This section covers the maintenance practices that prevent drift and reduce the frequency of 2 AM calls.

Queue depth trending

Monitor the queue depth over time, not just when there's an alert. A gradual increase in average queue depth (e.g., from 10 to 50 over a month) can indicate a slow leak—maybe the archive storage is getting slower, or the routing rules are becoming more complex. If you catch it early, you can tune the system before it freezes.

Set up a weekly report of queue depth at peak hours. If the depth exceeds a threshold (e.g., 100 pending), investigate proactively. Many monitoring tools (like Prometheus with a PACS exporter) can track this automatically.

Database connection pool tuning

The most common root cause we see is connection pool exhaustion. The fix is often simple: increase the max pool size in the queue service configuration. But there's a trade-off: too many connections can overwhelm the database. The right approach is to monitor the connection count during normal operation, then set the max pool to 2–3 times the normal peak. Also, ensure that connections are properly released—some PACS versions have a bug where connections are not closed after a timeout. Check with your vendor for patches.

Storage performance baselining

Storage latency is another frequent culprit. Baseline the average time to archive a study (from queue entry to archive confirmation). If that time starts creeping up, investigate the storage system: are the disks near capacity? Is the network between PACS and storage saturated? Regular performance testing (e.g., writing a test file and measuring the response time) can catch degradation before it causes a freeze.

Auto-routing rule audits

Routing rules are often added over time without review. A rule that routes studies based on modality, body part, contrast, and referring physician can create a combinatorial explosion of conditions. If two rules conflict (e.g., one routes chest CTs to radiologist A, another routes all CTs to radiologist B), the routing engine may deadlock on studies that match both. Schedule a quarterly review of routing rules: remove unused rules, simplify complex ones, and test new rules in a staging environment before deploying.

The cost of ignoring maintenance is not just the 2 AM call. It's the radiologist overtime, the delayed diagnoses, and the erosion of trust between IT and clinical staff. One hospital calculated that each queue freeze cost an average of $3,200 in overtime and lost productivity. Over a year with monthly freezes, that's $38,400—easily enough to justify a monitoring upgrade or a database tuning project.

6. When NOT to Use This Checklist

This checklist is for software-level queue freezes in a PACS that has been running normally. There are situations where following these steps will not help—and may harm.

Hardware failure

If the queue freeze is accompanied by disk errors, storage alerts, or server hardware faults, do not restart services. Restarting a server with a failing disk can cause data corruption. Instead, involve your hardware support team immediately. The checklist assumes the hardware is healthy. If you see S.M.A.R.T. errors on the storage array or RAID controller warnings, escalate first.

Known vendor bug

If your PACS vendor has published a known bug that causes queue freezes under specific conditions (e.g., a certain database version or a specific load pattern), follow their documented workaround. Using a generic checklist may not resolve the issue and could cause you to miss the vendor's recommended steps. Check your vendor's support portal for known issues before proceeding.

Security incident

If you suspect the queue freeze is caused by a security incident (e.g., ransomware, unauthorized access), do not restart services. Isolate the affected systems and follow your incident response plan. Restarting could destroy forensic evidence or allow the attacker to regain access. The checklist is for technical failures, not security events.

Planned maintenance overlap

If the queue froze during a scheduled maintenance window (e.g., a database upgrade), the freeze may be a normal part of the maintenance process. Check the maintenance plan before intervening. Some maintenance procedures intentionally pause the queue service. Restarting it prematurely could interfere with the maintenance and cause data inconsistency.

In all these cases, the correct first step is to call for help—your PACS vendor, your storage team, or your security team—rather than following a generic recovery checklist.

7. Open Questions and FAQ

Based on real questions from imaging IT teams, here are answers to the most common uncertainties.

Q: How do I know if the queue is truly frozen or just slow?

A slow queue still shows decreasing depth over time, just at a slower rate. A frozen queue shows a flat depth for 5+ minutes with no change. Check the timestamp of the oldest pending study: if it's more than 10 minutes old and the count isn't dropping, it's frozen.

Q: Can I restart the queue service without stopping other services?

Yes, in most PACS architectures the queue service is independent. However, stopping the queue service will prevent new studies from being processed until it restarts. That's fine for a few minutes. But do not stop the database or the web server unless you've confirmed they are the cause.

Q: What if the queue service won't stop gracefully?

Use a force stop (e.g., systemctl kill pacs-queue or kill -9). After a force stop, check for orphaned processes and lock files before restarting. Force stopping can leave the queue in an inconsistent state, so verify study integrity afterward.

Q: Should I clear the pending studies manually?

Generally no. Clearing pending studies from the database can cause data loss if those studies haven't been archived. The queue service should reprocess them after a restart. Only clear studies if you have a backup and a specific reason (e.g., the study is corrupted and causing the freeze).

Q: How do I prevent this from happening again?

Implement the monitoring and maintenance practices from section 5. At minimum, set up queue depth alerts and log error alerts. Schedule quarterly reviews of routing rules and database connection pool usage. Also, document every freeze and review the pattern after three months—if you see a recurrence, escalate to your vendor.

Q: What if the checklist doesn't work?

If you've followed the four steps and the queue is still frozen after 15 minutes, escalate to your PACS vendor support. There may be a deeper issue that requires vendor intervention, such as a database corruption or a software bug that needs a patch. While waiting, keep the logs from your diagnosis steps—they will help the vendor troubleshoot faster.

Remember: the goal is not just to recover the queue, but to recover it safely without data loss. A few extra minutes of careful diagnosis are better than a corrupted database that takes days to restore.

Share this article:

Comments (0)

No comments yet. Be the first to comment!