Optimizing Production Line Policies: A Queueing Theory (QT) Perspective

This article discusses using Queueing Theory (QT) to optimize maintenance in production lines, reducing downtime and costs while improving operational efficiency.

Prepared by Viktor Plamenov

Minimizing downtime in a production line is essential for maintaining operational efficiency and profitability. This analysis applies Queueing Theory (QT) to assess machine breakdowns and repair times, evaluating the financial impact of delays and exploring the benefit of adding a second maintenance and repair team. The analysis provides a framework for modeling systems that have a random component and shows how optimizing the maintenance policies can significantly reduce downtime and associated costs, leading to improved production continuity.

1. Scenario Overview

A beverage company operates a production line consisting of three key machines: Machine 1 (filling), Machine 2 (labeling), and Machine 3 (packing). These machines experience breakdowns at different rates, and repairs are handled by a shared maintenance and repair team. The key challenge is to minimize machine downtime and reduce the financial impact of delayed repairs. The input information of the case study is summarized in table 1.1 and the problem schematic is show in figure 1.1.

Figure 1.1: High level overview of the production line

The main objectives of the study are as follows:

Assess the expected costs incurred due to the downtime of each machine, quantifying the financial impact associated with machine inactivity.
Evaluate whether hiring an additional maintenance and repair team presents a cost-effective solution for reducing downtime expenses.

Analyze the distribution of downtime-related losses, including the probability of repair wait times exceeding a certain time threshold.

To answer these questions, we can model the production line as a queueing system [1][3].

Figure 1.2: One simulated trajectory of machine failures

2. Problem Setup

To effectively model this system using Queueing Theory (QT), we must gather key data on several parameters: the failure rates (λ) of the three machines, the repair team’s service rate in restoring machines to operation (µ), the service protocol (e.g., first-come, first-served or prioritized repair), and the number of available servers (repair teams). This information is essential for accurately representing system dynamics, enabling a rigorous analysis of performance and potential optimization strategies.

2.1 Total Failure Rate (λtotal)

The total breakdown rate across all machines is the sum of the individual components:

The system applies the FIFO protocol (First-In-First-Out), which indicates the machine that arrives earliest will be repaired first. Based on this, on average we can expect a failure once every 1/0.035 = 28.6 hours, with machine 1 failing once every 57.1 hours, machine 2 once every 133.3 hours, and machine 3 once every 100 hours.

Table 1.1: Machine Breakdown and Repair Data

2.2 Effective Service Rate (µ effective)

The effective service rate can be estimated by first computing the frequency of failures across the three machines. To do that, we divide the individual failure rates by the total failure rate:

Based on this, we can see that the probability of the repair team working on machine 1 is more than two times higher than working on machine 2, which is to be expected as it breaks more than twice as often. To get the effective service (repair) rate, we can weigh the individual service rates to weighted average one:

3. Queue Properties

3.1 Service Time

Based on the service rate, we can easily obtain the service time, which in this case is 1/0.065 = 15.3 hours on average. It needs to be noted that this is aggregated across the three machines. A next natural question is to understand what is the utilization of the system. That is to say, is the repair team overutilized, underutilized, or is it just about right? The utilization [1][2] of the repair team can be estimated as follows:

meaning that on average we have 0.6 machines waiting repair and the average waiting time for a machine to start being repaired is close to 18 hours. To be fully operational after a breakdown it takes more than 33 hours on average. However, to be more precise we can get the total nonoperational time of the machines by adding the waiting time W1 and the repair time for the respective machine. These formulae are valid for a system with a single server and need adjustment in situations with more servers (repair teams). To that end, we get the following nonoperation times across the 3 machines:

From these results, we observe that when machine 2 fails, on average it is back to an operational state almost two and a half hours later compared to machine 3.

4. Downtime Costs with One Repair Team

With an expected waiting time of 33.33 hours before repairs can begin, the downtime costs for each machine are:

Machine 1 (filling) downtime cost:
33.33 hours×5, 000 $/hour = 166, 650 $/failure
Machine 2 (labeling) downtime cost:
34.62 hours×4, 000 $/hour = 138, 480 $/failure
Machine 3 (packing) downtime cost:
32.24 hours×4, 500 $/hour = 145, 080 $/failure

Total downtime cost per breakdown cycle = $450,210. From a system utilization perspective, a utilization factor of 53.8% may appear moderate, or even low by some standards [3], and in many scenarios, such performance might be deemed acceptable. However, in this context, any operational disruption incurs substantial economic costs, making the decision to engage a second maintenance team potentially cost-effective. Here, the objective is not to maximize utilization but rather to minimize downtime costs, adjusted for the additional costs incurred. For this analysis, we define the cost of an additional repair team as S. The goal is to determine whether the cost savings from reduced downtime justify the expense of employing a second team.

5. Adding A Second Pair Team

Adding a second repair team reduces the system’s overall utilization and improves response times. The system becomes an M/M/2 queue. See [1][2] for a more detailed overview of queueing systems and the Kendall notation.

5.1 Utilization with Two Repair Teams (ρ2)

With two repair teams, the total system utilization drops to:

The utilization factor indicates that combined the repair teams are utilized only 27% of the time. It is important to note that adding a second repair team does not increase the overall service rate. Instead, the primary effect is on reducing waiting times and queue length. When only one machine is down, the repair time remains, on average, the same regardless of whether there is one or two maintenance and repair teams available.

5.2 Queue Size and Waiting Time

The expected waiting time with two repair teams cannot be estimated with the first formula. The previous formulae have to be adjusted to accommodate for this more complex scenario[1][2].

Figure 5.1: Failure arrivals as a function of time.

From this follow the nonoperational times in the case when we have two repair teams available:

This represents a 94% reduction in queue size compared to the single-team scenario. Additionally, the waiting time has decreased significantly, from nearly 18 hours to just over one hour. As a result, most of the downtime for a non-operational machine is now dedicated to active repair, whereas in the one-team setup, waiting time exceeded repair time. The total time from breakdown to full operation is now 16.54 hours, marking over a 50% reduction.

Table 5.1: Monthly Downtime Costs with One Repair Team. Numbers rounded to the nearest thousand.

Table 5.2: Monthly Downtime Costs with Two Repair Teams. Numbers rounded to the nearest thousand.

Figure 5.2: Simulated trajectory of machines waiting for repair.

the downtime for a non-operational machine is now dedicated to active repair, whereas in the one-team setup, waiting time exceeded repair time. The total time from breakdown to full operation is now 16.54 hours, making over a 50% reduction.

6. Downtime Costs with Two Repair Teams

With the reduced waiting time, the downtime costs are:

Machine 1 (filling) downtime cost:
16.6 hours × 5, 000 $/hour = 83, 000 $/failure
Machine 2 (labeling) downtime cost:
17.9 hours × 4, 000 $/hour = 71, 600 $/failure
Machine 3 (packing) downtime cost:
15.5 hours × 4, 500 $/hour = 69, 750 $/failure

Total downtime cost per breakdown cycle with two teams = $224,350. In order to compare the repair team costs and the downtime costs, we need to have a common unit of time. To that end, we can estimate the number of machine failures per month by assuming the operational time is 24 hours per day and we have 720 hours per month.

The expected number of monthly failures for each machine is calculated based on the failure rates. We use the formula:

where λ represents the failure rate per hour, and 720 hours represents the total hours in a month (assuming 30 days of operation). Next, multiplying the frequency of failures by the average waiting time and hourly downtime cost for each machine, we calculate the total monthly downtime cost for the single maintenance and repair team scenario. For this calculation, the average waiting time is assumed to be 33.3 hours with a more detailed breakdwon in Table 5.1.

The total monthly downtime cost with one repair team amounts to $3,893,000. In the case of two maintenance and repair teams, the average waiting time is reduced to 16.6 hours on average, and the total monthly downtime cost with is further reduced to $1,930,000. A more detailed breakdown of the costs is shown Table 5.2

7. Decision on Hiring Second Repair Team

The cost difference between the scenarios with one repair team and two repair teams is calculated as:

In this scenario, the estimated monthly savings from reducing downtime with two maintenance teams comes at $1,963,000. The next step is to compare these savings directly against the cost of employing the second repair team to make an informed financial decision.

8. Downtime Losses - Worst Case Scenario

The results presented above are applicable to the stationary distribution[1][2], providing insights into the average expected behavior of the system. To assess the likelihood of experiencing higher-thanaverage losses, we must estimate the probability of having n machines in the queue. By combining this with the estimated repair times, we can project potential losses under more extreme conditions. This analysis leverages the Erlang C[1][2] formula, which calculates the probability of a non-zero waiting time, offering a deeper understanding of the risk associated with larger queues and prolonged downtime.

Table 8.1: Distribution of Probability Values for Machines in the Queue

Table 8.1 illustrates that, for instance, the probability of observing 3 machines in the queue is 7.2% with one maintenance and repair team and decreases to 2.2% with two teams available (P2). Furthermore, the likelihood of having 3 or more machines in the queue is 15.6% with one team compared to 3.1% with two teams. Given an average downtime of approximately 33.3 hours per machine, the estimated total time required to clear the queue and bring the last machine back online could extend to over 100 hours of nonoperational time in a single failure.

Figure 8.1: Waiting probability with one and two maintenance teams

9. Business Insights and Recommendations

Current System Inefficiency: With one repair team, waiting times average 33.3 hours per breakdown, leading to significant downtime costs of approximately $3.9 mln per month. This inefficiency highlights the substantial financial impact of machine downtime in the current system, where any downtime can severely hinder operational throughput.
Financial Justification for a Second Repair Team: Introducing a second repair team reduces waiting times by approximately 50% to 16.5 hours per breakdown, lowering the total monthly downtime costs to $1.93mln. This translates into a savings of $1.97mln per month, providing a strong financial justification for the investment in a second maintenance team. The potential cost savings significantly exceed the expenses of employing the additional team, making this a prudent investment.
Utilization Maximization vs Downtime Minimization: While adding a second repair team reduces overall utilization to around 24% for each team, the substantial reduction in downtime costs and the system’s ability to handle unexpected surges in machine failures or demand outweighs the concerns about underutilization. The second repair team ensures that the system can handle multiple breakdowns without significant delays, improving production reliability and efficiency.

References

Gross, D., Shortle, J. F., Thompson, J. M., & Harris, C. M. (2008). Fundamentals of Queueing Theory (4th ed.). Wiley.
Buzacott, J. A., & Shanthikumar, J. G. (1993). Stochastic Models of Manufacturing Systems. Prentice Hall.
Alden, J. M., Burns, L. D., Costy, T., Hutton, R. D., Jackson, C. A., Kim, D. S., Kohls, K. A., Owen, J. H., Turnquist, M. A., & Vander Veen, D. J. (2006). “General Motors Increases Its Production Throughput.” Interfaces, 36(1), 6–25, Jan.– Feb.