A review of simulation optimization techniques

We present a review of methods for optimizing stochastic systems using simulation. The focus is on gradient based techniques for optimization with respect to continuous decision parameters and on random search methods for optimization with respect to discrete decision parameters.


INTRODUCTION
Suppose that the performance of a stochastic system of interest depends on the chosen values of the parameters of this system, and that this system is of such complexity that it is necessary to use simulation to estimate the performance of the system for each set of parameter values. We present a review of methods that can be used to determine the values of the system parameters that will yield optimal performance of the system. We consider both the case when the parameters of the system can take a continuous range of values and the case when the parameter values must lie in a discrete set. Our focus is on gradientbased techniques for continuous parameter optimization and on random search methods for discrete parameter optimization. We do not review some other important classes of methods for simulation optimization, including response surface methodology and ranking, selection, and multiple comparison procedures; see for example Kleijnen (1998) and Goldsman and Nelson (1998) for recent reviews of these topics aimed at simulation practitioners. Additional material on simulation optimization can be found in Fu (1994), , Andradóttir (1998a), and references therein.
The organization of this paper is as follows: Section 2 contains a review of gradient-based techniques for continuous parameter simulation optimization. Section 3 discusses random search methods and other recent developments for discrete parameter simulation optimization. Finally, some concluding remarks are given in Section 4.

CONTINUOUS PARAMETER SIMULATION OPTIMIZATION
Consider the optimization problem where θ is the (possibly vector-valued) decision parameter consisting of the parameters of the stochastic system of interest, the feasible region Θ ⊂ IR d is the set of possible values of the parameter θ, and the objective function values f (θ) represent the expected system performance when the values of the system parameters are given by θ ∈ Θ. We assume that the feasible region Θ is continuous, and that the objective function values f (θ), where θ ∈ Θ, are estimated using simulation. In this section we discuss gradient-based techniques for solving the optimization problem (1). We start by giving an overview of gradient estimation techniques in Section 2.1. Then we review two classes of methods for solving the optimization problem (1), namely stochastic approximation and sample path optimization, in Sections 2.2 and 2.3, respectively.

Gradient Estimation
In this section we discuss how simulation can be used to obtain estimates of the gradient of the expected system performance f (θ) with respect to the (continuous) parameter θ. For additional material on gradient estimation, the reader is referred to L'Ecuyer (1991), Fu (1994), Fu and Hu (1997), and references therein.
The most straightforward gradient estimation approach involves approximating the gradient using finite differences. Let e i = (0, . . . , 0, 1, 0, . . . , 0) denote the ith coordinate vector (with one in the ith position and zeros everywhere else) for i = 1, . . . , d, and let c be a small positive scalar. Then the value of the gradient g(θ) = ∇f (θ) can be estimated byĝ(θ) = (ĝ 1 (θ), . . . ,ĝ d (θ)) T , where, for This is the gradient estimator obtained using forward differences. When central differences are used, then , and f (θ−ce i ) obtained from simulations at the parameter values θ, θ + ce i , and θ − ce i , respectively, for i = 1, . . . , d. It is clear that when forward differences are used, it is necessary to conduct simulations at d + 1 sets of parameter values, whereas when central differences are used, it is necessary to conduct simulations at 2d parameter values. Thus, when d > 1, the central differences approach generally involves more computational effort than the forward differences approach. On the other hand, estimators obtained using central differences usually have smaller bias than those obtained using forward differences. A difficulty in implementing the finite differences gradient estimation approach is that in order for the bias to be small, it is necessary to let the scalar c be small, but when c is small, the estimators obtained usually have a large variance. One way of addressing this difficulty involves using common random numbers in the different simulations required to obtain an estimateĝ(θ) of the value of the gradient g(θ) = ∇f (θ).
We now give very brief illustrations of two gradient estimation techniques that require only a single simulation run to obtain an estimate of the gradient g(θ) = ∇f (θ) (unlike finite differences). These techniques are perturbation analysis and the likelihood ratio method. We will use the following extremely simple example to illustrate some key ideas that these methods are based on: where X(θ) is a random variable having the cumulative distribution function F θ . We provide references for additional reading about these techniques, including how to estimate gradients of performance measures that depend on a sample path of a stochastic process, rather than only on a single random variable like the example given in equation (2). We will not discuss frequency domain experimentation in this paper. This gradient estimation approach involves oscillating the value of the parameter θ during a single simulation run. For more details on this method, see for example Jacobson (1994) and references therein. Perturbation analysis is a class of related gradient estimation approaches. The best known variant is infinitesimal perturbation analysis (IPA). To illustrate, consider the example given in equation (2). Suppose that observations of the random variable X(θ) are generated using inverse transform; i.e., X(θ) = F −1 θ (U ) for all θ ∈ Θ, where U is a uniform random variable on the interval [0, 1] and F −1 θ (u) = inf{x : F θ (x) ≥ u} for all θ ∈ Θ and 0 ≤ u ≤ 1. Then, we have that assuming that F −1 θ is differentiable with respect to θ and that the interchange of the gradient and integration is valid (throughout this section, all gradients are taken with respect to the parameter θ). Therefore, we can generate independent and uniformly distributed ran- . In general, when the performance measure of interest depends on a sample path of a stochastic process (and not just on a single random variable as in equation (2)), then IPA involves considering how small perturbations in the underlying random variables affect the sample path generated using these random variables.
When IPA is valid (because the interchange of the gradient and integration is valid), it usually yields good results (i.e., unbiased estimates of the gradient whose variance is reasonably small). However, it is unfortunately not difficult to develop examples where IPA is not valid. Several other variants of perturbation analysis have been developed to address this problem. For more material on IPA and other perturbation analysis techniques, the reader is referred to the references given at the beginning of this section, and to Glasserman (1991), Ho and Cao (1991), and references therein.
Suppose now that the random variable X(θ) has a density function f θ . Then, assuming that we can exchange the order of the gradient and integration, we have that The likelihood ratio method (also called the score function method) involves expressing the gradient g(θ) as an expectation that can be estimated via simulation. This can be accomplished by multiplying and dividing by f θ (x) in the above integral, yielding ) .
(We adopt the convention that 0/0 = 1. It is easy to show that division by zero will not occur in the above equation; i.e., if f θ (x) = 0 then ∇f θ (x) = 0.) Therefore, we can generate independent observations X 1 (θ), . . . , X N (θ) from the distribution F θ and use In general, when the performance measure of interest depends on a number of random variables (e.g., it involves a sample path of a stochastic process), then the likelihood ratio method involves expressing this performance measure as an integral involving the product of the densities of the underlying random variables. For more material on the likelihood ratio method, see for example Glynn (1990), Rubinstein and Shapiro (1993), Andradóttir (1996b), and references therein.
Note that both perturbation analysis and the likelihood ratio method involve exploiting the structure of the performance measure of interest. Therefore, these methods are not as easy to apply as the finite differences approach. However, perturbation analysis and the likelihood ratio method require only a single simulation run and often produce estimators having desirable statistical features such as unbiasedness and strong consistency (unlike the finite differences approach). The likelihood ratio method applies in more generality than IPA (other variants of perturbation analysis are frequently needed in order to obtain gradient estimates with desirable statistical properties) but often yields estimates having a larger variance than the estimates obtained using IPA (when both techniques are applicable).

Stochastic Approximation
Stochastic approximation refers to a class of methods that can be used to solve continuous parameter simulation optimization problems of the form given in equation (1). The best known stochastic approximation algorithm is called the Robbins-Monro algorithm (see Robbins and Monro, 1951). When this algorithm is applied to solve optimization problems of the form given in equation (1) with a closed and convex feasible region Θ, it generates a sequence {θ n } of estimates of the optimal solution, where for all n ≥ 1. Here {a n } is a sequence of positive real numbers such that ∞ n=1 a n = ∞ and ∞ n=1 a 2 n < ∞, g(θ n ) is an estimate of g(θ n ) = ∇f (θ n ) for all n, and the function π Θ projects each element of IR d to the nearest point in Θ. When finite differences are used to obtain the gradient estimates in equation (3), the resulting procedure is called the Kiefer-Wolfowitz algorithm (see Kiefer and Wolfowitz, 1952). The Robbins-Monro algorithm generally will only converge to a local optimal solution of the optimization problem (1); see Gelfand and Mitter (1991) for a discussion of the use of stochastic approximation to solve global optimization problems. The sequence {a n } is usually chosen of the form a n = a/n for all n, where a is a positive scalar. With this choice of the sequence {a n }, the Robbins-Monro algorithm will, under certain conditions, have the fastest possible asymptotic convergence rate; i.e., it will converge to the optimal solution at the rate n −1/2 . Nevertheless, even with this choice of the sequence {a n }, the behavior of the Robbins-Monro algorithm depends on how well the scalar a is selected.
We now briefly review some other stochastic approximation algorithms that have been developed to avoid some of the shortcomings of the Robbins-Monro algorithm. For more discussion of stochastic approximation, including a discussion of how stochastic approximation algorithms can be used for solving constrained optimization problems with noisy constraints, the reader is referred to Kushner and Clark (1978), Benveniste, Métivier, and Priouret (1990), Ruppert (1991), Ljung, Pflug, and Walk (1992), Kushner and Yin (1997), and references therein. For a discussion of the convergence of stochastic approximation algorithms when applied to solve simulation optimization problems of the form given in equation (1), see Glynn (1986), Fu (1990), Ramadge (1992, 1993), L'Ecuyer and Glynn (1994), and Andradóttir (1996b).
One problem with the Robbins-Monro algorithm is that when it is applied to solve optimization problems of the form given in equation (1) with an unbounded feasible set Θ (e.g., when Θ = IR d ), the convergence of the algorithm is not guaranteed when the objective function f grows faster than quadratically in the decision parameter θ. Andradóttir (1996a) has proposed an alternative approach that addresses this problem using scaling. Moreover, Chen and Zhu (1986), Yin and Zhu (1989), and Andradóttir (1995a) have proposed addressing this problem using projections onto an increasing sequence of sets.
From the discussion in Section 2.1, it is clear that when the Kiefer-Wolfowitz algorithm is applied to solve the optimization problem (1), then it is necessary to conduct simulations at a mininum of d + 1 different parameter values in each iteration of the algorithm, where d is the dimension of the underlying optimization problem. This obviously means that when d is large, the Kiefer-Wolfowitz algorithm requires substantial computational effort per iteration, which could lead to slow convergence. To address this problem, Spall (1992) has proposed the use simultaneous perturbations (requiring only two simulations per iteration, regardless of the dimension d) to estimate the gradient in equation (3). Similar ideas have been proposed by Kushner and Clark (1978) and Ermoliev (1983).
A considerable amount of research has explored the choice of the sequence {a n } in equation (3). For sequences of the form a n = a/n for all n, where a is a positive scalar, several researchers have developed adaptive procedures where the value of the scalar a is updated as the number of iterations grows. For more details, see Venter (1967), Nevel'son andHas'minskii (1973), Lai and Robbins (1979), Ruppert (1985), and Wei (1987). For situations where the sequence {a n } approaches zero too rapidly, the empirical convergence rate can sometimes be improved by only decreasing the value of this sequence when there is reason to believe that the current estimate of the solution is near the optimal solution. This idea has been studied by Kesten (1958) and Delyon and Juditsky (1993). Another approach is to use ideas from deterministic optimization to select the sequence {a n }; see Wardi (1990), Yan and Mukai (1993), and Shapiro and Wardi (1996a). Finally, the use of a sequence {a n } that decreases at a slower rate than 1/n, together with the use of averages of the sequence {θ n } to estimate the optimal solution, has received much attention in the past few years. See Polyak (1990), Yin (1991), Polyak and Juditsky (1992), and Kushner and Yang (1993) for more discussion of this approach.

Sample Path Optimization
In this section, we discuss methods for continuous parameter simulation optimization that involve approximating the original simulation optimization problem (1) with a deterministic optimization problem. To illustrate, suppose that the objective function in equation (1) is of the form given in equation (2). Let N be a positive integer and assume that there exist independent random variables Y 1 , . . . , Y N and a function h : Θ × IR → IR such that X i (θ) = h(θ, Y i ) has the cumulative distribution function F θ for i = 1, . . . , N and all θ ∈ Θ. Then we can approximate the objective function witĥ Once the random variables Y 1 , . . . , Y N have been generated, the approximate objective functionf N (θ) is a deterministic function of the parameter θ. Therefore, we can approximate the original simulation optimization problem (1) with the deterministic optimization problem Now a standard mathematical programming algorithm can be applied to solve this approximate deterministic optimization problem.
It remains to demonstrate how one can obtain random variables Y 1 , . . . , Y N and a function h : Θ × IR → IR for which X i (θ) = h(θ, Y i ) has the cumulative distribution function F θ for i = 1, . . . , N and all θ ∈ Θ. We briefly illustrate two approaches for accomplishing this for the above example, using IPA and likelihood ratios, respectively.
First, as in Section 2.1, assume that observations of the random variables X i (θ) can be generated using the inverse transform technique. Then let Y 1 , . . . , Y N be independent and uniformly distributed on the interval [0, 1] and let h(θ, y) = F −1 θ (y) for all y ∈ [0, 1] and θ ∈ Θ. Note that the gradient of the function h with respect to θ coincides with the IPA gradient estimates discussed in Section 2.1. Now assume that for all θ ∈ Θ, the random variable X(θ) has density function f θ and that there exists θ 0 ∈ Θ such that f θ0 (x) = 0 implies f θ (x) = 0 for all x and θ ∈ Θ. Then we have that for all θ ∈ Θ (recall our convention that 0/0 = 1). The term f θ (x)/f θ0 (x) is called the likelihood ratio. The second approach involves letting Y 1 , . . . , Y N be independent observations drawn from the distribution F θ0 and letting h(θ, y) = yf θ (y)/f θ0 (y) for all y and θ ∈ Θ. Note that the gradient of the function h with respect to θ evaluated at θ = θ 0 coincides with the likelihood ratio gradient estimates of g(θ 0 ) discussed in Section 2.1. The simulation optimization approach described in this section is called sample path optimization because it only uses simulation to generate one sample path Y 1 , . . . , Y N , and it yields an estimated optimal solution that depends on the sample path that the approximate deterministic optimization problem (4) is based on. Generally, the integer N needs to be large in order for the approximating optimization problem (4) to be close to the original optimization problem (1). Several researchers have studied simulation optimization approaches of this form. Rubinstein and Shapiro (1993) have analyzed this approach using likelihood ratios to obtain the approximate optimization problem; their approach is called the stochastic counterpart method. Plambeck et al. (1996) have used this method with IPA gradient estimates. Healy and Schruben (1991) and Healy and Xu (1994) have also studied this method; they call it retrospective optimization. Additional research on the convergence of the sample path optimization method can be found in Robinson (1996) and Shapiro and Wardi (1996b).

DISCRETE PARAMETER SIMULATION OPTIMIZATION
In this section, we provide a brief review of random search methods for solving the optimization problem (1) when the feasible region Θ is discrete. We will also briefly describe some other recent advances to the field of discrete parameter simulation optimization. Related approaches that will not be discussed here include ranking, selection, and multiple comparison methods, methods for solving the multi-armed bandit problem, and learning automata procedures

Random Search
The random search methods discussed in this section all involve moving successively between neighboring feasible points in search of the optimal solution. For all θ ∈ Θ, let N (θ) ⊂ Θ \ {θ} denote the set of all the neighbors of θ. The neighborhood structure {N (θ) : θ ∈ Θ} must be connected, in the sense that for all θ, θ ∈ Θ, θ = θ , there exist an integer l and θ 0 , . . . , θ l ∈ Θ, such that θ 0 = θ, θ l = θ , and θ i+1 ∈ N (θ i ) for i = 0, . . . , l − 1 (otherwise, a random search method may not converge for all starting points). Andradóttir (1995bAndradóttir ( , 1996c has developed two random search methods for discrete parameter simulation optimization. In each iteration of these methods, the values of the objective function at two neighboring feasible points are estimated via simulation, and the alternative that yields the better estimate is passed on to the next iteration. Both algorithms use the feasible alternative that has been visited most often in this process to estimate the optimal solution. The two methods differ primarily in the choice of the neighborhood structure used. One of the methods is locally convergent, while the other one is globally convergent. Yan and Mukai (1992) have proposed a random search method called the stochastic ruler algorithm. This method compares observations of the objective function values with observations of a uniform random variable, called the "stochastic ruler," whose range covers the range of the observed objective function values. The number of such comparisons grows with the number of iterations. Yan and Mukai use the current element of the sequence generated by their algorithm to estimate the optimal solution. Alrefaei and Andradóttir (1998a) have developed a variant of the stochastic ruler algorithm that appears to perform better in practice than the original version of Yan and Mukai (1992). This variant requires less computational effort per iteration than the original method. It uses the number of visits to the different states to estimate the optimal solution, similar to the approaches of Andradóttir (1995bAndradóttir ( , 1996c. Gong, Ho, and Zhai (1992) have analyzed a method for discrete simulation optimization called the stochastic comparison method. Their method combines ideas from Andradóttir (1996c) and Yan and Mukai (1992). It resembles the method of Yan and Mukai (1992) in the choice of approach for estimating the optimal solution and in that it uses a growing amount of computer effort per iteration as the number of iterations grows. It resembles the method of Andradóttir (1996c) in the choice of neighborhood structure and in that it compares estimated objective function values at neighboring points, rather than comparing the estimated objective function values with a stochastic ruler. Andradóttir (1998b) presents a variant of the stochastic comparison method that uses a different approach for estimating the optimal solution (discussed at the end of this subsection) and in which the computational effort per iteration does not grow with the number of iterations.
The use of simulated annealing to solve discrete simulation optimization problems has received a significant amount of attention in recent years. Bulgak and Sanders (1988) and Haddock and Mittenthal (1992) proposed heuristic simulated annealing approaches for discrete simulation optimization. Other versions of the simulated annealing approach, adapted to solve discrete simulation optimization problems, that are built on a rigorous foundation can be found in Gelfand and Mitter (1989), Gutjahr and Pflug (1996), Fox and Heine (1996), and Alrefaei and Andradóttir (1998c).
Finally, Andradóttir (1998b) has proposed an approach for estimating the optimal solution that involves averaging all the estimated objective function values at the various feasible points obtained so far by a random search method and then using the point with the best average as the estimated optimal solution. Numerical evidence presented by Andradóttir (1998b) and by Andradóttir (1998b, 1998c) suggests that the use of this approach for estimating the optimal solution appears to yield improved performance relative to other approaches for estimating the optimal solution. Norkin, Ermoliev, and Ruszczynski (1994) have proposed a version of the branch-and-bound method (originally developed for discrete deterministic optimization) that is designed for solving discrete simulation optimization problems. Their approach involves partitioning the feasible region Θ into subsets and estimating bounds on the objective function values within these subsets. Based on the values of these bounds, one (promising) subset is divided into smaller subsets and other (non-prospective) subsets are removed from consideration. Pflug (1994) and Futschik and Pflug (1995) have discussed the use of confidence sets (having the feature that each global solution to the underlying discrete optimization problem lies in the confidence set with a given amount of confidence) to solve discrete simulation optimization problems. They discuss how valid confidence sets can be obtained and how to conduct the simulation in such a way that the resulting confidence set is as small as possible.

Other Recent Developments
Finally, Ho, Sreenivas, and Vakili (1992) suggest that in situations where the feasible set Θ is large, one could quickly conduct simulations at the various feasible points to obtain a rough ranking of these points. Then, one could discard all except a few top points in this rough ranking, and use a discrete simulation optimization technique to locate the best point among the points that were not discarded. Ho, Sreenivas, and Vakili (1992) show that the probability that the set of points that were not discarded contains at least one near-optimal solution to the underlying optimization problem is often surprisingly large.

CONCLUSION
We have provided an introduction to simulation optimization, with emphasis on gradient-based techniques for continuous parameter simulation optimization and on random search methods for discrete parameter simulation optimization. Although simulation optimization has received a fair amount of attention from the research community in recent years, the current methods generally require a considerable amount of technical sophistication on the part of the user, and they often require a substantial amount of computer time as well. Therefore, additional research aimed at increasing the efficiency and ease of application of simulation optimization techniques would be valuable.

ACKNOWLEDGMENT
This work was partially supported by the National Science Foundation under Grant DMI-9523111.