An Optimal Approximate Dynamic Programming Algorithm ... - CiteSeerX

2 downloads 314 Views 587KB Size Report
Sep 7, 2009 - Algorithm for the Energy Dispatch Problem with Grid- ... modeled as a series of single-period linear progr
An Optimal Approximate Dynamic Programming Algorithm for the Energy Dispatch Problem with GridLevel Storage Juliana M. Nascimento Warren B. Powell Department of Operations Research and Financial Engineering Princeton University September 7, 2009

Abstract We consider the problem of modeling hourly dispatch and energy allocation decisions in the presence of grid-level storage. The model makes it possible to capture hourly, daily and seasonal variations in wind, solar and demand, while also modeling the presence of hydroelectric storage to smooth energy production over both daily and seasonal cycles. The problem is formulated as a stochastic, dynamic program, where we approximate the value of stored energy. We provide a rigorous convergence proof for an approximate dynamic programming algorithm, which can capture the presence of both the amount of energy held in storage as well as other exogenous variables. Our algorithm exploits the natural concavity of the problem to avoid any need for explicit exploration policies.

1

Introduction

There is a long history of modeling energy dispatch using deterministic optimization models to capture the economic behavior of energy allocation and electric power dispatch. Linear programming has been used for energy policy modeling in models such as NEMS (NEMS (2003)), MARKAL (Fishbone & Abilock (1981)), MESSAGE III (Messner & Strubegger (1995)) and Meta*Net (Lamont (1997)). These models typically work in very aggregate time steps such as a year or more, which require that average values be used for intermittent energy sources such as wind and solar. However, it is well known (Lamont (2008)) that intermittent energy sources such as wind and solar look far more attractive when we use average production over periods as short as a few hours. Averaging reduces the peaks, where energy is often lost because we are unable to use brief bursts, and raises the valleys, which determines the amount of spinning reserve that must be maintained to supply demand in the event that these sources drop to their minimum. To accurately model the economics of intermittent energy, it is necessary to model energy supply and demand in time increments no greater than an hour. Some authors recognize the importance of capturing hourly wind variations, but still use deterministic linear programming models which are allowed to make storage and allocation decisions knowing the entire future directory (Castronuovo & Lopes (2004)). Garcia-Gonzalez et al. (2008) and Brown et al. (2008) use the idea of stochastic wind scenarios, but allow decisions to “see” the wind in the future within a particular scenario (violating what are known as nonanticipativity conditions). These models will also overstate the value of energy from intermittent sources by being able to make adjustments to other sources of energy in anticipation of future fluctuations in wind and solar. For this reason, it is important to capture not only the variability of wind and solar, but also the uncertainty. Furthermore, uncertainties come in a number of sources such as demand, prices and rainfall, among others. We address the problem of allocating energy resources over both the electric power grid (generally referred to as the electric power dispatch problem) as well as other forms of energy allocation (conversion of biomass to liquid fuels, conversion of petroleum to gasoline or electricity, and the use of natural gas in home heating or electric power generation). Determining

1

how much energy can be converted from each source to serve each type of demand can be modeled as a series of single-period linear programs linked by a single, scalar storage variable, as would be the case with grid-level storage. Let Rt be the scalar quantity of stored energy on hand, and let Wt be a vector-valued stochastic process describing exogenously varying parameters such as available energy from wind and solar, demand of different types (with hourly, daily and weekly patterns), energy prices and rainfall. We assume that Wt is Markovian to be able to model processes such as wind, prices and demand where the future (say, the price pt+1 at time t + 1), depends on the current price pt plus an exogenous change pˆt+1 , allowing us to write pt+1 = pt + pˆt+1 . If wt is the wind at time t, pt is a price (or vector of prices), and Dt is the demand, we would let Wt = (wt , pt , Dt ), where the future of each stochastic process depends on the current value. The state of our system, then, is given by St = (Rt , Wt ). In our problem, xt is a vector-valued control giving the allocation of different sources of energy (oil, coal, wind, solar, nuclear, etc.) over different energy pathways (conversion to electricity, transmission over the electric power grid, conversion of biomass to liquid fuel) and to satisfy different types of demands. xt also includes the amount of energy stored in an aggregate, grid-level storage device such as a reservoir for hydroelectric power. An optimal policy is described by Bellman’s equation Vt (St ) = max (C(St , xt ) + γE {Vt+1 (St+1 )|St }) xt ∈Xt

(1)

where St+1 = f (St , xt , Wt+1 ) and C(St , xt ) is a contribution function that is linear in xt . We use the convention that any variable indexed by t is Ft -measurable, where Ft is the sigmaalgebra generated by W1 , W2 , . . . , Wt . We are going to assume that Wt is a set of discrete random variables, which includes exogenous energy supply (such as wind and solar), demand, prices, rainfall and any other uncertain parameters. When the supplies and demands are discrete (in particular, that it can be written as a scaling factor times an integer), it is possible to show that Vt (St ) = Vt (Rt , Wt ) can be written as a piecewise linear function in Rt . We treat xt as a continuous variable, 2

but we will show that the optimal policy involves solving a linear program which returns an optimal xt that is also defined on a discrete set of outcomes (the extreme point solutions of the linear program). In this case, we can ensure that the solution xt also takes on discrete values, but we solve the optimization problem as a continuous problem using a linear programming solver. Thus, we can view xt and Rt as continuous, while ensuring that they always take on discrete values. Although St may only have only five or ten dimensions, this is often enough to render (1) computationally intractable. There is an extensive literature in approximate dynamic programming which assumes either discrete action spaces, or low dimensional, continuous controls. For our problem, xt can easily have hundreds or thousands of dimensions, depending on the level of aggregation. In addition, we are unable to compute the expectation analytically, introducing the additional complication of maximizing an expectation that is itself computationally intractable. The problem of vector-valued state, outcome and action spaces is known as the three curses of dimensionality (Powell (2007)). We propose a Monte Carlo-based algorithm based on the principles of approximate dynamic programming. We assume that C(St , xt ) is linear in xt given St . Now define Ft (Rt ) by Ft (Rt ) = max C(St , xt ) xt ∈Xt

subject to a flow conservation constraint involving the stored energy Rt . It is well known (see, e.g. Vanderbei (1997)) that Ft (Rt ) is piecewise linear concave in Rt . Backward induction can be used to then show that the value function Vt (St ) = Vt (Rt , Wt ) is piecewise linear and concave in the resource dimension Rt . We exploit this concavity to show almost sure convergence of an ADP algorithm that uses pure exploitation. This is important for higher dimensional problems where exploration steps, which may be useful for proving convergence, can be extremely slow in practice (imagine inserting an exploration step when there are millions of states). This algorithmic strategy has proven to be very successful for industrial applications (see, for example, Godfrey & Powell (2002), Topaloglu & Powell (2006)) and most recently in an energy model that is the motivating application for this paper (Powell et 3

al. (2009)), but none of these prior papers provided any formal convergence results. Our work is motivated by a particular energy application, but the problem class is more general. For example, the theory would apply to problems in logistics involving allocation of a single commodity from multiple suppliers to multiple consumers, with a single inventory linking time periods. We were recently introduced to such an application in Europe involving the distribution of cement provided by multiple suppliers to numerous construction projects. There are many other applications involving the distribution of a single product. This technique has also been applied in transportation to problems with multiple resource classes (Topaloglu & Powell (2006)), but without any formal convergence results. Our algorithmic strategy is limited to problems with linear contribution functions subject to constraints which can be discretized. For this problem class, Ft (Rt ) can be described as a piecewise linear, concave function. We exploit this property to show that we will visit, infinitely often, the slopes near the optimal solution, allowing us to prove convergence in our estimates of these slopes. Concavity allows us to avoid visiting other points infinitely often. Fortunately, there is an extremely wide class of problems that fit these conditions. Our proof technique combines the convergence proof for a monotone mapping in Bertsekas & Tsitsiklis (1996) with the proof of the SPAR algorithm in Powell et al. (2004). Current proofs of convergence for approximate dynamic programming algorithms such as Q-learning (Tsitsiklis (1994), Jaakkola et al. (1994)) and optimistic policy iteration (Tsitsiklis (2002)) require that we visit states (and possibly actions) infinitely often. A convergence proof for a Real Time Dynamic Programming (Barto et al. (1995)) algorithm that considers a pure exploitation scheme is provided in Bertsekas & Tsitsiklis (1996) [Prop. 5.3 and 5.4], but it assumes that expected values can be computed and the initial approximations are optimistic, which produces a type of forced exploration. We make no such assumptions, but it is important to emphasize that our result depends on the concavity of the optimal value functions. Others have taken advantage of concavity (convexity for minimization) in the stochastic control literature. Hernandez-Lerma & Runggladier (1994) presents a number of theoretical results for general stochastic control problems for the case with i.i.d. noise (an assumption we do not make), but does not present a specific algorithm that could be applied to our problem 4

class. There is an extensive literature on stochastic control which assumes linear, additive noise (see, for example, Bertsekas (2005) and the references cited there). For our problem, noise arises in costs and, in particular, the constraints. Other approaches to deal with our problem class would be different flavors of Benders decomposition (Van Slyke & Wets (1969), Higle & Sen (1991), Chen & Powell (1999)) and sample average approximation (Shapiro (2003))(SAA). However, techniques based on stochastic programming such as Benders require that we create scenario trees to capture the trajectory of information. We are interested in modeling our problem over a full year to capture seasonal variations, producing a problem with 8,760 time periods. On the other hand, SAA relies on generating random samples outside of the optimization problems and then solving the corresponding deterministic problems using an appropriate optimization algorithm. Numerical experiments with the SAA approach applied to problems where an integer solution is required can be found in Ahmed & Shapiro (2002). We begin in section 2 with a brief summary of the energy policy model SMART. Section 3 gives the optimality equations and describes some properties of the value function. Section 4 describes the algorithm in detail. Section 5 gives the convergence proof, which is the heart of the paper. Section 6 describes the use of this algorithm in SMART, and shows that the approximate dynamic programming algorithm almost perfectly matches the solution provided by an optimal linear programming model for a deterministic problem. Section 7 concludes the paper.

2

An energy modeling application

Our problem is based on the SMART model (Powell et al. (2009)), which describes a stochastic, multiscale model which models multiple energy supplies and demands, and an energy conversion network that captures both electric power transmission as well as activities such as biomass conversion. This model includes hydroelectric storage which is able to store energy (in the form of water) over long periods of time. This makes it possible to study both the storage of wind energy (which can be used to generate electricity to pump water back into the

5

reservoir) along with seasonal rainfall. In this section, we provide only a sketch of this model, since we only need the structure of the problem and specific mathematical properties such as concavity. We assume that the exogenous information process Wt is Markov, and that it includes both the demand for energy as well as any other exogenous information. The demand is vector-valued representing different types of energy demand, which we represent using Dt = (Dt1 , . . . , DtB d ), where B d is the number of different sources of demand. The demand does not need to be fully satisfied, but no backlogging is allowed (our energy model assumes that energy can be imported at a high price). Wt also includes wind and solar energy production, energy commodity prices, and rainfall. We assume that all elements of Wt ∈ Wt , including demands, have finite support, which means that they have been discretized at an appropriate level. We denote by Rt the scalar storage level right before a decision is taken. To simplify the presentation of the algorithm, we discretize the storage, and let the amount in storage be given by ηRt where η is an appropriately chosen scalar parameter, while Rt is a nonnegative integer. If the units of all the forms of energy are chosen carefully, we can let η = 1. We model the problem as if Rt is continuous, but argue below that it will always take on integer values. We define St = (Wt , Rt ) to be the pre-decision state (the information just before we make a decision), where we use St and (Wt , Rt ) interchangeably. The decision xt = (xdt , xst , xtr t ) has three components. The first component, xdt , is a vector that represents how much to sell to satisfy the demands Dt . The second component, xst , is also a vector that represents how much to buy from Bts different energy sources (coal, nuclear, natural gas, hydroelectric, ethanol, biomass, and so on). Finally, xtr t is a scalar that denotes the amount of energy that should be transferred (discarded), in case the storage level is too high, even after demand has been satisfied.

6

The feasible set, which is dependent on the current state (Wt , Rt ), is given by  d Xt (Wt , Rt ) = xdt ∈ IRB ,

s

xst ∈ IRB ,

xtr t ∈ IR :

d

Bt X

xdti + xtr t ≤ ηRt ,

i=1

0 ≤ xdti ≤ Dti ,

i = 1, . . . , B d ,

0 ≤ xsti ≤ Mti ,

i = 1, . . . , B s ,

0 ≤ xsti ≤ ust (Wt ),

i = 1, . . . , B s ,

tr 0 ≤ xtr ti ≤ ut (Wt ) , where ust (·) and utr t (·) are nonnegative functions which take on discrete values and Mti is a deterministic bound. The first constraint indicates that backlogging is not allowed, the second indicates that we do not sell more than is demanded and the third indicates that each supplier can only offer a limited amount of resources. The last two constraints impose upper bounds that varies with the information vector Wt . The contribution generated by the decision is given by   Btd Btd X X tr Ct (Wt , Rt , xt ) = − cuti (Wt )(Dti − xdti ) − cot (Wt ) Rt − xdti  − ctr t (Wt )xt i=1

i=1

s



Bt X

d

csti (Wt )xsti

i=1

+

Bt X

cdti (Wt )xdti ,

i=1

s d where cuti (·), cot (·), ctr t (·), cti (·) and cti (·) are nonnegative scalar functions. The first three terms

represent underage, overage and transfer costs, respectively. The fourth term represents the buying cost and the last one represents the money that is made satisfying demand. The decision takes the system to a new storage level, denoted by the scalar Rtx , given by Rtx = f x (Rt , xt ) Btd

= Rt −

X i=1

(2) Bts

xdti +

X

xsti − xtr t .

(3)

i=1

where f x (Rt , xt ) is our transition function that returns the post-decision resource state. 7

If the problem is deterministic, it can be formulated as a single linear program of the form

max x

T X

Ct (Wt , Rt , xt )

(4)

t=0

subject to xt ∈ Xt (Wt , Rt ) and the transition equation 3. In section 6, we compare against the optimal solution obtained from this formulation (again, assuming Wt is deterministic) as a benchmark. When the problem is stochastic, we look for a decision function Xtπ (St ) (or policy), that solves

max E π∈Π

T X

Ct (Wt , Rt , Xtπ (St )).

(5)

t=0

In this paper, we design the decision function using approximate dynamic programming, where Xtπ (St ) is given by   Xtπ (St ) = max C(St , xt ) + γE V¯t+1 (St+1 )|St xt ∈Xt

where V¯t (s) is an approximation of the value function at state s. This strategy introduces two challenges. The first is the presence of an expectation (which cannot be computed exactly) within the optimization problem. We can remove the expectation by using the concept of the post-decision state, Stx = (Wt , Rtx ), which is the state immediately after we have made a decision but before any new information has arrived. Since Wt is not affected by a decision, we only have to worry about Rtx which is the storage level after we have decided how much to add or subtract. The pre-decision state is then computed using ˆ t+1 Rt+1 = Rtx + R ˆ t+1 might represent exogenous rainfall to our reservoir (R ˆ t+1 would be part of Wt+1 ). where R The post-decision state (see Powell (2007), chapters 4, 11 and 12) allows us to solve problems of the form  Xtπ (St ) = max C(St , xt ) + γ V¯t+1 (Wt , Rtx ) xt ∈Xt

8

(6)

which is a deterministic linear program, where Rtx is given by equation 3. The second challenge is computing the value function V¯t (Stx ) = V¯t (Wt , Rtx ). V¯t (Wt , R) can be represented exactly as a piecewise linear function of R indexed by Wt . Throughout our presentation, we will write V¯t (Wt , R) for R = 0, 1, . . . , B R , by which we mean Rtx = ηR. We will treat R as if it is continuous, but it is easy to show by induction that Vtx (Wt , Rtx ) is piecewise linear and concave in Rtx . Since energy supplies and demands are all assumed to be discretized (with the same discretization parameter η), the optimal solution of the linear program in equation 6 has the property that the optimal solution xt will also be discrete, and can be represented as integers scaled by η. For this reason, assuming the initial amount in storage also takes on one of these values (such as zero), then we are assured that Rtx (and Rt ) will always take on a discrete value, even if we treat it as a continuous variable. For our energy application, concavity of V¯t (Wt , R) in R is a significant structural property. In section 6, we report on comparisons against the optimal solution from a deterministic linear program. To match this solution, we had to choose η so that B R was on the order of 10,000, which means that the value function approximation could have as many as 10,000 segments (and one of these for each of 8,760 time periods). However, our algorithm requires that we visit only a small number of these (for each value of Wt ), accelerating the algorithm by a factor of 1000. For this reason, it does not affect the computational performance of the algorithm if we were to discretize the function into a million segments. Our algorithmic strategy can be extended to more complex problems in an approximate way. In Powell et al. (2009), energy is stored in a hydroelectric reservoir. If we extended the model to handle storage using compressed air, flywheels, batteries and compressed hydrogen, we lose our convergence proof because we would have to approximate the value of this vector of energy storage resources using separable, piecewise linear approximations. Topaloglu & Powell (2006) does exactly this for a problem managing a fleet of vehicles of different types, and shows that the use of separable, piecewise linear value functions provides very good, but not optimal results.

9

3

The Optimal Value Functions

We define, recursively, the optimal value functions associated with the storage class. We denote by Vt∗ (Wt , Rt ) the optimal value function around the pre-decision state (Wt , Rt ) and by Vtx (Wt , Rtx ) the optimal value function around the post-decision state (Wt , Rtx ). At time t = T , since it is the end of the planning horizon, the value of being in any state (WT , RTx ) is zero. Hence, VTx (WT , RTx ) = 0. At time t − 1, for t = T, . . . , 1, the value of being x in any post-decision state (Wt−1 , Rt−1 ) does not involve the solution of a deterministic opti-

mization problem, it only involves an expectation, since the next pre-decision state (Wt , Rt ) only depends on the information that first becomes available at t. On the other hand, the value of being in any pre-decision state (Wt , Rt ) does not involve expectations, since the next post-decision state (Wt , Rtx ) is a deterministic function of Wt , it only requires the solution of a deterministic optimization problem. Therefore,   x x x Vt−1 (Wt−1 , Rt−1 ) = E Vt∗ (Wt , Rt )|(Wt−1 , Rt−1 ) , Vt∗ (Wt , Rt ) =

max

xt ∈Xt (Wt ,Rt )

(7)

 Ct (Wt , Rt , xt ) + γVtx (Wt , Rtx ) .

(8)

In the remainder of the paper, we only use the value function Vtx (Stx ) defined around the post-decision state variable, since it allows us to make decisions by solving a deterministic problem as in (7). We show that Vtx (Wt , R) is concave and piecewise linear in R with B R discrete breakpoints. This structural property combined with the optimization/expectation inversion is the foundation of our algorithmic strategy and its proof of convergence. For Wt ∈ Wt , let vt (Wt ) = (vt (Wt , 1), . . . , vt (Wt , B R )) be a vector representing the slopes of a function Vt (Wt , R) : [0, ∞) → IR that is concave and piecewise linear with breakpoints R = 1, . . . , B R . It is easy to see that the function Ft∗ (vt (Wt ), Wt , ·), where R

Ft∗ (vt (Wt ), Wt , Rt ) = max Ct (Wt , Rt , xt ) + γ xt ,yt

B X

vt (Wt , r)ytr

r=1 R

subject to xt ∈ Xt (Wt , Rt ),

B X

ytr = f x (Rt , xt ),

r=1

10

0 ≤ yt ≤ 1,

is also concave and piecewise linear with breakpoints R = 1, . . . , B R , since the demand vector Dt and the upper bounds usti (·), utr t (·) and Mti in the constraint set Xt (Wt , Rt ) are all discrete. Moreover, the optimal solution (x∗t , yt∗ ) to the linear programming problem that defines Ft∗ (vt (Wt ), Wt , Rt ) does not depend on Vt (Wt , 0) and is a discrete vector whenever the argument Rt is discrete. We also have that Ft∗ (vt (Wt ), Wt , Rt ) is bounded for all Wt ∈ Wt . We use Ft∗ to prove the following proposition about the optimal value function. Proposition 1. For t = 0, . . . , T and information vector Wt ∈ Wt , the optimal value function Vtx (Wt , R) is concave and piecewise linear with breakpoints R = 1, . . . , B R . We denote  its slopes by vt∗ (Wt ) = vt∗ (Wt , 1), . . . , vt∗ (Wt , B R ) , where, for R = 1, . . . , B R and t < T , vt∗ (Wt , R) is given by vt∗ (Wt , R) = Vtx (Wt , R) − Vtx (Wt , R − 1)  ∗ ∗ ˆ t+1 (Wt+1 )) = E Ft+1 (vt+1 (Wt+1 ), Wt+1 , R + R  ∗ ∗ ˆ t+1 (Wt+1 ))|Wt . − Ft+1 (vt+1 (Wt+1 ), Wt+1 , R − 1 + R

(9)

Proof: The proof is by backward induction on t. The base case t = T holds as VTx (WT , ·) is equal to zero for all WT ∈ WT . For t < T the proof is obtained noting that  x  ∗ ˆ t+1 (Wt+1 ))|Wt . Vtx (Wt , Rtx ) = E γVt+1 (Wt+1 , 0) + Ft∗ (vt+1 (Wt+1 ), Wt+1 , Rtx + R

Due to the concavity of Vtx (Wt , ·), the slope vector vt∗ (Wt ) is monotone decreasing, that is, vt∗ (Wt , R) ≥ vt∗ (Wt , R + 1) (throughout, we use vt∗ (Wt ) to refer to the slope of the value function Vtx (Wt , Rtx ) defined around the post-decision state variable). Moreover, throughout the paper, we work with the translated version Vt∗ (Wt , ·) of Vtx (Wt , ·) given by Vt∗ (Wt , R + R X y) = vt∗ (Wt , r) + yvt∗ (Wt , R + 1), where R is a nonnegative integer and 0 ≤ y ≤ 1, since r=1

∗ the optimal solution (x∗t , yt∗ ) associated with F ∗ (vt+1 (Wt+1 ), Wt+1 , R) does not depend on

Vtx (Wt , 0). We next introduce the dynamic programming operator H associated with the storage class. We define H using the slopes of piecewise linear functions instead of the functions themselves. 11

Let v = {vt (Wt ) for t = 0, . . . , T,

Wt ∈ Wt } be a set of slope vectors, where vt (Wt ) =

(vt (Wt , 1), . . . , vt (Wt , B R )). The dynamic programming operator H associated with the storage class maps a set of slope vectors v into a new set Hv as follows. For t = 0, . . . , T − 1, Wt ∈ Wt and R = 1, . . . , B R ,  ˆ t+1 (Wt+1 )) (Hv)t (Wt , R) = E Ft∗ (vt+1 (Wt+1 ), Wt+1 , R + R  ˆ t+1 (Wt+1 ))|Wt . −Ft∗ (vt+1 (Wt+1 ), Wt+1 , R − 1 + R

(10)

It is well known that the optimal value function in a dynamic program is unique. This is a trivial result for the finite horizon problems that we consider in this paper. Therefore, the set of slopes v ∗ corresponding to the optimal value functions Vt∗ (Wt , R) for t = 0, . . . , T , Wt ∈ Wt and R = 1, . . . , Bt are unique. Moreover, the dynamic programming operator H defined by (10) is assumed to satisfy the following conditions. Let v˜ = {˜ vt (Wt ) for t = 0, . . . , T,

Wt ∈ Wt } and v˜˜ = {v˜˜t (Wt ) for t = 0, . . . , T,

Wt ∈ Wt } be sets of slope vectors

such that v˜t (Wt ) = (˜ vt (Wt , 1), . . . , v˜t (Wt , Bt )) and v˜˜t (Wt ) = (v˜˜t (Wt , 1), . . . , v˜˜t (Wt , Bt )) are monotone decreasing and v˜t (Wt ) ≤ v˜˜t (Wt ). Then, for Wt ∈ Wt and R = 1, . . . , Bt : (H v˜)t (Wt ) is monotone decreasing,

(11)

(H v˜)t (Wt , R) ≤ (H v˜˜)t (Wt , R),

(12)

(H v˜)t (Wt , R) − ηe ≤ (H(˜ v − ηe))t (W, R) ≤ (H(˜ v + ηe))t (W, R) ≤ (H v˜)t (W, R) + ηe, (13) where η is a positive integer and e is a vector of ones. The conditions in equations (11) and (13) imply that the mapping H is continuous (see the discussion in Bertsekas & Tsitsiklis (1996), page 158). The dynamic programming operator H and the associated conditions (11)-(13) are used later on to construct deterministic sequences that are provably convergent to the optimal slopes.

4

The SPAR-Storage Algorithm

We propose a pure exploitation algorithm, namely the SPAR-Storage Algorithm, that provably learns the optimal decisions to be taken at parts of the state space that can be reached by an 12

V¯tn(Wt, ·)

Vt∗(Wt, ·)

Approximation - Constructed

Optimal - Unknown

n

v¯ t

                                                                                         )  Important Region t3 R                                   ∗ , ∗ v t t W                      ( ( v W                    t                                ,R t                                     ) t3              t ,R n  n ( 4 ) v¯t t W                  2) (W v¯                  t  , Rt                                  t ,  )2                                    R ∗ (W t                             t  t 4 ) vt                                    R                      t, W

Rt1

5)

v¯tn(W

t, R t1 )

Rt ) t, 5 R t v∗t(W

t, R t1 )

(

t,

vt∗(W

(W

n

v¯ t

Rt2

Rt3

Asset Dimension

Figure 1: Optimal value function and the constructed approximation optimal policy, which are determined by the algorithm itself. This is accomplished by learning the slopes of the optimal value functions at important parts of the state space, through the construction of value function approximations V¯tn (Wt , ·) that are concave, piecewise linear with breakpoints R = 1, . . . , B R . The approximation is represented by its slopes v¯tn (Wt ) =  v¯tn (Wt , 1), . . . , v¯tn (Wt , B R ) . Figure 1 illustrates the idea. In order to do so, the algorithm combines Monte Carlo simulation in a pure exploitation scheme and stochastic approximation integrated with a projection operation. Figure 2 describes the SPAR-Storage algorithm. The algorithm requires an initial concave piecewise linear value function approximation V¯ 0 (Wt , ·), represented by its slopes v¯t0 (Wt ) =  v¯t0 (Wt , 1), . . . , v¯t0 (Wt , B R ) , for each information vector Wt ∈ Wt . Therefore the initial slope vector v¯t0 (Wt ) has to be monotone decreasing. For example, it is valid to set all the initial slopes equal to zero. For completeness, since we know that the optimal value function at the end of the horizon is equal to zero, we set v¯Tn (WT , R) = 0 for all iterations n, information vectors WT ∈ WT and asset levels R = 1, . . . , B R . The algorithm also requires an initial asset x,n level to be used in all iterations. Thus, for all n ≥ 0, R−1 is set to be a nonnegative integer,

as in STEP 0b. At the beginning of each iteration n, the algorithm observes a sample realization of the information sequence W0n , . . . , WTn , as in STEP 1. The sample can be obtained from a sample generator or actual data. After that, the algorithm goes over time periods t = 0, . . . , T .

13

STEP 0: Algorithm Initialization: STEP 0a: Initialize v¯t0 (Wt ) for t = 1, . . . , T − 1 and Wt ∈ Wt monotone decreasing. x,n STEP 0b: Set R−1 = r¯, a nonnegative integer, for all n ≥ 0

STEP 0c: Set n = 1. STEP 1: Sample/Observe the information sequence W0n , . . . , WTn . Do for t = 0, . . . , T : x,n ˆ t (W n ). +R STEP 2: Compute the pre-decision asset level: Rtn = Rt−1 t

STEP 3: Find the optimal solution xnt of maxn

xt ∈Xt (Wt ,Rtn )

Ct (Wtn , Rtn , xt ) + γ V¯tn−1 (Wtn , f x (Rtn , xt )) .

STEP 4: Compute the post-decision asset level: Rtx,n = f x (Rtn , xnt ). STEP 5: Update slopes: If t < T then n n (Rtx,n + 1). See (14). (Rtx,n ) and vˆt+1 STEP 5a: Observe vˆt+1 STEP 5b: For Wt ∈ Wt and R = 1, . . . , B R : n ztn (Wt , R) = (1 − α ¯ tn (Wt , R))¯ vtn−1 (Wt , R) + α ¯ tn (Wt , R)ˆ vt+1 (R).

STEP 5c: Perform the projection operation v¯tn = ΠC (ztn ). See (18). STEP 6: Increase n by one and go to step 1.

Figure 2: SPAR-Storage Algorithm First, the pre-decision asset level Rtn is computed, as in STEP 2. Then, the decision xnt , which is optimal with respect to the current pre-decision state (Wtn , Rtn ) and value function approximation V¯tn−1 (Wtn , ·) is taken, as stated in STEP 3. We have that V¯tn (Wt , R + y) = R X v¯tn (Wt , r) + y¯ vtn (Wt , R + 1), where R is a nonnegative integer and 0 ≤ y ≤ 1. Next, taking r=1

into account the decision, the algorithm computes the post-decision asset level Rtx,n , as in STEP 4.

14

Time period t is concluded by updating the slopes of the value function approximation. Steps 5a-5c describes the procedure and figure 3 illustrates it. Sample slopes relative to the post-decision states (Wtn , Rtx,n ) and (Wtn , Rtx,n + 1) are observed, see STEP 5a and figure 3a. After that, these samples are used to update the approximation slopes v¯tn−1 (Wtn ), through a temporary slope vector ztn (Wtn ). This procedure requires the use of a stepsize rule that is state dependent, denoted by α ¯ tn (Wt , R), and it may lead to a violation of the property that the slopes are monotonically decreasing, see STEP 5b and figure 3b. Thus, a projection operation ΠC is performed to restore the property and updated slopes v¯tn (Wtn ) are obtained, see STEP 5c and figure 3c. After the end of the planning horizon T is reached, the iteration counter is incremented, as in STEP 6, and a new iteration is started from STEP 1. We have that xnt is easily computed since it is the first component of the optimal solution to the linear programming problem in the definition of Ft∗ (¯ vtn−1 (Wtn ), Wtn , Rtn ). Moreover, given our assumptions and the properties of Ft∗ discussed in the previous section, it is clear that Rtn , xnt and Rtx,n are all integer valued. We also know that they are bounded. Therefore, the sequences of decisions, pre and post decision states generated by the algorithm, which are denoted by {xnt }n≥0 , {Stn }n≥0 = {(Wtn , Rtn )}n≥0 and {Stx,n } = {(Wtn , Rtx,n )}n≥0 , respectively, have at least one accumulation point. Since these are sequences of random variables, their accumulation points, denoted by x∗t , St∗ and Stx,∗ , respectively, are also random variables. The sample slopes used to update the approximation slopes are obtained by replacing the ∗ expectation and the slopes vt+1 (Wt+1 ) of the optimal value function in (9) by a sample realn−1 n n ization of the information Wt+1 and the current slope approximation v¯t+1 (Wt+1 ), respectively.

Thus, for t = 1 . . . , T , the sample slope is given by n n n n−1 n ˆ t+1 (Wt+1 ), Wt+1 ,R + R )) vˆt+1 (R) = Ft∗ (¯ vt+1 (Wt+1 n−1 n n ˆ t+1 (W n )). − Ft∗ (¯ vt+1 (Wt+1 ), Wt+1 ,R − 1 + R t+1

(14)

The update procedure is then divided into two parts. First, a temporary set of slope vectors ztn = {ztn (Wt , R) : Wt ∈ Wt , R = 1, . . . , B R } is produced combining the current approximation

15

n

vn−1 (Wtn ), Wtn , Rt , x) Ft (¯

                                  



 

 

 



                                          

                           

                



                     

             v¯tn−1 (Wtn , Rtn )

x,n n 1(R t vˆ t+

) vˆ n t+1 (R x,n t

+ 1) Exact

Rtn

xnt

x Rtx,n

Wtn

Slopes of V¯tn−1 (Wtn , ·)

3a: Current approximate function, optimal decision and sampled slopes

                                       



 

 



     

                        

                         

                  

                       

           

      

               ztn (Wtn , Rtn )

Exact Concavity violation

3b: Temporary approximate function with violation of concavity

n

vn−1 (Wtn ), Wtn , Rt , x) Ft (¯

Temporary slopes

       

  

 

    

          

       

                

       

                             

                  



                

                                        ¯      v¯tn (Wtn , Rtn )

Exact

x

Slopes of Vtn (Wtn , ·)

3c: Level projection operation: updated approximate function with concavity restored Figure 3: Update procedure of the approximate slopes and the sample slopes using the stepsize rule α ¯ tn (Wt , R). We have that α ¯ tn (Wt , R) = αtn 1{Wt =Wtn } (1{R=Rtx,n } + 1{R=Rtx,n +1} ), where αtn is a scalar between 0 and 1 and can depend only on information that became available up until iteration n and time t. Moreover, on the event that (Wt∗ , Rtx,∗ ) is an accumulation

16

point of {(Wtn , Rtx,n )}n≥0 , we make the standard assumptions that ∞ X

α ¯ tn (Wt∗ , Rtx,∗ ) = ∞ a.s.,

(15)

(¯ αtn (Wt∗ , Rtx,∗ ))2 ≤ B α < ∞ a.s.,

(16)

n=1 ∞ X n=1

where B α is a constant. Clearly, the rule αtn = N V n (Wt∗ , Rtx,∗ ) is the number of visits to state

1 satisfies all the conditions, where N V n (Wt∗ ,Rtx,∗ ) (Wt∗ , Rtx,∗ ) up until iteration n. Furthermore,

for all positive integers N , ∞ Y

(1 − α ¯ tn (Wt∗ , Rtx,∗ )) = 0 a.s.

(17)

n=N

The proof for (17) follows directly from the fact that log(1 + x) ≤ x. The second part is the projection operation, where the temporary slope vector ztn (Wt ), that may not be monotone decreasing, is transformed into another slope vector v¯tn (Wt ) that has this structural property. The projection operator imposes the desired property by simply forcing the violating slopes to be equal to the newly updated ones. For Wt ∈ Wt and R = 1, . . . , B R , the projection is given by  x,n n n  if Wt = Wtn , R < Rtx,n , ztn (Wt , R) ≤ ztn (Wtn , Rtx,n ) zt (Wt , Rt ), ΠC (ztn )(Wt , R) = ztn (Wtn , Rtx,n + 1), if Wt = Wtn , R > Rtx,n + 1, ztn (Wt , R) ≥ ztn (Wtn , Rtx,n + 1)   n otherwise. zt (Wt , R), (18)

For t = 0, . . . , T , information vector Wt ∈ Wt and asset level R = 1, . . . , B R , the sequence of slopes of the value function approximation generated by the algorithm is denoted by {¯ vtn (Wt , R)}n≥0 . Moreover, as the function Ft∗ (¯ vtn−1 (Wtn ), Wtn , Rtn ) is bounded and the stepsizes αtn are between 0 and 1, we can easily see that the sample slopes vˆtn (R), the temporary slopes ztn (Wt , R) and, consequently, the approximated slopes v¯tn (Wt , R) are all bounded. Therefore, the slope sequence {¯ vtn (Wt , R)}n≥0 has at least one accumulation point, as the projection operation guarantees that the updated vector of slopes are elements of a compact set. The accumulation points are random variables and are denoted by v¯t∗ (Wt , R), as opposed to the deterministic optimal slopes vt∗ (Wt , R). 17

The ability of the SPAR-storage algorithm to avoid visiting all possible values of Rt was significant in our energy application. In our energy model in Powell et al. (2009), Rt was the energy stored in a hydroelectric reservoir which served as a source of storage for the entire grid. The algorithm required that we discretize Rt into approximately 10,000 elements. However, the power of concavity requires that we visit only a small number of these a large number of times for each value of Wt . An important practical issue is that we effectively have to visit the entire support of Wt infinitely often. If Wt contains a vector of as few as five or ten dimensions, this can be computationally expensive. In a practical application, we would use statistical methods to produce more robust estimates using smaller numbers of observations. For example, we could use the hierarchical aggregation strategy suggested by George et al. (2008) to produce estimates of V (W, R) at different levels of aggregation of W . These estimates can then be combined using a simple weighting formula. It is beyond the scope of this paper to analyze the convergence properties of such a strategy, but it is important to realize that these strategies are available to overcome the discretization of Wt .

5

Convergence Analysis

We start this section presenting the convergence results we want to prove. The major result is the almost sure convergence of the approximation slopes corresponding to states that are visited infinitely often. On the event that (Wt∗ , Rtx,∗ ) is an accumulation point of {(Wtn , Rtx,n )}n≥0 , we obtain v¯tn (Wt∗ , Rtx,∗ ) → vt∗ (Wt∗ , Rtx,∗ ) and v¯tn (Wt∗ , Rtx,∗ + 1) → vt∗ (Wt∗ , Rtx,∗ + 1) a.s. As a byproduct of the previous result, we show that, for t = 0, . . . , T , on the event that (Wt∗ , Rt∗ , x∗t ) is an accumulation point of {(Wtn , Rtn , xnt )}n≥0 , x∗t = arg

max

xt ∈Xt (Wt∗ ,Rt∗ )

Ct (Wt∗ , Rt∗ , xt ) + γVt∗ (Wt∗ , f x (Rt∗ , xt )) a.s.

where Vt∗ (Wt∗ , ·) is the translated optimal value function. 18

(19)

Equation (19) implies that the algorithm has learned almost surely an optimal decision for all states that can be reached by an optimal policy. This implication can be easily justified as follows. Pick ω in the sample space. We omit the dependence of the random variables x,n on ω for the sake of clarity. For t = 0, since R−1 = r¯, a given constant, for all iterations of x,∗ = r¯. Moreover, all the elements in W0 are accumulation the algorithm, we have that R−1

points of {W0n }n≥0 , as W0 has finite support. Thus, (19) tells us that the accumulation points x,∗ ˆ 0 (W0∗ )) +R x∗0 of the sequence {xn0 }n≥0 along the iterations with pre-decision state (W0∗ , R−1

are in fact an optimal policy for period 0 when the information is W0∗ . This implies that x,∗ ˆ 0 (W ∗ ), x0 ) of {R0x,n }n≥0 are post-decision resource +R all accumulation points R0x,∗ = f x (R−1 0

levels that can be reached by an optimal policy. By the same token, for t = 1, every element in W1 is an accumulation point of {W1n }n≥0 . Hence, (19) tells us that the accumulation points x,∗ ˆ 0 (W ∗ )) are indeed an +R x∗1 of the sequence {xn1 } along iterations with (W1n , R1n ) = (W1∗ , R−0 1

optimal policy for period 1 when the information is W1∗ and the pre-decision resource level is x,∗ ˆ 0 (W1∗ ). As before, the accumulation points R1x,∗ = f x (R1∗ , x1 ) of {R1∗,n }n≥0 are R1∗ = R−0 +R

post-decision resource levels that can be reached by an optimal policy. The same reasoning can be applied for t = 2, . . . , T .

5.1

Outline of the Convergence Proofs

Our proofs build on the ideas presented in Bertsekas & Tsitsiklis (1996) and in Powell et al. (2004). The first proves convergence assuming that all states are visited infinitely often. The authors do not consider a concavity-preserving step, which is the key element that has allowed us to obtain a convergence proof when a pure exploitation scheme is considered. Although the framework in Powell et al. (2004) also considers the concavity of the optimal value functions in the resource dimension, the use of a projection operation to restore concavity and a pure exploitation routine, their proof is restricted to two-stage problems. The difference is significant. In Powell et al. (2004), it was possible to assume that Monte Carlo estimates of the true value function were unbiased, a critical assumption in that paper. In our paper, estimates of the marginal value of additional resource at time t depends on a value function approximation at t + 1. Since this is an approximation, the estimates of marginal values at time t are biased. 19

The main concept to achieve the convergence of the approximation slopes to the optimal ones is to construct deterministic sequences of slopes, namely, {Lk (Wt , R)}k≥0 and {Utk (Wt , R)}k≥0 , that are provably convergent to the slopes of the optimal value functions. These sequences are based on the dynamic programming operator H, as introduced in (10). We then use these sequences to prove almost surely that for all k ≥ 0, Lkt (Wt∗ , Rtx,∗ ) ≤ v¯tn (Wt∗ , Rtx,∗ ) ≤ Utk (Wt∗ , Rtx,∗ ),

(20)

Lkt (Wt∗ , Rtx,∗ + 1) ≤ v¯tn (Wt∗ , Rtx,∗ + 1) ≤ Utk (Wt∗ , Rtx,∗ + 1),

(21)

on the event that the iteration n is sufficiently large and (Wt∗ , Rtx,∗ ) is an accumulation point of {(Wtn , Rtx,n )}n≥0 , which implies the convergence of the approximation slopes to the optimal ones. Establishing (20) and (21) requires several intermediate steps that need to take into consideration the pure exploitation nature of our algorithm and the concavity preserving operation. We give all the details in the proof of Lkt (Wt∗ , Rtx,∗ ) ≤ v¯tn (Wt∗ , Rtx,∗ ) and Lkt (Wt∗ , Rtx,∗ + 1) ≤ v¯tn (Wt∗ , Rtx,∗ + 1). The upper bound inequalities are obtained using a symmetrical argument. First, we define two auxiliary stochastic sequences of slopes, namely, the noise and the bounding sequences, denoted by {¯ snt (Wt , R)}n≥0 , and {¯ltn (Wt , R)}n≥0 , respectively. The first sequence represents the noise introduced by the observation of the sample slopes, which replaces the observation of true expectations and the optimal slopes. The second one is a convex combination of the deterministic sequence Lkt (Wt , R) and the transformed sequence (HLk )t (Wt , R). We then define the set S˜t to contain the states (Wt∗ , Rtx,∗ ) and (Wt∗ , Rtx,∗ + 1), such that (Wt∗ , Rtx,∗ ) is an accumulation point of {(Wtn , Rtx,n )}n≥0 and the projection operation decreased or kept the same the corresponding unprojected slopes infinitely often. This is not the set of all accumulation points, since there are some points where the slope may have increased infinitely often. The stochastic sequences {¯ snt (Wt , R)}, and {¯ltn (Wt , R)} are used to show that on the event ˜ tx,∗ ) is an element of the random set S˜t , that the iteration n is big enough and (Wt∗ , R ˜ tx,∗ ) ≥ ¯ltn−1 (Wt∗ , R ˜ tx,∗ ) − s¯n−1 ˜ tx,∗ ) a.s. v¯tn−1 (Wt∗ , R (Wt∗ , R t 20

˜ tx,∗ ) ∈ S˜t }, convergence to zero of the noise sequence, the convex comThen, on {(Wt∗ , R bination property of the bounding sequence and the monotone decreasing property of the approximate slopes, will give us ˜ tx,∗ ) ≤ v¯tn−1 (Wt∗ , R ˜ tx,∗ ) a.s. Lkt (Wt∗ , R Note that this inequality does not cover all the accumulation points of {(Wtn , Rtx,n )}n≥0 , since they are restricted to states in the set S˜t . Nevertheless, this inequality and some properties of the projection operation are used to fulfill the requirements of a bounding technical lemma, which is used repeatedly to obtain the desired lower bound inequalities for all accumulation points. In order to prove (19), we note that Ft (¯ vtn−1 (Wtn ), Wtn , Rtn , xt ) = Ct (Wtn , Rtn , xt ) + γ V¯tn−1 (Wtn , f x (Rtn , xt )) is a concave function of Xt and Xt (Wtn , Rtn ) is a convex set. Therefore, 0 ∈ ∂Ft (¯ vtn−1 (Wtn ), Wtn , Rtn , xnt ) − XtN (Wtn , Rtn , xnt ), where xnt is the optimal decision of the optimization problem in STEP 3a of the algorithm, ∂Ft (¯ vtn−1 (Wtn ), Wtn , Rtn , xnt ) is the subdifferential of Ft (¯ vtn−1 (Wtn ), Wtn , Rtn , ·) at xnt and XtN (Wtn , Rtn , xnt ) is the normal cone of Xt (Wtn , Rtn ) at xnt . This inclusion and the first convergence result are then combined to prove, as shown in section 5.4, that 0 ∈ ∂Ft (vt∗ (Wt∗ ), Wt∗ , Rt∗ , x∗t ) − XtN (Wt∗ , Rt∗ , x∗t ).

5.2

Technical Elements

In this section, we set the stage to the convergence proofs by defining some technical elements. We start with the definition of the deterministic sequence {Lkt (Wt , R)}k≥0 . For this, we let B v be a deterministic integer that bounds |ˆ vtn (R)|, |ztn (Wt , R)|, |¯ vtn (Wt , R)| and |vt∗ (Wt , R)| for all Wt ∈ Wt and R = 1, . . . , B R . Then, for t = 0, . . . , T − 1, Wt ∈ Wt and R = 1, . . . , B R , we have that L0t (Wt , R) = vt∗ (Wt , R) − 2B v

and

Lk+1 (Wt , R) = t 21

Lkt (Wt , R) + (HLk )t (Wt , R) . (22) 2

At the end of the planning horizon T , LkT (WT , R) = 0 for all k ≥ 0. The proposition below introduces the required properties of the deterministic sequence {Lk (Wt , R)}k≥0 . Its proof is deferred to the appendix. Proposition 2. For t = 0, . . . , T − 1, information vector Wt ∈ Wt and resource levels R = 1, . . . , B R , Lkt (Wt , R) is monotone decreasing,

(23)

(HLk )t (Wt , R) ≥ Lk+1 (Wt , R) ≥ Lkt (Wt , R), t

(24)

LkT (Wt , R) < vt∗ (Wt , R),

(25)

and

lim Lkt (Wt , R) = vt∗ (Wt , R).

k→∞

The deterministic sequence {U k (Wt , R)}k≥0 is defined in a symmetrical way. It also has the properties stated in proposition 2, with the reversed inequality signs. ¯ that is used to indicate when an iteration We move on to define the random index N ¯ be the smallest of the algorithm is large enough for convergence analysis purposes. Let N ¯ are integer such that all states (actions) visited (taken) by the algorithm after iteration N ¯ accumulation points of the sequence of states (actions) generated by the algorithm. In fact, N can be required to satisfy other constraints of the type: if an event did not happen infinitely ¯ . Since we need N ¯ to be finite almost surely, the often, then it did not happen after N additional number of constraints have to be finite. We introduce the set of iterations, namely Nt (Wt , R), that keeps track of the effects produced by the projection operation. For Wt ∈ Wt and R = 1, . . . , B R , let Nt (Wt , R) be the set of iterations in which the unprojected slope corresponding to state (Wt , R), that is, ztn (Wt , R) was too large and had to be decreased by the projection operation. Formally, Nt (Wt , R) = {n ∈ IN : ztn (Wt , R) > v¯tn (Wt , R)}. For example, based on figure 3c, n ∈ Nt (Wtn , Rtn + 2). A related set is the set of states S˜t . A state (Wt , R) is an element of S˜t if (Wt , R) is equal to an accumulation point (Wt∗ , Rtx,∗ ) of {(Wtn , Rtx,n )}n≥0 or is equal to (Wt∗ , Rtx,∗ + 1). Its corresponding approximate slope also has to ¯ , that is, the projection operation satisfy the condition ztn (Wt , R) ≤ v¯tn (Wt , R) for all n ≥ N decreased or kept the same the corresponding unprojected slopes infinitely often. 22

We close this section dealing with measurability issues. Let (Ω, F, Prob) be the probability space under consideration. The sigma-algebra F is defined by F = σ{(Wtn , xnt ), 0, . . . , T }. Moreover, for n ≥ 1 and t = 0, . . . , T ,  [ n m Ft = σ {(Wtm 0 < m < n, t0 = 0, . . . , T } {(Wtn0 , xnt0 ), 0 , xt0 ),

n ≥ 1,

t=



0

t = 0, . . . , t} .

n and FTn ⊂ F0n+1 . Furthermore, given the initial slopes v¯t0 (Wt ) and the Clearly, Ftn ⊂ Ft+1 n initial resource level r¯, we have that Rtn , Rtx,n and αtn are in Ftn , while vˆt+1 (Rtx ), ztn (Wt ), n v¯tn (Wt ) are in Ft+1 . A pointwise argument is used in all the proofs of almost sure convergence

presented in this paper. Thus, zero-measure events are discarded on an as-needed basis.

5.3

Almost sure convergence of the slopes

We prove that the approximation slopes produced by the SPAR-Storage algorithm converge almost surely to the slopes of the optimal value functions of the storage class for states that can be reached by an optimal policy. This result is stated in theorem 1 below. Along with the proof of the theorem, we present the noise and the bounding stochastic sequences and introduce three technical lemmas. Their proofs are given in the appendix so that the main reasoning is not disrupted. Before we present the theorem establishing the convergence of the value function, we introduce the following technical condition. Given k ≥ 0 and t = 0, . . . , T − 1, we assume there exists a positive random variable Ntk such that on {n ≥ Ntk }, (HLk )t (Wtn , Rtn ) ≤ (H v¯n−1 )t (Wtn , Rtn ) ≤ (HU k )t (Wtn , Rtn )

a.s.,

(HLk )t (Wtn , Rtn + 1) ≤ (H v¯n−1 )t (Wtn , Rtn + 1) ≤ (HU k )t (Wtn , Rtn + 1)

(26) a.s.

(27)

Proving this condition is nontrivial. It draws on a proof technique in Bertsekas & Tsitsiklis (1996), Section 4.3.6, although this technique requires an exploration policy that ensures that each state is visited infinitely often. Nascimento & Powell (2009) (section 6.1) show how this proof can be adapted to a similar concave optimization problem that uses pure exploration. The proof for our problem is essentially the same, and is not duplicated here. Theorem 1. Assume the stepsize conditions (15)–(16). Also assume (26) and (27). Then, for all k ≥ 0 and t = 0, . . . , T , on the event that (Wt∗ , Rtx,∗ ) is an accumulation point of 23

{(Wtn , Rtx,n )}n≥0 , the sequences of slopes {¯ vtn (Wt∗ , Rtx,∗ )}n≥0 and {¯ vtn (Wt∗ , Rtx,∗ + 1)}n≥0 generated by the SPAR-Storage algorithm for the storage class converge almost surely to the optimal slopes vt∗ (Wt∗ , Rtx,∗ ) and vt∗ (Wt∗ , Rtx,∗ + 1), respectively. Proof:

As discussed in section 5.1, since the deterministic sequences {Lkt (Wt , Rtx )}k≥0

and {Utk (Wt , Rtx )}k≥0 do converge to the optimal slopes, the convergence of the approximation sequences is obtained by showing that for each k ≥ 0 there exists a nonnegative random variable Nt∗,k such that on the event that n ≥ Nt∗,k and (Wt∗ , Rtx,∗ ) is an accumulation point of {(Wtn , Rtx,n )}n≥0 , we have Lkt (Wt∗ , Rtx,∗ ) ≤ v¯tn (Wt∗ , Rtx,∗ ) ≤ Utk (Wt∗ , Rtx,∗ ) a.s. Lkt (Wt∗ , Rtx,∗ + 1) ≤ v¯tn (Wt∗ , Rtx,∗ + 1) ≤ Utk (Wt∗ , Rtx,∗ + 1) a.s. We concentrate on the inequalities Lkt (Wt∗ , Rtx,∗ ) ≤ v¯tn (Wt∗ , Rtx,∗ ) and Lkt (Wt∗ , Rtx,∗ + 1) ≤ v¯tn (Wt∗ , Rtx,∗ + 1). The upper bounds are obtained using a symmetrical argument. The proof is by backward induction on t. The base case t = T is trivial as LkT (WT , R) = v¯Tn (WT , R) = 0 for all WT ∈ WT , R = 1, . . . , B R , k ≥ 0 and iterations n ≥ 0. Thus, we can ¯ , where N ¯ , as defined in section 5.2, is a random variable that pick, for example, NT∗,k = N denotes when an iteration of the algorithm is large enough for convergence analysis purposes. The backward induction proof is completed when we prove, for t = T − 1, . . . , 0 and k ≥ 0, that there exists Nt∗,k such that on the event that n ≥ Nt∗,k and (Wt∗ , Rtx,∗ ) is an accumulation point of {(Wtn , Rtx,n )}n≥0 , Lkt (Wt∗ , Rtx,∗ ) ≤ v¯tn (Wt∗ , Rtx,∗ )

and

Lkt (Wt∗ , Rtx,∗ + 1) ≤ v¯tn (Wt∗ , Rtx,∗ + 1) a.s.

(28)

Given the induction hypothesis for t + 1, the proof for time period t is divided into two parts. In the first part, we prove for all k ≥ 0 that there exists a nonnegative random variable Ntk such that ˜ tx,∗ ) ≤ v¯n−1 (W ∗ , R ˜ tx,∗ ), Lkt (Wt∗ , R t t

a.s. on {n ≥ Ntk ,

˜ tx,∗ ) ∈ S˜t }. (Wt∗ , R

(29)

Its proof is by induction on k. Note that it only applies to states in the random set S˜t . Then, again for t, we take on the second part, which takes care of the states not covered by the first 24

part, proving the existence of a nonnegative random variable Nt∗,k such that the lower bound inequalities are true on {n ≥ Nt∗,k } for all accumulation points of {(Wtn , Rtx,n )}n≥0 .

We start the backward induction on t. Pick ω ∈ Ω. We omit the dependence of the random elements on ω for compactness. Remember that the base case t = T is trivial and we pick ¯ . We also pick, for convenience, N k = N ¯. NT∗,k = N T Induction Hypothesis: Given t = T − 1, . . . , 0, assume, for t + 1, and all k ≥ 0 the existence ∗,k ∗,k k k of integers Nt+1 and Nt+1 such that, for all n ≥ Nt+1 , (29) is true, and, for all n ≥ Nt+1 ,

inequalities in (28) hold true for all accumulation points (Wt∗ , Rtx,∗ ).

Part 1: For our fixed time period t, we prove for any k, the existence of an integer Ntk such that for n ≥ Ntk , inequality(29) is true. The proof is by forward induction on k. We start with k = 0. For every state (Wt , R), we have that −B v ≤ vt∗ (Wt , R) ≤ B v , implying, by definition, that L0t (Wt , R) ≤ −B v . Therefore, (29) is satisfied for all n ≥ 1, since we know that v¯tn−1 (Wt , R) is bounded by −B v and +B v for all iterations. Thus, Nt0 =  ∗,0 ∗,0 max 1, Nt+1 = Nt+1 . The induction hypothesis on k assumes that there exists Ntk such that, for all n ≥ Ntk (29) ∗,k ∗,k . , thus we assume that Ntk ≥ Nt+1 is true. Note that we can always make Ntk larger than Nt+1

The next step is the proof for k + 1. Before we move on, we depart from our pointwise argument in order to define the stochastic noise sequence and state a lemma describing an important property of this sequence. We start n by defining, for R = 1, . . . , B R , the random variable sˆnt+1 (R) = (H v¯n−1 )t (Wtn , R) − vˆt+1 (R)

that measures the error incurred by observing a sample slope. Using sˆnt+1 (R) we define for each Wt ∈ Wt the stochastic noise sequence {¯ snt (Wt , R)}n≥0 . We have that s¯nt (Wt , R) = 0 on {n < Ntk } and, on {n ≥ Ntk }, s¯nt (Wt , R) is equal to  x,n n n n x,n + (R + 1)1 n}) . max 0, (1 − α ¯ tn (Wt , R)) s¯n−1 (W , R) + α ¯ (W , R)ˆ s (R 1 t t {R>R {R≤R } t t t t+1 t t t

25

The sample slopes are defined in a way such that   E sˆnt+1 (R)|Ftn = 0.

(30)

This conditional expectation is called the unbiasedness property. This property, together with the martingale convergence theorem and the boundedness of both the sample slopes and the approximate slopes are crucial for proving that the noise introduced by the observation of the sample slopes, which replace the observation of true expectations, go to zero as the number of iterations of the algorithm goes to infinity, as is stated in the next lemma. Lemma 5.1. On the event that (Wt∗ , Rtx,∗ ) is an accumulation point of {(Wtn , Rtx,n )}n≥0 we have that {¯ snt (Wt∗ , Rtx,∗ )}n≥0 → 0

and

{¯ snt (Wt∗ , Rtx,∗ + 1)}n≥0 → 0 a.s.

(31)

Proof: [Proof of lemma 5.1] Given in the appendix. Returning to our pointwise argument, where we have fixed ω ∈ Ω, we use the convention that the minimum of an empty set is +∞. Let ( δ k = min

) ˜ x,∗ ) − Lk (W ∗ , R ˜ x,∗ ) (HLk )t (Wt∗ , R t t t t ˜ x,∗ ) ∈ S˜t , (HLk )t (W ∗ , R ˜ x,∗ ) > Lk (W ∗ , R ˜ x,∗ ) . : (Wt∗ , R t t t t t t 4

If δ k < +∞ we define an integer NL ≥ Ntk to be such that NY L −1



 ˜ tx,∗ ) ≤ 1/4 1−α ¯ tm (Wt∗ , R

˜ tx,∗ ) ≤ δ k , s¯tn−1 (Wt∗ , R

and

(32)

m=Ntk

˜ tx,∗ ) ∈ S˜t . Such an NL exists because both (17) and (31) for all n ≥ NL and states (Wt∗ , R ˜ tx,∗ ) ∈ S˜t , (HLk )t (Wt∗ , R ˜ tx,∗ ) = Lkt (Wt∗ , R ˜ tx,∗ ) are true. If δ k = +∞ then, for all states (Wt∗ , R ˜ tx,∗ ) = Lkt (Wt∗ , R ˜ tx,∗ ) and we since (24) tells us that (HLk )t (Wt∗ ) ≥ Lkt (Wt∗ ). Thus, Lk+1 (Wt∗ , R t   ∗,k+1 and show that define the integer NL to be equal to Ntk . We let Ntk+1 = max NL , Nt+1 (29) holds for n ≥ Ntk+1 . ˜ tx,∗ ) ∈ S˜t . If Lk+1 ˜ tx,∗ ) = Lk (W ∗ , R ˜ tx,∗ ), then inequality We pick a state (Wt∗ , R (Wt∗ , R t t t (29) follows from the induction hypothesis. We therefore concentrate on the case where ˜ tx,∗ ) > Lkt (W ∗ , R ˜ tx,∗ ). Lk+1 (Wt∗ , R t t

26

First, we depart one more time from the pointwise argument to introduce the stochastic bounding sequence. We also state a lemma combining this sequence with the stochastic noise sequence. For Wt ∈ Wt and R = 1, . . . , B R , we have on {n < Ntk } that ¯ltn (Wt , R) = Lkt (Wt , R) and, on {n ≥ Ntk }, ¯ln (Wt , R) = (1 − α ¯ tn (Wt , R)(HLk )t (Wt , R). ¯ tn (Wt , R)) ¯ltn−1 (Wt , R) + α t The next lemma states that the noise and the stochastic bounding sequences can be used to provide a bound for the approximate slopes as follows. Lemma 5.2. On {n ≥ Ntk ,

˜ tx,∗ ) ∈ S˜t }, (Wt∗ , R

˜ tx,∗ ) ≥ ¯ln−1 (W ∗ , R ˜ tx,∗ ) − s¯n−1 (W ∗ , R ˜ tx,∗ ) a.s. v¯tn−1 (Wt∗ , R t t t t

(33)

Proof: [Proof of lemma 5.2] Given in the appendix. Back to our fixed ω, a simple inductive argument proves that ¯ltn (Wt , R) is a convex combination of Lkt (Wt , R) and (HLk )t (Wt , R). Therefore we can write, ¯ln−1 (W ∗ , R ˜ tx,∗ )) = ˜bn−1 ˜ tx,∗ ) + (1 − ˜bn−1 ˜ tx,∗ ), Lkt (Wt∗ , R )(HLk )t (Wt∗ , R t t t t where ˜bn−1 = t

n−1 Y



 ˜ tx,∗ ) . For n ≥ Ntk+1 ≥ NL , we have ˜btn−1 ≤ 1/4. Moreover, 1−α ¯ tm (Wt∗ , R

m=Ntk

˜ tx,∗ ) Lkt (Wt∗ , R

˜ tx,∗ ). Thus, using (22) and the definition of δ k , we obtain ≤ (HLk )t (Wt∗ , R

¯ln−1 (W ∗ , R ˜ x,∗ ) ≥ 1 Lkt (Wt∗ , R ˜ x,∗ ) + 3 (HLk )t (Wt∗ , R ˜ x,∗ ) t t t t t 4 4 1 ˜ x,∗ ) + 1 (HLk )t (W ∗ , R ˜ x,∗ ) + 1 ((HLk )t (W ∗ , R ˜ x,∗ ) − Lk (W ∗ , R ˜ x,∗ )) = Lkt (Wt∗ , R t t t t t t t t 2 2 4 k+1 ∗ ˜ x,∗ k ≥ Lt (Wt , Rt ) + δ . (34)

Combining (33) and (34), we obtain, for all n ≥ Ntk+1 ≥ NL , ˜ tx,∗ ) ≥ Lk+1 (W ∗ , R ˜ tx,∗ ) + δ k − s¯n−1 (W ∗ , R ˜ tx,∗ ) v¯tn−1 (Wt∗ , R t t t t ˜ tx,∗ ) + δ k − δ k ≥ Lk+1 (Wt∗ , R t ˜ tx,∗ ), = Lk+1 (Wt∗ , R t where the last inequality follows from (32).

27

Part 2: We continue to consider ω picked in the beginning of the proof of the theorem. In this part, we take care of the states (Wt∗ , Rtx,∗ ) that are accumulation points but are not in S˜t . In contrast to part 1, the proof technique here is not by forward induction on k. We rely entirely on the definition of the projection operation and on the elements defined in section 5.2, as this part of the proof is all about states for which the projection operation decreased the corresponding approximate slopes infinitely often, which might happen when some of the optimal slopes are equal. Of course this fact is not verifiable in advance, as the optimal slopes are unknown. n Remember that at iteration n time period t, we observe the sample slopes vˆt+1 (Rtx,n ) and n n n (Rtx,n + 1) and it is always the case that vˆt+1 (Rtx,n ) ≥ vˆt+1 (Rtx,n + 1), implying that the vˆt+1

resulting temporary slope ztn (Wtn , Rtx,n ) is bigger than ztn (Wtn , Rtx,n + 1). Therefore, according to our projection operator, the updated slopes v¯tn (Wtn , Rtx,n ) and v¯tn (Wtn , Rtx,n + 1) are always equal to ztn (Wtn , Rtx,n ) and ztn (Wtn , Rtx,n +1), respectively. Due to our stepsize rule, as described in section 4, the slopes corresponding to (Wtn , Rtx,n ) and (Wtn , Rtx,n + 1) are the only ones updated due to a direct observation of sample slopes at iteration n, time period t. All the other slopes are modified only if a violation of the monotone decreasing property occurs. Therefore, the slopes corresponding to states with information vector Wt ∈ W different than Wtn , no matter the resource level R = 1, . . . , B R , remain the same at iteration n time period t, that is, v¯tn−1 (Wt , R) = ztn (Wt , R) = v¯tn (Wt , R). On the other hand, it is always the case that the temporary slopes corresponding to states with information vector Wtn and resource levels smaller than Rtx,n can only be increased by the projection operation. If necessary, they are increased to be equal to v¯tn (Wtn , Rtx,n ). Similarly, the temporary slopes corresponding to states with information vector Wtn and resource levels greater than Rtx,n + 1 can only be decreased by the projection operation. If necessary, they are decreased to be equal to v¯tn (Wtn , Rtx,n + 1), see figure 3c. Keeping the previous discussion in mind, it is easy to see that for each Wt∗ ∈ Wt , if RtM in is the minimum resource level such that (Wt∗ , RtM in ) is an accumulation point of {(Wtn , Rtx,n )}n≥0 , then the slope corresponding to (Wt∗ , RtM in ) could only be decreased by the projection opera28

tion a finite number of iterations, as a decreasing requirement could only be originated from a resource level smaller than RtM in . However, no state with information vector Wt∗ and resource ¯ (as defined in section level smaller than RtM in is visited by the algorithm after iteration N ¯ . We thus have that (W ∗ , RM in ) is an 5.2), since only accumulation points are visited after N t t element of the set S˜t , showing that S˜t is a proper set. Hence, for all states (Wt∗ , Rtx,∗ ) that are accumulation points of {(Wtn , Rtx,n )}n≥0 and are ˜ tx,∗ ), where R ˜ tx,∗ is the maximum resource not elements of S˜t , there exists another state (Wt∗ , R ˜ tx,∗ ) ∈ S˜t . We argue that for all resource levels R level smaller than Rtx,∗ such that (Wt∗ , R ˜ tx,∗ and Rtx,∗ (inclusive), we have that |Nt (Wt∗ , R)| = ∞. Figure 4 illustrates the between R

vt∗ (Wt∗ , R)

situation.

˜

∈ St |N (W ∗ , Rx,∗ − 2)| = 111 000 t t t 000 111 000 |Nt (Wt∗ , Rtx,∗ − 1)| = 111 000 111 |Nt (Wt∗ , Rtx,∗ )| = ∞ 000 111 000 111 000 111 1111111111111 0000000000000 00 111 11 000 000 111 00 000 111 00011 111 00 11 000 111 000 111 00 11 000 111 000 111 00 11 000 111 00011 111 00 000 111 000 111 000 111 00 11 000 111 000 111 00011 111 00 000 111 000 111 000 111 00 11 000 111 000 111 000 111 00 11 000 111 000 111 00011 111 00 000 111 000 111 000 111 00 000 000 111 00011 111 00 111 11 000 111 RtM in ˜ x,∗ R

Rtx,∗ 6∈S˜t

11 00 00 are accumulation points 11 00 11 00 11

R

t

Figure 4: Illustration of technical elements related to the projection operation

As introduced in section 5.2, we have that Nt (Wt∗ , R) = {n ∈ IN : ztn (Wt∗ , R) > v¯tn (Wt∗ , R)}. By definition, the sets S˜t and Nt (Wt∗ , R) share the following relationship. Given that (Wt∗ , R) is an accumulation point of {(Wtn , Rtx,n )}n≥0 , then |Nt (Wt∗ , R)| = ∞ if and only if the state (Wt∗ , R) is not an element of S˜t . Therefore, |Nt (Wt∗ , Rtx,∗ )| = ∞, otherwise (Wt∗ , Rtx,∗ ) would ˜ tx,∗ = Rtx,∗ − 1 we are done. If R ˜ tx,∗ < Rtx,∗ − 1, we have to be an element of S˜t . If R consider two cases, namely (Wt∗ , Rtx,∗ − 1) is accumulation point and (Wt∗ , Rtx,∗ − 1) is not an accumulation point. For the first case, we have that |Nt (Wt∗ , Rtx,∗ − 1)| = ∞ from the fact that this state is not an element of S˜t . For the second case, since (Wt∗ , Rtx,∗ − 1) is not an accumulation point, its corresponding slope is never updated due to a direct observation of ¯ , by the definition of N ¯ . Moreover, every time the slope of (W ∗ , Rtx,∗ ) sample slopes for n ≥ N t 29

is decreased due to a projection (which is coming from the left), the slope of (Wt∗ , Rtx,∗ − 1) T ¯ } ⊆ Nt (Wt∗ , Rtx,∗ − 1) T{n ≥ has to be decreased as well. Therefore, Nt (Wt∗ , Rtx,∗ ) {n ≥ N ¯ }, implying that |Nt (Wt∗ , Rtx,∗ − 1)| = ∞. We then apply the same reasoning for states N ˜ tx,∗ − 1), obtaining that the corresponding sets of iterations have an (Wt∗ , Rtx,∗ − 2), . . . , (Wt∗ , R infinite number of elements. The same reasoning applies to states (Wt∗ , Rtx,∗ + 1) that are not in S˜t . We state a lemma that is the key element for the proof of Part 2, once again going away from the pointwise argument. Lemma 5.3. Given an information vector Wt ∈ Wt and a resource level R = 1, . . . , B R − 1, if for all k ≥ 0, there exists an integer random variable N k (Wt , R) such that Lkt (Wt , R) ≤ v¯tn−1 (Wt , R) almost surely on {n ≥ N k (Wt , R),

Nt (Wt , R+1) = ∞} , then for all k ≥ 0, there

exists another integer random variable N k (Wt , R+1) such that Lkt (Wt , R+1) ≤ v¯tn−1 (Wt , R+1) almost surely on {n ≥ N k (Wt , R + 1)}. Proof: [Proof of lemma 5.3] Given in the appendix. Using the properties of the projection operator, we return to the proof of Part 2 and to our fixed ω. Pick k ≥ 0 and a state (Wt∗ , Rtx,∗ ) that is an accumulation point but is not in ˜ tx,∗ ) where R ˜ tx,∗ is the S˜t . The same applies if (Wt∗ , Rtx,∗ + 1)6∈S˜t . Consider the state (Wt∗ , R ˜ tx,∗ ) ∈ S˜t . This state satisfies the maximum resource level smaller than Rtx,∗ such that (Wt∗ , R ˜ tx,∗ ) = Ntk (from part 1 of the proof). Thus, we can condition of lemma 5.3 with N k (Wt∗ , R ˜ tx,∗ + 1) such that apply this lemma in order to obtain, for all k ≥ 0, an integer N k (Wt∗ , R ˜ tx,∗ + 1) ≤ v¯tn−1 (W ∗ , R ˜ tx,∗ + 1), for all n ≥ N k (W ∗ , R ˜ tx,∗ + 1). Lkt (Wt∗ , R t t ˜ tx,∗ + 1). Note that After that, we use lemma 5.3 again, this time considering state (Wt∗ , R ˜ tx,∗ + 1), necessary to fulfill the first application of lemma 5.3 gave us the integer N k (Wt∗ , R the conditions of this second usage of the lemma. We repeat the same reasoning, applying ˜ tx,∗ + 2), . . . , (W ∗ , Rtx,∗ − 1). In the end, we obtain, lemma 5.3 successively to the states (Wt∗ , R t for each k ≥ 0, an integer N k (Wt∗ , Rtx,∗ ), such that Lkt (Wt∗ , Rtx,∗ ) ≤ v¯tn−1 (Wt∗ , Rtx,∗ ), for all n ≥ N k (Wt∗ , Rtx,∗ ). Figure 5 illustrates this process. Finally, if we pick Nt∗,k to be greater than Ntk of part 1 and greater than N k (Wt∗ , Rtx,∗ ) 30

Lemma 3  vtn 1  P * , R*  3  Lkt  P * , R*  3 for n  N k  P * , R*  3 Lemma 3  vtn 1  P * , R*  2   Lkt  P * , R*  2  for n  N k  P * , R*  2  Lemma 3   0,1

vtn 1  P * , R*  1  Lkt  P * , R*  1 for n  N k  P * , R*  1

vtn 1  P * , R*   Lkt  P * , R*  for n  N k  P * , R*   N tk

R *  St*  St

|  k  P * , R*  1 | 

|  k  P * , R*  | 

|  k  P * , R*  2  |  R *  2

R *  1

R *  St* \ St

 R* 1

Figure 5: Successive applications of lemma 5.3.

and N k (Wt∗ , Rtx,∗ + 1) for all accumulation points (Wt∗ , Rtx,∗ ) that are not in S˜t , then (28) is true for all accumulation points and n ≥ Nt∗,k .

5.4

Optimality of the Decisions

We finish the convergence analysis proving that, with probability one, the algorithm learns an optimal decision for all states that can be reached by an optimal policy. Theorem 2. Assume the conditions of Theorem 1 are satisfied. For t = 0, . . . , T , on the event that (Wt∗ , Rt∗ , v¯t∗ , x∗t ) is an accumulation point of the sequence {(Wtn , Rtn , v¯tn−1 , xnt )}n≥1 generated by the SPAR-Storage algorithm, x∗t , is almost surely an optimal solution of max

xt ∈Xt (Wt∗ ,Rt∗ )

Ft (vt∗ (Wt∗ ), Wt∗ , Rt∗ , xt ),

(35) 

where Ft (vt∗ (Wt∗ ), Wt∗ , Rt∗ , xt ) = Ct (Wt∗ , Rt∗ , xt ) + γVt∗ Wt∗ , Rt∗ −

Btd

X i=1

31

Bts

xdti +

X i=1

 . xsti − xtr t

Proof: Fix ω ∈ Ω. As before, the dependence on ω is omitted. At each iteration n and time t of the algorithm, the decision xnt in STEP 3 of the algorithm is an optimal solution to the optimization problem

maxn

xt ∈Xt (Wt ,Rtn )

vtn−1 (Wtn ), Wtn , Rtn , ·) is Ft (¯ vtn−1 (Wtn ), Wtn , Rtn , xt ). Since Ft (¯

concave and Xt (Wtn , Rtn ) is convex, we have that 0 ∈ ∂Ft (¯ vtn−1 (Wtn ), Wtn , Rtn , xnt ) + XtN (Wtn , Rtn , xnt ), where ∂Ft (¯ vtn−1 (Wtn ), Wtn , Rtn , xnt ) is the subdifferential of Ft (¯ vtn−1 (Wtn ), Wtn , Rtn , ·) at xnt and XtN (Wtn , Rtn , xnt ) is the normal cone of Xt (Wtn , Rtn ) at xnt . Then, by passing to the limit, we can conclude that each accumulation point (Wt∗ , Rt∗ , v¯t∗ , x∗t ) of the sequence {(Wtn , Rtn , v¯tn−1 , xnt )}n≥1 satisfies the condition 0 ∈ ∂Ft (¯ vt∗ (Wt∗ ), Wt∗ , Rt∗ , x∗t ) + XtN (Wt∗ , Rt∗ , x∗t ). We now derive an expression for the subdifferential. We have that ∂Ft (¯ vt∗ (Wt∗ ), Wt∗ , Rt∗ , xt ) = ∇Ct (Wt∗ , Rt∗ , xt ) + γ∂ V¯t∗ (Wt∗ , Rt∗ + A · xt ), where A0 = (−1, . . . , −1, 1, . . . , 1, −1), that is, A · xt = − | {z } | {z } Bd

PBtd

i=1

xdti +

PBts

i=1

xsti − xtr t . From

Bs

(Bertsekas et al., 2003, Proposition 4.2.5), for xt ∈ INB

d +B s +1

,

∂ V¯t∗ (Wt∗ , Rt∗ +A·xt ) = {(A1 y, . . . , AB d +B s +1 y)T : y ∈ [¯ vt∗ (Wt∗ , Rt∗ +A·xt +1), v¯t∗ (Wt∗ , Rt∗ +A·xt )]}. Therefore, as x∗t is integer valued, ∂Ft (¯ vt∗ (Wt∗ ), Wt∗ , Rt∗ , x∗t ) = {∇Ct (Wt∗ , Rt∗ , x∗t ) + γAy : y ∈ [¯ vt∗ (Wt∗ , Rt∗ + A · x∗t + 1), v¯t∗ (Wt∗ , Rt∗ + A · x∗t )]}. Since (Wt∗ , Rt∗ +A·x∗t ) is an accumulation point of {(Wtn , Rtx,n )}n≥0 , it follows from theorem 1 that v¯t∗ (Wt∗ , Rt∗ +A·x∗t ) = vt∗ (Wt∗ , Rt∗ +A·x∗t )

and v¯t∗ (Wt∗ , Rt∗ +A·x∗t +1) = vt∗ (Wt∗ , Rt∗ +A·x∗t +1).

Hence, ∂Ft (¯ vt∗ (Wt∗ ), Wt∗ , Rt∗ , x∗t ) = ∂Ft (vt∗ (Wt∗ ), Wt∗ , Rt∗ , xt ) and 0 ∈ ∂Ft (vt∗ (Wt∗ ), Wt∗ , Rt∗ , x∗t ) + XtN (Wt∗ , Rt∗ , x∗t ), which proves that x∗t is the optimal solution of (35). 32

6

Numerical experiments

We have implemented the SPAR-Storage algorithm in an energy policy model called SMART (Stochastic Multiscale model for the Analysis of energy Resources, Technology and policy), described in Powell et al. (2009) (the numerical results reported here are based on this paper). The problem involved modeling the allocation of 15 types of energy resources where some were converted to electricity and distributed over the grid, while the others (notably transportation fuels and home heating needs) which capture the conversion and movement of physical fuels to meet various demands. The model represents energy supplies and demand for the western United States at an aggregate level. The model captures hourly variations in wind, solar and demands, over a one-year horizon. Each time period involves a fairly small linear program (comparable to the model in section 2) that is linked over time through the presence of a single, grid-level storage facility in the form of a reservoir for hydroelectric power. Water inflow to the reservoir, in the form of rainfall and snow melt, is a highly seasonal process with typically high input in the late winter and spring. Generating electricity from hydroelectric power is extremely cheap on the margin, but there are benefits to holding water in the reservoir until it is needed in the summer. There is a limit on how much water can be stored. In the optimal solution, the tendency is to begin storing water late in the winter, hitting the maximum just before we begin drawing it down for the summer months with no rainfall. This means that the algorithm has to be smart enough to realize, in March, that there is a positive value for storing water that is not needed until August, thousands of time periods later. We undertook two sets of experiments to test the performance of the algorithm. The first considered a deterministic dataset which could be solved optimally as a linear program over all 8,760 time periods at the same time. This is a good test of the algorithm, since the ADP algorithm is not allowed to take advantage of the deterministic future. The deterministic model was run with a single rainfall scenario, and two demand patterns: the first used constant demand over the entire year, while the second used a sinusoidal curve to test more complex behaviors for storing and releasing water. The ADP algorithm was run for 200 iterations

33

Optimal from linear program

ADP solution

Reservoir level

Reservoir level

Rainfall

Rainfall Demand

Demand

(a) Optimal-constant demand

(b) ADP-constant demand

Optimal from linear program

ADP solution

Reservoir level

Reservoir level

Rainfall

Rainfall

Demand

Demand

(c) Optimal-sinusoidal demand

(d) ADP-sinusoidal demand

Figure 6: Optimal reservoir levels (a) and (c) and reservoir levels from ADP algorithm (b) and (d) for constant and sinusoidal demand (from Powell et al. (2009)). (requiring approximately 5 seconds per iteration for the entire year), and produced an objective function within 0.02 percent of the optimal solution produced by the linear programming model. Despite this close agreement, we were interested in seeing the actual behavior in terms of the amount held in storage, since small differences in the objective function can hide significant differences in the decisions. Figure 6 shows the rainfall, demand and, most important, the amount of water in storage at each time period as produced by the ADP algorithm, and the optimal solution from the linear programming model. Clearly there is very close agreement. Note also that in both runs, the ADP solution was able to replicate the behavior of the optimal solution where water was held in storage for over 2000 hours. The ADP algorithm, of course, can also handle uncertainty. This was tested in the form of stochastic rainfall. Figure 7 shows the reservoir level produced by the ADP algorithm (as a single, sample path taken from the last iteration). Also shown is the optimal solution

34

9000 8000

ADP Optimal for individual scenarios

7000

Reservoir level

6000 5000 4000 3000 2000 1000 0 0

100

200

300

400

500

600

700

800

Time period

Figure 7: Reservoir level produced by ADP (last iteration) for stochastic precipitation, and optimal reservoir levels produced by LP when optimized over individual scenarios (from Powell et al. (2009)). produced by the linear programming model when optimized over each sample path for rainfall. Each of these so-called optimal solutions are allowed to see into the future. Not surprisingly, the optimal, prescient solutions for each sample path are able to hold off longer from storing water, whereas the ADP solution, quite reasonably, stores water earlier in the season because of the possibility that the rainfall might be lower than expected. The ADP-based algorithm in the SMART model was found to offer some attractive features. First, while this paper focuses purely on storage tested over a one-year horizon, in the SMART model it was used in the context of an energy policy model that extended for 20 years (over 175,000 time periods) with run times of only a few hours on a PC. The same ADP algorithm described in this paper was also used to model energy investment decisions (how many megawatts of capacity should we have for ethanol, wind, solar, coal, natural gas, ...), but in this case, we used separable approximations of a nonseparable value function. For this setting, we can no longer prove convergence, but we did obtain results within 0.24 percent of the linear programming optimal solution, suggesting that the general algorithmic strategy scales to vector-valued resource allocation problems. This finding is in line with other algorithmic research with this class of algorithms (in particular, see Topaloglu & Powell (2006)).

35

7

Summary

We propose an approximate dynamic programming algorithm using pure exploitation for the problem of planning energy flows over a year, in hourly increments, in the presence of a single storage device, taking advantage of our ability to represent the value function as a piecewise linear function. We are able to prove almost sure convergence of the algorithm using a pure exploitation strategy. This ability is critical for this application, since we discretized each value function (for 8,760 time periods) into thousands of breakpoints. Rather than needing to visit all the possible values of the storage level many times, we only need to visit a small number of these points (for each value of our information vector Wt ) many times. A key feature of our algorithm is that it is able to handle high-dimensional decisions in each time period. For the energy application, this means that we can solve the relatively small (but with hundreds of dimensions) decision problem that determines how much energy is coming from each energy source and being converted to serve different types of energy demands. We are able to solve sequences of deterministic linear programs very quickly by formulating the value function around the post-decision state variable. Our numerical experiments with the SMART model suggest that the resulting algorithm is quite fast, and when applied to a deterministic problem, produced solutions that closely match the optimal solution produced by a linear program. This is a true test of the use of ADP algorithm, since it requires that it be able to determine when to start storing water in the water so that we can meet demands late in the summer, thousands of hours later. The algorithm requires approximately five seconds to do a forward pass over all 8,760 hourly time periods within a year, and high quality solutions are obtained within 100 or 200 iterations. Additional research is required to develop a practical algorithm in the presence of uncertainty when the information vector Wt is large, since we have to estimate a piecewise linear value function for each possible value of Wt . There are a number of statistical techniques that can be used to address this problem. A possible line of investigation, described in George et al. (2008), uses hierarchical aggregation, where a piecewise linear value function is estimated for Wt at different levels of aggregation, and an estimate of the value function is formed using

36

a weighted some of these different approximations. As the algorithm progresses, the weights are adjusted so that higher weights are put on more disaggregate levels. This strategy appears promising, but we need to show a) that this strategy would still produce a convergent algorithm and b) that it actually works in practice.

Acknowledgements The authors would like to acknowledge the valuable comments and suggestions of Andrzej Rusczynski. This research was supported in part by grant AFOSR contract FA9550-08-10195.

References Ahmed, S. & Shapiro, A. (2002), The sample average approximation method for stochastic programs with integer recourse, E-print available at http://www.optimization-online.org. 5 Barto, A. G., Bradtke, S. J. & Singh, S. P. (1995), ‘Learning to act using real-time dynamic programming’, Artificial Intelligence, Special Volume on Computational Research on Interaction and Agency 72, 81–138. 4 Bertsekas, D. (2005), Dynamic Programming and Stochastic Control Vol. I, Athena Scientific. Bertsekas, D. & Tsitsiklis, J. (1996), Neuro-Dynamic Programming, Athena Scientific, Belmont, MA. 4, 12, 19, 23, 42 Bertsekas, D., Nedic, A. & Ozdaglar, E. (2003), Convex Analysis and Optimization, Athena Scientific, Belmont, Massachusetts. 32 Brown, P. D., Peas-Lopes, J. A. & Matos, M. A. (2008), ‘Optimization of pumped storage capacity in an isolated power system with large renewable penetration’, IEEE Transactions on Power Systems 23(2), 523–531. 1 Castronuovo, E. D. & Lopes, J. A. P. (2004), ‘On the Optimization of the Daily Operation of a Wind-Hydro Power Plant’, IEEE Transactions on Power Systems 19(3), 1599–1606. 1 37

Chen, Z.-L. & Powell, W. B. (1999), ‘A convergent cutting-plane and partial-sampling algorithm for multistage stochastic linear programs with recourse’, Journal of Optimization Theory and Applications 102(3), 497–524. 5 Fishbone, L. & Abilock, H. (1981), ‘MARKAL - A linear programming model for energy systems analysis: Technical description of the BNL version’, International Journal of Energy Research 5, 353–375. 1 Garcia-Gonzalez, J., de la Muela, R. M. R., Santos, L. M. & Gonzalez, A. M. (2008), ‘Stochastic Joint Optimization of Wind Generation and Pumped-Storage Units in an Electricity Market’, Power Systems, IEEE Transactions on 23(2), 460–468. 1 George, A., Powell, W. B. & Kulkarni, S. (2008), ‘Value function approximation using multiple aggregation for multiattribute resource management’, J. Machine Learning Research pp. 2079–2111. 18, 36 Godfrey, G. & Powell, W. B. (2002), ‘An adaptive, dynamic programming algorithm for stochastic resource allocation problems I: Single period travel times’, Transportation Science 36(1), 21–39. 3 Hernandez-Lerma, O. & Runggladier, W. (1994), ‘Monotone approximations for convex stochastic control problems’, J. Math. Syst., Estim. and Control 4, 99–140. Higle, J. & Sen, S. (1991), ‘Stochastic decomposition: An algorithm for two stage linear programs with recourse’, Mathematics of Operations Research 16(3), 650–669. 5 Jaakkola, T., Jordan, M. I. & Singh, S. P. (1994), Convergence of stochastic iterative dynamic programming algorithms, in J. D. Cowan, G. Tesauro & J. Alspector, eds, ‘Advances in Neural Information Processing Systems’, Vol. 6, Morgan Kaufmann Publishers, San Francisco, pp. 703–710. 4 Lamont, A. (1997), User’s Guide to the META*Net Economic Modeling System Version 2.0, Technical report, Lawrence Livermore National Laboratory. 1 Lamont, A. D. (2008), ‘Assessing the long-term system value of intermittent electric generation technologies’, Energy Economics 30(3), 1208–1231. 1 38

Messner, S. & Strubegger, M. (1995), User’s Guide for MESSAGE III, Technical report, International Institute for Applied Systems Analysis (IIASA). 1 Nascimento, J. M. & Powell, W. B. (2009), ‘An optimal approximate dynamic programming algorithm for the lagged asset acquisition problem’, Mathematics of Operations Research 34(1), 210–237. 23 NEMS (2003), The National Energy Modeling System: An Overview 2003, Technical report, Energy Information Administration, Office of Integrated Analysis and Forecasting, U.S. Department of Energy, Washington, DC 20585. 1 Powell, W. B. (2007), Approximate Dynamic Programming: Solving the curses of dimensionality, John Wiley and Sons, New York. 3, 8 Powell, W. B., George, A., Lamont, A. & Stewart, J. (2009), ‘Smart: A stochastic multiscale model for the analysis of energy resources, technology and policy’, Informs J. on Computing. 3, 5, 9, 18, 33, 34, 35 Powell, W. B., Ruszczy´ nski, A. & Topaloglu, H. (2004), ‘Learning algorithms for separable approximations of stochastic optimization problems’, Mathematics of Operations Research 29(4), 814–836. 4, 19 Shapiro, A. (2003), Monte carlo sampling methods, in A. Ruszczy´ nski & A. Shapiro, eds, ‘Handbooks in Operations Research and Management Science: Stochastic Programming’, Vol. 10, Elsevier, Amsterdam, pp. 353–425. 5 Shiryaev, A. (1996), Probability Theory, Vol. 95 of Graduate Texts in Mathematics, SpringerVerlag, New York. 43 Topaloglu, H. & Powell, W. B. (2006), ‘Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems’, Informs Journal on Computing 18(1), 31–42. 3, 4, 9, 35 Tsitsiklis, J. (2002), ‘On the convergence of optimistic policy iteration’, Journal of Machine Learning Research 3, 59–72. 4

39

Tsitsiklis, J. N. (1994), ‘Asynchronous stochastic approximation and Q-learning’, Machine Learning 16, 185–202. 4 Van Slyke, R. & Wets, R. (1969), ‘L-shaped linear programs with applications to optimal control and stochastic programming’, SIAM Journal of Applied Mathematics 17(4), 638– 663. 5 Vanderbei, R. (1997), Linear Programming: Foundations and Extensions, Kluwer’s International Series. 3

Appendix The appendix consists of a brief summary of notation to assist with reading the proofs and the proofs themselves.

Notation For each random element, we provide its measurability.

• Filtrations F = σ{(Wtn , xnt ),

n ≥ 1,

m Ftn = σ ({(Wtm 0 , xt0 ),

t = 0, . . . , T }.

0 < m < n,

t0 = 0, . . . , T } ∪ {(Wtn0 , xnt0 ),

t0 = 0, . . . , t}).

• Post-decision state (Wtn , Rtx,n ) Rtx,n ∈ Ftn : resource level after the decision. Integer and bounded by 0 and B R . Wtn ∈ Ftn : Markovian information vector. Independent of the resource level. Dtn ∈ Ftn : demand vector. Nonnegative and integer valued. Wt : finite support set of Wtn . • Slopes (monotone decreasing in R and bounded) vt∗ (Wt , R): slope of the optimal value function at (Wt , R). n ztn (Wt , R) ∈ Ft+1 : unprojected slope of the value function approximation at (Wt , R).

40

n v¯tn (Wt , R) ∈ Ft+1 : slope of the value function approximation at (Wt , R).

vtn (Wt , R)}n≥0 . v¯t∗ (Wt , R) ∈ F: accumulation point of {¯ vˆtn (R) ∈ Ftn : sample slope at R. • Stepsizes (bounded by 0 and 1, sum is +∞, sum of the squares is < +∞ )  αtn ∈ Ftn and α ¯ tn (Wt , R) = αtn 1{Wt =Wtn ,R=Rtx,n } + 1{Wt =Wtn ,R=Rtx,n +1} . ¯ . Iteration big enough for convergence analysis. • Finite random variable N • Set of iterations and states (due to the projection operation) Nt (Wt , R) ∈ F: iterations in which the unprojected slope at (Wt , R) was decreased. S˜t ∈ F: states in which the projection had not decreased the unprojected slopes i.o. • Dynamic programming operator H. • Deterministic slope sequence {Lkt (Wt , R)}k≥0 . Converges to vt∗ (Wt , R). n • Error variable sˆnt+1 (R) ∈ Ft+1

• Stochastic noise sequence {¯ snt (Wt , R)}n≥0 • Stochastic bounding sequence {¯ltn (Wt , R)}n≥0

Proofs of Lemmas We start with the proof of proposition 2, then we present the proofs for the lemmas of the convergence analysis section. Proof: [Proof of proposition 2] We use the notational convention that Lk is the entire set of slopes Lk = {Lkt (Wt ) for t = 0, . . . , T,

Wt ∈ Wt }, where Lkt (Wt ) = (Lkt (Wt , 1), . . . , Lkt (Wt , B R )).

We start by showing (23). Clearly, LkT (WT ) is monotone decreasing (MD) for all k ≥ 0 and WT ∈ WT , since LkT (WT ) is a vector of zeros. Thus, using condition (11), we have that (HLk )T −1 (WT −1 ) is MD. We keep in mind that, for t = 0, . . . , T − 1, L0t (Wt ) is MD as L0 = v ∗ −2B v e. By definition, we have that L1T −1 (WT −1 ) is MD. A simple induction argument shows that LkT −1 (WT −1 ) is MD. Using the same argument for t = T − 2, . . . , 0, we show that (HLk )t (Wt ) and Lkt (Wt ) are MD. 41

We now show (24). Since v ∗ is the unique fixed point of H, then L0 = Hv ∗ − 2B v e. Moreover, from condition (13), we have that (Hv ∗ )t (Wt , R)−2B v ≤ (H(v ∗ −2B v e))t (Wt , R) = (HL0 )t (Wt , R) for all Wt ∈ Wt and R = 1, . . . , BtR . Hence, L0 ≤ HL0 and L0 ≤ L1 = L0 +HL0 2

≤ HL0 . Suppose that (24) holds true for some k > 0. We shall prove for k + 1. The

induction hypothesis and (23) tell us that (12) holds true. Hence, HLk+1 ≥ HLk ≥ Lk+1 and HLk+1 ≥ Lk+2 ≥ Lk+1 . Finally, we show (25). A simple inductive argument shows that Lk > v ∗ for all k ≥ 0. Thus, as the sequence is monotone and bounded, it is convergent. Let L denote the limit. It is clear that Lt (Wt ) is monotone decreasing. Therefore, conditions (11)-(13) are also true when applied to L. Hence, as shown in (Bertsekas & Tsitsiklis, 1996, pages 158-159), we have that kHLk − HLk∞ ≤ kLk − Lk∞ . With this inequality, it is straightforward to see that lim HLk = HL.

k→∞

Therefore, as in the proof of (Bertsekas & Tsitsiklis, 1996, Lemma 3.4) L=

L + HL . 2

It follows that L = v ∗ , as v ∗ is the unique fixed point of H. Each lemma assumes all the conditions imposed and all the results obtained before its statement in the proof of Theorem 1. To improve the comprehension of each proof, all the assumptions are presented beforehand. Proof: [Proof of lemma 5.1] Assumptions: Assume stepsize conditions (15)-(16).

Fix ω ∈ Ω. Omitting the dependence on ω, let (Wt∗ , Rtx,∗ ) be an accumulation point of {(Wtn , Rtx,n )}n≥0 . In order to simplify notation, let s¯nt (Wt∗ , Rtx,∗ ) be denoted by s¯∗,n and t 42

α ¯ tn (Wt∗ , Rtx,∗ ) be denoted by α ¯ t∗,n . Furthermore, let n θˆt+1 = sˆnt+1 (Rtn 1{Rtx,∗ ≤Rtx,n } + (Rtn + 1)1{Rtx,∗ >Rtx,n } ).

We have, for n ≥ 1, h i2 ∗,n ∗,n−1 ∗,n ˆn 2 (¯ s∗,n ) ≤ (1 − α ¯ ) s ¯ + α ¯ θ = (¯ st∗,n−1 )2 − 2¯ αt∗,n (¯ s∗,n−1 )2 + Ant , t t t t t t+1

(36)

n n where Ant = 2¯ αt∗,n s¯∗,n−1 θˆt+1 + (¯ αt∗,n )2 (θˆt+1 − s¯∗,n−1 )2 . We want to show that t t ∞ X n=1

Ant

=2

∞ X

n α ¯ t∗,n s¯∗,n−1 θˆt+1 t

∞ X n − s¯∗,n−1 )2 < ∞. + (¯ αt∗,n )2 (θˆt+1 t

n=1

n=1

n n It is trivial to see that both s¯∗,n−1 and θˆt+1 are bounded. Thus, (θˆt+1 − s¯∗,n−1 )2 is bounded t t

and (16) tells us that ∞ X n (¯ αt∗,n )2 (θˆt+1 − s¯∗,n−1 )2 < ∞. t

(37)

n=1

Define a new sequence

n }n≥0 , {gt+1

where

0 gt+1

= 0 and

n gt+1

=

n X

m α ¯ t∗,m s¯∗,m−1 θˆt+1 . We t

m=1 n }n≥0 is a FTn -martingale bounded in L2 . Measurability is obvious. can easily check that {gt+1

The martingale equality follows from repeated conditioning and the unbiasedness property. Finally, the L2 -boundedness and consequentially the integrability can be obtained by noticing n−1 2 n−1 ∗,n ∗,n−1 ˆn n n )2 = (gt+1 )2 . From the martingale equality that (gt+1 ) +2gt+1 αt∗,n )2 (¯ s∗,n−1 θˆt+1 α ¯ t s¯t θt+1 +(¯ t n , we get and boundedness of s¯∗,n−1 and θˆt+1 t

 n 2 n−1   ∗,n 2 n−1  n−1 2 E (gt+1 ) |FT ≤ (gt+1 ) + CE (¯ αt ) |FT , where C is a constant. Hence, taking expectations and repeating the process, we obtain, from 0 the stepsize assumption (16) and E[(gt+1 )2 ] = 0, n X  n 2  n−1 2   ∗,n 2   0 2  ∗,m 2  E (gt+1 ) ≤ E (gt+1 ) + CE (¯ αt ) ≤ E (gt+1 ) + C E (¯ αt ) < ∞. m=1

Therefore, the L2 -Bounded Martingale Convergence Theorem (Shiryaev, 1996, page 510) tells us that ∞ X

n α ¯ t∗,n s¯∗,n−1 θˆt+1 < ∞. t

(38)

n=1

43

Inequalities (37) and (38) show us that

∞ X

Ant < ∞, and so, it is valid to write

n=1

Ant

=

∞ X

Am t

∞ X



m=n

Am t .

m=n+1

Therefore, as −2¯ αt∗,n (¯ s∗,n−1 )2 < 0, inequality (36) can be rewritten as t 2 (¯ s∗,n t ) +

∞ X

Am s∗,n−1 )2 + t t ≤ (¯

m=n+1

∞ X

Am t .

(39)

m=n

Thus, the sequence {(¯ s∗,n−1 )2 + t

∞ X

Am t }n≥1 is decreasing and bounded from below, as

m=n

∞ X

∞. Hence, it is convergent. Moreover, as

∞ X

Am t
Ntk ≥ N ˜ tx,∗ ) ˜ tx,∗ ) be denoted by α ˜ tn and v¯tn (Wt∗ , R In order to simplify the notation, let α ¯ tn (Wt∗ , R ˜ tx,∗ ) and ˜ tx,∗ ), ¯ltn (Wt∗ , R be denoted by v˜tn . We use the same shorthand notation for ztn (Wt∗ , R ˜ tx,∗ ) as well. Keeping in mind that the set of iterations Nt (W ∗ , R ˜ tx,∗ ) is finite and for s¯nt (Wt∗ , R t ¯ , v¯n (W ∗ , R ˜ tx,∗ ) ≥ z n (W ∗ , R ˜ tx,∗ ). We consider three different cases: all n ≥ Ntk ≥ N t t t t ˜ tx,∗ = Rtx,n . Case 1: Wt∗ = Wtn and R ˜ tx,∗ ) is the state being visited by the algorithm at iteration n at time In this case, (Wt∗ , R t. Thus, n v˜tn ≥ z˜tn = (1 − α ˜ tn )˜ vtn−1 + α ˜ tn vˆt+1 (Rtx,n ) n ≥ (1 − α ˜ tn )(˜ltn−1 − s˜n−1 )+α ˜ tn vˆt+1 (Rtx,n ) t

−α ˜ tn (H v¯n−1 )t (Wtn , Rtx,n ) + α ˜ tn (H v¯n−1 )t (Wtn , Rtx,n )

(40)

≥ (1 − α ˜ tn )(˜ltn−1 − s˜tn−1 ) − α ˜ tn sˆnt+1 (Rtx,n ) + α ˜ tn (HLk )t (Ptn , Rtx,n )

(41)

= ˜ltn − ((1 − α ˜ tn )˜ sn−1 +α ˜ tn sˆnt+1 (Rtx,n )) t

(42)

 ≥ ˜ltn − max 0, (1 − α ˜ tn )˜ sn−1 +α ˜ tn sˆnt+1 (Rtx,n ) t = ˜ltn − s˜nt .

(43)

The first inequality is due to the construction of set S˜t , while (40) is due to the induction hypothesis. Inequalities (26) and (27) for n ≥ Ntk explains (41). Finally, (42) and (43) come from the definition of the stochastic sequences ˜ltn and s˜nt , respectively. ˜ tx,∗ = Rn + 1. Case 2: Wt∗ = Wtn and R t n This case is analogous to the previous one, except that we use the sample slope vˆt+1 (Rtx,n + n 1) instead of vˆt+1 (Rtx,n ). We also consider the (Wtn , Rtx,n + 1) component, instead of

(Wtn , Rtx,n ). Case 3: Else. ˜ tx,∗ ) is not being updated at iteration n at time t due to a direct Here the state (Wt∗ , R 45

observation of sample slopes. Then, α ˜ tn = 0 and, hence, ˜ln = ˜ln−1 t t

and

s˜nt = s˜n−1 . t

Therefore, from the construction of set S˜t and the induction hypothesis v˜tn ≥ z˜tn = v˜tn−1 ≥ ˜ltn−1 − s˜n−1 = ˜ltn − s˜nt . t

Proof: [Proof of lemma 5.3] Assumptions: Assume stepsize conditions (15)-(16). Moreover, assume for all k ≥ 0 and integer Ntk that inequalities (26) and (27) hold true on {n ≥ Ntk }. Finally, assume there exists an integer random variable N k (Wt , R) such that v¯tn−1 (Wt , R) ≥ Lkt (Wt , R) on {n ≥ N k (Wt , R),

Nt (Wt , R + 1) = ∞}.

Pick ω ∈ Ω and omit the dependence of the random elements on w. We start by showing, for each k ≥ 0, there exists an integer Ntk,s such that v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) − s¯n−1 (Wt , R + 1) for all n ≥ Ntk,s . Then, we show, for all  > 0, there is an integer Ntk, such t that v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) −  for all n ≥ Ntk, . Finally, using these results, we prove existence of an integer N k (Wt , R + 1) such that v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) for all n ≥ N k (Wt , R + 1).  Pick k ≥ 0. Let Ntk,s = min n ∈ Nt (Wt , R + 1) : n ≥ N k (Wt , R) + 1, where N k (Wt , R) is the integer such that v¯tn−1 (Wt , R) ≥ Lkt (Wt , R) for all n ≥ N k (Wt , R). Given the projection Ntk,s −1

properties discussed in the proof of Theorem 1, Nt (Wt , R + 1) is infinite and v¯t Ntk,s −1

v¯t

(Wt , R) =

(Wt , R + 1). Therefore, Ntk,s is well defined. Redefine the noise sequence {¯ snt (Wt , R +

1)}n≥0 introduced in the proof of Theorem 1 using Ntk,s instead of Ntk . We prove that v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) − s¯n−1 (Wt , R + 1) for all n ≥ Ntk,s by induction on n. t For the base case n − 1 = Ntk,s − 1, from our choice of Ntk,s and the monotone decreasing property of Lkt (Wt ), we have that stn−1 (Wt , R+1). v¯tn−1 (Wt , R+1) = v¯tn−1 (Wt , R) ≥ Lkt (Wt , R) ≥ Lkt (Wt , R+1) = Lkt (Wt , R+1)−¯ Now, we suppose v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) − s¯tn−1 (Wt , R + 1) is true for n − 1 > Ntk,s and prove for n. We have to consider two cases: 46

Case 1: n ∈ Nt (Wt , R + 1) In this case, a projection operation took place at iteration n. This fact and the monotone decreasing property of Lkt (Wt ) give us v¯tn−1 (Wt , R+1) = v¯tn−1 (Wt , R) ≥ Lkt (Wt , R) ≥ Lkt (Wt , R+1) ≥ Lkt (Wt , R+1)−¯ sn−1 (Wt , R+1). t Case 2: n 6∈ Nt (Wt , R + 1) The analysis of this case is analogous to the proof of inequality (33) of lemma 5.2 for (Wt , R+1) ∈ S˜t . The difference is that we consider Lkt instead of the stochastic bounding sequence {¯ltn (Wt , R + 1)}n≥0 and we note that (1 − α ¯ tn (Wt , R + 1))Lkt (Wt , R + 1) + α ¯ tn (Wt , R + 1)(HLk )t (Wt , R + 1) ≥ Lkt (Wt , R + 1). Hence, we have proved that for all k ≥ 0 there exists an integer Ntk,s such that n−1 v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) − s¯t+ (Wt , R + 1) for all n ≥ Ntk,s .

We move on to show, for all k ≥ 0 and  > 0, there is an integer Ntk, such that v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) −  for all n ≥ Ntk, . We have to consider two cases: (i) (Wt , R + 1) is either equal to an accumulation point (Wt∗ , Rtx,∗ ) or (Wt∗ , Rtx,∗ + 1) and (ii) (Wt , R+1) is neither equal to an accumulation point (Wt∗ , Rtx,∗ ) nor (Wt∗ , Rtx,∗ +1). For the first case, lemma 5.1 tells us that s¯nt+ (Wt , R + 1) goes to zero. Then, there exists N  > 0 such that s¯nt+ (Wt , R + 1) <  for all n ≥ N  . Therefore, we just need to choose Ntk, = max(Ntk,s , N  ). N k,s −1

For the second case, α ¯ tn (Wt , R + 1) = 0 for all n ≥ Ntk,s and s¯t+t Ntk,s −1

s¯nt+ (Wt , R+1) = s¯t+

(Wt , R + 1) = 0. Thus,

(Wt , R+1) = 0 for all n ≥ Ntk,s and we just have to choose Ntk, = Ntk,s .

We are ready to conclude the proof. For that matter, we use the result of the previous paragraph. Pick k > 0. Let  = vt∗ (Wt , R + 1) − Lkt (Wt , R + 1) > 0. Since {Lkt (Wt , R + 1)}k≥0 0

increases to vt∗ (Wt , R + 1), there exists k 0 > k such that vt∗ (Wt , R + 1) − Lkt (Wt , R + 1) < /2. 0

Thus, Lkt (Wt , R + 1) − Lkt (Wt , R + 1) > /2 and the result of the previous paragraph tells k0 ,/2

us that there exists Nt

0

such that v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) − /2 > Lkt (Wt , R + k0 ,/2

1) + /2 − /2 = Lkt (Wt , R + 1) for all n ≥ Nt 47

. Therefore, we just need to choose

k0 ,/2

N k (Wt , R + 1) = Nt

and we have proved that for all k ≥ 0, there exists N k (Wt , R + 1)

such that v¯tn−1 (Wt , R + 1) ≥ Lkt (Wt , R + 1) for all n ≥ N k (Wt , R + 1).

48