Sunday, March 29, 2015

Optimization Pro and Con

A tweet by Nate Brixius (@natebrix) led me to read the article "The Natural Order and Divine Order of Optimization" published by the Wisconsin Institute for Discovery, a rebuttal/counterpoint to a New York Times Magazine article titled "A Sucker is Optimized Every Minute". The former sings the praises of optimization (somewhat) and the latter vilifies it (somewhat).

As someone whose research focuses on optimization, I'm generally in the pro camp, but I'll concede that it can be overdone. One of my favorite quotes is of rather fuzzy provenance. I've seen variations of it from multiple authors, phrased as a monologue by, variously, a Zen master, a playboy or a dying man (but not, so far, a dying playboy Zen master). The version I recall:
I had my chances to marry, but I was looking for the perfect woman. Then one day I found her; but she was looking for the perfect man.
Optimization is fine for some things, inappropriate for others, and (like everything else in life) not a good idea if taken to an extreme.*

* I'm rather proud of myself that I resisted the nearly-inevitable "extreme point" pun there.

Saturday, March 28, 2015

CPLEX Performance Variability

It may come as a surprise to some users of CPLEX that its performance is, somewhat intentionally, random. (This may be true of some other optimization software as well. I'll limit the assertion to CPLEX because CPLEX is the only solver for which I have first-hand knowledge that it's true.)

A certain amount of uncertainty about performance is probably inevitable. Operating systems swap programs in and out of memory and processes in and out of CPU cores in ways uncontrollable by the CPLEX user, so "random" fluctuations in timing are to be expected even when running code with identical parameter settings repeatedly on a single problem on a single machine. When the code is itself multi-threaded, I suspect even more "random" timing issues occur. (CPLEX by default uses multiple threads, but can be restricted to a single thread via a parameter setting.)

Less well known is that CPLEX uses a pseudorandom number stream to control some of its internal decision-making. Marc-André Carle (@ORNinja) has an excellent series of posts on his blog about performance variability, so I'll save myself a ton of typing and refer you there:
I thought I'd share some recent results that demonstrate how one particular source of variability, the random seed, can make a large difference in performance. The name for the seed in the Java API is IloCplex.Param.RandomSeed. According to the CPLEX 12.6.1 parameters manual, its default value changes with each new release of the program, so if you run two different versions of CPLEX on the same problem, you can expect different performance even if none of the changes from one version to the next directly impact your problem. Put another way, your mileage WILL vary.

Here's an example. I took a moderately chewy MIP model that arises in a current research project. I'm not at liberty (nor inclined) to give a full description, but the first few lines of the CPLEX log show that it has what are, by contemporary standards, modest dimensions.

Reduced MIP has 20434 rows, 10239 columns, and 78243 nonzeros.
Reduced MIP has 215 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.08 sec. (37.97 ticks)
Found incumbent of value 216.000000 after 0.23 sec. (89.01 ticks)


This is a cost minimization problem with an integer-valued objective function. The trivial feasible solution with cost 216 is always found immediately; it's from there that the adventure begins. This version of the problem is not too difficult. Although I have not solved it to optimality yet, I'm fairly confident that the optimal solution has cost 11, and I suspect that a few hours of crunching on my quad-core PC would be enough to prove that solution is optimal.

I did a little test, using CPLEX 12.6.1, in which I did 85 independent solves, using different values of the seed parameter. Runs were limited to 60 seconds each and three CPU threads, and the Emphasis.MIP parameter was set to 4 (emphasizing a search for hidden feasible solutions -- improving incumbents -- over moving the lower bound or proving optimality). In other words, I told CPLEX to find good solutions fast, and to worry about proving optimality later. All other CPLEX parameters were left at default.

Here is a histogram of the lower bounds from those 85 runs:
histogram of lower bounds
The bounds ranged from 9.38 to 9.60, and you can see that the vast majority were at the low (inferior) end of the range. Now for a tabulation of the incumbents found within one minute:

Incumbent 11 12 144
Frequency 1 3 81

Recall that I am fairly certain the optimal value is 11. Out of 85 attempts, I landed on the optimal solution once and a near-optimal solution three times; on the other 81 tries, I got a fairly terrible solution (costing more than 13 times the optimum). There are some intermediate solutions with costs in the 20s and 30s that show up in the node logs for the runs where I got lucky; CPLEX just never happened to be sitting on one of those when the time limit expired.

The point here is that, in practice, users don't run the same problem multiple times with different seeds (if they even know the seed parameter exists, which I suspect most do not). This caught my eye because I'd done a test run on this example, gotten the incumbent down from 216 to 144 (with a lower bound under 10) after a minute or so, and thought "hmm, this is going to take a while". I might add that 144 was found fairly quickly, so there was a lengthy stretch with no progress in the incumbent before time ran out. Had I been lucky on that pilot run (where, if the table is to be believed, being lucky has probability a bit under 1 in 20), I would have gotten a really good solution (11 or 12) in the first minute, with a lower bound of 10 (rounding up, since the objective is integer-valued), and thought "hmm, this problem is pretty easy".

So, is the problem easy or not? Beats me.

I'll finish this with a pointer to a slide in a recent present on advances in CPLEX. The slide mentions a "racing ramp-up phase" that can be used when running CPLEX on multiple processors in parallel. In essence, CPLEX tackles the same problem independently on multiple processors, using different parameter settings for each. After a set amount of time (the end of the "ramp-up" phase), the results are compared, a winner is declared, and the processors then all proceed to work on finishing the winner's search tree, using the winner's parameter settings.

With or without parallel processors, a user can replicate this to some extent. In particular, you can do a set of short runs with different seeds, find the best incumbent solution found by any of them, restart with that seed and do a full production run. I don't think there's a way to recycle the winner's search tree (unless you get lucky and the last run was the winner, in which case you can just continue it), but if the ramp-up phase is kept short, this could still help.

Bottom line: a "hard" problem might be truly hard (CPLEX will take a long time no matter seed you use), or it might be hard in the sense that there are (many) more bad seeds than good. Then again, maybe you just need better karma.


Tuesday, March 24, 2015

Updated Java Utilities for CPLEX and CP Optimizer

I just finished adding a feature to a utility library I use in Java projects that employ either CPLEX or CP Optimizer. In addition, I moved the files to a new home. The library is free to use under the Eclipse Public License 1.0. The code is mentioned in previous posts, so I'll just quickly summarize the content here and refer interested parties to the earlier posts:
The latest source code is now available from the Michigan State University Gitlab server. You do not need to log in, and you do not need to be a Git user. You can just click the download button near the upper right to get a ZIP archive of the source code. If you run into any bugs, there is an issue tracker on the MSU site where you can report them (please!). If you just want a binary (.jar) file and the Javadoc documentation, you can download a .zip file from my MSU web space.

In the following description, please assume that cplex and cp are instances of IloCplex and IloCP respectively. The main reason I developed the library is that I like to experiment with different parameter settings for CPLEX and CP Optimizer. Rather than hard coding a parameter (e.g., cplex.setParameter(IloCplex.Param.Emphasis.MIP, 3)) and then having to edit and recompile the code to try a different value, I prefer to pass CPLEX and CP Optimizer parameters to the program through the command line (or, if my program has a graphical interface, through a user dialog). The equivalent code to the previous example might look like cplex.setParameter(pname, pval) where pname is a string with value "Emphasis.MIP" (the minimum portion of the full parameter name necessary to uniquely identify the parameter) and pval is a string with the value "3".

With that as background, here is a summary of the contents of the library:
  • CplexParameterSetter: Use an instance of this class to apply parameter settings (each specified by two strings, parameter name and new value) to an instance of IloCplex. Example code appears in the May 2014 post (look for the version 2.0 sample).
  • CPOptimizerParameterSetter: Use an instance of this class to apply parameter settings to an instance of IloCP. Again, sample code is in the May 2014 post.
  • CPDisplay: The static method CPDisplay.asString(cp) will produce a summary of the model in cp, suitable for printing or displaying. For whatever reason, IloCplex.toString() produces a human-readable version of a CPLEX model, but IloCP.toString() just prints the object's hash code. This class provides a workaround. For the output to really be readable, you need to be assiduous about assigning meaningful names to the variables and constraints in the model. I originally released this (in the March 2014 post) as an override to IloCP.toString(), but I decided a static method was easier to use (does not require extending IloCP).
  • Main program: The project comes with a main program. If you run it, you will get an alphabetized list of all parameters known to the versions of CPLEX and CP Optimizer that you are using. For instance, run against CPLEX Studio 12.6.1, the output I get looks like this:
    CPLEX:
               AbsGap double  IloCplex.Param.MIP.Pool.AbsGap
            AbsMIPGap double  IloCplex.Param.MIP.Tolerances.AbsMIPGap
            AbsMIPGap double  IloCplex.Param.MIP.PolishAfter.AbsMIPGap
              Advance int  IloCplex.Param.Advance
    [...]
           WriteLevel int  IloCplex.Param.Output.WriteLevel
          ZeroHalfCut int  IloCplex.Param.MIP.Cuts.ZeroHalfCut
    
    CPOptimizer:
                      AllDiffInferenceLevel int  IloCP.IntParam.AllDiffInferenceLevel
               AllMinDistanceInferenceLevel int  IloCP.IntParam.AllMinDistanceInferenceLevel
    [...]
                               WarningLevel int  IloCP.IntParam.WarningLevel
                                    Workers int  IloCP.IntParam.Workers
    
    At least for me, the main use of this is to figure out which parameter names are unambiguous and which are repeated. For instance, the CPLEX parameter IloCplex.Param.MIP.Pool.AbsGap can be abbreviated "AbsGap"; but the minimum amount necessary to specify the absolute MIP gap for convergence is "Tolerances.AbsMIPGap", since "AbsMIPGap" could also refer to IloCplex.Param.MIP.PolishAfter.AbsMIPGap.
If you have questions about the proper use of the code, please feel free to post them in as comments here. If you run into bugs (or missing features), please use the issue tracker. That will help me keep tabs on what needs to be done.

Thursday, March 19, 2015

An SSH Glitch

Something weird happened with SSH today, and I'm documenting it here in case it happens again.

I was minding my own business, doing some coding, on a project that is under version control using Git. After committing some changes, I was ready to push them up to the remote (a GitLab server here at Michigan State University). I had already set up the GitLab account with my RSA public key, and I had done previous pushes successfully. Today, suddenly, the push failed with a message that boiled down to the server not recognizing my key.

After the usual amount of fruitless messing around, I ran the command
ssh-add -l
in a terminal window. This is supposed to list the keys that the ssh-agent program recognizes. Oops! The response was "The agent has no identities."In other words, my public/private key pair, which I've been using for ages, was not being seen by ssh-agent.

I don't know if this is the result of a software update or a boot anomaly. The keys were right where they were supposed to be, in ~/.ssh, unchanged. Restarting ssh-agent did not help. What fixed it was to run
ssh-add ~/.ssh/id_rsa
in a terminal. I did that when I first set up the keys, and I don't know why I had to do it again. We'll see if I'm forced to do it a third time.

After that last command, ssh-add -l showed the "fingerprint" of my key. I still could not push to the GitLab server, though. So, on the GitLab server, I tried to install my public key (with a new name) and was told the fingerprint matched my existing key. In desperation, I deleted the old key and installed the "new" public key (the one with the same fingerprint as the key just deleted). Bear in mind that at no point in this mess did I generate new keys; the key I installed was the same one previously installed (and just deleted). For whatever reason, that worked; the server accepted the commit.

<sigh>Now I have to make sure that every other server using my public key also recognizes my digital signature.</sigh>

Update: The business about using ssh-add may be a red herring. I just checked my laptop (which uses the same public/private key pair and has the same Git projects on it. I ran ssh-add -l and again got the "no identities" message. Then I ran a Git "pull" command (from the NetBeans IDE), using my key, and (after giving a password to unlock my keyring) it worked! I then tried ssh-add -l again, just to make sure unlocking the keyring had not caused ssh-agent to sober up, and the response was still "no identities". So having no listed identities apparently does not prevent applications from using the key pair.

On the home PC, I frequently (but not always) get the prompt to unlock my keyring when banging on a Git server. It didn't happen today, and I didn't think anything of that at the time. I'm now left to wonder if the whole problem with logging into the server had something to do with (a) the keyring being locked and (b) the system, for whatever reason, never asking me to unlock it?

Or maybe it's just gremlins. The little buggers tend to be a bit unpredictable.

Wednesday, March 18, 2015

Formulating Optimization Models

Periodically, on OR Exchange and other forums, I encounter what are surely homework problems involving the construction of optimization models. "The Acme Anvil Corporation makes two types of anvils, blue ones and red ones. Blue anvil use 185 kg. of steel and have a gross revenue of $325 each; red anvils ..." Really? Does anyone seriously believe Acme Anvil uses metric units?? This is obviously a homework problem.

Not surprisingly, these sorts of questions tend not to get answered, or at least not fully. Many of the participants in the forums are academics (faculty or students) who are loathe to corrupt the homework process; most of the rest have real jobs, and better things to do than other people's homework. Still, there's something to be said for providing some nonspecific guidance on how a student might get started formulating such a model.

After bumping into one on a forum earlier today, it occurred to me to post here, somewhat paraphrased and highly condensed, my "introduction to modeling" lecture from the Dark Ages, when I taught modeling to business students. A caveat is in order: this is for beginners in the modeling process. Experienced modelers will go through these steps implicitly/subconsciously/automatically or perhaps just dive straight into a relevant model by association with previously encountered models.

In essence, what follows is a checklist for formulating an optimization model. For whatever it's worth, here it goes.

Variables: Identify your decisions.

An optimization model is typically a decision-making model. Rather than attempting to go from first reading of the problem to full-fledged algebraic formulation, start by identifying the key decisions you must make (e.g., how many red anvils to make in March at the Los Alamos plant). Each decision maps to a single variable in your model. You will not necessarily come up with all the variables at this step, but you should identify the important ones. It might be helpful to think of yourself as the supervisor of some diligent but not very bright (and thoroughly non-quantitative) employees. (Think Harvard MBAs here.) What orders would you need to give them to get things running properly? Those are at least some of your decisions/variables.

As you identify decisions, also select units for them. Are anvils produced individually or in pallets? Is paint ordered by the gallon, by the thousand gallons, by the jug ...? Also ask yourself whether the amounts are continuous or discrete. Time spent on something is pretty continuous. Number of anvils produced is rather discrete.

Since discrete variables generally make models harder to solve, you might ask whether it's reasonable to treat seemingly discrete amounts a continuous. One example I used for this is the manufacture of ball bearings. The little buggers are obviously discrete -- nobody uses 4.37 ball bearings in the manufacture of a device -- but you probably make the little guys by the rail carload. If you decide to make 2.134 carloads today, nobody's going to obsess about whether that translates to a partial ball bearing.

My classroom experience suggests that this step is critical. Almost every time I had a student foundering in an attempt to formulate a model, their difficulty traced back to incorrect identification of the decision variables.

Bounds: What are reasonable values for your decisions?

The question here is not what specific values do you want; that's the solution to the model. The question is, in general terms, how big or small can each decision quantity be? The answers give upper and lower bounds for your decision variables. Often the lower bound will be zero, signaling that (a) you are measuring things the way most people do (negative quantities are inherently unnatural for most people) and (b) you might possibly not want to do whatever this variable represents at all. Upper bounds do not need to be very tight, and definitely should not be too tight (too tight meaning it cuts off a valid decision that might prove to be optimal). How many red anvils might I choose to make in a day? Without getting mired in resource issues, simple logic might suggest that more than 1,000 is out of the question. So I can use 1,000 as the upper bound for that quantity.

Objective function: What is my criterion for selecting a program of action?

Suppose that some intern came to you with several choices for what to do (how many anvils of each type to make, how many to put into storage, which ones to drop off cliffs, ...). How would you decide which solution was best? The criterion you use will turn into your objective function. Common choices include maximizing profit or minimizing cost (or, if you're the government, maximizing cost), but it could be something non-monetary such as maximizing the number of at-risk subjects inoculated against some disease or maximizing the amount of food aid distributed to a stricken area.

Having determined your decision criterion, you need to express it as a function (preferably linear, but life isn't always cooperative) of your decision variables. In attempting to do so, you may identify quantities other than decision variables that you need to know (or that would be helpful) in order to express the objective function. Those quantities will turn into auxiliary variables. Mathematically, there's no difference between a "decision" variable and an "auxiliary" variable; the distinction is largely in whether it looks like something you would decide explicitly (previous section) or implicitly (this section).

A common example of an auxiliary variable for business students is inventory. How many blue anvils should be put into inventory in March to be shipped in April? You might not have thought of that as an explicit decision when identifying the original variables. When you try to express profit (your criterion) as a function, however, you realize that there are costs for carrying inventory, and you must account for them ... which requires knowing how much inventory you have. Thus are born the inventory variables.

There are methods for handling multiple competing criteria (I want to maximize distribution of food aid while minimizing the cost of acquiring and distributing it), but beginners are urged to start with a single criterion. If you must account for multiple criteria, you will need to do one or more of the following:
  • prioritize them;
  • set reasonable targets (aspiration levels) for them;
  • weight them (e.g., I'm indifferent between feeding one more person and saving $20 in cost).

Constraints: What restricts your decisions?

At this juncture, the ideal set of decisions is likely to be obvious. Make an infinite number of anvils and sell them all for a profit (at the risk of causing road runner extinction). Eat as much as you want of all the foods you like. Sadly, various factors will constrain your free will here. There's only so much iron available for anvils, your factories have limited capacity, and demand for anvils is not infinite. Your wallet will not support that all-steak diet, your wardrobe will not support the change in dimensions resulting from all the ice cream you can eat, and your doctor (or mother) (or both) absolutely insists that something green is mandatory (and mint ice cream does not qualify).

Your task now is to identify the aspects of the situation that will limit your decisions and express them as equations or inequalities involving your decision variables. Again, you may identify auxiliary variables during this process. If you do not have a specific accounting cost for holding inventory, you might not have identified it in the previous section. When it comes time to connect anvil production with anvil shipments, though, you may find it convenient to create inventory variables.

For strictly mathematical reasons, you will need to avoid strict inequalities. Where quantities are discrete, this is easy. "I need to make more than 50 blue anvils" (strict inequality, verboten) is equivalent to "I need to make at least 51 blue anvils" (weak inequality, perfectly legal). In other cases, it may not be so easy. "We need to make some blue paint" sounds a lot like "blue paint production $> 0$", but that's a strict inequality. Your choices are either "blue paint production $\ge 0$" (allowing the possibility that you produce no blue paint) or "blue paint production $\ge \epsilon$ where $\epsilon$ is a strictly positive parameter (a lower production limit that is the minimum amount you can live with).

I hope someone finds that helpful.