Report to users

Two years ago I came to London and gave my first “Report to Users” and now I’m back to update it.

I had selected the title of my talk very carefully two years ago because it conveyed something I wanted to emphasize. The talk was not entitled “Comments from Above: Future developments in Stata”. User is not a dismissive term, especially in our business. Users know as much as or more than we at Stata know about statistics, about what they are doing, and about how they want to do it.

In fact, it is that characteristic that is almost unique to the statistical software industry. In most software, the people who write it know more than the people who use it. That is certainly true of word processors and spreadsheets and all the rest of the “office systems”. And that makes the design of those systems relatively easy: Since the people who write the software know more than users, they can evaluate designs through introspection.

In statistical software, on the other hand, when we at StataCorp are lucky, we know as much about a topic as our users. Generally, we know less. Oh, we know more about certain computer topics—how to write parsers, how to sort efficiently, how to make calculations in stable ways—but in terms of how the product is used, we really do know less and this puts us at a considerable disadvantage.

It is that realization that guides most of our actions.

I hope you will excuse me if I take a few minutes to summarize my previous talk because some of you did not hear it and where I want to pick up is where the last one ended. I will say what I said before in a different way.

What I tried to say last time is that we—the software developers—are at a considerable disadvantage. I went on to ask the whether there is any software industry, any model, in which the developers are at a similar disadvantage relative to their users?

Yes, I said. The people who make compilers. The people who make operating systems. The people who make development tools. Now here’s a topic I know something about. James and I spend much of our time developing software and, in doing that, we use products for which we are the customers. The people who write compilers know more than James and I know about certain aspects of compiler technology—in particular I am very out of date concerning optimization schemes—but James and I know a lot about software development and, in fact, we know more than the providers of the tools we use about certain aspects of that development.

That was the insight. Our relationship with you is just like our providers relationship with us. So what is it on the developmental tool front that has proven successful? What are the characteristics that James and I share when we are the customers?

Here are the answers:

As a user of developmental software, I want to control my environment. I do not want to be forced to work this way or that, to be told that I cannot get there from here. I am a very smart person. I am smarter than the compiler I use. I can carry out complicated tasks. I can understand subtle concepts. Hide nothing from me. Tell me exactly what is going on and how and then I will exploit the environment to its fullest.

You are not so different. You are smarter than Stata. Stata does not stand in judgment and tell you no, you shouldn't do that, or this is what you ought to do next. Rather, it is up to you to perform sequences of steps to accomplish what you want, and sometimes those steps are complex and subtle.

Returning to the developer market, what has been remarkably successful is the incremental design for software tools and operating systems. That’s computer-science jargon and all it means is that I can add on to the operating system and, when I do, the result is just as if the operating system shipped from the factory with that addition. We developers who are constantly performing unique and complex tasks constantly write little tools—new commands—to make our tasks easier.

You are not so different. You researchers are constantly performing unique and complex tasks. Would not it be convenient if you could write little tools—new commands—to make your task easier?

And thus was Stata designed. The insight was that statistical systems are not like word processors, like spreadsheets, like office systems. The computer-science jargon for this is monolithic. A monolithic system comes preassembled from the factory—it does what it does and perhaps it does that quite well. Most existing statistical systems are monolithic designs.

The insight was that the proper model for a statistical system is the operating system. Statistical systems really should be Statistical Operating Systems.

And so the design of Stata was just stolen from the principles of incremental operating system design. Once we had the insight the rest was easy because the computer scientists had already done all the ground work: There is a large literature on operating systems and it is just a matter of reading it and translating it to a statistical application.

The whole design of Stata is stolen. We looked at Unix, we looked at DOS, we read the computer science literature. The front end—what you see—will seem familiar if you know a variety of operating systems. There’s a Unix bent, but actually the front end has more of a Wylbur twist to it. The whole idea of return codes was taken from VM/CMS. Ado-files? Think Unix scripts and DOS bat files.

The parts that you cannot see were similarly stolen. The whole internal structure of Stata is right out of the computer science literature for incremental operating systems. Stata would need a memory manager. It says so, just open the book. Stata would need a caching system. That’s chapter 7. Why is Stata so fast? Because Stata really was designed by experts after hundreds of person years of experimentation. We just borrowed their knowledge and applied it to a different application.

Now, once you accept that the operating system is the right model for statistical software, that leads to a range of considerations in how the support services for that software is organized.

Users are smart. Users are knowledgeable. Users want to control their environment. Users want to be involved in the development process. Or at least, some users do. Why? Because those users know what they are doing. About that topic, they know more than James or me. They do something all the time, they are world-class experts, and the only way it is ever going to get done right, they realize, is if they do it themselves. It is hopeless for James and me to try to compete with them. James and I can, however, help.

If users—some users—are to be involved in the development process, they must have the same access to Stata's internals and to information about Stata as we at StataCorp do. Hence, our openness and our constant struggle to be even more open.

Let me give an example: I am not confused, I know Stata is a statistical package. Sometimes James and I even do statistics but that’s not very interesting to talk about because it has no implications beyond the particular statistic implemented. Nevertheless, a statistical system without statistics would be pointless. Here, StataCorp is no different than any of the statistical vendors—we try.

Two years ago I came to London and walked into a buzz saw. It started kindly with a talk by Michael Hills who gently mentioned some shortcomings in our survival analysis routines—all the while adding that we still had the best routines around. It then was continued by Peter Sasieni who —this is Britain—was also understated but who left no doubt as to exactly what the shortcomings were.

We addressed that—that's unimportant right now—but how we addressed it is important. When we rewrote our survival-analysis routines, we documented everything. Open the manual to [R] st_is. The utility routines—the guts that make the whole system work—are fully documented. How the system works is fully documented. If you want to add to that system, you can, and you can do that as elegantly as we do. My understanding is that Michael Hills and David Clayton have already done that. In any case, that is what it means to be open.

By the way, if users—some users—are to be involved in the development process, they must have a way to distribute what it is they do. Hence, the STB.

So two years ago I told a story and it went like this:

  1. Researchers, the users of statistical software, are smart. Note well that it is they who develop the things that go into a statistical package.
  2. Most statistical packages are monolithic, meaning merely that they are written at the factory by computer professionals. Thank you very much, researcher, we'll take it from here, and contact us if you have any future thoughts.
  3. Researchers have never liked this but these packages are so well written in other ways—data management, for instance—that researchers used them. I told a biased story of the history of statistical computing that emphasized that everytime the researchers had an opportunity, they struck out on their own.
  4. The proper design for a statistical system, I argued, is that of an operating system and, in particular, an incremental one.
  5. Incremental means merely that anybody can add new features—they do not have to be added at the factory—and once added, the new features are indistinguishable from those that were added at the factory.
  6. Not all researchers will want to make additions to Stata but all researchers will benefit if those additions are widely distributed.
  7. In addition to the software design itself, there must be ways for researchers to communicate with each other. The STB was the first tool to assist that and Statalist was the second.
  8. We are a statistical software provider. We have a responsibility to add new statistical routines to Stata.
  9. This just means that if researchers are to be allowed to participate too, then we at the factory must exercise certain cautions. We need to develop a language that we can all use whether you are an insider or an outsider. We must consider ourselves tool builders just like the compiler manufacturers. It makes more sense for us to spend our time at the factory adding to Stata's programming language and then using that language to implement new statistical procedures than to implement the procedures directly. It takes us a little longer to do it that way in terms of when the new statistical procedure is delivered but it subsequently allows others to implement new statistical procedures.
  10. For the short run, it is the statistical procedures that are in Stata that matters.
  11. For the long run, it is the openness and programming capabilities of Stata that matters.
  12. If outsiders are to implement additions to Stata, there must be a scheme by which those new additions can be certified and researchers can use those tools with some assurance that results are correct.

Two years ago I tied these 12 points into what we had done and where we thought we were going. As an aside, I left off with the puzzle of the Internet. The Internet, I said not very insightfully, was going to be important. Somehow there had to be a way that the Internet tied into these 12 points but I did not know how and that was what I was currently thinking about.

So with that introduction, let me continue.

First I want to briefly run over the last two years and highlight certain events and then I want to talk about the future.


The last two years

I very much enjoy these meetings and, more importantly, I find them of great value. It is not the opportunity of giving this talk that I value—I don’t mind and, I admit, I even enjoy it—and more importantly I owe you this report—but that is not what I value. What I value — and do not always enjoy—is sitting in the back and listening.

I mentioned Michael Hills’ and Peter Sasieni’s talks two years ago. I came here unaware that Stata's survival-analysis routines needed updating but I left with no doubts. More importantly, I knew exactly what the problems were. So, just as a matter of reporting, I went back and said all of our survival-analysis routines go into the trash can. We start again and rewrite the whole thing. No patching—we rewrite. Mostly we did a pretty good job. We made one mistake—Michael Hills has already told me that time should be allowed to be negative and this I have agreed to.

I just want to report that we do listen and, with a lag, we respond.

What makes listening at these meetings so useful for me is that, in the face of criticism, while I may not enjoy it, I need not feel defensive. We are all Stata users and we all agree that Stata is a wonderful package. That said, we can talk honestly about how to make it better.

Next topic.

I think the most important thing in terms of future implications that has happened in the last two years is our use of the Internet.

Two years ago I mentioned by puzzlement about what to do with the Internet. Well, we are making some progress. Obviously, we have a website—I cannot remember whether it opened just before or after the meetings two years ago, but we have one.

Putting aside the obvious marketing value of thesite, the important part is the “User Support” half, so let me update you on that:

Under User Support, we have

  • FAQs
  • Cool ado-files
  • STB and STB-FTP
  • Updates
  • StataQuest Additions
  • Statalist
  • Links
  • Netcourses

I want to go through each of these.


FAQs

This has been reasonably successful although we do not have enough. Too often, in my view, we reply to a question either privately through technical support or on Statalist and do not make a FAQ out of it.

In order to improve things in the future, the first thing we are doing is attributing—adding authors—and dates to each and every statistical FAQ and ultimately to all of them. This task should be done by now; the work started just last week. For some of us, having our name up at the website provides an incentive to write these things.

In addition, we are adding instructions on how to cite a particular FAQ. I do not expect many will cite but I want to see more purely statistical FAQs discussing purely statistical issues and perhaps those will be cited.

The second part of this plan—which will be going into effect almost immediately—is to obtain FAQs from Statalist. Not just StataCorp people can write FAQs and, every so often, something excellent appears on Statalist. We will be combing Statalist for potential FAQs and then seeking the author’s permission to put it up on our Website. With attribution, of course.

Finally—and this will be starting almost immediately—we are going to index the FAQs through the lookup system so that they have a higher profile. This way, if there is a relevant FAQ, you can find it. And remember, the lookup database is updated whenever you update Stata.


Cool Ado-files

This seemed like a good idea but has proven to be a failure. I attribute this to us: we simply have not gone through Statalist as we should and pulled the good stuff. I had hoped users would tell us what to add to Cool Ado and then it would just be our responsibility to copy, but it did not work out that way.

I’m not ready to give up on cool-ado, but I expect it will languish quite a while longer.


STB and STB-FTP

The STB portion is just advertising. The STB-FTP is a set of links to other sites that provide the STB diskettes.

StataCorp’s attitude toward the STB being distributed over the net has been schizophrenic. On the one hand, we hate to lose the revenue from diskettes. On the other hand, we agree with the theoretical proposition that the materials should be distributed freely. StataCorp's solution has been to not provide the STB diskettes itself but to allow others to distribute them. This has basically been how the technical group has agreed to disagree with the marketing group.

These days, the revenue from diskettes amounts to little and losing all that revenue would not bother us at all. By that logic, we should immediately put the STB diskettes up on oursite but we are not going to do that. We have other ideas for the STB diskettes which I will get to when I talk about the future.


Updates

We provide updates on our Website and this has proven quite useful to us. First, it lowers the cost when we make a mistake. We have, even before the Web, distributed updated ado-files via the STB, so we have always been committed to the idea of continual updates. The website allows us to distribute updates to the binary executable itself, something we could not afford to do when we were mailing updates on diskette.

This is a big success and, because of that, we are going to focus on this and we have some big changes coming.


StataQuest Additions

Some people have downloaded the StataQuest Additions but it is not important. There are actually two aspects to this:

  1. StataQuest itself and
  2. The menu programming language.

As things stand right now, Stata’s menu programming language is little used by users and that, we think, is because it is little used by us.

Nevertheless, the MPL is an important addition to Stata—we just need to show you how to use it. We will be doing that.

StataQuest plays little role in Stata’s long-run plans. We are proud of it and it is used, but we mainly entered into the StataQuest agreement with Duxbury because we wanted an application for testing our MPL.

Stata makes software for researchers. That is what we do. We are interested in seeing our software used for teaching future researchers but we have no interest in trying to make money off the education market per se. I believe it is important that we stay focused on our primary task.


Statalist

The website entry is just a way to subscribe to Statalist so, basically, I want to use this as an excuse to discuss the list itself.

From our perspective it is a resounding success. It is, in a sense, like these London meetings. Everybody agrees that Stata is wonderful, so let’s be honest, what about ...

From our perspective, Statalist is a place that one can go to gain access to StataCorp that is profitable for us.

Pretend one of you email me personally with a question and I reply personally. I ask you to think about this from a purely monetary point of view. Perhaps I spend half an hour drafting my reply. The result is that, if I do a good job, you feel warmly toward StataCorp. If we are very lucky, perhaps you feel so warmly you mention it to somebody. The monetary return is small but positive. Perhaps I have increased the likelihood you will upgrade. Perhaps I have increased the probability that, in speaking to someone else, they will purchase Stata. The cost, however, is high—30 minutes of my salary plus overhead.

Now say you ask the same question on Statalist and I reply there. This is being done in public. A 1,000 people will read the question and answer and, if I do a good job, 1,000 people will feel warmly toward Stata. The return is 1,000 times greater and the cost is the same. Now, if I can also convert your question and my answer into a FAQ, perhaps another 1,000 people will read it over the period of a year.

This we can afford. And, as a matter of fact, the more subscribers we can get, the more activity like this we can justify. So one goal is to increase the Statalist subscription rate. If anybody has any ideas on how to do that, I would like to hear them.

Statalist has another advantage from our point of view. Sometimes questions are asked and answered and we need to nothing. We want to promote others answering questions and that is one of the major reasons we want to attribute the FAQs and promote Statalist responses into FAQs.


Links

We have links to all the other statistical software vendors. This page is always in our top-10 list of pages hit with a couple-hundred day visiting it.

This is, I must say, one of our better ideas. The important thing about this list is its even handedness. If we hear about software, or if a vendor asks us to list them—and vendors do—there are only two tests:

  1. Is it statistical software?
  2. Is the URL valid?

We provide no commentary because we could not do that in an unbiased way.

One minor change I would like to make in the future would be to list the freeware packages separately. I think the freeware providers and the users of the service would like that.

If you know of any links, please tell me.


NetCourses

The web-site page is just advertising but, as with the Statalist, I will use this as an excuse to discuss NetCourses.

This is, in my judgment, the most inventive thing we have done on the net. Obviously they have been a success in terms of enrollment—we never have difficulty filling up a course. And they have been a success in terms of user feedback. I think only two people have ever asked for their money back—and both cases were situations that had to do with their personal lives rather than us. Comments are uniformly positive.

We offered our first NetCourse in June, 1995. 779 people have ever taken NetCourses. Among those who take it, it is a success.

Now let me discuss NetCourses from StataCorp’s point of view.

First, offering a NetCourse is extremely expensive. Obviously huge amounts of time go into a course the first time it is offered but we have discovered that very large amounts of time go into it afterwards, too. The cost of intelligently answering intelligent questions is high. On a revenue-minus-cost basis, NetCourses cost more than they make. We lose money on them.

One could still argue that they have additional benefits that accrue to Stata that reflects itself in more sales, and that might be true although we do not know how to measure it. We have debated this internally.

Here is what we do know: if we converted the NetCourses into books, we would make money and a fair amount of it. Thus, comparing NetCourses that, on a cash basis, lose money and if converted into a book would certainly make money, it is difficult to argue in favor of the NetCourse.

Nevertheless, I am still very positive on NetCourses. Here is my current thinking:

  1. If one is going to write a book, giving a NetCourse has low marginal cost.
  2. Moreover, giving the NetCourse at least twice will improve the book. First editions become more like second editions. At this point, one can even argue that NetCourses make money.
  3. At some point thereafter, however, the NetCourse is a money-losing proposition and, on that basis, the NetCourse should be converted to a book.

We are very favorably disposed to writing books. Books make money and they promote the use of Stata, so given that, NetCourses can be an efficient investment for StataCorp.

The following are open questions:

  1. If one were to publish a book for, say NC-151, would anyone still take NC-151 if it were still offered?
  2. If the answer to (1) is no, how many times must one offer a NetCourse before turning it into a book? Obviously once would not be enough. People would say, I'll wait. Is it twice? Three times?
  3. Do some of the courses have greater value as NetCourses rather than as books?

James Hardin is working on approaching this from a different angle. James claims that NetCourses would be cheaper to administer if they were WebCourses using more modern technology. James and I disagree on this. I say answering intelligent questions intelligently is expensive and that is independent of technology. James says there is an administrative burden to NetCourses that I do not fully appreciate. We are going to run an experiment. When we get back, we are offering NC-101 as a WebCourse.

I do not know exactly what we are going to do, but the thinking right now is to convert some of the NetCourses to books and then I do not know whether we will continue to offer those particular NetCourses or not. In any case, you will be seeing more NetCourses on varied topics because we believe providing documentation is both profitable and useful.


Returning to the past two years: Stata 5.0

That pretty much covers what I view as the important events of the last two years. Obviously we released Stata 5.0 and that was an major event, but I do not think it important. Producing new releases is something we do. The company would be a big trouble if it considered the mere fact of getting out a new release important in the sense of being something that requires a lot of thought or that is a major hurdle to clear.

When I use the word important I mean important in terms of future implications. Things with big-time current effects but little in the way of future implications I call major. Stata 5.0 was major.

Obviously, what goes into the release is important and I told you last time how that is determined. Partly whim—what we find interesting —and partly what users demand.

Let me run over the important additions:

  1. No set maxvar/maxobs.
  2. Larger limits, and even larger limits are coming.
  3. Better integration with Windows.
  4. xtgee. Obviously analysis of panel data is something we think important. The analysis of panel data is something we have identified as a major component of Stata.
  5. robust and cluster() options on nearly all the estimation commands. We'll complete that work in the next release.
  6. svy commands, which are related to (4).

I know people are curious as to how such decisions are made so let me run over the history of these last two to show how randomness enters the process.

We have had Huber estimators of variance for linear regression, logistic regression, and a few other estimators for some time. Why did that happen? Bill Rogers was a student of Peter Huber back at Stanford. All of Stata’s senior developers get to spend some amount of their time adding features to Stata they consider interesting or fun. That’s just a management tool we use: In my experience, people are more productive working on all assigned tasks if they get to assign some of the tasks to themselves. Moreover, on the tasks they do assign themselves, they produce very good software although sometimes the market for it is lacking.

Anyway, Bill Rogers added Huber estimation and it really was not much work. Those Huber commands languished virtually unused for years although, slowly, perhaps just because they were there, people began to use them. For other reasons I do not understand, people got more interested in this estimator of variance, and we started getting questions on it. When was it appropriate? What did it do? Should I use it in this case? There was a unique aspect of the Huber estimator we had in Stata: it provided clustering. Yes, we knew something about Huber but when Huber estimation was added to Stata, the addition of clustering was a Bill Rogers invention that was—as Bill Rogers said—obvious.

Now Bill Rogers has a very Tukey-esque outlook on statistics. Here’s a reasonable thing to do, so do it. This resulted in us recommending the use of these Huber estimators, mostly in private communication with users. Comment after the fact: My comment about Bill Rogers is unfair to both Bill Rogers and John Tukey and, as many of you know, I have the greatest respect for both of them.]

Every so often, one of those users would ask us for references and we did not have much we could supply and we had virtually nothing on the clustering option. I began to feel very uncomfortable. Here was a feature in Stata, admittedly not used much but showing signs of growing interest, and we really did not know much about it. We were in over our heads.

So I assigned to Bill Sribney the task of finding out out about this Huber estimator. Explain it. Find references. Is this something we should pull back from or something we should go forward with? Make a report, especially about this clustering option. Bill dug and discovered its connection with the survey literature. At that point Bill Sribney said we needed technical expertise we did not have to answer my questions and we asked John Eltinge, a survey statistician, to consult with us.

At this point it will still a low-level effort charged only with answering the question is use of the Huber estimator to be promoted or not and, if so, can we provide references about its use? Jon and Bill traced the estimator’s entire history both through the survey literature and the robust literature.

Good stuff, they said. So at that point we made the decision to promote hreg, hlogit, etc., out of the suburbs and back into the main estimation commands. We would promote its use.

Now understand where we were: We had just gone though the survey literature not because anybody had said "Go through the survey literature" but because answering the question on Huber had led that direction. We now had a very competent survey statistician with whom we could consult. The cost of adding survey additions to Stata had just gone down. Moreover, Bill Sribney said he wanted to pursue it as his personal project. So we did.

I wish I could tell you that the statistical additions to Stata are carefully considered in terms of user's needs and size of market, but it is not so. It is very much a random process. That is as opposed to the programming language and computer-science related additions to Stata where we do have long-run plans and, mostly, we stick to them.

All right, let's go back to the other important additions to Stata:

  1. New survival commands, obviously.
  2. New table command.
  3. fracpoly.
  4. insheet.
  5. Menu-programming language. As I said earlier, this is not much used yet but we believe this will be an important component for the future of Stata.
  6. Graphics-programming language.

Finally, there was one other major (remember my definition of major vs. important) work:

  1. We revamped the manuals.

This is major but, unfortunately, probably will not turn out to be important because we have not yet figured out how to document Stata. This is an important problem for the future. Here are the issues:

  1. Stata is, at heart, simple.
  2. Over the years, Stata has picked up lots of capabilities, and it takes thousands of pages to document fully those capabilities.
  3. New users want to get going quickly and they do not want to be tied up in the details.
  4. Professional researchers want to know the details—they want references and they want formulas. Moreover, we want them to because we have to keep track of that information and putting it in the manual is convenient for us.
  5. People do not want heavy manuals. Information content is good. Pages and good. Weight is bad.
  6. The costs of printing and shipping the manuals are the major cost of producing a copy of Stata. We want to have low prices but the manuals get in the way of that.
  7. As we continue to add features to Stata, the documentation problem gets worse and worse.

We have not figured out a solution to this problem. Here is what we believe:

  1. The Getting Started manual is a great success for new users.
  2. The User's Guide that covers the basic is pretty good.
  3. The 3-volume reference manuals are growing without bound.

Right now we are focusing on (3).

We are thinking about allowing the reference manuals to continue to grow without bound for the professional users that want them, but introducing a 1-volume reference manual subset. This manual would not document the programming, matrix, and other advanced commands. Most technical notes would be eliminated. Methods and formulas would be deleted. References would remain. The manual would be lighter and good-enough, as a computer manual, for many users.

It would be produced by deletion.

Comments and other suggestions appreciated.


The Future

[Thereupon followed lengthy comments speculating about the future.]