Second Biannual Report to Users
|
Speaker |
William Gould, StataCorp
|
|
Date |
6 June 1997
|
Two years ago I came to London and gave my first “Report to
Users” and now I’m back to update it.
I had selected the title of my talk very carefully two years ago because it
conveyed something I wanted to emphasize. The talk was not entitled
“Comments from Above: Future developments in Stata”. User is
not a dismissive term, especially in our business. Users know as much as or
more than we at Stata know about statistics, about what they are doing, and
about how they want to do it.
In fact, it is that characteristic that is almost unique to the statistical
software industry. In most software, the people who write it know more than
the people who use it. That is certainly true of word processors and
spreadsheets and all the rest of the “office systems”. And that
makes the design of those systems relatively easy: Since the people who
write the software know more than users, they can evaluate designs through
introspection.
In statistical software, on the other hand, when we at StataCorp are lucky,
we know as much about a topic as our users. Generally, we know less. Oh,
we know more about certain computer topics—how to write parsers, how
to sort efficiently, how to make calculations in stable ways—but in
terms of how the product is used, we really do know less and this puts us at
a considerable disadvantage.
It is that realization that guides most of our actions.
I hope you will excuse me if I take a few minutes to summarize my previous
talk because some of you did not hear it and where I want to pick up is
where the last one ended. I will say what I said before in a different way.
What I tried to say last time is that we—the software
developers—are at a considerable disadvantage. I went on to ask the
whether there is any software industry, any model, in which the developers
are at a similar disadvantage relative to their users?
Yes, I said. The people who make compilers. The people who make operating
systems. The people who make development tools. Now here’s a topic I
know something about. James and I spend much of our time developing
software and, in doing that, we use products for which we are the customers.
The people who write compilers know more than James and I know about certain
aspects of compiler technology—in particular I am very out of date
concerning optimization schemes—but James and I know a lot about
software development and, in fact, we know more than the providers of the
tools we use about certain aspects of that development.
That was the insight. Our relationship with you is just like our providers
relationship with us. So what is it on the developmental tool front that
has proven successful? What are the characteristics that James and I share
when we are the customers?
Here are the answers:
As a user of developmental software, I want to control my environment. I do
not want to be forced to work this way or that, to be told that I cannot get
there from here. I am a very smart person. I am smarter than the compiler
I use. I can carry out complicated tasks. I can understand subtle
concepts. Hide nothing from me. Tell me exactly what is going on and how
and then I will exploit the environment to its fullest.
You are not so different. You are smarter than Stata. Stata does not stand
in judgment and tell you no, you shouldn't do that, or this is what you
ought to do next. Rather, it is up to you to perform sequences of steps to
accomplish what you want, and sometimes those steps are complex and subtle.
Returning to the developer market, what has been remarkably successful is
the incremental design for software tools and operating systems.
That’s computer-science jargon and all it means is that I can add on
to the operating system and, when I do, the result is just as if the
operating system shipped from the factory with that addition. We developers
who are constantly performing unique and complex tasks constantly write
little tools—new commands—to make our tasks easier.
You are not so different. You researchers are constantly performing unique
and complex tasks. Would not it be convenient if you could write little
tools—new commands—to make your task easier?
And thus was Stata designed. The insight was that statistical systems are
not like word processors, like spreadsheets, like office systems. The
computer-science jargon for this is monolithic. A monolithic system comes
preassembled from the factory—it does what it does and perhaps it
does that quite well. Most existing statistical systems are monolithic
designs.
The insight was that the proper model for a statistical system is the
operating system. Statistical systems really should be Statistical
Operating Systems.
And so the design of Stata was just stolen from the principles of
incremental operating system design. Once we had the insight the rest was
easy because the computer scientists had already done all the ground work:
There is a large literature on operating systems and it is just a matter of
reading it and translating it to a statistical application.
The whole design of Stata is stolen. We looked at Unix, we looked at DOS,
we read the computer science literature. The front end—what you
see—will seem familiar if you know a variety of operating systems.
There’s a Unix bent, but actually the front end has more of a Wylbur
twist to it. The whole idea of return codes was taken from VM/CMS.
Ado-files? Think Unix scripts and DOS bat files.
The parts that you cannot see were similarly stolen. The whole internal
structure of Stata is right out of the computer science literature for
incremental operating systems. Stata would need a memory manager. It says
so, just open the book. Stata would need a caching system. That’s
chapter 7. Why is Stata so fast? Because Stata really was designed by
experts after hundreds of person years of experimentation. We just borrowed
their knowledge and applied it to a different application.
Now, once you accept that the operating system is the right model for
statistical software, that leads to a range of considerations in how the
support services for that software is organized.
Users are smart. Users are knowledgeable. Users want to control their
environment. Users want to be involved in the development process. Or at
least, some users do. Why? Because those users know what they are doing.
About that topic, they know more than James or me. They do something all
the time, they are world-class experts, and the only way it is ever going to
get done right, they realize, is if they do it themselves. It is hopeless
for James and me to try to compete with them. James and I can, however,
help.
If users—some users—are to be involved in the development
process, they must have the same access to Stata's internals and to
information about Stata as we at StataCorp do. Hence, our openness and our
constant struggle to be even more open.
Let me give an example: I am not confused, I know Stata is a statistical
package. Sometimes James and I even do statistics but that’s not very
interesting to talk about because it has no implications beyond the
particular statistic implemented. Nevertheless, a statistical system
without statistics would be pointless. Here, StataCorp is no different than
any of the statistical vendors—we try.
Two years ago I came to London and walked into a buzz saw. It started
kindly with a talk by Michael Hills who gently mentioned some shortcomings
in our survival analysis routines—all the while adding that we still
had the best routines around. It then was continued by Peter Sasieni who
—this is Britain—was also understated but who left no doubt
as to exactly what the shortcomings were.
We addressed that—that's unimportant right now—but how we
addressed it is important. When we rewrote our survival-analysis routines,
we documented everything. Open the manual to [R] st_is. The utility
routines—the guts that make the whole system work—are fully
documented. How the system works is fully documented. If you want to add
to that system, you can, and you can do that as elegantly as we do. My
understanding is that Michael Hills and David Clayton have already done
that. In any case, that is what it means to be open.
By the way, if users—some users—are to be involved in the
development process, they must have a way to distribute what it is they do.
Hence, the STB.
So two years ago I told a story and it went like this:
- Researchers, the users of statistical software, are smart. Note
well that it is they who develop the things that go into a statistical
package.
- Most statistical packages are monolithic, meaning merely that they
are written at the factory by computer professionals. Thank you
very much, researcher, we'll take it from here, and contact us if
you have any future thoughts.
- Researchers have never liked this but these packages are so well
written in other ways—data management, for instance—that
researchers used them. I told a biased story of the history
of statistical computing that emphasized that everytime the
researchers had an opportunity, they struck out on their
own.
- The proper design for a statistical system, I argued, is that of
an operating system and, in particular, an incremental one.
- Incremental means merely that anybody can add new features—they
do not have to be added at the factory—and once added, the new
features are indistinguishable from those that were added at the
factory.
- Not all researchers will want to make additions to Stata but all
researchers will benefit if those additions are widely distributed.
- In addition to the software design itself, there must be
ways for researchers to communicate with each other. The STB
was the first tool to assist that and Statalist was the second.
- We are a statistical software provider. We have a
responsibility to add new statistical routines to Stata.
- This just means that if researchers are to be allowed to participate
too, then we at the factory must exercise certain cautions. We need
to develop a language that we can all use whether you are an insider
or an outsider. We must consider ourselves tool builders just like
the compiler manufacturers. It makes more sense for us to spend our
time at the factory adding to Stata's programming language and then
using that language to implement new statistical procedures than to
implement the procedures directly. It takes us a little longer to do
it that way in terms of when the new statistical procedure is
delivered but it subsequently allows others to implement new
statistical procedures.
- For the short run, it is the statistical procedures that are in
Stata that matters.
- For the long run, it is the openness and programming capabilities
of Stata that matters.
- If outsiders are to implement additions to Stata, there must be a
scheme by which those new additions can be certified and researchers
can use those tools with some assurance that results are correct.
Two years ago I tied these 12 points into what we had done and where we
thought we were going. As an aside, I left off with the puzzle of the
Internet. The Internet, I said not very insightfully, was going to be
important. Somehow there had to be a way that the Internet tied into these
12 points but I did not know how and that was what I was currently thinking
about.
So with that introduction, let me continue.
First I want to briefly run over the last two years and highlight certain
events and then I want to talk about the future.
The last two years
I very much enjoy these meetings and, more importantly, I find them of great
value. It is not the opportunity of giving this talk that I value—I
don’t mind and, I admit, I even enjoy it—and more importantly I owe
you this report—but that is not what I value. What I value —
and do not always enjoy—is sitting in the back and listening.
I mentioned Michael Hills’ and Peter Sasieni’s talks two years
ago. I came here unaware that Stata's survival-analysis routines needed
updating but I left with no doubts. More importantly, I knew exactly what
the problems were. So, just as a matter of reporting, I went back and said
all of our survival-analysis routines go into the trash can. We start again
and rewrite the whole thing. No patching—we rewrite. Mostly we did
a pretty good job. We made one mistake—Michael Hills has already
told me that time should be allowed to be negative and this I have agreed
to.
I just want to report that we do listen and, with a lag, we respond.
What makes listening at these meetings so useful for me is that, in the face
of criticism, while I may not enjoy it, I need not feel defensive. We are
all Stata users and we all agree that Stata is a wonderful package. That
said, we can talk honestly about how to make it better.
Next topic.
I think the most important thing in terms of future implications that has
happened in the last two years is our use of the Internet.
Two years ago I mentioned by puzzlement about what to do with the Internet.
Well, we are making some progress. Obviously, we have a website—I
cannot remember whether it opened just before or after the meetings two
years ago, but we have one.
Putting aside the obvious marketing value of thesite, the important part is
the “User Support” half, so let me update you on that:
Under User Support, we have
- FAQs
- Cool ado-files
- STB and STB-FTP
- Updates
- StataQuest Additions
- Statalist
- Links
- Netcourses
I want to go through each of these.
FAQs
This has been reasonably successful although we do not have enough. Too
often, in my view, we reply to a question either privately through technical
support or on Statalist and do not make a FAQ out of it.
In order to improve things in the future, the first thing we are doing is
attributing—adding authors—and dates to each and every
statistical FAQ and ultimately to all of them. This task should be done by
now; the work started just last week. For some of us, having our name up at
the website provides an incentive to write these things.
In addition, we are adding instructions on how to cite a particular FAQ. I
do not expect many will cite but I want to see more purely statistical FAQs
discussing purely statistical issues and perhaps those will be cited.
The second part of this plan—which will be going into effect almost
immediately—is to obtain FAQs from Statalist. Not just StataCorp
people can write FAQs and, every so often, something excellent appears on
Statalist. We will be combing Statalist for potential FAQs and then seeking
the author’s permission to put it up on our Website. With
attribution, of course.
Finally—and this will be starting almost immediately—we are
going to index the FAQs through the lookup system so that they have a
higher profile. This way, if there is a relevant FAQ, you can find it. And
remember, the lookup database is updated whenever you update Stata.
Cool Ado-files
This seemed like a good idea but has proven to be a failure. I attribute
this to us: we simply have not gone through Statalist as we should and
pulled the good stuff. I had hoped users would tell us what to add to Cool
Ado and then it would just be our responsibility to copy, but it did not
work out that way.
I’m not ready to give up on cool-ado, but I expect it will languish
quite a while longer.
STB and STB-FTP
The STB portion is just advertising. The STB-FTP is a set of links to other
sites that provide the STB diskettes.
StataCorp’s attitude toward the STB being distributed over the net has
been schizophrenic. On the one hand, we hate to lose the revenue from
diskettes. On the other hand, we agree with the theoretical proposition
that the materials should be distributed freely. StataCorp's solution has
been to not provide the STB diskettes itself but to allow others to
distribute them. This has basically been how the technical group has agreed
to disagree with the marketing group.
These days, the revenue from diskettes amounts to little and losing all that
revenue would not bother us at all. By that logic, we should immediately
put the STB diskettes up on oursite but we are not going to do that. We
have other ideas for the STB diskettes which I will get to when I talk about
the future.
Updates
We provide updates on our Website and this has proven quite useful to us.
First, it lowers the cost when we make a mistake. We have, even before the
Web, distributed updated ado-files via the STB, so we have always been
committed to the idea of continual updates. The website allows us to
distribute updates to the binary executable itself, something we could not
afford to do when we were mailing updates on diskette.
This is a big success and, because of that, we are going to focus on this
and we have some big changes coming.
StataQuest Additions
Some people have downloaded the StataQuest Additions but it is not
important. There are actually two aspects to this:
- StataQuest itself and
- The menu programming language.
As things stand right now, Stata’s menu programming language is little
used by users and that, we think, is because it is little used by us.
Nevertheless, the MPL is an important addition to Stata—we just need
to show you how to use it. We will be doing that.
StataQuest plays little role in Stata’s long-run plans. We are proud
of it and it is used, but we mainly entered into the StataQuest agreement
with Duxbury because we wanted an application for testing our MPL.
Stata makes software for researchers. That is what we do. We are
interested in seeing our software used for teaching future researchers but
we have no interest in trying to make money off the education market per se.
I believe it is important that we stay focused on our primary task.
Statalist
The website entry is just a way to subscribe to
Statalist so, basically, I want to use this as an
excuse to discuss the list itself.
From our perspective it is a resounding success. It is, in a sense, like
these London meetings. Everybody agrees that Stata is wonderful, so
let’s be honest, what about ...
From our perspective, Statalist is a place that one can go to gain access to
StataCorp that is profitable for us.
Pretend one of you email me personally with a question and I reply
personally. I ask you to think about this from a purely monetary point of
view. Perhaps I spend half an hour drafting my reply. The result is that,
if I do a good job, you feel warmly toward StataCorp. If we are very lucky,
perhaps you feel so warmly you mention it to somebody. The monetary return
is small but positive. Perhaps I have increased the likelihood you will
upgrade. Perhaps I have increased the probability that, in speaking to
someone else, they will purchase Stata. The cost, however, is high—30 minutes of my salary plus overhead.
Now say you ask the same question on Statalist and I reply there. This is
being done in public. A 1,000 people will read the question and answer and,
if I do a good job, 1,000 people will feel warmly toward Stata. The return
is 1,000 times greater and the cost is the same. Now, if I can also convert
your question and my answer into a FAQ, perhaps another 1,000 people will
read it over the period of a year.
This we can afford. And, as a matter of fact, the more subscribers we can
get, the more activity like this we can justify. So one goal is to increase
the Statalist subscription rate. If anybody has any ideas on how to do
that, I would like to hear them.
Statalist has another advantage from our point of view. Sometimes questions
are asked and answered and we need to nothing. We want to promote others
answering questions and that is one of the major reasons we want to
attribute the FAQs and promote Statalist responses into FAQs.
Links
We have links to all the other statistical software vendors. This page is
always in our top-10 list of pages hit with a couple-hundred day visiting
it.
This is, I must say, one of our better ideas. The important thing about
this list is its even handedness. If we hear about software, or if a vendor
asks us to list them—and vendors do—there are only two
tests:
- Is it statistical software?
- Is the URL valid?
We provide no commentary because we could not do that in an unbiased way.
One minor change I would like to make in the future would be to list the
freeware packages separately. I think the freeware providers and the users
of the service would like that.
If you know of any links, please tell me.
NetCourses
The web-site page is just advertising but, as with the Statalist, I will use
this as an excuse to discuss NetCourses.
This is, in my judgment, the most inventive thing we have done on the net.
Obviously they have been a success in terms of enrollment—we never
have difficulty filling up a course. And they have been a success in terms
of user feedback. I think only two people have ever asked for their money
back—and both cases were situations that had to do with their
personal lives rather than us. Comments are uniformly positive.
We offered our first NetCourse in June, 1995. 779 people have ever taken
NetCourses. Among those who take it, it is a success.
Now let me discuss NetCourses from StataCorp’s point of view.
First, offering a NetCourse is extremely expensive. Obviously huge amounts
of time go into a course the first time it is offered but we have discovered
that very large amounts of time go into it afterwards, too. The cost of
intelligently answering intelligent questions is high. On a
revenue-minus-cost basis, NetCourses cost more than they make. We lose
money on them.
One could still argue that they have additional benefits that accrue to
Stata that reflects itself in more sales, and that might be true although we
do not know how to measure it. We have debated this internally.
Here is what we do know: if we converted the NetCourses into books, we
would make money and a fair amount of it. Thus, comparing NetCourses that,
on a cash basis, lose money and if converted into a book would certainly
make money, it is difficult to argue in favor of the NetCourse.
Nevertheless, I am still very positive on NetCourses. Here is my current
thinking:
- If one is going to write a book, giving a NetCourse has low marginal
cost.
- Moreover, giving the NetCourse at least twice will improve the book.
First editions become more like second editions.
At this point, one can even argue that NetCourses make money.
- At some point thereafter, however, the NetCourse is a money-losing
proposition and, on that basis, the NetCourse should be converted
to a book.
We are very favorably disposed to writing books. Books make money and they
promote the use of Stata, so given that, NetCourses can be an efficient
investment for StataCorp.
The following are open questions:
- If one were to publish a book for, say NC-151, would anyone still
take NC-151 if it were still offered?
- If the answer to (1) is no, how many times must one offer a NetCourse
before turning it into a book? Obviously once would not be enough.
People would say, I'll wait. Is it twice? Three times?
- Do some of the courses have greater value as NetCourses rather than
as books?
James Hardin is working on approaching this from a different angle. James
claims that NetCourses would be cheaper to administer if they were
WebCourses using more modern technology. James and I disagree on this. I
say answering intelligent questions intelligently is expensive and that is
independent of technology. James says there is an administrative burden to
NetCourses that I do not fully appreciate. We are going to run an
experiment. When we get back, we are offering NC-101 as a WebCourse.
I do not know exactly what we are going to do, but the thinking right now is
to convert some of the NetCourses to books and then I do not know whether we
will continue to offer those particular NetCourses or not. In any case, you
will be seeing more NetCourses on varied topics because we believe providing
documentation is both profitable and useful.
Returning to the past two years: Stata 5.0
That pretty much covers what I view as the important events of the last two
years. Obviously we released Stata 5.0 and that was an major event, but I
do not think it important. Producing new releases is something we do. The
company would be a big trouble if it considered the mere fact of getting out
a new release important in the sense of being something that requires a lot
of thought or that is a major hurdle to clear.
When I use the word important I mean important in terms of future
implications. Things with big-time current effects but little in the way of
future implications I call major. Stata 5.0 was major.
Obviously, what goes into the release is important and I told you last time
how that is determined. Partly whim—what we find interesting
—and partly what users demand.
Let me run over the important additions:
- No set maxvar/maxobs.
- Larger limits, and even larger limits are coming.
- Better integration with Windows.
- xtgee. Obviously analysis of panel data is something we think
important. The analysis of panel data is something we have
identified as a major component of Stata.
- robust and cluster()
options on nearly all the estimation
commands. We'll complete that work in the next release.
- svy commands, which are related to (4).
I know people are curious as to how such decisions are made so let me run
over the history of these last two to show how randomness enters the
process.
We have had Huber estimators of variance for linear regression, logistic
regression, and a few other estimators for some time. Why did that happen?
Bill Rogers was a student of Peter Huber back at Stanford. All of
Stata’s senior developers get to spend some amount of their time
adding features to Stata they consider interesting or fun. That’s
just a management tool we use: In my experience, people are more productive
working on all assigned tasks if they get to assign some of the tasks to
themselves. Moreover, on the tasks they do assign themselves, they produce
very good software although sometimes the market for it is lacking.
Anyway, Bill Rogers added Huber estimation and it really was not much work.
Those Huber commands languished virtually unused for years although, slowly,
perhaps just because they were there, people began to use them. For other
reasons I do not understand, people got more interested in this estimator of
variance, and we started getting questions on it. When was it appropriate?
What did it do? Should I use it in this case? There was a unique aspect of
the Huber estimator we had in Stata: it provided clustering. Yes, we knew
something about Huber but when Huber estimation was added to Stata, the
addition of clustering was a Bill Rogers invention that was—as Bill
Rogers said—obvious.
Now Bill Rogers has a very Tukey-esque outlook on statistics. Here’s
a reasonable thing to do, so do it. This resulted in us recommending the
use of these Huber estimators, mostly in private communication with users.
Comment after the fact: My comment about Bill Rogers is unfair to
both Bill Rogers and John Tukey and, as many of you know, I have the
greatest respect for both of them.]
Every so often, one of those users would ask us for references and we did
not have much we could supply and we had virtually nothing on the clustering
option. I began to feel very uncomfortable. Here was a feature in Stata,
admittedly not used much but showing signs of growing interest, and we
really did not know much about it. We were in over our heads.
So I assigned to Bill Sribney the task of finding out out about this Huber
estimator. Explain it. Find references. Is this something we should pull
back from or something we should go forward with? Make a report, especially
about this clustering option. Bill dug and discovered its connection with
the survey literature. At that point Bill Sribney said we needed technical
expertise we did not have to answer my questions and we asked John Eltinge,
a survey statistician, to consult with us.
At this point it will still a low-level effort charged only with answering
the question is use of the Huber estimator to be promoted or not and, if so,
can we provide references about its use? Jon and Bill traced the
estimator’s entire history both through the survey literature and the
robust literature.
Good stuff, they said. So at that point we made the decision to promote
hreg, hlogit, etc., out of the suburbs and back into the main estimation
commands. We would promote its use.
Now understand where we were: We had just gone though the survey literature
not because anybody had said "Go through the survey literature" but because
answering the question on Huber had led that direction. We now had a very
competent survey statistician with whom we could consult. The cost of
adding survey additions to Stata had just gone down. Moreover, Bill Sribney
said he wanted to pursue it as his personal project. So we did.
I wish I could tell you that the statistical additions to Stata are
carefully considered in terms of user's needs and size of market, but it is
not so. It is very much a random process. That is as opposed to the
programming language and computer-science related additions to Stata where
we do have long-run plans and, mostly, we stick to them.
All right, let's go back to the other important additions to Stata:
- New survival commands, obviously.
- New table command.
- fracpoly.
- insheet.
- Menu-programming language.
As I said earlier, this is not much used yet but we believe this
will be an important component for the future of Stata.
- Graphics-programming language.
Finally, there was one other major (remember my definition of major vs.
important) work:
- We revamped the manuals.
This is major but, unfortunately, probably will not turn out to be important
because we have not yet figured out how to document Stata. This is an
important problem for the future. Here are the issues:
- Stata is, at heart, simple.
- Over the years, Stata has picked up lots of capabilities, and it
takes thousands of pages to document fully those capabilities.
- New users want to get going quickly and they do not want to be
tied up in the details.
- Professional researchers want to know the details—they want
references and they want formulas. Moreover, we want them to
because we have to keep track of that information and putting
it in the manual is convenient for us.
- People do not want heavy manuals. Information content is good.
Pages and good. Weight is bad.
- The costs of printing and shipping the manuals are the major
cost of producing a copy of Stata. We want to have low prices
but the manuals get in the way of that.
- As we continue to add features to Stata, the documentation problem
gets worse and worse.
We have not figured out a solution to this problem. Here is what we
believe:
- The Getting Started manual is a great success for new users.
- The User's Guide that covers the basic is pretty good.
- The 3-volume reference manuals are growing without bound.
Right now we are focusing on (3).
We are thinking about allowing the reference manuals to continue to grow
without bound for the professional users that want them, but introducing a
1-volume reference manual subset. This manual would not document the
programming, matrix, and other advanced commands. Most technical notes
would be eliminated. Methods and formulas would be deleted. References
would remain. The manual would be lighter and good-enough, as a computer
manual, for many users.
It would be produced by deletion.
Comments and other suggestions appreciated.
The Future
[Thereupon followed lengthy comments speculating about the future.]
insert content here
|