Second Biannual Report to Users
|
Speaker |
William Gould, StataCorp
|
|
Date |
6 June 1997
|
Two years ago I came to London and gave my first "Report to Users" and now I'm
back to update it.
I had selected the title of my talk very carefully two years ago because it
conveyed something I wanted to emphasize. The talk was not entitled "Comments
from Above: Future developments in Stata". User is not a dismissive term,
especially in our business. Users know as much as or more than we at Stata
know about statistics, about what they are doing, and about how they want to
do it.
In fact, it is that characteristic that is almost unique to the statistical
software industry. In most software, the people who write it know more than
the people who use it. That is certainly true of word processors and
spreadsheets and all the rest of the "office systems". And that makes the
design of those systems relatively easy: Since the people who write the
software know more than users, they can evaluate designs through
introspection.
In statistical software, on the other hand, when we at StataCorp are lucky, we
know as much about a topic as our users. Generally, we know less. Oh, we
know more about certain computer topics — how to write parsers, how to
sort efficiently, how to make calculations in stable ways — but in terms
of how the product is used, we really do know less and this puts us at a
considerable disadvantage.
It is that realization that guides most of our actions.
I hope you will excuse me if I take a few minutes to summarize my previous
talk because some of you did not hear it and where I want to pick up is where
the last one ended. I will say what I said before in a different way.
What I tried to say last time is that we — the software developers
— are at a considerable disadvantage. I went on to ask the whether
there is any software industry, any model, in which the developers are at a
similar disadvantage relative to their users?
Yes, I said. The people who make compilers. The people who make operating
systems. The people who make development tools. Now here's a topic I know
something about. James and I spend much of our time developing software and,
in doing that, we use products for which we are the customers. The people who
write compilers know more than James and I know about certain aspects of
compiler technology — in particular I am very out of date concerning
optimization schemes — but James and I know a lot about software development
and, in fact, we know more than the providers of the tools we use about
certain aspects of that development.
That was the insight. Our relationship with you is just like our providers
relationship with us. So what is it on the developmental tool front that has
proven successful? What are the characteristics that James and I share when
we are the customers?
Here are the answers:
As a user of developmental software, I want to control my environment. I do
not want to be forced to work this way or that, to be told that I cannot get
there from here. I am a very smart person. I am smarter than the compiler I
use. I can carry out complicated tasks. I can understand subtle concepts.
Hide nothing from me. Tell me exactly what is going on and how and then I
will exploit the environment to its fullest.
You are not so different. You are smarter than Stata. Stata does not stand
in judgment and tell you no, you shouldn't do that, or this is what you ought
to do next. Rather, it is up to you to perform sequences of steps to
accomplish what you want, and sometimes those steps are complex and subtle.
Returning to the developer market, what has been remarkably successful is the
incremental design for software tools and operating systems. That's
computer-science jargon and all it means is that I can add on to the operating
system and, when I do, the result is just as if the operating system shipped
from the factory with that addition. We developers who are constantly
performing unique and complex tasks constantly write little tools — new
commands — to make our tasks easier.
You are not so different. You researchers are constantly performing unique
and complex tasks. Would not it be convenient if you could write little
tools — new commands — to make your task easier?
And thus was Stata designed. The insight was that statistical systems are not
like word processors, like spreadsheets, like office systems. The
computer-science jargon for this is monolithic. A monolithic system comes
preassembled from the factory — it does what it does and perhaps it does
that quite well. Most existing statistical systems are monolithic designs.
The insight was that the proper model for a statistical system is the
operating system. Statistical systems really should be Statistical Operating
Systems.
And so the design of Stata was just stolen from the principles of incremental
operating system design. Once we had the insight the rest was easy because
the computer scientists had already done all the ground work: There is a
large literature on operating systems and it is just a matter of reading it
and translating it to a statistical application.
The whole design of Stata is stolen. We looked at Unix, we looked at DOS, we
read the computer science literature. The front end — what you see
— will seem familiar if you know a variety of operating systems.
There's a Unix bent, but actually the front end has more of a Wylbur twist to
it. The whole idea of return codes was taken from VM/CMS. Ado-files? Think
Unix scripts and DOS bat files.
The parts that you cannot see were similarly stolen. The whole internal
structure of Stata is right out of the computer science literature for
incremental operating systems. Stata would need a memory manager. It says
so, just open the book. Stata would need a caching system. That's chapter 7.
Why is Stata so fast? Because Stata really was designed by experts after
hundreds of person years of experimentation. We just borrowed their knowledge
and applied it to a different application.
Now, once you accept that the operating system is the right model for
statistical software, that leads to a range of considerations in how the
support services for that software is organized.
Users are smart. Users are knowledgeable. Users want to control their
environment. Users want to be involved in the development process. Or at
least, some users do. Why? Because those users know what they are doing.
About that topic, they know more than James or me. They do something all the
time, they are world-class experts, and the only way it is ever going to get
done right, they realize, is if they do it themselves. It is hopeless for
James and me to try to compete with them. James and I can, however, help.
If users — some users — are to be involved in the development process, they
must have the same access to Stata's internals and to information about Stata
as we at StataCorp do. Hence, our openness and our constant struggle to be
even more open.
Let me give an example: I am not confused, I know Stata is a statistical
package. Sometimes James and I even do statistics but that's not very
interesting to talk about because it has no implications beyond the particular
statistic implemented. Nevertheless, a statistical system without statistics
would be pointless. Here, StataCorp is no different than any of the
statistical vendors — we try.
Two years ago I came to London and walked into a buzz saw. It started kindly
with a talk by Michael Hills who gently mentioned some shortcomings in our
survival analysis routines — all the while adding that we still had the
best routines around. It then was continued by Peter Sasieni who — this
is Britain — was also understated but who left no doubt as to exactly
what the shortcomings were.
We addressed that — that's unimportant right now — but how we
addressed it is important. When we rewrote our survival-analysis routines, we
documented everything. Open the manual to [R] st_is. The utility routines
— the guts that make the whole system work — are fully documented.
How the system works is fully documented. If you want to add to that system,
you can, and you can do that as elegantly as we do. My understanding is that
Michael Hills and David Clayton have already done that. In any case, that is
what it means to be open.
By the way, if users — some users — are to be involved in the development
process, they must have a way to distribute what it is they do. Hence, the
STB.
So two years ago I told a story and it went like this:
- Researchers, the users of statistical software, are smart. Note
well that it is they who develop the things that go into a statistical
package.
- Most statistical packages are monolithic, meaning merely that they
are written at the factory by computer professionals. Thank you
very much, researcher, we'll take it from here, and contact us if
you have any future thoughts.
- Researchers have never liked this but these packages are so well
written in other ways — data management, for instance — that
researchers used them. I told a biased story of the history
of statistical computing that emphasized that everytime the
researchers had an opportunity, they struck out on their
own.
- The proper design for a statistical system, I argued, is that of
an operating system and, in particular, an incremental one.
- Incremental means merely that anybody can add new features — they
do not have to be added at the factory — and once added, the new
features are indistinguishable from those that were added at the
factory.
- Not all researchers will want to make additions to Stata but all
researchers will benefit if those additions are widely distributed.
- In addition to the software design itself, there must be
ways for researchers to communicate with each other. The STB
was the first tool to assist that and Statalist was the second.
- We are a statistical software provider. We have a
responsibility to add new statistical routines to Stata.
- This just means that if researchers are to be allowed to participate
too, then we at the factory must exercise certain cautions. We need
to develop a language that we can all use whether you are an insider
or an outsider. We must consider ourselves tool builders just like
the compiler manufacturers. It makes more sense for us to spend our
time at the factory adding to Stata's programming language and then
using that language to implement new statistical procedures than to
implement the procedures directly. It takes us a little longer to do
it that way in terms of when the new statistical procedure is
delivered but it subsequently allows others to implement new
statistical procedures.
- For the short run, it is the statistical procedures that are in
Stata that matters.
- For the long run, it is the openness and programming capabilities
of Stata that matters.
- If outsiders are to implement additions to Stata, there must be a
scheme by which those new additions can be certified and researchers
can use those tools with some assurance that results are correct.
Two years ago I tied these 12 points into what we had done and where we
thought we were going. As an aside, I left off with the puzzle of the
Internet. The Internet, I said not very insightfully, was going to be
important. Somehow there had to be a way that the Internet tied into these 12
points but I did not know how and that was what I was currently thinking
about.
So with that introduction, let me continue.
First I want to briefly run over the last two years and highlight certain
events and then I want to talk about the future.
The last two years
I very much enjoy these meetings and, more importantly, I find them of great
value. It is not the opportunity of giving this talk that I value — I don't
mind and, I admit, I even enjoy it — and more importantly I owe you this
report — but that is not what I value. What I value — and do not always
enjoy — is sitting in the back and listening.
I mentioned Michael Hills' and Peter Sasieni's talks two years ago. I came
here unaware that Stata's survival-analysis routines needed updating but I
left with no doubts. More importantly, I knew exactly what the problems were.
So, just as a matter of reporting, I went back and said all of our
survival-analysis routines go into the trash can. We start again and rewrite
the whole thing. No patching — we rewrite. Mostly we did a pretty good job.
We made one mistake — Michael Hills has already told me that time should be
allowed to be negative and this I have agreed to.
I just want to report that we do listen and, with a lag, we respond.
What makes listening at these meetings so useful for me is that, in the face
of criticism, while I may not enjoy it, I need not feel defensive. We are all
Stata users and we all agree that Stata is a wonderful package. That said, we
can talk honestly about how to make it better.
Next topic.
I think the most important thing in terms of future implications that has
happened in the last two years is our use of the Internet.
Two years ago I mentioned by puzzlement about what to do with the Internet.
Well, we are making some progress. Obviously, we have a web site — I cannot
remember whether it opened just before or after the meetings two years ago,
but we have one.
Putting aside the obvious marketing value of the site, the important part is
the "User Support" half, so let me update you on that:
Under User Support, we have
- FAQs
- Cool ado-files
- STB and STB-FTP
- Updates
- StataQuest Additions
- Statalist
- Links
- Netcourses
I want to go through each of these.
FAQs
This has been reasonably successful although we do not have enough. Too
often, in my view, we reply to a question either privately through technical
support or on Statalist and do not make a FAQ out of it.
In order to improve things in the future, the first thing we are doing is
attributing — adding authors — and dates to each and every statistical FAQ
and ultimately to all of them. This task should be done by now; the work
started just last week. For some of us, having our name up at the web site
provides an incentive to write these things.
In addition, we are adding instructions on how to cite a particular FAQ. I do
not expect many will cite but I want to see more purely statistical FAQs
discussing purely statistical issues and perhaps those will be cited.
The second part of this plan — which will be going into effect almost
immediately — is to obtain FAQs from Statalist. Not just StataCorp people
can write FAQs and, every so often, something excellent appears on Statalist.
We will be combing Statalist for potential FAQs and then seeking the author's
permission to put it up on our Web site. With attribution, of course.
Finally — and this will be starting almost immediately — we are going to
index the FAQs through the lookup system so that they have a higher
profile. This way, if there is a relevant FAQ, you can find it. And
remember, the lookup database is updated whenever you update Stata.
Cool Ado-files
This seemed like a good idea but has proven to be a failure. I attribute this
to us: we simply have not gone through Statalist as we should and pulled the
good stuff. I had hoped users would tell us what to add to
Cool Ado and then
it would just be our responsibility to copy, but it did not work out that way.
I'm not ready to give up on cool-ado, but I expect it will languish quite a
while longer.
STB and STB-FTP
The STB portion is just advertising. The STB-FTP is a set of links to other
sites that provide the STB diskettes.
StataCorp's attitude toward the STB being distributed over the net has been
schizophrenic. On the one hand, we hate to lose the revenue from diskettes.
On the other hand, we agree with the theoretical proposition that the
materials should be distributed freely. StataCorp's solution has been to not
provide the STB diskettes itself but to allow others to distribute them. This
has basically been how the technical group has agreed to disagree with the
marketing group.
These days, the revenue from diskettes amounts to little and losing all that
revenue would not bother us at all. By that logic, we should immediately put
the STB diskettes up on our site but we are not going to do that. We have
other ideas for the STB diskettes which I will get to when I talk about the
future.
Updates
We provide updates on our Web site and this has
proven quite useful to us. First, it lowers the cost when we make a mistake.
We have, even before the Web, distributed updated
ado-files via the STB, so we have always been committed to the idea of
continual updates. The Web site allows us to distribute
updates to the binary executable itself, something
we could not afford to do when we were mailing updates on diskette.
This is a big success and, because of that, we are going to focus on this and
we have some big changes coming.
StataQuest Additions
Some people have downloaded the StataQuest
Additions but it is not important. There are actually two aspects to
this:
- StataQuest itself and
- The menu programming language.
As things stand right now, Stata's menu programming language is little used by
users and that, we think, is because it is little used by us.
Nevertheless, the MPL is an important addition to Stata — we just need to
show you how to use it. We will be doing that.
StataQuest plays little role in Stata's long-run plans. We are proud of it
and it is used, but we mainly entered into the StataQuest agreement with
Duxbury because we wanted an application for testing our MPL.
Stata makes software for researchers. That is what we do. We are interested
in seeing our software used for teaching future researchers but we have no
interest in trying to make money off the education market per se. I believe
it is important that we stay focused on our primary task.
Statalist
The web site entry is just a way to subscribe to
Statalist so, basically, I want to use this as an
excuse to discuss the list itself.
From our perspective it is a resounding success. It is, in a sense, like
these London meetings. Everybody agrees that Stata is wonderful, so let's be
honest, what about ...
From our perspective, Statalist is a place that one can go to gain access to
StataCorp that is profitable for us.
Pretend one of you email me personally with a question and I reply personally.
I ask you to think about this from a purely monetary point of view. Perhaps I
spend half an hour drafting my reply. The result is that, if I do a good job,
you feel warmly toward StataCorp. If we are very lucky, perhaps you feel so
warmly you mention it to somebody. The monetary return is small but positive.
Perhaps I have increased the likelihood you will upgrade. Perhaps I have
increased the probability that, in speaking to someone else, they will
purchase Stata. The cost, however, is high — 30 minutes of my salary plus
overhead.
Now say you ask the same question on Statalist and I reply there. This is
being done in public. A 1,000 people will read the question and answer and,
if I do a good job, 1,000 people will feel warmly toward Stata. The return is
1,000 times greater and the cost is the same. Now, if I can also convert your
question and my answer into a FAQ, perhaps another 1,000 people will read it
over the period of a year.
This we can afford. And, as a matter of fact, the more subscribers we can
get, the more activity like this we can justify. So one goal is to increase
the Statalist subscription rate. If anybody has any ideas on how to do that,
I would like to hear them.
Statalist has another advantage from our point of view. Sometimes questions
are asked and answered and we need to nothing. We want to promote others
answering questions and that is one of the major reasons we want to attribute
the FAQs and promote Statalist responses into FAQs.
Links
We have links to all the other
statistical software vendors. This page is always in our top-10 list of
pages hit with a couple-hundred day visiting it.
This is, I must say, one of our better ideas. The important thing about this
list is its even handedness. If we hear about software, or if a vendor asks
us to list them — and vendors do — there are only two tests:
- Is it statistical software?
- Is the URL valid?
We provide no commentary because we could not do that in an unbiased way.
One minor change I would like to make in the future would be to list the
freeware packages separately. I think the freeware providers and the users of
the service would like that.
If you know of any links, please tell me.
NetCourses
The web-site page is just advertising but, as with the Statalist, I will use
this as an excuse to discuss NetCourses.
This is, in my judgment, the most inventive thing we have done on the net.
Obviously they have been a success in terms of enrollment — we never have
difficulty filling up a course. And they have been a success in terms of user
feedback. I think only two people have ever asked for their money back — and
both cases were situations that had to do with their personal lives rather
than us. Comments are uniformly positive.
We offered our first NetCourse in June, 1995. 779 people have ever taken
NetCourses. Among those who take it, it is a success.
Now let me discuss NetCourses from StataCorp's point of view.
First, offering a NetCourse is extremely expensive. Obviously huge amounts of
time go into a course the first time it is offered but we have discovered that
very large amounts of time go into it afterwards, too. The cost of
intelligently answering intelligent questions is high. On a
revenue-minus-cost basis, NetCourses cost more than they make. We lose money
on them.
One could still argue that they have additional benefits that accrue to Stata
that reflects itself in more sales, and that might be true although we do not
know how to measure it. We have debated this internally.
Here is what we do know: if we converted the NetCourses into books, we would
make money and a fair amount of it. Thus, comparing NetCourses that, on a
cash basis, lose money and if converted into a book would certainly make
money, it is difficult to argue in favor of the NetCourse.
Nevertheless, I am still very positive on NetCourses. Here is my current
thinking:
- If one is going to write a book, giving a NetCourse has low marginal
cost.
- Moreover, giving the NetCourse at least twice will improve the book.
First editions become more like second editions.
At this point, one can even argue that NetCourses make money.
- At some point thereafter, however, the NetCourse is a money-losing
proposition and, on that basis, the NetCourse should be converted
to a book.
We are very favorably disposed to writing books. Books make money and they
promote the use of Stata, so given that, NetCourses can be an efficient
investment for StataCorp.
The following are open questions:
- If one were to publish a book for, say NC-151, would anyone still
take NC-151 if it were still offered?
- If the answer to (1) is no, how many times must one offer a NetCourse
before turning it into a book? Obviously once would not be enough.
People would say, I'll wait. Is it twice? Three times?
- Do some of the courses have greater value as NetCourses rather than
as books?
James Hardin is working on approaching this from a different angle. James
claims that NetCourses would be cheaper to administer if they were WebCourses
using more modern technology. James and I disagree on this. I say answering
intelligent questions intelligently is expensive and that is independent of
technology. James says there is an administrative burden to NetCourses that I
do not fully appreciate. We are going to run an experiment. When we get
back, we are offering NC-101 as a WebCourse.
I do not know exactly what we are going to do, but the thinking right now is
to convert some of the NetCourses to books and then I do not know whether we
will continue to offer those particular NetCourses or not. In any case, you
will be seeing more NetCourses on varied topics because we believe providing
documentation is both profitable and useful.
Returning to the past two years: Stata 5.0
That pretty much covers what I view as the important events of the last two
years. Obviously we released Stata 5.0 and that was an
major event, but I do not think it important. Producing new releases is
something we do. The company would be a big trouble if it considered the mere
fact of getting out a new release important in the sense of being something
that requires a lot of thought or that is a major hurdle to clear.
When I use the word important I mean important in terms of future
implications. Things with big-time current effects but little in the way of
future implications I call major. Stata 5.0 was major.
Obviously, what goes into the release is important and I told you last time
how that is determined. Partly whim — what we find interesting — and partly
what users demand.
Let me run over the important additions:
- No set maxvar/maxobs.
- Larger limits, and even larger limits are coming.
- Better integration with Windows.
- xtgee. Obviously analysis of panel data is something we think
important. The analysis of panel data is something we have
identified as a major component of Stata.
- robust and cluster()
options on nearly all the estimation
commands. We'll complete that work in the next release.
- svy commands, which are related to (4).
I know people are curious as to how such decisions are made so let me run over
the history of these last two to show how randomness enters the process.
We have had Huber estimators of variance for linear regression, logistic
regression, and a few other estimators for some time. Why did that happen?
Bill Rogers was a student of Peter Huber back at Stanford. All of Stata's
senior developers get to spend some amount of their time adding features to
Stata they consider interesting or fun. That's just a management tool we use:
In my experience, people are more productive working on all assigned tasks if
they get to assign some of the tasks to themselves. Moreover, on the tasks
they do assign themselves, they produce very good software although sometimes
the market for it is lacking.
Anyway, Bill Rogers added Huber estimation and it really was not much work.
Those Huber commands languished virtually unused for years although, slowly,
perhaps just because they were there, people began to use them. For other
reasons I do not understand, people got more interested in this estimator of
variance, and we started getting questions on it. When was it appropriate?
What did it do? Should I use it in this case? There was a unique aspect of
the Huber estimator we had in Stata: it provided clustering. Yes, we knew
something about Huber but when Huber estimation was added to Stata, the
addition of clustering was a Bill Rogers invention that was — as Bill Rogers
said — obvious.
Now Bill Rogers has a very Tukey-esque outlook on statistics. Here's a
reasonable thing to do, so do it. This resulted in us recommending the use of
these Huber estimators, mostly in private communication with users.
Comment after the fact: My comment about Bill Rogers is unfair to both
Bill Rogers and John Tukey and, as many of you know, I have the greatest
respect for both of them.]
Every so often, one of those users would ask us for references and we did not
have much we could supply and we had virtually nothing on the clustering
option. I began to feel very uncomfortable. Here was a feature in Stata,
admittedly not used much but showing signs of growing interest, and we really
did not know much about it. We were in over our heads.
So I assigned to Bill Sribney the task of finding out out about this Huber
estimator. Explain it. Find references. Is this something we should pull
back from or something we should go forward with? Make a report, especially
about this clustering option. Bill dug and discovered its connection with the
survey literature. At that point Bill Sribney said we needed technical
expertise we did not have to answer my questions and we asked John Eltinge, a
survey statistician, to consult with us.
At this point it will still a low-level effort charged only with answering the
question is use of the Huber estimator to be promoted or not and, if so, can
we provide references about its use? Jon and Bill traced the estimator's
entire history both through the survey literature and the robust literature.
Good stuff, they said. So at that point we made the decision to promote hreg,
hlogit, etc., out of the suburbs and back into the main estimation commands.
We would promote its use.
Now understand where we were: We had just gone though the survey literature
not because anybody had said "Go through the survey literature" but because
answering the question on Huber had led that direction. We now had a very
competent survey statistician with whom we could consult. The cost of adding
survey additions to Stata had just gone down. Moreover, Bill Sribney said he
wanted to pursue it as his personal project. So we did.
I wish I could tell you that the statistical additions to Stata are carefully
considered in terms of user's needs and size of market, but it is not so. It
is very much a random process. That is as opposed to the programming language
and computer-science related additions to Stata where we do have long-run
plans and, mostly, we stick to them.
All right, let's go back to the other important additions to Stata:
- New survival commands, obviously.
- New table command.
- fracpoly.
- insheet.
- Menu-programming language.
As I said earlier, this is not much used yet but we believe this
will be an important component for the future of Stata.
- Graphics-programming language.
Finally, there was one other major (remember my definition of major vs.
important) work:
- We revamped the manuals.
This is major but, unfortunately, probably will not turn out to be important
because we have not yet figured out how to document Stata. This is an
important problem for the future. Here are the issues:
- Stata is, at heart, simple.
- Over the years, Stata has picked up lots of capabilities, and it
takes thousands of pages to document fully those capabilities.
- New users want to get going quickly and they do not want to be
tied up in the details.
- Professional researchers want to know the details — they want
references and they want formulas. Moreover, we want them to
because we have to keep track of that information and putting
it in the manual is convenient for us.
- People do not want heavy manuals. Information content is good.
Pages and good. Weight is bad.
- The costs of printing and shipping the manuals are the major
cost of producing a copy of Stata. We want to have low prices
but the manuals get in the way of that.
- As we continue to add features to Stata, the documentation problem
gets worse and worse.
We have not figured out a solution to this problem. Here is what we believe:
- The Getting Started manual is a great success for new users.
- The User's Guide that covers the basic is pretty good.
- The 3-volume reference manuals are growing without bound.
Right now we are focusing on (3).
We are thinking about allowing the reference manuals to continue to grow
without bound for the professional users that want them, but introducing a
1-volume reference manual subset. This manual would not document the
programming, matrix, and other advanced commands. Most technical notes would
be eliminated. Methods and formulas would be deleted. References would
remain. The manual would be lighter and good-enough, as a computer manual,
for many users.
It would be produced by deletion.
Comments and other suggestions appreciated.
The Future
[Thereupon followed lengthy comments speculating about the future.]
insert content here
|