Using Stata’s official dyndoc
command.
A couple of things to note:
If a <<dd_do>>
block has any Stata input or output, it must be surrounded by ~~~~
code-block fences.
Be careful not to have successive ~~~~
blocks, as they will generate extra vertical space.
The Stata commands within a <<dd_do>>
will show any leading whitespace in the web page, so they should be flush left
This is a meant to be a very simple exposition about modeling energy usage using Stata’s auto dataset. What makes the dataset special is that it is from the year 1978. Notable occurrences in 1978 were that Wayne Gretzky signed with the Indianapolis Racers of the World Hockey Association, and that the Boston Red Sox folded a 14-game lead to the Yankees. Eek. Oh, and homebrewing of beer was legalized in the U.S. To make things work more nicely, let’s pretend that this is some sort of sample of measurements, so that when we talk about “average energy consumption”, it will make some sense.
Let’s open the auto dataset, and look at its structure.
. sysuse auto, clear
(1978 Automobile Data)
. describe
Contains data from /Applications/AAApplications/MathTools/Stata15/ado/base/a/au
> to.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2016 17:45
size: 3,182 (_dta has notes)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-------------------------------------------------------------------------------
Sorted by: foreign
We could use a codebook
command here to look at all the variables, but it will take up too much space. Let’s do this instead:
. codebook, compact
Variable Obs Unique Mean Min Max Label
-------------------------------------------------------------------------------
make 74 74 . . . Make and Model
price 74 74 6165.257 3291 15906 Price
mpg 74 21 21.2973 12 41 Mileage (mpg)
rep78 69 5 3.405797 1 5 Repair Record 1978
headroom 74 8 2.993243 1.5 5 Headroom (in.)
trunk 74 18 13.75676 5 23 Trunk space (cu. ft.)
weight 74 64 3019.459 1760 4840 Weight (lbs.)
length 74 47 187.9324 142 233 Length (in.)
turn 74 18 39.64865 31 51 Turn Circle (ft.)
displacement 74 31 197.2973 79 425 Displacement (cu. in.)
gear_ratio 74 36 3.014865 2.19 3.89 Gear Ratio
foreign 74 2 .2972973 0 1 Car type
-------------------------------------------------------------------------------
For those unfamiliar with the system of weights and measures used in the United Stated (and Liberia), the important conversions to remember are that
One other oddity in the so-called traditional (or Standard or English or Imperial) system, is that energy usage is measured in miles per gallon (mpg). This is not good for analysis, because it makes for a non-linear relationship between weight and energy. This can be seen in following graph:
To make the analysis work better, we should make a variable measuring gallons use per 100 miles driven:
. gen gp100m = 100/mpg, before(mpg)
. label variable gp100m "Gallons per 100 miles"
One last conversion 1 gallon per 100 miles is about 75/32 (= 2.344) liters per 100 km.
Let’s take a look at various variables by whether the cars are from the US (domestic), or whether they are from outside the US (foreign). This was 1978, so country of manufacture mostly matched location of company. This is, of course, no longer the case.
. tabstat gp100m weight length turn displacement gear_ratio, ///
> statistics( mean sd count ) by(foreign)
Summary statistics: mean, sd, N
by categories of: foreign (Car type)
foreign | gp100m weight length turn displa~t gear_r~o
---------+------------------------------------------------------------
Domestic | 5.318155 3317.115 196.1346 41.44231 233.7115 2.806538
| 1.224346 695.3637 20.04605 3.967582 85.26299 .3359556
| 52 52 52 52 52 52
---------+------------------------------------------------------------
Foreign | 4.312848 2315.909 168.5455 35.40909 111.2273 3.507273
| 1.144388 433.0035 13.68255 1.501082 24.88054 .2969076
| 22 22 22 22 22 22
---------+------------------------------------------------------------
Total | 5.01928 3019.459 187.9324 39.64865 197.2973 3.014865
| 1.279856 777.1936 22.26634 4.399354 91.83722 .4562871
| 74 74 74 74 74 74
----------------------------------------------------------------------
This works, but it would be nice to have a table which makes it easier to see comparisons. For a simple example (with fewer statistics), we can use Ian Watson’s tabout
, version 3, from http://tabout.net.au. To facilitate the options needed for rerunning the command for different output types, the options for generating the command have been put in the file tabout_oneway.options
.
Mean values for US and Non-US cars
Gp100M | Weight | Displacement | Gear Ratio | |
Domestic (70%) | 5.32 | 3,317.1 | 233.7 | 2.807 |
Foreign (29%) | 4.31 | 2,315.9 | 111.2 | 3.507 |
Total (100%) | 5.02 | 3,019.5 | 197.3 | 3.015 |
Source: auto.dta
If time permits, we should be able to make a more-complete version of this table.
Before modelling, we should take a look to see if there could be collinearities in the predictors.
Finally, how about modelling let’s first run a regression with many variables and then store the results
. regress gp100m weight displacement gear_ratio foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(4, 69) = 56.84
Model | 91.7374232 4 22.9343558 Prob > F = 0.0000
Residual | 27.8388375 69 .403461414 R-squared = 0.7672
-------------+---------------------------------- Adj R-squared = 0.7537
Total | 119.576261 73 1.63803097 Root MSE = .63519
------------------------------------------------------------------------------
gp100m | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | .0014428 .000216 6.68 0.000 .0010118 .0018737
displacement | .0012388 .0021161 0.59 0.560 -.0029828 .0054603
gear_ratio | -.2037991 .3258603 -0.63 0.534 -.8538726 .4462744
foreign | .733736 .2301493 3.19 0.002 .2746007 1.192871
_cons | .8147969 1.239181 0.66 0.513 -1.657301 3.286895
------------------------------------------------------------------------------
We can see that, as expected, heavier cars take more energy to move. Perhaps unexpectedly, non-US cars use more gas at the same weight. It appears that we can throw out both displacement
and gear_ratio
as predictors and fit a simpler model.
. regress gp100m weight foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 113.97
Model | 91.1761694 2 45.5880847 Prob > F = 0.0000
Residual | 28.4000913 71 .400001287 R-squared = 0.7625
-------------+---------------------------------- Adj R-squared = 0.7558
Total | 119.576261 73 1.63803097 Root MSE = .63246
------------------------------------------------------------------------------
gp100m | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | .0016254 .0001183 13.74 0.000 .0013896 .0018612
foreign | .6220535 .1997381 3.11 0.003 .2237871 1.02032
_cons | -.0734839 .4019932 -0.18 0.855 -.8750354 .7280677
------------------------------------------------------------------------------
We can put these coefficients in a table
--------------------------------------------
(1) (2)
gp100m gp100m
--------------------------------------------
weight 0.00144*** 0.00163***
(6.68) (13.74)
displacement 0.00124
(0.59)
gear_ratio -0.204
(-0.63)
foreign 0.734** 0.622**
(3.19) (3.11)
_cons 0.815 -0.0735
(0.66) (-0.18)
--------------------------------------------
N 74 74
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
From the simple model, cars from 40 years ago used 0.163 gallons per mile per extra 100 pounds, on average. Also, non-US cars use about 0.622 more gallons per mile, on average, all other things being equal.