# Tag Archives: Slope

The following applies to elections for Parliament, say for the US House of Representatives or the UK House of Commons, and it may also apply for the election of a city council. When the principle is one man, one vote then we would want that the shares of “seats won” would be equal to the shares of “votes received”. When there are differences then we would call this inequality or disproportionality.

Such imbalance is not uncommon. At the US election of November 8 2016, the Republicans got 49.1% of the votes and 55.4% of the seats, while the Democrats got 48% of the votes and 44.6% of the seats. At the UK general election of June 8 2017, the Conservatives got 42.2% of the votes and 48.8% of the seats while Labour got 39.9% of the votes and 40.3% of the seats (the wikipedia data of October 16 2017 are inaccurate).

This article clarifies a new and better way to measure this inequality or disproportionality of votes and seats. The new measure is called Sine-Diagonal Inequality / Disproportionality (SDID) (weblink to main article). The new measure falls under descriptive statistics. Potentially it might be used in any area where one matches shares or proportions, like the proportions of minerals in different samples. SDID is related to statistical concepts like R-squared and the regression slope. This article looks at some history, as Karl Pearson (1857-1936) created the R-Squared and Ronald A. Fisher (1890-1962) in 1915 determined its sample distribution. The new measure would also be relevant for Big Data. William Gosset (1876-1937) a.k.a. “Student” was famously unimpressed by Fisher’s notion of “statistical significance” and now is vindicated by descriptive statistics and Big Data.

Statistics has the triad of Design, Description and Decision.

• Design is especially relevant for the experimental sciences, in which plants, lab rats or psychology students are subjected to alternate treatments. Design is informative but less applicable for observational sciences, like macro-economics and national elections when the researcher cannot experiment with nations.
• Descriptive statistics has measures for the center of location – like mean or median – and measures of dispersion – like range or standard deviation. Important are also the graphical methods like the histogram or the frequency polygon.
• Statistical decision making involves the formulation of hypotheses and the use of loss functione to evaluate that hypotheses. A hypothesis on the distribution of the population provides an indication for choosing the sample size. A typical example is the definition of decision error (of the first kind) that a hypothesis is true but still rejected. One might accept a decision error in say 5% of the cases, called the level of statistical significance.

Historically, statisticians have been working on all these areas of design, description and decision, but the most difficult was the formulation of decision methods, since this involved both the calculus of reasoning and the more complex mathematics on normal, t, chi-square, and other frequency distributions. In practical work, the divide between the experimental and the non-experimental (observational) sciences appeared insurmountable. The experimental sciences have the advantages of design and decisions based upon samples, and the observational sciences basically rely on descriptive statistics. When the observational sciences do regressions, there is an ephemeral application of statistical significance that invokes the Law of Large Numbers, that all error approximates the normal distribution.

This traditional setup of statistics is being challenged in the last decades by Big Data – see also this discussion by Rand Wilcox in Significance May 2017. When all data are available, and when you actually have the population data, then the idea of using a sample evaporates, and you don’t need to develop hypotheses on the distributions anymore. In that case descriptive statistics becomes the most important aspect of statistics. For statistics as a whole, the emphasis shifts from statistical decision making to decisions on content. While descriptive statistics had been applied mostly to samples, Big Data now causes the additional step how these descriptions relate to decisions on content. In fact, such questions already existed for the observational sciences like for macro-economics and national elections, in which the researcher only had descriptive statistics, and lacked the opportunity to experiment and base decisions upon samples. The disadvantaged areas now provide insights for the earlier advantaged areas of research.

The key insight is to transform the loss function into a descriptive statistic itself. An example is the Richter scale for the magnitude of earthquakes. It is both a descriptive statistic and a factor in the loss function. A nation or regional community has on the one hand the cost of building and construction and on the other hand the risk of losing the entire investments and human lives. In the evaluation of cost and benefit, the descriptive statistic helps to clarify the content of the issue itself. The key issue is no longer a decision within statistical hypothesis testing, but the adequate description of the data so that we arrive at a better cost-benefit analysis.

##### Existing measures on votes versus seats

Let us return to the election for the House of Representatives (USA) or the House of Commons (UK). The criterion of One man, one vote translates into the criterion that the shares of seats equal the shares of votes. We are comparing two vectors here.

The reason why the shares of seats and votes do not match is because the USA and UK use a particular setup. The setup is called an “electoral system”, but since it does not satisfy the criterion of One man, one vote, it does not really deserve that name. The USA and UK use both (single member) districts and the criterion of Plurality per district, meaning that the district seat is given to the candidate with the most votes – also called “first past the post” (FPTP). This system made some sense in 1800 when the concern was district representation. However, when candidates stand for parties then the argument for district representation loses relevance. The current setup does not qualify for the word “election” though it curiously continues to be called so. It is true that voters mark ballots but that is not enough for a real election. When you pay for something in a shop then this is an essential part of the process, but you also expect to receive what you ordered. In the “electoral systems” in the USA and UK, this economic logic does not apply. Only votes for the winner elect someone but the other votes are obliterated. For such reasons Holland switched to equal / proportional representation in 1917.

For descriptive statistics, the question is how to measure the deviation of the shares of votes and seats. For statistical decision making we might want to test whether the US and UK election outcomes are statistically significantly different from inequality / proportionality. This approach requires not only a proper descriptive measure anyway, but also some assumptions on the distribution of votes which might be rather dubious to start with. For this reason the emphasis falls on descriptive statistics, and the use of a proper measure for inequality / disproportionality (ID).

A measure proposed by, and called after, Loosemore & Hanby in 1971 (LHID) uses the sum of the absolute deviations of the shares (in percentages), divided by 2 to correct for double counting. The LHID for the UK election of 2017 is 10.5 on a scale of 100, which means that 10.5% of the 650 seats (68 seats) in the UK House of Commons are relocated from what would be an equal allocation. When the UK government claims to have a “mandate from the people” then this is only because the UK “election system” is so rigged that many votes have been obliterated. The LHID gives the percentage of relocated seats but is insensitive to how these actually are relocated, say to a larger or smaller party.

The Euclid / Gallagher measure proposed in 1991 (EGID) uses the Euclidean distance, again corrected for double counting. For an election with only two parties EGID = LHID. The EGID has become something like the standard in political science. For the UK 2017 the EGID is 6.8 on a scale of 100, which cannot be interpreted as a percentage of seats like LHID, but which indicates that the 10.5% of relocated seats are not concentrated in the Conservative party only.

Alan Renwick in 2015 tends to see more value in LHID than EGID: “As the fragmentation of the UK party system has increased over recent years, therefore, the standard measure of disproportionality [thus EGID] has, it would appear, increasingly understated the true level of disproportionality.”

##### The new SDID measure

The new Sine-Diagonal Inequality / Disproportionality (SDID) measure – presented in this paper – looks at the angle between the vectors of the shares of votes and seats.

• When the vectors overlap, the angle is zero, and then there is perfect equality / proportionality.
• When the vectors are perpendicular then there is full inequality / disproportionality.
• While this angle variates from 0 to 90 degrees, it is more useful to transform it into sine and cosine that are in the [0, 1] range.
• The SDID takes the sine for inequality / disproportionality and the cosine of the angle for equality / proportionality.
• With Sin[0] = 0 and Cos[0] = 1, we thus get a scale that is 0 for full inequaliy / disproportionality and 1 for full equality / proportionality.

It appears that the sine is more sensitive than either absolute value (LHID) and Euclidean distance (EGID). It is closer to the absolute value for small angles, and closer to the Euclidean distrance for larger angles. See said paper, Figure 1 on page 10. SDID is something like a compromise between LHID and EGID but also better than both.

##### The role of the diagonal

When we regress the shares of the seats on the shares of the votes without using a constant – i.e. using Regression Through the Origin (RTO) – then this gives a single regression coefficient. When there is equality / proportionality then this regression coefficient is 1. This has the easy interpretation that this is the diagonal in the votes & seats space. This explains the name of SDID: when the regression coefficient generates the diagonal, then the sine is zero, and there is no inequality / disproportionality.

Said paper – see page 38 – recovers a key relationship between on the one hand the sine and on the other hand the Euclidean distance and this regression coefficient. On the diagonal, the sine and Euclidean distance are both zero. Off-diagonal, the sine differs from the Euclidean distance in nonlinear manner by means of a factor given by the regression coefficient. This relationship determines the effect that we indicated above, how SDID compromises between and improves upon LHID and EGID.

##### Double interpretation as slope and similarity measure

There appears to be a relationship between said regression coefficient and the cosine itself. This allows for a double interpretation as both slope and similarity measure. This weblog text is intended to avoid formulas as much as possible and thus I refer to said paper for the details. Suffice to say here is that, at first, it may seem to be a drawback that such a double interpretation is possible, yet, on closer inspection the relationship makes sense and it is an advantage to be able to switch perspective.

##### Weber – Fechner sensitivity, factor 10, sign

In human psychology there appears to be a distinction between actual differences and perceived differences. This is called the Weber – Fechner law. When a frog is put into a pan with cool water and slowly boiled to death, it will not jump out. When a frog is put into a pan with hot water it will jump out immediately. People may notice differences between low vote shares and high seat shares, but they may be less sensitive to small differences, while these differences actually can still be quite relevant. For this reason, the SDID uses a sensitivity transform. It uses the square root of the sine.

(PM. A hypothesis is that the USA and UK call their national “balloting events” still “elections”, is that the old system of districts has changed so gradually into the method of obliterating votes that many people did not notice. It is more likely though that that some parties recognised the effect, but have an advantage under the present system, and then do not want to change to equal / proportional representation.)

Subsequently, the sine and its square root have values in the range [0, 1]. In itself this is an advantage, but it comes with leading zeros. We might multiply with 100 but this might cause the confusion as if it would be percentages. The second digit might give a false sense of accuracy. It is more useful to multiply this by 10. This gives values like on a report card. We can compare here to Bart Simpson, who appreciates low values on his report card.

Finally, when we compare, say, votes {49, 51} and seats {51, 49}, then we see a dramatic change of majority, even though there is only a slight inequality / disproportionality. It is useful to have an indicator for this too. It appears that this can be done by using a negative sign when such majority reversal occurs. This method of indicating majority reversals is not so sophisticated yet, and at this stage consists of using the sign of the covariance of the vectors of votes and seats.

##### In sum: the full formula

This present text avoids formulas but it is useful to give the formula for the new measure of SDID, so that the reader may link up more easily with the paper in which the new measure is actually developed. For the vectors of votes and seats we use the symbols v and s, and the angle between the two vectors give cosine and then sine:

SDID[v, s] = sign 10 √ Sin[v, s]

For the UK 2017, the SDID value is 3.7. For comparison the values of Holland with equal / proportional representation are: LHID 3, EGID 1.7, SDID 2.5. It appears that Holland is not yet as equal / proportional as can be. Holland uses the Jefferson / D’Hondt method, that favours larger parties in the allocation of remainder seats. At elections there are also the wasted vote, when people vote for fringe parties that do not succeed in getting seats. In a truly equal or proportional system, the wasted vote can be respected by leaving seats empty or by having a qualified majority rule.

##### Cosine and R-squared

Remarkably, Karl Pearson (1857-1936) also used the cosine when he created R-squared, also known as the “coefficient of determination“. Namely:

• R-squared is the cosine-squared applied to centered data. Such centered data arise when one subtracts the mean value from the original data. For such data it is advisable to use a regression with a constant, which constant captures the mean effect.
• Above we have been using the original (non-centered) data. Alternatively put, when we do above Regression Through the Origin (RTO) and then look for the proper coefficient of determination, then we get the cosine-squared.

The SDID measure thus provides a “missing link” in statistics between centered and non-centered data, and also provides a new perspective on R-squared itself.

Apparently till now statistics found little use for original (non-centered) data and RTO. A possible explanation is that statistics fairly soon neglected descriptive statistics as less challenging, and focused on statistical decision making. Textbooks prefer the inclusion of a constant in the regression, so that one can test whether it differs from zero with statistical significance. The constant is essentially used as an indicator for possible errors in modeling. The use of RTO or the imposition of a zero constant would block that kind of application. However, this (traditional, academic) focus on statistical decision making apparently caused the neglect of a relevant part of the analysis, that now comes to the surface.

##### R-squared has relatively little use

R-squared is often mentioned in statistical reports about regressions, but actually it is not much used for other purposes than reporting only. Cosma Shalizi (2015:19) states:

“At this point, you might be wondering just what R-squared is good for — what job it does that isn’t better done by other tools. The only honest answer I can give you is that I have never found a situation where it helped at all. If I could design the regression curriculum from scratch, I would never mention it. Unfortunately, it lives on as a historical relic, so you need to know what it is, and what misunderstandings about it people suffer from.”

At the U. of Virginia Library, Clay Ford summarizes Shalizi’s points on the uselessness of R-squared, with a reference to his lecture notes.

Since the cosine is symmetric, the R-squared is the same for regressing y given x, or x given y. Shalizi (2015, p18) infers from the symmetry: “This in itself should be enough to show that a high R² says nothing about explaining one variable by another.” This is too quick. When theory shows that x is a causal factor for y then it makes little sense to argue that y explains x conversely. Thus, for research the percentage of explained variation can be informative. Obviously it matters how one actually uses this information.

When it is reported that a regression has an R-squared of 70% then this means that 70% of the variation of the explained variable is explained by the model, i.e. by variation in the explanatory variables and the estimated coefficients. In itself such a report does not say much, for it is not clear whether 70% is a little or a lot for the particular explanation. For evaluation we obviously also look at the regression coefficients.

One can always increase R-squared by including other and even nonsensical variables. For a proper use of R-squared, we would use the adjusted R-squared. R-adj finds its use in model specification searches – see Dave Giles 2013. For an increase of R-adj coefficients must have an absolute t-value larger than 1. A proper report would show how R-adj increases by the inclusion of particular variables, e.g. also compared to studies by others on the same topic.  Comparison on other topics obviously would be rather meaningless. Shalizi also rejects R-adj and suggests to work directly with the mean squared error (MSE, also corrected for the degrees of freedom). Since R-squared is the cosine, then the MSE relates to the sine, and these are basically different sides of the same coin, so that this discussion is much a-do about little. For standardised variables (difference from mean, divided by standard deviation), the R-squared is also the coefficient of regression, and then it is relevant for the effect size.

R-squared is a sample statistic. Thus it depends upon the particular sample. A hypothesis is that the population has a ρ-squared. For this reason it is important to distinguish between a regression on fixed data and a regression in which the explanatory variables also have a (normal) distribution (errors in variables). In his 1915 article on the sample distribution of R-squared. R.A Fisher (digital library) assumed the latter. With fixed data, say X, the outcome is conditional on X, so that it is better to write ρ[X], lest one forgets about the situation. See my earlier paper on the sample distribution of R-adj. Dave Giles has a fine discussion about R-squared and adjusted R-squared. A search gives more pages. He confirms the “uselessnes” of R-squared: “My students are often horrified when I tell them, truthfully, that one of the last pieces of information that I look at when evaluating the results of an OLS regression, is the coefficient of determination (R2), or its “adjusted” counterpart. Fortunately, it doesn’t take long to change their perspective!” Such a statement should not be read as the uselessness of cosine or sine in general.

##### A part of history of statistics that is unknown to me

I am not familiar with the history of statistics, and it is unknown to me what else Pearson, Fisher, Gosset and other founding and early authors wrote about the application of the cosine or sine. The choice to apply the cosine to centered data to create R-squared is deliberate, and Pearson would have been aware that it might also be applied to original (non-centered) data. It is also likely that he would not have the full perspective above, because then it would have been in the statistical textbooks already. It would be interesting to know what the considerations at time were. Quite likely the theoretical focus was on statistical decision making rather than on description, yet this for me unknown history would put matters more into perspective.

##### Statistical significance

Part of the history is that R.A. Fisher with his attention for mathematics emphasized precision while W.S. Gosset with his attention to practical application emphasized the effect size of the coefficients found by regression. Somehow, statistical significance in terms of precision became more important than content significance, and empirical research has rather followed Fisher than the practical relevance of Gosset. This history and its meaning is discussed by Stephen Ziliak & Deirdre McCloskey 2007, see also this discussion by Andrew Gelman. As said, for standardised variables, the regression coefficient is the R-squared, and this is best understood with attention for the effect size. For some applications a low R-squared would still be relevant for the particular field.

##### Conclusion

The new measure SDID provides a better description of the inequality or disproportionality of votes and seats compared to existing measures. The new measure has been tailored to votes and seats, by means of greater sensitivity to small inequalities, and because a small change in inequality may have a crucial impact on the (political) majority. For different fields, one could taylor measures in similar manner.

That the cosine could be used as a measure of similarity has been well-known in the statistics literature since the start, when Pearson used the cosine for centered data to create R-square. For the use of the sine I have not found direct applications, but its use is straightforward when we look at the opposite of similarity.

The proposed measure provides an enlightening bridge between descriptive statistics and statistical decision making. This comes with a better understanding of what kind of information the cosine or R-squared provides, in relation to regressions with and without a constant. Statistics textbooks would do well by providing their students with this new topic for both theory and practical application.

Exponential functions have the form bx, where b > 0 is the base and x the exponent.

Exponential functions are easily introduced as growth processes. The comparison of x² and 2^x is an eye-opener, with the stories of duckweed or the grain on the chess board. The introduction of the exponential number e is a next step. What intuitions can we use for smooth didactics on e ?

##### The “discover-e” plot

There is the following “intuitive graph” for the exponential number e = 2,71828…. The line y = e is found by requiring that the inclines (tangents) to bx all run through the origin at {0, 0}. The (dashed) value at x = 1 helps to identify the function ex itself. (Check that the red curve indicates 2^x).

2^x, e^x and 4^x, and inclines through {0, 0}

Remarkably, Michael Range (2016:xxix) also looks at such an outcome = 2^(1 / c), where is the derivative of = 2^x at x = 0, or c = ln[2]. NB. Instead of the opaque term “logarithm” let us use “recovered exponent”, denoted as rex[y].

Perhaps above plot captures a good intuition of the exponential number ? I am not convinced yet but find that it deserves a fair chance.

NB. Dutch mathematics didactician Hessel Pot, in an email to me of April 7 2013, suggested above plot. There appears to be a Wolfram Demonstrations Project item on this too. Their reference is to Helen Skala, “A discover-e,” The College Mathematics Journal, 28(2), 1997 pp. 128–129 (Jstor), and it has been included in the “Calculus Collection” (2010).

##### Deductions

The point-slope version of the incline (tangent) of function f[x] at x = a is:

y – f[a] = s (x a)

The function b^x has derivative rex[b] b^x. Thus at arbitrary a:

y – b^a = rex[b] b^a (x a)

This line runs through the origin {xy} = {0, 0} iff

0 – b^a = rex[b] b^a (0 – a)

1 = rex[ba

Thus with H = -1, a = rex[b]H = 1 / rex[b]. Then also:

yf[a] = b^a = b^rex[b]H = e^(rex[b]  rex[b]H) = e^1 = e

The inclines running through {0, 0} also run through {rex[b]H, e}. Alternatively put, inclines can thus run through the origin and then cut y = e .

For example, in above plot, with 2^x as the red curve, rex[2] ≈ 0.70 and ≈ 1.44, and there we find the intersection with the line y = e.

Subsequently also at a = 1, the point of tangency is {1, e}, and we find with e that rex[e] = 1,

The drawback of this exposition is that it presupposes some algebra on e and the recovered exponents. Without this deduction, it is not guaranteed that above plot is correct. It might be a delusion. Yet since the plot is correct, we may present it to students, and it generates a sense of wonder what this special number e is. Thus it still is possible to make the plot and then begin to develop the required math.

Another drawback of this plot is that it compares different exponential functions and doesn’t focus on the key property of e^x, namely that it is its own derivative. A comparison of different exponential functions is useful, yet for what purpose exactly ?

##### Descartes

Our recent weblog text discussed how Cartesius used Euclid’s criterion of tangency of circle and line to determine inclines to curves. The following plots use this idea for e^x at point x = a, for a = 0 and a = 1.

Incline to e^x at x = 0 (left) and x = 1 (right)

Let us now define the number e such that the derivative of e^x is given by e^x itself. At point x = a we have s = e^a. Using the point-slope equation for the incline:

y – f[a] = s (x a)

y – e^ae^a (x a)

y e^a (x – (a – 1))

Thus the inclines cut the horizontal axis at {x, y} = {a – 1, 0}, and the slope indeed is given by the tangent s = (f[a] – 0) / (a – (a – 1)) = f[a] / 1 = e^a.

The center {u, 0} and radius r of the circle can be found from the formulas of the mentioned weblog entry (or Pythagoras), and check e.g. a = 0:

u = a + s f[a] = a + (e^a

r = f[a] √ (1 + s²) = e^a √ (1 + (e^a)²)

A key problem with this approach is that the notion of “derivative” is not defined yet. We might plug in any number, say e^2 = 10 and e^3 = 11. For any location the Pythagorean Theorem allows us to create a circle. The notion of a circle is not essential here (yet). But it is nice to see how Cartesius might have done it, if he had had e = 2.71828….

##### Conquest of the Plane (COTP) (2011)

Conquest of the Plane (2011:167+), pdf online, has the following approach:

• §12.1.1 has the intuition of the “fixed point” that the derivative of e^x is given by e^x itself. For didactics it is important to have this property firmly established in the minds of the students, since they tend to forget this. This might be achieved perhaps in other ways too, but COTP has opted for the notion of a fixed point. The discussion is “hand waiving” and not intended as a real development of fixed points or theory of function spaces.
• §12.1.2 defines e with some key properties. It holds by definition that the derivative of e^x is given by e^x itself, but there are also some direct implications, like the slope of 1 at x = 0. Observe that COTP handles integral and derivative consistently as interdependent notions. (Shen & Lin (2014) use this approach too.)
• §12.1.3 gives the existence proof. With the mentioned properties, such a number and function appears to exist. This compares e^x with other exponential functions b^x and the recovered exponents rex[y] – i.e. logarithm ln[y].
• §12.1.4 uses the chain rule to find the derivatives of b^x in general. The plot suggested by Hessel Pot above would be a welcome addition to confirm this deduction and extension of the existence proof.
• §12.1.5-7 have some relevant aspects that need not concern us here.
• §12.1.8.1 shows that the definition is consistent with the earlier formal definition of a derivative. Application of that definition doesn’t generate an inconsistency. No limits are required.
• §12.1.8.2 gives the numerical development of = 2.71828… There is a clear distinction between deduction that such a number exists and the calculation of its value. (The approach with limits might confuse these aspects.)
• §12.1.8.3 shows that also the notion of the dynamic quotient (COTP p57)  is consistent with above approach to e. Thus, the above hasn’t used the dynamic quotient. Using it, we can derive that 1 = {(e^h – 1) // h, set h = 0}. Thus the latter expression cannot be simplified further but we don’t need to do so since we can determine that its value is 1. If we would wish so, we could use this (deduced) property to define e as well (“the formal approach”).

The key difference between COTP and above “approach of Cartesius” is that COTP shows how the (common) numerical development of e can be found. This method relies on the formula of the derivative, which Cartesius didn’t have (or didn’t want to adopt from Fermat).

##### Difference of COTP and a textbook introduction of e

In my email of March 27 2013 to Hessel Pot I explained how COTP differed from a particular Dutch textbook on the introduction of e.

• The textbook suggests that f ‘[0] = 1 would be an intuitive criterion. This is only partly true.
• It proceeds in reworking f ‘[0] = 1 into a more general formula. (I didn’t mention unstated assumptions in 2013.)
• It eventually boils down to indeed positing that e^x has itself as its derivative, but this definition thus is not explicitly presented as a definition. The clarity of positing this is obscured by the path leading there. Thus, I feel that the approach in COTP is a small but actually key innovation to explicitly define e^x as being equal to its derivative.
• It presents e only with three decimals.
##### Conclusion

There are more ways to address the intuition for the exponential number, like the growth process or the surface area under 1 / x. Yet the above approaches are more fitting for the algebraic approach. Of these, COTP has a development that is strong and appealing. The plots by Cartesius and Pot are useful and supportive but no alternatives.

The Appendix contains a deduction that was done in the course of writing this weblog entry. It seems useful to include it, but it is not key to above argument.

##### Appendix. Using the general formula on factor x – a

The earlier weblog entry on Cartesius and Fermat used a circle and generated a “general formula” on a factor x a. This is not really factoring, since the factor only holds when the curve lies on a circle.

Using the two relations:

f[x] – f[a]  = (x a)  (2u – x – a) / (f[x] + f[a])    … (* general)

u = a + s f[a]       … (for a tangent to a circle)

we can restate the earlier theorem that s defined in this manner generates the slope that is tangent to a circle.

f[x] – f[a]  = (x a)  (2 s f[a](x – a)) / (f[x] + f[a])

It will be useful to switch to x a = h:

f[a + h] – f[a]  = h (2 s f[a] – h) / (f[a + h] + f[a])

Thus with the definition of the derivative via the dynamic quotient we have:

df / dx = {Δf // Δx, set Δx = 0}

= {(f[a + h] – f[a]) // h, set h = 0}

= { (2 s f[a] – h) / (f[a + h] + f[a]), set h = 0}

= s

This merely shows that the dynamic quotient restates the earlier theorem on the tangency of a line and circle for a curve.

This holds for any function and thus also for the exponential function. Now we have s = e^a by definition. For e^x this gives:

ea + hea  = h (2 s eah) / (ea + h + ea)

For COTP §12.1.8.3 we get, with Δx = h:

df / dx = {Δf // Δx, set Δx = 0}

= {(ea + hea  ) // h, set h = 0}

= {(2 s eah) / (ea + h + ea) , set h = 0}

= s

This replaces Δf // Δx by the expression from the general formula, while the general formula was found by assuming a tangent circle, with s as the slope of the incline. There is the tricky aspect that we might choose any value of s as long as it satisfies u = a + s f[a]. However, we can refer to the earlier discussion in §12.1.8.2 on the actual calculation.

The basic conclusion is that this “general formula” enhances the consistency of §12.1.8.3. The deduction however is not needed, since we have §12.1.8.1, but it is useful to see that this new elaboration doesn’t generate an inconsistency. In a way this new elaboration is distractive, since the conclusion that 1 = {(e^h – 1) // h, set h = 0} is much stronger.

Isaac Newton (1642-1727) invented the differentials, calling them evanescent quantities. Since then, the world has been wondering what these are. Just to be sure, Newton wrote his Principia (1687) by using the methods of Euclidean geometry, so that his results could be accepted in the standard of his day (context of reconstruction and presentation), and so that his results were not lost in a discussion about the new method of these differentials (context of discovery). However, this only increased the enigma. What can these quantities be, that are so efficient for science, and that actually disappear when mathematically interesting ?

Gottfried Leibniz (1646-1716) gave these infinitesimals their common labels dy and dx, and thus they became familiar as household names in academic circles, but this didn’t reduce their mystery.

Charles Dodgson (1832-1898) as Lewis Carroll had great fun with the Cheshire Cat, who disappears but leaves its grin.

Abraham Robinson (1918-1974) presented an interpretation called “non-standard analysis“. Many people think that he clinched it, but when I start reading then my intuition warns me that this is making things more difficult. (Perhaps I should read more though.)

In 2007, I developed an algebraic approach to the derivative. This was in the book “A Logic of Exceptions” (ALOE), later also included in “Elegance with Substance” (EWS) (2009, 2015), and a bit later there was a “proof of concept” in “Conquest of the Plane” (COTP) (2011). The pdfs are online, and a recent overview article is here. A recent supplement is the discussion on continuity.

In this new algebraic approach there wasn’t a role for differentials, yet. The notation dy / dx = f ‘[x] for y f [x] can be used to link up to the literature, but up to now there was no meaning attached to the symbolism. In my perception this was (a bit of) a pity since the notation with differentials can be useful on occasion, see the example below.

Last month, reading Joop van Dormolen (1970) on the didactics of derivatives and the differential calculus – in a book for teachers Wansink (1970) volume III – I was struck by his admonition (p213) that dy / dx really is a quotient of two differentials, and that a teacher should avoid identifying it as a single symbol and as the definition of the derivative. However, when he proceeded, I was disappointed, since his treatment didn’t give the clarity that I looked for. In fact, his treatment is quite in line with that of Murray Spiegel (1962), “Advanced calculus (Metric edition)”, Schaum’s outline series, see below. (But Van Dormolen very usefully discusses the didactic questions, that Spiegel doesn’t look into.)

Thus, I developed an interpretation of my own. In my impression this finally gives the clarity that people have been looking for starting with Newton. At least: I am satisfied, and you may check whether you are too.

I don’t want to repeat myself too much, and thus I assume that you read up on the algebraic approach to the derivative in case of questions. (A good place to start is the recent overview.)

##### Ray through an origin

Let us first consider a ray through the origin, with horizontal axis x and vertical axis y. The ray makes an angle α with the horizontal axis. The ray can be represented by a function as y =  f [x] = s x, with the slope s = tan[α]. Observe that there is no constant term (c = 0).

The quotient y / x is defined everywhere, with the outcome s, except at the point x = 0, where we get an expression 0 / 0. This is quite curious. We tend to regard y / x as the slope (there is no constant term), and at x = 0 the line has that slope too, but we seem unable to say so.

There are at least three responses:

(i) Standard mathematics then takes off, with limits and continuity.

(ii) A quick fix might be to try to define a separate function to find the slope of a ray, but we can wonder whether this is all nice and proper, since we can only state the value s at 0 when we have solved the value elsewhere. If we substitute y when it isn’t a ray, or example x², then we get a curious construction, and thus the definition isn’t quite complete since there ought to be a test on being a ray.

(iii) The algebraic approach uses the following definition of the dynamic quotient:

y // x ≡ { y / x, unless x is a variable and then: assume x ≠ 0, simplify the expression y / x, declare the result valid also for the domain extension x = 0 }

Thus in this case we can use y // x = s x // x = s, and this slope also holds for the value x = 0, since this has now been included in the domain too.

##### In a nutshell for dy / dx

In a nutshell, we get the following situation for dy / dx:

Properties are exactly as Van Dormolen explained:

• “dy” and “dx” are names for variables, and thus they have their own realm with their own axes.
• The definition of their relationship is dy = f ‘[x] dx.

The news is:

• The mistake in history was to write dy / dx instead of dy // dx.

The latter “mistake” can be understood, since the algebraic approach uses notions of set theory, domain and range, and dynamics as in computer algebra, and thus we can forgive Newton for not getting there yet.

To link up with history, we might define that the “symbol dy / dx as a whole” is a shortcut for dy // dx. This causes additional yards to develop the notion of “symbol as a whole” however. My impression is that it is better to use dy // dx unless it is so accepted that it might become pedantic. (You must only explain that the Earth isn’t flat while people don’t know that yet.)

##### Application to Spiegel 1962 gives clarity

Let us look at Spiegel (1962) p58-59, and see how above discussion can bring clarity. The key points can all be discussed with reference to his figure 4-1.

Looking at this with a critical eye, we find:

• At the point P, there is actually the creation of two new sets of axes, namely, both the {Δx, Δy} plane and the {dx, dy} plane.
• These two new planes have both rays through the origin, one with angle θ and one with angle α.
• The two planes help to define the error. An error is commonly defined from the relation “true value = estimate + error”. The true value of the angle is θ and our estimate is α.
• Thus we get absolute error Δf = s Δx + ε where s = dy / dx. This error is a function of Δx, or ε = ε[Δx]. It solves as ε = Δf – s Δx.
• The relative error is Δf / Δx =  dy / dx + r which solves as r = Δf / Δx – dy / dx. This is still a function rx]. We use the quotient of the differentials instead of the true quotient of the differences.
• We better re-consider the error in terms of the dynamic quotient, replacing / by // in the above, because at P we like the error to be zero. Thus in above figure we have ε = Δf  s Δx, where s = dy // dx.
• A source of confusion is that Spiegel suggests that d≈ Δx or even dx = Δx but this is numerically true only sometimes and conceptually there surely is no identity since these are different axes.
• In the algebraic approach, Δx is set to zero to create the derivative, in particular the value of f ‘[x] = tan[α] at point P.  In this situation, Δx = 0 thus clearly differs from the values of dx that are still available on dx ‘s whole own axis. This explains why the creation of the differentials is useful. For, while Δx is set to 0, then the differentials can take any value, including 0.

Just to be sure, the algebraic approach uses this definition:

f ’[x] = {Δf // Δx, then set Δx = 0}

Subsequently, we define dy = f ‘[x] dx, so that we can discuss the relative error r = Δf // Δx – dy // dx.

PM. Check COTP p224 for the discussion of (relative) error, with the same notation. This present discussion still replaces the statement on differentials in COTP p155, step number 10.

##### A subsequent point w.r.t. the standard approach

Our main point thus is that the mistake in history was to write dy / dx instead of dy // dx. There arises a subsequent point of didactics. When you have real variables and z, then these have their own axes, and you don’t put them on the same axis just because they are both reals.

See Appendix A for a quote from Spiegel (1962), and check that it is convoluted at times.

Appendix B contains a quote from p236 from Adams & Essex (2013). We can see the same confusions as in Spiegel (1962). It really is a standard approach, and convoluted.

The standard approach takes Δx = dx and joins the axis for the variable Δy with the axis for the variable dy, with the common idea of “a change from y“. The idea of this setup is that it shows the error for values of Δx = dx.

It remains an awkward setup. It may well be true that John from Los Angeles is called Harry in New York, but when John calls his mother back home and introduces himself as “Mom, this is Harry”, then she will be confused. Eventually she can get used to this kind of phonecalls, but it remains awkward didactics to introduce students to these new concepts in this manner. (Especially when John adds: “Mom, actually I don’t exist anymore because I have been set to zero.”)

Thus, in good didactics we should drop this Δx = dx.

Alternatively put: We might define dy = f ’[x] Δx = f // Δx, then set Δx = 0} Δx. In the latter expression Δx occurs twice: both as a local and bound variable within { … } and as a global free variable outside of { … }. This is okay. In the past, mathematicians apparently thought that it might make things clearer to write dfor the free global variable: dy = f ’[x] dx. In a way this is okay too. But for didactics it doesn’t work. We should rather avoid an expression in which the same variable (name) is uses both locally bound and globally free.

##### Clear improvement

Remarkably, we are using 99% of the same apparatus as the standard approach, but there are clear improvements:

• There is no use of limits. All information is contained in the algebra of both the function f and the dynamic quotient. See here for continuity.
• There is a clear distinction between the three realms {x, y}, {Δx, Δy} and {dx, dy}.
• There is the new tool of the {dx, dy} space that can be used for analysis of variations.
• Didactically, it is better to first define the derivative in chapter 1, and then introduce the differentials in chapter 2, since the differentials aren’t needed to understand chapter 1.
• There is clarity about the error, that one doesn’t take d≈ Δx but considers ε = Δf  s Δx, where s has been found from the recipe s = f ’[x] = {Δf // Δx, then set Δx = 0}.
##### Example by Van Dormolen (1970:219)

This example assumes the total differential of the function f[x, y]:

df = (∂f // ∂x) dx + (∂f // ∂y) dy

Question. Give the slope of the tangent in the point {3, 4} of the circle x² + y²  = 25.

Answer. The point is on the circle indeed. We write the equation as f[x, y] = x² + y²  = 25. The total differential gives 2x dx + 2y dy = 0. Thus dy // dx = – x // y. Evaluation at the point {3, 4} gives the slope – 3/4.  □

PM. We might develop y algebraically as a function of and then use the +√ rather than the -√. However, more abstractly, we can use [x], and use dy = g ‘[x] dx, so that the slope of the tangent is g ‘[x] at the point {3, 4}. Subsequently we use g ‘[x] = dy // dx.

PM. In the Dutch highschool programme, partial derivatives aren’t included, but when we can save time by a clear presentation, then they surely should be introduced.

##### Conclusion

The conclusion is that the algebraic approach to the derivative also settles the age-old question about the meaning of the differentials.

For texts in the past the interpretation of the differential is a mess. For the future, textbooks now have the option of above clarity.

Again, a discussion about didactics is an inspiration for better mathematics. Perhaps research mathematicians have abandoned this topic for ages, and it is only looked at by researchers on didactics.

##### Appendix A. Spiegel (1962)

Quote from Murray Spiegel (1962), “Advanced calculus (Metric edition)”, Schaum’s outline series, p58-59.

##### Appendix B. Adams & Essex (2013)

The following quote is from Robert A. Adams & Christopher Essex (2013), “Calculus. A Complete Course”, Pearson, p236.

• It is a pity that they use c as a value of x rather than as an universal name for a constant (value on the y axis).
• For them, the differential cannot be zero, while Spiegel conversely states that it is “not necessarily zero”.
• They clearly show that you can take f ‘[x] Δin in {Δx, Δy} space, and that you then need a new symbol for the outcome, since Δy already has been defined differently. However, it is awkward to say: “For such an approximation, the quantity Δx is traditionally denoted as dx (…)”. It may well be true that John from Los Angeles is called Harry in New York, … etcetera, see above.