The “supremum” norm

Today I shall speak on a topic that I feel is important.

All of us have encountered the “\sup” norm in Functional Analysis. In C[a,b], \|f_1-f_2\|=\sup|f_1x-f_2x| for x\in X. In the dual space B(X,Y), \|T_1x-T_2x\|=\sup\limits_{x\in X}\frac{\|(T_1-T_2)x\|}{\|x\|}. What is the utility of this “sup” norm? Why can’t our norm be based on “inf” or the infimum? Or even \frac{\sup+\inf}{2}?

First we’ll talk about C[a,b]. Let us take a straight line on the X-Y plane, and a sequence of continuous functions converging to it. How do we know they’re converging? Through the norm. Had the norm been \inf, convergence would only have to be shown at one point. For example, according to this norm, the sequence f_n=|x|+\frac{1}{n} converges to y=0. This does not appeal to our aesthetic sense, as we’d want the shapes of the graphs to gradually start resembling y=0.  

Now let’s talk about B(X,Y). If we had the \inf norm, then we might not have been able to take every point x\in X and show that the sequence (T_n x), n\in\Bbb{N} is a cauchy sequence. So what if we cannot take every x\in X and say (T_n x) is a cauchy sequence? This would crush the proof. How? Because then \lim\limits_{n\to\infty} T_n would not resemble the terms of the cauchy sequence at all points in X, and hence we wouldn’t be able to comment on whether \lim\limits_{n\to\infty} T_n is linear for all x\in X or not. Considering that B(X,Y) contains only bounded linear operators, the limit of the cauchy sequence (T_n) not being a part of B(X,Y) would prove that B(X,Y) is not complete. Hence, in order for us to be able to prove that \lim\limits_{n\to\infty} T_n is linear and that B(X,Y) is complete, we need to use the \sup norm.

On making a choice between hypotheses

At 15:33, Peter Millican says “How can any criterion of reliable knowledge be chosen, unless we already have some reliable criterion for making that choice?”

What does this actually mean? Say I have two hypotheses- A and B. One of them is true, whilst the other is false. But I don’t know which is which. How should I choose? Maybe I should choose the one which corroborates the existing body of knowledge to a greater extent. Or I could choose the one which corroborates my personal experiences to a greater extent (my personal experiences and the existing body of knowledge would certainly overlap, but the former may or may not be a subset of the latter). Clearly, we have a problem here. On what criterion should I base my selection. This is precisely the question being discussed here.

This, I feel brings us to an even more fundamental question: How do we make choices? Let us suppose I have to choose between two identical bars, based on just sight. One is a gold bar, and the other an imitation. As they’re exactly identical, I have no reason to choose one over the other. This of course is true only after removing biases like “I should choose the bar on the left as the left is my lucky side; Today is the 13th, and I had an accident on the 13th while turning to the right, hence I should choose left, etc”. Hence, the criterion for selection, without the addition of any further knowledge, would be arbitrary.

However, when making choices between two non-identical hypotheses, we have a definite bias. I might say my textbook supports A. Or that my personal experiences support B. Now based on whether I trust my textbook more or my personal experiences and reasoning, I shall make a choice accordingly. The criterion for making a choice in such situations is hence based on evaluating biases, of one form or another, and deciding which choice I’m naturally more inclined or biased towards. A lot of this evaluation takes place implicitly in the brain in our daily lives.

This of course is assuming the choice-making process is completely rational, which it is not in humans. I could, for example, arbitrarily choose the option which the weighted sum of my biases do not support.

The factoring of polynomials

This article is about the factorization of polynomials:

First, I’d like to discuss the most important trick that is used directly or implicitly in most of the theorems given below. Let us consider the polynomial f(x)=(a_mx^m+a_{m-1}x^{m-1}+\dots+a_0)(b_nx^n+b^{n-1}x^{n-1}+\dots+b_0). Let a_i be the first coefficient, starting from a_0, not to be a multiple of p in a_mx^m+a_{m-1}x^{m-1}+\dots+a_0. Let b_j be the first such coefficient in b_nx^n+b^{n-1}x^{n-1}+\dots+b_0. Then the coefficient of x^r in f(x) where r\leq i+j-1 is definitely a multiple of p, the coefficient of x^{i+j} is definitely NOT a multiple of p, and the coefficient of x^t where t>i+j, we can’t say anything about, except for the coefficient of x^{m+n} which is definitely not a multiple of p.

1. Gauss’s lemma (polynomial)- A primitive polynomial is one one with integer coefficients such that the gcd of all the coefficients is 1. The lemma states that if a primitive polynomial is irreducible over \Bbb{Z}, then it is irreducible over \Bbb{Q}. In other words, if a primitive polynomial has a root, it has to be an integer. The converse is also obviously true, as \Bbb{Z\subset Q}. A very interesting proof can be found here, although Proof 1 is not completely correct. The characterization of c_{i+j} should actually be \sum\limits_{k\leq m,i+j-k\leq n}{a_k b_{i+j-k}}. This is to include the possibility of i+j>m or i+j>n. Also, note that in all other coefficients a_rx^r\in f(x)g(x), we are unlikely to find a glaring contradiction (that p doesn’t divide this coefficient). This proof by explicit demonstration is indeed brilliant. But wait. This just proves that the product of primitive polynomials is primitive. It doesn’t prove that a primitive polynomial can only be factored into primitive polynomials. This proves the aforementioned. The most important thing to concentrate on here is the how to convert rational polynomials into a constant times a primitive polynomial. Any irreducible polynomial can be converted into a constant times a product of two primitive polynomials. It is only when the original polynomial is also primitive that this constant is 1.

2. Rational root theorem- It states that if a polynomial with integer coefficients has a rational root \frac{p}{q}, then p divides the constant and q divides the leading coefficient (if the polynomial is a_n x^n +a_{n-1}x^{n-1}+\dots +a_0, then p|a_0 and q|a_n). A proof is given here. I’d like to add certain things to the proof given in the article for a clearer exposition. First of all, by taking out the integral gcd of the coefficients, make the polynomial primitive. Let the gcd be g. Deal with this primitive polynomial. Using Gauss’s lemma and the proof in the article, we can easily deduce that q|\frac{a_n}{g} and p|\frac{a_0}{g}. This both these are true, and as g is an integer, we get q|a_n and p|a_0.

3. Eisenstein criterion- The statement and the proof are given here. I want to discuss the true essence of the proof. The most important condition is that p^2\not| a_0. What this essentially does is it splits p between b_0 and c_0; ie it can’t divide both. How does the splitting of p change anything? We’ll come to that. Another important condition is that p\not| a_n. This forces us to conclude that not all b_i or c_i are divisible by p. Hence, there is a first element b_r and a c_t which are not divisible by p. If p\not|b_0, then a_t is not divisible by p, which contradicts the third condition that p|a_i for i<m+n. How? Because a^t=b_0c_t+b_1c_{t-1}+\dots+b_tc_0.All terms except b_0c_t are divisible by p. This is where the splitting helped. If there was no such splitting (if p^2|b_0c_0) then a_t could have been divisible by p). Similarly, if p\not| c_0, then p\not|a^r. Remember the point that I elaborated at the very beginning of the article? Try to correlate that with this proof. Here a_{i+j} becomes a_t or a_r, as the first coefficient not to be divisible by p is b_0 or c_0.

Euclidean rings and prime factorization

Now we will talk about the factorization of elements in Euclidean rings. On pg.146 of “Topics in Algebra” by Herstein, it says:

“Let R be a Euclidean ring. Then every element in R is either a unit in R or can be written as the product of a finite number of prime elements in R.”

This seems elementary. Take any element a\in R. If it is prime, then we’re done. If it is not, then keep on splitting it into factors. For example, let a=bc. If b is prime, then we leave it as it is. If it is not (if b=fg), we split it as fgc. The same with c, and so on.

The theorem says a can be represented as the product of a finite number of primes. But what if this splitting is a never-ending process? We don’t face this problem with \Bbb{N}, as splitting causes positive integers to decrease in magnitude and there’s a lower limit to how much a positive integer can decrease. But we might face this problem with other Euclidean rings.

Circumventing this problem throws light on a fact that is often forgotten. d(a)\leq d(ab). When we take a and start splitting it, we’re decreasing the d-values of the individual factors as we continue to split them. Note that we will not split an associate into a unit and its associate. For example, if f and g are associates, and a=ft, then we will not split f=u_1 g where u_1 is a unit, as we’re only looking to split elements that are not prime. Hence, if only splittings involving units are possible for f, then we know that f is prime, and leave it as it is. Let us suppose f=pq where neither p nor q s a unit. Then d(p),d(q)<d(f), as they’re not associates. This shows that as we keep splitting non-prime elements into factors that are not units, then the d-value of each individual factor keeps strictly decreasing. This has a lower bound as d-values are positive real numbers, and we’re bound to arrive upon a finite factorization in a finite number of steps.  

What’s the deal with units then? We’ll return to this after a short discussion on d(0) and d(1).

If a\neq 0, then d(1)\leq d(1.a)=d(a) for every a\in R. Hence, d(1)=d(u) for all non-zero units u\in R (proof: let 1=u_1 b, where u_1 is a unit. Then b has to be u_{1}^{-1}, which is also a unit. The same can be said of all units. Hence, 1 is associate with all units only), and d(1)<d(a) for all non-zero non-units in R.

Now what about d(0)? If the axiom of a Euclidean ring was a=qb+r\implies d(r)<d(b) rather than d(r)<d(b) provided r\neq 0, then we could conduct some investigation into this. Let us imagine d(0) exists. Then a=1.a+0. Hence, d(0)<d(a) for all a\in R. But 0=0.0+0. Hence, d(0)<d(0), which is impossible as d:R\to \Bbb{R}^+ is a well-defined mapping. Hence, in order to facilitate the well-defined existence of d and also keep a=qb+r\implies d(r)<d(b) as an axiom used to define a Euclidean ring, we forego defining f(0).  

Now returning to our discussion, we’ve already stated that as d(0) is not defined, and 1 has the lowest d-value in R. Moreover, as all associates of 1 are units, d(u)=d(1), where u is a unit. If it were possible to split u into factors ab such that both a and b are not units, then d(a),d(b)<d(u), which is not possible. Hence, every unit u is prime in a Euclidean ring. Note that this is not a natural property of such structures, but a result of the arbitrary axiom that d(ab)\geq d(a),d(b).

Summarising the above arguments:

1. A unit is prime in a Euclidean ring R.

2. Every element in R can be split into a finite number of prime factors.

3. In order to avoid contradictions, d(0) is not defined. Also, 1 has the lowest d-value.

4. Removing the axiom d(ab)\geq d(a),d(b) would nullify a lot of these properties.

Euclidean rings and generators of ideals

This is to address a point that has just been glazed over in “Topics in Algebra” by Herstein.

In a Euclidean ring, for any two elements a,b\in R, \exists q,r\in R such that a=bq+r. Also, there exists a function d:R\to\Bbb{R} such that d(r)<d(b).

We also know that the element with the lowest d-value generates the whole ring R. The proof of this is elementary.

But what if there are more than one element with the same lowest d-value? Do both these elements generate R?

Yes. Proof: Let d(x)=d(y) such that x and y are the elements of R with the lowest d-value. Then for a third element c\in R, c=u_1 x=u_2 y. Hence, both x and y divide c. They also divide each other. Hence, x and y have to be associates. In other words, x=uy, where u is a unit in R.

Let us approach this from the opposite direction now. If x=uy, where u is a unit. Axiomatically, d(y)\leq d(uy). Hence, d(y)\leq d(x). Similarly, d(x)\leq d(y). This shows that d(x)=d(y). Therefore, whenever two elements a and b are associates, their d-values are the same.

Note that if we did not have the axiom that d(a)\leq d(ab), then there would be no reason to believe that if a and b are associates, then d(a)=d(b). Hence, ideals could then potentially be generated  by elements whose d-values would not be the lowest in the ideal, with the restriction that all those elements would be associates of the lowest d-value element.

A summary of the important points is:

1. Associates have the same d-value.

2. An element a generates an ideal iff it has the lowest d-value in the ideal.

3. All associates of the lowest d-value element in an ideal generate the same ideal.

4. If we did not have the axiom d(a)\leq d(ab), then point 1 would not be true, point 2 would not be true (a generator of an ideal wouldn’t have to have the lowest d-value), but point 3 would still be true.

_______________________________________________________________________________

There’s a couple of things I’d like to add here.

Why is it that a prime element should be such that its factorization does not contain a unit element? Generally, when we think about prime numbers in positive integers, we imagine a number which is absolutely not factorizable except in the form 1.p (p being the prime number). A sense of unbreakability is felt. Here, the same prime element p can be broken in at least n ways, where n is the number of unit elements in the Euclidean ring R. The sense of absolute unbreakability is lost. I suppose the reason for this is that the concept of ‘unit’ is just an extension of 1 in natural numbers. As factorization of primes of the form 1.p are not counted when dealing with natural numbers, factorizations of the form u.p_1 shouldn’t count in R, where p_1 is an associate of p.

Also note that the addition of deletion of axioms would have greatly changed the structure of Euclidean rings. For example, deleting the axiom d(ab)\geq d(a),d(b) would allow infinite prime factorizations of elements, and adding the axiom d(a+b)<d(ab) would further alter the structure of R. One should not forget that the properties of the elements of R are a result of these defining axioms, and the addition and deletion of such would cause substantial alterations. It is just the fact that Euclidean rings mimic many properties of natural numbers that we find them important to study.

Integral domains and characteristics

Today we shall talk about the characteristic of an integral domain, concentrating mainly on misconceptions and important points.

An integral domain is a commutative ring with the property that if a\neq 0 and b\neq 0, then ab\neq 0. Hence, if ab=0, then a=0 or b=0 (or both).

The characteristic of an integral domain is the lowest positive integer c such that \underbrace{1+1+\dots +1}_{\text{ c times}}=0.

Let a\in R. Then \underbrace{a+a+\dots +a}_{\text{ c times}}=a\underbrace{(1+1+\dots +1)}_{\text{ c times}}=0. This is because a.0=0.

If \underbrace{a+a+\dots +a}_{\text{ d<c times}}=0, then we have a\underbrace{(1+1+\dots +1)}_{\text{ d times}}=0. This is obvious for a=0. If a\neq 0, then this implies \underbrace{1+1+\dots +1}_{\text{ d times}}=0, which contradicts the fact that c is the lowest positive integer such that 1 added c times to itself is equal to 0. Hence, if c is the characteristic of the integral domain D, then it is the lowest positive integer such that any non-zero member of D, added c times to itself, gives 0. No member of D can be added a lower number of times to itself to give 0.

Sometimes \underbrace{a+a+\dots +a}_{\text{ c times}} is written as ca. One should remember that this has nothing to do the multiplication operator in the ring. In other words, this does not imply that \underbrace{a+a+\dots +a}_{\text{ c times}}=c.a, where c is a member of the domain. In fact, c does NOT have to be a member of the domain. It is just an arbitrary positive integer.

Now on to an important point: something that is not emphasized, but should be. Any expression of the form

\underbrace{\underbrace{a+a+\dots +a}_{\text{m times}}+\underbrace{a+a+\dots +a}_{\text{m times}}+\dots +\underbrace{a+a+\dots +a}_{\text{m times}}}_{\text{n times}}=\underbrace{(a+a+\dots +a)}_{\text{m times}}(\underbrace{1+1+\dots +1}_{\text{n times}}).

Now use this knowledge to prove that the characteristic of an integral domain, if finite, has to be 0 or prime.

Ordinals- just what exactly are they?!

If ordinals have not confused you, you haven’t really made a serious attempt to understand them.

Let me illustrate this. If I have 5 fruits (all different) and 5 plates (all different), then I can bijectively map the fruits to plates. However, I arrange the fruits or plates, I can still bijectively map them.

Let’s suppose I have a set finite set A, and don’t know its cardinality. But I know hat it bijectively maps to B. This directly implies that however I arrange A or B, they will still bijectivey map to each other.

This intuition fails for infinte sets. In fact weird things start happening for infinite sets. Natural numbers can be bijectively mapped to rational numbers. What?!! Isn’t the set of rational numbers a superset of natural numbers?! Yes. Then how can there be a bijection between them? Bijection implies both sets contain the same number of elements.

No. That the cardinality should be the same is not part of the definition of bijection. Bijection is defined as an injective and surjective mapping between two sets. It is just that same cardinality is implied through bijection in the case of finite sets. For rational numbers, by cantor’s diagonalization argument, for every number, we cam find a unique pre-image amongst the natural numbers. Hence, we have a bijection.

Coming back to ordinals, let \omega denote the ordered set of natural numbers. Does \Bbb{N} bijectively map to \Bbb{N}\cup \{\pi\}? Not if you use the mapping f(n)=n. However, if you map f(1)=\pi and f(n)=n-1, then you’re done. Note the fact that if two infinite sets are bijective, that does not imply that every one-to-one mapping will be surjective. It just means that there exists *one* such mapping. This is in direct contrast with the case of finite sets, in which every injective mapping between two bijective sets is surjective.

What if you map \omega to \omega+1? Note that as \omega is ordered, this implies there is only *one* mapping we’re allowed to have: f(n)=n. We can clearly see \Bbb{N} can’t be bijectively mapped to \omega+1.

Where most mathematical texts fail is actually explaining these finer points to students. Most of them just regurgitate the material present in “classics” of that subject. The most important thing to note here is that if we have two infinite sets which are not ordered, then there *might* be some bijective mapping between them, and finding one can be tricky sometimes. For example, Cantor’s diagonal mapping is brilliant, and non-trivial. Hence for non-ordinal infinite sets, we can’t be sure if they’re bijective with \Bbb{N}. However, in the case of ordinals, as there is only one mapping, determining whether the set is bijective with respect to \Bbb{N} is trivial.

 

Why substitution works in indefinite integration

Let’s integrate \int{\frac{dx}{\sqrt{1-x^2}}} . We know the trick: substitute x for \sin\theta. We get dx=\cos\theta d\theta. Substituting into the original equation, we get \int{\frac{\cos\theta d\theta}{\sqrt{1-\sin^2\theta}}}=\int{\frac{\cos\theta d\theta}{|\cos\theta|}}. Let us assume \cos\theta remains positive throughout the interval under consideration. Then we get the integral as \theta or \arcsin x.

I have performed similar operations for close to five years of my life now. But I was never, ever, quite convinced with it. How can you, just like that, substitute dx for \cos\theta d\theta? My teacher once told me this: \frac{dx}{d\theta}=\cos\theta. Multiplying by d\theta on both sides, we get dx=d\theta. What?!! It doesn’t work like that!!

It was a year back that I finally derived why this ‘ruse’ works.

Take the function x^2. If you differentiate this with respect to x, you get 2x. If you integrate 2x, you get x^2+c. Simple.

Now take the function \sin^2\theta. Differentiate it with respect to \theta. You get 2\sin\theta.\cos\theta. If you integrate 2\sin\theta.\cos\theta, you get \sin^2\theta+c.

The thing to notice is when you integrate the two functions- 2x and 2\sin\theta.\cos\theta, you want a function of the form y^2. However and whatever I integrate, I ultimately want a function of the form y^2, so that I can substitute x for y to get x^2.

In the original situation, let us imagine there’s a function f(x)=\int{\frac{dx}{\sqrt{1-x^2}}}. We’ll discuss the properties of f(x). If we were to make the substitution x=\sin\theta in f(x) and differentiate it with respect to \theta, we’d get a function of the form \frac{1}{\sqrt{1-y^2}}\cos\theta, where y is \sin\theta. There are two things to note here:

1. The form of the derivative if f(x) wrt \theta is the same as that of f'(x), which is \frac{1}{\sqrt{1-y^2}}, multiplied by \cos\theta, or derivative of \sin\theta wrt \theta.

2. When any function is differentiated with respect to any variable, integration wrt the same variabe gives us back the same function. Hence, \int{\frac{\partial f}{\partial x}dx}=\int{\frac{\partial f}{\partial \theta}d\theta}

Coming back to \int{\frac{dx}{\sqrt{1-x^2}}}, let us assume its integral is f(x). It’s derivative on substituting x=\cos\theta and differentiating wrt \theta is of the same form as \frac{\partial f}{\partial x} multiplied by \cos\theta. This is a result of the chain rule of differentiation. Now following rule 2, we know \int{\frac{dx}{\sqrt{1-x^2}}}=\int{\frac{\cos\theta d\theta}{\sqrt{1-\sin^2\theta}}}.

How is making the substitution x=\sin\theta justified? Could we have made any other continuous substitution, like x=\theta^2 +\tan\theta^3? Let us assume we substitute x for g(\theta). We want g(\theta) to take all the values x can take. This is the condition that must be satisfied by any substitution. For values that g(\theta) takes by x doesn’t, we restrict the range of g(\theta) to that of x. Note that the shapes of f(x) as plotted against x and f(\sin\theta) as plotted against \theta will be different. But that is irrelevant as long as we can write the same cartesian pairs (m,n) for any variable, where m is the x-coordinate and n is the y-coordinate.

Summing the argument, we predict the form the derivative of f(x) will take when the substitution x=\sin\theta is made, and then integrate this new form wrt \theta to get the original function. This is why the ‘trick’ works.

Fermat’s Last Theorem

When in high school, spurred by Mr. Scheelbeek’s end-of-term inspirational lecture on Fermat’s Last Theorem, I tried proving the same for…about one and a half long years!
For documentation purposes, I’m attaching my proof. Feel free to outline the flaws in the comments section.

Let us assume FLT is true. i.e. x^n + y^n =z^n. We know x^n + y^n<(x+y)^n (n is assumed to be greater than one here). Hence, z<x+y. Moreover, we know z^n-x^n<(z+x)^n. Hence, y<z+x. Similarly, y+z<x.

So we have the three inequalities: x+y<z, x+z<y, and y+z<x.

x,y,z satisfy the triangle inequalities! Hence, x,y,z form a triangle.

Using the cosine rule, we get z^2=x^2 +y^2 -2xy\cos C, where C is the angle opposite side z.

Raising both sides to the power \frac{n}{2}, we get z^n=(x^2 +y^2 -2xy\cos C)^{\frac{n}{2}}. Now if n=2 and c=\frac{\pi}{2}, we get z^2=x^2+y^2. This is the case of the right-angled triangle.

However, if n\geq 3, then the right hand side, which is (x^2 +y^2 -2xy\cos C)^{\frac{n}{2}}, is unlikely to simplify to x^n + y^n.

There are multiple flaws in this argument. Coming to terms with them was a huge learning experience.

Binomial probability distribution

What exactly is binomial distribution?

Q. A manufacturing process is estimated to produce 5\% nonconforming items. If a random sample of the five items is chosen, find the probability of getting two nonconforming items.

Now one could say let there be 100 items. Then the required probability woud be \frac{{5\choose 2}{95\choose 3}}{{100\choose 5}} . In what order the items are chosen is irrelevant. This roughly comes out to be 0.18, while the answer is 0.22. Where did we go wrong?

Why should we assume there are 100 items in total? Let us assume n\to\infty, as we determine \frac{{.05n\choose 2}{.95n\choose 3}}{{n\choose 5}} . What if 0.95 n and 0.05n are not integers? We use the gamma function.

We get \frac{{.05n\choose 2}{.95n\choose 3}}{{n\choose 5}}=\frac{\int_{0}^{\infty}{t^{0.05n}e^{-t} dt}.\int_{0}^{\infty}{t^{0.95n}e^{-t} dt}}{{n\choose 5}}

My textbook says this tends to {5\choose 2}(0.05)^2 (0.95)^2. This is something you could verify for yourself.

Another question. Say you roll a die 5 times. Find the probability of getting two 6s. The probability as determined by combinatorics is \frac{{5\choose 2}5^3}{6^5} . You must have applied the binomial theorem before in such problems. You know the answer to be {5\choose 2}(\frac{1}{6})^2 (\frac{5}{6})^3 . This matches with the answer determined before. So why is it that we’re right here in determining the probability accurately, while we were not before?

Binomial probability corroborates with elementary probability where separate arrangements of selected items are counted as distinct arrangements, and where the total number of items is known and not just guessed at. When the total number of items is not known and only percentages (percentage of success) is known, then binomial probability is an approximation arrived at by assuming n approaches infinity.