tag:blogger.com,1999:blog-32146486079968395292023-10-27T00:07:19.986+01:00The Winding NumberInsights on interesting things.Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.comBlogger139125tag:blogger.com,1999:blog-3214648607996839529.post-56328914373914472792023-06-15T20:14:00.002+01:002023-06-15T20:15:45.587+01:00Ordinals and well-founded recursion<p>&nbsp;Two closely related questions about ordinals that I found quite confusing at first and couldn't find a satisfactory answer online (self-answering):</p><p></p><ul style="text-align: left;"><li>I've heard sentences like "$\omega^{CK}$ is the least ordinal that cannot be represented on a computer". What does this mean, precisely? Surely after exhausting notations for all computable ordinals, I could just represent $\omega^{CK}$ with something else? E.g. even in the Kleene's O notation, I could just denote $\omega^{CK}$ by $-1$ or something?</li><li>We say things like "the strength of a theory corresponds to the smallest recursive ordinal it cannot prove well-founded" -- but _we know_ every ordinal is well-founded, right? So are we better than every theory?</li></ul><p></p><hr /><div><br /></div><div><strong>implementing the ordinals</strong></div> <p>The thing that really clarified this for me was to actually implement ordinals in Python. Implementations in Haskell are plentiful of course (<a href="http://blog.jbapple.com/2007/02/countable-ordinals-in-haskell.html" rel="nofollow noreferrer">e.g.</a>) but Haskell doesn't feel as grounded, since you don't actually see the computations being done.</p> <pre><code>from typing import Union, Callable from __future__ import annotations class Ordinal: def __init__(self, children : None | Ordinal | Callable[[int], Ordinal]): self.children = children </code></pre> <p>That's it. An ordinal is defined by its "children" -- if that's <code>None</code>, it's 0; if that's a single ordinal, it's the successor of that ordinal; if that's an infinite sequence of ordinals, it's the limit of that sequence.</p> <p>You can then define some obvious operations.</p> <pre><code>Zero = Ordinal(None) def Succ(x : Ordinal) -&gt; Ordinal: return Ordinal(x) def Lim(xs : Callable[[int], 'Ordinal']) -&gt; Ordinal: return Ordinal(xs) def Add(x : Ordinal, y : Ordinal) -&gt; Ordinal: if not y.children: return x elif type(y.children) == Ordinal: return Succ(Add(x, y.children)) else: return Lim(lambda n : Add(x, y.children(n))) def Mul(x : Ordinal, y : Ordinal) -&gt; Ordinal: if not y.children: return Zero elif type(y.children) == Ordinal: return Add(Mul(x, y.children), x) else: return Lim(lambda n : Mul(x, y.children(n))) def Pow(x : Ordinal, y : Ordinal) -&gt; Ordinal: if not y.children: return Succ(Zero) elif type(y.children) == Ordinal: return Mul(Pow(x, y.children), x) else: return Lim(lambda n : Pow(x, y.children(n))) </code></pre> <p>And represent some small ordinals:</p> <pre><code># ordinal corresponding to some finite n def Fin(n : int) -&gt; Ordinal: x = Zero for i in range(n): x = Succ(x) return x Omega = Lim(Fin) # ordinal corresponding to omega^n def Foo(n : int) -&gt; Ordinal: x = Succ(Zero) for i in range(n): x = Pow(Omega, x) return x Epsilon0 = Lim(Foo) </code></pre> <hr /> <p><strong>explicitly seeing them as functions</strong></p> <p>What stops us from making an infinite sequence of <em>all</em> (computable) ordinals? That would be <a href="https://en.wikipedia.org/wiki/Burali-Forti_paradox" rel="nofollow noreferrer">paradoxical</a>, of course -- if you had an ordinal whose children/descendants included itself, its definition would not be a well-founded recursion, it would be circular. But it's easy to enumerate all programs, is it really so hard to enumerate specifically programs that represent objects of the <code>Ordinal</code> class?</p> <p>Let's think of a way to reduce an ordinal into a simple-to-understand program. The natural way to do this is:</p> <pre><code>def call(x : Ordinal) -&gt; Callable | None: if not x.children: return None elif type(x.children) == Ordinal: return lambda: call(x.children) else: return lambda n: call(x.children(n)) </code></pre> <p>So e.g. <code>call(Epsilon0)</code> returns a multivariable function that you can call as follows:</p><p>$$\epsilon_0(2)=\omega^\omega$$</p><p>$$\omega^\omega(4) = \omega^4$$</p><p>$$\omega^4(5) = \omega^3\cdot 5$$&nbsp;</p><p>$$\omega^3\cdot 5(2) = \omega^3\cdot4+\omega^2\cdot2$$</p><p>$$\omega^3\cdot4+\omega^2\cdot2(4) = \omega^3\cdot4+\omega^2+\omega\cdot4$$</p><p>$$\omega^3\cdot4+\omega^2+\omega\cdot4(3)=\omega^3\cdot4+\omega^2+\omega\cdot3+3$$</p><p>$$\omega^3\cdot4+\omega^2+\omega\cdot3+3()()()=\omega^3\cdot4+\omega^2+\omega\cdot3$$</p><p>... and ultimately $\epsilon_0(2)(4)(5)(2)(4)(3)()()()(1)()(1)()(1)()(2)(2)()()(1)()(0)(0)(0)(1)(2)(3)()()()(2)()()=0$, which you can check by running</p><p></p> <pre><code>print(call(Epsilon0)(2)(4)(5)(2)(4)(3)()()()(1)()(1)()(1)()(2)(2)()()(1)()(0)(0)(0)(1)(2)(3)()()()(2)()()) </code></pre> <p>You may think this is quite long, but I've actually chosen fairly small numbers at each step. The ridiculous thing is that despite these paths being absurdly long -- all of them do, in fact, terminate! This is a statement that <em>should initially surprise you</em> by the way -- you might think you can just provide larger and larger numbers at each step and get a non-terminating sequence, but nope.</p><p>$$\epsilon_0(2)(3)(4)(5)(6)()^6(7)()^7(8)()^8(9)()^9(10)(11)(12)()^{12}(13)()^{13}\dots(22)()^{22}(23)(24)^{24}(25)^{25}\dots(45)()^{45}(46)(47)(48)()^{48}(49)^{49}\dots(94)()^{94}=0$$</p><p>In general, the main point is that these sequences always terminate, i.e. the ordinals represent <strong>total</strong> functions. But given an arbitrary such function, it is not clear if it represents an ordinal. This is still a bit of a claim to unpack, at least to me, so it is helpful to think of ordinals from an entirely new perspective--</p> <hr /> <p><strong>nested for loops</strong></p> <p>Remember the first time you actually bothered to implement recursion, because nested <code>for</code> loops wouldn't suffice? <code>while</code> loops would do it, of course, but often that's inelegant.</p> <p>Generally speaking, if the only function we are allowed to define without recursion is the successor function (as it is the case with one nested for loop is sufficient to define addition $f_1(n)$, two to define multiplication $f_2(n)$, three to define exponentiation $f_3(n)$, and so on to define any <a href="https://en.wikipedia.org/wiki/Hyperoperation" rel="nofollow noreferrer">hyperoperation</a>.</p> <p>And indeed this demonstrates that the hyperoperations partition (in terms of growth rate) all the primitive recursive functions -- but there are total computable functions that grow faster than any hyperoperation (i.e. than any primitive recursion), which you can define by diagonalization: i.e. $f_\omega(n)$, which uses $n$ for loops where $n$ is the value inputed (this is basically the "Ackermann function"), and you can go further and further. This is called the&nbsp;<a href="https://en.wikipedia.org/wiki/Fast-growing_hierarchy" rel="nofollow noreferrer">fast-growing hierarchy</a>.</p> <p>Quite generally such total computable functions can be written recursively, by <a href="https://en.wikipedia.org/wiki/Kleene%27s_recursion_theorem#Kleene%27s_second_recursion_theorem" rel="nofollow noreferrer">Kleene's fixed-point theeorem</a> (such a function is the "fixed point" of an operator like $\lambda (f, x) \mapsto xf(x-1)$ or whatever). What seems to be a good reference:</p> <p><a href="https://core.ac.uk/download/pdf/82045685.pdf" rel="nofollow noreferrer">Fairtlough &amp; Wainer (1992). Ordinal Complexity of Recursive Definitions.</a></p> <p>TL;DR: Specifying an ordinal is equivalent to specifying a particular recursion. It is not immediately obvious that a given recursion is well-founded. ZFC can't actually prove "bla bla recursion equals bla bla big ordinal", even though it (non-constructively) "has" the latter ordinal.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-28017232084694909882022-10-23T15:17:00.016+01:002023-01-03T11:43:32.405+00:00A crash course on mathematical logic: Part I<p><i>Update:</i>&nbsp;the up-to-date version of this is now available as a post on LessWrong:&nbsp;<a href="https://www.lesswrong.com/posts/xqxXrAohXSD3akYCg/meaningful-things-are-those-the-universe-possesses-a">Meaningful things are stuff the universe possesses a semantics for</a>.&nbsp;</p><hr /><p><strong><a href="https://math.stackexchange.com/a/4559696/78451">[post also on math stackexchange]</a></strong></p> <p>Revisiting my Godel/model theory questions after two years -- let me put down some thoughts on the generality in which I think these theorems should be viewed, in a way that naturally addresses the standard philosophical questions asked around this stuff, like:</p> <ol> <li> <p>Godel's incompleteness theorem sounds a lot like the Halting problem -- are they analogous or equivalent in any fundamental way?</p> </li> <li> <p>Does Godel's incompleteness theorem <em>or</em> the Halting problem <em>or</em> the Entscheidungsproblem mean that we can't know everything?</p> </li> <li> <p>The Entscheidungsproblem seems to "mix" Godel's incompleteness theorem with Turing completeness somehow -- it says that not only are there theorems which cannot be proven or disproven, even among those that <em>can</em> be proven, there is no algorithm that determines if a theorem can be proven. Where does this really fit in in this whole thing?</p> </li> <li> <p>Why does the term "first-order logic" (actually "first-order arithmetic") keep showing up in these contexts? How important is first-order logic to Godel?</p> </li> <li> <p>If "true" is different from "provable", what does "true" even mean?</p> </li> <li> <p>How do <em>we</em> know all these things? Are we beyond formal systems? Are we beyond computers? Wait first of all are we comparable to formal systems or to computers?</p> </li> <li> <p>Does Godel mean there cannot be a Theory of Everything/that we cannot find the Theory of Everything?</p> </li> </ol> <p>Out of these questions, only the last is truly a bad question (we know the rules governing computers anyway, that doesn't mean we can figure out their arbitrary-time behaviour), the other ones are quite good.</p> <p>Standard responses to these questions take the form "Godel's theorems are very specific statements about some formal systems, it's not useful to overgeneralize them beyond this", etc. This is the MTC-CYA ("minimal, technically correct, covers-your-ass") answer, and I think it's quite narrow and does a disservice to just how basic computation is in our world.</p> <p>I also don't think it is useful to get straight into specifics about models and theories, logics and languages, etc. because Godel's incompleteness theorem is a lot more general. If you say "Peano arithmetic is incomplete because there are non-standard models", but then second-order arithmetic <em>doesn't</em> have non-standard models, and sure there's another reason why that's incomplete, but somehow <em>we</em> still know that it has no non-standard models so are we beyond computers? (no, we just believe in things that don't believe in themselves) Then you have people saying "Oh, Godel's incompleteness isn't that bizarre -- it's like how group theory has many models!" AND WHY WOULD YOU EVEN MENTION NON-STANDARD MODELS -- OR ANY KIND OF MODELS -- IN THE FIRST PLACE? These are special ways in which incompleteness manifests, but incompleteness is a lot more general than that.</p> <hr /> <p><strong>Table of contents:</strong></p> <ol> <li> <p>Godel's first theorem: Imagine a rebellious computer. Panic.</p> </li> <li> <p>Godel's second theorem: why does the first theorem sound wrong?</p> </li> <li> <p>The Entscheidungsproblem; Turing degrees</p> </li> <li> <p>First-order arithmetic; arithmetical hierarchy</p> </li> <li> <p>Tarski: truth, interpretation and language</p> </li> <li> <p>Exercise: Lob's theorem</p> </li> <li> <p>Exercise: Related paradoxes</p> </li> <li> <p>Further reading</p> </li> </ol> <hr /> <p><strong>Godel's first theorem: Imagine a rebellious computer. Panic.</strong></p> <p>The right way to understand Godel's incompleteness theorem is to entertain all those philosophical questions about how it applies to the human mind -- and regard it as a statement far more generally about an agent with beliefs. We can reproduce the Halting argument directly for a human mind:</p> <blockquote> <p><em>Alice cannot predict the actions of Bob, where Bob is a computer (resp. person) programmed (resp. committed) to act as follows: I will read Alice's mind, and do something other than what she predicts I will do.</em></p> </blockquote> <p>In particular, Bob can be identical to Alice, so Alice cannot predict her own behaviour.</p> <p>This applies <em>regardless</em> of what system Alice uses to form her beliefs.</p> <p>Alice might adopt an axiomatic framework, say ZFC, which is capable of expressing Bob's mind and believe all the theorems of that axiomatic framework (i.e. Alice is an Oracle for ZFC), and Bob can still trick her. (This is the classic Godel's incompleteness theorem.)</p> <p>Alice might have some pre-defined algorithm to decide Bob's behaviour, and believe whatever this algorithm outputs, and Bob can still trick her. (This is the Halting problem.)</p> <p>So this general Godel's theorem really does tell you that "you can't know everything". If your "beliefs" work by assigning "True" or "False" to every statement, this means either (1) there are statements that you don't know are true or false, i.e. incompleteness (2) there are statements that you're wrong in your beliefs about, i.e. unsoundness.</p> <p>All that is required is that (1) you are capable of conceptualizing statements about computers (2) computers are able to read your mind -- in the case of formal systems, this means your theorems are computably enumerable. In the case of a formal system, the only statements you necessarily cannot know are the ones about long-term behaviour, because the way that a computer reads your mind is by enumerating your theorems for an indefinite amount of time -- for real minds which can just be scanned in a flash, incompleteness is even <em>worse</em>.</p> <hr /> <p><strong>Godel's second theorem: why does the first theorem sound wrong?</strong></p> <p>Even with Godel's incompleteness theorem written down explicitly for a human being, it is still tempting to think you are beyond it.</p> <p>I mean, sure, suppose Bob has two choices: to "hum" or to "not hum" at any given moment, and he decides to adopt his usual demeanor of tricking Alice: he will keep choosing "hum", but if Alice ever becomes certain that he will always choose "hum", he will choose "not hum" in that very moment.</p> <p>So Alice can never become certain that Bob will keep choosing "hum", and yet that is precisely what Bob does.</p> <p>But surely Alice can <em>see</em> that! Surely Alice can <em>see</em> the same argument we just saw for why Bob will keep choosing "hum"?</p> <p>What's going on?</p> <p>Let's write down the argument a bit more formally to see what's wrong.</p> <p>Represent Alice's axioms by $A$, and <em>Bob will always hum</em> by $B$. Then how do we "know" $B$ is true? Well, if it weren't -- if Bob were to have ever chosen to "not hum" -- then Alice must have found a proof that Bob will always choose "hum". We believe that if $A$ is a true model of reality -- i.e. if it proves something, that thing is really true in the real world (or rather in our belief system, or whatever). In other words, we argue: $\lnot B \implies (A\vdash B) \implies B$, which is a contradiction, and we conclude that $B$ must be true.</p> <p>The <em>only</em> additional assumption we made was $A$'s soundness -- we assumed $(A\vdash B)\implies B$. So <em>that</em> is the assumption Alice can't prove -- her own soundness. Technically, Godel's second incompleteness theorem differs from the first only in this canonical choice of unprovable statement.</p> <p>Repeat this to yourself: stronger theories are not smarter, they're just more confident. So the theory ZFC+"ZFC is consistent" can prove more things than ZFC, but it's not like ZFC doesn't know that ZFC+"ZFC is consistent" proves these things. It <em>can</em> read a stronger theory's mind, it just doesn't believe what it sees.</p> <hr /> <p><strong>The Entscheidungsproblem; Turing degrees</strong></p> <p>So can the Entscheidungsproblem -- the problem of determining whether a given statement is a theorem of the theory -- be resolved in general by a computer?</p> <p>No, because then in particular, we would be able to determine if a given Turing machine halts (as that can be formulated as a statement of ZFC/whatever). This is called a "Turing reduction" from one problem to another (in this case, from the Halting problem to the Entscheidungsproblem). There is also a Turing reduction from the Entscheidungsproblem to the Halting problem -- to determine if a statement is provable, just determine if the computer that searches for its proof halts.</p> <p>A Turing reduction in both ways is called "Turing equivalence", which is an equivalence relation -- an equivalence class under this equivalence is called a "Turing degree". In particular, the Turing degree of computable problems is <strong>0</strong>; the Turing degree of problems computable once given access to an Oracle for Halting is called the "Turing jump", denoted <strong>0'</strong>.</p> <p>Remember how the second condition of the Godelian argument was that (2) computers are able to read your mind? So an Oracle for Halting can decide the halting of regular computers because regular computers cannot read its mind -- but it cannot predict the halting of computers that have access to the Oracle. So you have this infinite chain of Turing jumps.</p> <p>An Oracle for PA, an Oracle for ZFC, an Oracle for Halting are all Turing-equivalent.</p> <hr /> <p><strong>First-order arithmetic; arithmetical hierarchy</strong></p> <p>Remember how the first condition of the Godelian argument was that (1) you are capable of conceptualizing statements about computers?</p> <p>The basic reason we care about first-order arithmetic is that it is capable of conceptualizing statements about computers. Note that it is <em>not</em> the "smallest" such system -- rather, it is capable of expressing the entire <a href="https://en.wikipedia.org/wiki/Arithmetical_hierarchy">arithmetical hierarchy</a>.</p> <ul> <li>A $\Sigma_1$ decision problem is one given by a rule of the form $\exists (...), P(x,(...))$ (i.e. input $x$ returns True if this condition is satisfied, False otherwise)</li> <li>a $\Pi_1$ decision problem is one given by a rule of the form $\forall (...), P(x,(...))$</li> <li>a $\Sigma_2$ decision problem is one given by a rule of the form $\exists (...), \forall (...), P(x,(...))$</li> <li>a $\Pi_2$ decision problem is one given by a rule of the form $\forall (...), \exists (...), P(x,(...))$</li> <li>a $\Sigma_3$ decision problem is one given by a rule of the form $\exists (...), \forall (...), \exists (...), P(x,(...))$</li> <li>a $\Pi_3$ decision problem is one given by a rule of the form $\forall (...), \exists (...), \forall (...), P(x,(...))$</li> </ul> <p>[this is called <a href="https://en.wikipedia.org/wiki/Prenex_normal_form">prenex normal form</a>, by the way]</p> <p>Note how $\Sigma_1$ is equivalent to computable enumerability; being both $\Sigma_1$ and $\Pi_1$ is equivalent to computability (do you see why?). Essentially, $\exists$ corresponds to the operation of a computer, and $\forall$ corresponds to the operation of an Oracle -- so $\Sigma_{n+1}$ is equivalent to problems that are computably enumerable by a Turing machine with access to an Oracle for $\varnothing^{(n)}$.</p> <p>This basic connection between first-order arithmetic and Turing degrees is known as <a href="https://en.wikipedia.org/wiki/Post%27s_theorem">Post's theorem</a>.</p> <hr /> <p><strong>Tarski: truth, interpretation and language</strong></p> <p>[This should not take so long to explain, but I chose this section to actually start getting formal, and to start specifying to actual formal systems.]</p> <p>You know how we can define really big numbers with very few letters, like with Knuth's uparrow notation? You might wonder what the <em>largest</em> number is that you can express with only, say, 1000 characters.</p> <p>Oh, wait -- but whatever that number might be, I can express a <em>larger</em> number, also with less than 1000 characters, by writing: "The <em>largest</em> number you can express with 1000 characters, plus one".</p> <p>This is Berry's paradox, by the way.</p> <p>Let's think about it more formally. What we have is a map $f$ that assigns for each short formula some number -- then what does the string "max f + 1" represent?</p> <p>$$\texttt{equals one} \to 1$$</p> <p>$$\texttt{fourth fermat prime} \to 257$$</p> <p>$$\texttt{is even} \to \mathrm{NaN}$$</p> <p>$$\texttt{is smaller than itself} \to \mathrm{NaN}$$</p> <p>$$\texttt{kumquat} \to \mathrm{NaN}$$</p> <p>$$\texttt{square root of four} \to 2$$</p> <p>$$\texttt{max f + 1} \to , ?????$$</p> <p>For the paradox in natural language, one can just say -- well, it's natural language, the $f$ we're thinking of -- the $f$ that behaves as/assigns values to strings in a manner consistent with what we expect, just doesn't exist. But one can also talk about such an $f$ in a more formal context. For any formula $\texttt{x}$ of &lt;1000 characters, define:</p> <p>$$f(\texttt{x}) = \begin{cases} n &amp; \mathrm{if} \, x\, n, \, \forall m, \, x \, m \implies m = n \\ 0 &amp; \mathrm{if} \, \lnot\exists n, \, (x\, n, \, \forall m, \, x \, m \implies m = n)\end{cases}$$</p> <p>Then $f(\texttt{max f + 1}) = \max f + 1$, which is a contradiction. What's wrong?</p> <p>Note that getting $x$ from $\texttt{x}$ (interpreting what a string actually says) is perfectly acceptable -- it's a very simple and silly-looking operation, where you say the equal sign really means equals, the exist sign really means exists, the forall sign really means forall -- it's called the <a href="https://en.wikipedia.org/wiki/T-schema">T-schema</a>.</p> <p>What really goes wrong is when you try to condition on $x, n$ -- when you try to define a general predicate (like $f$) on all strings $\text{x}$ that tells you whether or not it holds? A minimal working example is -- for any formula $\mathrm{x}$, define:</p> <p>$$t(\texttt{x}) = \begin{cases} 1 &amp; \mathrm{if} \, x \\ 0 &amp; \mathrm{else}\end{cases}$$</p> <p>$t$ is called the truth predicate, and it's not definable. Or in other words -- the set of Godel numbers of true sentences is not an arithmetical set. This is "Tarski's theorem". People like to write Tarski's theorem as "truth in the standard model cannot be defined in the theory", but I don't really like that pedagogically, even though I know those are <em>technically</em> the same. You would think that you could define a predicate on propositions that is true iff the proposition is true -- maybe PA doesn't know which propositions are true, and so it won't know the value of this predicate either, but apparently we can't even <em>define</em> it; it's just not an FOL formula. So "the largest number definable by a FOL formula satisfying some property" just isn't an FOL formula.</p> <p>[Perhaps something that will help make Tarski's theorem a little less unintuitive -- you can define a truth predicate on all $\Sigma_k$ sentences; you just can't quantify on <em>all</em> sentences. There is a certain analogy to set theory, where you cannot quantify on all sets.]</p> <p>This is all a bit vague, because we keep saying "A theory cannot define a truth predicate", but if you can't define a truth predicate, how are WE even talking about a truth predicate, and what does it <em>mean</em> that a theory can't define a truth predicate if <em>it can't even define it</em>?</p> <p>The thing is that all this while -- and I don't just mean in this answer, but in general, in math, in everything you've ever done and thought about, we've always been talking about a theory <em>from the perspective of another</em>. Philosophically, this "relative" perspective is exactly what meaning is -- which is why this theory of interpretations is <em>semantics</em>.</p> <p>[Sometimes this relative picture is lost in introducing semantics, and people just say semantics is about giving meaning to formal systems, without specifying that this meaning is within another theory. Because why not? We do all other math without annoyingly mentioning "this is done within ZFC" -- this is what we mean by a "foundation" for mathematics, a sufficiently powerful theory that believes the beliefs of most theories of interest to us in math, we can just study formal systems like we study any other mathematical structure. Thinking about semantics this way leads to the whole discipline of "model theory" especially "abstract model theory", and it turns out that model theory has a lot of interesting math of its own -- blabla compactness blabla Lowenheim-Skolem ... ]</p> <p>There is no one way to define semantics, just like there is no one way to define "agents with beliefs" (although formal systems -- which by the way are in the most general setting tuples $(L, T)$ where $L$ is a set of sentences called a "language" -- which represents everything the agent can imagine or express, $T\subseteq L$ is some subset which we call the "theorems" -- which represents everything the agent believes -- are fairly general, you know, they're not perfect) -- <a href="https://math.stackexchange.com/a/3818951/78451">this answer</a> provides a fairly general definition -- theory $(L', T')$ "interprets" theory $(L, T)$ if there is a computable translation function $\iota: L\to L'$ such that $\iota(T)\subseteq T'$.</p> <p>So how does this help us define truth? Remember how we defined soundness -- "A formal system is sound if its theorems are actually true", i.e. $(T\vdash P)\implies P$; in fact it is better to write $(T\vdash P)\implies P'$. Hiding in this is an assumed definition for truth -- a sentence $P$ is true if and only if $P'$ -- $t(\texttt{x})\iff x$, just as what you'd think, except now $\texttt{x}$ is in one system and $x$ is in another system which is interpreting the first. This is "Convention T", now known as the <a href="https://en.wikipedia.org/wiki/Semantic_theory_of_truth">model-theoretic semantics</a>. Again, this is not the only possible way to do semantics, it is not the only possible way to assign meaning to a formal system, there are other ways in which a formal system can become meaningful in the real world (see <a href="https://en.wikipedia.org/wiki/Semantics_of_logic">Semantics of logic</a> for some examples, game semantics is a fun one).</p> <p>[But semantics in the form of model-theoretic semantics is just the right way to think about things like expressiveness (of languages) and strength (of theories). So if you ever wondered what it means for something to be a foundation for mathematics, you know.]</p> <p>And Tarski's theorem tells us that a predicate satisfying Convention T isn't definable within the theory -- i.e. there is no predicate that can be proven to satisfy Convention T ($t(\texttt{x})\iff x$) by any equally strong or stronger consistent theory.</p> <p>By the way, Tarski's theorem does have an equivalent for computers.</p> <p>Recall that the main elements of the Berry's paradox contradiction ($f(\texttt{max f + 1})&gt;\max f$) were: (1) that such an $f$ -- which assigns values to these strings in a manner consistent with what we expect, e.g. $\texttt{+ 1}$ means $+1$, etc. and (2) that $f$ can be described by sentences in this system -- for FOL this meant being expressible as a FOL statement, for computers this will mean computability.</p> <p>So you can construct the same paradox for computers -- suppose a computer were to go over all the computer programs of length &lt;1000, determine their outputs, and output 1 + their max. So this means a computer program cannot decide the output of an arbitrary computer program -- actually, the paradox does not require determining the output per se, but any non-trivial property of the output/"behaviour" (the property must be non-trivial -- i.e. not just apply to all or no programs -- so that our paradoxical program is actually able to choose to obey or violate it itself). For this connection, properties of program output are called "semantic properties", and "All non-trivial semantic properties of programs are undecidable to programs" is called Rice's theorem.</p> <p>[The standard way of formalizing the notion of a "semantic property" is as "a property of the language recognized by the program (set of strings that the program does not return "screw yourself" for) -- this is because although "language recognized by the program" sounds like something to do with the input, it's actually about what the program outputs in response. I don't like it, though, it obfuscates things -- maybe that formulation makes sense in formal grammar and stuff."]</p> <p>Tarski's theorem implies Godel's theorem (because in particular, provability cannot satisfy Condition T, "what I know" cannot be equivalent to "what is true"); Rice's theorem generalizes the Halting problem (halting is a semantic property).</p> <p>[There's a weird resemblance of Rice's theorem to Kolmogorov's zero-one law. It's nonsense, but I had to write this down somewhere or it would drive me insane.]</p> <hr /> <p><strong>Exercise: Lob's theorem</strong></p> <p>Lob's theorem is kind of an alternate way at looking at Godel's second theorem -- Godel's second theorem tells us we cannot believe in our own soundness, that we cannot believe that whatever we believe in fact holds. Lob's theorem says "Yeah, in fact, if you're sound, the only statements you can believe in your soundness for, are those that you believe anyway." -- i.e. if $T \vdash ((T\vdash P)\implies P)$, then $T\vdash P$.</p> <p>The proof of this is "Precommit to the following: if you believe that you will only eat a potato if you believe Germany borders China, then eat a potato". If you're sound, you'd better not be eating that potato.</p> <p>This is kind of interesting. A classic application of Lob's theorem is as follows -- instead of the rebel Bob, you have the obedient Carl, who reads Alice's mind to see if she believes he will halt, and does so if she does. Will Carl halt?</p> <p>There is no obvious logical contradiction either way (I remember asking a similar question on PhysicsForums when I was eleven -- about the set of all sets which <em>do</em> contain themselves -- it's so silly! We get so used to these absurd paradoxical Bobs that now we're bewildered when a program <em>isn't</em> out to get us), but Lob's theorem gives us the answer: by construction, Alice knows that if she believes Carl will halt, he will -- so by Lob's theorem she must believe he halts.</p> <p><em>Exercise:</em> But aren't Carl's options symmetric? Suppose instead Carl said -- I'll raise whichever hand Alice believes I'll raise (and if she doesn't have a belief on either, I will raise my right hand by default). Which hand will he raise?</p> <blockquote class="spoiler" data-spoiler="Reveal spoiler"> <p> <em>Solution:</em> This modified problem actually <em>does not</em> have a Lob premise -- Alice does <em>not</em> believe that her believing Carl will raise some hand actually implies he will, because Carl does not consider the case where Alice believes both. This is fine for Carl if he believes in Alice's soundness, but Alice does not trust Carl's soundness, so this means nothing to her. [He could instead precommit to raising all the hands she believes he will raise -- and in this case, he would actually end up raising both hands, but there is no unsoundness, because raising his left and raising his right are no longer mutually exclusive, and Alice has in fact predicted correctly.]</p> </blockquote> <hr /> <p><strong>Exercise: Jailor paradox</strong></p> <p>A jailor tells a prisoner he will be hung on one of Days 1 or 2 -- and when he is hung, he will be surprised, i.e. he will not have expected it to have occurred on that day with certainty.</p> <p>The prisoner, on Day 1, reasons as follows: I cannot be hung on Day 2, because then I will know with certainty that I will be hung then; thus I must be hung today. But now I expect to be hung, and thus cannot be hung ...</p> <p>There are two aspects two this paradox: (1) is the same as the argument of Godel's theorem (an antagonistic agent who swears to do the opposite of what you expect him to), and (2) is the fact that the jailor has <em>promised</em> to certainly hang the prisoner. The first is just Godel's theorem, the second part is actually paradoxical (to see this, consider the one-day case -- then it is perfectly possible for the jailor's promise to be actually impossible).</p> <p>If you're not convinced, represent this problem formally. Represent the prisoner's beliefs on Day 2 (should he still be alive then) as a formal system A2, with axioms:</p> <p>A2.0 -- $X = 2$</p> <p>A2.1 -- $(A2 \vdash X = 2) \implies \lnot (X = 2)$</p> <p>And his beliefs on Day 1 as a formal system A1, with axioms:</p> <p>A1.0 -- $X = 1 \lor X = 2$</p> <p>A1.1 -- $(A1 \vdash X = 1) \implies \lnot (X = 1)$</p> <p>A1.2 -- $X\ne 1 \land (A2 \vdash X = 2) \implies \lnot (X = 2)$</p> <p>Without the A1.0, A2.0 axioms, you just have Godel's incompleteness theorem. But with them, you have a real paradox (because then the prisoner can literally prove he will be executed that day), so the jailor's promise becomes self-contradictory: his promise to only execute the prisoner when it's a surprise contradicts his prerogative to necessarily execute the prisoner on one of these days. When the jailor comes to execute the prisoner the very next day, he's violating his promise, since the prisoner <em>can</em> prove that he will be executed (that he can also prove he will not be executed is irrelevant).</p> <hr /> <p><strong>Further reading</strong></p> <p>To read about how classical "paradoxes" can be treated as instances in a general, category-theoretic framework, see: <a href="https://arxiv.org/abs/math/0305282">Yanofsky's <em>A Universal Approach to Self-Referential Paradoxes, Incompleteness and Fixed Points</em></a>&nbsp;(there's a <a href="https://www.youtube.com/watch?v=dwNxVpbEVcc">Youtube video by Thricery</a> explaining this paper, if you don't like reading).</p> <p>Not covered in this exposition: stuff related to the philosophical question of "What the right axioms are?" -- physics, reflection, ordinal analysis/Church-Kleene limit, Chaitin's constant, axiom of choice and infinite hats and what can be modelled by computers. Relevant reading: <a href="https://terrytao.wordpress.com/2010/03/19/a-computational-perspective-on-set-theory">Terry Tao's <em>A computational perspective on set theory</em></a>, <a href="https://philosophy.stackexchange.com/a/2634/4313">Ron Maimon's <em>Was mathematics invented or discovered?</em></a>, <a href="https://www1.maths.leeds.ac.uk/~rathjen/realm.pdf">Michael Rathjen's <em>The Realm of Ordinal Analysis</em></a>.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-80460111007032846192022-07-24T15:51:00.010+01:002022-07-24T19:47:53.671+01:00Number theory: Part I<p><b>Table of contents:</b></p><p></p><ul style="text-align: left;"><li><b>Introduction: the general strategy of finding homomorphic spaces</b></li><li><b>Integral points on hyperplanes: prerequisites</b></li><li><b><b>Integral points on hyperplanes</b><b>: Bezout, Euclid's algorithm, Gaussian elimination and Smith normal form</b></b></li></ul><hr /><p><b>Introduction</b></p><p>Number theory is just algebra done over the integers. This seems kind of weird at first -- the integer lattice seems like a rather arbitrary lattice to care about; I mean, if you're looking for solutions for an equation over the reals, you can always show existence and approximate the solution by analytic means, but whether or not an arbitrary real number is an integer is just so <i>arbitrary</i>, why would you even CARE about that--</p><p>But the integers do have some special structure. The most obvious interesting thing about the integers is <i>divisibility</i>. One can show something isn't a solution by taking both sides $\mod b$ for some $b$ -- i.e. one can get restrictions on the space of possible solutions by considering the problem in $\mathbb{Z} \mod b$.&nbsp;</p><p>This is an example of a much more general strategy in mathematics to restrict the space of possible solutions -- apply a homomorphism to both sides and eliminate non-solutions in the homomorphic space. If you apply an <i>isomorphism</i>, you can also do the converse -- i.e. accept solutions in the isomorphic space.</p><p>So we're interested in the structure of $\mathbb{Z}$, so as to know what it can homomorph to. Addition and multiplication are rather important to the structure of $\mathbb{Z}$&nbsp;-- we like to call this a <i>ring</i>. In fact, $\mathbb{Z}$&nbsp;is the "free ring". If you know any category theory, this means $\mathbb{Z}$&nbsp;is the ring that has homomorphisms to every other ring.&nbsp;</p> <div class="twn-furtherinsight">Actually, $\mathbb{Z}$&nbsp;is the "free ring with no generators" -- the free ring with some generators $X_1,\dots X_n$ is called the polynomial ring with these variables, and is the free object in the category of "rings that contain the elements $X_1,\dots X_n$". BTW this is why Galois theory is so important, because it studies quotients of $K[X]$, the free ring with one generator. Also $\mathbb{Q}$ is the free field, and the rational functions form the free fields with some generators.&nbsp;</div> <p>In basic number theory, there are two sorts of such rings to which we will be interested in homomorphisms to: $\mathbb{Z}/b\mathbb{Z}$ (which we've discussed) and $\mathbb{Z}[\alpha]$ for some non-integer $\alpha$.&nbsp;</p><p>This is why the algebraic properties of these rings are of so much importance in number theory -- that's why <i>algebra</i>&nbsp;is so important in number theory. The Chinese remainder theorem -- the statement that the obvious projection map $\mathbb{Z}/(n_1\dots n_k)\mathbb{Z}\to\mathbb{Z}/n_1\mathbb{Z}\times\dots\times\mathbb{Z}/n_k\mathbb{Z}$ is a ring isomorphism -- is such an algebraic property. Stuff about quadratic and higher residues (squares mod $b$) are such algebraic properties. Stuff about when $\mathbb{Z}[\alpha]$ is a UFD, PID blabla are such algebraic properties. We will make reference to them throughout this course without getting into the details of things.</p><p>So the topic is&nbsp;<i>solving equations over the integers. </i>In particular, solving polynomial equations over the integers.&nbsp;</p><hr /><p><b>Integral points on hyperplanes: prerequisites</b></p><p>The following animation is a picture illustration of Euclid's algorithm:</p> <div class="separator" style="clear: both;"><a href="https://upload.wikimedia.org/wikipedia/commons/1/1c/Euclidean_algorithm_1071_462.gif" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="800" data-original-width="345" height="320" src="https://upload.wikimedia.org/wikipedia/commons/1/1c/Euclidean_algorithm_1071_462.gif" /></a></div> <div class="twn-furtherinsight">Make sure you understand exactly what this algorithm is doing at each step, and how it corresponds to Euclid's algorithm as you know it (think of it with a particular example). Why does this work -- how do we know we will finish up the rectangle in a finite number of steps (i.e. actually get a GCD), what property of the integers ensures this?</div> <div class="twn-beg">Anyone got intuition for Bezout's identity? I can say that I'm not a fan of using Euclid's algorithm -- instead, I prefer the following argument: let $d$ be the smallest positive $ax+by$ -- certainly the gcd must divide $d$, but also $d$ must divide $x$ (and analogously $y$), or its remainder will be smaller than $d$. So $d$ must be the gcd.</div> <p><span></span></p><hr /><p><b>Integral points on hyperplanes</b><b>: Bezout, Euclid's algorithm, Gaussian elimination and Smith normal form</b></p><p>$$381x+66y=b$$</p><p>So Bezout's identity tells us that the equation $a_1x+a_2y=b$ (by convention $a_1, a_2, b$ are known/"constants" while $x, y, z$ are to be solved for/"variables") is solvable iff $\mathrm{gcd}(a_1,a_2)$ divides $b$. In fact it says the equation $\sum a_ix_i=b$ is solvable iff $\mathrm{gcd}((a_i))$ divides $b$.&nbsp;</p><p>In fact it tells us there is an algorithm to compute this solution $x_i$ -- namely, Euclid's algorithm.&nbsp;</p><p>To help us generalize this: the nicest way to think of Euclid's algorithm algorithmically is as a sequence of matrix multiplications. E.g.</p><p>\begin{align}381&amp;=5\times 66+51\\ 66&amp;=1\times 51+15\\ 51&amp;=3\times 15+6\\15&amp;=2\times 6+3\\6&amp;=2\times 3+0\end{align}</p><p>\begin{align}\left[\begin{array}{c}381\\66\end{array}\right]&amp;=\left[\begin{array}{cc}5&amp;1\\1&amp;0\end{array}\right]\left[\begin{array}{cc}1&amp;1\\1&amp;0\end{array}\right]\left[\begin{array}{cc}3&amp;1\\1&amp;0\end{array}\right]\left[\begin{array}{cc}2&amp;1\\1&amp;0\end{array}\right]\left[\begin{array}{cc}2&amp;1\\1&amp;0\end{array}\right]\left[\begin{array}{c}3\\0\end{array}\right]\\&amp;=\left[\begin{array}{cc}127&amp;52\\22&amp;9\end{array}\right]\left[\begin{array}{c}3\\0\end{array}\right]\\ \left[\begin{array}{cc}3&amp;0\end{array}\right] &amp;= \left[\begin{array}{cc}381&amp;66\end{array}\right]\left[\begin{array}{cc}-9&amp;22\\52&amp;-127\end{array}\right]&nbsp;\end{align}</p><p>So in effect, we're solving the linear equation (where $\mathbf{a},\mathbf{d}$ are row vectors) $\mathbf{a}\mathbf{x}=b$ by instead solving $\mathbf{d}\mathbf{y}=b$ where $\mathbf{d}=\mathbf{a}\mathbf{T}$ is of the form $(d,0)$ and $\mathbf{T}\in\mathrm{GL}(\mathbb{Z}^2)$.&nbsp;</p><p>Note also that left- (resp. right-) multiplication by $\left[\begin{array}{cc}\lambda&amp;1\\1&amp;0\end{array}\right]$ can also be understood as "add $\lambda$ copies of Row 1 to Row (resp. Col) 2, then switch Row (resp. Col) 1 and Row (resp. Col) 2"&nbsp; -- i.e.</p><p>$$\left[\begin{array}{cc}\lambda&amp;1\\1&amp;0\end{array}\right]=\left[\begin{array}{cc}0&amp;1\\1&amp;0\end{array}\right]\left[\begin{array}{cc}1&amp;0\\\lambda&amp;1\end{array}\right]$$</p> <div class="twn-furtherinsight">The group $\mathrm{GL}(\mathbb{Z}^2)$ can be understood as the group of <i>GCD-preserving transformations</i>. (The algorithm to simplify $\mathrm{gcd}(ax+by,cx+dy)$ -- using the identity $\mathrm{gcd}(x+\lambda y, y)=\mathrm{gcd}(x,y)$ -- is precisely Euclid's algorithm.)</div> <p>But the nice thing about this matrix approach is that it can be used to solve <i>systems</i>&nbsp;of linear equations. It is not sufficient to solve these equations independently and state the conditions as $d_1\mid c_1,\dots d_m\mid c_m$, because the solution sets for each equation needn't intersect (think: $3x+6y=3, 3x+9y=3$).&nbsp;</p><p>Instead, we solve $\mathbf{A}\mathbf{x}=\mathbf{b}$ by solving $\mathbf{D}\mathbf{y}=\mathbf{c}$ where $\mathbf{D}=\mathbf{S}\mathbf{A}\mathbf{T}$, $\mathbf{c}=\mathbf{S}\mathbf{b}$ and $\mathbf{D}$ is diagonal, $\mathbf{S},\mathbf{T}\in\mathrm{GL}(\mathbb{Z}^n)$. This $\mathbf{D}$ is called the <i>Smith normal form</i>&nbsp;of $\mathbf{A}$, and can be seen as a generalization of the Greatest Common Divisor to a set of integer vectors.</p> <div class="twn-beg">If anyone has a good thorough exposition for the algorithm to calculate the Smith normal form -- and some analogous intuition to that for its special case the Euclidean algorithm -- let me know.</div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-10867939527941169012022-06-27T13:48:00.005+01:002022-06-27T13:48:37.097+01:00Bayes, bias, p-hacking and the Monty Hall problem<p><b>TL;DR:</b>&nbsp;Bias is a fundamental concept in the science of agents; bias of a source is fundamentally related to cognitive bias.</p><hr /><p>You open the news in the morning, and you see the following headline from <i>The Pan-Pan times</i>: <b>80% of Chimpanzee Party lawmakers have criminal cases filed against them.</b></p><p>"Outrageous!" You cry-- "The Chimpanzee Party has no respect for the law! I certainly shall not be voting for them!"&nbsp;</p><p>But upon subsequent investigation, you find that while this statement is true, in fact, 80% of Bonobo Party lawmakers <i>also</i>&nbsp;have criminal cases filed against them. Or perhaps you find that they have criminal cases filed against them -- in an unrecognized country created by some crackpot on Reddit. And the person responsible for generating these headlines was aware of this fact.</p><p>You feel cheated, even though everything you heard was the truth. Even though The Pan-Pan times only gave you information, and you acted as a rational agent when exposed to this information (well, actually you didn't, but let's ignore that for now), you feel like you've been "exploited" somehow, "tricked" even.&nbsp;</p><p>Is this truly possible? Can a Bayes-rational agent truly be fooled by cleverly selecting information to provide?</p><p>Let's think about the problem more carefully.&nbsp;</p><p>Our ultimate decision might be governed by the following rule: <i>vote for whichever party you expect to have fewer criminal cases filed against its lawmakers</i>. We may have some prior distribution on the fraction of criminal-accused lawmakers of each party, say $\mathrm{B}(2,3)$ and $\mathrm{B}(3,2)$ for the Chimpanzee and Bonobo parties respectively. Under the prior, the expected fraction of criminal-accused lawmakers are 0.4 and 0.6 respectively, and one would vote for the Chimpanzee Party; under the posterior, the expected fraction of criminal-accused lawmakers are 0.8 and 0.6 respectively, and one would vote for the Bonobo Party.&nbsp;</p><p>Or.</p><p>Our ultimate decision might be governed by the following rule: <i>vote for whichever party minimizes the expected sum $X_1+\dots+X_{100}$</i>, where say: $X_1$ is the regulatory burden the party will place upon coming to power, $X_2$ is the tax rate the party will implement upon coming to power, $X_3$ is the number of criminal cases against the party lamakers, $X_4$ is the number of bad words the party's candidates use on TV, $X_5$ is the number of lies the party's candidates say on TV, $X_6$ is the number of dissidents the party will throw in prison upon coming to power, etc.</p><p>Let's say, for simplicity, that these are all Bernoulli distributed variables -- further that each $X_i$ (Chimpanzee Party) and $Y_i$ (Bonobo Party) are distributed as $\mathrm{Bernoulli}(0.5)$. Then in this prior distribution, we would be uncertain as to whom to vote for, as both parties have $\mathrm{E}[\sum X_i]=50$ is lower.&nbsp;</p><p>And suppose the Pan-Pan Times tells us: "We looked at $X_{3}$ and it turned out to be 1 for the Chimpanzee Party!" Now, $\mathrm{E}[\sum X_i]=50.5$, while $\mathrm{E}[\sum Y_i]=50$, so we vote for the Bonobo Party.&nbsp;</p><p>But here's the thing though: the precise information you receive isn't <i>$X_3$ is equal to 1</i> -- it is <i>the Pan-Pan Times's report is "$X_3$ is equal to 1"</i>. And <i>that's</i>&nbsp;what you should be conditioning on.</p><p>If you were to condition on&nbsp;<i>the Pan-Pan Times's report is "$X_3$ is equal to 1"</i>&nbsp;(call this variable $\Pi$), well, what would your inference look like? You could apply Bayes's theorem, etc. but simply put -- let's say the Pan-Pan Times's report is some generative process that looks at all the $X_i$, and reports one that is equal to 1 (i.e. tosses a hundred coins and reports a heads) -- it can do so in all but $2^{-100}$ of outcomes, so the only information we're given is that that particular outcome (where all $X_i$ are 0) is not the case -- the only information we're given is that the Chimpanzee Party isn't literally perfect -- so that $\mathrm{E}[\sum X_i] = 50/(1-2^{-100})$.</p><p>(OK, in this case, the decision is the same -- but for example, suppose our decision was instead "donate some sum of money proportional to the difference in $\mathrm{E}[\sum X_i]-\mathrm{E}[\sum Y_i]$" then the decision would be different.)</p><p>To instead condition on just $X_3=1$ -- rather than the full information provided -- is a <i>cognitive bias</i>. As actually implementing Bayes's theorem everywhere is expensive, the mind processes information using <i>heuristics</i>&nbsp;-- one such heuristic is that only some information is selected to be conditioned on, leading to <i>selection bias</i>. Indeed, all such "source biases" are fundamentally manifested as some form of cognitive bias -- the use of negative terms to describe the Chimpanzee Party, for example, is the exploitation of some sort of the association fallacy, repeating the word "Bonobo Party" for hours on screentime is an exploitation of the availability heuristic, etc.&nbsp;</p><p>A perfectly Bayes-rational agent -- that takes into account <i>all</i>&nbsp;information it is exposed to -- is immune to being tricked in this way. But a real agent, which uses heuristics, can be exploited. The idea is that if such biases are systematic, then it can be predicted and avoided cheaply.&nbsp;</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-41952533522805234292022-05-07T18:32:00.000+01:002022-05-07T18:32:32.888+01:00Curvature is just the Hessian<p>If you recall some basic calculus, the gradient of a scalar function $f(x_1,\dots x_n)$ is just the generalization of the derivative:&nbsp;</p><p>$$f'(x_1,\dots x_n) =\left[\begin{array}{}\frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{array} \right]$$</p><p>And the Hessian of a scalar function $f(x_1,\dots x_n)$ is just the generalization of the second derivative:</p><p>$$f''(x_1,\dots x_n) =\left[\begin{array}{}\frac{\partial^2 f}{\partial x_1^2} &amp; \dots &amp; \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \vdots &amp; \ddots &amp; \vdots \\ \frac{\partial^2 f}{\partial x_n\partial x_1} &amp; \dots &amp; \frac{\partial^2 f}{\partial x_n^2} \end{array} \right]$$</p><p>Why is this interesting? Consider just $f$ quadratic -- then&nbsp;just like in one dimension, $f$ can be written only in terms of its value, derivative, and second derivative at 0, $f$ can be written only in terms of its value, gradient and Hessian at 0.&nbsp;</p><p>&nbsp; \begin{align} f(x,y) &amp;= c + (c_1x+c_2y) + (c_{11}x^2 + c_{12}xy + c_{21}y^2) \\ &amp;= f(0) + \left(f_x(0)x+f_y(0)y\right) + \frac12 \left(f_{xx}(0)x^2+2f_{xy}(0)xy+ f_{yy}(0)y^2\right) \\ f(\mathbf{x}) &amp;= f(0) + f'(0)\cdot \mathbf{x} + \mathbf{x}\cdot f''(0) \mathbf{x} \end{align}</p><p>What this tells us is:</p><p></p><ul style="text-align: left;"><li>The gradient is <i>naturally thought of</i>&nbsp;as a linear form.</li><li>The Hessian is <i>naturally thought of</i> as&nbsp;a quadratic form.</li></ul><p>A what and a what?</p><p>There are two ways of thinking of a thing like $\left[\begin{array}{}a \\ b \end{array} \right]$ -- a vector $a\mathbf{e}_1+b\mathbf{e}_2$, or a linear expression $ax_1+bx_2$, a function on $x_1,x_2$. The former is an object in the space $\mathbb{R}^n$, while the latter is a function $\mathbb{R}^n\to\mathbb{R}$ (do you see why?).&nbsp;&nbsp;</p><p>Similarly, there are two ways of thinking of a matrix $\left[\begin{array}{}a_{11} &amp; a_{12} \\ a_{21} &amp; a_{22} \end{array} \right]$ -- a linear transformation $\mathbb{R}^n\to\mathbb{R}^n$, or a quadratic expression $a_{11}x^2+(a_{12}+a_{21})xy+a_{22}y^2$, which is a function&nbsp; on $x_1,x_2$, a function $\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}$ (do you see why?).&nbsp;</p> <div class="twn-furtherinsight">This is what duality is in linear algebra. Also in tensor notation, vectors are $v_i$ while linear forms are $v^i$; linear transformations are $A_i^j$, quadratic forms are $A_{ij}$ -- <a href="https://thewindingnumber.blogspot.com/2019/01/covectors-conjugates-and-why-gradient.html">see also</a>. Don't bother with this if you don't want to.</div> <p>So e.g. the gradient should naturally be thought of as a function that, given some vector as input, gives you the directional derivative in the direction of that vector.</p><p>(Make sure you understand this very clearly.)</p><p>Similarly, the Hessian should be thought of as a function that, given two vectors as input, gives the second derivative in their directions $f_{xy}$.</p><p>(Make sure you understand this VERY clearly.)</p><p>Now suppose we wanted to talk about the <i>curvature of a surface</i>.</p><p>We know that the curvature of some curve $\phi(t)$ at the point $t=0$ is $\phi''(0)$. Naturally, we'd like the "curvature of a surface" would be something of a function that gives you the curvature in each direction -- that gives you the second derivative in each direction. So naturally, you'd want something like the Hessian.</p><div class="twn-beg">I'm not sure if the cross-derivative $f_{xy}$, i.e. $A(X, Y)$, has any natural geometric interpretation. Does this have anything to do with torsion? Does $A(X, Y)$ ever come of use?</div> <p>So we'd like to define some quadratic form $A$ such that $\phi'(0)\cdot A \phi'(0)$ is the curvature $\phi''(0)$. Actually, it should just be the <i>normal curvature</i>, the component of $\phi''(0)$ normal to the surface, the sort of curvature that can be attributed entirely to the surface, rather than to the curve wiggling around on the surface.</p><p>[For whomsoever it may concern, Theorem 10.4 in your notes is what computes this quadratic form $A$ as the differential of the Gauss map, and is what motivates the Gauss map in the first place. This is why you should start with the last chapter and read backwards.]</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com2tag:blogger.com,1999:blog-3214648607996839529.post-11951261008635764152022-04-12T22:07:00.001+01:002022-04-13T10:35:32.590+01:00Walkthrough of Galois theory<p><b>Symmetries of polynomials; the Galois group</b></p><p>You have often seen that there is a certain fundamental symmetry between the roots of a polynomial. For example, there is an obvious symmetry between $i$ and $-i$, the roots of $x^2+1=0$. But even without imaginary numbers, there is a certain symmetry between $2+\sqrt{3}$ and $2-\sqrt{3}$, the roots of $x^2-4x+1=0$.&nbsp;</p><p>What is this symmetry precisely? The idea is this: $i$ and $-i$ can only "relate" to the reals through squaring -- thus any equation in them with real coefficients remains true up to permutation of roots. In equations like $\alpha^2+1=0$, $\alpha+\beta=0$, etc. one may permute the roots $\alpha,\beta$ and maintain the equations. The only way you can break this symmetry is with an equation whose coefficients are themselves complex.&nbsp;</p><p>It's not necessary that <i>every</i>&nbsp;permutation is a symmetry of the polynomial. For example, consider the equation $x^4-10x^2+1=0$, whose roots are $\sqrt2+\sqrt3$, $\sqrt2-\sqrt3$, $-\sqrt2+\sqrt3$, $-\sqrt2-\sqrt3$. In fact, the only permutations that are symmetries of these roots are: $e$ (which conjugates nothing), $(12)(34)$ (which conjugates $\sqrt3$), $(13)(24)$ (which conjugates $\sqrt2$), $(14)(23)$ (which conjugates both $\sqrt2$ and $\sqrt3$) -- these permutations are the Klein 4-group $\mathbb{Z}_2\times \mathbb{Z}_2$.&nbsp;</p><p>As a more extreme example, consider $x^2-3x+2=0$, whose roots are $1$ and $2$. There is no non-trivial permutation of these roots that preserves e.g. the equation $\alpha-1=0$.</p><p>This group of symmetries is known as the <i>Galois group</i>.</p><p>How might we formalize this? Well, you may notice that none of what we described really depends on the polynomial -- just its roots. Fundamentally, we are "extending" the rational numbers with these roots -- e.g. to $\mathbb{Q}[i]$ or $\mathbb{Q}[\sqrt2]$ or whatever -- then, instead of saying "a permutation of these roots that preserves all $\mathbb{Q}$-algebraic equations in them", we can say "a field automorphism that fixes $\mathbb{Q}$" (because under such an automorphism, all arithmetical operations and $\mathbb{Q}$ coefficients remain the same); this is nice, because it is more natural to think of complex conjugation as a symmetry of the complex plane as a whole rather than just .</p><p>Thus our formal definition: the Galois group of some field extension $L/K$ is its automorphism group (the group of automorphisms of $L$ that fix $K$).&nbsp;</p><p>Why is this actually useful or relevant? It's a natural construction so of course it's useful, go perforate your head IDC.</p><p><b>Why field theory? Constructible numbers</b></p> <p>Geometrical constructability is defined as follows: given some initial geometric figure (collection of points with known distances), what geometric figure can be constructed with only a straightedge and compass?&nbsp;</p> <p>More formally, we have the following axioms -- to define constructible <i>points</i>, <i>shapes</i>&nbsp;and <i>numbers</i>. Where $G$ is is a collection of points in the plane:</p> <ol style="text-align: left;"> <li>The points in $G$ are constructible points.</li> <li>A line through two constructible points is a constructible shape.</li> <li>A circle with its centre a constructible point and its radius a constructible line is a constructible shape.</li> <li>The intersections of two constructible shapes are constructible points.&nbsp;</li> <li>The co-ordinate of a constructible point is a constructible number.</li> </ol> <p><i>Pedantry:</i>&nbsp;</p> <p>In fact, these are the axioms for constructability with a <i>collapsible compass</i>&nbsp;-- a compass whose legs collapse once it is taken off the paper -- so you cannot use it to mark an identical distance elsewhere. For constructability with a non-collapsible compass, Axiom 3 would instead read: A circle with its centre a constructible point and its radius <i>equal to the length of</i> a constructible line is a constructible shape.&nbsp;</p><p>Exercise: show that these definitions are equivalent -- that under the axioms for a collapsible compass, you can move a line to an arbitrary point. <a href="https://planetmath.org/circlewithgivencenterandgivenradius">Solution</a>.</p><p>In particular, this means that constructability only depends on the initial set of <i>distances</i>, not the precise points themselves, because any arrangement of these lengths can be shifted around&nbsp;-- we can formulate the question as "what numbers are constructible from some initial set of numbers?".</p><p><i>Exercises:&nbsp;</i>Given numbers $a$ and $b$, show that the following numbers are constructible from them:</p><p></p><ol style="text-align: left;"><li>$a+b$</li><li>$|a-b|$</li><li>$ab$</li><li>$a/b$</li><li>$\sqrt{a}$</li></ol><div><a href="https://en.wikipedia.org/wiki/Constructible_number#Equivalence_of_algebraic_and_geometric_definitions">Solution</a>.</div><div><br /></div>Numbers expressible through only these operations: $+,-,\times,/,\sqrt{\cdot}$ are called <i>algebraically constructible</i>&nbsp;from some base set of numbers.&nbsp;<p></p><p>The above exercise demonstrates that all algebraically constructible numbers are geometrically constructible.</p><p>Conversely, as all geometrically constructible lengths can be constructed using these operations (you know, Pythagoras theorem and stuff), all geometrically constructible numbers are algebraically constructible.</p><p>Thus algebraical and geometrical constructability are equivalent.&nbsp;</p><p>When no base figure is specified, the tacit assumption is that the base figure is a line segment of length 1, i.e. the base set is "1". We simply call the numbers constructible from this to be the <i>constructible numbers</i>.&nbsp;</p> <div class="twn-furtherinsight">Other forms of constructability besides straightedge-and-compass are known -- such as conic constructability, solid constructability, neusis constructability, origami constructability. We want a theory that handles not only straightedge-and-compass (i.e. square roots), but also these more general forms.</div> <p>This immediately demonstrates why:</p><p></p><ul style="text-align: left;"><li>It is impossible to double the cube.</li><li>It is impossible to trisect an arbitrary angle (because that is equivalent to constructing a cube root).</li><li>It is impossible to square the circle (because $\pi$ cannot be constructed with radicals).&nbsp;</li></ul><div>These are not <i>proofs</i>&nbsp;per se ... it still remains to be shown that you literally can't construct a cube root with some finite nesting of square roots, and so on.&nbsp;</div><p></p><p>A set closed under $+,-,\times,/$ is a field. So constructible numbers are some type of fields. What kind of fields, and what do we do with them? Yeah, yeah, something.</p><p><b>Computing the Galois group</b></p><p>So how do we actually figure out the Galois group of a thing?</p><p>First of all, we said that it's really the roots that matter, not the polynomial. Does that mean that we can adjoin any elements to a field and calculate the Galois group of that extension?</p><p>Not really. There's nothing interesting to be said about the automorphism group of $\mathbb{Q}[\sqrt{2}]$, for example. $x^4-2=0$ has imaginary roots and stuff, and we want to be able to permute them. The automorphism group of this field -- with only one root adjoined -- says nothing of the properties of $x^4-2=0$.&nbsp;</p><p>The fundamental theorem of Galois theory -- which we haven't yet stated -- will not apply to such a shitty field extension.</p><p>No -- the field extensions we are interested in are those generated by adjoining <i>all</i>&nbsp;the roots of a polynomial. This is a "normal extension", "Galois extension", "splitting field", whatever (yeah, yeah, a Galois extension must be <i>normal and separable</i>&nbsp;but do I look like the kind of guy who'd bother with things that aren't separable? -- what even is separable, your FACE is separable, I'll split it in two).</p><p>We can "intuit" out the Galois group of things.</p><p>The splitting field of $x_4-1$ is $\mathbb{Q}[i]$; then the image of $i$ determines the Galois group, which is thus $S_2$.</p><p>The Galois group of $x^4+1$ is $K_4$. Prove it.</p><p>The splitting field of $x^4-2$ is $\mathbb{Q}[\sqrt{2},i]$; then the images of $\sqrt{2}$ (for which there are 4 options) and $i$ (for which there are 2 options) are sufficient to determine the Galois group, which is thus $D_4$.</p><p>The Galois group of $\mathbb{Q}[\sqrt{5},\sqrt{7}]$ is $K_4$. Prove it.</p><p>The Galois group of $x^3-1$ is $S_2$. Prove it.</p><p>The Galois group of $x^3+1$ is $S_2$. Prove it.</p><p>The Galois group of $x^3-2$ is $S_3$. Prove it.</p><p>The Galois group of $x^5-1$ is $C_4$. Prove it.</p><p>The Galois group of $x^5+1$ is $C_4$. Prove it.</p><p>The Galois group of $x^5-2$ is ... okay, just see <a href="https://math.stackexchange.com/questions/204552/computing-the-galois-group-of-polynomials-xn-a-in-mathbbqx">this Math.SE question</a>. It's some semidirect product, and you should be able to see why by now.</p><p>If you worked through the examples above, you should have a good intuitive grasp of the <i>Fundamental theorem of Galois theory</i>&nbsp;by now. We've been constructing the Galois group of a field extension by determining some number of "basis elements" whose images suffice to determine the automorphism; extensions by these basis elements (intermediate field extensions) correspond to specific subgroups (intermediate subgroups of the subgroup lattice).</p><p><b>Solvable groups</b></p><p>add.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com1tag:blogger.com,1999:blog-3214648607996839529.post-42344149673124100862021-09-19T17:07:00.003+01:002021-09-20T03:41:37.247+01:00Causation, transfer and symmetry<p>In this article we attempt to build intuition for <i>disentangled representations</i>, and the closely related ideas of transfer learning, equivariant learning and causation.</p><p></p><ul style="text-align: left;"><li>Transfer learning and causation</li><li>Transfer learning and hierarchical models; counterfactual reasoning</li><li>Out-of-distribution generalization and Independent Causal Mechanisms</li><li>Disentangled representations and symmetry</li></ul><p></p><p><b>Transfer learning and causation</b></p><p>Transfer learning is set in the context where two models are not independent. An example we've previously discussed is <a href="https://thewindingnumber.blogspot.com/2021/08/causation-and-semi-supervised-learning.html">semi-supervised learning</a>, which analyzes the correlation between $P(X)$ and $P(Y\mid X)$ (viewed as random variables from some distribution $\phi_{p_{x,y}}$ on possible distributions). As with the semi-supervised learning example, one way to think about "our belief about a distribution" is to consider that distribution as parameterized by/conditioned on some third variable $Z$ and talk about our beliefs on $Z$.&nbsp;</p><p>More generally, we may consider correlations between arbitrary such distributions. For $P(Y\mid X)$ to be independent of some $Z$ is an equivalent way of saying that $Y\perp\!\!\!\perp Z\mid X$ -- in a faithful causal model, this means that all causal paths between $Y$ and $Z$ are blocked by conditioning on $X$. Thus the following diagrams represent the only possible structures where $Z$ carries information about $P(Y\mid X)$ and transfer learning is only possible if there exists some $Z$ that plays this role for both distributions:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-t31hc1NCnlA/YT4QTGurXgI/AAAAAAAAOrk/9p9dC7jHSyUq00wzRSNESFyVX4yPUVEPACLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="544" data-original-width="500" height="400" src="https://lh3.googleusercontent.com/-t31hc1NCnlA/YT4QTGurXgI/AAAAAAAAOrk/9p9dC7jHSyUq00wzRSNESFyVX4yPUVEPACLcBGAsYHQ/w368-h400/image.png" width="368" /></a></div><div><br /></div><div><b>Transfer learning and hierarchical models; counterfactual reasoning</b><p></p><p>The classic example of a hierarchical model goes something like this: you're trying to model the score of students based on a couple of parameters, and the students can be divided into a bunch of schools. While the school category is itself a variable with some value and can thus be considered a parameter, our treatment is often different in application -- in application, we may e.g. be faced with an entirely new school we haven't heard of before, or with a school with few data points. Then our predictions about such a school should take into account what we know from the other schools.</p><p>A very simple hierarchical model might look like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-6fnHFG-zr64/YUEGFsBVuRI/AAAAAAAAOsM/Rpev3ysEa3MmAU5iUjE8b88aFOdZNZP1gCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="141" data-original-width="100" height="101" src="https://lh3.googleusercontent.com/-6fnHFG-zr64/YUEGFsBVuRI/AAAAAAAAOsM/Rpev3ysEa3MmAU5iUjE8b88aFOdZNZP1gCLcBGAsYHQ/w72-h101/image.png" width="72" /></a></div>-- i.e. instead of simply describing score as having a particular distribution $f_\mu(x)$, we recognize that $\mu$ is a random variable in itself and consider $f(x\mid\mu)$. The main idea of importance in a hierarchical model, however, has to do with the inference of $\mu$ for a particular sample. The full model for something like this looks like:<p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-TjeRpirNrR8/YUEidr06o7I/AAAAAAAAOsU/yPaozQ0k7KM5DFOScziQWwa6npBrsK1iACLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="142" data-original-width="748" height="112" src="https://lh3.googleusercontent.com/-TjeRpirNrR8/YUEidr06o7I/AAAAAAAAOsU/yPaozQ0k7KM5DFOScziQWwa6npBrsK1iACLcBGAsYHQ/w587-h112/image.png" width="587" /></a></div><br />And we would know correlations between the distributions $P(\mu_1)$ and $P(\mu_2)$, representable by another random variable $\theta$ -- essentially our $\mu$s are sampled from some unknown distribution, and the identity of this distribution (e.g. a parameter) is $\theta$.<div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-vJT9RLsaN0U/YUExDzoOkJI/AAAAAAAAOss/0qq0pg2TSk4QyPo-1_yUQ44YPxv_x4HsgCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="231" data-original-width="748" height="175" src="https://lh3.googleusercontent.com/-vJT9RLsaN0U/YUExDzoOkJI/AAAAAAAAOss/0qq0pg2TSk4QyPo-1_yUQ44YPxv_x4HsgCLcBGAsYHQ/w566-h175/image.png" width="566" /></a></div><br />Another place that this idea -- of "extending" samples as random variables in a causal diagram -- is of importance is in the context of counterfactuals. A classic counterfactual question looks like this:</div><blockquote>X is elected president, and the GDP becomes 1. What would the GDP of the country have been if Y were elected president instead?</blockquote><div><p></p><p>To understand the precise meaning of such a question (so as to formalize it as a causal diagram) requires us to think about the purpose of the answer to the question. For example, we might be interested in a future election, or the use of President X's vs President Y's policies in some other time or country, etc. In general, we are interested in an alternate <i>world</i>&nbsp;whose mechanisms are correlated with this one.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-1O8NZDGTXwY/YUFQHFyv7oI/AAAAAAAAOs0/ZySWogviJvEcOBeTAcwB0SV-SyWhxjl-QCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="159" data-original-width="506" height="101" src="https://lh3.googleusercontent.com/-1O8NZDGTXwY/YUFQHFyv7oI/AAAAAAAAOs0/ZySWogviJvEcOBeTAcwB0SV-SyWhxjl-QCLcBGAsYHQ/image.png" width="320" /></a></div><p>A particular observation of President-1 and GDP-1 gives us information on Mechanism, which gives us information to make inferences in the counterfactual world.</p><div>Relevant references: [<a href="https://www.cs.cmu.edu/~liuy/atl.pdf">1</a>] [<a href="https://arxiv.org/abs/2005.08697">2</a>]</div><p><b>Out-of-distribution generalization and Independent Causal Mechanisms</b></p><div>Ordinary machine learning involves some random variable(X, Y)$which we sample IID and attempt to learn the underlying joint distribution of -- or some functional of the joint distribution, like$P(Y\mid X)$. Instead, we might be interested in sampling from several random variables$(X_i, Y_i)$with different distributions and exploit this&nbsp;<i>full</i>&nbsp;information to infer desired distributions. The problem of transferring information to some new distribution$(X_{n+1}, Y_{n+1})$from which no data points have been sampled is called&nbsp;<i>domain generalization</i>, while combining this prior with information from the sample of the specific task$(X_n, Y_n)$is called&nbsp;<i>multi-task learning</i>.</div><div><br /></div><div>This general setting is exactly the same as that in the first section, of course. However, having prior information on the precise correlation between the distributions of the mechanisms$P(Y_1\mid X_1)$and$P(Y_2\mid X_2)is practically quite difficult; we are often interested in the case where we plainly know that these mechanisms are&nbsp;<i>identical</i>. You may immediately see the connection to&nbsp;<a href="https://thewindingnumber.blogspot.com/2021/08/causation-and-semi-supervised-learning.html">independent causal mechanisms</a>.&nbsp;Indeed, the precise connection has to do with the idea of extending samples as random variables, as in the previous section. See [<a href="https://arxiv.org/abs/1507.05333">1</a>]&nbsp;[<a href="https://www.pnas.org/content/pnas/113/27/7345.full.pdf">2</a>]&nbsp;for details on precise algorithms for such tasks.</div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-3Jo1piPbUmE/YUdeyom5SuI/AAAAAAAAOt0/qaAMyth_31gA_73MhKFmz3CiQwMb1u7CwCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="281" data-original-width="374" height="240" src="https://lh3.googleusercontent.com/-3Jo1piPbUmE/YUdeyom5SuI/AAAAAAAAOt0/qaAMyth_31gA_73MhKFmz3CiQwMb1u7CwCLcBGAsYHQ/image.png" width="319" /></a></div><br /><br /></div><div><br /></div><b>Disentangled representations and symmetry</b><br /><p></p><p>[article to be extended]</p><p>Relevant references: [<a href="https://arxiv.org/abs/1812.02230">1</a>] [<a href="https://arxiv.org/abs/1909.13739">2</a>] [<a href="https://arxiv.org/abs/1909.13789">3</a>] [<a href="https://openreview.net/forum?id=q8qLAbQBupm">4a</a>] [<a href="http://ai.stanford.edu/blog/neural-mechanics/">4b</a>].</p></div></div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-74351588121996916252021-08-30T14:38:00.006+01:002021-09-05T13:34:34.458+01:00What is energy? What is physics?<p>In your high-school physics classes, you may have studied the dynamics of various physical systems, but you may have also heard that physics is the "general science", that is universally applicable. One may wonder if physics can be formulated as some general framework that can handle any type of system without more specific assumptions about such system, and what interesting conclusions can be drawn about physics in such a general setting.</p><p>Fundamentally, we seek to make predictions about some <i>observables</i>. You may imagine that all the observable information about a system is given by some abstract "phase space" of "state vectors" such that each observable is some function of the state vector. For example if the phase space is parameterized by position and momentum, then the position and momentum observables would be projection functions and various other observables can be written as compositions thereof of functions with these projections e.g.E=\frac{p^2}{2m}+U(x,p)$.</p><div class="twn-furtherinsight">The question of <i>why</i>&nbsp;position and momentum are sufficient to specify our phase space in so many real-life situations is a rather advanced one -- here's a paper that explains that this general "first-derivative sufficiency" in physics arises as a special case of something to do with Fisher information [<a href="https://journals.aps.org/pre/abstract/10.1103/PhysRevE.52.2274">1</a>] -- but I don't understand Fisher information.</div><p><b>The Lie theory of dynamics</b></p><p>One may say that we are particularly interested in the time-evolution of such observables,$dA/dt$. If we can find a differential equation for this, then we can find the value of$A$given some initial condition. But more generally we may not have an <i>initial</i>&nbsp;condition as such and maybe the problem we're trying to solve isn't even a differential equation as such but we just want to speak more abstractly about the dynamics of the system.</p><p>This should all be screaming "Lie groups" to you.&nbsp;</p><p>One may imagine that time evolution is the flow under a certain vector field.&nbsp;</p><p>More precisely, suppose the state of a system is given by some$\psi=(x, p)$, which represents the position and momentum of a particle at time$t$. Then the "position at time$t$" is an observable given by the function$X(\psi)=x$, the projection onto the$x$co-ordinate -- but we may also think about the observable "position at time$t+\Delta t$", which is some function$\Phi^{\Delta t}X$that projects onto a different co-ordinate system.</p><p>This co-ordinate transformation on the phase space is a type of&nbsp;<i>canonical transformation</i> -- we will soon define this generally, but it is important to realize that motion/time-evolution is a type of canonical transformation.&nbsp;</p><p>You may imagine that such canonical transformations form a Lie group, and as is standard the Lie algebra to this Lie group consists of vector fields (flows) on the phase space.&nbsp;</p><p><b>Noether and symplectic geometry</b></p><p>The question, then, is what sort of transformations count as "canonical transformations", or equivalently, what sort of vector fields are we interested in. What kind of "co-ordinate transformations" are acceptable?</p><p>We want to be as general as possible, so we do not wish to restrict what sort of paths are predicted on the phase space -- rather, we want to restrict the relationships between the predicted paths, i.e. how a distribution evolves under time evolution. We won't go into details of why this is true (because I do not fully understand it yet -- apparently it is called <i>Liouville's theorem</i>&nbsp;and has to do with "conservation of information", which has to do with probabilities inferred from some symmetries, see <a href="https://en.wikipedia.org/wiki/A_priori_probability">a priori probability</a>), but we expect the vector fields to be solenoidal, i.e. have zero divergence. Analogous to how irrotational vector fields are precisely those that are gradients of functions, solenoidal vector fields are precisely those that are the <i>symplectic gradients</i>&nbsp;of functions (see <a href="https://www.reddit.com/r/mathematics/comments/1z791c/puzzle_of_the_moment_hamiltonian_vector_fields/">reddit</a> for a proof):</p><p>$$\nabla\cdot \vec{v}=0\leftrightarrow \exists g, \vec{v}=\frac{\partial g}{\partial y}\hat{x}-\frac{\partial g}{\partial x}\hat{y}$$</p><p>This "symplectic gradient" always points along the contours of$g$-- thus, these flows conserve&nbsp;the quantity$g$. More generally, the change in some observable$f$over the flow that is the symplectic gradient of$g$is given by$\vec{v}\cdot\nabla g$, and is called the <i>Poisson bracket</i>&nbsp;of the two observables:</p><p>$$\{f,g\}=\frac{\partial g}{\partial y}\frac{\partial f}{\partial x}-\frac{\partial g}{\partial x}\frac{\partial f}{\partial y}$$</p><p>It should be intuitively clear that this Poisson bracket corresponds to the Lie bracket of their symplectic gradients -- if the vector fields commute, they must be constant under the flow of the other, etc.</p><p><b>The Hamiltonian and conservation of energy</b></p><p>The evolution of a classical Newtonian state is given by&nbsp;</p><p>$$\dot{x}=p/m$$</p><p>$$\dot{p}=F$$</p><p>We wish to find the observable$H$such that this evolution is represented by the symplectic gradient of$H$, i.e. so that:</p><p>$$p/m=\{x, H\} = \frac{\partial H}{\partial p}\frac{\partial x}{\partial x}-\frac{\partial H}{\partial x}\frac{\partial x}{\partial p}=\frac{\partial H}{\partial p}$$</p><p>$$F = \{p, H\}=-\frac{\partial H}{\partial x}$$</p><p>Integrating, the quantity we want is</p><p>$$H=\frac{1}{2m}p^2-\int F dx$$</p><p>This quantity is called the "energy" of the system -- often we call$p^2/2m$the "kinetic energy" and$U=-\int F dx$the "potential energy". In particular, it is immediately clear that energy is conserved. In general, the physics of a system can be completely specified by some "Hamiltonian"$H, this rather generalized definition of energy, whose symplectic gradient represents time translations.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-75796200764890014372021-08-16T11:46:00.006+01:002021-08-17T00:54:21.152+01:00Causation and semi-supervised learning<p><b>Table of contents</b></p><p></p><ul style="text-align: left;"><li>Semi-supervised learning</li><li>Semi-supervised learning and causation</li><li>Independent noise and causation</li><li>Hierarchical models and transfer learning</li></ul><p></p><p><b>Semi-supervised learning</b></p><p>We have <a href="https://thewindingnumber.blogspot.com/2021/05/machine-learning-and-information-theory.html">previously</a> discussed that machine learning can both be considered to be an information theoretic problem of finding an optimal representation. Most of it had to do with unsupervised&nbsp;learning, while supervised learning requires a notion of optimality with respect to some target variable (formalized via the information bottleneck). In particular, we're interested if the unsupervised representations are <i>correlated</i> with the supervised ones.</p><p>More specifically, suppose you're trying to classify points into some number of categories (i.e. learn a (random) functionf:X\to Y$), and have some number of labeled points (values of$f(x)$for some$x$). Then when -- if ever -- will a further sampling from$X$(i.e. without labels) help you achieve greater classification accuracy? Does$P(X)$provide <i>information</i> on$P(Y\mid X)?</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-iupnpRakQfY/YRhA7-ZvTTI/AAAAAAAAOo8/2YCIuNYNeSI23AXG9-badUmGahSH0Z9_QCLcBGAsYHQ/image.png" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="337" data-original-width="291" height="240" src="https://lh3.googleusercontent.com/-iupnpRakQfY/YRhA7-ZvTTI/AAAAAAAAOo8/2YCIuNYNeSI23AXG9-badUmGahSH0Z9_QCLcBGAsYHQ/image.png" width="207" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://en.wikipedia.org/wiki/Semi-supervised_learning#/media/File:Example_of_unlabeled_data_in_semisupervised_learning.png">Image source</a></td></tr></tbody></table><p>So for example, if you believe that clusteredx$tend to have similar labels, then$P(X)$gives information on$P(Y\mid X)$. Formally, we have a belief distribution$\phi_{p_{x,y}}$on the possible distributions$p_{x,y}$that generate$X, Y$-- so the above hypothesis represents a high prior probability for distributions&nbsp;$p_{x,y}$that predict a high probability for nearby$x$to produce nearby$y$. The random distributions$p_x$and$p_{x\mid y}$can be extracted as function(al)s of the random distribution$p_{x,y}$, and so it makes sense to talk about their independence -- and if they are independent, semi-supervised learning doesn't work.</p><p><b>Semi-supervised learning and causation</b></p><p>One may think about the physical mechanisms underlying such data. Suppose that$P(Y\mid X)$does give us information about$P(X)$-- e.g. we believe that if$Y$tends to cluster with$X$then$X$has a correspondingly clustered process -- then that tells us that either$X$and$Y$have a common cause, or that$Y$causes$X$.</p><p>Why? Let's be a bit more concrete about this and suppose that$X$and$Y$represent altitude and temperature (of some weather stations) respectively. And suppose we believe that observing a distribution like this for$X:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-mbpFVuGhIqU/YRkr0vz2zLI/AAAAAAAAOpU/yVgB7na_BI0ishYi7zW2rG6KN_1ZRmzYgCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="263" data-original-width="477" height="176" src="https://lh3.googleusercontent.com/-mbpFVuGhIqU/YRkr0vz2zLI/AAAAAAAAOpU/yVgB7na_BI0ishYi7zW2rG6KN_1ZRmzYgCLcBGAsYHQ/image.png" width="320" /></a></div><br /><p>Made it likely for the relationship betweenX$and$Yto follow the same clustering, i.e. to look like this:</p><div><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-oasoMPL3d9g/YRksnwg-uOI/AAAAAAAAOpc/kyKL_u08YVY7JtZpZeyrXcLx66EAtdEvgCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="263" data-original-width="477" height="176" src="https://lh3.googleusercontent.com/-oasoMPL3d9g/YRksnwg-uOI/AAAAAAAAOpc/kyKL_u08YVY7JtZpZeyrXcLx66EAtdEvgCLcBGAsYHQ/image.png" width="320" /></a></div><br /><p>Would that then be consistent with a model in which altitude causes temperature?</p></div><div><p></p><p>No. In fact, one way to think about "our belief about a distribution" is to consider that distribution as parameterized by (conditioned on) some third variableZ$and think about or beliefs on$Z$. For$P(X)$and$P(Y\mid X)$to be "independent" means that there is no$Z$such that$P(X)$and$P(Y\mid X)$both depend on$Z$, i.e. for all variables$Z$, either$X$and$Z$are independent, or$Y$and$Z$are independent conditioned on$X$:</p><p>$$\forall Z, (X\perp\!\!\!\perp Z)\lor (Y\perp\!\!\!\perp Z\mid X)$$</p><p><i>&lt;=&gt; For any$Z$correlated with$X$,$Z$affects$Y$only through$X$.</i></p><p><i>&lt;=&gt; If there is an unblocked causal path between$X$and some$Z$, then conditioning on$X$blocks all causal paths between$Y$and$Z$.</i></p><p><i>&lt;=&gt; If there are unblocked causal paths between some$Z$and$X$,$Y$, then the paths between$Z$and$Y$contain$X$&nbsp;</i><i>as either a mediator or a common cause.&nbsp;</i></p><p><i>&lt;=&gt;&nbsp;</i><i>If there are unblocked causal paths between some$Z$and$X$,$Y$, then the paths between$Z$and$Y$contain$X$, and$X$is a cause of$Y$.</i></p><p><i>&lt;=&gt;$X$causes$Y$and they have no common causes.</i></p><p>This is called the "Principle of Independent Causal Mechanisms" -- if semi-supervised learning doesn't add anything, then there is a defined causal direction$X\to Y$.</p><p>(Our treatment of the Principle of Independent Causal Mechanisms is similar to that by Sokolovska &amp; Wuillemin [<a href="https://www.mdpi.com/1099-4300/23/8/928">1</a>]. We have not covered the algorithmic formulation of the principle -- worth reading: in this regard Janzing et al on what ICM translates to in physics [<a href="https://arxiv.org/abs/1512.02057">2</a>], Besserve et al on group-theoretic formulation [<a href="https://arxiv.org/abs/1705.02212">3</a>]. Furthermore, semi-supervised learning can be considered a special case of transfer learning, relevant papers in this area/on the theory of transfer learning include: [<a href="https://www.cs.cmu.edu/~liuy/atl.pdf">4</a>] [<a href="https://arxiv.org/abs/2005.08697">5</a>] [<a href="https://mitpress.mit.edu/books/elements-causal-inference">6</a>&nbsp;-- p. 177])</p><p><b>Independent noise and causation</b></p><p>Although we have formulated the notion of$P(X)$and$P(Y\mid X)$being independent in terms of our beliefs about what those distributions might be, such beliefs can also be updated by data. Indeed the characterization above, in which the variable$Z$is used for parameterization, is one such formulation.&nbsp;</p><p><i>If there are unblocked causal paths between some$Z$and$X$,$Y$, then the paths between$Z$and$Y$contain$X$&nbsp;</i><i>as either a mediator or a common cause.&nbsp;</i></p><p>The contrapositive of this reads:</p><p><i>If$Y$depends on$Z$not through$X$, then$X$and$Z$must be independent.</i></p><p><i>&lt;=&gt;$X$is independent of the noise term in$Y.</i></p><p>Indeed, this is one of the most common characterizations used to establish causation when a full picture of the causal model and its specific variables is not known. The idea is that the noise term captures the relevant stuff from the unknown variables in the true causal model.</p><p>(See Goudet et al [<a href="https://arxiv.org/pdf/1709.05321.pdf">4</a>] for the standard treatment of causal inference from noise.)</p></div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-54240187416322057602021-08-12T01:42:00.018+01:002021-08-17T17:13:15.681+01:00Further notes on causation<p>Some further topics on causal models, as a follow-up to the previous&nbsp;<a href="https://thewindingnumber.blogspot.com/2021/02/statistical-models-causal-models.html">introductory article</a>.</p><p><b>Table of contents</b></p><p></p><ul style="text-align: left;"><li>Causal models and truth tables</li><li>The do operator; ceteris paribus</li><li>d-separation, RCT and the do operator</li><li>Ancestral graphs</li><li>Time</li><li>Newcomb's problem and the limits of CDT</li></ul><p><b>Causal models and truth tables</b></p><div><p>When first introduced to causal modeling (or for that matter any multivariate modeling that involves looking at conditional distributions), you wondered why one couldn't just directly apply the joint probability distribution to model your data, rather than talk about conditional independences, etc. Well, the standard answer has to do with computational problems with dealing with joint distributions with way too many free parameters, but some insight can be gained from considering the logical analog of the statistical problem.</p><p>The logical analog of a joint distribution is a truth table. Well, formally these propositions should be considered quantified over some variable, or over some models with incomplete theories, etc. so the propositions aren't just always true or always false. Then a truth table represents every possible state the propositions could have in any model.</p><p>$$\begin{array}{*{20}{c}}{\begin{array}{*{20}{c}}P&amp;Q&amp;R\\\hline0&amp;0&amp;0\\0&amp;1&amp;1\\1&amp;0&amp;1\\1&amp;1&amp;0\end{array}}&amp;{\begin{array}{*{20}{c}}P&amp;Q&amp;R\\\hline0&amp;0&amp;0\\1&amp;1&amp;1\end{array}}\end{array}$$</p><p>Above are two possible truth tables for some propositionsP,Q,R$. Causally, they can be represented by the following diagrams respectively.</p><p>$$\begin{array}{*{20}{c}}{\begin{array}{*{20}{c}}P&amp; {} &amp;Q\\{}&amp; \searrow &amp; \downarrow \\{}&amp;{}&amp;R\end{array}}&amp;{P \to Q \to R}\end{array}$$</p><p>These arrows do&nbsp;not&nbsp;directly represent implication (although they do up to negation), and obviously the precise conditional implications need to be specified to extract the joint distribution, but the causal diagrams tell us, as before, the pattern of conditional independences. Learning the precise distribution given a causal model (i.e. given a causal graph, without knowledge of the precise implications/"conditional distributions" represented by each link) amounts to checking smaller truth tables for each vertex in the causal model.</p><p>So for a system with$i$propositions, learning the truth table directly would take$2^i$passes, but given a causal model would take$\sum_v 2^{i_v}\sim i2^{i_v}$passes (where$i_v$is the number of parents$v$has), which is far more computationally scalable.</p><p>Evaluating models is also more efficient in causal form, although this is less important -- the example models above, in truth table form, respectively read formally as:</p><p>$$\left( {\neg P \wedge \neg Q \wedge \neg R} \right) \vee \left( {\neg P \wedge Q \wedge R} \right) \vee \left( {P \wedge \neg Q \wedge R} \right) \vee \left( {P \wedge Q \wedge \neg R} \right)$$</p><p>$$(P \wedge Q \wedge R) \vee (\neg P \wedge \neg Q \wedge \neg R)$$</p><p>And in causal form, read as (in a way analogous to Markov factorization):</p><p>$$\left( {P \vee \neg P} \right) \wedge \left( {Q \vee \neg Q} \right) \wedge \left( {\neg P \wedge \neg Q \to \neg R} \right) \wedge \left( {\neg P \wedge Q \to R} \right) \wedge \left( {P \wedge \neg Q \to R} \right) \wedge \left( {P \wedge Q \to \neg R} \right)$$</p><p>$$\left( {P \to Q} \right) \wedge \left( {\neg P \to \neg Q} \right) \wedge \left( {Q \to R} \right) \wedge \left( {\neg Q \to \neg R} \right)$$</p><p>While the causal expressions may seem more convoluted, consider the problem of checking whether a given state$(P,Q,R)$is valid according to our truth table (testing a model in the logical setting amounts to checking whether given data points fit the model). This is a <i>decidable problem</i>&nbsp;-- there is a simple algorithm to reduce any logical expression with given values of$(P,Q,R)$to a boolean value. Which expression is computationally more feasible to evaluate?</p><p>In the first model, the truth table takes 7.5 equality checks, while the causal model takes an average of 10.5 equality checks. In the second model, the truth table takes an average of 5.625 equality checks, while the causal model takes an average of 6 equality checks. But look at how these values scale -- in general for some causal model, the causal model takes the following number of equality checks on average:</p><p>$$\sum_{v\in V}{\sum_{w\in 2^I_v}{2-\frac{1}{2^{i_v+o_w}}}}\approx \sum_{v}{2^{i_v+1}}$$</p><p>Where$v\in V$are the vertices (variables) in the model,$w\in 2^{I_v}$are the$2^{i_v}$possible configurations of$v$'s$i_v$causal variables$I_v$,$o_w$is the number of possibilties left open even after all causal variables are specified (i.e. the noise term). While the truth table takes the following number of equality checks on average:</p><p>$$2^i o-\frac{o(o-1)}{2}$$</p><p>Where$i$is the total number of variables and$o$is the number of allowed states (i.e. the total unexplained variance). For$i_v &lt;&lt; i&lt;o, the causal model is far more efficient to test than the joint distribution. Perhaps some connection can be made to divide-and-conquer or factorization algorithms -- or maybe not, since divide-and-conquer is recursive while this isn't.</p><p><b>The do operator; ceteris paribus</b></p><p>Causation is the idea that crashing cars won't cause people to drink more. It does so within the framework of statistical modeling by observing that "accident rate" and "go crash a car" are two different variables, and when you "go crash a car" in the real world, the precise mechanism by which you do so does not involve increasing people's drinking ability. "Go crash a car" is an instrumental variable, and is typically denoted using Pearl's do operator as do(crash). In the below diagram, releasing storks is "do(stork)".</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-0Gps0XgUoWU/YQ7mbY_2MdI/AAAAAAAAOmk/5O7xcfQ430QbHr87ls5bRuenjCnEDGCkgCLcBGAsYHQ/s550/instrumental.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="261" data-original-width="550" height="152" src="https://1.bp.blogspot.com/-0Gps0XgUoWU/YQ7mbY_2MdI/AAAAAAAAOmk/5O7xcfQ430QbHr87ls5bRuenjCnEDGCkgCLcBGAsYHQ/s320/instrumental.png" width="320" /></a></div><p>The do operator captures the basic intuition, or motivation, behind causality. We're interested in finding out what increasing tax rates does to the GDP -- it doesn't matter for our purposes that richer countries tend to have higher taxes for political reasons, we'll change taxes without all that politics -- <i>conditioning</i>&nbsp;on the politics.</p><p>What the do operator does is allow us to formalize the notion of <i>ceteris paribus</i>. On first glance, the concept of ceteris paribus seems inherently flawed -- suppose you have four variables: momentum, mass, velocity and crater size; then saying "an increase in mass results in an increase in crater size, ceteris paribus" is ill-defined -- you can't hold velocity <i>and</i>&nbsp;momentum constant while increasing mass.</p><p>No, the notion of ceteris paribus is defined with respect to a causal model -- thus, both the following models are ok:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-r8H5NGICmAk/YQ_LiULtoAI/AAAAAAAAOns/NUpiNkq5vMknsLTRFHgqd5Cyugpl-xMlwCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="293" data-original-width="701" height="218" src="https://lh3.googleusercontent.com/-r8H5NGICmAk/YQ_LiULtoAI/AAAAAAAAOns/NUpiNkq5vMknsLTRFHgqd5Cyugpl-xMlwCLcBGAsYHQ/w522-h218/image.png" width="522" /></a></div><br /><p>(Our use of the do operator is perhaps non-standard, but equivalent to the standard one. We define\mathrm{do}(X=x)$as a random variable with conditional probabilities$P(X=x'\mid\mathrm{do}(X)=x)=\delta_{xx'}$, and$P(X=x'\mid \mathrm{do}(X)=\bot)$is as before. This becomes equivalent to erasing causal arrows from parents when$\mathrm{do}(X)$is active. If you let$\mathrm{do}(X)be distributed in some way, this captures the notion of a <i>randomized control trial</i>.)</p><p><b>d-separation, RCT and the do operator</b></p><p>The idea of "do" interventions is given formal grounding by the idea of d-separation, which is the basic calculus of causation upon which all experiments to establish causation are based.</p><p>Fundamentally, d-separation describes when a causal pathway allows for the "flow of information" (a concept that is also relevant to belief propagation stuff) between random variables. The rules of d-separation follow from the Markov and faithfulness assumptions:</p><p></p><ul style="text-align: left;"><li>In a chain junctionA\to B\to C$, conditioning on$B$blocks the pathway.</li><li>In a confounding (fork) junction$A\leftarrow B\to C$, conditioning on&nbsp;$B$blocks the pathway.</li><li>In a colliding junction$A\to B\leftarrow C$, the opposite is true: chain starts out blocked, but conditioning on$B$unblocks it due to the "explain-away effect". For example, a sprinkler may operate independently of the rain, but if you're told that the grass is wet, then they become negatively correlated.</li><li>Conditioning on the descendant of a variable is like also like conditioning on the variable itself. Conditioning on the descendant of a mediator or confounding variable partially blocks the pathway; conditioning on the descendant of a collider partially unblocks the pathway. This is "partial", in the sense that parent nodes take priority over child nodes in deciding how the pathway is blocked.</li></ul><div>So by adding$\mathrm{do}(X)$to a causal model,$X$becomes a collider, and so$\mathrm{do}(X)$can be adjusted independently of$X$'s other parents, thus blocking the flow of information through the parents' causal pathways -- this is also precisely what a randomized control trial does, in which$\mathrm{do}(X)is given some distribution.</div><p></p></div><div><p><b>Ancestral graphs</b></p></div><div><p>Here's a somewhat harder example, involving common causes: suppose the true model of smoking and cancer looks like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-EuCUu6QLoL4/YQ79Kqw5kTI/AAAAAAAAOm0/TYkl5geypDY7rQ-67SmcqlJldMIWvKyeACLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="112" data-original-width="301" height="119" src="https://lh3.googleusercontent.com/-EuCUu6QLoL4/YQ79Kqw5kTI/AAAAAAAAOm0/TYkl5geypDY7rQ-67SmcqlJldMIWvKyeACLcBGAsYHQ/image.png" width="320" /></a></div>And suppose that risk-averseness is unobserved/not part of our universe. Can we still recover the fact that there is no causal link between smoking and cancer (according to this model) without this common cause being known? In other words, consider the following scenario: a correlation is observed between smoking and cancer, but a randomized experiment shows no increase in cancer from do(Smoking), and no increase in smoking from do(Cancer) either (so you can't just point an arrow from cancer to smoking either). The <i>true</i>&nbsp;model looks something like this:<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-29J0K5B8gK4/YQ7-g6uKs2I/AAAAAAAAOnM/NMCPANhNcfcSff9bCukCR_dy3TrfhwavQCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="112" data-original-width="546" height="97" src="https://lh3.googleusercontent.com/-29J0K5B8gK4/YQ7-g6uKs2I/AAAAAAAAOnM/NMCPANhNcfcSff9bCukCR_dy3TrfhwavQCLcBGAsYHQ/w470-h97/image.png" width="470" /></a></div><p>To simplify, let's consider a logical model -- the truth table looks like this:</p> <center> <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-0lax{text-align:left;vertical-align:top} .tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-fymr"><span style="font-weight: bold;">Risk</span></th> <th class="tg-fymr">do(smoking)</th> <th class="tg-fymr">smoking</th> <th class="tg-fymr">do(cancer)</th> <th class="tg-fymr">cancer</th> </tr> </thead> <tbody> <tr> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> </tr> <tr> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">1</td> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">1</td> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> </tr> <tr> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> </tr> </tbody> </table> </center> <p>And if risk is unobserved:</p> <center> <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-fymr">do(smoking)</th> <th class="tg-fymr">smoking</th> <th class="tg-fymr">do(cancer)</th> <th class="tg-fymr">cancer</th> </tr> </thead> <tbody> <tr> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> </tr> <tr> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">0</td> <td class="tg-0pky">0</td> </tr> <tr> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">0</td> <td class="tg-0pky">1</td> </tr> <tr> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> <td class="tg-0pky">1</td> </tr> </tbody> </table> </center> <p>Well, here's the thing: smoking's parent cannot be do(smoking) alone, because smoking and cancer are not independent once conditioned on do(smoking) -- you can't have smoking and cancer take on different values if do(smoking) is 0. And if you condition on do(smoking) and do(cancer) both, then you can't have smoking and cancer take on different values if do(smoking) and do(cancer) are both 0. So cancer must itself be a parent to smoking -- and by a similar argument, smoking must itself be a parent to cancer.&nbsp;</p><p>Obviously, this is not possible in a DAG, so if we assume that the universe is causal, this data <i>implies</i>&nbsp;the existence of an unobserved hidden cause. Such data -- where some vertices of a DAG are marginalized over -- can be represented with an <i>ancestral graph </i>[<a href="https://arxiv.org/pdf/0908.3605.pdf">1</a>]. The question, then, is how the full DAG can actually be <i>learned</i> from the data, i.e. what the causal structure is, what its distribution is, etc. Some relevant papers: [<a href="https://www.degruyter.com/document/doi/10.1515/jci-2017-0020/html">1</a>][<a href="https://iopscience.iop.org/article/10.1088/1367-2630/16/11/113043/pdf">2</a>][<a href="https://arxiv.org/pdf/1209.2978.pdf">3</a>][<a href="https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full">4</a>] Fundamentally, however, this seems closely related to the problem of causal discovery and representation learning.</p><p><b>Time</b></p><p>One may imagine taking the limit to a continuum of random variables, in which a DAG approaches a partial order.&nbsp;</p><p>It is notable that this is a partial order, and not a total order. This allows for something like relativity, in which time ordering cannot be established for spacelike separated events. Indeed, the causal model of spacetime in special relativity looks like this, and light cones are simply sets of paths in the DAG:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-umenBDa-9CA/YROpNo9dYjI/AAAAAAAAOoI/nlCBjWJjTf8T9Ys8Ys4K5GiwneAQYGShACLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="412" data-original-width="540" height="240" src="https://lh3.googleusercontent.com/-umenBDa-9CA/YROpNo9dYjI/AAAAAAAAOoI/nlCBjWJjTf8T9Ys8Ys4K5GiwneAQYGShACLcBGAsYHQ/image.png" width="315" /></a></div><br /><p></p><p><b>Newcomb's problem and the limits of CDT</b></p><blockquote><p>An AI offers you two boxes, and gives you the options of choosing either only Box A, or Box A + Box B. It says it has modelled your mind, and set up the boxes based on its predictions of your behavior: Box B will contain £1 no matter what, but Box A will contain £1000 only if you one-box. What do you choose?</p></blockquote><p>Causation is fundamentally a decision-theoretic idea, and Newcomb's problem reveals a fundamental problem with causal decision theory: causal decision theory pretends that the agent is completely independent of the environment -- from physics. In reality, the agent is a "part" of physics -- the\mathrm{do}(X)variable itself has arrows going into it --&nbsp;</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-FYLdMcaVZzQ/YROx4m8aHtI/AAAAAAAAOoQ/BULYn6Wa9mYHGrro9KZUvrkodXtiEw3PwCLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="324" data-original-width="430" height="240" src="https://lh3.googleusercontent.com/-FYLdMcaVZzQ/YROx4m8aHtI/AAAAAAAAOoQ/BULYn6Wa9mYHGrro9KZUvrkodXtiEw3PwCLcBGAsYHQ/image.png" width="319" /></a></div><p>I believe that causal decision theory should be considered a <i>special case</i>&nbsp;of evidential decision theory in which certain choices can be made "freely", in which true "do" operators <i>exist</i>.&nbsp;</p><p></p><p>A similar problem arises with time travel as in&nbsp;<a href="https://lesswrong.com/hpmor">Harry Potter and the Methods of Rationality</a>. Here too there are arrows pointing into\mathrm{do}(X), arising from the causal graph no longer being DAG.&nbsp;</p><p></p><p></p><p></p><p></p><blockquote style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">There was another pause, and then Madam Bones’s voice said, "I have information which I learned four hours into the future, Albus. Do you still want it?" Albus paused— (weighing, Minerva knew, the possibility that he might want to go back more than two hours from this instant; for you couldn’t send information further back in time than six hours, not through any chain of Time-Turners) —and finally said, “Yes, please.”</blockquote><p>(Of course, "I have information from four hours into the future" is itself information, as is any such interaction, and indeed if time loops of constant length exist at every point in time then arbitrarily large time loops can be created simply by adding them.)</p> </div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-36989883504982512682021-05-26T13:33:00.004+01:002021-07-21T13:41:56.444+01:00The mathematical definition of an image<p>We have previously <a href="https://thewindingnumber.blogspot.com/2020/07/overview-of-neural-network-architectures.html">noted</a> that convolutional layers are effective in image processing applications. In this article we go a bit further and claim that the <i>definition</i>&nbsp;of an image is that convolutions are effective on it.</p><p>The motivation for the convolution operation -- as well as for the use of the term "convolution" to describe it -- is the idea that <i>an image is best understood in its frequency representation</i>. This is the basis of various simple compression algorithms like JPEG, as well as for general image processing. The Discrete Fourier (or equivalently cosine) transform allows an image to be written as a linear combination of sinusoids, like such.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-JJkx0ETCa9I/YK4qqBLiBOI/AAAAAAAAOjg/SRvzOBxDmZAyyrdPGFR7zKUDur6qCUoBACLcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="684" data-original-width="684" height="240" src="https://lh3.googleusercontent.com/-JJkx0ETCa9I/YK4qqBLiBOI/AAAAAAAAOjg/SRvzOBxDmZAyyrdPGFR7zKUDur6qCUoBACLcBGAsYHQ/image.png" width="240" /></a></div><p>The idea is then that the properties of the image can be inferred from this frequency representation through "simpler" functions than from the position representation. This is the <i>defining characteristic</i>&nbsp;of an image (or rather of the distribution of images) -- if we instead dealt with some other data type, which did not exhibit spatial regularities like an image does, this would no longer be true.</p><p>Multiplying the frequency representation by an enveloping function to enhance particular frequencies corresponds to a <i>convolution</i>&nbsp;of the image with the Fourier transform of said envelope. Any arbitrary such envelope can be constructed from a sequence of such convolution operations.</p><p>(Play with frequency representations of images in <a href="https://colab.research.google.com/drive/1gzVvEIKOFfkc0qP_wAJAJlwOIbauR3yx?usp=sharing">this Colab notebook</a>.)</p><hr /><p>Another type of data whose distribution we know something about is time series.</p><p>The key thing about time series is that the data representation should handle sequences of varying lengths. This is not a tall order -- in general, entropy encodings are perfectly capable of handling data of arbitrary lengths if we know the distribution of the data. To model the distribution of such sequences means figuring out the probabilityp(x_{i+1}\mid x_1,\dots x_i)$-- or equivalently$p(h_{i+1}\mid h_i)$where the$h_i$are the data representations. So the natural way of representing sequences of arbitrary length is as states of a Markov chain, with each new item in the sequence providing information to determine the next state.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-57653927501919191332021-05-26T00:12:00.006+01:002021-05-26T10:00:56.016+01:00Machine learning and information theory<p>A simple and abstract way to look at machine learning is as a method to find optimal data compression algorithms (and indeed, data compression can be used as a measure of general intelligence). Fundamentally, the following two goals are equivalent: (1) approximating the distribution of some data (2) finding the optimal representation (compression algorithm) for said data -- the optimal representation is just the entropy encoding given the distribution.</p><p>Note that all of this is said about unsupervised learning algorithms -- generative modelling, dimensionality reduction/clustering, etc. -- but such structure in the data can also be uncovered during supervised learning (e.g. decision trees can emulate hierarchical clustering, feed-forward neural networks embed data into its feature representations), when the desired labels depend strongly on the clustering.</p><p>In supervised learning, you are no longer seeking to simply find an optimal compression for the data, but a compression that "agrees" with the labelling -- you are no longer seeking a representation that best predicts the original data, but a representation that best predicts the labels. In other words, in unsupervised learning you are trying to find a representation of fixed length$T(X)$that maximizes the expected information that a representation provides on the data, the mutual entropy$I(X;T(X))$; in supervised learning you are trying to find a representation$T(X)$that maximizes the expected information that a representation provides on the labels$I(y;T(X))$. This is how information gain appears in decision trees.&nbsp;</p><p>(See <a href="https://arxiv.org/abs/1905.07822">Minimal Achievable Sufficient Statistic Learning</a>. See also&nbsp;<a href="https://en.wikipedia.org/wiki/Information_bottleneck_method">information bottleneck method</a>, a generalization of minimal sufficient statistics to allow for representations that lose some information about the labels.)</p><p>There is yet a third place that entropy functions appear in machine learning, and that has to do with <i>proper scoring rules</i>&nbsp;for making probabilistic predictions.&nbsp;</p><p>Suppose you were betting on an election, and you believed there was a 70% chance of a candidate winning -- under a classic betting scheme, you should put all your money on the winning candidate to maximize your expected earnings. Thus the fraction of money you bet on a candidate (which is 100%) does not reflect your predicted belief of them winning the election (which is 70%). In order to have your bets reflect actual probabilities, one needs to use a <i>proper scoring rule</i>&nbsp;to score your prediction. The only additive proper scoring rule is$\log(q)$, where$q$is the probability you assign to the outcome -- so where$p$is the actual probability distribution of the outcomes, the score is$\sum p \log q, which is the cross-entropy. The same applies to machine learning algorithms making probabilistic predictions for the labels.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-87551032180416195332021-05-23T16:11:00.003+01:002021-05-23T16:14:22.484+01:00Information theory of multiple random variables<p>In the previous articles [<a href="https://thewindingnumber.blogspot.com/2020/07/information-and-bits.html">1</a>] [<a href="https://thewindingnumber.blogspot.com/2020/07/lossless-compression-and-invertible.html">2</a>] we defined the information of an observation as the logarithm of its probability -- and its entropy of a random variable as the expected amount of information gained from observing it (and thus is a measure of how little information the prior carries). Naturally in the multivariate case it is trivial to define a "joint information" that is the logarithm of the joint distribution. and we have the joint entropy:</p><p>$$H(\mathbf{X})=-\sum_{\mathbf{x}}{p(\mathbf{x})\log p(\mathbf{x})}$$</p><p>Because the full joint distribution is at least as informative as the marginal distributions taken separately, we have the property:</p><p>$$H(X,Y)\le H(X)+H(Y)$$</p><p>We often informally explain notions of independence and relationships between random variables in terms of information -- with information theory, we can now formalize these descriptions. The Venn diagram below shows the additive relationships between various entropies of multiple random variables:</p><p></p><div class="separator" style="clear: both; text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg/1116px-Entropy-mutual-information-relative-entropy-relation-diagram.svg.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="563" data-original-width="800" height="232" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg/1116px-Entropy-mutual-information-relative-entropy-relation-diagram.svg.png" width="330" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">(<a href="https://en.wikipedia.org/wiki/File:Entropy-mutual-information-relative-entropy-relation-diagram.svg">source</a>)</td></tr></tbody></table></div>H(X)$and$H(Y)$are represented by the respective circles,$H(X,Y)$is represented by the combined area of the circles and the mutual entropy; conditional entropies are as indicated. The mutual entropy (more commonly <i>mutual information</i>) is the expectation of the mutual information (more commonly <i>pointwise mutual information</i>):<br /><p></p><p>$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$</p><p>Even though the pointwise mutual information may be positive or negative (the probability of$y$may go up or down depending on the observed$x), its expectation is always positive in a way analogous to conservation of expected evidence. These ideas can be generalized to beyond two variables:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5d/VennInfo3Var.svg/902px-VennInfo3Var.svg.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="797" data-original-width="800" height="277" src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5d/VennInfo3Var.svg/902px-VennInfo3Var.svg.png" width="278" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">(<a href="https://en.wikipedia.org/wiki/File:VennInfo3Var.svg">source</a>)</td></tr></tbody></table><p>The mutual entropy represents the reduction in the number of bits necessary to encode two correlated variables together, as opposed to separately. This is a special example of the <i>entropy gain</i>&nbsp;(or "Kullback-Leibler divergence") of two probability distributionsp$and$q$: it is the expected number of extra bits used when expressing a$p(x)$-distributed random variable with a$q(x)$-entropy encoding. The mutual entropy is the entropy gain from$f_{X,Y}(x,y)$of$f_X(x)f_Y(y).</p><p>\begin{align*}KL(p(X) | q(X)) &amp;= \sum-p(x) \log {q(x)} -&nbsp; \sum -p(x) \log {p(x)} \\&amp;= \sum p(x) \log \frac{p(x)}{q(x)}\end{align*}</p><p>The first term of this expression (the number of bits required to express the random variable in the incorrect encoding) is also known as the <i>cross-entropy</i>.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-43768836693136280692021-04-26T13:50:00.001+01:002021-04-26T13:50:13.607+01:00The three theorems of complete statistics<p><b>The Lehmann-Scheffe theorem</b></p><p>In the <a href="https://thewindingnumber.blogspot.com/2021/04/sufficient-statistics.html">last article</a>, we discussed the Rao-Blackwell theorem, which allows us to improve any estimator by averaging on a sufficient statistic (i.e. transforming it into a function of a sufficient statistic). Furthermore, the maximum improvement is attained by averaging on a minimal sufficient statistic, and the resulting estimator cannot be improved by further Rao-Blackwellization -- it is the best estimator "of that class".</p><p>One may wonder if averaging an unbiased estimator on a particular MSS could lead to the best estimator <i>overall</i>.&nbsp;</p><p>Let's call such an MSS a <i>complete sufficient statistic</i>, or CSS. What we really require, then, is the uniqueness of the Rao-Blackwellization on a CSS -- then no other estimator could possibly be better, because you can't be better than your Rao-Blackwellization.&nbsp;</p><p>We can then "reverse mathematics" out the definition of a complete statistic:T$is called a <b>complete statistic</b>&nbsp;if there is at most one unique unbiased estimator of$\theta$that can be written as a function of$T$. Equivalently, there is a unique unbiased estimator of 0 that can be written as a function of$T$.</p><p>It is in fact sufficient for the estimator to be unique up to disagreement on a set of measure zero -- i.e. the probability of two estimators disagreeing is zero.</p><p>The statement that an unbiased CSS is the best estimator overall is known as the <i>Lehmann-Scheffe theorem</i>.</p> <hr /> <p><b>Basu's theorem</b></p><p>Often the data provides information on stuff other than the parameter of interest. We cannot quite ask for a sufficient statistic that provides information <i>only</i>&nbsp;on$\theta$(by the law of multiple explanations, this is impossible unless the statistic determines$\theta$with certainty), but there are various ways we can ask that irrelevant information be "minimal".&nbsp;</p><p>One is in the definition of an MSS, as in the last article. The other has to do with something known as ancillary statistics.</p><p>An ancillary statistic is a statistic that provides <i>no</i>&nbsp;information on$\theta$-- e.g.$X-\bar{X}$does not provide information on$\mu$in a normal model. So we might want to ask for a sufficient statistic such that no function of it is ancillary. Well, this is implied by completeness but is still not as strong as completeness -- it is possible to have statistics which aren't independent of$\theta$but still have constant mean.</p><p>In fact, a stronger result, known as <i>Basu's theorem</i>, is implied by completeness: not only is no function of a CSS ancillary, but a CSS is <i>independent</i>&nbsp;of any ancillary statistic$V$. This follows from the fact that$P(V|T)-P(V)$is itself a function of$T$, and thus its mean being 0 implies its triviality.</p> <hr /> <p><b>Bahadur's theorem</b></p><p>In the above theorems, CSSs seems like everything we wanted from MSSs. In fact, every CSS is an MSS.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-6382670436645743052021-04-20T17:42:00.009+01:002021-04-21T09:40:52.278+01:00Sufficient statistics and the Rao-Blackwell theorem<p>The motivation for the definition of a <b>sufficient statistic</b> is very obvious from a Bayesian standpoint -- we call a statistic$T(X)$sufficient for$\theta$if it carries all the information that$X$has to offer on$\theta$, i.e. if</p><p>$$P(\theta|T\land X)=P(\theta|T)$$</p><p>(Or equivalently, since$T$is a function of$X$,$P(\theta|T)=P(\theta|X)$), i.e.$\theta$and$X$are conditionally independent knowing$T$. An equivalent, more common formulation that$P(X|T)$is independent of$\theta$:</p><p>$$P(X|T\land\theta)=P(X|T)$$</p><p>This captures the intuitive idea that$\theta$is causally linked to$X$only through$T$-- from this perspective, the two formulations are precisely symmetric.&nbsp;</p><p>One can find if a statistic is sufficient simply by looking at the distribution of$X$. We can write$P(X|\theta)=P(X|T(X))P(T(X)|\theta)$for any statistic$T(X)$-- thus for$T(X)$to be <i>sufficient</i>, i.e. for$P(X|T(X))$to be independent of$\theta$, we only require that such a factorization exists:</p><p>$$f(X|\theta)=h(X)g(T(X),\theta)$$</p><p>I.e. the distribution function can be written as the product of a function that doesn't depend on$\theta$and a function that depends on$\theta$and$T(X)$but not$X$directly. This is called the <b>factorization criterion</b>.</p><p>The interpretation of this theorem is that when doing Bayesian inference, two sets of data yielding the same value of$T(X)$should yield the same inference about$\theta$-- for this to be possible, the likelihoods' dependence on$\theta$should only be in conjunction with$T(X)$so that the direct dependence of the posterior on$X$cancels out through the normalization factor.</p><p>We are further interested in sufficient statistics that are "minimal" in the sense that they don't carry any more superfluous information than other sufficient statistics -- "necessary and sufficient statistics", if you will.&nbsp;</p> <div class="twn-furtherinsight">Can we define "necessary and sufficient statistics" as statistics that give <em>no</em> information other than that about$\theta$, i.e. such that knowing$T$doesn't give us any additional information on$X$than$\theta$? Consider if basic examples of minimal sufficient statistics have this property, and provide a simple counter-example.</div> <p>We thus define a <b>minimal sufficient statistic</b>&nbsp;as a sufficient statistic that can be written as a function of any other sufficient statistic. So, for example, the sample mean may be a minimal sufficient statistic for the mean, but the entire sample itself is not, because it cannot be obtained from the sample mean alone.</p><p>A particular MSS of importance is the equivalence class of likelihood-ratios independent of$\theta$: one can partition the space of$X$s by the equivalence relation of$P(X|\theta)/P(Y|\theta)$being independent of$\theta$-- the resulting quotient function (that sends$X$to its equivalence class) is then a minimal sufficient statistic for$\theta$.</p><p>As all MSSs can be recovered from each other, this is often used as a characterization of MSSs:$T$is an MSS iff:&nbsp;</p><p>$$T(X)=T(Y)\iff \frac{P(X|\theta)}{P(Y|\theta)}\text{ independent of }\theta$$</p><p>E.g. if two samples have the same mean, then the difference in their likelihoods comes from a factor other than the distribution mean.</p><hr /><b><br /></b><div><b>Rao-Blackwell theorem</b><p>If an estimator$\hat{\theta}$of$\theta$takes different values for different$X$with the same value of sufficient statistic$T(X)$, then it stands to reason that this estimator is sub-optimal, in that irrelevant features of the data contribute to the variance in$\hat{\theta}$. So for any sample$X$, we might want to take the average of$\hat{\theta}$among all samples with the same value of$T(X)$, which would be a new estimator$\tilde{\theta}$.</p><p>This is known as the <b>Rao-Blackwell theorem</b>. Specifically, the Rao-Blackwell Theorem says that the new estimator$\tilde{\theta}=E(\hat{\theta}|T(X))$has the same bias, and less or equal variance than$\hat{\theta}$(with equality if$\hat{\theta}$was already a function of$T$). The proof follows straightforwardly from the conditional breakdown of variance (ANOVA).</p><p>$$\mathrm{Var}(\hat{\theta})=\mathrm{E}(\mathrm{var}(\hat{\theta}|T))+\mathrm{var}(\mathrm{E}(\hat{\theta}|T))$$</p><div>Continuing this line of reasoning, we would like to average over the <i>minimal</i>&nbsp;sufficient statistic to be eliminate as much superfluous information as possible. Indeed, it is easy to show that if$T_2=h(T_1)$are both sufficient statistics, then averaging over$T_2$gives a lower variance estimator than averaging over$T_1$.</div></div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-29729048781220290752021-03-31T10:43:00.001+01:002021-03-31T10:43:32.229+01:00Dutch book argument for probability theory<p><b>aka "how to profit off people's irrationalities".</b></p><p>The philosophical question of what it means to be rational is answered by the popular slogan "Rationality is systematized winning" -- so, in particular, a rational belief is one that is expected to deliver you a profit.</p><p>This is closely linked to the "practical" definition of probability -- assigning probability$p$to observation$X$means valuing at$p$a unit wager on$X$(an asset with payoff 1 if$X$happens, an 0 otherwise) -- you are willing to buy the wager for any price less than$p$, and sell it for any price higher.&nbsp;</p><p>(We use money to make things simple, but more fundamentally we have a cardinal utility function whose expectation we seek to maximize -- so this is a valid abstract definition and does not depend on a specific currency or monetary system.)</p><p>However, if you accept that probability is definitionally subjective (even if updated by evidence via Bayes's theorem, your priors are still subjective), then assigning a probability$p$to a particular event cannot itself be rational or irrational. You can't quite calculate whether someone who holds that belief will make a profit or loss, because doing so would require assigning an underlying "objective" probability to$X$.&nbsp;</p><p>You can only say that a person is irrational if you can prove that their beliefs <i>always</i>&nbsp;lead to them losing, regardless of the results of the observation. Such beliefs are said to be <i>incoherent</i>, a notion that is identical to the financial concept of arbitrage. Irrationality in this sense is about an <i>inconsistency</i>&nbsp;in beliefs about different, related observations.</p><p>Consider observations$X_1,\dots X_n$(in the language of probability theory, these are elements of our sigma algebra). Suppose you assigned to them the probabilities$p_1,\dots p_n$.&nbsp;</p><p>Your beliefs are irrational iff I can come up with a list of unit wagers (called a <b>Dutch book</b>) to trade with you that always deliver me a profit regardless of the outcome of each observation.</p><p>For example, if$X_1\subseteq X_2$and you assign probabilities$p_1=0.5,p_2=0.1$, then I can buy from you a unit wager on$X_2$and sell you a unit wager on$X_1, and always make a profit.</p><p>The axioms of probability theory can then be shown to be equivalent to forbidding a Dutch book. If the values of all bets are positive between 0 and 1, the value of the bet on the entire sample space is 1 and bets are additive, then it becomes impossible to construct a Dutch book. Conversely, a violation of any of these axioms allows a bet on something equivalent to the sample space to be traded for less than or greater than 1, allowing for a Dutch book.</p><p>Similar Dutch book arguments can be extended to conditionalization and Bayes's theorme.</p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-40662873893435190042021-03-30T10:13:00.003+01:002021-03-30T10:13:58.264+01:00You can't hack Bayes's theorem<p><b>Aka the difference between rationalization and rationality.</b></p><p>Eliezer Yudkowsky calls this the dilemma of the clever arguer: A propagandist (clever arguer) is hired to sell you a box that may or may not contain a diamond -- he tells you that the box has a blue stamp on it (which you know occurs more on boxes containing diamonds). If you could handle the box yourself, you could rationally evaluate all the characteristics of the box and compile their influences on your probability estimate Are you then forced, for each argument the clever arguer provides, to helplessly update your probabilities as the clever arguer wishes, even though you know the propagandist has omitted the evidence he doesn't want you to know of?</p><p>There are various equivalent formulations of the problem including:</p><p></p><ul style="text-align: left;"><li><a href="https://thewindingnumber.blogspot.com/2020/04/i-dont-believe-p-hacking-is-problem.html">p-hacking</a></li><li>Filtered evidence: an experimenter flips a coin 10 times and tells you that the 4th, 5th and 7th tosses came up heads, without telling you anything about the other tosses.&nbsp;</li></ul><div>The key matter to realize is that the information you get is not simply <i>the 4th, 5th and 7th tosses came up heads</i>, but instead: <i>the propagandist tells me that the 4th, 5th and 7th tosses came up heads</i>. The statistical process we are studying is no longer the "natural" (IID Bernoulli) process we're used to, but a <i>different process</i>, which depends on the inner mechanism used by the propagandist. For example:</div><div><ul style="text-align: left;"><li>If the propagandist always tells you only the 4th, 5th and 7th tosses, you update your beliefs from this evidence as normal.</li><li>If the propagandist only tells you the coin tosses that came up heads, then you now know that the other seven tosses come up tails, and you update your beliefs accordingly.</li><li>If the propagandist chooses any three heads that came up to tell you about, then the probability of 4, 5 and 7 specifically&nbsp;being chosen is only slightly greater with a biased coin (you can calculate this, but the key point is that the process we're observing is not about which coins come up heads, but which coins are chosen by the propagandist).&nbsp;</li></ul><div>We could have a probability distribution on the various possible mechanisms underlying the propagandist, and then the information we get from our observations is actually <i>split</i>&nbsp;between information on the propagandist's mechanism and information on the coin's mechanism.</div></div><div><br /></div><div>Here's a model that makes it easier to perform inference on: we haven$features$X_i$distributed as$N(\mu, 1)$with prior$N(0,1)$on$\mu$, and the propagandist chooses to tell us the value$x$of$X_j$that has the greatest value among all$X_i$s. The posterior density on$\mu$can then be calculated:</div><div>$\frac{{{\Phi _\mu }{{(x)}^{n - 1}}{\phi _\mu }(x){\phi _0}(\mu )}}{{\int_\mu&nbsp; {{\Phi _\mu }{{(x)}^{n - 1}}{\phi _\mu }(x){\phi _0}(\mu )\,d\mu } }}$</div><div>Where$\phi_\mu$,$\Phi_\mu$are the normal PDF and CDF respectively. Here's an interactive visualization of this density:</div> <figure> <iframe frameborder="0" height="500px" src="https://www.desmos.com/calculator/y5zui7gljk?embed" style="border: 1px solid #ccc;" width="500px"></iframe> <figcaption><a href="https://www.desmos.com/calculator/y5zui7gljk">Link to interactive version.</a>(Note that in this visualization,$\mu$is$x$and$x$is$z$).</figcaption> </figure> For example for$n=10$,$x=1$, the distribution actually shifts leftwards, because <i>surely</i>&nbsp;if$\mu$were 0, the propagandist could have found a feature with a better value than 1.<div><br /></div><div>The moral of the story is that you can't hack Bayes's theorem; you can't fool a rational agent. If you have a parameter whose value you know, you can't systematically produce misinformation about the parameter. This is a result of the <b><a href="https://www.lesswrong.com/posts/jiBFC7DcCrZjGmZnJ/conservation-of-expected-evidence">conservation of expected evidence</a></b>: the expectation of the posterior probability of each value is its prior probability.</div><div><br /></div><div>So when we talk about scientific standards -- about scientists revealing all relevant information and not p-hacking, etc. -- these are not <i>requirements</i>&nbsp;for Bayesian inference, but they're simply a way to ensure that scientific research is maximally informative. If you don't know the underlying process that scientists use to report their data, or if you know that they use a "biased" process, then your estimator of the relevant parameter will be less informative than it could have otherwise been.</div><div><br /></div><div>The method they teach you in primary school to combat filtered evidence -- listing "arguments and counter-arguments", "pros and cons", etc. -- is far inferior to the standard of scientific ethics. Listing arguments and counter-arguments replaces a one-sided rationalization with a two-sided rationalization, but it doesn't truly approach rationality -- you just have two propagandists instead of one.&nbsp;</div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-84380848311741873622021-02-11T10:14:00.007+00:002021-06-03T16:10:59.890+01:00Exogenous variables and normal linear models<p>In the last article, we described two simplifications that allow us to more sensibly (than trying to fit a complicated joint distribution) deal with modeling the relationships between random variables. The first of these simplifications was <i>causal modeling</i>, which we discussed at some length.</p><p>&nbsp;In this section, we shall focus on the latter type of statistical model, where we do not try to model the explanatory variables themselves, but are only interested in modeling our dependent variable in terms of them. In particular, we shall discuss the special case of&nbsp;<i>normal linear models</i>, a toy set-up to start thinking about this sort of statistical modeling.</p><div>In the normal linear model:$X$is a vector-valued ($\in \mathbb{R}^n$) random variable and$Y$is a vector-valued ($\in\mathbb{R}^m$) random variable, and we write $$Y\sim N(\mu+\mathbf{B}X, \Sigma)$$ Where$\mu\in\mathbb{R}^m$and$\mathbf{B}\in\mathbb{R}^{m\times n}$. For$p$data points, we can write the model concisely as:</div><div><br /></div><div>$$\mathbf{Y}=\mathbf{X}\mathbf{B}+\mathbf{E}$$</div><div><br /></div><div>Where$\mathbf{Y}\in\mathbb{R}^{p\times m}$are the observed values of$Y$,$\mathbf{X}\in\mathbb{R}^{p\times (n+1)}$is the design matrix (the first column is a column of 1s),$\mathbf{B}\in\mathbb{R}^{(n+1)\times m}$is the parameter matrix and$\mathbf{E}\in\mathbb{R}^{p\times m}$is the error matrix, with each row being distributed$N(0, \Sigma)$.</div><div><br /></div><div>$\mathbf{B}$and$\Sigma$are thus the parameters we want to estimate from data. In particular, we want to construct estimators and <a href="https://thewindingnumber.blogspot.com/2020/04/special-case-of-bayes-confidence-regions.html">confidence regions</a> for them.</div><div><br /></div><div>The description above is usually called a <i>general</i>&nbsp;or <i>multivariate</i>&nbsp;linear model, while the special case$m=1$is what is usually referred to by the term <i>normal linear model</i>. Here we have:</div><div><br /></div><div>$$Y=X\beta+E$$</div><div>where$Y\in\mathbb{R}^p$,$X\in\mathbb{R}^{p\times(n+1)}$,$\beta\in\mathbb{R}^{n+1}$and$E\in\mathbb{R}^p$with each$E_i\sim N(0,\sigma^2)$.&nbsp;</div><div><br /></div><div><b>Maximum Likelihood Estimator for$\beta$given known variance</b></div><div><b><br /></b></div><div>The square term in the exponent of the normal distribution closely links it to the square-minimization that you see so much in variance/error-related statistics. In particular, the MLE for the parameter$\betain the normal linear model is the same as the Least Squares Estimator.</div><div><ul style="text-align: left;"><li>\hat \beta&nbsp; = {\left( {{X^T}X} \right)^{ - 1}}{X^T}Y$</li><li>$\hat\beta$is the same as the Least Squares Estimator, because$X\left( {{X^T}X} \right)^{ - 1}{X^T}$is the symmetric projection matrix onto the image of$X$, so the vector of residuals$Y-\hat{Y}$has minimal length (its norm is called the <i>residual sum of squares</i>&nbsp;$\mathrm{RSS}$).&nbsp;</li><li>$\hat{\beta}$is linear in$Y$and unbiased.</li><li>$\mathrm{cov}(\hat{\beta})=\sigma^2(X^TX)^{-1}$</li></ul></div><div>Observe that$\hat\beta$is only defined ("identifiable") when$X^T X$is invertible, i.e. when$X$is full column rank. When$X$is not full column rank, the data points are "vertical", so there are multiple models that give rise to the same data.</div><div><br /></div><div><b>Unbiased estimator for the variance$\sigma^2$&nbsp;</b></div><div><b><br /></b></div><div>This is a generalization of the classic statistical problem of sample variance. In the simplest example, you have some random variable$X$with some mean$\mu$and variance$\sigma^2$, and you want to estimate$\sigma^2$from an$n$-sample of$X$. The "obvious" answer -- and this is the Maximum Likelihood Estimator -- is just$\frac{1}{n}\sum{(x_i-\hat{\mu})^2}$.</div><div><br /></div><div>But if you consider, e.g. the situation with only one data point (then the sample has zero variance), it becomes very clear that this underestimates the population variance. In fact, the variance of the population has two additive components: the variance within the sample itself, and the variance between the sample and the population mean. For example in the one-data-point case, the value of the data point is <i>itself</i>&nbsp;an estimator of variance (a distribution with higher variance is more likely to produce samples away from the mean), or at least it would be if we knew the population mean.&nbsp;</div><div><br /></div><div>In the situation with one explanatory variable, the same result applies when you have <i>two</i>&nbsp;data points, because a single line can fit both data points without error, unless the two data points are vertically colinear. In general when you have$n$explanatory variables, you can perfectly fit$n+1$data points with zero error, unless the$(n+1)$-plane passing through them is vertical.&nbsp;&nbsp;</div><div><br /></div><div>So the expected generalization of the$\frac{\mathrm{RSS}}{p-1}$estimator (known as the <i>Bessel correction</i>) to the case of a linear model is$\frac{\mathrm{RSS}}{p-r}$where$r$is the rank of the design matrix. Indeed, one can check that this is an unbiased estimator for the variance of the error term.</div><div><br /></div><div>In fact it turns out that$\frac{\mathrm{RSS}}{\sigma^2}\sim\chi_{p-r}^2$.</div><div><br /></div><div><b>Confidence region for$\beta$</b></div><div><b><br /></b></div><div>Let$c\in\mathbb{R}^{n+1}$We have:</div><div><br /></div><div>$$\frac{c^T\hat{\beta}-c^T\beta}{\mathrm{Var}(c^T\hat{\beta})}\sim N(0, 1)$$</div><div>$$\frac{c^T\hat{\beta}-c^T\beta}{\sqrt{{c}^T(X^TX)^{-1}{c}\sigma^2}}\sim N(0, 1)$$</div><div><br /></div><div>Note that for$Z\sim N(0, 1)$and$V \sim \chi_{p-k}^2$independent,$Z/\sqrt{V/(p-k)}\sim t_{p-k}$. So:&nbsp;</div><div><br /></div><div>$$\frac{c^T\hat{\beta}-c^T\beta}{\sqrt{c^T(X^TX)^{-1}c\cdot \frac{\mathrm{RSS}}{p-(n+1)}}}\sim t_{p-(n+1)}$$</div><div><br /></div><div>Which can be rewritten as a confidence region for$\beta$:</div><div><br /></div><div>$$\frac{(\hat{\beta}-\beta)^T X^T X (\hat\beta - \beta)}{\mathrm{RSS}}\frac{p-(n+1)}{n+1}\sim F_{n+1,p-(n+1)}$$</div><div><br /></div><div><b>Confidence interval for predictions</b></div><div><b><br /></b></div><div>When you use your linear model to make new predictions (i.e. estimate$Y$for some$x$, as$\hat{Y}=\hat{\beta}^T x$), there are two sources of error, since$Y\sim N(\beta^T x, \epsilon^2):</div><div><ul style="text-align: left;"><li>Our estimator\hat{\beta}$is not an exact estimate of$\beta$.</li><li>$Y$has some variance$\epsilon^2$even after specifying$x$.&nbsp;</li></ul><div>In principle, if we had a Bayesian model, we could calculate the exact distribution of$Y$by compounding the distributions of$\hat{\beta}\mid\beta$and$Y\mid \hat{\beta}$. In the frequentist setting,$\beta$is not seen to have a <i>distribution</i>&nbsp;as such, so we instead want to construct a confidence interval for$Y$.</div></div><div><br /></div><div><hr /><p><b>ANOVA</b></p></div><div><p>Statistical modeling is closely linked to decision theory -- exogenous variables are those that we have "direct control" over, so the effects of the decision can be seen suppressing the distribution of and the internal correlations between these controllable variables. These "projected" correlations are causations.</p><p>This problem is related to the problem of&nbsp;<i>explaining variance</i>. Sure, statistical modeling is more general -- the variability of a random variable has more to do with just its variance, but in the special case of a normal linear model, the distribution is summarized by the mean and the variance, and so explaining the variance in$Y$through exogenous variables is equivalent to determining a statistical model.</p><p>The basic motivating fact behind ANOVA is the&nbsp;<b>law of total variance</b>: variance in a dependent variable can be broken down, in a Pythagorean fashion, into variances in the exogenous variables. This is a result of the independence of$Y\mid X$from the residuals.</p><p>$$\mathrm{Var}(Y)=\mathrm{Var}\left(\mathrm{E}\left(Y\mid X\right)\right)+\mathrm{E}\left(\mathrm{Var}\left(Y\mid X\right)\right)$$</p><p>This simplifies rather nicely in the case of a normal linear model, where errors are assumed to be independent of exogenous variables.</p><p>A very simple application of ANOVA is in assessing the "importance" of a particular exogenous variable to$Y$, by looking at the fractions of variance explained by each exogenous variable. More generally, ANOVA can be used to test the validity of any sub-model -- if a particular factor doesn't explain much of the variance in a variable$Y$, it can probably be discarded while still retaining a suitable model.</p><p>Any linear model can be represented as$E(Y)=\mathrm{span}(X)$, representing the hypothesis that the mean of$Y$is a plane/that the mean of$Y$is a linear function of$X$. A submodel$E(Y)=\mathrm{span}(X_0)$(where$X_0$is a submatrix of columns in$X$) represents the further hypothesis that$Y$does not correlate with any of the variables in$X$except those in$X_0$.</p><p>The fraction of variance in$Y$unexplained by the sub-model is then a test statistic for the sub-model (the larger this fraction is, the less likely the sub-model is to be true), and its distribution can be calculated under the sub-model as a null hypothesis.</p><p>$$F=\frac{\mathrm{RSS}_0-\mathrm{RSS}}{\mathrm{RSS}}\cdot \frac{p-r}{r-r_0} \sim F_{r-r_0,p-r}$$</p><div><hr /><p><b>Model diagnostics</b></p></div><p>Specifying a model, one can then do inference to determine the model parameters from the data. However, in the general paradigm of statistical inference, we don't know at all that the specified model is valid (in the full paradigm of Solmonoff Induction, the model is also inferred statistically). With simple data of low dimensionality, we often "eyeball" the data to set the model. Heuristics that help us decide on/evaluate a model are called&nbsp;<i>model diagnostics</i>.</p><p>We earlier discussed&nbsp;<a href="https://thewindingnumber.blogspot.com/2020/05/statistical-models-noise-anova-and.html">ANOVA</a>, which is a diagnostic to evaluate sub-models of a normal linear model. A related approach to evaluate the suitability of a normal linear model (without reference to a super-model) is the&nbsp;<b>Coefficient of Determination</b>&nbsp;$R^2$, which is defined as the fraction of variance in the response variable explained&nbsp; by the model.</p><p>Well, even an uncorrelated explanatory variable will spuriously explain some of the variance in$Y$because the model will fit to whatever insignificant sample correlation is observed -- in particular if we have a full set of$p$explanatory variables (where$p$is the number of data points),$Y$is fitted exactly and$R^2$is 1. This is not a result of the accuracy of the model -- the model isn't&nbsp;<i>predicting</i>&nbsp;anything, it simply lacks the freedom to equal anything but the observed data. So analogous to our degree-of-freedom argument for scaling by$1/(n-p)$for the sample variance, one may want to divide the numerator and denominator of$R^2$by the degrees of freedom$n-p$and$n-1$to obtain unbiased estimates of the errors of each model.</p><p>+leverages etc.</p><hr /><p><b>Linear parameterization: PCA</b></p></div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-70697903330713419002021-02-06T20:46:00.007+00:002021-05-04T10:49:32.215+01:00Statistical models; causal models<p>A <b>statistical model</b>&nbsp;in the most general sense is a proposed distribution for some random variable. In the simplest case, one can have a univariate distribution, e.g.$X\sim N(\mu, \sigma^2)$and experiment to get posterior distributions for the values of$\mu$,$\sigma.&nbsp;</p><p>More generally, the random variable in question may be correlated with some other random variable. Well, the most general way to express such a model is via its joint distribution, but the joint distribution might be too expensive to reasonably test -- a classic example is that of <a href="https://www.lesswrong.com/posts/hzuSDMx7pd2uxFc5w/causal-diagrams-and-causal-models">causal relationship</a>.</p><p>However, there are various special cases of this general description that naturally capture the assumptions that one would make while attempting to model these variables:&nbsp;</p><p></p><ul style="text-align: left;"><li><b>Hierarchical/Causal model:</b>&nbsp;We arrange the random variables in a tree-like format, with each parent's distribution taking the value of its child random variables as an explanatory variable -- the distribution of each node is marginalized against the distribution of its parents so that it only depends on its own children. E.g.Y\sim N(\mu_{0Y} + \mu_{1Y} X, 1)$,$X\sim N(\mu_X, 1)$.&nbsp;</li><li><b>Exogenous variables:</b>&nbsp;We don't bother about the distribution of the explanatory variables, treating them as "exogenous variables". In the context of decision theory, these should be seen as variables which we can manipulate at will -- e.g. if we're discussing the effects of taxation on GDP growth, then tax rates should be seen as an exogenous variable (even though it might be true that the tax rate is&nbsp;<i>actually</i>&nbsp;influenced by voter priorities and can be predicted in such way -- that's just not the question we're interested in addressing). Essentially, we are only interested in modeling the conditional distribution.</li></ul><div><br /></div><div><hr /><br /></div><div><b>Causation</b></div><div><b><br /></b></div><div>So causation is different from correlation, and closely related to&nbsp;<i>partial correlation</i>.&nbsp;</div><div><br /></div><div>To illustrate what we mean, consider this highly publicized study (reported e.g. <a href="https://prospect.org/power/want-expand-economy-tax-rich/">here</a>) arguing that higher taxes benefit the economy, because the U.S. economy grew slightly faster in the 1960s when taxes were higher. This study had a host of other problems (see the <a href="https://thewindingnumber.blogspot.com/2020/04/i-dont-believe-p-hacking-is-problem.html">p-hacking article</a>, where I point out that doing an experimental study on such noisy data is meaningless in the absence of a solid theoretical backing for your claim), but the main problem was that there are a lot of differences between the 1960s and now <i>apart</i>&nbsp;from tax rates, e.g. regulations were fewer in the 1960s, the kinds of industries and their accounting in GDP were different in the 1960s.&nbsp;</div><div><br /></div><div>The fundamental problem is this: the question you <i>want</i>&nbsp;to answer is "what will be the effect of increasing taxes be on economic growth?" Since it is unlikely that increasing taxes will reduce regulations or change accounting methods, we want to remove the contributions of these correlations in the explanatory variables from the correlation between tax rates and economic growth.</div><div><br /></div><div>(The obvious solution would be to take cross-sectional data (comparing the U.S. to countries that had different time trends in their tax rates) instead of longitudinal data -- this is an example of randomized interventions, or reasoning on the joint probability distribution.)</div><div><br /></div><div>This is the idea behind <b>partial correlations</b>. To study the <i>causal link</i> between a dependent variable$Y$and an independent variable$X_i$means to study the <b>conditional distribution</b>&nbsp;of$Y\mid X_{j\ne i}$.</div><div><br /></div><div>Notice how this definition&nbsp;<i>completely depends on the random variables we have chosen to control</i>. For example, if we had also controlled for some economic variables that taxation affects GDP growth "through", the effect of changing taxation would perhaps be smaller. The choice of these variables depends on our purpose -- i.e. based on what we're actually able to control.</div><div><br /><blockquote class="tr_bq"><div class="twn-furtherinsight">Apply this reasoning to the context of a classic example: since wind speed correlates with the rotation of a windmill, does this mean the windmill rotation affects wind speed? What are we really asking here? What are our variables (you should have at least 3 of them)?</div></blockquote><br /><div class="twn-furtherinsight">We can also immediately carry over constructions from correlation to causation. We define the&nbsp;<b>partial correlation</b>&nbsp;of$Y\mid X_{j\ne i}$as the value of the correlation computed from the conditional distribution of$Y\mid X_{j\ne i}$, and may geometrically be interpreted as the$\cos\theta$of the projections of$Y$and$X_i$(<a href="https://thewindingnumber.blogspot.com/2018/02/random-variables-and-their-properties.html">as vectors</a>) onto the orthogonal complement of the span of the controlled variables (therefore eliminating the correlations$Y$and$X_ihave with them).</div></div><div><br /></div><div><div><hr /><br /></div></div><div><b>Causal models</b></div><div><b><br /></b></div><div>It is important to note that the existence of partial correlation is not actually equivalent to causation either. Causation is an <i>underlying explanation</i>&nbsp;-- it is represented in the form of <i>causal networks</i>, which are directed acyclic graphs that are <i>models</i>&nbsp;of the underlying behavior of the system. It's much like how quarks are an underlying model of the observed particle zoo -- the quarks themselves cannot be observed, but they form a framework on which predictions can be made.</div><div><br /></div><div>So a causal model might look like this:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-rq59jfoqAxk/YBvSkpsFLII/AAAAAAAAOeU/0HpjY8RvyyUmDC3j4_ZgWbdhryIFkLf9gCLcBGAsYHQ/s502/causal_netwok.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="456" data-original-width="502" src="https://1.bp.blogspot.com/-rq59jfoqAxk/YBvSkpsFLII/AAAAAAAAOeU/0HpjY8RvyyUmDC3j4_ZgWbdhryIFkLf9gCLcBGAsYHQ/s320/causal_netwok.png" width="320" /></a></div></div><br /><div>With distributionsf_T$,$f_{X_1\mid T}$,$f_{X_2\mid T}$,$f_{Y\mid X_1, X_2}.&nbsp;</div><div><br /></div><div>${f_{T,{X_1},{X_2},Y}}(t,{x_1},{x_2},y) = {f_{Y\mid {X_1},{X_2}}}(y\mid {x_1},{x_2}){f_{{X_1}\mid T}}({x_1}|t){f_{{X_2}\mid T}}({x_2}|t){f_T}(t)$</div><div><br /></div><div>So the idea is that multiple different causal models may be <i>equivalent</i>&nbsp;in the sense that they lead to the same joint distribution -- saying that something causes another thing is not an absolute truth, but simply <i>a</i> model that is consistent with the truth. If two models lead to an identical joint distribution, they are "equivalent" for all physical purposes.</div><div><br /></div><div>E.g. consider the following models:</div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-qu8dny8kdFA/YB7ov8qjwgI/AAAAAAAAOe0/-WEBP1dmT14gYx8hKv4ZjiXINFkDgZhUgCLcBGAsYHQ/s856/causal_networks.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="364" data-original-width="856" height="170" src="https://1.bp.blogspot.com/-qu8dny8kdFA/YB7ov8qjwgI/AAAAAAAAOe0/-WEBP1dmT14gYx8hKv4ZjiXINFkDgZhUgCLcBGAsYHQ/w400-h170/causal_networks.png" width="400" /></a></div><br /><div><br /></div><div>Then the joint distribution corresponding to each causal model is:</div><div><br /></div><div>${f_{ABR}}(a,b,r) = {f_{A\mid B}}(a\mid b){f_{B\mid R}}(b\mid r){f_R}(r) = \frac{{f(a,b)f(b,r)}}{{f(b)}}$</div><div>${f_{ABR}}(a,b,r) = {f_{R\mid B}}(r\mid b){f_{B\mid A}}(b\mid a){f_A}(a) = \frac{{f(a,b)f(b,r)}}{{f(b)}}$</div><div>${f_{ABR}}(a,b,r) = {f_{A\mid B}}(a\mid b){f_{R\mid B}}(r\mid b){f_B}(b) = \frac{{f(a,b)f(b,r)}}{{f(b)}}$</div><div><br /></div><div>Which are equivalent.</div><div><br /></div><div>On the other hand, the following three models:</div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-K_aGBy73bCI/YB72J0pS7jI/AAAAAAAAOfA/MKA8UTyos64WJh65ZtM7SxxrH8xswyBQwCLcBGAsYHQ/s868/causal_networks2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="370" data-original-width="868" height="170" src="https://1.bp.blogspot.com/-K_aGBy73bCI/YB72J0pS7jI/AAAAAAAAOfA/MKA8UTyos64WJh65ZtM7SxxrH8xswyBQwCLcBGAsYHQ/w400-h170/causal_networks2.png" width="400" /></a></div><br /><div>${f_{ABR}}(a,b,r) = {f_{A\mid B}}(a\mid b){f_{B\mid R}}(b\mid r){f_R}(r) = \frac{{f(a,b)f(b,r)}}{{f(b)}}$</div><div>${f_{ABR}}(a,b,r) = {f_{B\mid A,R}}(b\mid a,r){f_B}(b){f_R}(r)$</div><div>${f_{ABR}}(a,b,r) = {f_{B\mid A}}(b\mid a){f_{A\mid R}}(a\mid r){f_R}(r) = \frac{{f(a,b)f(a,r)}}{{f(a)}}$</div><div><br /></div><div>Are <i>not</i>&nbsp;equivalent, and can thus be distinguished by experiment.</div><div><br /></div><div>Because this equivalence is a property of graphs, we expect to be able to read off equivalences from simply glancing at the causal trees. The key idea here is that a causal tree is defined by assumptions of conditional independence (because causation is <i>about</i>&nbsp;conditional independence), and so the equivalence of two trees means that they have the same pattern of conditional independence. Reading off patterns of conditional independence is known as <b>D-separation</b>.</div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-47008648876347399632021-01-27T12:09:00.001+00:002021-01-27T12:09:20.876+00:00Frequentism as a bureacuratic restriction on speech<p>Although it is traditional to claim that Bayesian and frequentist statistics are two "schools" of probability and statistics, there are actually two distinct ideas that are labeled "frequentist" (and analogously "Bayesian"):</p><p></p><div><ul style="text-align: left;"><li><b>Frequentist probability:</b>&nbsp;An <i>interpretation</i>&nbsp;of probability, an alternative to the subjective interpretation of probability normally associated with Bayesianism. Under the frequentist interpretation, probability is understood as having to do with frequencies of observations, e.g. a probability of 1/2 means that you can expect 50 coin tosses out of 100 to come up heads. In reality, this merely shifts the problem to interpreting <i>expectation</i>&nbsp;instead of probability.</li><li><b>Frequentist statistics:</b>&nbsp;A formulation of statistical inference, in which we don't really discuss probabilities of theories at all, but restrict ourselves to talking about tautological statements about the probabilities of observations <i>given</i>&nbsp;theories. The supposed advantage of this is that we don't need to follow unproven beliefs about priors.</li></ul><div>In the Bayesian view, the frequentist interpretation of probability appears as follows: given a random variableX$, we sample a sequence of IID variables$X_i$-- the "limiting" distribution of these$X_i$s is then seen as the probability distribution of$X$.</div></div><div><br /></div><div>(From a "positivist"/"scientific" viewpoint, it is rather absurd to define the probability of heads on a coin-flip in terms of a large number of <i>other</i>&nbsp;coin flips. After all, it is only an approximate <i>assumption</i>&nbsp;that all these coin flips are distributed identically. The purest way to talk about IID variables would be to consider an infinite number of <i>universes</i>, each in which a coin flip occurs. Of course, if the universes were identical in every way, the coin flips would be the same in every universe and the probability of heads will be 0 or 1. Probability arises from the fact that we have <i>limited information</i>&nbsp;about the state of the universe -- thus the average is not being taken over <i>all identical universes</i>, but over <i>all universes which match the limited information we have</i>. This "purification" of frequentism is simply Bayesian probability, as the "set of universes" is just the sample space, and the distribution of the unknown information is a Bayesian prior.)</div><div><br /></div><div>The reason&nbsp;that frequentist statistics is typically associated with frequentist probability is that frequentist probabilists see it as odd to talk about the <i>frequency</i>&nbsp;of a theory being true -- to them, the same theory is true in each experiment conducted, since frequentist probability does not consider the situations in "alternate universes". Of course, this distinction between theory and observation is completely arbitrary -- on a logical level, theories and observations are both just logical statements, and analogously parameters and observed quantities are both just random variables.</div><div><br /></div><div>To "forbid" talking about the probability of a theory is simply a bureaucratic restriction on what one can <i>say</i>, not something that affects your <i>decisions</i>&nbsp;in any way. When it comes to making decisions, one has to compute expected utilities, which requires a Bayesian approach (because frequentism does not consider "multiple universes"). Even if you don't <i>say</i>&nbsp;the probability of a theory is 80%, and only say that the probability of some observation would be 95% given that theory, the fact that your discipline "accepts" a p-value of 0.05 itself implicitly means that you believe the theory to be true with sufficient certainty.</div><p></p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-65882595143563334962021-01-12T21:30:00.000+00:002021-01-12T21:30:01.347+00:00Outer and inner measure, complete measure<p>Defining measure <a href="https://thewindingnumber.blogspot.com/2020/12/introduction-to-measure.html">in the sense of Borel</a>, we are only really able to define measure on the specific sets that we can find in the sigma algebra. While this might seem intuitively acceptable (we only really have a feeling for the notion of "length" for line intervals, and not some arbitrarily bizarre sets we may be able to construct), it is easy to see that we might be able to do "better" than what the Borel measure allows.</p><p>For instance, we should be able to assign "bounds" to measures on sets. If a non-measurable set is fully contained in a set of some measure, then its measure should be less or equal to that of that set. In particular if a non-measurable set is fully contained in a set of zero measure, its measure should just be equal to zero (classic examples of this are subsets of the Cantor set).&nbsp;</p><p>This is the idea behind the&nbsp;<b>inner and outer measures</b>.</p><p>Informally, the idea is that the measure of a non-measurable set is bounded below by the measures of the sets it contains and above by the measures of the sets it is contained in. More formally: the inner measure is the supremum of the measures of the measurable sets contained within and the outer measure is the infimum of the measures of the measurable&nbsp;sets containing it.&nbsp;</p><p>I.e. given a measure space$(X, \Sigma, \mu)$we define the inner and outer measures:</p><p>$$\mu^-(S)=\sup\{\mu(T):T\in\Sigma;\,T\subseteq S\}$$</p><p>$$\mu^+(S)=\inf\{\mu(T):T\in\Sigma;\,T\supseteq S\}$$</p><p>Our goal, as previously discussed, is to be able to measure any set that <i>can</i>&nbsp;be consistently measured. The statement we <i>wish</i> to make is of the form "a measure is called a <b>complete measure</b>&nbsp;if all sets whose inner and outer measures agree are measurable", i.e.$\muis called a complete measure if:</p><p>$$\forall S\subseteq X, \mu^+(S)=\mu^-(S)\implies S\in\Sigma$$</p><p>It is easy to show that this is equivalent to demanding that <b>all subsets of measure-zero sets are measurable</b> (with measure zero), which is traditionally the more common definition of a complete measure.</p> <hr />More generally, we may attempt to define the <i>notion</i> of an outer/inner measure without&nbsp;reference to a measure inducing it -- and in fact we see that we may construct a measure <i>from</i>&nbsp;just an outer/inner measure alone.&nbsp;<div><div><br /></div></div><div>Well, the outer measure of a set is supposed to represent the measure of the smallest measurable set containing it -- informally, one may suggest that an outer measure is the composition of a measure and a "closure operator". The "axioms" defining this notion are then (based directly off the axioms for sigma algebras and measures, and the notion of an outer measure):</div><div><ul style="text-align: left;"><li>\mathrm{cl}:2^X\to 2^X$</li><li>$\forall(S_i),\,\exists S,\,\bigcup\mathrm{cl}(S_i)=\mathrm{cl}(S)$</li><li>$\forall T,\,\exists S,\,\mathrm{cl}(T)^C=\mathrm{cl}(S)$</li><li>$S\subseteq\mathrm{cl}(S)$</li><li>$S\subseteq\mathrm{cl}(T)\implies\mathrm{cl}(S)\subseteq\mathrm{cl}(T)$</li><li>$\mathrm{cl}(\mathrm{cl}(S))=\mathrm{cl}(S)$</li><li>$\mu:\mathrm{Im}(\mathrm{cl})\to[0,\infty]$</li><li>$\mu(\bigcup S_i)=\sum\mu(S_i)$</li></ul></div><div>What sort of$\mu^+:2^X\to[0,\infty]$can be written as$\mu^+=\mu\circ\mathrm{cl}$so that the above conditions are satisfied?&nbsp;</div><div><br /></div><div>To axiomatise the outer measure so, we want to first find a necessary and sufficient condition for a set to be measurable under some$\mu^+$, i.e. we wish to represent the condition$\mathrm{cl}(S)=S$purely in terms of$\mu^+$. Well, the "idea" represented by$\mathrm{cl}(S)=S$is that$S$contains its "boundary", so its outer measure does not overlap with that of$S^C$-- so$S$and$S^C$can be used as "building blocks" for the outer measures of other sets. I.e. for an arbitrary set$A,&nbsp;</div><div><br /></div><div>\begin{align}\mu^+(A) &amp;= \mu(\mathrm{cl}(A)) \\&amp;= \mu\left(\mathrm{cl}(A)\cap S\right) + \mu\left(\mathrm{cl}(A)\cap (\mathrm{cl}(S)-S)\right) + \mu\left(\mathrm{cl}(A) \cap \mathrm{cl}(S)^C\right)\\&amp;=\mu^+\left(A\cap S\right) + \mu^+\left(A \cap S^C\right)\end{align}</div><div>This is known as the <b>Caratheodory criterion</b>&nbsp;for the measurability of a set with respect to an outer measure.</div><div><br /></div><div>Now we can simply think of\mathrm{cl}$as the operator that takes a set to the smallest Caratheodory-measurable set containing it. We then ask: what conditions must$\mu^+satisfy so that it is a measure on the sigma algebra of Caratheodory-measurable sets? We want to be able to show that the measure of the union of disjoint Caratheodory sets is the sum of their measures.</div><div><br /></div><div>Well, you should be able to work out the axiomatization of the outer measure then as:</div><div><ul style="text-align: left;"><li>S\subseteq T\implies\mu^+(S)\le\mu^+(T)$</li><li>$\mu^+\left(\bigcup S_i\right)\le\sum\mu^+(S_i)$</li></ul><div>And similarly for the inner measure:</div></div><div><br /></div><div><li>$S\subseteq T\implies\mu^-(S)\le\mu^-(T)$</li><li>$\mu^-\left(\bigcup S_i\right)\ge\sum\mu^-(S_i)$</li></div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-16852923474049274232020-12-24T12:42:00.005+00:002020-12-24T12:42:35.568+00:00Introduction to measure<p>Although integration is often introduced in terms of Riemann sums, it is rather clear that Riemann integration only really represents a specific algorithm for approximating volumes based on some intuition that applies to sufficiently familiar-looking functions. In particular, it doesn't really help us in trying to understand what a volume <i>is</i>&nbsp;(the standard example is that it doesn't tell us what the volume of$\mathbb{Q}$is).</p><p>Even without calculus, we can already calculate the volumes of polygonal shapes like lines and trapeziums through basic linear algebra -- and calculus is the science of being able to define volumes of more complicated shapes based on limiting rules from these building blocks. Riemann integration is an example of such a limiting process.</p><p>The general idea behind abstract measure theory is that we can define volume on an arbitrary space based on some "building block" sets. Essentially, if we have a set$\mathcal{F}_0$(called a <b>sigma basis</b>)&nbsp;of shapes whose volumes we can know axiomatically, then we should be able to say that the <b>volume of a countable union of such&nbsp;</b><b>disjoint shapes is the sum of their volumes</b>&nbsp;and the <b>volume of the difference between a shape and its subshape is the difference between their volumes</b>.</p><p>(These axioms look awfully similar to <a href="https://thewindingnumber.blogspot.com/2019/10/sigma-fields-are-venn-diagrams.html">those of probability theory</a>, which is why measure theory appears so much in probability. Countable unions in the definition of sigma algebras also appear more naturally in measure theory than in probability theory.)</p><p>This motivates the definition of <b>sigma algebras </b>in measure theory: for a space$X$, some set of its subsets$\mathcal{F}\subseteq 2^X$is called a$\sigma-algebra if:</p><p></p><ul style="text-align: left;"><li>For setsA_i\in\mathcal{F}$($i$is countable),$\bigcup A_i\in\mathcal{F}$.</li><li>For set$A\in\mathcal{F}$,$A^c\in\mathcal{F}$.</li></ul><div>And we define a <b>measure</b>$\mu:\mathcal{F}\to[0,\infty]on a sigma algebra as a map satisfying:</div><div><ul style="text-align: left;"><li>For disjointA_i$($i$countable),$\mu\left(\bigcup A_i\right) = \sum\mu(A_i)$.</li></ul><div>The tuple$(X,\mathcal{F})$is called a <b>measurable space</b>&nbsp;and the tuple$(X,\mathcal{F},\mu)$is called a <b>measure space</b>. The elements of the sigma algebra$\mathcal{F}are called <b>measurable sets</b>, and the basis of the sigma algebra is hopefully something simple and intuitive, like open intervals.</div></div><p></p><p></p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-17356760308478510882020-12-03T20:07:00.007+00:002020-12-07T08:27:37.063+00:00Introduction to Projective Geometry<p>We are also interested in the null sets of systems of polynomials, and these are also considered algebraic varieties. Geometrically, this corresponds to studying the&nbsp;<b>intersections</b>&nbsp;of algebraic varieties.&nbsp;</p><div>Like working over the complex plane allows us to be more general when dealing with the shapes and dimensions of algebraic varieties themselves -- because every polynomial has a root over the complex numbers, etc. -- working with the&nbsp;<b>projective complex plane</b>&nbsp;allows us to be general when dealing with intersections of algebraic varieties -- because intersections work in a nice and general way on the projective complex plane.</div><div><br /></div><div>The idea is that we want to be able to say that the intersection of two 1-dimensional varieties is "pretty much always" 0-dimensional, the intersection of two 2-dimensional varieties is "pretty much always" 1 -dimensional, etc. Here are the ways that this can go wrong, or we can have a "non-general" situation:</div><div><ul><li>The varieties are asymptotically parallel (e.g. two parallel lines/planes)</li><li>The intersection has multiplicity (e.g. a paraboloid tangent to a plane)</li><li>The varieties coincide on a region/have a common segment (e.g. two coincidental lines)</li></ul><div>The&nbsp;<i>first</i>&nbsp;problem in particular is addressed by the&nbsp;<b>projective complex numbers</b>.</div></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/d/d7/Parabola_%26_cubic_curve_in_projective_space.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="600" data-original-width="800" height="399" src="https://upload.wikimedia.org/wikipedia/commons/d/d7/Parabola_%26_cubic_curve_in_projective_space.png" width="532" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Parabola (red) and cubic curve (blue) in projective space</td></tr></tbody></table><br /><div><br /></div><div><br /></div><div>The idea behind projective geometry is that any two parallel lines are said to meet at a particular "point at infinity" (and the point is the same in both directions, because we want to preserve the axiom that&nbsp;<i>any&nbsp;two lines intersect at a single point</i>). We thus want to add a point at infinity for each "direction" in the plane.</div><div><br /></div><div>The standard construction to enable this is to define the projective complex setP\mathbb{C}^n$as the set of lines through the origin in$\mathbb{C}^{n+1}$, identifying each line with the point it intersects an off-origin copy of$\mathbb{C}^n.&nbsp;</div><div><br /></div><div>This construction is directly motivated by perspective drawings -- the origin from which the rays are extended is equivalent to the "eye" from which the perspective is taken (and the rays correspond to the actual rays of light through which the landscape is viewed), and the points at the horizon are seen by rays parallel to the ground.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/2/2a/Railroad-Tracks-Perspective.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="800" data-original-width="600" height="421" src="https://upload.wikimedia.org/wikipedia/commons/2/2a/Railroad-Tracks-Perspective.jpg" width="316" /></a></div><div><br /></div><div>Co-ordinates in the space\mathbb{C}^{n+1}$are then called&nbsp;<b>homogenous co-ordinates</b>&nbsp;for$P\mathbb{C}^n)$-- and each point in$P\mathbb{C}^n$can be represented by the co-ordinates of any point on the line in$\mathbb{C}^{n+1}it is identified with.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/dBH-Id8VC3U" width="320" youtube-src-id="dBH-Id8VC3U"></iframe></div><br /><hr /><br /><div>The <a href="https://thewindingnumber.blogspot.com/2019/04/geometry-positive-definiteness-and.html">definition of geometry</a>&nbsp;is that it is the set of symmetries of a set. For the ordinary geometry we deal with, the manifold isK^n$(for some field$K$) and the symmetries are the rigid transformations (combinations of rotations, reflections and translations). In projective geometry, we deal with the symmetries of the manifold$PK^n$under the so-called <b>projective transformations</b>&nbsp;(or&nbsp;<b>homographies</b><i>)</i>, which generalize affine transformations in that they cannot be written as matrices.</div><div><br /></div><div>The key idea behind projective transformations is that the points at infinity are not truly "special" or "different" from ordinary points, so transformations that transform between Euclidean points and points at infinity are symmetries of the projective space.&nbsp;</div><div><br /></div><div>The obvious way to do this is to define projective transformations on$PK^n$as restrictions of (non-singular) linear transformations in homogenous co-ordinates (i.e. on$K^{n+1}$), because these are precisely the transformations that send lines to lines (which are identified with points of$PK^n).&nbsp;</div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-uKTR1N0mCFE/X8qOqbnb1wI/AAAAAAAAObc/oAii_7VPtOIT-A0ONryx_p3qEtdjjy9twCLcBGAsYHQ/s668/homography.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="488" data-original-width="668" height="234" src="https://1.bp.blogspot.com/-uKTR1N0mCFE/X8qOqbnb1wI/AAAAAAAAObc/oAii_7VPtOIT-A0ONryx_p3qEtdjjy9twCLcBGAsYHQ/w320-h234/homography.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The transformation sending the green line to the lime line is a homography on the projective space.</td></tr></tbody></table><br /><div>To express a homography within the projective space, we just evaluate\phi A\phi^{-1}$, where$\phi$is the conversion to homogenous co-ordinates and$A$is the linear transformation in$K^{n+1}$. </div><div><br /></div><div><b>Exercise:</b>&nbsp;show that the homographies on the projective line$PK^1$are precisely the functions of the form$f(z)=\frac{az+b}{cz+d}$.</div><div><br /></div><div><hr /><br /></div><div><b>Exercise (homogenous polynomials):</b>&nbsp;show that a polynomial in projective space is a homogenous polynomial (all its component monomials have the same total degree) in homogenous co-ordinates.</div><div><br /></div><div>E.g. the polynomial$x^2-y=0$can be written in homogenous co-ordinates as$x_0^2-x_1x_2=0$.</div>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-65015606255759666682020-11-22T20:40:00.016+00:002020-12-06T07:32:35.832+00:00Introduction to Algebraic Geometry<p>Algebraic geometry -- at least for the purposes of this article -- is the geometric study of the solution sets of polynomial equations. Specifically, it speaks of a certain duality between$(n+1)$-variable polynomial equations and their$n-dimensional solution sets, called <b>algebraic varieties</b>. We are interested in the precise nature of this duality, i.e.</p><p></p><ul style="text-align: left;"><li>How "<b>injective</b>" is it, i.e. when do two polynomials have the same solution space?&nbsp;</li><li>How "<b>surjective</b>" is it, i.e. what sort of sets can be algebraic varieties?</li></ul><div>In more general terms:</div><div><ul style="text-align: left;"><li>for a set of polynomialsS$we write their null set (the intersection of all their null sets) as$V(S)$</li><li>for some set of points$U$we write the ideal of polynomials whose null set contains it as$I(U).&nbsp;</li></ul></div><div>Then we can formulate our previous two questions as:&nbsp;</div><div><ul style="text-align: left;"><li>when isS=I(V(S))$? (this is the&nbsp;<b>Nullstellensatz problem</b>)</li><li>when is$U=V(I(U))$? (this has to do with the <b>Zariski topology</b>)</li></ul></div><hr /><div><b><br /></b></div><div><div>We generally prefer to work over the complex numbers (and in fact something slightly more interesting than the complex numbers) in algebraic geometry, as the fundamental theorem of algebra makes things easier. However, we are often interested in looking at real or even rational/integer solutions (the latter marks the connection between algebraic geometry and&nbsp;<b>number theory</b>) to equations.</div><div><br /></div><div><b>Definition (rational points):</b>&nbsp;Let$V\subseteq K$be an algebraic variety and$K_0\le K$be a subfield. Then$V\cap K_0$is the set of&nbsp;<b>$K_0$-rational points</b>&nbsp;of$V$.</div></div><div><b><br /></b></div><div><b>Exercise:</b>&nbsp;Find all the$\mathbb{Q}$-rational points of the equation${x_1}^2+{x_2}^2=1$[e.g. (3/5, 4/5, 1)] -- these correspond to the <i>Pythagorean triples</i>, which are integers.</div><div><br /></div><div>We can parameterize this variety (the unit circle) with rational functions using the parameter$t=\tan(\theta/2)$:</div><div><br /></div><div>$$x(t)=\frac{1-t^2}{1+t^2};\,\,\,y(t)=\frac{2t}{1+t^2}$$</div><div>Since$t$is the slope of the line connecting$(x,y)$and (-1, 0), the rational points have a rational$t$, and since$(x,y)$can be written as rational functions of$t$, rational$t$give rise to rational points. Writing$t=u/v$, this leads to</div><div><br /></div><div>$$x(t)=\frac{v^2-u^2}{v^2+u^2};\,\,\,y(t)=\frac{2uv}{u^2+v^2}$$</div><div><br /></div><div>(Analogously, Fermat's last theorem is that the solution set of${x_1}^n+{x_2}^n=1\$ does <i>not</i>&nbsp;have rational points.)</div><p></p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0tag:blogger.com,1999:blog-3214648607996839529.post-18028711439117788152020-09-13T18:45:00.003+01:002020-09-14T07:47:30.831+01:00Riddles and mental models<p>I've been thinking about some assorted cognitive processes over the past week, trying to remove the mysterious/magical feel to cognition. I have no idea if my observations reflect the cognitive science literature, I don't know if what is true of my mind is true of others', and I haven't done any actual serious testing of my guesses, but as long as the ideas described are a possible description of reality, they should work as an acceptable basis for doing AI, which is the main purpose of cognitive science anyway.</p><p><b>Observation: thoughts are wordless</b></p><p>People often describe their thoughts in words, and often insist that their thoughts <i>are</i>&nbsp;in the form of words mentally. This is obviously false, because (1) words need to be rooted to some notion of meaning, and (2) you need to already <i>know</i>&nbsp;what the sentence you're about to think is before you think it, or you won't get into the right grammar, etc.&nbsp;</p><p>So "thoughts", whatever they are, are not words. So what are they? Let's think about some example thoughts one may have, in the form of words:</p><p></p><ul style="text-align: left;"><li>"I'll find carrots tastier than cucumbers, so let me eat it."</li><li>"Alright, let's focus on this for now."</li><li>"It's sunny, let me close the curtain."</li><li>"What are thoughts?"</li><li>"Saying the words <i>What are thoughts?</i>&nbsp;won't help me progress on the question."</li><li>"What will help me progress on the question?"</li><li>"What thoughts am I having right now?"</li><li>"I need to develop a stronger intuition for this"</li><li>"Wait, I thought thoughts <i>weren't</i>&nbsp;words, what's this?"</li><li>"Stupid question, never mind, but I should add that clarification to the sentence above."</li><li>"Ah, stupid spelling mistake."</li></ul><div>I think that they key abstraction here is the idea of <b>mental theories and models</b>. Thoughts are <i>beliefs</i>&nbsp;and <i>reasoning</i>&nbsp;about implications between mental models -- i.e. a very <a href="https://thewindingnumber.blogspot.com/p/model-theory.html" target="_blank">model theoretic</a> concept. This is true for thoughts about how things are, thoughts about decisions to make, introspective thoughts (these are just self-referential sentences), whatever. </div><div><br /></div><div>One may consider these model theoretic sentences to be generalizations of linguistic sentences -- a sort of language that every conscious being naturally has and vividly imagines, <i>lives in</i>. It is not necessary to talk about how your brain attaches meaning to sentences in these models -- associates them with reality, because your brain <i>lives in</i>&nbsp;these models, that <i>is</i>&nbsp;its reality, it <i>is</i>&nbsp;the source of all meaning. One of these mental models is the very visual picture you see of your surroundings.</div><div><br /></div><div><b>Observation: riddle solving.</b></div><div><b><br /></b></div><div>There are two kinds of riddles -- one, the ancient Greek kind, like "what has four legs at dawn, two legs at noon, three legs at dusk, one foot stuck in its skull, and zero legs after time-traveling back to dawn?" and "what comes first? the chicken or the egg?". These are stupid and pretentious, and have nothing interesting to tell us.</div><div><br /></div><div>Two, the kind of riddle that people scoff at as childish and uncultured, and is therefore actually somewhat interesting. Some examples:</div><div><ul style="text-align: left;"><li>A bus driver was heading down a street in Colorado. He went right past a stop sign without stopping, he turned left where there was a "no left turn" sign, and he went the wrong way on a one-way street. Then he went on the left side of the road past a cop car. Still - he didn't break any traffic laws. Why not?</li><li>Samuel was out for a walk when it started to rain. He did not have an umbrella and he wasn't wearing a hat. His clothes were soaked, yet not a single hair on his head got wet. How could this happen?</li><li>A pet shop owner had a parrot with a sign on its cage that said "Parrot repeats everything it hears". Davey bought the parrot and for two weeks he spoke to it and it didn't say a word. He returned the parrot but the shopkeeper said he never lied about the parrot. How can this be?</li><li>The 22nd and 24th presidents of the United States of America had the same parents, but were not brothers. How can this be possible?</li><li>What goes place to place yet stays in one place?</li><li>What is harder to catch the faster you run?</li></ul><div>(Source: <a href="http://Riddles.com">Riddles.com</a>, a site I frequented as a kid.)</div><div><br /></div><div>(There's also a third kind riddle, having to do with pattern matching, like <i>What did the ___ tell to the ____?</i>&nbsp;and <i>Why did the _____ _____?</i>)</div></div><div><br /></div><div>When you hear about the riddle about Samuel and the rain, your brain immediately infers a <i>particular</i>&nbsp;mental theory from the problem text -- and very quickly ends up at a contradiction (or <i>confusion</i>), something that eliminates all possible models. In a sense, riddles are all about the art of <a href="https://www.lesswrong.com/s/zpCiuR4T343j9WkcK" target="_blank">noticing confusion</a> -- about identifying the axiom you subconsciously assumed in your reasoning. You <i>subconsciously assumed</i>&nbsp;that the bus driver was in a bus, that was included in your mental picture, but was not implied by the form of the question itself.</div><div><br /></div><div>The last two appear different, but are based on the same principle of being able to infer multiple possible theories and evaluate their consequences.</div><p></p>Abhimanyu Pallavi Sudhirhttp://www.blogger.com/profile/17891661511466934614noreply@blogger.com0