Divisions by Two in Collatz Sequences: A Data Science Approach

The Collatz conjecture is an unsolved number theory problem. We approach the question by examining the divisions by two that are performed within Collatz sequences. Aside from classical mathematical methods, we use techniques of data science. Based on the analysis of 10,000 sequences we show that the number of divisions by two lies within clear boundaries. Building on the results, we develop and prove an equation to calculate the maximum possible number of divisions by two for any given Collatz sequence. Whenever this maximum is reached, a sequence leads to the result one, as conjectured by Lothar Collatz. Furthermore, we show how many divisions by two are required for a cycle of a specific length. The findings are valuable for further investigations and could form the basis for a comprehensive proof of the conjecture.


The Problem
The Collatz conjecture is a wellknown number theory problem and is the subject of numerous pub lications. 1 Therefore, our description of the topic will be brief. The mathematician Lothar Collatz introduced a function g : N → N as follows: The conjecture, as treated in this paper, claims that the above function leads to the final result one for every natural starting number, when applied recursively. A series of numbers involved in this process is called a Collatz sequence. With an aim to contribute to a proof of the conjecture, this paper analyses a central aspect of the problem: the divisions by two. 2

Determining Odd Numbers
Sultanow, Koch and Cox demonstrated that odd numbers of Collatz sequences can be calculated with the following recursive equation The variable v 1 denotes the first odd number of the sequence, that is, the starting value. The variable v i symbolises the odd number that is the result of a particular iteration. 4 The exponent n stands for the count of odd numbers that are processed by the algorithm. In the further course of this paper we will call the parameter n the length of a sequence. The exponent α i finally represents the number of divisions by two that are performed in a specific iteration. Accordingly, the sum of α i is the count of divisions by two leading from the starting value v 1 to the outcome v n+1 . 5 Let us consider the example v 1 = 13 and n = 2. Applying equation 2 yields: 6 v 2+1 = 3 2 · 13 · Starting with v 1 = 33 for n = 3 we obtain the result: · 2 −5 = 29 Improving readability, we denote the factor with the variable β i . In addition, we gen eralise the formula by replacing the factor three with the variable k. This will be useful for further analysis and leads us to the following generalised version of equation 2: In order to correctly calculate odd numbers with formula 3, we must first define the halting con ditions of the algorithm in the next section.

Halting Conditions
Being compliant with the Collatz conjecture, the algorithms 2 and 3 halt if at least one of the following conditions is fulfilled: When the first condition applies, the Collatz conjecture is true for a specific sequence. If the second condition is fulfilled, the sequence has led to a cycle. For every starting value, except v 1 = 1, the Collatz conjecture is therefore falsified. 7 Let us consider the example k = 3, v 1 = 13, and n = 2. Applying equation 3 yields: In the above example the algorithm halts after two iterations because the first condition is fulfilled. If we examine the case v 1 = 1, we realise that the algorithm finishes after the first iteration, since both halting conditions are true: The sequence stops in the example above due to the result being one. Apart from that, the sequence has led to a cycle. 5 For a glossary of notations see section "Glossary of Notations" in the appendix. 6 The result of the first iteration v 1+1 equals five. 7 This statement refers to the Collatz conjecture in its original form 3v + 1.

Boundaries of α i
We know that in every iteration of the equations 2 and 3 at least one division by two is performed. This follows from the constraints of the Collatz problem. Consequently, we can define the minimum of α i with the following condition: The maximum can be specified in a similarly easy way. According to the halting conditions, de fined in the previous section, a Collatz sequence finishes when v n+1 = 1. The maximum of α i , here inafter calledα i , can hence be defined as: The formula above builds on the fact that the expression 2α i must equal the next even number k · v i + 1 in order to lead to v n+1 = 1. Being greater, the result v n+1 would be less than one. The second step inverses the exponentiation ofα i by taking the binary logarithm. Appropriately, we replace the operation plus one by β i . For a better understanding of the above term, let us consider the example k = 3 and v 1 = 5. In this case equation 5 results in: Whenever a sequence reaches the maximumα i , it finishes with one, thus verifying the Collatz conjecture. If we could prove that every odd number finally leads to this maximum for k = 3, the Collatz problem would be solved. Summarising, we can define the following boundaries for α i : Before we continue, we validate theorem 6 empirically. We will do so at various points in this paper to avoid obvious errors in our mathematical reasoning. The basis for the validation is a sample of 10, 000 Collatz sequences. The data set comprises information about sequences for the odd starting numbers v 1 ∈ {1, 3, 5, . . . , 3999} and k ∈ {1, 3, 5, 7, 9}. Since we do not know that all generated sequences halt, we limited the number of iterations per sequence to n = 100. For further details on the data set, see section "Data Set" in the appendix.
Unsurprisingly, we found that all values of α i in the sample are compliant with theorem ∏ 6. 8 In the next section we move on to more sophisticated considerations and study the properties of n 2 α i .

Boundaries of α
In equations 2 and 3, the expression ∏ n i=1 2 α i represents the divisions by two performed by the algo rithms. The number of divisions by two can be determined with the following formula and will be symbolised by α: Own empirical analysis, see appendix "Data Set" for details.

International Journal of Pure Mathematical Sciences Vol. 21 3
Based on theorem 6 we can define the minimum of α as follows: Since we carry out at least one division by two in every iteration of formulas 2 and 3, the minimum of α equals the sequence's length. The maximum value is harder to determine. In the first step we derive it empirically from the data set mentioned in the previous section. Based on the observed data we formulate the hypothesis that the maximum of α can be calculated with the following equation: The hypothesis holds for all Collatz sequences in the empirical data set. 9 If a Collatz sequence reaches the above stated maximum, it finishes with one, as conjectured by Lothar Collatz. 10 Let us, for example, consider the case where v 1 = 13, n = 2 and k = 3. Applying theorem 7 and formula 3 leads to: The empirical validation supports our hypothesis, but does not prove it for all Collatz sequences. Throughout the next sections we will formulate a comprehensive proof of theorem 7 step by step.

Provingα for k = 1
First, we examine the case k = 1, where theorem 7 can be simplified as follows: In order to prove theorem 7, we have to demonstrate that the number of divisions by two, α, is less than or equal to the maximumα. This can be achieved by analysing the binary representation of Collatz numbers. 11 Let us consider the case v 1 = 25 and k = 1 in the decimal system. Applying equation 3 leads to the sequence shown in the following table.
The sequence presented in table 1 starts with the decimal number v 1 = 25 at n = 1. Subsequently it comprises the odd numbers v 2 = 13, v 3 = 7 and finally v 4 = 1. In the binary system the sequence starts accordingly with v 1 = 11001 2 . The binary length of the starting number len(v 1 ) equals five. 12 9 Source: Own empirical analysis, see appendix "Data Set" for details. 10 The parameter n, representing the length of a sequence, cannot be predicted for a specific k and v 1 with the formula. 11 To avoid confusion between decimal and binary numbers, we will label binary numbers with a subscripted 2. 12 With binary length we mean the count of digits of a binary number.

IJPMS Volume 21
This observation is crucial for our proof. For clarification, it is important to note that the length of a binary number can be calculated with the following equation: 13 For example, consider the case v i = 13 in decimal, rendered in binary that means v i = 1101 2 . Here, equation 9 leads to the following result: The comparison of equation 9 with formula 8 makes it clear that they are identical. This raises the question why the maximum number of divisions by two of a Collatz sequence corresponds to the binary length of v 1 . 14 To answer this, we take a closer look at the mechanics of a Collatz sequence in the binary system.
We start with v 1 = 11001 2 in the above example. Adding one, we obtain the even number v 1 +1 = 11010 2 . The binary length of v 1 equals the binary length of v 1 + 1, which is five. Due to the trailing zero we immediately realise that v 1 + 1 is even. A division by two can be performed in the binary system by deleting the trailing zero. The result is v 2 = 1101 2 . Adding one again, leads to the next even number v 2 + 1 = 1110 2 . Deleting the trailing zero once more, results in v 3 = 111 2 .
Up to this point we have performed two divisions by two. The parameter α therefore equals two. The case v 3 = 111 2 is very important for our proof. Adding one to v 3 = 111 2 , leads to an overflow of the binary number. As a result, we obtain the even number v 3 + 1 = 1000 2 , which is a power of two and equals 2 3 in decimal. Knowing that every power of two in a Collatz sequence directly leads to the terminal value v n+1 = 1, we can deduce that the sequence ends after the third iteration.
The binary length len(v 3 ) = 3 increases to len(v 3 + 1) = 4 in the final step. This situation only occurs once in a Collatz sequence for k = 1. Whenever adding one to a number v n causes an overflow of its binary representation, the result v n + 1 will be a power of two. The binary length will, in this scenario, increase from len(v n ) to len(v n ) + 1. The sequence will consequently halt. For all other cases the following condition applies: 15 Only the final iteration increases the length of the binary number. In any other case the binary length decreases from v n to v n+1 .
Let us now reflect what this implies for the maximumα. We know that the binary length of the starting value v 1 can be calculated with equation 9. In order to reach the final result v n+1 = 1, starting at v 1 , we have to perform the following number of divisions by two: The equation builds on the binary length of the starting value len(v 1 ). We add one to respect the binary overflow in the final iteration. Furthermore, we subtract the binary length of the final result v n+1 = len(v n+1 ) = 1. No value of α can possibly exceed this maximum, sinceα directly leads to the terminal value v n+1 = 1, halting the sequence.
The above equation thus proves theorem 7 for k = 1. In the next section we will explain why this argumentation is in principle valid for all k.

Provingα for k > 1
Let us now examine the case k = 3, which is most interesting because it relates to the original Collatz conjecture. The first question we need to address is whether or not the principles discussed in the previous paragraph are transferable to this form of the problem. To find an answer, we analyse a sequence, starting with v 1 = 17 and k = 3. The results are displayed in the following table. The example presented in table 2 reveals that in comparison to the previous case k = 1, the algorithm performs an additional operation, which is the multiplication with three. This operation leads to a growth of the binary length when comparing v n to 3v n . The result of the operation can be calculated as follows: len(3v n ) = ⌊log 2 3 + log 2 v n ⌋ + 1 In determining the maximumα for k = 3, we have to take the additional binary growth into account. With regard to the operation +1 we can utilise the same arguments as in the previous section. Whenever adding one leads to an overflow in the binary representation of 3v n , the result will be a power of two, halting the sequence. The length of (3v n + 1) will, in this case, increase by one in contrast to 3v n . This can happen only once in a Collatz sequence, since the resulting power of two will lead to a termination.
In order to prove our hypothesis, we have to adjust equation 8 by considering the additional binary growth that is caused by the multiplications with three. Therefore, we obtain the following formula: The above term proves theorem 7 for the case k = 3. A closer look makes clear that it is not only valid for k = 3, but for all k. The appendix outlines an alternative approach to verification of theorem 7. In conclusion, we can define the following boundaries for the number of divisions by two in a Collatz sequence: If one could establish that every sequence finally leads toα, that means to a binary overflow of 3v n + 1, the Collatz problem would be solved. In the following we will discuss the consequences of our findings for the occurrence of cycles and further confirm our line of reasoning.

Definition
A promising possibility to falsify the Collatz conjecture in its original form is a cycle. We have found such a counterexample if the following halting condition from section "Introduction" is fulfilled: The single known cycle for k = 3 is the trivial one starting with v 1 = 1: The Collatz conjecture claims that the above example is the only possibility of a cycle for k = 3. Based on equation 3 we derive the following condition for the occurrence of a cycle within a Collatz sequence: 16 For the convenience of the reader, the expression ∏ n i=1 β i will be referred to as β subsequently. Showing that equation 13 is true for k = 3 would partially prove the Collatz conjecture. Yet there would still remain the possibility of an eternally growing sequence. This makes theorem 7 particularly interesting.
A major difficulty in analysing cycles in Collatz sequences is that there seems to be just one example. This is, however, not true for our generalised form of the problem. Let us consider the case k = 5 and v 1 = 13. Applying formula 3 leads to a cycle of the length n = 3: Setting k = 5 and v 1 = 13 in equation 13, we obtain the following result after three iterations: To determine the number of divisions by two, which can lead to a cycle, we need to investigate the parameter β more thoroughly.

Analysing β
The starting point of our analysis of β is theorem 7. The formula can be used to calculate the maximum possible divisions by two of a Collatz sequence: In section "Analysing α" we showed thatα relates to the binary length of the starting value v 1 . Fur thermore, the equation accounts for the binary growth, caused by the nfold multiplication with k as well as the final overflow, triggered by the operation +1. If a sequence reachesα, it halts at the ter minal value v n+1 = 1. In order to learn more about the parameter β, we take a look at the relation between theorem 7 and equation 3. We examine the situation in which formula 3 leads to the final result one. Consequently, we set v n+1 = 1 and α =α:

International Journal of Pure Mathematical Sciences Vol. 21
For a better understanding of the above term, let us examine two examples. We begin with the border case where k = 1 and v 1 = 1. Here, equation 14 leads to: log 2 β = 1 = −n · log 2 1 − log 2 1 + ⌊n · log 2 1 + log 2 1⌋ + 1 β = 2 Moreover, we study the example where k = 5, v 1 = 19 and n = 2. Equation 14 in this case results in: Based on equation 14 and the fact that β must always be greater than one, we define the following boundaries of β: The limits formulated by theorem 15 are confirmed through a validation with our empirical data set. 17 Figure 1 shows the maximum β for different values of k in the sample. The diagram as well depicts the corresponding starting number v 1 , which leads to this maximum. As we can see from figure 1, the maximum for k = 1 equals 2. The limit for the other k is beneath. For example, the maximum for k = 3 equals 1.3 = 4 3 . The diagram reveals that the maximum β for every k is reached for the starting number v 1 = 1. A proof for this finding will be provided in a future article. In the next section we will discuss the implications of theorem 15 on the occurrence of cycles. 17 Source: Own empirical analysis, see appendix "Data Set" for details.

IJPMS Volume 21
Analysingᾱ How many divisions by two can lead to a cycle within a Collatz sequence? We can derive an equa tion for this number, subsequently calledᾱ, on the basis of formula 3 and theorem 15. Therefore, we examine the case in which equation 3 leads to a cycle by setting v n+1 = v 1 : The last transformation above is applied, sinceᾱ is a whole number. 18 Now that it is clear that 1 < β ≤ 2, we truncate the fractional part of (n · log 2 k) and add one to the result. In a Collatz sequence a cycle can only occur if the number of divisions by two equalsᾱ. Conversely, this does not imply that reachingᾱ inevitably leads to a cycle. The following example demonstrates this. Let us consider the case where k = 3, v 1 = 83 and n = 3. Here, theorem 16 and formula 3 yield the following result:ᾱ = 5 = ⌊3 · log 2 3⌋ + 1 Before we continue, we will empirically validate theorem 16. Our tool is a linear search per formed by a Python script. For details on the program see section "Cycle Finder" in the appendix. With the script we searched and evaluated cycles in Collatz sequences for the odd starting numbers v 1 ∈ {1, 3, 5, . . . , 9999} and k ∈ {1, 3, 5, . . . , 999}. In order to restrict the runtime of the program we limited the length of the investigated cycles to n = 100. The results of our empirical validation are shown in the following table. (1) 1 9 9 As one can see in table 3, we found several cycles for our generalised form of the Collatz problem. All of which comply with theorem 16. 19

Binary Growth
As we have emphasised at several points in this paper, theorem 7 builds on the binary length of the starting value len(v 1 ). Furthermore, it accounts for the maximum binary growth, henceforth denoted withΛ. We define binary growth as the total number of digits by which the binary length of v 1 increases in a sequence. 20 In order to reach the final result v n+1 = len(v n+1 ) = 1, we have to substractα from the sum of the binary length of v 1 and the binary growth: In the final step the above equation is condensed by subtracting the starting value v 1 . As a result, we obtain a range forΛ. The reason is a possible overflow which can be instigated by the expression n · log 2 k + log 2 v 1 . Let us examine two examples to illustrate this. Starting with the case k = 3, v 1 = 13 and n = 2 we find that the result is equal to the lower limit ofΛ: Setting v 1 = 7, k = 3 and n = 5 leads to the upper limit of the variable: The parameterΛ represents the maximum binary growth of a Collatz sequence. In other words, the binary growth of a sequence cannot exceedΛ, even if we would not perform any divisions by two. Examining formula 17, it is not surprising that we find the following relation to theorem 16: α = ⌊n · log 2 k⌋ + 1 ≤Λ As we know, a cycle occurs in a Collatz sequence when the condition v 1 = v n+1 is fulfilled. The binary length of the starting number v 1 , must therefore grow exactly as much as it is reduced by the divisions by two. Thus, for a cycle to occur, the number of divisions by two has to be equal to the binary growth.
One might argue that this reasoning is erroneous since a sequence does not necessarily reach the maximum binary growth. We build on formula 3 to show that our arguments are valid. By setting v n+1 = v 1 we examine the case where the growth of the binary length of a sequence is neutralised by the divisions by two: Knowing that 1 < β ≤ 2, we derive the following limits for the binary growth of a cycle, subse quently calledΛ: n · log 2 k <Λ ≤ ⌊n · log 2 k⌋ + 1 The binary growth of every Collatz sequence that leads to a cycle must lie within these boundaries. Due to the fact thatᾱ is a whole number, it is obvious that it must equal the maximum on the right side of the expression. For all other cases a cycle is impossible. 20 This means thatΛ does not account for the divisions by two that reduce the binary length of v 1 .

Summary
In our paper we have shed light on a central aspect of the Collatz conjecture: the divisions by two. We analysed the problem in its original form 3v + 1 as well as in the generalised variant kv + 1. Based on mathematical reasoning and empirical studies we derived and proved theorems on the occurrence of cycles and the termination of sequences. Our reasoning primarily builds on the binary representation of Collatz numbers and the underlying operations. Theorem 16 determines the number of divisions by two that can lead to a cycle. The theorem is based on the simple truth that a cycle can only occur if the binary growth of a sequence is exactly neutralised by the divisions by two. Theorem 7 determines the maximum number of divisions by two that can be performed in a sequence. If one could show that every starting number finally leads to this maximum, the Collatz problem would be solved. We are convinced that a profound study of the binary mechanics of Collatz sequences will lead to this proof.
The above term represents the (hypothetical) case in which a sequence rises to its maximum for a spe cific starting value v 1 . For the Collatz conjecture this is the worstcase scenario because the equation never leads to the result one due to the steady increase. Let us consider the example v 1 = 7 and n = 1.