Sequence Indexing

Introduction

The theory of Sequence Indexing deals with the notion of structured Enumeration, the mathematical concept of counting objects. It tries to answer one simple question : Given a collection of objects - how to create, access and identify various groups of objects drawn from that collection?

This article presents few classes that abstact the formulae for indexing and de-indexing some of the well-known Sequences.

By Sequence, here, we mean, collection of all possible groups that could be formed with any given set of symbols. The exact group formation mechanism may vary from Sequence to Sequence. For example, considering the Sequence {123,132,213,231,312,321 }, it was made of three symbols 1 ,2,3 and has six unique symbol-groups such as 123, 132 etc.. which are called sequents (1stSequent=123, 2ndSequent=132 ... 6thSequent=321 ). The number of sequents in any Sequence gives the length of that Sequence. For this example, the length of the Sequence is 6.

Depending on the mechanism of formation, each Sequence would have its own regulations and principles. In the following, we discuss the details of some of the common sequences we find useful in daily life, such as the Range Sequence, Permutation Sequence and the Combinational Sequence. These Sequences were abstracted into different classes aptly named: CRangeSequenceIndexer, CPermutationSeuqnceIndexer and CCombinationalSequenceIndexer respectively.

While these classes use different mathematical manipulations for their indexing mechanism, they all, nevertheless, have one single operation in common: Factorial. Written as n! to indicate Factorial(n), it is equivalent to n*(n-1)*(n-2)*...*3*2*1, the multiplication of all consecutive integers right from n down to 1 , it finds important role in all these classes. One of the nice properties of Factorial is its ability to be expressed in its own terms such as Factorial(n) = n * Factorial(n-1); For example,

Factorial(4) = 4! = 4 * 3 * 2 * 1 = 24;

Factorial(5) = 5! = 5 * 4 * 3 * 2 * 1 = 120;

Factorial(5) == 5*4! == 5*4*3! == 5*4*3*2! == 5*4*3*2*1!; //(The value of Factorial(1) is considered to be 1).

We can use this property to our advantage by pre-computing and storing the smaller factorial values to avoid the recomputation later for larger values. However, one big problem with Factorial is that, due to its multiplicative nature, the values grow out of bounds quite quickly. A typical 4-byte integer can not hold Factorial values for numbers beyond 15 or 20. We have to use larger data types (such as Arbitrary Precision Mathematical Library Types) if we need to support larger values. As for the sample code here, we have type defined a BIGNUMBER data type from unsigned long. You might later want to change it with your own larger data types as the need be. The type definitions and the Factorial function are presented below.

typedef unsigned long BIGNUMBER;   // Replace unsigned long with Arbitrary Precision Types to support Large Numbers
typedef BIGNUMBER POSITION;        // Factorial value, Needs bigger types !!
typedef unsigned long SYMBOLINDEX; // Number of symbols can not exceed the maximum unsigned long number
typedef std::vector<symbolindex> SYMBVECTOR;
  
static const BIGNUMBER& Factorial(long lNumber)
{
  if(lNumber <= 1) return m_FactArray[0];
  if(lNumber <= (long)m_FactArray.size())
  {
    static BIGNUMBER Product;
    
    Product = m_FactArray[m_FactArray.size() - 1];
    
    for(long l = (long)m_FactArray.size(); l <= lNumber; ++l)
    {
      Product *= l;
      m_FactArray.push_back(Product);
    }
  }
  return m_FactArray[lNumber];
}

In the above code for the factorial function, we are storing all the computed factorial values in a big array (implemented with STL vector). When we are required to compute a factorial value for any new number, we first check if it has already been computed. If yes, then we return the stored value for that number. If no, we take the largest number we have computed the factorial value till now, and start computing from that number (instead of starting from 1 every time). As you could easily see, this simple mechanism saves lot of time (of course, at the cost of memory). With this simple factorial function in place, we can now turn our attention back to our Sequence classes.

Range Sequence

This is the sequence that is formed by set of symbols each of which takes values from a fixed range. For example, a symbol denoting the day of a week can take values only from a fixed set of 7 weekdays, and a symbol that denotes the hour of a day can have only one of 24 values etc...Now considering an object made of two symbols, one for day and one for hour, then all the possible unique objects that can be made from those two symbols would make up a Range Sequence. Each element of the sequence would have a unique day-hour value that is different from that of all others. For example, consider a hypothetical class as defined below:

class WesternPersonDetails{

    Sex    sVal;    // can take two values : { Male, Female}

    Nature nVal;    // can take three values:{ Good, Bad, Ugly}

    Living lVal;    // can take two values : { Alive, Dead }

};

The above class has three member variables each of which can take 2,3 and 2 values respectively. Now, all the possible unique objects that can be created for that class would makeup the following sequence.

Range[] = { 2, 3, 2 };        // Range[0] == 2; Range[1] == 3; Range[2] == 2;

SymbolCount = 3;              // SymbolCount == (sizeof(Range)/sizeof(Range[0]) )

SequenceLength =  2*3*2 = 12; // SequenceLength == Range[0] * Range[1] * Range[2] == 2 * 3 * 2 == 12

Index

Generalized Sequence

0

{ 0 0 0 }

1

{ 0 0 1 }

2

{ 0 1 0 }

3

{ 0 1 1 }

4

{ 0 2 0 }

5

{ 0 2 1 }

6

{ 1 0 0 }

7

{ 1 0 1 }

8

{ 1 1 0 }

9

{ 1 1 1 }

10

{ 1 2 0 }

11

{ 1 2 1 }

As can be seen from the above table, there can be 12 unique objects for the afore mentioned class. Any other objects created would be just a repetition of one of these 12. In the above table, index 0 presents a generalized sequence {0,0,0} that represents an object {Male , Good , Alive } and index 1 presents a generalized sequence {0,0,1} that represents an objects {Male , Good , Dead } and so on. Given any class whose variables have ranges given by the array Range[] , the number of unique objects for that class can be obtained by multiplying all the range values as: Range[0]*Range[1]*Range[2]*... For any such class of unique objects, how to compute what would be the variable values for ith object for any given i without going through all the rest of the objects?

The class CRangeSequenceIndexer makes this possible with the function CRangeSequenceIndexer::GetAt(). It takes the index i as the argument and returns the ith unique object as the result. It does the computation without going through all the rest of the objects. Further, if one need to compute all of them instead, it is quite possible that the function could be called within a loop, as shown below, by single-step incrementing the loop-index from 0 to SequenceLength-1 , the value of SequenceLength having obtained from the function CRangeSequenceIndexer:: TotalSequenceLength();

RANAGE Ranges[]={2,3,2};
CRangeSequenceIndexer CRIndexer(Ranges, sizeof(Ranges)/sizeof(Ranges[0]));
for(long i=0; i < CRIndexer.TotalSequenceLength(); ++i)
  cout << CRIndexer.GetAt(i);

The above code snippet prints only the generalized sequence. One has to re-index the generalized symbols to get the exact sequence. This way, it becomes quite easy to adapt the class for any set of symbols instead of few hard coded ones. Some of the implementation details of CRangeSequenceIndexer are as follows.

class CRangeSequenceIndexer
{
  SYMBOLINDEX m_nSymbolCount;
  POSITION m_SeqLengths[];
public:
  CRangeSequenceIndexer(const RANGE* pRange, const SYMBOLINDEX lSymbolCount)
  {
    POSITION Product = 1;

    m_SeqLengths[lSymbolCount] = 1;	//We need this as Sentinel

    for(long s = lSymbolCount - 1; s >= 0; --s)
    {
      m_SeqLengths[s] = Product * pRange[s];
      Product = m_SeqLengths[s];		
    }
      
    m_nSymbolCount = lSymbolCount;
  }
};
 
SYMBVECTOR CRangeSequenceIndexer::GetAt(const POSITION& SeqPosIndex)
{
  SYMBVECTOR SymbVector;
  
  for(SYMBOLINDEX s = 0; s < m_nSymbolCount; ++s)
    SymbVector[s] = (SeqPosIndex % m_SeqLengths[s]) / m_SeqLengths[s+1];
  
  return SymbVector;
}
 
POSITION CRangeSequenceIndexer::GetIndexOf(const SYMBVECTOR& SymbolVector)
{
  POSITION Index = 0;

  for(SYMBOLINDEX s = 0; s < m_nSymbolCount; ++s)		
    Index += (SymbolVector[s] * m_SeqLengths[s+1]);
  
  return Index;
}

The another important method supported by CRangeSequenceIndexer is GetIndexOf() . Given any sequent, this method would tell us the index of that sequent in constant time (w.r.to the number of sequents in the Sequence). This would be useful when we have the object information with us and we would like to know what is its position in the set of all possible objects. For example, considering our above hypothetical WesternPersonDetails class, suppose we have some object that has values {Female , Good , Alive }, and we would like to know what would be its index among the 12 possible objects. We can achieve this easily by first identifying its generalized sequent {1,0,0 } and then submitting it to the method CRangeSequenceIndexer:: GetIndexOf() . Identifying the generalized sequent is pretty straight forward in that all that need be done is replacing the Female with its index 1 (1 because Sex can take 2 values {Male,Female} and Female is in position 1), and Good with its index 0 (0 because Nature can take 3 values {Good,Bad,Ugly} and Good is in position 0), and so on. As can be seen from the table, the generalized sequent {1,0,0 } has index 6 and hence GetIndexOf() would return 6 .

The combination of  CRangeSequenceIndexer::GetAt() and CRangeSequenceIndexer:: GetIndexOf() would prove quite useful in maintaining the persistence (storing and retrieving) of class objects. These two methods complement each other and hence complete the functionality of the Sequence. For complete implementation details please refer the source code available at: SequenceIndexingSourceCode.html

Permutation Sequence

In mathematics the concept of a permutation expresses the idea of arranging objects in various different ways. For any given set of symbols, the exhaustive rearrangements of all the symbols would give us the Permutation Sequence. Each element of the Sequence (a sequent, that is) would correspond to one unique arrangement of the symbols. The position of symbol is the crucial concept here and so an arrangement 1234 is not same as 4321. For example, consider a scenario where we have 3 symbols {7,4,8} and we are supposed to rearrange them in various ways. Then the list of all possible arrangements would form the following sequence:

Symbol[] = { 7, 4, 8 }; // Symbol[0] == 7; Symbol[1] == 4; Symbol[2] == 8;

SymbolCount = 3;        // SymbolCount == (sizeof(Symbol)/sizeof(Symbol[0]))

SequenceLength = 6;     // SequenceLength == Factorial(SymbolCount) == 3! == 3 * 2 * 1 == 6

Index

Generalized Sequence

Exact Sequence

0

{ 0 1 2 }

7 4 8

1

{ 0 2 1 }

7 8 4

2

{ 1 0 2 }

4 7 8

3

{ 1 2 0 }

4 8 7

4

{ 2 1 0 }

8 4 7

5

{ 2 0 1 }

8 7 4

As can be seen from the above table, we have 6 ways of arranging 3 symbols. The first way (indicated by index 0 in the above table) is to simply place the symbols in their given order as 748 and the second way is to keep the first element intact while switching the positions of other two elements (as indicated by the generalized sequence {0 2 1} in the above table across index 1 ) resulting in 784 and so on. If we consider our total number of symbols as n, then the number of all possible ways of arranging them would be given by the mathematical formula n! = Factorial(n) = n*(n-1)*(n-2)*...*3*2*1 ; With so many ways of arrangements, how to compute what would be the ith way for any given i without going through all the rest of them?

The class CPermutationSequenceIndexer makes this possible with the function CPermutationSequenceIndexer::GetAt(). It takes the index i as the argument and returns the ith arrangement as the result. It does the computation without going through all the rest of the rearrangements. Further, if one need to compute all of them instead, it is quite possible that the function could be called within a loop, as shown below, by single-step incrementing the loop-index from 0 to SequenceLength-1 , the value of SequenceLength having obtained from the function CPermutationSequenceIndexer:: TotalSequenceLength();

CPermutationSequenceIndexer CPIndexer(3);
for(long i=0; i < CPIndexer.TotalSequenceLength(); ++i)
  cout << CPIndexer.GetAt(i);

The above code snippet prints only the generalized sequence. One has to re-index the generalized symbols to get the exact sequence. This way, it becomes quite easy to adapt the class for any set of symbols instead of few hard coded ones. Some of the implementation details of CPermutationSequenceIndexer are as follows.

  
typedef std::vector<SYMBOLINDEX> POSVECTOR;

class CPermutationSequenceIndexer
{
  SYMBOLINDEX	m_nSymbolCount;
  SYMBVECTOR	m_SymbVector;
  POSVECTOR	m_Pos;
public:
  CPermutationSequenceIndexer(const SYMBOLINDEX lLength)
  {
    m_Pos.reserve(lLength + 16);
    m_SymbVector.reserve(lLength + 16);

    for(SYMBOLINDEX l=0; l < lLength; ++l)
    {
      m_Pos.push_back(l);
      m_SymbVector.push_back(l);
    }

    m_nSymbolCount = lLength;
    m_nCompleteSeqLength = Factorial(lLength);
  }
};

SYMBVECTOR CPermutationSequenceIndexer::GetAt(const POSITION& SeqPosIndex )
{
  SYMBOLINDEX lSwapVal;
  SYMBOLINDEX lLength = m_nSymbolCount;
  
  for(long j = lLength - 1; j >=0; --j)
  {
    lSwapVal = (SeqPosIndex % Factorial((lLength - j))) / Factorial(lLength - (j+1));
    lSwapVal += j;
    
    m_Pos[j] = j;
    
    m_SymbVector[j] = lSwapVal;
    m_SymbVector[m_Pos[lSwapVal]] = j;

    m_Pos[j] = m_Pos[lSwapVal];
    m_Pos[lSwapVal] = j;
  }
  
  return m_SymbVector;
}

POSITION CPermutationSequenceIndexer::GetIndexOf(SYMBVECTOR SymbolVector)
{
  SYMBOLINDEX lLength = (SYMBOLINDEX) m_nSymbolCount;

  for(long j = 0; j < lLength; ++j)
    m_Pos[SymbolVector[j]] = j;

  POSITION Index = 0;

  for(SYMBOLINDEX j = 0; j < lLength; ++j)
  {
    Index += ((SymbolVector[j]-j) * Factorial(lLength - (j+1)));

    SymbolVector[m_Pos[j]] = SymbolVector[j];
    m_Pos[SymbolVector[j]] = m_Pos[j];
    m_Pos[j] = j;
    SymbolVector[j] = j;
  }

  return Index;
}

The another important method supported by CPermutationSequenceIndex is GetIndexOf() . Given any sequent, this method would tell us the index of that sequent in constant time (w.r.to the number of sequents in the Sequence). This would be quite useful when we have some particular arrangement with us whose position we would like to know in the set of all possible arrangements. For example, considering our above scenario of 3 symbols {7,4,8} , suppose we have one arrangement 478 , and we would like to know what is its index among the 6 possible arrangements. We can achieve this easily by first identifying its generalized sequent {1,0,2 } and then submitting it to the method CRangeSequenceIndexer:: GetIndexOf() . Identifying the generalized sequent is pretty straight forward in that all that need be done is replacing the 4 with its index 1, 7 with its index 0, and so on. As can be seen from the table, the generalized sequent {1,0,2 } has index 2 and hence GetIndexOf() would return 2 .

The combination of CPermutationSequenceIndexer::GetAt() and CPermutationSequenceIndexer:: GetIndexOf() would prove quite useful in simulating the randomness, among other things. For example, if we associate one event with each of the numbers 4,8 and 7 then we can change the order in which events occur just by changing the index for the arrangements, and conversely we can as well predict the process that is creating the events by just observing the order of the events and finding the corresponding index for that order. These two methods GetAt() and GetIndexOf() complement each other and hence complete the functionality of the Sequence. For complete implementation details please refer the source code available at: SequenceIndexingSourceCode.html

Combinational Sequence

Combinational Sequences are the result of selection of few symbols out of many. In contrast to the Permutation Sequence (where we care for the position of the symbol), Combinational Sequences do not respect the positional differences of symbols. Thus, both 1234 and 1432 are treated as one and the same here. It is the combination that makes the difference and not the position. For example, consider a scenario where we have 5 symbols {7,4,8,1,5} and we are supposed to choose 3 out of them. Then the list of all possible selections would form the following sequence:

Symbol[] = { 7, 4, 8, 1, 5 };
n = SymbolCount = 5; // SymbolCount == (sizeof(Symbol)/sizeof(Symbol[0]))
r = Choice = 3;      // Any value such that 0 <= Choice <= SymbolCount

SequenceLength = 10; // SequenceLength == ncr == n! /(r! (n-r)!) == 5! / (3!*2!) == 120/12 == 10

Index

Generalized Sequence

Exact Sequence

0

{ 0 1 2 }

7 4 8

1

{ 0 1 3 }

7 4 1

2

{ 0 1 4 }

7 4 5

3

{ 0 2 3 }

7 8 1

4

{ 0 2 4 }

7 8 5

5

{ 0 3 4 }

7 1 5

6

{ 1 2 3 }

4 8 1

7

{ 1 2 4 }

4 8 5

8

{ 1 3 4 }

4 1 5

9

{ 2 3 4 }

8 1 5

As can be seen from the above table, we have 10 ways of selecting 3 symbols out of 5 . The first way (indicated by index 0 in the above table) is to simply choose the first three symbols 748 and the second way is to choose the first two symbols followed by fourth symbol (indicated by the generalized sequence {0 1 3} in the above table across index 1 ) resulting in 741 and so on. If we consider our total number of symbols as n and the number of symbols to choose as r, then the number of ways of selecting r symbols out of n symbols is given by the mathematical formula ncr = Factorial(n) / (Factorial(r)*Factorial(n-r)); With so many ways of selection, how to compute what would be the ith way for any given i without going through all the rest of them?

The class CCombinationalSequenceIndexer makes this possible with the function CCombinationalSequenceIndexer::GetAt(). It takes the index i as the argument and returns the ith combination as the result. It does the computation without going through all the rest of the combinations. Further, if one need to compute all the possible combinations instead, it is quite possible that the function could be called within a loop, as shown below, by single-step incrementing the loop-index from 0 to SequenceLength-1 , the value of SequenceLength having obtained from the function CCombinationalSequenceIndexer:: TotalSequenceLength();

CCombinationalSequenceIndexer CSIndexer(5, 3);
for(long i=0; i < CSIndexer.TotalSequenceLength(); ++i)
  cout << CSIndexer.GetAt(i);

The above code snippet prints only the generalized sequence. One has to re-index the generalized symbols to get the exact sequence. This way, it becomes quite easy to adapt the class for any set of symbols instead of few hard coded ones. Some of the implementation details of CCombinationalSequenceIndexer are as follows.

 
class CCombinationalSequenceIndexer
{
  SYMBOLINDEX	m_nSymbolCount;
  SYMBOLINDEX	m_nChoice;
public:
  CCombinationalSequenceIndexer(SYMBOLINDEX NoofSymbolsExisting, SYMBOLINDEX NoofSymbolsToChoose)
  {
    m_nSymbolCount = NoofSymbolsExisting;

    m_nChoice = NoofSymbolsToChoose;
  }
  
  POSITION ComputeNcR(SYMBOLINDEX n, SYMBOLINDEX r)
  {
    return Factorial(n) / (Factorial(r) * Factorial(n-r)) ;
  }
};
  
SYMBVECTOR CCombinationalSequenceIndexer::GetAt(POSITION SeqPosIndex) // Compute Combination from Index
{	
  SYMBVECTOR SymbVector;
  POSITION SeqLength, CumulativeSeqLength;
  
  SYMBOLINDEX r = m_nChoice;
  SYMBOLINDEX n = m_nSymbolCount;

  for(SYMBOLINDEX s=0; s < m_nChoice; ++s)
  {
    CumulativeSeqLength = 0;

    for(SYMBOLINDEX i=0, nMax = n - r; i <= nMax; ++i)
    {
      SeqLength = ComputeNcR(n-(i+1), r-1);

      CumulativeSeqLength += SeqLength;

      if(SeqPosIndex < CumulativeSeqLength)
      {
        SeqPosIndex += SeqLength;
        SeqPosIndex -= CumulativeSeqLength;

        SymbVector[s] = (m_nSymbolCount - n) + i;

        r -= 1;
        n -= (i+1);

        break;
      }
    }
  }	
  return SymbVector;
}

POSITION CCombinationalSequenceIndexer::GetIndexOf(const SYMBVECTOR& SymbolVector) // Compute Index from Combination
{
  SYMBOLINDEX r = m_nChoice;
  SYMBOLINDEX n = m_nSymbolCount;
  
  POSITION Index = 0;

  for(SYMBOLINDEX s=0; s < m_nChoice; ++s)
  {
    const SYMBOLINDEX& SelectedIndex = SymbolVector[s] - (m_nSymbolCount - n);

    for(SYMBOLINDEX i=0 ; i < SelectedIndex; ++i)
      Index += ComputeNcR(n-(i+1), r-1);
    
    r -= 1;
    n -= (SelectedIndex + 1);
  }

  return Index;
}

The another important method supported by CCombinationalSequenceIndexer is GetIndexOf() . Given any sequent, this method would tell us the index of that sequent in constant time (w.r.to the number of sequents in the Sequence). This would be useful when we have one set of choices with us and we would like to know what is its position in the set of all possible choices. For example, considering our above example of selecting 3 symbols out of {7,4,8,1,5} , suppose we have chosen 415 , and we would like to know what is its index among the 10 possible choices. We can achieve this easily by first identifying its generalized sequent {1,3,4 } and then submitting it to the method CRangeSequenceIndexer:: GetIndexOf() . Identifying the generalized sequent is pretty straight forward in that all that need be done is replacing the 4 with its index 1, 1 with its index 3, and so on. As can be seen from the table, the generalized sequent {1,3,4 } has index 8 and hence GetIndexOf() would return 8 .

The combination of CCombinationalSequenceIndexer::GetAt() and CCombinationalSequenceIndexer:: GetIndexOf() would prove quite useful in all situations where selective ignorance (or elimination) is required. For example, consider a situation where we need to prepare a questionnaire for r questions out of a data base of n questions. We need to guarantee that the questionnaire we prepare is not a duplicate of any of the old questionnaires. We can achieve this by using the indexing to generate the questionnaire and then comparing the index for any past duplicates. This index comparision for the set of selected questions is far efficient compared with the method of comparing each and every question against each and every one from the past. These two methods GetAt() and GetIndexOf() complement each other and hence complete the functionality of the Sequence. For complete implementation details please refer the source code available at: SequenceIndexingSourceCode.html

Conclusions

Theory of Sequence Indexing deals with the notion of  structured Enumeration. It is about ordering and indexing the objects based on their properties. The classes presented here index different kinds of Sequences suitable for different purposes at different times. Range Sequences are suitable for indexing variables with well known ranges such as class objects, graph statistics etc.. Permutation Sequences are suitable for indexing exhaustive rearrangements of any given set of symbols, such as enumerating all braches in a source code etc... and the Combination Sequences find their usage when choosing a subset of symbols out of many, such as picking up samples for observations etc... Given suitable details, these classes help computing any element of a Sequence without going through all the rest of the elements.

By   

P.GopalaKrishna

Homepage     Other Articles