The Segmental Revenue Rule: Methodology for Primary Sector Assignment

Table Of Contents

Executive Summary: The Logic Gate of Market Taxonomy
- Python Implementation for Segmental Data Ingestion Strategy
Conceptual Theory: The Architecture of Identity
- Mathematical Specification of the Primary Sector Assignment Rule
  - Python Algorithm for Primary Sector Assignment (50% Rule)
The Core Methodology: The Quantitative Thresholds
- Mathematical Specification of the Weighted Classification Score (WCS)
  - Python Function for Profitability-Weighted Classification
Regulatory Framework: AS-17 and Ind AS 108
- Mathematical Specification of the Segment Significance Threshold
  - Python Implementation for Regulatory Compliance Audit
Technical Workflow: Data Fetch → Store → Measure
- Python Workflow for Fuzzy Segment Mapping
Trading Impact: Short, Medium, and Long Term
- Mathematical Specification of Segmental Similarity (Cosine Similarity)
  - Python Implementation of Peer Group Similarity Algorithm
Hypothetical Logic Scenarios (The Researcher's Manual)
- Mathematical Specification of the Aggregation Logic (Entropy-Based Diversification)
  - Python Logic for Segment Aggregation and Entropy Check
Python Libraries & Implementation Toolkit
- Mathematical Specification of Segmental HHI
  - Python Algorithm for Concentration Risk Assessment
Data Sourcing & Methodology
- Python Snippet for API Data Validation
Database Structure & Storage Design
- Python Implementation for Temporal Data Storage (SQLAlchemy)
Final Compendium: Missed Algorithms & Curated Sources
- Mathematical Specification of the Tie-Breaker Algorithm (Dominance Gap)
  - Python Tie-Breaker Logic for Multi-Segment Companies

Executive Summary: The Logic Gate of Market Taxonomy

In the high-frequency and data-driven landscape of the Indian equity markets, the classification of a company is not merely a descriptive label; it is a fundamental quantitative pivot. For institutional investors, quantitative researchers, and software engineers building financial platforms, the primary sector assignment serves as the “logic gate” through which capital flows. Whether a company is labeled “Information Technology” or “Financial Services” dictates its inclusion in major indices like the Nifty 50, its weightage in exchange-traded funds (ETFs), and the peer group against which its valuation multiples are benchmarked.

The Quantitative Pivot and Investor Database Metadata

The primary sector assignment is arguably the most critical metadata field in an investor’s database. A misclassification can lead to significant tracking errors and skewed risk assessments. For instance, if a company is migrating its operations from traditional textiles to high-margin technical fibers, its valuation should ideally transition from a low P/E commodity multiple to a higher growth multiple. Investors who rely on automated Python scripts to detect these segmental shifts gain a “first-mover” advantage before the broader market or the stock exchanges formally re-classify the security.

Python’s Role in Automating Taxonomy

Modern classification frameworks have moved beyond manual entry. Leveraging the Python ecosystem—specifically Pandas for data manipulation, NumPy for vectorized mathematical operations, and Scikit-Learn for clustering—analysts can now ingest thousands of XBRL (eXtensible Business Reporting Language) filings from the NSE and BSE. This automation allows for the extraction of granular segmental data, enabling a real-time “Segmental Revenue Rule” audit that identifies “Thematic Drifts” in diversified conglomerates.

Python Implementation for Segmental Data Ingestion Strategy

 import pandas as pd import numpy as np
def initialize_metadata_engine(ticker_list): """ Initializes a structured DataFrame to store raw segmental data extracted from exchange filings. """ # Define the core columns for the classification database columns = ['ticker', 'segment_name', 'revenue', 'ebitda', 'capital_employed']
Initialize an empty DataFrame with specified types for memory efficiency
segment_df = pd.DataFrame(columns=columns)
Placeholder for a batch ingestion process (e.g., from a PostgreSQL source)
segment_df = pd.read_sql("SELECT * FROM raw_annual_segments", db_engine)
return segment_df

Step-by-step Summary:

The script imports Pandas and NumPy, the backbone of Indian market data analysis.
'initialize_metadata_engine' prepares the environment for high-volume data ingestion.
It establishes the key metrics: Revenue (Volume), EBITDA (Value), and Capital (Infrastructure).
This structure allows for vectorized operations in subsequent classification stages.

Conceptual Theory: The Architecture of Identity

Financial identity in the Indian market is often blurred by the presence of large, diversified business houses. Understanding the “Architecture of Identity” requires a transition from qualitative descriptions to a rigorous quantitative taxonomy ladder. This section explores how the “operational reality” of a firm—revealed through its segmental filings—interacts with its “thematic behavior” in the stock market.

The Identity Crisis of Diversified Entities

The spectrum of Indian equities ranges from “Pure Plays” (companies focused on a single product line) to complex “Hybrids” or conglomerates. A Pure Play, such as a dedicated software services firm, is easily categorized. However, many Indian firms are in a state of flux, where a legacy business provides the cash flow for a new, high-growth venture. The taxonomy ladder provides a structured path: starting from a “Basic Industry” (e.g., Spinning) to an “Industry” (e.g., Textiles) and finally to a “Broad Sector” (e.g., Consumer Discretionary).

Why Classification Dictates Capital Flow

The importance of accurate classification is magnified by the rise of passive investing in India. ETFs and index funds are programmed to buy or sell based on sector-specific indices. If a company is mislabeled, it creates an “algorithmic error” where millions in capital are misallocated. For the software engineer, building a “Truth Source” database involves reconciling regulatory codes (like the National Industrial Classification or NIC) with the actual revenue drivers disclosed in the notes to accounts under Ind AS 108.

Mathematical Specification of the Primary Sector Assignment Rule

To mathematically define the primary sector of a company, we utilize the Argmax Function subjected to a Majority Threshold. This ensures that a sector is only assigned if it represents the dominant economic engine of the enterprise. $S_{p r i m a r y} = \{\begin{cases} \underset{i}{arg max} (R_{i}), & if \frac{R_{i}}{\sum_{j = 1}^{n} R_{j}} > τ \\ Diversified, & otherwise \end{cases}$

Variable and Symbol Definitions:

S_primary: The Resultant Primary Sector Tag assigned to the ticker.
R_i: The Revenue generated by the i-th business segment (The Numerator).
∑ R_j: The Total Revenue of the company, calculated as the sum of all ‘n’ segments (The Denominator).
n: The total number of reportable segments as per Ind AS 108.
τ: The Threshold Coefficient (typically set at 0.50 or 50% for the “Golden Rule”).
arg max: The operator that selects the index ‘i’ which yields the maximum value of Revenue.

Python Algorithm for Primary Sector Assignment (50% Rule)

 def calculate_primary_sector(df, ticker, threshold=0.5): """ Applies the Argmax Logic with a threshold constraint to determine the primary sector. """ # Filter data for the specific company company_data = df[df['ticker'] == ticker]
Calculate Total Revenue (The Summand)
total_revenue = company_data['revenue'].sum()
Calculate Percentage Contribution for each segment
company_data['contribution'] = company_data['revenue'] / total_revenue
Find the segment with the maximum contribution (The Argmax)
top_segment = company_data.loc[company_data['contribution'].idxmax()]
Apply the Threshold (τ) logic
if top_segment['contribution'] > threshold:

    return top_segment['segment_name']

else:

    return "Diversified / Conglomerate"

Step-by-step Summary:

The function isolates data for a single 'ticker' to ensure domain isolation.
It computes the total revenue, acting as the mathematical denominator.
Each segment's relative weight is calculated (Normalization).
'idxmax()' identifies the index of the highest revenue-generating segment.
A conditional check determines if the segment meets the 'Rule of 50'.

To further enhance your analysis of the Indian stock market, integrating diverse data streams is essential. You can utilize TheUniBit to access high-fidelity financial data and advanced analytics that complement these Python workflows.

The Core Methodology: The Quantitative Thresholds

The transition from theory to practice requires a rigid set of rules to handle edge cases. While the “50% Rule” is the standard, the methodology must also account for disparities between revenue and profitability, as well as capital allocation.

The 50% Rule: The Golden Standard

In most instances, a company is assigned to a sector if a single segment contributes more than 50% of the total revenue. This provides a clean classification for the majority of the NSE 500. However, when a company’s revenue is split nearly equally (e.g., 40/40/20), a “Capital Employed Override” is used. This tie-breaker assigns the sector based on where the majority of the balance sheet—fixed assets and working capital—is deployed, as this indicates management’s long-term strategic commitment.

Revenue vs. EBITDA: The Profitability Nuance

A common paradox in Indian equity analysis is the “Trading vs. Manufacturing” conflict. A company might report 80% of its revenue from low-margin commodity trading (High Volume) but 70% of its EBITDA from a small specialty chemical manufacturing unit (High Value). In such cases, the “Value Driver” (EBITDA) is prioritized. The rationale is that the stock market values earnings and cash flow potential rather than raw turnover.

Mathematical Specification of the Weighted Classification Score (WCS)

To resolve conflicts between Revenue and EBITDA, we use a weighted linear combination to derive a Classification Score (WCS) for each segment. $W C S_{i} = w_{1} (\frac{R_{i}}{\sum R}) + w_{2} (\frac{E_{i}}{\sum E}) + w_{3} (\frac{C_{i}}{\sum C})$

Variable and Symbol Definitions:

WCS_i: Weighted Classification Score for segment ‘i’.
R_i, E_i, C_i: Revenue, EBITDA, and Capital Employed for segment ‘i’ respectively.
w₁, w₂, w₃: Weighting Coefficients, where ∑ w = 1. Standard weights are often w₁=0.3, w₂=0.5, w₃=0.2.
∑R, ∑E, ∑C: The total Aggregate Revenue, EBITDA, and Capital across all business units.

Python Function for Profitability-Weighted Classification

 def get_weighted_sector(df, weights={'rev': 0.3, 'ebitda': 0.5, 'cap': 0.2}): """ Calculates a multi-metric score to identify the true economic driver of a multi-segment company. """ # Normalize metrics to obtain relative contributions (0 to 1 range) df['rel_rev'] = df['revenue'] / df['revenue'].sum() df['rel_ebitda'] = df['ebitda'] / df['ebitda'].sum() df['rel_cap'] = df['capital_employed'] / df['capital_employed'].sum()
Calculate the Weighted Classification Score (WCS)
df['wcs'] = (df['rel_rev'] * weights['rev'] + 

             df['rel_ebitda'] * weights['ebitda'] + 

             df['rel_cap'] * weights['cap'])
Return the segment with the highest WCS
return df.loc[df['wcs'].idxmax(), 'segment_name']

Step-by-step Summary:

The function accepts custom weights, allowing flexibility for different industries.
It normalizes Revenue, EBITDA, and Capital Employed to make them comparable.
It applies a linear combination to compute the 'WCS' for each segment.
This approach prevents 'Revenue-only' bias and highlights high-margin divisions.

Trading Impact: The “Fetch-Store-Measure” Workflow

The “Fetch-Store-Measure” workflow is the operational backbone of this methodology. In the Fetch phase, Python scrapers collect segment notes from PDF annual reports. In the Store phase, this data is cleaned and saved into a relational database with historical versioning. Finally, the Measure phase runs the WCS algorithms described above.

The impact on trading varies across time horizons:

Short-Term: News of a major new contract in a secondary segment can trigger a “Re-Classification Pop” if the market anticipates a shift in the primary sector.
Medium-Term: Relative valuation becomes more accurate as “Diversified Discounts” are removed when a company successfully pivots to a pure-play growth sector.
Long-Term: Structural alignment ensures the portfolio is exposed to the intended themes (e.g., Green Energy) rather than legacy anchors.

Regulatory Framework: AS-17 and Ind AS 108

In the Indian corporate landscape, the granularity of segmental data is governed by specific accounting standards. Transitioning from the older AS-17 to the modern Ind AS 108 has significantly altered how “Primary Sector Assignment” is executed. For a quantitative researcher using Python, understanding these regulatory nuances is essential for identifying reporting gaps and management subjectivity.

Decoding AS-17 (Segment Reporting)

Under the legacy AS-17 framework, segment reporting was largely based on the “Risks and Returns” approach. Companies were required to disclose information for any segment that contributed more than 10% of total revenue, results, or assets. However, this often led to fragmented reporting where companies would group disparate business units under “Others” to avoid revealing competitive data. For an automated Python workflow, this creates a “Reporting Gap” where unstructured PDF data requires advanced Natural Language Processing (NLP) to map vague segment names to standardized industrial codes.

Transition to Ind AS 108: The Management Approach

Ind AS 108 introduced the “Management Approach,” where segments are defined based on how the Chief Operating Decision Maker (CODM)—usually the CEO or the Board—reviews the business. While this provides insight into how the company is internally managed, it introduces subjectivity. A company might report “Consumer Electronics” as a single segment even if it includes both manufacturing and retail services. The challenge for the analyst is to use Python to cross-reference these management-defined segments against external benchmarks like the National Industrial Classification (NIC).

Mathematical Specification of the Segment Significance Threshold

To determine if a segment is “Reportable” under regulatory mandates, we apply the Indicator Function across three financial dimensions: Revenue, Profit/Loss, and Assets. $𝑀_{r e p o r t a b l e} = [(\frac{R_{i}}{\sum R} \geq 0.10) \lor (\frac{| P_{i} |}{\sum | P |} \geq 0.10) \lor (\frac{A_{i}}{\sum A} \geq 0.10)]$

Variable and Symbol Definitions:

𝑀_reportable: A Boolean indicator (1 if the segment must be disclosed, 0 otherwise).
R_i, P_i, A_i: Segmental Revenue, Profit (or Loss), and Assets respectively.
∑R, ∑P, ∑A: The total consolidated Revenue, Profit, and Assets of the entity.
∨: The Logical OR operator, indicating that meeting any one of the three criteria triggers disclosure.
|P_i|: The absolute value of profit or loss, used to handle segments currently in a loss-making phase.

Python Implementation for Regulatory Compliance Audit

 def audit_segment_disclosure(segments_list): """ Evaluates which business units cross the 10% threshold for mandatory disclosure under Ind AS 108. """ results = [] total_rev = sum(s['revenue'] for s in segments_list) total_assets = sum(s['assets'] for s in segments_list)
Calculate total absolute profit/loss for the denominator
total_abs_profit = sum(abs(s['profit']) for s in segments_list)
for seg in segments_list:

    # Check against the 10% threshold for each metric

    is_rev_sig = (seg['revenue'] / total_rev) >= 0.10

    is_prof_sig = (abs(seg['profit']) / total_abs_profit) >= 0.10

    is_asset_sig = (seg['assets'] / total_assets) >= 0.10
# Trigger reportable flag if any condition is met
is_reportable = is_rev_sig or is_prof_sig or is_asset_sig

results.append({
    'name': seg['name'],
    'reportable': is_reportable,
    'reason': "Regulatory Requirement" if is_reportable else "Internal Disclosure"
})
return results

Step-by-step Summary:

The function iterates through a list of segment dictionaries.
It calculates the aggregate totals for Revenue, Assets, and Absolute Profit.
Each segment is tested against the 10% benchmark (Indicator Function logic).
It returns a compliance report identifying which segments management is legally bound to disclose.

Technical Workflow: Data Fetch → Store → Measure

To scale the Segmental Revenue Rule across the 5,000+ companies listed on the NSE and BSE, a robust technical pipeline is required. This workflow ensures that messy, unstructured regulatory filings are transformed into actionable trading signals.

Data Ingestion: The “Fetch” Phase

The primary sources for Indian market data are XBRL files and annual report PDFs. Python’s BeautifulSoup and xml.etree.ElementTree are used to parse XBRL instances directly from exchange websites. For legacy companies that only provide PDFs, OCR (Optical Character Recognition) tools combined with LLM-based extraction (Large Language Models) are employed to find the “Segment Information” table in the notes to accounts.

Standardization and Mapping: The “Store” Phase

Segment names in India are notoriously non-standardized. A company might label its chemical division as “Chemicals,” “Agro-Inputs,” or “Specialty Solutions.” To solve this, we use Fuzzy Matching (via the RapidFuzz library) to map these erratic strings to a master taxonomy. The data is then stored in a PostgreSQL database using a Relational Schema that tracks “Point-in-Time” sector assignments, allowing for backtesting of sector migration events.

Python Workflow for Fuzzy Segment Mapping

 from rapidfuzz import process, fuzz
def map_segment_to_master(raw_name, master_tags): """ Standardizes inconsistent segment names using Fuzzy Logic. """ # Use partial_ratio to handle prefixes/suffixes like "Textile Division" match = process.extractOne(raw_name, master_tags, scorer=fuzz.partial_ratio)
Set a confidence threshold (e.g., 80%) to avoid false positives
if match and match[1] > 80:

    return match[0]

else:

    return "Unclassified/Other"

Master list of standardized Indian sectors

indian_sectors = ["Textiles", "Chemicals", "IT Services", "Banking", "Automobiles"]
Example Usage

raw = "Textile Manufacturing Unit A"

clean_tag = map_segment_to_master(raw, indian_sectors) # Returns "Textiles"

Step-by-step Summary:

'RapidFuzz' is utilized for high-speed string comparison.
'partial_ratio' accounts for sub-strings and descriptive noise in segment names.
A confidence score threshold ensures data integrity.
This mapping is critical for building a 'Truth Source' database for peer comparisons.

Trading Impact: The Mechanics of Value Shift

The transition from “Fetch” to “Measure” provides a quantitative bridge for traders. By monitoring the Shift in Capital Allocation (CapEx) before it reflects in revenue, analysts can predict sector re-classifications.

Short-Term: High-frequency bots monitor SEBI filings for “Business Diversification” announcements. An automated mapping of these announcements to the “Segmental Revenue Rule” can trigger buy/sell orders milliseconds before manual traders react.
Medium-Term: As a company crosses the 50% threshold in a new sector, its “Peer Group” changes. This leads to Mean Reversion or Multiple Expansion as the stock is re-rated by sell-side analysts.
Long-Term: Investors use the “Store” phase data to track the long-term survival and profitability of new segments, ensuring the company isn’t falling into the “Diversified Trap” (where secondary businesses destroy value).

For more advanced data-fetching capabilities and automated market insights, you can explore the specialized tools available at TheUniBit, which streamline the ingestion of Indian corporate filings for systematic trading strategies.

Trading Impact: Short, Medium, and Long Term

The “Segmental Revenue Rule” is not merely an accounting exercise; it is a catalyst for significant price action in the Indian equity markets. When a company’s primary revenue driver shifts, it triggers a chain reaction across institutional portfolios, index weights, and valuation models. Understanding the temporal impact of these shifts allows traders to position themselves ahead of official re-classifications.

Short-Term: The Re-Classification Pop and Front-Running

In the short term, the market reacts to the “event” of sector migration. When a company officially crosses the 50% threshold in a high-growth sector—such as a legacy chemical firm becoming a “Specialty Chemical” or “Battery Materials” player—it often experiences a “Re-Classification Pop.” This is driven by alpha-seeking algorithms that scrape SEBI corporate filings for segmental updates. Quantitative traders can “front-run” index rebalancing by predicting which stocks will be added to sector-specific indices like the Nifty IT or Nifty Realty based on their latest annual report data.

Medium-Term: Relative Valuation and the Peer Group Fallacy

Medium-term trading impact is dictated by the correction of the “Peer Group Fallacy.” For years, a company with a 50/50 split between Textiles and Real Estate might have been valued at a low textile multiple. As the Real Estate segment becomes the dominant engine (>50%), analysts are forced to value the company using Realty multiples. This shift leads to Multiple Expansion or “Multiple Compression” depending on the relative desirability of the new sector. Python-based “Peer Group Finders” are used to calculate the segmental similarity between companies to identify undervalued “misfit” stocks.

Mathematical Specification of Segmental Similarity (Cosine Similarity)

To identify mispriced peers, we calculate the Cosine Similarity between the segmental revenue vectors of two companies. This quantitative measure determines how closely a hybrid company matches a pure-play benchmark. $C o s i n e S i m i l a r i t y = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}$

Variable and Symbol Definitions:

A_i: Revenue contribution percentage of segment ‘i’ for Company A.
B_i: Revenue contribution percentage of segment ‘i’ for Company B.
∑ A_i B_i: The Dot Product of the two segmental revenue vectors (The Numerator).
√∑ A_i²: The Euclidean Norm (Magnitude) of Company A’s vector.
n: The total number of standardized industry segments in the master taxonomy.

Python Implementation of Peer Group Similarity Algorithm

 from sklearn.metrics.pairwise import cosine_similarity import pandas as pd
def find_true_peers(ticker, full_market_segments): """ Computes the similarity between a target ticker and all other listed companies based on segmental revenue profiles. """ # Create a pivot table where rows are tickers and columns are segments pivot_table = full_market_segments.pivot( index='ticker', columns='segment_name', values='rel_rev' ).fillna(0)
Extract the vector for the specific ticker
target_vector = pivot_table.loc[[ticker]]
Calculate similarity across the entire market (Vectorized)
similarities = cosine_similarity(target_vector, pivot_table)
Return sorted peers by similarity score
peer_df = pd.DataFrame({

    'peer_ticker': pivot_table.index,

    'similarity': similarities[0]

}).sort_values(by='similarity', ascending=False)
return peer_df.head(10)

Step-by-step Summary:

The function pivots the database to create a segment-wise matrix (Vectorization).
'cosine_similarity' calculates the angular distance between revenue vectors.
Companies with scores near 1.0 are "Pure Play" peers.
This tool identifies companies that are misclassified by broad market indices.

Long-Term: Structural Portfolio Alignment and Thematic Drifts

Long-term investors monitor “Thematic Drifts”—the gradual shift in capital expenditure (CapEx) from legacy segments to future growth drivers. By tracking the Segmental Asset Intensity over 5-10 years, an investor can determine if a company is successfully transitioning its “Identity Architecture.” For example, an Indian auto-ancillary company pivoting into EV components will show a surge in CapEx for the EV segment years before the “Segmental Revenue Rule” triggers a re-classification.

Hypothetical Logic Scenarios (The Researcher’s Manual)

To master the methodology, researchers must navigate complex scenarios where revenue alone is an insufficient indicator. These scenarios represent the “Edge Cases” where human-in-the-loop validation or advanced Python logic is required.

Scenario A: The Textile-to-Real Estate Pivot

A legacy cotton mill owner in Mumbai stops production due to labor issues but begins developing the mill land into luxury residential towers. In the transition year, Revenue is still 60% Textiles (due to clearing of old stock), but 90% of the Capital Employed is in Real Estate. Decision: The company should be classified as “Realty” despite the revenue lag. This is because future cash flow drivers and the “Economic Essence” of the balance sheet have already migrated.

Scenario B: The Agri-Chemical Hybrid

Consider a company reporting 45% Fertilisers, 45% Pesticides, and 10% Seeds. No single segment crosses the 50% “Golden Rule.” Decision: The algorithm should aggregate the first two segments into a broader “Agro-Chemicals” bucket. If aggregation is not possible, the “Diversified” tag is applied until a clear leader emerges. This prevents “Sector Flipping,” where a stock jumps between sectors annually due to minor revenue fluctuations.

Mathematical Specification of the Aggregation Logic (Entropy-Based Diversification)

To decide whether to label a company as “Diversified,” we calculate the Shannon Entropy of its revenue streams. High entropy indicates a lack of a primary sector driver. $H = - \sum_{i = 1}^{n} p_{i} {log}_{2} (p_{i})$

Variable and Symbol Definitions:

H: The Entropy Score. A higher H signifies a more diversified/conglomerate-like entity.
p_i: The proportion of revenue from segment ‘i’ (where ∑ p_i = 1).
log₂: Base-2 logarithm, standard in information theory for measuring “bits” of uncertainty.
n: Number of segments.

Python Logic for Segment Aggregation and Entropy Check

 import math
def check_diversification_entropy(proportions): """ Measures the degree of diversification using Shannon Entropy. """ # Calculate entropy: -Sum(p * log2(p)) entropy = -sum(p * math.log2(p) for p in proportions if p > 0)
Normalized Entropy (0 to 1) where 1 is perfectly diversified
max_entropy = math.log2(len(proportions))

norm_entropy = entropy / max_entropy if max_entropy > 0 else 0
return norm_entropy

Example Usage

props = [0.45, 0.45, 0.10]

score = check_diversification_entropy(props)

if score > 0.7: print("Label as Diversified")

Step-by-step Summary:

The function takes the revenue proportions as an input list.
It computes the uncertainty (entropy) of the business profile.
Normalization allows for a universal threshold across companies with different segment counts.
If entropy is high, the '50% Rule' is suppressed in favor of a 'Diversified' tag.

By applying these quantitative measures, traders can move beyond basic ticker symbols and understand the deep fundamental identity of their investments. For high-resolution historical data to test these entropy and similarity models, TheUniBit provides the granular segmental datasets necessary for institutional-grade equity analysis.

Python Libraries & Implementation Toolkit

The transition from manual spreadsheet-based sector analysis to an automated, Python-centric framework requires a specialized stack of libraries. Each library in this toolkit addresses a specific challenge in the “Segmental Revenue Rule” pipeline, ranging from raw numerical computation to the handling of erratic natural language in Indian corporate filings.

The Core Quantitative Stack

For the “Measure” phase of our workflow, performance and vectorization are paramount. Processing the segmental history of 5,000+ tickers across 10 years creates millions of data points. Standard Python loops are insufficient for this scale, necessitating the use of vectorized operations provided by NumPy and Pandas.

Library	Feature	Key Function	Use Case
Pandas	Data Manipulation	`groupby().transform()`	Calculating % contribution per segment within each ticker.
NumPy	Vectorized Math	`np.where()`	Applying conditional logic for sector thresholds.
RapidFuzz	String Matching	`fuzz.ratio()`	Standardizing erratic segment names from PDF/XBRL.
SQLAlchemy	Database ORM	`session.query()`	Querying historical sector migrations for backtesting.
Scikit-Learn	Clustering	`KMeans()`	Grouping companies by their multi-segmental profiles.

Advanced Algorithmic Implementation: The Herfindahl-Hirschman Index (HHI)

While the “Rule of 50” is a binary classifier, sophisticated investors use the Herfindahl-Hirschman Index (HHI) to measure the “Concentration Risk” of a company’s revenue streams. This quantitative metric identifies firms that are dangerously reliant on a single segment versus those that are truly diversified.

Mathematical Specification of Segmental HHI

The HHI is calculated by summing the squares of the percentage market share (or in this case, revenue share) of each segment within the company. $H H I = \sum_{i = 1}^{n} {(s_{i} \times 100)}^{2}$

Variable and Symbol Definitions:

HHI: The Resultant Concentration Score (ranging from 0 to 10,000).
s_i: The Segment Revenue Proportion (Segment Revenue / Total Revenue).
n: The total number of reportable business segments.
∑: The summation operator across all ‘n’ segments.
100: Scaling factor to convert proportions to percentages before squaring.

Python Algorithm for Concentration Risk Assessment

 def calculate_segmental_hhi(segment_proportions): """ Computes the HHI to determine if a company is a 'Pure Play' or a 'Diversified Conglomerate'. """ # Convert proportions (0.0 - 1.0) to percentages (0 - 100) percentages = [p * 100 for p in segment_proportions]
Square the percentages and sum them (Vectorized logic)
hhi = sum(p**2 for p in percentages)
Logic for Interpretation:
HHI > 8000: Highly concentrated (Pure Play)
2500 < HHI < 8000: Moderately concentrated (Hybrid)
HHI < 2500: Highly diversified (Conglomerate)
return hhi

Step-by-step Summary:

The function takes a list of segmental revenue weights.
Each weight is converted to a percentage to adhere to standard HHI notation.
Squaring the percentages gives disproportionate weight to larger segments.
The resulting score acts as a filter for 'Sector Pure Play' strategies.

Data Sourcing & Methodology

High-quality sector assignment is only as good as the underlying data. In the Indian context, data sourcing involves navigating both structured exchange feeds and unstructured corporate disclosures.

Primary Source: NSE/BSE XBRL Filings

The most reliable source for segmental data is the XBRL (eXtensible Business Reporting Language) instance files filed with the National Stock Exchange (NSE) and Bombay Stock Exchange (BSE). Unlike PDFs, XBRL provides tagged data points for Revenue, Results, and Capital Employed. Python’s lxml or xml.etree libraries are used to parse these files, targeting the specific tags defined by the Ministry of Corporate Affairs (MCA).

Secondary Source: Annual Reports (PDF) and LLM Extraction

For historical data (pre-2015) or for companies with non-standard filings, the Annual Report PDF remains the “Source of Truth.” Extracting segment tables from these documents requires a specialized “Fetch-Store-Measure” workflow:

Fetch: Use requests and PyPDF2 to download and ingest PDF documents.
Store: Use Tabula-py or LLM-based vision models to convert graphical tables into machine-readable JSON formats.
Measure: Compare extracted totals against the consolidated Profit & Loss statement to ensure data integrity before classification.

API Integration: The Efficiency Layer

To avoid the overhead of custom scrapers, institutional-grade APIs like TheUniBit offer pre-standardized segmental data. These APIs handle the messy “Mapping and Standardization” phase, allowing Python developers to focus on the “Measure” phase—building the actual trading algorithms and sector-rotation models.

Python Snippet for API Data Validation

 import requests
def fetch_standardized_segments(ticker, api_key): """ Fetches processed segmental data from a high-fidelity provider. """ endpoint = f"https://api.theunibit.com/v1/segments/{ticker}" params = {'token': api_key, 'standardized': 'true'}
response = requests.get(endpoint, params=params)

data = response.json()
Validate the 'Unallocated' segment ratio
unallocated = next((s for s in data if s['name'] == 'Unallocated'), None)

if unallocated and (unallocated['revenue_share'] > 0.15):

    print(f"Warning: High Data Opacity for {ticker}")
return data

Step-by-step Summary:

The script calls an external API to retrieve pre-cleaned data.
It checks for 'standardized' tags to ensure mapping has already occurred.
A validation step monitors 'Unallocated Corporate Assets' to detect data hiding.
This ensures that the primary sector assignment is based on 'Core Operations'.

By combining these technical tools with the regulatory knowledge of Ind AS 108, the researcher can build a classification engine that is far more accurate and responsive than traditional, slow-moving exchange categories. This data-driven approach is the cornerstone of modern alpha generation in Indian equity markets.

Database Structure & Storage Design

To institutionalize the “Segmental Revenue Rule,” a robust relational database architecture is required. This system must move beyond static CSV files to a temporal data model that tracks how a company’s identity evolves over decades. For a Python developer, this involves designing schemas that can handle “Point-in-Time” queries, ensuring that backtests of sector-rotation strategies are free from look-ahead bias.

Relational Schema for Sectoral Evolution

The database must be structured to separate fixed company metadata from time-varying segmental disclosures. This separation allows for efficient joining of tables when calculating aggregate sector weights or detecting business pivots.

Table: Company_Metadata
- ticker: Primary Key (e.g., RELIANCE)
- cin: Corporate Identity Number (Unique Identifier)
- current_sector: The latest assigned primary sector
- is_active: Boolean flag for listed status
Table: Segment_History
- year: Fiscal year of disclosure
- ticker: Foreign Key to Metadata
- segment_name: Raw name from the annual report
- standardized_tag: The mapped industry name (from Fuzzy Matching)
- revenue: Numeric value of segmental turnover
- ebitda: Segmental operating profit
- capex: Capital expenditure for the segment

Python Implementation for Temporal Data Storage (SQLAlchemy)

 from sqlalchemy import Column, Integer, String, Float, ForeignKey, Boolean from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class SegmentHistory(Base): """ SQLAlchemy model representing historical segmental disclosures. Designed for Point-in-Time sectoral analysis. """ tablename = 'segment_history'
id = Column(Integer, primary_key=True)

ticker = Column(String(20), index=True)

fiscal_year = Column(Integer)

segment_name = Column(String(255))

revenue = Column(Float)

ebitda = Column(Float)

capital_employed = Column(Float)

is_primary = Column(Boolean, default=False) # Result of the 50% Rule

Step-by-step Summary:

The script defines a 'SegmentHistory' table using an ORM (Object-Relational Mapping).
'index=True' on the ticker column ensures high-speed querying for specific stocks.
All primary financial metrics (Revenue, EBITDA, Capital) are stored as Floats for precision.
'is_primary' allows the system to store the historical result of the assignment algorithm.

Final Compendium: Missed Algorithms & Curated Sources

In cases where the “Rule of 50” fails (e.g., three segments contributing 33% each), a tie-breaker is required. The Weighted Dominance Index (WDI) is the preferred algorithm. It evaluates not just the largest segment, but the gap between the top two segments to determine if a “Lead Sector” truly exists.

Mathematical Specification of the Tie-Breaker Algorithm (Dominance Gap)

The Dominance Gap measures the distance between the primary and secondary segments relative to the total business size. $Δ_{d o m} = \frac{R_{(1)} - R_{(2)}}{\sum R}$

Variable and Symbol Definitions:

Δ_dom: The Dominance Gap Score.
R₍₁₎: Revenue of the largest segment (Rank 1).
R₍₂₎: Revenue of the second-largest segment (Rank 2).
∑R: Total Consolidated Revenue.
Condition: If Δ_dom < 0.10 (10%), the company is automatically flagged as “Diversified” regardless of the largest segment’s size.

Python Tie-Breaker Logic for Multi-Segment Companies

 def apply_tie_breaker(segment_df): """ Handles cases where no segment crosses 50%. Evaluates the 'Dominance Gap' between top contenders. """ # Sort segments by revenue descending sorted_segs = segment_df.sort_values(by='revenue', ascending=False)
total_rev = sorted_segs['revenue'].sum()

r1 = sorted_segs.iloc[0]['revenue']

r2 = sorted_segs.iloc[1]['revenue']
Calculate Dominance Gap (Δ_dom)
dom_gap = (r1 - r2) / total_rev
if dom_gap > 0.15:

    # If the gap is significant, assign to the leader

    return sorted_segs.iloc[0]['standardized_tag']

else:

    # Otherwise, the entity is a true conglomerate

    return "Diversified"

Step-by-step Summary:

The function ranks business segments by revenue volume.
It calculates the spread between the two largest units (The Delta).
A 15% gap acts as a buffer to prevent 'Sector Flipping' due to minor volatility.
This ensures classification stability for index inclusion models.

Curated Data & Official Sources

To maintain a high-quality classification framework, practitioners should cross-reference their Python outputs with the following official and technical sources:

Official NIC Codes (MoSPI): The National Industrial Classification (NIC) provides the standardized hierarchy used by the Government of India for industrial surveys.
SEBI Listing Regulations (LODR): Specifically Regulation 33, which mandates the submission of segmental financial results to the exchanges.
News Triggers for Re-Classification: Monitor keywords such as “Demerger,” “Slump Sale,” “Asset Monetization,” and “Strategic Pivot” via Python-based news aggregators.
Python-Friendly APIs:
- TheUniBit: For standardized segmental revenue and EBITDA data with high-frequency updates.
- Jugaad-Data: For direct library access to NSE/BSE metadata.

Conclusion: The Strategic Advantage of Segmental Mastery

The “Segmental Revenue Rule” is the definitive methodology for navigating the complexity of Indian equity markets. By moving beyond the surface-level sector tags provided by exchanges and applying a rigorous Python-centric workflow—Fetch, Store, Measure—investors can uncover the true economic identity of a company. Whether it is identifying a “Thematic Drift” in a chemical-to-EV pivot or front-running an index rebalancing event, the ability to quantitatively assign primary sectors is an indispensable skill for the modern alpha generator.

For those looking to bypass the complexity of manual scraping and fuzzy mapping, TheUniBit offers a comprehensive suite of financial data tools that provide institutional-grade segmental analysis at scale, empowering you to focus on the trading strategies that matter most.

The Segmental Revenue Rule: Methodology for Primary Sector Assignment

Executive Summary: The Logic Gate of Market Taxonomy

The Quantitative Pivot and Investor Database Metadata

Python’s Role in Automating Taxonomy

Python Implementation for Segmental Data Ingestion Strategy

Conceptual Theory: The Architecture of Identity

The Identity Crisis of Diversified Entities

Why Classification Dictates Capital Flow

Mathematical Specification of the Primary Sector Assignment Rule

Python Algorithm for Primary Sector Assignment (50% Rule)

The Core Methodology: The Quantitative Thresholds

The 50% Rule: The Golden Standard

Revenue vs. EBITDA: The Profitability Nuance

Mathematical Specification of the Weighted Classification Score (WCS)

Python Function for Profitability-Weighted Classification

Trading Impact: The “Fetch-Store-Measure” Workflow

Regulatory Framework: AS-17 and Ind AS 108

Decoding AS-17 (Segment Reporting)

Transition to Ind AS 108: The Management Approach

Mathematical Specification of the Segment Significance Threshold

Python Implementation for Regulatory Compliance Audit

Technical Workflow: Data Fetch → Store → Measure

Data Ingestion: The “Fetch” Phase

Standardization and Mapping: The “Store” Phase

Python Workflow for Fuzzy Segment Mapping

Trading Impact: The Mechanics of Value Shift

Trading Impact: Short, Medium, and Long Term

Short-Term: The Re-Classification Pop and Front-Running

Medium-Term: Relative Valuation and the Peer Group Fallacy

Mathematical Specification of Segmental Similarity (Cosine Similarity)

Python Implementation of Peer Group Similarity Algorithm

Long-Term: Structural Portfolio Alignment and Thematic Drifts

Hypothetical Logic Scenarios (The Researcher’s Manual)

Scenario A: The Textile-to-Real Estate Pivot

Scenario B: The Agri-Chemical Hybrid

Mathematical Specification of the Aggregation Logic (Entropy-Based Diversification)

Python Logic for Segment Aggregation and Entropy Check

Python Libraries & Implementation Toolkit

The Core Quantitative Stack

Advanced Algorithmic Implementation: The Herfindahl-Hirschman Index (HHI)

Mathematical Specification of Segmental HHI

Python Algorithm for Concentration Risk Assessment

Data Sourcing & Methodology

Primary Source: NSE/BSE XBRL Filings

Secondary Source: Annual Reports (PDF) and LLM Extraction

API Integration: The Efficiency Layer

Python Snippet for API Data Validation

Database Structure & Storage Design

Relational Schema for Sectoral Evolution

Python Implementation for Temporal Data Storage (SQLAlchemy)

Final Compendium: Missed Algorithms & Curated Sources

Mathematical Specification of the Tie-Breaker Algorithm (Dominance Gap)

Python Tie-Breaker Logic for Multi-Segment Companies

Curated Data & Official Sources

Conclusion: The Strategic Advantage of Segmental Mastery

Related Posts