NCD Analytics Calculation Reference (Detailed)
==============================================

Purpose
-------
This file documents exactly how the NCD analytics results are calculated in the current Laravel + Python implementation.
Use this as a technical reference for medical stakeholders, QA, and data audits.

Code paths used
---------------
- Python pipeline: ncd_analysis.py
- SQL views: metrics_views.sql
- Laravel dashboard calculations: app/Http/Controllers/Dashboard1Controller.php
- NCD dashboard page: resources/views/ncd_dashboard.blade.php

Primary source tables
---------------------
1) ncd_pt_registers
- Key columns used:
  - Pid, FuchiaID, Clinic_code
  - Reg_Date
  - visit_Age, Current_Age
  - Gender
  - 1stBP, 2ndBP, 3rdBP
  - 1stHypertension, 2nd_Hypertension, staging_Hypertension
  - 1st_tot_Diabetes, 1st_RBS, 1st_RBS_date
  - 2nd_tot_Diabetes, 2nd_RBS, 2nd_RBS_date

2) ncd_followups
- Key columns used:
  - Pid, FuchiaID, Clinic_code
  - Visit_date, Next_Appointment
  - own_clinic_Bp, own_Bp_Stage
  - NCD_Diagnosis
  - FBS, FBS_test_date, Loaction_test
  - 2HPP, 2HPP_test_date, Loaction_Test2
  - HBA1C, Lab_res_Date
  - RBS result (legacy)
  - Creatinine, CRCL, Uring_AC_ratio, CVD_Risk
  - Medication changed, Patient_adhe medic, Drug_Supply
  - Foth_medi_spec
  - Out_come, Tout_mam_clinic, Ncd_Tout_icmv_location

3) patients
- Key columns used:
  - Pid
  - Date of Birth
  - Gender (used for gender correction on cumulative follow-up chart)

High-level processing flow
--------------------------
Step 1. Per clinic DB, create/update views:
- v_ncd_base_join: registration + patient join
- v_ncd_followups_clean: follow-up normalized columns + BP parse
- v_ncd_monthly_summary_seed: monthly seed counts

Step 2. Read view data into pandas:
- registers (from v_ncd_base_join)
- followups (from v_ncd_followups_clean)
- monthly seed

Step 3. Decrypt configured encrypted columns via Laravel helper:
- tools/laravel_decryptor.php
- Python sends batches to PHP and receives decrypted values.

Step 4. Data quality date limits:
- config.yaml -> data_quality.min_valid_date / max_valid_date
- Dates outside range are flagged.
- For core row date columns:
  - register row dropped if reg_date invalid range
  - follow-up row dropped if visit_date invalid range
- Other invalid date fields are nulled and tracked.

Step 5. Build core patient-level dataset:
- Latest follow-up per Pid is selected (max visit_date, tie by followup_id desc).
- This dataset is saved as patient_latest.csv and is used by many dashboard KPIs.

Step 6. Export clinic outputs:
- outputs/{DB}/registers_clean.csv
- outputs/{DB}/followups_clean.csv
- outputs/{DB}/patient_latest.csv
- plus trend and summary CSVs (listed below).

Step 7. Build combined overall output:
- Concatenate all clinic datasets (source_db retained).
- Recalculate metrics for outputs/overall.

Privacy and ID handling
-----------------------
- If privacy.mask_ids=true, stakeholder CSVs hash pid/patient_id/fuchia_id with SHA-256 + id_salt.
- Doctor list can keep full IDs using privacy.keep_full_ids_in_doctor_lists=true.
- For cumulative follow-up by gender, gender is remapped from patients table; encrypted Gender is decrypted with decrypt_light('General').

Core parsing rules
------------------
BP parsing
- Pattern: ^\\s*\\d{2,3}\\s*/\\s*\\d{2,3}\\s*$
- Valid range:
  - SBP: 50 to 300
  - DBP: 30 to 200
- Out-of-range BP is treated as invalid for control/staging.

BP stage classification
- Stage 3: SBP >= 180 OR DBP >= 110
- Stage 2: SBP >= 160 OR DBP >= 100
- Stage 1: SBP >= 140 OR DBP >= 90
- Normal: otherwise

Diabetes control hierarchy
- Per row, first available test in this order:
  1) HbA1c
  2) 2HPP
  3) FBS
  4) RBS (legacy fallback)
- Controlled thresholds (config.yaml):
  - HbA1c < 7.0
  - 2HPP < 180
  - FBS < 126
  - RBS < 200

Age derivation
- Primary: date_of_birth from patients + reference date
- Formula: floor((reference_date - dob in days) / 365.2425)
- Fallbacks when DOB unavailable:
  - current_age
  - visit_age
- Dashboard age distribution uses ncd_pt_registers.visit_Age and includes only ages 1..120.

Dashboard metrics: exact calculation steps
------------------------------------------

1) Yearly/Monthly cohort and diagnosis trend
Source:
- ncd_pt_registers (register rows within filters)

Grouping:
- Yearly: period = year(Reg_Date)
- Monthly: period = month(Reg_Date) for selected trend_year

Total cohort:
- Count of register rows per period.

Diagnosis flags:
- Hypertension diagnosis present if 1stHypertension contains New/Known/Know (case-insensitive normalized).
- Diabetes diagnosis present if 2nd_Hypertension contains New/Known/Know.
- Both = both flags true.
- No diagnosis = total cohort - (HTN only + DM only + both).

Gender segmentation:
- Male/Female split for HTN only, DM only, both, and no diagnosis.

2) Cumulative follow-up visits by year
Source:
- ncd_followups.Visit_date + NCD_Diagnosis

Year bucket:
- year(Visit_date)

Diagnosis category per visit:
- hypertension if text contains hypertension/htn
- diabetes if text contains diabetes/dm
- both if both terms appear (or contains "both")
- other for unknown/non-target

Two outputs are produced:
- Yearly raw counts by category
- Cumulative counts by category across years

3) Cumulative follow-up visits by year (gender)
Source:
- follow-up visits + gender remapped from patients table (preferred over followup gender text)

Per year:
- Male/Female counts are split by diagnosis category.
- Chart displays cumulative male/female totals over years.

4) Age distribution
Source:
- ncd_pt_registers.visit_Age

Rules:
- Use floor(age)
- Include only 1 <= age <= 120
- Bin size: 5 years
- Bins: 1-5, 6-10, ..., 116-120

5) LTFU from latest appointment (+84 days)
Source:
- For each Pid, use latest follow-up by Visit_date.

Classification logic:
- If Out_come indicates died/death/dead/tout/transfer out -> Exited
- Else if Next_Appointment missing -> Missing next appointment
- Else cutoff = Next_Appointment + grace_days (default 84)
  - cutoff < observe_date -> LTFU
  - otherwise -> Active

Observe date:
- current time in dashboard calculation (Laravel side), or configured report end in pipeline summaries.

Also provided:
- gender-stacked counts for Active/LTFU/Exited/Missing.

6) Clinic visit plan status by year
Source:
- Consecutive visit pairs per Pid in ncd_followups

For each current visit i:
- Needs Next_Appointment on visit i
- Compare to next visit i+1
- delta_days = next_visit_date - next_appointment_date

Status:
- Unplan (early): delta_days < 0
- Ontime: 0 <= delta_days <= 7
- Late: 8 <= delta_days < 84

Exclusions:
- Next_Appointment year < minYear (default 2020) excluded
- delta_days >= 84 excluded
- Missing Next_Appointment tracked
- No following visit tracked

7) Return to care by year
Source:
- Consecutive Visit_date pairs per Pid

Rule:
- If gap between consecutive visits >= 84 days, later visit counted as return-to-care event.

Output:
- yearly event counts
- summary: total events, unique patients with >=1 event, excluded pairs under 84 days.

8) Control status (BP stage change matrix)
Source:
- Baseline stage from register
- Latest stage from follow-up

Baseline stage extraction priority:
- 3rdBP -> 2ndBP -> 1stBP (first valid BP parse)
- fallback to staging_Hypertension text parse

Latest stage extraction priority:
- numeric sbp/dbp columns if valid
- fallback parse own_clinic_Bp
- fallback own_Bp_Stage text parse

Then for each patient:
- Require valid baseline and valid latest stage.
- Build 4x4 matrix: baseline stage (rows) vs latest stage (columns)
- Direction:
  - Improved if latest rank < baseline rank
  - Unchanged if equal
  - Worsened if latest rank > baseline rank

Data quality tracked:
- baseline missing/invalid
- latest missing/invalid
- top invalid raw examples (baseline and latest)

9) BP control %
Patient-level metric from patient_latest.csv:
- Denominator: patients with valid latest SBP and DBP (bp_with_values=true)
- Numerator: denominator patients with SBP < 140 AND DBP < 90 (bp_controlled=true)
- Rate = numerator / denominator * 100

Exclusion groups are tracked:
- missing BP
- invalid format
- out-of-range
- other

10) DM control %
Patient-level metric from patient_latest.csv:
- Denominator: patients with at least one usable DM test (dm_with_values=true)
- Numerator: denominator patients controlled by hierarchy (dm_controlled=true)
- Rate = numerator / denominator * 100

Hierarchy and source fields:
- HbA1c -> HBA1C/Lab_res_Date
- 2HPP -> 2HPP/2HPP_test_date/Loaction_Test2
- FBS -> FBS/FBS_test_date/Loaction_test
- RBS fallback -> RBS result (legacy)

11) Active caseload and LTFU (patient-level flags)
Reference date:
- report_end (selected end date or config/date bounds fallback)

Active patient:
- latest visit_date >= report_end - active_days

LTFU:
- latest visit_date < report_end - ltfu_days (or missing visit_date)

Current defaults:
- active_days = 180
- ltfu_days = 90

12) Missed appointment (visit-level and patient-level forms)
Visit-level continuity (displayed % in continuity section):
- Only active patients included.
- For each follow-up with Next_Appointment:
  - if no next visit exists -> missed
  - else if next visit date != Next_Appointment date -> missed
- Missed appointment rate = missed visits / total follow-up visits considered

Patient-level boolean (used in some summaries):
- true if patient has at least one missed appointment event using same rule.

13) Quality-of-care coverage
Source:
- follow-up rows, grouped by patient

Lookback windows from report_end:
- HbA1c: last 6 months
- Kidney (Creatinine/CRCL/Uring_AC_ratio): last 12 months
- CVD risk: last 12 months

For each patient:
- metric is true if any visit within lookback has non-missing value in that field.

Coverage rate:
- covered patients / total patients with follow-up records.

14) Risk stratification
From patient latest follow-up:
- High CVD risk: cvd_risk >= cvd_risk_high (default 20) or text contains "high"
- CKD marker:
  - creatinine > creatinine_high (default 1.3), OR
  - crcl < crcl_low (default 60), OR
  - uring_ac_ratio >= uring_ac_ratio_high (default 30)
- Diabetic foot / neuropathy / hypoglycemia flags:
  - yes/y/true/1/late/on treated as positive

15) Operations
- Medication changed rate:
  - numerator: latest rows where medication_changed is truthy
  - denominator: all latest rows
- Adherence and Drug supply distributions:
  - category counts from patient_adherence and drug_supply values

16) Other medications (Foth_medi_spec)
Source:
- ncd_followups.Foth_medi_spec

Normalization:
- remove dose/schedule/time tokens (mg, od/bd/tid, wk/weeks, etc.)
- tokenize and map aliases from config.yaml (e.g., ibuprofen -> ibu, multivitamin -> mv)
- count mentions and share_of_mentions.

17) Equity summaries
Grouped by:
- sex (gender)
- age band

For each group:
- patients count
- BP control rate
- DM control rate
- LTFU rate

18) Control improvement
Baseline status:
- earliest register per patient
- BP baseline from first valid of 1st/2nd/3rd BP
- DM baseline from first_dm_* then second_dm_*

Eligible for improvement:
- baseline uncontrolled AND has latest valid measurement

Rate:
- improved / eligible * 100

19) Sustained control
For each patient and condition:
- need at least 2 valid follow-up measurements
- sustained if last two are both controlled

Rate:
- sustained / eligible * 100

Export files and what they represent
------------------------------------
Per clinic in outputs/{DB}/:
- registers_clean.csv: cleaned register-level dataset
- followups_clean.csv: cleaned follow-up-level dataset
- patient_latest.csv: one latest follow-up record per patient (main denominator table)
- monthly_summary.csv: registrations and follow-ups by month
- bp_control_trend.csv: monthly BP control trend
- dm_control_trend.csv: monthly DM control trend
- continuity_metrics.csv: active/ltfu/missed/late counts + rates
- quality_metrics.csv: lab coverage metrics
- risk_metrics.csv: risk markers and symptoms
- operations_metrics.csv: medication changed summary
- operations_distributions.csv: adherence + drug supply distributions
- other_medications.csv: normalized Foth_medi_spec distribution
- equity_bp_control.csv, equity_dm_control.csv: equity tables
- kpi_summary.csv: top KPI table
- data_quality_report.csv: quality counters
- data_quality/invalid_date_ranges.csv: row-level invalid date list
- doctor/doctor_action_lists.csv: clinical follow-up list

Overall outputs in outputs/overall/:
- Same structure computed on combined clinic data.
- Includes kpi_summary_by_clinic.csv and overall invalid_date_ranges.csv.

Important denominator notes
---------------------------
1) "Patients" in KPI cards:
- count of rows in patient_latest (latest-follow-up cohort), not all raw attendance rows.

2) BP/DM control denominator:
- not all patients; only those with valid measurements for that metric.

3) Visit-level missed appointment percentage:
- denominator is follow-up visits assessed among active patients.
- this can differ from patient-level missed count.

Current key default parameters (config.yaml)
--------------------------------------------
- date_range: 2015-01-01 to 2025-12-31
- data_quality.min_valid_date: 2018-01-01
- data_quality.max_valid_date: 2025-12-31
- active_days: 180
- ltfu_days: 90
- BP control: <140/<90
- DM thresholds: HbA1c<7, 2HPP<180, FBS<126, RBS<200

Audit/troubleshooting locations
-------------------------------
- Background status: storage/app/ncd_analysis_status.json
- Background log: storage/logs/ncd_analysis_background.log
- Last outputs timestamp: derived from latest CSV mtime in outputs/{clinic}/

Notes
-----
- Unknown/NA/null-like values are treated as missing in most calculations.
- Date filtering and trend filters are applied before most metrics.
- If a metric looks inconsistent, first compare:
  - selected clinic
  - timeframe / trend year
  - denominator definition (patient-level vs visit-level).