Implementing Spatial Lag Models in Python

Implementing spatial lag models in Python requires estimating a spatial autoregressive parameter (ρ\rho) that explicitly captures how neighboring observations directly influence the dependent variable. Unlike ordinary least squares (OLS), this specification accounts for spatial spillover effects, making it the standard choice for Spatial Regression Models where geographic proximity drives correlation. The production-ready approach relies on the spreg module within the PySAL ecosystem, which provides both maximum likelihood (ML) and generalized method of moments (GMM) estimators. You will need a GeoDataFrame, a properly constructed spatial weights matrix, and either spreg.ML_Lag (for small-to-medium datasets) or spreg.GM_Lag (for larger datasets or when endogeneity is a concern).

Environment Setup & Compatibility

Spatial lag estimation depends heavily on sparse linear algebra and eigenvalue decomposition. Ensure your environment meets these baseline requirements:

  • Python: 3.9+ (3.10+ recommended for libpysal performance optimizations)
  • PySAL Ecosystem: libpysal>=4.8.0, spreg>=1.5.0
  • Data Handling: geopandas>=0.14.0, pandas>=2.0.0, numpy>=1.24.0
  • Backend: scipy>=1.10.0 (required for sparse matrix inversion and Jacobian computation)

Install dependencies via:

bash
pip install geopandas libpysal spreg numpy scipy pandas

Compatibility Warning: spreg dropped legacy pysal v1.x syntax in v2.0. Always import directly from spreg, not pysal.spreg. Weight matrices must be row-standardized (transform='r') before passing to ML estimators to ensure ρ\rho remains bounded between 1-1 and 11. GMM estimators can accept raw contiguity but converge faster with standardized inputs. Consult the official spreg documentation for estimator-specific parameter mappings.

Step-by-Step Implementation

The pipeline below demonstrates a complete spatial lag specification using modern PySAL patterns. It assumes a GeoDataFrame (gdf) containing a dependent variable y, independent variables X1 and X2, and a valid geometry column.

python
import geopandas as gpd
import numpy as np
import pandas as pd
import libpysal
from spreg import ML_Lag, GM_Lag

# 1. Load and prepare data
# gdf = gpd.read_file("your_spatial_data.gpkg")
model_vars = ["y", "X1", "X2"]
df = gdf[model_vars].dropna().copy()

# 2. Build spatial weights matrix
# Queen contiguity is standard for polygon data; k-NN or distance bands work for points
w = libpysal.weights.Queen.from_dataframe(df, use_index=True)
w.transform = "r"  # Row-standardize (critical for ML_Lag interpretation)

# 3. Extract dependent and independent arrays
y = df["y"].values.reshape(-1, 1)
X = df[["X1", "X2"]].values

# 4. Fit Spatial Lag Model
# ML_Lag is efficient for N < 2000. GM_Lag scales better and handles endogenous weights.
try:
    model = ML_Lag(
        y, X, w,
        name_y="y",
        name_x=["X1", "X2"],
        name_w="Queen_Weights"
    )
    estimator_type = "ML"
except Exception as e:
    print(f"ML estimation failed ({e}). Falling back to GMM.")
    model = GM_Lag(
        y, X, w,
        name_y="y",
        name_x=["X1", "X2"],
        name_w="Queen_Weights"
    )
    estimator_type = "GMM"

# 5. Extract key results
print(f"\n--- {estimator_type} Spatial Lag Results ---")
print(f"Spatial Autoregressive Coefficient (ρ): {model.rho[0,0]:.4f}")
print(f"R²: {model.r2:.4f}")
print(f"AIC: {model.aic:.2f}")
print(f"Log-Likelihood: {model.ll:{estimator_type=='ML' and '.4f' or '.2f'}}")

Diagnostics & Interpretation

A spatial lag model is only valid if spatial dependence is statistically confirmed. Before trusting ρ\rho, run Lagrange Multiplier (LM) diagnostics to verify that the lag specification outperforms a spatial error alternative.

python
# Access diagnostic statistics (available in both ML and GM outputs)
lm_lag = model.lm_lag
lm_lag_robust = model.lm_lag_robust
p_value = lm_lag[1]

print(f"LM-Lag Statistic: {lm_lag[0]:.4f} (p={p_value:.4f})")
print(f"Robust LM-Lag: {lm_lag_robust[0]:.4f} (p={lm_lag_robust[1]:.4f})")

Interpretation Guidelines:

  • Significant LM-Lag (p<0.05p < 0.05): Confirms spatial dependence in the dependent variable. Proceed with the spatial lag specification.
  • Significant ρ\rho: Indicates direct spatial spillover. A positive ρ\rho means high values cluster with high neighbors; negative ρ\rho indicates dispersion.
  • Marginal Effects: Unlike OLS, coefficients in spatial lag models represent direct effects only. Total impact requires computing the spatial multiplier: (IρW)1X(I - \rho W)^{-1}X. Use spreg’s built-in spatial_lag_effects utilities or manually compute the Leontief inverse for policy simulations.

For deeper statistical validation, review the libpysal weights documentation to ensure your neighbor definition aligns with the underlying spatial process.

Production & Performance Guidelines

When deploying spatial regression pipelines in research or production environments, follow these optimization and validation rules:

  1. Estimator Selection: ML_Lag uses dense matrix operations and exact Jacobian computation. It becomes computationally prohibitive beyond N2,500N \approx 2,500. Switch to GM_Lag for larger datasets; it uses instrumental variables and sparse algebra to scale linearly.
  2. Row-Standardization Enforcement: Always apply w.transform = 'r' before ML estimation. Unstandardized weights distort the eigenvalue spectrum, causing ρ\rho to exceed theoretical bounds and producing biased standard errors.
  3. Island Handling: Isolated polygons (zero neighbors) break row-standardization. Resolve them by:
  • Assigning nearest neighbors: w = libpysal.weights.KNN.from_dataframe(df, k=1)
  • Using w.remap_ids() to drop islands before modeling
  • Applying w.transform = 'b' (binary) only if explicitly required by your theoretical framework
  1. Endogeneity Checks: If your spatial weights matrix is derived from the dependent variable (e.g., distance-decay based on outcome values), spatial lag becomes endogenous. Use GM_Lag with external instruments or switch to a spatial Durbin model.
  2. Workflow Integration: Embed spatial diagnostics directly into your CI/CD or notebook pipelines. Automated checks for model.lm_lag[1] < 0.05 prevent silent OLS mis-specification. This pattern aligns with standardized Python Workflows for Spatial Modeling & Regression where reproducibility and diagnostic gating are enforced before model deployment.

By adhering to these specifications, you ensure that your spatial lag implementation remains statistically rigorous, computationally efficient, and production-ready across varying dataset scales.