How to Build Analysis Pipelines with Dplyr-style Operations
============================================================

**Goal:** Chain operations to build reproducible analysis workflows.

**Prerequisites:** A loaded network (see :doc:`load_and_build_networks`).

Basic Pipeline
--------------

Node Operations
~~~~~~~~~~~~~~~

.. code-block:: python

    from py3plex.core import multinet
    from py3plex.graph_ops import nodes
    
    network = multinet.multi_layer_network()
    network.load_network("data.multiedgelist", input_type="multiedgelist")
    
    # Build pipeline
    result = (
        nodes(network)
        .filter(lambda n: n["degree"] > 2)
        .mutate(score=lambda n: n["degree"] * 2)
        .arrange("degree", reverse=True)
        .to_pandas()
    )
    
    print(result.head())

Available Operations
--------------------

filter()
~~~~~~~~

Select nodes based on condition:

.. code-block:: python

    result = (
        nodes(network)
        .filter(lambda n: n["layer"] == "friends")
        .filter(lambda n: n["degree"] > 5)
        .to_pandas()
    )

mutate()
~~~~~~~~

Add or modify columns:

.. code-block:: python

    result = (
        nodes(network)
        .mutate(
            degree_squared=lambda n: n["degree"] ** 2,
            is_hub=lambda n: n["degree"] > 10
        )
        .to_pandas()
    )

select()
~~~~~~~~

Choose specific columns:

.. code-block:: python

    result = (
        nodes(network)
        .select("node", "layer", "degree")
        .to_pandas()
    )

arrange()
~~~~~~~~~

Sort results:

.. code-block:: python

    result = (
        nodes(network)
        .arrange("degree", reverse=True)  # Descending
        .to_pandas()
    )

group_by() and summarize()
~~~~~~~~~~~~~~~~~~~~~~~~~~

Aggregate data:

.. code-block:: python

    result = (
        nodes(network)
        .group_by("layer")
        .summarize(
            avg_degree=lambda g: g["degree"].mean(),
            max_degree=lambda g: g["degree"].max(),
            count=lambda g: len(g)
        )
        .to_pandas()
    )

**Expected output:**

.. code-block:: text

          layer  avg_degree  max_degree  count
    0   friends        3.45          12     46
    1      work        2.87          08     46
    2    family        4.12          15     42

Complex Pipelines
-----------------

Multi-Step Analysis
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    result = (
        nodes(network)
        # Step 1: Filter active nodes
        .filter(lambda n: n["layer_count"] > 1)
        # Step 2: Add computed score
        .mutate(
            activity_score=lambda n: n["degree"] * n["layer_count"]
        )
        # Step 3: Filter by score
        .filter(lambda n: n["activity_score"] > 20)
        # Step 4: Sort by score
        .arrange("activity_score", reverse=True)
        # Step 5: Get top 20
        .head(20)
        # Convert to pandas
        .to_pandas()
    )

Combining with DSL
------------------

Mix dplyr-style and DSL:

.. code-block:: python

    from py3plex.dsl import Q
    from py3plex.graph_ops import nodes
    
    # Use DSL to compute metrics
    dsl_result = (
        Q.nodes()
         .compute("degree", "betweenness_centrality")
         .execute(network)
    )
    
    # Use dplyr-style to filter and transform
    final_result = (
        nodes(dsl_result)
        .filter(lambda n: n["degree"] > 5)
        .mutate(
            centrality_rank=lambda n: n["betweenness_centrality"] * 100
        )
        .arrange("centrality_rank", reverse=True)
        .to_pandas()
    )

Sklearn-Style Pipelines
------------------------

For machine learning workflows:

.. code-block:: python

    from py3plex.pipeline import NetworkPipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    
    # Define pipeline
    pipeline = NetworkPipeline([
        ('features', 'compute_features'),  # Extract features
        ('scaler', StandardScaler()),       # Normalize
        ('cluster', KMeans(n_clusters=5))   # Cluster
    ])
    
    # Fit and predict
    labels = pipeline.fit_predict(network)
    
    print(f"Assigned {len(set(labels))} clusters")

Exporting Pipelines
-------------------

Save Pipeline Definition
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    import json
    
    pipeline_config = {
        'steps': [
            {'operation': 'filter', 'condition': 'degree > 2'},
            {'operation': 'mutate', 'column': 'score', 'expr': 'degree * 2'},
            {'operation': 'arrange', 'by': 'score', 'reverse': True}
        ]
    }
    
    with open('pipeline.json', 'w') as f:
        json.dump(pipeline_config, f, indent=2)

Load and Execute
~~~~~~~~~~~~~~~~

.. code-block:: python

    with open('pipeline.json', 'r') as f:
        config = json.load(f)
    
    # Execute pipeline from config
    result = execute_pipeline(network, config)

Reusable Pipelines
------------------

Create Pipeline Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    def identify_hubs(network, threshold=10):
        """Reusable pipeline to identify hub nodes."""
        return (
            nodes(network)
            .filter(lambda n: n["degree"] > threshold)
            .mutate(hub_score=lambda n: n["degree"] * n["layer_count"])
            .arrange("hub_score", reverse=True)
            .to_pandas()
        )
    
    # Use
    hubs = identify_hubs(network, threshold=5)
    print(hubs)

Next Steps
----------

* **Query with DSL:** :doc:`query_with_dsl`
* **Complete workflows:** :doc:`reproduce_workflows`
* **API reference:** :doc:`../reference/api_index`