Calculate Workdays Between Dates With Polars

by Pedro Alvarez 45 views

Hey everyone! Ever found yourself needing to calculate the number of workdays between two dates in a Polars DataFrame? It's a common task in data analysis, especially when dealing with timelines, project management, or financial data. In this article, we'll dive into how you can efficiently achieve this using Polars, a blazing-fast DataFrame library for Python. We'll break down the problem, explore different approaches, and provide you with practical code examples. So, let's get started and make those workday calculations a breeze!

Understanding the Challenge

Before we jump into the code, let's clarify what we mean by "workdays." Generally, workdays are weekdays, meaning Monday through Friday. We need to exclude weekends (Saturday and Sunday) from our calculations. Additionally, we might want to consider holidays, which are non-working days that fall on weekdays. Factoring in holidays adds another layer of complexity, but don't worry, we'll tackle that too!

The main challenge lies in efficiently determining the number of weekdays between two dates within a DataFrame. Traditional methods, like iterating through dates, can be slow, especially for large datasets. Polars, with its optimized columnar processing, offers a much faster and more elegant solution. We'll leverage Polars' powerful date and time functions along with its expression language to achieve our goal.

So, what are the specific scenarios we'll cover? First, we'll look at the basic case of calculating workdays without considering holidays. Then, we'll extend our solution to incorporate a list of holiday dates. We'll also discuss how to handle edge cases and potential performance optimizations. By the end of this article, you'll have a solid understanding of how to calculate workdays between dates in Polars and be equipped to apply this knowledge to your own projects.

Setting Up the Polars DataFrame

First, let's set the stage by creating a sample Polars DataFrame. This DataFrame will have two columns: a date column representing the start date and a maturity_date column representing the end date. We'll populate these columns with some example dates to work with.

import polars as pl
from datetime import date

df = pl.DataFrame({
    "date": [date(2024, 1, 15), date(2024, 2, 1), date(2024, 3, 10)],
    "maturity_date": [date(2024, 1, 31), date(2024, 2, 29), date(2024, 3, 20)],
})

print(df)

This code snippet creates a DataFrame with three rows and two date columns. The date column contains the start dates, and the maturity_date column contains the end dates. We'll use this DataFrame as the foundation for our workday calculations. Feel free to modify the dates to experiment with different scenarios.

Before we move on, let's take a moment to appreciate the elegance of Polars' DataFrame representation. The output is clean and easy to read, making it simple to inspect and verify our data. This is just one of the many reasons why Polars is a joy to work with.

Now that we have our DataFrame set up, we're ready to dive into the core of the problem: calculating the number of workdays between these dates. Let's explore the different approaches we can take, starting with the simplest case of excluding weekends.

Calculating Workdays Excluding Weekends

Now, let's get to the heart of the matter: calculating workdays! Our initial goal is to determine the number of workdays between the date and maturity_date columns, excluding weekends (Saturdays and Sundays). We'll achieve this using Polars' expression language, which allows us to perform calculations on entire columns efficiently.

The key idea here is to first calculate the total number of days between the two dates and then subtract the number of weekend days. Polars provides the pl.col("maturity_date") - pl.col("date") expression to get the difference between two date columns, resulting in a Duration object. We can then access the number of days using the .dt.days() method.

To count the weekend days, we can use a clever trick involving the dt.weekday() method, which returns the day of the week as an integer (0 for Monday, 6 for Sunday). By filtering the dates based on their weekday, we can count the number of Saturdays and Sundays within the date range.

Here's the code that puts it all together:

def calculate_workdays(df: pl.DataFrame) -> pl.DataFrame:
    return df.with_columns(
        workdays=((pl.col("maturity_date") - pl.col("date")).dt.days() - (
            ((pl.col("maturity_date").dt.weekday() - pl.col("date").dt.weekday()) % 7) // 5
        )).alias("workdays")
    )

df_with_workdays = calculate_workdays(df)
print(df_with_workdays)

Let's break down this code step by step:

  1. We define a function calculate_workdays that takes a Polars DataFrame as input and returns a new DataFrame with an additional workdays column.
  2. We use the with_columns method to add the new column. This method allows us to perform calculations and create new columns in a concise and efficient way.
  3. Inside with_columns, we define the expression for the workdays column. This expression is the core of our calculation.
  4. We start by calculating the total number of days between the maturity_date and date columns using (pl.col("maturity_date") - pl.col("date")).dt.days(). This gives us the total duration in days.
  5. Next, we subtract the number of weekend days. To do this, we use the formula `(((pl.col(