Explaining the migration path for Copy-on-Write
Introduction
The introduction of Copy-on-Write (CoW) is a breaking change that can have some impression on present pandas code. We are going to examine how we are able to adapt our code to keep away from errors when CoW will likely be enabled by default. That is presently deliberate for the pandas 3.0 launch, which is scheduled for April 2024. The primary submit on this sequence defined the habits of Copy-on-Write whereas the second submit dove into efficiency optimizations which might be associated to Copy-on-Write.
We’re planning on including a warning mode that can warn for all operations that can change habits with CoW. The warning will likely be very noisy for customers and thus must be handled with some care. This submit explains frequent circumstances and how one can adapt your code to keep away from modifications in habits.
Chained project
Chained project is a method the place one object is up to date by means of 2 subsequent operations.
import pandas as pd
df = pd.DataFrame({“x”: [1, 2, 3]})
df[“x”][df[“x”] > 1] = 100
The primary operation selects the column “x” whereas the second operation restricts the variety of rows. There are numerous totally different combos of those operations (e.g. mixed with loc or iloc). None of those combos will work underneath CoW. As an alternative, they may increase a warning ChainedAssignmentError to take away these patterns as an alternative of silently doing nothing.
Usually, you should use loc as an alternative:
df.loc[df[“x”] > 1, “x”] = 100
The primary dimension of loc at all times corresponds to the row-indexer. Because of this you’ll be able to choose a subset of rows. The second dimension corresponds to the column-indexer, which allows you to choose a subset of columns.
It’s typically quicker utilizing loc whenever you wish to set values right into a subset of rows, so this can clear up your code and supply a efficiency enchancment.
That is the plain case the place CoW will have an effect. It can additionally impression chained inplace operations:
df[“x”].exchange(1, 100)
The sample is identical as above. The column choice is the primary operation. The exchange methodology tries to function on the non permanent object, which is able to fail to replace the preliminary object. You may as well take away these patterns fairly simply by means of specifying the columns you wish to function on.
df = df.exchange({“x”: 1}, {“x”: 100})
Patterns to keep away from
My earlier submit explains how the CoW mechanism works and the way DataFrames share the underlying knowledge. A defensiv copy will likely be carried out if two objects share the identical knowledge if you are modifying one object inplace.
df2 = df.reset_index()df2.iloc[0, 0] = 100
The reset_index operation will create a view of the underlying knowledge. The result’s assigned to a brand new variable df2, because of this two objects share the identical knowledge. This holds true till df is rubbish collected. The setitem operation will thus set off a replica. That is fully pointless when you do not want the preliminary object df anymore. Merely reassigning to the identical variable will invalidate the reference that’s held by the thing.
df = df.reset_index()df.iloc[0, 0] = 100
Summarizing, creating a number of references in the identical methodology retains pointless references alive.
Non permanent references which might be created when chaining totally different strategies collectively are effective.
df = df.reset_index().drop(…)
This can solely maintain one reference alive.
Accessing the underlying NumPy array
pandas presently provides us entry to the underlying NumPy array by means of to_numpy or .values. The returned array is a replica, in case your DataFrame consists of various dtypes, e.g.:
df = pd.DataFrame({“a”: [1, 2], “b”: [1.5, 2.5]})df.to_numpy()
[[1. 1.5][2. 2.5]]
The DataFrame is backed by two arrays which should be mixed into one. This triggers the copy.
The opposite case is a DataFrame that’s solely backed by a single NumPy array, e.g.:
df = pd.DataFrame({“a”: [1, 2], “b”: [3, 4]})df.to_numpy()
[[1 3][2 4]]
We are able to straight entry the array and get a view as an alternative of a replica. That is a lot quicker than copying all knowledge. We are able to now function on the NumPy array and doubtlessly modify it inplace, which can even replace the DataFrame and doubtlessly all different DataFrames that share knowledge. This turns into way more difficult with Copy-on-Write, since we eliminated many defensive copies. Many extra DataFrames will now share reminiscence with one another.
to_numpy and .values will return a read-only array due to this. Because of this the ensuing array just isn’t writeable.
df = pd.DataFrame({“a”: [1, 2], “b”: [3, 4]})arr = df.to_numpy()
arr[0, 0] = 1
This can set off a ValueError:
ValueError: project vacation spot is read-only
You may keep away from this in two alternative ways:
Set off a replica manually if you wish to keep away from updating DataFrames that share reminiscence along with your array.Make the array writeable. This can be a extra performant resolution however circumvents Copy-on-Write guidelines, so it needs to be used with warning.arr.flags.writeable = True
There are circumstances the place this isn’t attainable. One frequent prevalence is, if you’re accessing a single column which was backed by PyArrow:
ser = pd.Sequence([1, 2], dtype=”int64[pyarrow]”)arr = ser.to_numpy()arr.flags.writeable = True
This returns a ValueError:
ValueError: can not set WRITEABLE flag to True of this array
Arrow arrays are immutable, therefore it’s not attainable to make the NumPy array writeable. The conversion from Arrow to NumPy is zero-copy on this case.
Conclusion
We’ve appeared on the most invasive Copy-on-Write associated modifications. These modifications will change into the default habits in pandas 3.0. We’ve additionally investigated how we are able to adapt our code to keep away from breaking our code when Copy-on-Write is enabled. The improve course of needs to be fairly easy when you can keep away from these patterns.