Pyspawner

Subprocess that spawns children quickly, using clone().

How it works

Read an Explainer.

How to use

Create a pyspawner.Client that imports the “common” Python imports your sandboxed code will run. (These import statements aren’t sandboxed, so be sure you trust the Python modules.)

Then call pyspawner.Client.spawn_child() each time you want to create a new child. It will invoke the pyspawner’s child_main function with the given arguments.

Here’s pseudo-code for invoking the pyspawner part:

import pyspawner

# pyspawner.Client() is slow; ideally, you'll just call it during startup.
with pyspawner.Client(
    child_main="mymodule.main",
    environment={"LC_ALL": "C.UTF-8"},
    preload_imports=["pandas"],  # put all your slow imports here
) as cloner:
    # cloner.spawn_child() is fast; call it as many times as you like.
    child_process: pyspawner.ChildProcess = cloner.spawn_child(
        args=["arg1", "arg2"],  # List of picklable Python objects
        process_name="child-1",
        sandbox_config=pyspawner.SandboxConfig(
            chroot_dir=Path("/path/to/chroot/dir"),
            network=pyspawner.NetworkConfig()
        )
    )

    # child_process has .pid, .stdin, .stdout, .stderr.
    # Read from its stdout and stderr, and then wait for it.

For each child, read from stdout and stderr until end-of-file; then wait() for the process to exit. Reading from two pipes at once is a standard exercise in UNIX, so the minutae are left as an exercise. A safe approach:

  1. Register both stdout and stderr in a selectors.DefaultSelector

  2. loop, calling selectors.BaseSelector.select() and reading from whichever file descriptors have data. Unregister whichever file descriptors reach EOF; and read but _ignore_ data past a predetermined buffer size. Kill the child process if this is taking too long. (Keep reading after killing the child to avoid deadlock.)

  3. Wait for the child process (using os.waitpid()) to clean up its system resources.

Setting up your environment

Your system must have libcap.so.2 installed. In Debian, the libcap2 package provides it.

Pyspawner relies on Linux’s clone() system call to create child-process containers. If you’re using pyspawner from a Docker container, subcontainer are disabled by default. Run Docker with --seccomp-opt=/path/to/pyspawner/docker/pyspawner-seccomp-profile.json to allow creating subcontainers.

By default, sandboxed children cannot access the Internet. If you want to enable networking for child processes, ensure your process has the CAP_NET_ADMIN capability. (docker run --cap-add NET_ADMIN ...). Also, you’ll need to configure NAT in the parent-process environment … which is beyond the scope of this README. Finally, you may want to supply a chroot_dir to give child processes a custom /etc/resolv.conf.

Ideally, sandboxed children would not be able to write anywhere on the main filesystem. Unfortunately, the umount() and pivot_root() system calls are restricted in many environments. As a placeholder, you’re encouraged to supply a chroot_dir to provide an environment for your sandboxed child code. chroot_dir must be in a separate filesystem from the root filesystem. (In the future, when the Linux container ecosystem evolves enough, chroot_dir will make children unmount the root filesystem.) Again, chroot is beyond the scope of this README.

class pyspawner.ChildProcess(pid: int, stdin: BinaryIO, stdout: BinaryIO, stderr: BinaryIO)

A handle for the parent to interact with a spawned child process.

This is akin to a subprocess.Popen object … but with fewer features. (Rationale: subprocess.Popen has too many features.)

kill()

Terminate the child process with SIGKILL.

Return type

None

pid: int

Child process ID as seen from the parent.

(The child process will see its own ID as 1.)

stderr: BinaryIO

Readable pipe, written in the child as sys.stderr.

stdin: BinaryIO

Writable pipe, readable in the child as sys.stdin.

stdout: BinaryIO

Readable pipe, written in the child as sys.stdout.

wait(options)

Wait for the child process to complete.

You must call this for every child process. Otherwise, children will become zombie processes when they terminate, consuming system resources.

Return type

Tuple[int, int]

class pyspawner.Client(*, child_main, environment={}, preload_imports=[], executable='/home/docs/checkouts/readthedocs.org/user_builds/pyspawner/envs/latest/bin/python')

Launch Python quickly, sharing most memory pages.

The problem this solves: we want to spin up many children quickly; but as soon as a child starts running we can’t trust it. Starting Python with lots of imports like Pyarrow+Pandas can take ~2s and cost ~100MB RAM.

The solution: a mini-server process, the “pyspawner”, preloads Python modules. Then we clone() each time we need a subprocess. (clone() is near-instantaneous.) Beware: since clone() copies all memory, the “pyspawner” shouldn’t load anything sensitive before clone(). (No Django: it reads secrets!)

This is similar to Python’s multiprocessing.forkserver, except…:

  • Children are not managed. It’s up to the caller to kill and wait for the process. Children are direct children of the _caller_, not of the pyspawner. (We use CLONE_PARENT.)

  • asyncio-safe: we don’t listen for SIGCHLD, because asyncio’s subprocess-management routines override the signal handler.

  • Thread-safe: multiple threads may spawn multiple children, and they may all run concurrently (unless child code writes files or uses networking).

  • No multiprocessing.context. This Client is the context.

  • No Connection (or other high-level constructs).

  • The caller interacts with the pyspawner process via _unnamed_ AF_UNIX socket, rather than a named socket. (multiprocessing writes a pipe to /tmp.) No messing with hmac. Instead, we mess with locks. (“Aren’t locks worse?” – [2019-09-30, adamhooper] probably not, because clone() is fast; and multiprocessing and asyncio have a race in Python 3.7.4 that causes forkserver children to exit with status code 255, so their named-pipe+hmac approach does not inspire confidence.)

Parameters
  • child_main (str) – The full name (including module name) of the function each child should run. (Must be importable.)

  • environment (Dict[str, str]) – Environment variables for child processes. (Must all be str.)

  • preload_imports (List[str]) – List of module names pyspawner should import at startup. These modules (plus pyspawner’s internal imports) will be preloaded in all child processes.

  • executable (str) – Python executable to invoke. (Default: current-process executable).

close()

Kill the pyspawner.

Spawned child processes continue to run: they are entirely disconnected from their pyspawner.

Return type

None

spawn_child(args=[], *, process_name=None, sandbox_config)

Make our server spawn a process, and return it.

Parameters
  • args (List[Any]) – List of arguments to pass to the child-process function. (Must be picklable.)

  • process_name (Optional[str]) – Process name to display for the child process in ps and other sysadmin tools. (Useful for debugging.)

  • sandbox_config (pyspawner.SandboxConfig) – Sandbox settings.

Raises
  • OSError – if the clone() system call fails.

  • pyroute2.NetlinkError – if network configuration fails.

Return type

pyspawner.ChildProcess

class pyspawner.NetworkConfig(kernel_veth_name: str = 'veth-pyspawn', child_veth_name: str = 'veth-pyspawn-c', kernel_ipv4_address: str = '192.168.123.1', child_ipv4_address: str = '192.168.123.2')

Network configuration that lets children access the Internet.

Pyspawner will create a veth interface that may be used to route traffic from the child to the Internet via network address translation (NAT). You must write the iptables rules yourself! pyspawner does not invoke iptables! The intent is for you to set up iptables rules once, and then reuse the same rules for every clone.

One iptables rule to route network traffic from a child process to the Internet:

iptables -t nat -a POSTROUTING -s [child_ipv4_address] -j SNAT --to-source=[our IP address]

You should also firewall the traffic to secure the rest of your network from sandboxed processes. See tests/setup-sandbox.sh for a minimal set of iptables rules.

We do not yet support IPv6, because Kubernetes support is shaky. Follow https://github.com/kubernetes/kubernetes/issues/62822.

Here’s how networking works. When cloning, the child process gets a new, anonymous network namespace. pyspawner creates a veth pair, and it passes the “child” veth interface to the child process. The child process brings up its network interface and can only see the public Internet.

After the child dies, the Linux kernel will delete the network interface. (There’s a bit of a race here: the interface may exist a few milliseconds after the child dies. Pyspawner will explicitly ensure the interface is deleted before creating it.)

Beware if running multiple children at once that all access the Internet. Each must have a unique interface name and IP addresses.

The default values match those in tests/setup-sandbox.sh. Don’t edit one without editing the other.

child_ipv4_address: str = '192.168.123.2'

IPv4 address of the child.

The kernel will maintain iptables rules to route from this IP address to the public Internet.

This must be in the same /24 network block as kernel_ipv4_address.

child_veth_name: str = 'veth-pyspawn-c'

Name of veth interface run by the child.

Maximum length is 15 characters. Any longer gives NetlinkError 34.

This name must not conflict with any other network device in the kernel’s container. (The kernel creates this device before sending it into the child’s network namespace.)

kernel_ipv4_address: str = '192.168.123.1'

IPv4 address of the kernel.

This must not conflict with any other IP address in the kernel’s container.

This should be a private address. Be sure it doesn’t conflict with your network’s addresses. Kubernetes uses 10.0.0.0/8; Docker uses 172.16.0.0/12. The hard-coded “192.168.123/24” should be safe for Docker and Kubernetes.

The child will use this address as its default gateway.

kernel_veth_name: str = 'veth-pyspawn'

Name of veth interface run by the kernel.

Maximum length is 15 characters. Any longer gives NetlinkError 34.

This name must not conflict with any other network device in the kernel’s container.

class pyspawner.SandboxConfig(chroot_dir: Union[pathlib.Path, NoneType] = None, network: Union[pyspawner.sandbox.NetworkConfig, NoneType] = None, skip_sandbox_except: FrozenSet[str] = <factory>)
chroot_dir: Optional[pathlib.Path] = None

Setting for “chroot” security layer.

If chroot_dir is set, it must point to a directory on the filesystem. Remember that we call setuid() to an extreme UID (>65535) by default: that means the child will only be able to read files that are world-readable (i.e., “chmod o+r”).

(TODO chroot_dir should use pivot_root, for security. When Kubernetes lets us modify our mount namespace in an unprivileged container, switch to pivot_root.)

network: Optional[pyspawner.sandbox.NetworkConfig] = None

If set, network configuration so child processes can access the Internet.

If None, child processes have no network interfaces.

Type

pyspawner.NetworkConfig

skip_sandbox_except: FrozenSet[str]

Security layers to enable in child processes. (DO NOT USE IN PRODUCTION.)

MUST BE EXACTLY frozenset(). Other values are only for unit tests. See protocol.SpawnChild for details.

By default, child processes are sandboxed: user code should not be able to access the rest of the system. (In particular, it should not be able to access parent-process state; influence parent-process behavior in any way but its stdout, stderr and exit code; or communicate with any internal services.)

Our layers of sandbox security overlap: for instance: we (a) restrict the user code to run as non-root _and_ (b) disallow root from escaping its chroot. We can’t test layer (b) unless we disable layer (a); and that’s what this feature is for.

By default, all sandbox features are enabled. To enable only a subset, set skip_sandbox_except to a frozenset() with one or more of the following strings:

  • “drop_capabilities”: limit root’s capabilities

  • “setuid”: become an anonymous, non-root user

  • “no_new_privs”: prevent setuid-root programs from gaining capabilities

  • “seccomp”: filter system calls