cgroupv2

Topic: cgroupv2 (Read 4620 times) previous topic - next topic

0 Members and 2 Guests are viewing this topic.

cgroupv2

28 October 2019, 17:48:40

Is artix using cgroups2 in the kernel?

Can we track and control cgroups without systemd? I know that sounds like a huge question (why should the init control and track cgroups!), but process control through cgroups is still powerful and important. What tools do we have?

Ruben

Re: cgroupv2

Reply #1 – 28 October 2019, 20:08:53

Runit starts up hybrid cgroups by default (which uses cgroups2 as well as cgroups1). You can configure it in rc.conf to use just cgroups1 or cgroups2 if you want. From then, you can use the cgmanager daemon to control it. I assume openrc works the same way. cgroups in s6 currently only startups in hybrid mode (work in progress here).

Re: cgroupv2

Reply #2 – 28 October 2019, 20:23:30

I use libcgroup-openrc-git. It's a fully working package but I didn't know what I could do with it other than use it myself, as it's an openrc Artix package so might not be suitable for the AUR, especially as I don't have a straight Arch install to test it with, and Artix package guidelines say no git packages.
The patch came from their mailing list, they were very helpful. AFAIK (which may not be that far

) it's the best most full featured cgroup manager, and it's not had a release for a few years as cgroup management functionality has been added to systemd leaving other methods as a low priority I suppose. But a lot of useful commits have been added since the last release, so the git version should be better.
Had you also noticed ionice has been broken due to kernel changes? This also breaks btrfs scrub niceness. The recommendation is to use cgroups instead.
Here's the PKGBUILD and associated files (it provides / conflicts with the libcgroup AUR pkg so it won't cause a problem if you had that installed before):

Code: [Select]

$ cat PKGBUILD 
pkgname=libcgroup-openrc-git
pkgver=r920.62f7665
pkgrel=1
pkgdesc='Library that abstracts the control group file system in Linux, configured for OpenRC, GIT version.'
arch=('i686' 'x86_64')
url='http://libcg.sourceforge.net'
license=(LGPL)
makedepends=('git')
provides=('libcgroup')
conflicts=('libcgroup')
backup=('etc/cgconfig.conf'
        'etc/cgrules.conf'
	'etc/cgsnapshot_blacklist.conf')
options=('!libtool')
install=libcgroup.install
source=("git://git.code.sf.net/p/libcg/libcg" "fix-segfault.patch" "libcgroup")
md5sums=('SKIP' 'SKIP' 'SKIP')

pkgver() {
        cd "$srcdir/libcg"
        printf "r%s.%s" "$(git rev-list --count HEAD)" "$(git rev-parse --short HEAD)"
}

prepare() {
    cd ${srcdir}/libcg
    patch --forward --strip=1 --input="${srcdir}/fix-segfault.patch"
#patch --forward --strip=1 --input="fix-segfault.patch"
}

build() {
	cd "${srcdir}/libcg"
	autoreconf -i
	./configure \
		--prefix=/usr \
		--sysconfdir=/etc \
		--localstatedir=/var \
		--sbindir=/usr/bin \
		--enable-opaque-hierarchy=name=openrc
	make
}

package() {
	cd "${srcdir}/libcg"
	make DESTDIR="${pkgdir}" pkgconfigdir="/usr/lib/pkgconfig" install
	install -D -m0644 samples/cgconfig.conf "${pkgdir}/etc/cgconfig.conf"
	install -D -m0644 samples/cgrules.conf "${pkgdir}/etc/cgrules.conf"
	install -D -m0644 samples/cgsnapshot_blacklist.conf "${pkgdir}/etc/cgsnapshot_blacklist.conf"
        install -D -m0755 ${srcdir}/libcgroup "${pkgdir}/etc/init.d/libcgroup"
	install -d -m0755 $pkgdir/etc/cgconfig.d
	rm -f ${pkgdir}/usr/lib/security/pam_cgroup.{la,so,so.0}
	mv ${pkgdir}/usr/lib/security/pam_cgroup.so.0.0.0 ${pkgdir}/usr/lib/security/pam_cgroup.so
	rm -rf ${pkgdir}/etc/rc.d
	# Make cgexec setgid cgred
	chown root:160 ${pkgdir}/usr/bin/cgexec
	chmod 2755 ${pkgdir}/usr/bin/cgexec
	#mkdir -p $pkgdir/etc/cgconfig.d
}

$ cat fix-segfault.patch 
diff --git a/src/parse.y b/src/parse.y
index 98f7699..e67ad54 100644
--- a/src/parse.y
+++ b/src/parse.y
@@ -45,9 +45,9 @@ int yywrap(void)
        int val;
        struct cgroup_dictionary *values;
 }
-%type <name> ID DEFAULT
+%type <name> ID DEFAULT group_name
 %type <val> mountvalue_conf mount task_namevalue_conf admin_namevalue_conf
-%type <val> admin_conf task_conf task_or_admin group_conf group start group_name
+%type <val> admin_conf task_conf task_or_admin group_conf group start
 %type <val> namespace namespace_conf default default_conf
 %type <values> namevalue_conf
 %type <val> template template_conf

$ cat libcgroup
#!/usr/bin/openrc-run

name="libcgroup"
description="libcgroup daemon - cgrulesengd"
command="/usr/bin/cgrulesengd"
command_args="-s -g cgred"

start_pre() {
                /usr/bin/cgconfigparser -l /etc/cgconfig.conf -s 1664
            }

stop_post() {
                /usr/bin/cgclear -l /etc/cgconfig.conf -e
            }

$ cat libcgroup.install 
post_install() {
  getent group cgred &>/dev/null || groupadd -r -g 160 cgred >/dev/null
}

post_upgrade() {
  post_install
}

post_remove() {
  getent group cgred &>/dev/null && groupdel cgred >/dev/null
}

Re: cgroupv2

Reply #3 – 29 October 2019, 05:21:43

cgroups2

Re: cgroupv2

Reply #4 – 29 October 2019, 16:08:36

Doesn't libcgroup work with v2?
Most of the commands here are from libcgroup, plus it has a daemon to place processes and users in cgroups according to rules without doing it manually every time:
https://wiki.archlinux.org/index.php/cgroups
cgmanager can't do much.

cgmanager /usr/bin/cgm
cgmanager /usr/bin/cgmanager
cgmanager /usr/bin/cgproxy

libcgroup-openrc-git /usr/bin/cgclassify
libcgroup-openrc-git /usr/bin/cgclear
libcgroup-openrc-git /usr/bin/cgconfigparser
libcgroup-openrc-git /usr/bin/cgcreate
libcgroup-openrc-git /usr/bin/cgdelete
libcgroup-openrc-git /usr/bin/cgexec
libcgroup-openrc-git /usr/bin/cgget
libcgroup-openrc-git /usr/bin/cgrulesengd
libcgroup-openrc-git /usr/bin/cgset
libcgroup-openrc-git /usr/bin/cgsnapshot
libcgroup-openrc-git /usr/bin/lscgroup
libcgroup-openrc-git /usr/bin/lssubsys

Re: cgroupv2

Reply #5 – 29 October 2019, 20:54:00

Quote from: ####### – on 29 October 2019, 16:08:36

Doesn't libcgroup work with v2?
Most of the commands here are from libcgroup, plus it has a daemon to place processes and users in cgroups according to rules without doing it manually every time:
https://wiki.archlinux.org/index.php/cgroups
cgmanager can't do much.

cgmanager /usr/bin/cgm
cgmanager /usr/bin/cgmanager
cgmanager /usr/bin/cgproxy

libcgroup-openrc-git /usr/bin/cgclassify
libcgroup-openrc-git /usr/bin/cgclear
libcgroup-openrc-git /usr/bin/cgconfigparser
libcgroup-openrc-git /usr/bin/cgcreate
libcgroup-openrc-git /usr/bin/cgdelete
libcgroup-openrc-git /usr/bin/cgexec
libcgroup-openrc-git /usr/bin/cgget
libcgroup-openrc-git /usr/bin/cgrulesengd
libcgroup-openrc-git /usr/bin/cgset
libcgroup-openrc-git /usr/bin/cgsnapshot
libcgroup-openrc-git /usr/bin/lscgroup
libcgroup-openrc-git /usr/bin/lssubsys

OK so looking at this I see:

Code: [Select]

CGM(1)                           User Commands                          CGM(1)

NAME
       cgm - a client script for cgmanager

DESCRIPTION
       cgm  is  a client script to simplify making requests of the cgroup man‐
       ager.  It simply calls dbus-send to send requests to the running cgman‐
       ager or cgproxy.

       Usage:

       cgm ping

       cgm create <controller> <cgroup>

SO can this be used without dbus? Also, I see nothing about cgroups2. I'm interested in the firewall rules that are implemented in cgroups2 so I can limit, say, mariabd, to my internal network without implimenting a netfilter that interferes with vpns and crypto.

Re: cgroupv2

Reply #6 – 30 October 2019, 04:38:42

The libcgroup devs are currently discussing plans to implement v2 cgroup support in addition to or alternatively with v1, for systemd and non-systemd environments, apparently it's why so much commit activity has been taking place over the last year or so, to get the existing codebase updated in preparation. So as it turns out - not yet.
Apparently docker containers can be useful for various resource limiting purposes, I looked at that a little but I thought it was too heavyweight a solution for my uses - docker runs apps in a mini-os chroot vm type thing - but it might be worth considering too. Docker uses cgroups itself so could be considered a cgroup tool, apparently they are working on v2 support, but could probably? manage isolation issues now due to the design.

Re: cgroupv2

Reply #7 – 30 October 2019, 09:43:17

openrc cgroups suppport v1 and v2, our runit-rc implements the openrc cgroups, and our new s6-rc does it as well, because elogind is built for the openrc cgroups.

In other words, if you use a different cgroups, you are on your own.

Re: cgroupv2

Reply #8 – 30 October 2019, 12:31:30

Quote from: artoo – on 30 October 2019, 09:43:17

openrc cgroups suppport v1 and v2, our runit-rc implements the openrc cgroups, and our new s6-rc does it as well, because elogind is built for the openrc cgroups.

In other words, if you use a different cgroups, you are on your own.

I don't know what this means exactly. Cgroups are a replacement, as I understand it, for the process system that unix usually has. Both the light process (threads) and the full process (processes with PID numbers) were rewritten as control groups. Thats how I have come to learn it. openrc cgroups, I asume, is a priveledged userspace program to interact with cgroup processes, so what does it have to do with elogind?

Re: cgroupv2

Reply #9 – 30 October 2019, 12:35:47

Quote from: mrbrklyn – on 30 October 2019, 12:31:30

I don't know what this means exactly. Cgroups are a replacement, as I understand it, for the process system that unix usually has. Both the light process (threads) and the full process (processes with PID numbers) were rewritten as control groups. Thats how I have come to learn it. openrc cgroups, I asume, is a priveledged userspace program to interact with cgroup processes, so what does it have to do with elogind?

Read the docs please, I am not the all answering oracle.

Logind requires cgroups.

Re: cgroupv2

Reply #10 – 30 October 2019, 13:04:19

Quote from: artoo – on 30 October 2019, 12:35:47

Read the docs please, I am not the all answering oracle.

Logind requires cgroups.

thanks

Most of what I have learned about cgroups comes from hacking the scheduler

Quote

            CGROUPS
            -------

Written by Paul Menage <[email protected]> based on
Documentation/cgroups/cpusets.txt

Original copyright statements from cpusets.txt:
Portions Copyright (C) 2004 BULL SA.
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
Modified by Paul Jackson <[email protected]>
Modified by Christoph Lameter <[email protected]>

CONTENTS:
=========

1. Control Groups
1.1 What are cgroups ?
1.2 Why are cgroups needed ?
1.3 How are cgroups implemented ?
1.4 What does notify_on_release do ?
1.5 What does clone_children do ?
1.6 How do I use cgroups ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Attaching processes
2.3 Mounting hierarchies by name
2.4 Notification API
3. Kernel API
3.1 Overview
3.2 Synchronization
3.3 Subsystem API
4. Questions

1. Control Groups
=================

1.1 What are cgroups ?
----------------------

Control Groups provide a mechanism for aggregating/partitioning sets of
tasks, and all their future children, into hierarchical groups with
specialized behaviour.

Definitions:

A *cgroup* associates a set of tasks with a set of parameters for one
or more subsystems.

A *subsystem* is a module that makes use of the task grouping
facilities provided by cgroups to treat groups of tasks in
particular ways. A subsystem is typically a "resource controller" that
schedules a resource or applies per-cgroup limits, but it may be
anything that wants to act on a group of processes, e.g. a
virtualization subsystem.

A *hierarchy* is a set of cgroups arranged in a tree, such that
every task in the system is in exactly one of the cgroups in the
hierarchy, and a set of subsystems; each subsystem has system-specific
state attached to each cgroup in the hierarchy. Each hierarchy has
an instance of the cgroup virtual filesystem associated with it.

At any one time there may be multiple active hierarchies of task
cgroups. Each hierarchy is a partition of all tasks in the system.

User level code may create and destroy cgroups by name in an
instance of the cgroup virtual file system, specify and query to
which cgroup a task is assigned, and list the task pids assigned to
a cgroup. Those creations and assignments only affect the hierarchy
associated with that instance of the cgroup file system.

On their own, the only use for cgroups is for simple job
tracking. The intention is that other subsystems hook into the generic
cgroup support to provide new attributes for cgroups, such as
accounting/limiting the resources which processes in a cgroup can
access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows
you to associate a set of CPUs and a set of memory nodes with the
tasks in each cgroup.

1.2 Why are cgroups needed ?
----------------------------

There are multiple efforts to provide process aggregations in the
Linux kernel, mainly for resource tracking purposes. Such efforts
include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
namespaces. These all require the basic notion of a
grouping/partitioning of processes, with newly forked processes ending
in the same group (cgroup) as their parent process.

The kernel cgroup patch provides the minimum essential kernel
mechanisms required to efficiently implement such groups. It has
minimal impact on the system fast paths, and provides hooks for
specific subsystems such as cpusets to provide additional behaviour as
desired.

Multiple hierarchy support is provided to allow for situations where
the division of tasks into cgroups is distinctly different for
different subsystems - having parallel hierarchies allows each
hierarchy to be a natural division of tasks, without having to handle
complex combinations of tasks that would be present if several
unrelated subsystems needed to be forced into the same tree of
cgroups.

At one extreme, each resource controller or subsystem could be in a
separate hierarchy; at the other extreme, all subsystems
would be attached to the same hierarchy.

As an example of a scenario (originally proposed by [email protected])
that can benefit from multiple hierarchies, consider a large
university server with various users - students, professors, system
tasks etc. The resource planning for this server could be along the
following lines:

   CPU : "Top cpuset"
   /    \
   CPUSet1    CPUSet2
|    |
   (Professors) (Students)

   In addition (system tasks) are attached to topcpuset (so
   that they can run anywhere) with a limit of 20%

   Memory : Professors (50%), Students (30%), system (20%)

   Disk : Professors (50%), Students (30%), system (20%)

   Network : WWW browsing (20%), Network File System (60%), others (20%)
   / \
   Professors (15%) students (5%)

Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go
into NFS network class.

At the same time Firefox/Lynx will share an appropriate CPU/Memory class
depending on who launched it (prof/student).

With the ability to classify tasks differently for different resources
(by putting those resource subsystems in different hierarchies) then
the admin can easily set up a script which receives exec notifications
and depending on who is launching the browser he can

# echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks

With only a single hierarchy, he now would potentially have to create
a separate cgroup for every browser launched and associate it with
appropriate network and other resource class. This may lead to
proliferation of such cgroups.

Also lets say that the administrator would like to give enhanced network
access temporarily to a student's browser (since it is night and the user
wants to do online gaming OR give one of the students simulation
apps enhanced CPU power,

With ability to write pids directly to resource classes, it's just a
matter of :

   # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
   (after some time)
   # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks

Without this ability, he would have to split the cgroup into
multiple separate ones and then associate the new cgroups with the
new resource classes.

1.3 How are cgroups implemented ?
---------------------------------

Control Groups extends the kernel as follows:

- Each task in the system has a reference-counted pointer to a
   css_set.

- A css_set contains a set of reference-counted pointers to
   cgroup_subsys_state objects, one for each cgroup subsystem
   registered in the system. There is no direct link from a task to
   the cgroup of which it's a member in each hierarchy, but this
   can be determined by following pointers through the
   cgroup_subsys_state objects. This is because accessing the
   subsystem state is something that's expected to happen frequently
   and in performance-critical code, whereas operations that require a
   task's actual cgroup assignments (in particular, moving between
   cgroups) are less common. A linked list runs through the cg_list
   field of each task_struct using the css_set, anchored at
   css_set->tasks.

- A cgroup hierarchy filesystem can be mounted for browsing and
   manipulation from user space.

- You can list all the tasks (by pid) attached to any cgroup.

The implementation of cgroups requires a few, simple hooks
into the rest of the kernel, none in performance critical paths:

- in init/main.c, to initialize the root cgroups and initial
   css_set at system boot.

- in fork and exit, to attach and detach a task from its css_set.

In addition a new file system, of type "cgroup" may be mounted, to
enable browsing and modifying the cgroups presently known to the
kernel. When mounting a cgroup hierarchy, you may specify a
comma-separated list of subsystems to mount as the filesystem mount
options. By default, mounting the cgroup filesystem attempts to
mount a hierarchy containing all registered subsystems.

If an active hierarchy with exactly the same set of subsystems already
exists, it will be reused for the new mount. If no existing hierarchy
matches, and any of the requested subsystems are in use in an existing
hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
is activated, associated with the requested subsystems.

It's not currently possible to bind a new subsystem to an active
cgroup hierarchy, or to unbind a subsystem from an active cgroup
hierarchy. This may be possible in future, but is fraught with nasty
error-recovery issues.

When a cgroup filesystem is unmounted, if there are any
child cgroups created below the top-level cgroup, that hierarchy
will remain active even though unmounted; if there are no
child cgroups then the hierarchy will be deactivated.

No new system calls are added for cgroups - all support for
querying and modifying cgroups is via this cgroup file system.

Each task under /proc has an added file named 'cgroup' displaying,
for each active hierarchy, the subsystem names and the cgroup name
as the path relative to the root of the cgroup file system.

Each cgroup is represented by a directory in the cgroup file system
containing the following files describing that cgroup:

- tasks: list of tasks (by pid) attached to that cgroup. This list
   is not guaranteed to be sorted. Writing a thread id into this file
   moves the thread into this cgroup.
- cgroup.procs: list of tgids in the cgroup. This list is not
   guaranteed to be sorted or free of duplicate tgids, and userspace
   should sort/uniquify the list if this property is required.
   Writing a thread group id into this file moves all threads in that
   group into this cgroup.
- notify_on_release flag: run the release agent on exit?
- release_agent: the path to use for release notifications (this file
   exists in the top cgroup only)

Other subsystems such as cpusets may add additional files in each
cgroup dir.

New cgroups are created using the mkdir system call or shell
command. The properties of a cgroup, such as its flags, are
modified by writing to the appropriate file in that cgroups
directory, as listed above.

The named hierarchical structure of nested cgroups allows partitioning
a large system into nested, dynamically changeable, "soft-partitions".

The attachment of each task, automatically inherited at fork by any
children of that task, to a cgroup allows organizing the work load
on a system into related sets of tasks. A task may be re-attached to
any other cgroup, if allowed by the permissions on the necessary
cgroup file system directories.

When a task is moved from one cgroup to another, it gets a new
css_set pointer - if there's an already existing css_set with the
desired collection of cgroups then that group is reused, else a new
css_set is allocated. The appropriate existing css_set is located by
looking into a hash table.

To allow access from a cgroup to the css_sets (and hence tasks)
that comprise it, a set of cg_cgroup_link objects form a lattice;
each cg_cgroup_link is linked into a list of cg_cgroup_links for
a single cgroup on its cgrp_link_list field, and a list of
cg_cgroup_links for a single css_set on its cg_link_list.

Thus the set of tasks in a cgroup can be listed by iterating over
each css_set that references the cgroup, and sub-iterating over
each css_set's task set.

The use of a Linux virtual file system (vfs) to represent the
cgroup hierarchy provides for a familiar permission and name space
for cgroups, with a minimum of additional kernel code.

1.4 What does notify_on_release do ?
------------------------------------

If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the "release_agent" file in that hierarchy's root directory,
supplying the pathname (relative to the mount point of the cgroup
file system) of the abandoned cgroup. This enables automatic
removal of abandoned cgroups. The default value of
notify_on_release in the root cgroup at system boot is disabled
(0). The default value of other cgroups at creation is the current
value of their parents notify_on_release setting. The default value of
a cgroup hierarchy's release_agent path is empty.

1.5 What does clone_children do ?
---------------------------------

If the clone_children flag is enabled (1) in a cgroup, then all
cgroups created beneath will call the post_clone callbacks for each
subsystem of the newly created cgroup. Usually when this callback is
implemented for a subsystem, it copies the values of the parent
subsystem, this is the case for the cpuset.

1.6 How do I use cgroups ?
--------------------------

To start a new job that is to be contained within a cgroup, using
the "cpuset" cgroup subsystem, the steps are something like:

1) mount -t tmpfs cgroup_root /sys/fs/cgroup
2) mkdir /sys/fs/cgroup/cpuset
3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
4) Create the new cgroup by doing mkdir's and write's (or echo's) in
the /sys/fs/cgroup virtual file system.
5) Start a task that will be the "founding father" of the new job.
6) Attach that task to the new cgroup by writing its pid to the
/sys/fs/cgroup/cpuset/tasks file for that cgroup.
7) fork, exec or clone the job tasks from this founding father task.

For example, the following sequence of commands will setup a cgroup
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then start a subshell 'sh' in that cgroup:

mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/cpuset
mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
cd /sys/fs/cgroup/cpuset
mkdir Charlie
cd Charlie
/bin/echo 2-3 > cpuset.cpus
/bin/echo 1 > cpuset.mems
/bin/echo $$ > tasks
sh
# The subshell 'sh' is now running in cgroup Charlie
# The next line should display '/Charlie'
cat /proc/self/cgroup

2. Usage Examples and Syntax
============================

2.1 Basic Usage
---------------

Creating, modifying, using the cgroups can be done through the cgroup
virtual filesystem.

To mount a cgroup hierarchy with all available subsystems, type:
# mount -t cgroup xxx /sys/fs/cgroup

The "xxx" is not interpreted by the cgroup code, but will appear in
/proc/mounts so may be any useful identifying string that you like.

Note: Some subsystems do not work without some user input first. For instance,
if cpusets are enabled the user will have to populate the cpus and mems files
for each new cgroup created before that group can be used.

As explained in section `1.2 Why are cgroups needed?' you should create
different hierarchies of cgroups for each single resource or group of
resources you want to control. Therefore, you should mount a tmpfs on
/sys/fs/cgroup and create directories for each cgroup resource or resource
group.

# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/rg1

To mount a cgroup hierarchy with just the cpuset and memory
subsystems, type:
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1

To change the set of subsystems bound to a mounted hierarchy, just
remount with different options:
# mount -o remount,cpuset,blkio hier1 /sys/fs/cgroup/rg1

Now memory is removed from the hierarchy and blkio is added.

Note this will add blkio to the hierarchy but won't remove memory or
cpuset, because the new options are appended to the old ones:
# mount -o remount,blkio /sys/fs/cgroup/rg1

To Specify a hierarchy's release_agent:
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
xxx /sys/fs/cgroup/rg1

Note that specifying 'release_agent' more than once will return failure.

Note that changing the set of subsystems is currently only supported
when the hierarchy consists of a single (root) cgroup. Supporting
the ability to arbitrarily bind/unbind subsystems from an existing
cgroup hierarchy is intended to be implemented in the future.

Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
is the cgroup that holds the whole system.

If you want to change the value of release_agent:
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent

It can also be changed via remount.

If you want to create a new cgroup under /sys/fs/cgroup/rg1:
# cd /sys/fs/cgroup/rg1
# mkdir my_cgroup

Now you want to do something with this cgroup.
# cd my_cgroup

In this directory you can find several files:
# ls
cgroup.procs notify_on_release tasks
(plus whatever files added by the attached subsystems)

Now attach your shell to this cgroup:
# /bin/echo $$ > tasks

You can also create cgroups inside your cgroup by using mkdir in this
directory.
# mkdir my_sub_cs

To remove a cgroup, just use rmdir:
# rmdir my_sub_cs

This will fail if the cgroup is in use (has cgroups inside, or
has processes attached, or is held alive by other subsystem-specific
reference).

2.2 Attaching processes
-----------------------

# /bin/echo PID > tasks

Note that it is PID, not PIDs. You can only attach ONE task at a time.
If you have several tasks to attach, you have to do it one after another:

# /bin/echo PID1 > tasks
# /bin/echo PID2 > tasks
   ...
# /bin/echo PIDn > tasks

You can attach the current shell task by echoing 0:

# echo 0 > tasks

You can use the cgroup.procs file instead of the tasks file to move all
threads in a threadgroup at once. Echoing the pid of any task in a
threadgroup to cgroup.procs causes all tasks in that threadgroup to be
be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
in the writing task's threadgroup.

Note: Since every task is always a member of exactly one cgroup in each
mounted hierarchy, to remove a task from its current cgroup you must
move it into a new cgroup (possibly the root cgroup) by writing to the
new cgroup's tasks file.

Note: Due to some restrictions enforced by some cgroup subsystems, moving
a process to another cgroup can fail.

2.3 Mounting hierarchies by name
--------------------------------

Passing the name=<x> option when mounting a cgroups hierarchy
associates the given name with the hierarchy. This can be used when
mounting a pre-existing hierarchy, in order to refer to it by name
rather than by its set of active subsystems. Each hierarchy is either
nameless, or has a unique name.

The name should match [\w.-]+

When passing a name=<x> option for a new hierarchy, you need to
specify subsystems manually; the legacy behaviour of mounting all
subsystems when none are explicitly specified is not supported when
you give a subsystem a name.

The name of the subsystem appears as part of the hierarchy description
in /proc/mounts and /proc/<pid>/cgroups.

2.4 Notification API
--------------------

There is mechanism which allows to get notifications about changing
status of a cgroup.

To register new notification handler you need:
- create a file descriptor for event notification using eventfd(2);
- open a control file to be monitored (e.g. memory.usage_in_bytes);
- write "<event_fd> <control_fd> <args>" to cgroup.event_control.
   Interpretation of args is defined by control file implementation;

eventfd will be woken up by control file implementation or when the
cgroup is removed.

To unregister notification handler just close eventfd.

NOTE: Support of notifications should be implemented for the control
file. See documentation for the subsystem.

3. Kernel API
=============

3.1 Overview
------------

Each kernel subsystem that wants to hook into the generic cgroup
system needs to create a cgroup_subsys object. This contains
various methods, which are callbacks from the cgroup system, along
with a subsystem id which will be assigned by the cgroup system.

Other fields in the cgroup_subsys object include:

- subsys_id: a unique array index for the subsystem, indicating which
entry in cgroup->subsys[] this subsystem should be managing.

- name: should be initialized to a unique subsystem name. Should be
no longer than MAX_CGROUP_TYPE_NAMELEN.

- early_init: indicate if the subsystem needs early initialization
at system boot.

Each cgroup object created by the system has an array of pointers,
indexed by subsystem id; this pointer is entirely managed by the
subsystem; the generic cgroup code will never touch this pointer.

3.2 Synchronization
-------------------

There is a global mutex, cgroup_mutex, used by the cgroup
system. This should be taken by anything that wants to modify a
cgroup. It may also be taken to prevent cgroups from being
modified, but more specific locks may be more appropriate in that
situation.

See kernel/cgroup.c for more details.

Subsystems can take/release the cgroup_mutex via the functions
cgroup_lock()/cgroup_unlock().

Accessing a task's cgroup pointer may be done in the following ways:
- while holding cgroup_mutex
- while holding the task's alloc_lock (via task_lock())
- inside an rcu_read_lock() section via rcu_dereference()

3.3 Subsystem API
-----------------

Each subsystem should:

- add an entry in linux/cgroup_subsys.h
- define a cgroup_subsys object called <name>_subsys

If a subsystem can be compiled as a module, it should also have in its
module initcall a call to cgroup_load_subsys(), and in its exitcall a
call to cgroup_unload_subsys(). It should also set its_subsys.module =
THIS_MODULE in its .c file.

Each subsystem may export the following methods. The only mandatory
methods are create/destroy. Any others that are null are presumed to
be successful no-ops.

struct cgroup_subsys_state *create(struct cgroup_subsys *ss,
               struct cgroup *cgrp)
(cgroup_mutex held by caller)

Called to create a subsystem state object for a cgroup. The
subsystem should allocate its subsystem state object for the passed
cgroup, returning a pointer to the new object on success or a
negative error code. On success, the subsystem pointer should point to
a structure of type cgroup_subsys_state (typically embedded in a
larger subsystem-specific object), which will be initialized by the
cgroup system. Note that this will be called at initialization to
create the root subsystem state for this subsystem; this case can be
identified by the passed cgroup object having a NULL parent (since
it's the root of the hierarchy) and may be an appropriate place for
initialization code.

void destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
(cgroup_mutex held by caller)

The cgroup system is about to destroy the passed cgroup; the subsystem
should do any necessary cleanup and free its subsystem state
object. By the time this method is called, the cgroup has already been
unlinked from the file system and from the child list of its parent;
cgroup->parent is still valid. (Note - can also be called for a
newly-created cgroup if an error occurs after this subsystem's
create() method has been called for the new cgroup).

int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);

Called before checking the reference count on each subsystem. This may
be useful for subsystems which have some extra references even if
there are not tasks in the cgroup. If pre_destroy() returns error code,
rmdir() will fail with it. From this behavior, pre_destroy() can be
called multiple times against a cgroup.

int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
        struct cgroup_taskset *tset)
(cgroup_mutex held by caller)

Called prior to moving one or more tasks into a cgroup; if the
subsystem returns an error, this will abort the attach operation.
@tset contains the tasks to be attached and is guaranteed to have at
least one task in it.

If there are multiple tasks in the taskset, then:
- it's guaranteed that all are from the same thread group
- @tset contains all tasks from the thread group whether or not
they're switching cgroups
- the first task is the leader

Each @tset entry also contains the task's old cgroup and tasks which
aren't switching cgroup can be skipped easily using the
cgroup_taskset_for_each() iterator. Note that this isn't called on a
fork. If this method returns 0 (success) then this should remain valid
while the caller holds cgroup_mutex and it is ensured that either
attach() or cancel_attach() will be called in future.

void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
         struct cgroup_taskset *tset)
(cgroup_mutex held by caller)

Called when a task attach operation has failed after can_attach() has succeeded.
A subsystem whose can_attach() has some side-effects should provide this
function, so that the subsystem can implement a rollback. If not, not necessary.
This will be called only about subsystems whose can_attach() operation have
succeeded. The parameters are identical to can_attach().

void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
     struct cgroup_taskset *tset)
(cgroup_mutex held by caller)

Called after the task has been attached to the cgroup, to allow any
post-attachment activity that requires memory allocations or blocking.
The parameters are identical to can_attach().

void fork(struct cgroup_subsy *ss, struct task_struct *task)

Called when a task is forked into a cgroup.

void exit(struct cgroup_subsys *ss, struct task_struct *task)

Called during task exit.

int populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
(cgroup_mutex held by caller)

Called after creation of a cgroup to allow a subsystem to populate
the cgroup directory with file entries. The subsystem should make
calls to cgroup_add_file() with objects of type cftype (see
include/linux/cgroup.h for details). Note that although this
method can return an error code, the error code is currently not
always handled well.

void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
(cgroup_mutex held by caller)

Called during cgroup_create() to do any parameter
initialization which might be required before a task could attach. For
example in cpusets, no task may attach before 'cpus' and 'mems' are set
up.

void bind(struct cgroup_subsys *ss, struct cgroup *root)
(cgroup_mutex and ss->hierarchy_mutex held by caller)

Called when a cgroup subsystem is rebound to a different hierarchy
and root cgroup. Currently this will only involve movement between
the default hierarchy (which never has sub-cgroups) and a hierarchy
that is being created/destroyed (and hence has no sub-cgroups).

4. Questions
============

Q: what's up with this '/bin/echo' ?
A: bash's builtin 'echo' command does not check calls to write() against
   errors. If you use it in the cgroup file system, you won't be
   able to tell whether a command succeeded or failed.

Q: When I attach processes, only the first of the line gets really attached !
A: We can only return one error code per call to write(). So you should also
   put only ONE pid.

Re: cgroupv2

Reply #11 – 30 October 2019, 13:55:55

Please just post the link to the docs if you have to, do not spam our forum with walls of text or endless one liner threads with just some quote.

Re: cgroupv2

Reply #12 – 30 October 2019, 15:20:38

cgmanager does support v2, but the README on the GitHub page says:
Please note that the CGManager project has been deprecated in favor of
using the kernel's CGroup Namespace or lxcfs' simulated cgroupfs.
lxcfs is a package in the community repo.

Re: cgroupv2

Reply #13 – 31 October 2019, 02:49:57

Quote from: ####### – on 30 October 2019, 15:20:38

cgmanager does support v2, but the README on the GitHub page says:
Please note that the CGManager project has been deprecated in favor of
using the kernel's CGroup Namespace or lxcfs' simulated cgroupfs.
lxcfs is a package in the community repo.

this is beginning to sound like my C++ codebase...

For what it is worth, what got me interested in cgroups2 was the ability to restrict fireall rules to specific processes rather than the network as a whole with filters.

Re: cgroupv2

Reply #14 – 21 November 2019, 05:05:25

Quote from: artoo – on 30 October 2019, 13:55:55

Please just post the link to the docs if you have to, do not spam our forum with walls of text or endless one liner threads with just some quote.

I was just trying to be helpful

So I spent much of last night listening to lectures on cgroups and namespace, and what I learned is that this is complex, with a poor interface, and difficult to learn, or even make much sense of. And yet the entire kernel is wrapped in cgroups . Obviously docker is at the front of promoting it and the complexty makes use use of helpful tools like Ansible, almost essential.