Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SIGCHLD handler to pause container #36853

Merged
merged 5 commits into from Jan 20, 2017
Merged

Conversation

verb
Copy link
Contributor

@verb verb commented Nov 15, 2016

What this PR does / why we need it: This allows pause to reap orphaned zombies in a shared PID namespace. (#1615)

Special notes for your reviewer: I plan to discuss this with SIG Node to ensure compatibility with future runtimes.

Release note: This will have no effect until shared PID namespace is enabled, so recommend release-note-none.

This allows pause to reap zombies in the upcoming Shared PID namespace
(#1615). Uses the better defined sigaction() instead of signal() for all
signals both for consistency (SIGCHLD handler avoids SA_RESTART) and to
avoid the implicit signal()->sigaction() translation of various libc
versions.

Also makes warnings errors and includes a tool to make orphaned zombies
for manual testing.


This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

Can a kubernetes member verify that this patch is reasonable to test? If so, please reply with "@k8s-bot ok to test" on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands will still work. Regular contributors should join the org to skip this step.

If you have questions or suggestions related to this bot's behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

@verb
Copy link
Contributor Author

verb commented Nov 15, 2016

cc @vishh @uluyol

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Nov 16, 2016
Copy link
Contributor

@uluyol uluyol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits. Don't block approval of this on me, I just translated pause to C.

int main() {
pid_t pid;
pid = fork();
if ( pid == 0 ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style mismatch with pause.c. I don't know if kubernetes has C style conventions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could run clang formatter on the C code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Ran clang-format on pause.c and orphan.c.

int main() {
if (signal(SIGINT, sigdown) == SIG_ERR)
if (getpid() != 1) {
fprintf(stderr, "Warning: pause should be the first process in a pod\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this an error? I'm not sure what value this has as a warning.

Also nit: capitalization (other messages are lowercase).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the capitalization on the other messages. I'm inclined to keep it a warning rather than taking down the infra container when presented with an unusual situation. Plus it was helpful to be able to run it outside of a container.

@liggitt
Copy link
Member

liggitt commented Nov 16, 2016

cc @derekwaynecarr

@verb
Copy link
Contributor Author

verb commented Nov 16, 2016

Added to Agenda for the 2016-11-22 SIG Node meetup

@derekwaynecarr
Copy link
Member

@mrunalp - PTAL

@mrunalp
Copy link
Contributor

mrunalp commented Nov 22, 2016

Looks fine to me.

@derekwaynecarr derekwaynecarr added this to the v1.6 milestone Nov 22, 2016
@derekwaynecarr derekwaynecarr added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Nov 22, 2016
@derekwaynecarr
Copy link
Member

LGTM , I will let @dchen1107 make an additional review.

fprintf(stderr, "error: infinite loop terminated\n");
return 42;
if (getpid() != 1)
fprintf(stderr, "Warning: pause should be the first process in a pod\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We use pause images for testing as well. So this is necessarily not an error condition.

return 1;
if (sigaction(SIGTERM, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
return 2;
if (sigaction(SIGCHLD, &(struct sigaction){.sa_handler = sigreap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not place this under a if getpid() == 1 {..} block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no downside to installing a signal handler that will never get called so my preference would be to avoid special cases. If there's going to be a failure installing the handler it should happen whether or not it's run in a container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned because pause image is used for arbitrary testing too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to make pause create a child, so it will never automatically receive a SIGCHLD outside of a shared PID namespace. If some other process sends it a SIGCHLD (which would be strange) then waitpid() will immediately return 0 (because it has no children) and go back to the infinite loop.

Its use in arbitrary testing is an argument for keeping its behavior consistent no matter how it's executed, but I might have missed your point. Could you explain what you want to protect against?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is to not make much assumptions about how pause is expected to be used. Instead, we make it behave like an init process whenever its pid is 1. Otherwise, its old behavior can continue to exist.

.sa_flags = SA_NOCLDSTOP},
NULL) < 0)
return 3;
sigaction(SIGKILL, &(struct sigaction){.sa_handler = sigdown}, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to trap sigkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add that part, but I believe PID 1 has special handling and will ignore signals by default, including SIGKILL. I could look into it more but perhaps @uluyol just knows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied it over from the original Go port. I didn't think to remove it because as @verb said above it shouldn't harm anything. I don't know whether it's possible to trap it.

Copy link
Contributor

@mrunalp mrunalp Dec 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @vishh. I think this should be removed as process can't catch SIGKILL sent to the process itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on a single host and PID 1 (in the container) ignored SIGKILL even with a handler installed. Doing a little research, sysvinit installs the handler for SIGKILL while upstart & systemd appear not to install one. I have no preference here and will defer to the reviewer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not installing one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@vishh vishh self-assigned this Dec 7, 2016
@vishh
Copy link
Contributor

vishh commented Dec 12, 2016

Once this PR lands, there might be some changes required to how pod restarts are handled. As of now, infra container death does not bring down other containers. If a pod needs to be greacefully restarted, all its user containers will have to be stopped (or killed) gracefully first, and then the infra container can be restarted, followed by restarting of user containers.
Given that the infra container will now be the parent of all other user containers, forwarding signals from the infra container to all other processes might also help.

@verb
Copy link
Contributor Author

verb commented Dec 13, 2016

@vishh I agree that those are concerns, but they're not concerns for this PR which only modifies the pause container and does not enable the shared PID namespace. Let's address that in #1615.

I'd really like to avoid forwarding signals and other init features which would be better handled by the kubelet. @dchen1107 has longer term plans for init-like functionality that will work across run times.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2016
This allows pause to reap zombies in the upcoming Shared PID namespace
(kubernetes#1615). Uses the better defined sigaction() instead of signal() for all
signals both for consistency (SIGCHLD handler avoids SA_RESTART) and to
avoid the implicit signal()->sigaction() translation of various libc
versions.

Also makes warnings errors and includes a tool to make orphaned zombies
for manual testing.
Run clang-format on the C files and capitalize error messages.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 16, 2016
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2016
@philips
Copy link
Contributor

philips commented Dec 22, 2016

cc @ethernetdan @lucab @jonboulle

fprintf(stderr, "error: infinite loop terminated\n");
return 42;
if (getpid() != 1)
fprintf(stderr, "Warning: pause should be the first process in a pod\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning? shouldn't this be a distinct fatal condition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oh, I guess maybe you intend this for testing - though in the production case I think it should fail hard)

Copy link
Contributor Author

@verb verb Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the pause container is used in some circumstances where this wouldn't be an error (#36853 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for missing that previous comment!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I guess a code comment would help :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Added a comment.

@vishh
Copy link
Contributor

vishh commented Dec 24, 2016

I'd really like to avoid forwarding signals and other init features which would be better handled by the kubelet. @dchen1107 has longer term plans for init-like functionality that will work across run times.

Pause container can function across runtimes AFAIK. How do you expect kubelet to forward signals? It is only aware of the init pid in a container. Kubelet needs to rely on this custom init process for all init functionalities. In any case, I'm ok with punting on handling init issues until we turn on shared pidns. It might help if you can capture these comments in #1615 just so that we don't forget them in the future.

@verb
Copy link
Contributor Author

verb commented Jan 14, 2017

Whoops, didn't realize this was blocked on me.

I don't expect the kubelet to forward signals. I meant that init features other than signal handling are better handled by the kubelet. For example, users should rely on the kubelet to start processes in a pod rather than running an init system in the container.

Regarding signals, I expect the runtime (docker) to deliver signals to the proper process in containers that share a namespace. For example:

% docker run -id --name c1 gcr.io/verb-test/signalsquealer                        
cb85b55459ba08188191ac4be88c631d264a108faa83f849f73a307c18ec7a28
% docker run -id --name c2 --pid=container:cb85b55459ba08188191ac4be88c631d264a108faa83f849f73a307c18ec7a28 gcr.io/verb-test/signalsquealer 
b713384eca2fcf76b66cd32473757f51e3f35ccd4730182f905f2804ba93cb32
% docker run --rm --pid=container:cb85b55459ba08188191ac4be88c631d264a108faa83f849f73a307c18ec7a28 alpine ps auxww           
PID   USER     TIME   COMMAND
    1 root       0:00 /go/bin/signalsquealer
   11 root       0:00 /go/bin/signalsquealer
   22 root       0:00 ps auxww

(signalsquealer just prints out what signal it receives and ignores it.) I expect signals sent to c2 to be delivered to the process for c2 and not c1, and they are:

% docker kill -s TERM c2
c2
% docker logs c1 
Commence to catch (and ignore) signals
% docker logs c2 
Commence to catch (and ignore) signals
We get signal. All your terminated are belong to us.

It's the case that docker 1.12 currently doesn't reparent PIDs correctly, so this isn't an entirely accurate test, but I still expect docker kill to deliver signals to the correct process after that issue is resolved.

However if you're referring to delivering signals to children of processes in other containers, I still think that shouldn't be handled by the infra container, even if it's a process that double forks and is inherited by pause.

Containers shipping their own in-container init will continue to function correctly as long as they're not doing anything too exotic, but in-container init isn't necessary for Kubernetes and we shouldn't add complexity to support it.

I don't think we should establish the convention that a signal delivered to the pause container must be forwarded to every process in other containers in the pod.

Caveats:

  1. Imagine a process double forks inside of a shared namespace and is inherited by pause. What happens when the runtime is instructed to stop its origin container? I don't how, but I expect it to be killed.
  2. I just noticed that if you kill the container that's the source of the pid namespace then all other containers sharing that namespace are also killed. So restarting the non-infra containers may be less of an issue, at least for docker. It might be something we need to codify in the CRI.

@verb
Copy link
Contributor Author

verb commented Jan 19, 2017

@vishh @dchen1107 @derekwaynecarr (assigned reviewers)

This PR has been open for 2 months. What else do I need to do for it?

@vishh
Copy link
Contributor

vishh commented Jan 19, 2017

Imagine a process double forks inside of a shared namespace and is inherited by pause. What happens when the runtime is instructed to stop its origin container? I don't how, but I expect it to be killed.

runtimes are expected to track processes using cgroups and the scenario above is also expected to be covered. We should ideally have a test case for it.

I just noticed that if you kill the container that's the source of the pid namespace then all other containers sharing that namespace are also killed. So restarting the non-infra containers may be less of an issue, at least for docker. It might be something we need to codify in the CRI.

Yeah. The infra container is essentially the "pod sandbox" in CRI. It's should be managed accordingly.

@vishh
Copy link
Contributor

vishh commented Jan 19, 2017

@verb This comment thread is still open - #36853 (comment)
Github reviews are a pain if we update existing commits.

Other than that comment, this PR LGTM. Happy to tag once you address that.

@verb
Copy link
Contributor Author

verb commented Jan 19, 2017

@vishh Thanks, I totally missed that, too. My github-fu is weak. :-/

Added a commit to remove the SIGKILL handler.

@vishh
Copy link
Contributor

vishh commented Jan 19, 2017 via email

@vishh
Copy link
Contributor

vishh commented Jan 19, 2017 via email

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 19, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 39446, 40023, 36853)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet