Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

classic Classic list List threaded Threaded
54 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Yann Ylavic
Hi Luca,

[better/easier to talk about details on dev@]

On Fri, Jun 30, 2017 at 11:05 AM,  <[hidden email]> wrote:

> https://bz.apache.org/bugzilla/show_bug.cgi?id=60956
>
> --- Comment #11 from Luca Toscano <[hidden email]> ---
> Other two interesting trunk improvements that have not been backported yet:
>
> http://svn.apache.org/viewvc?view=revision&revision=1706669
> http://svn.apache.org/viewvc?view=revision&revision=1734656
>
> IIUC these ones are meant to provide a more async behavior to most of the
> output filters, namely setting aside buckets (on the heap) to avoid blocking.

These are quite orthogonal I think, and don't seem to fix this particular issue.

>
> After a bit of thinking it seems to me that we'd need to find a solution that
> prevents the mod_ssl_output filter to block, but in a safe way.

IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
modssl_smart_shutdown(), only in the "abortive" mode of
ssl_filter_io_shutdown().

The caller of the ssl output filters should decide when to flush, so
http://svn.apache.org/r1651077 was not the good fix in this regard.

Even if we don't BIO_flush, openssl shouldn't retain the close-notify
by itself, so at least it should go down to the next/core ouput filter
(and stay there until the socket is write-able or asked to flush).

>
> In this particular case we assume this about start_lingering_close_nonblocking:
>
> """
> /*
>  * Close our side of the connection, NOT flushing data to the client.
>  * This should only be called if there has been an error or if we know
>  * that our send buffers are empty.
>  * Pre-condition: cs is not in any timeout queue and not in the pollset,
>  *                timeout_mutex is not locked
>  * return: 0 if connection is fully closed,
>  *         1 if connection is lingering
>  * may be called by listener thread
>  */
> """
>
> I tried the following patch:
>
> """
> Index: server/mpm/event/event.c
> ===================================================================
> --- server/mpm/event/event.c    (revision 1800362)
> +++ server/mpm/event/event.c    (working copy)
> @@ -744,10 +744,7 @@
>      conn_rec *c = cs->c;
>      apr_socket_t *csd = cs->pfd.desc.s;
>
> -    if (ap_prep_lingering_close(c)
> -        || c->aborted
> -        || ap_shutdown_conn(c, 0) != APR_SUCCESS || c->aborted
> -        || apr_socket_shutdown(csd, APR_SHUTDOWN_WRITE) != APR_SUCCESS) {
> +    if (ap_prep_lingering_close(c) || c->aborted) {
>          apr_socket_close(csd);
>          ap_push_pool(worker_queue_info, cs->p);
>          if (dying)
> """
>
> So the idea was to brutally close the connection only if
> ap_prep_lingering_close(c) is not 0 or if the client has already aborted, but
> to leave all the other cases to the start_lingering_close_common. This is
> probably not enough/correct because the connection would go into the
> lingering_close queue, to be picked up again by
> process_timeout_queue(linger_q,..) after the timeout that would call
> stop_lingering_close, that would in turn simply close the socket without giving
> the possibility to mod_ssl to flush its close-notify (because no
> ap_shutdown_conn would be called).

With a possibly non-blocking modssl_smart_shutdown(), I think we could
make ap_shutdown_conn(c, 0) return something like APR_INCOMPLETE for
the case the shutdown was "buffered" in the output filter stack (e.g.
core output filter).

In mpm_event, we would then go to (or stay in) the WRITE_COMPLETION
state instead of LINGER, until every remaining piece data is flushed
successfully.

Does this sound OK?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Luca Toscano
Hi Yann!

2017-06-30 12:18 GMT+02:00 Yann Ylavic <[hidden email]>:
Hi Luca,

[better/easier to talk about details on dev@]

On Fri, Jun 30, 2017 at 11:05 AM,  <[hidden email]> wrote:
> https://bz.apache.org/bugzilla/show_bug.cgi?id=60956
>
> --- Comment #11 from Luca Toscano <[hidden email]> ---
> Other two interesting trunk improvements that have not been backported yet:
>
> http://svn.apache.org/viewvc?view=revision&revision=1706669
> http://svn.apache.org/viewvc?view=revision&revision=1734656
>
> IIUC these ones are meant to provide a more async behavior to most of the
> output filters, namely setting aside buckets (on the heap) to avoid blocking.

These are quite orthogonal I think, and don't seem to fix this particular issue.

Sorry for the noise, I am still reviewing those to understand what they really do and added them as reference to the task.
I thought that they might have been useful while thinking about a solution, since from my (ignorant :) point of view the fact that ap_core_output_filter was blocked by mod_ssl's BIO_flush was somehow pointing me to those commits.  

Will skip these notes in the future, they might be confusing!

[..]
 
With a possibly non-blocking modssl_smart_shutdown(), I think we could
make ap_shutdown_conn(c, 0) return something like APR_INCOMPLETE for
the case the shutdown was "buffered" in the output filter stack (e.g.
core output filter).

In mpm_event, we would then go to (or stay in) the WRITE_COMPLETION
state instead of LINGER, until every remaining piece data is flushed
successfully.

IIUC in this case the listener is calling process_timeout_queue(write_completion_q, ..), so the conn has already been in the WRITE_COMPLETION state for Timeout seconds and needs to be closed. Your suggestion would be to still force the listener to call ap_shutdown_conn(c, 0), but a "smarter" version that eventually returns APR_INCOMPLETE rather than blocking. Then the listener could put the connection again in the WRITE_COMPLETION queue, and wait for the client to unblock and read the missing bytes about close-notify (or just let Timeout seconds to pass, forcing the listener to shutdown the socket and be done with it).

You are free to trash the email if it doesn't make sense, I am probably missing too many important pieces of the puzzle :)

Thanks for the help! 

Luca  

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Ruediger Pluem
In reply to this post by Yann Ylavic


On 06/30/2017 12:18 PM, Yann Ylavic wrote:

> Hi Luca,
>
> [better/easier to talk about details on dev@]
>
> On Fri, Jun 30, 2017 at 11:05 AM,  <[hidden email]> wrote:
>> https://bz.apache.org/bugzilla/show_bug.cgi?id=60956
>>
>> --- Comment #11 from Luca Toscano <[hidden email]> ---
>> Other two interesting trunk improvements that have not been backported yet:
>>
>> http://svn.apache.org/viewvc?view=revision&revision=1706669
>> http://svn.apache.org/viewvc?view=revision&revision=1734656
>>
>> IIUC these ones are meant to provide a more async behavior to most of the
>> output filters, namely setting aside buckets (on the heap) to avoid blocking.
>
> These are quite orthogonal I think, and don't seem to fix this particular issue.
>
>>
>> After a bit of thinking it seems to me that we'd need to find a solution that
>> prevents the mod_ssl_output filter to block, but in a safe way.
>
> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
> modssl_smart_shutdown(), only in the "abortive" mode of
> ssl_filter_io_shutdown().

I think the issue starts before that.
ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
to this hook can perform all sort of long lasting blocking operations there.
While it can be argued that this would be a bug in the module I think the only safe way
is to have the whole start_lingering_close_nonblocking being executed by a worker thread.

Regards

Rüdiger

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Yann Ylavic
In reply to this post by Luca Toscano
On Fri, Jun 30, 2017 at 12:52 PM, Luca Toscano <[hidden email]> wrote:

>
> 2017-06-30 12:18 GMT+02:00 Yann Ylavic <[hidden email]>:
>> >
>> > http://svn.apache.org/viewvc?view=revision&revision=1706669
>> > http://svn.apache.org/viewvc?view=revision&revision=1734656
>> >
>> > IIUC these ones are meant to provide a more async behavior to most of
>> > the
>> > output filters, namely setting aside buckets (on the heap) to avoid
>> > blocking.
>>
>> These are quite orthogonal I think, and don't seem to fix this particular
>> issue.
>
> Sorry for the noise, I am still reviewing those to understand what they
> really do and added them as reference to the task.

NP, it's not really a noise either because with my proposed change
they'd imply a different fix for 2.4 and trunk.

>>
>> With a possibly non-blocking modssl_smart_shutdown(), I think we could
>> make ap_shutdown_conn(c, 0) return something like APR_INCOMPLETE for
>> the case the shutdown was "buffered" in the output filter stack (e.g.
>> core output filter).
>>
>> In mpm_event, we would then go to (or stay in) the WRITE_COMPLETION
>> state instead of LINGER, until every remaining piece data is flushed
>> successfully.
>
> IIUC in this case the listener is calling
> process_timeout_queue(write_completion_q, ..), so the conn has already been
> in the WRITE_COMPLETION state for Timeout seconds and needs to be closed.
> Your suggestion would be to still force the listener to call
> ap_shutdown_conn(c, 0), but a "smarter" version that eventually returns
> APR_INCOMPLETE rather than blocking. Then the listener could put the
> connection again in the WRITE_COMPLETION queue, and wait for the client to
> unblock and read the missing bytes about close-notify (or just let Timeout
> seconds to pass, forcing the listener to shutdown the socket and be done
> with it).

We'd use a shorter timeout but that was the idea yes.

But nevermind, I like Rüdiger's idea better actually, less invasive (I
think) and more bullets proof. Continued there...
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Yann Ylavic
In reply to this post by Ruediger Pluem
On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:

>
> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>
>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>> modssl_smart_shutdown(), only in the "abortive" mode of
>> ssl_filter_io_shutdown().
>
> I think the issue starts before that.
> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
> to this hook can perform all sort of long lasting blocking operations there.
> While it can be argued that this would be a bug in the module I think the only safe way
> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.

Correct, that'd be much simpler/safer indeed.
We need a new SHUTDOWN state then, right?


Regards,
Yann.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Stefan Eissing

> Am 30.06.2017 um 13:33 schrieb Yann Ylavic <[hidden email]>:
>
> On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:
>>
>> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>>
>>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>>> modssl_smart_shutdown(), only in the "abortive" mode of
>>> ssl_filter_io_shutdown().
>>
>> I think the issue starts before that.
>> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
>> to this hook can perform all sort of long lasting blocking operations there.
>> While it can be argued that this would be a bug in the module I think the only safe way
>> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.
>
> Correct, that'd be much simpler/safer indeed.
> We need a new SHUTDOWN state then, right?

+1


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Luca Toscano
In reply to this post by Yann Ylavic
Hi Yann and Ruediger,

2c from a mpm-event newbie inline:

2017-06-30 13:33 GMT+02:00 Yann Ylavic <[hidden email]>:
On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:
>
> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>
>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>> modssl_smart_shutdown(), only in the "abortive" mode of
>> ssl_filter_io_shutdown().
>
> I think the issue starts before that.
> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
> to this hook can perform all sort of long lasting blocking operations there.
> While it can be argued that this would be a bug in the module I think the only safe way
> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.

This makes a lot of sense and I agree, but at the same time I feel that it would be really great not to move lingering close responsibilities away from the listener. As far as we know mod_ssl is the only one that shows this "buggy" behavior, would it make sense to attempt to "fix it" now and postpone the decision about pushing start_lingering_close_nonblocking to a worker thread? 
 
Correct, that'd be much simpler/safer indeed.
We need a new SHUTDOWN state then, right?

IIUC in each case that the listener calls start_lingering_close_nonblocking we'd need to set the connection to SHUTDOWN, check if there is a free worker and push2worker the connection to it right?  

Thanks!

Luca
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Yann Ylavic
In reply to this post by Yann Ylavic
On Fri, Jun 30, 2017 at 1:33 PM, Yann Ylavic <[hidden email]> wrote:

> On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:
>>
>> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>>
>>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>>> modssl_smart_shutdown(), only in the "abortive" mode of
>>> ssl_filter_io_shutdown().
>>
>> I think the issue starts before that.
>> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
>> to this hook can perform all sort of long lasting blocking operations there.
>> While it can be argued that this would be a bug in the module I think the only safe way
>> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.
>
> Correct, that'd be much simpler/safer indeed.
> We need a new SHUTDOWN state then, right?
Actually it was less simple than expected, and it has some caveats obviously...

The attached patch does not introduce a new state but reuses the
existing CONN_STATE_LINGER since it was not really considered by the
listener thread (which uses CONN_STATE_LINGER_NORMAL and
CONN_STATE_LINGER_SHORT instead), but that's a detail.

Mainly, start_lingering_close_nonblocking() now simply schedules a
shutdown (i.e. pre_close_connection() followed by immediate close)
that will we be run by a worker thread.
A new shutdown_linger_q is created/handled (with the same timeout as
the short_linger_q, namely 2 seconds) to hold connections to be
shutdown.

So now when a connection times out in the write_completion or
keepalive queues, it needs (i.e. the listener may wait for) an
available worker to process its shutdown/close.
This means we can *not* close kept alive connections immediatly like
before when becoming short of workers, which will favor active KA
connections over new ones in this case (I don't think it's that
serious but the previous was taking care of that. For me it's up to
the admin to size the workers appropriately...).

Same when a connection in the shutdown_linger_q itself times out, the
patch will require a worker immediatly to do the job (see
shutdown_lingering_close() callback).

So overall, this patch may introduce the need for more workers than
before, what was (wrongly) done by the listener thread has to be done
somewhere anyway...

Finally, I think there is room for improvements like batching
shutdowns in the same worker if there is no objection on the approach
so far.

WDYT?

Regards,
Yann.

shutdown_linger_q.patch (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Stefan Eissing

> Am 14.07.2017 um 21:52 schrieb Yann Ylavic <[hidden email]>:
>
> On Fri, Jun 30, 2017 at 1:33 PM, Yann Ylavic <[hidden email]> wrote:
>> On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:
>>>
>>> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>>>
>>>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>>>> modssl_smart_shutdown(), only in the "abortive" mode of
>>>> ssl_filter_io_shutdown().
>>>
>>> I think the issue starts before that.
>>> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
>>> to this hook can perform all sort of long lasting blocking operations there.
>>> While it can be argued that this would be a bug in the module I think the only safe way
>>> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.
>>
>> Correct, that'd be much simpler/safer indeed.
>> We need a new SHUTDOWN state then, right?
>
> Actually it was less simple than expected, and it has some caveats obviously...
>
> The attached patch does not introduce a new state but reuses the
> existing CONN_STATE_LINGER since it was not really considered by the
> listener thread (which uses CONN_STATE_LINGER_NORMAL and
> CONN_STATE_LINGER_SHORT instead), but that's a detail.
>
> Mainly, start_lingering_close_nonblocking() now simply schedules a
> shutdown (i.e. pre_close_connection() followed by immediate close)
> that will we be run by a worker thread.
> A new shutdown_linger_q is created/handled (with the same timeout as
> the short_linger_q, namely 2 seconds) to hold connections to be
> shutdown.
>
> So now when a connection times out in the write_completion or
> keepalive queues, it needs (i.e. the listener may wait for) an
> available worker to process its shutdown/close.
> This means we can *not* close kept alive connections immediatly like
> before when becoming short of workers, which will favor active KA
> connections over new ones in this case (I don't think it's that
> serious but the previous was taking care of that. For me it's up to
> the admin to size the workers appropriately...).
>
> Same when a connection in the shutdown_linger_q itself times out, the
> patch will require a worker immediatly to do the job (see
> shutdown_lingering_close() callback).
>
> So overall, this patch may introduce the need for more workers than
> before, what was (wrongly) done by the listener thread has to be done
> somewhere anyway...
>
> Finally, I think there is room for improvements like batching
> shutdowns in the same worker if there is no objection on the approach
> so far.
>
> WDYT?

I will test the patch, most likely today. I lot of +1s for the initiative!

-Stefan


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Yann Ylavic
On Mon, Jul 17, 2017 at 9:33 AM, Stefan Eissing
<[hidden email]> wrote:
>
> I will test the patch, most likely today. I lot of +1s for the initiative!

Thanks Stefan, as I said the proposed patch currently reuses the
existing CONN_STATE_LINGER state to shutdown connections, but if it
needs to be set from outside mpm_event (eg. mod_h2 ;) we could add a
new state...
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Stefan Eissing
Threw it into my test suite and works nicely.

> Am 17.07.2017 um 14:02 schrieb Yann Ylavic <[hidden email]>:
>
> On Mon, Jul 17, 2017 at 9:33 AM, Stefan Eissing
> <[hidden email]> wrote:
>>
>> I will test the patch, most likely today. I lot of +1s for the initiative!
>
> Thanks Stefan, as I said the proposed patch currently reuses the
> existing CONN_STATE_LINGER state to shutdown connections, but if it
> needs to be set from outside mpm_event (eg. mod_h2 ;) we could add a
> new state...
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Luca Toscano
In reply to this post by Stefan Eissing


2017-07-17 9:33 GMT+02:00 Stefan Eissing <[hidden email]>:

> Am 14.07.2017 um 21:52 schrieb Yann Ylavic <[hidden email]>:
>
> On Fri, Jun 30, 2017 at 1:33 PM, Yann Ylavic <[hidden email]> wrote:
>> On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:
>>>
>>> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>>>
>>>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>>>> modssl_smart_shutdown(), only in the "abortive" mode of
>>>> ssl_filter_io_shutdown().
>>>
>>> I think the issue starts before that.
>>> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
>>> to this hook can perform all sort of long lasting blocking operations there.
>>> While it can be argued that this would be a bug in the module I think the only safe way
>>> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.
>>
>> Correct, that'd be much simpler/safer indeed.
>> We need a new SHUTDOWN state then, right?
>
> Actually it was less simple than expected, and it has some caveats obviously...
>
> The attached patch does not introduce a new state but reuses the
> existing CONN_STATE_LINGER since it was not really considered by the
> listener thread (which uses CONN_STATE_LINGER_NORMAL and
> CONN_STATE_LINGER_SHORT instead), but that's a detail.
>
> Mainly, start_lingering_close_nonblocking() now simply schedules a
> shutdown (i.e. pre_close_connection() followed by immediate close)
> that will we be run by a worker thread.
> A new shutdown_linger_q is created/handled (with the same timeout as
> the short_linger_q, namely 2 seconds) to hold connections to be
> shutdown.
>
> So now when a connection times out in the write_completion or
> keepalive queues, it needs (i.e. the listener may wait for) an
> available worker to process its shutdown/close.
> This means we can *not* close kept alive connections immediatly like
> before when becoming short of workers, which will favor active KA
> connections over new ones in this case (I don't think it's that
> serious but the previous was taking care of that. For me it's up to
> the admin to size the workers appropriately...).
>
> Same when a connection in the shutdown_linger_q itself times out, the
> patch will require a worker immediatly to do the job (see
> shutdown_lingering_close() callback).
>
> So overall, this patch may introduce the need for more workers than
> before, what was (wrongly) done by the listener thread has to be done
> somewhere anyway...
>
> Finally, I think there is room for improvements like batching
> shutdowns in the same worker if there is no objection on the approach
> so far.
>
> WDYT?

I will test the patch, most likely today. I lot of +1s for the initiative!


Thanks a lot for this work Yann, really nice! I will try to test it as well during the next days, I was not convinced in the beginning that triggering a (potential) increase in workers usage was super great for our users but it is definitely the most correct and safe thing to do. Even if we find a way to fix the ssl-lingering-close-block issue we might encounter a similar one in the future, so it is better imho to fix it at the source.

Luca 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Stefan Priebe - Profihost AG
In reply to this post by Yann Ylavic
Hello Yann,

i'm observing some deadlocks again.

I'm using
httpd 2.4.27
+ mod_h2
+ httpd-2.4.x-mpm_event-wakeup-v7.1.patch
+ your ssl linger fix patch from this thread

What kind of information do you need? If you need a full stack backtrace
 - from which pid? Or from all httpd pids?

Thanks!

Greets,
Stefan

Am 14.07.2017 um 21:52 schrieb Yann Ylavic:

> On Fri, Jun 30, 2017 at 1:33 PM, Yann Ylavic <[hidden email]> wrote:
>> On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:
>>>
>>> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>>>
>>>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>>>> modssl_smart_shutdown(), only in the "abortive" mode of
>>>> ssl_filter_io_shutdown().
>>>
>>> I think the issue starts before that.
>>> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
>>> to this hook can perform all sort of long lasting blocking operations there.
>>> While it can be argued that this would be a bug in the module I think the only safe way
>>> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.
>>
>> Correct, that'd be much simpler/safer indeed.
>> We need a new SHUTDOWN state then, right?
>
> Actually it was less simple than expected, and it has some caveats obviously...
>
> The attached patch does not introduce a new state but reuses the
> existing CONN_STATE_LINGER since it was not really considered by the
> listener thread (which uses CONN_STATE_LINGER_NORMAL and
> CONN_STATE_LINGER_SHORT instead), but that's a detail.
>
> Mainly, start_lingering_close_nonblocking() now simply schedules a
> shutdown (i.e. pre_close_connection() followed by immediate close)
> that will we be run by a worker thread.
> A new shutdown_linger_q is created/handled (with the same timeout as
> the short_linger_q, namely 2 seconds) to hold connections to be
> shutdown.
>
> So now when a connection times out in the write_completion or
> keepalive queues, it needs (i.e. the listener may wait for) an
> available worker to process its shutdown/close.
> This means we can *not* close kept alive connections immediatly like
> before when becoming short of workers, which will favor active KA
> connections over new ones in this case (I don't think it's that
> serious but the previous was taking care of that. For me it's up to
> the admin to size the workers appropriately...).
>
> Same when a connection in the shutdown_linger_q itself times out, the
> patch will require a worker immediatly to do the job (see
> shutdown_lingering_close() callback).
>
> So overall, this patch may introduce the need for more workers than
> before, what was (wrongly) done by the listener thread has to be done
> somewhere anyway...
>
> Finally, I think there is room for improvements like batching
> shutdowns in the same worker if there is no objection on the approach
> so far.
>
> WDYT?
>
> Regards,
> Yann.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Stefan Priebe - Profihost AG

Am 19.07.2017 um 16:59 schrieb Stefan Priebe - Profihost AG:

> Hello Yann,
>
> i'm observing some deadlocks again.
>
> I'm using
> httpd 2.4.27
> + mod_h2
> + httpd-2.4.x-mpm_event-wakeup-v7.1.patch
> + your ssl linger fix patch from this thread
>
> What kind of information do you need? If you need a full stack backtrace
>  - from which pid? Or from all httpd pids?

Something i forgot to tell:

it seems httpd is running at max threads:
awk '{print $10 $11}' lsof.txt | sort | uniq -c | grep LISTEN
  25050 *:http(LISTEN)
  25050 *:https(LISTEN)

Stefan

>
> Thanks!
>
> Greets,
> Stefan
>
> Am 14.07.2017 um 21:52 schrieb Yann Ylavic:
>> On Fri, Jun 30, 2017 at 1:33 PM, Yann Ylavic <[hidden email]> wrote:
>>> On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:
>>>>
>>>> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>>>>
>>>>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>>>>> modssl_smart_shutdown(), only in the "abortive" mode of
>>>>> ssl_filter_io_shutdown().
>>>>
>>>> I think the issue starts before that.
>>>> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
>>>> to this hook can perform all sort of long lasting blocking operations there.
>>>> While it can be argued that this would be a bug in the module I think the only safe way
>>>> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.
>>>
>>> Correct, that'd be much simpler/safer indeed.
>>> We need a new SHUTDOWN state then, right?
>>
>> Actually it was less simple than expected, and it has some caveats obviously...
>>
>> The attached patch does not introduce a new state but reuses the
>> existing CONN_STATE_LINGER since it was not really considered by the
>> listener thread (which uses CONN_STATE_LINGER_NORMAL and
>> CONN_STATE_LINGER_SHORT instead), but that's a detail.
>>
>> Mainly, start_lingering_close_nonblocking() now simply schedules a
>> shutdown (i.e. pre_close_connection() followed by immediate close)
>> that will we be run by a worker thread.
>> A new shutdown_linger_q is created/handled (with the same timeout as
>> the short_linger_q, namely 2 seconds) to hold connections to be
>> shutdown.
>>
>> So now when a connection times out in the write_completion or
>> keepalive queues, it needs (i.e. the listener may wait for) an
>> available worker to process its shutdown/close.
>> This means we can *not* close kept alive connections immediatly like
>> before when becoming short of workers, which will favor active KA
>> connections over new ones in this case (I don't think it's that
>> serious but the previous was taking care of that. For me it's up to
>> the admin to size the workers appropriately...).
>>
>> Same when a connection in the shutdown_linger_q itself times out, the
>> patch will require a worker immediatly to do the job (see
>> shutdown_lingering_close() callback).
>>
>> So overall, this patch may introduce the need for more workers than
>> before, what was (wrongly) done by the listener thread has to be done
>> somewhere anyway...
>>
>> Finally, I think there is room for improvements like batching
>> shutdowns in the same worker if there is no objection on the approach
>> so far.
>>
>> WDYT?
>>
>> Regards,
>> Yann.
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Stefan Priebe - Profihost AG
Hello,

fullstatus says:
   Slot  PID  Stopping   Connections    Threads       Async connections
                       total accepting busy idle writing keep-alive  closing
   0    25042 no       0     no        2    198  0       0
4294966698
   1    4347  no       0     no        0    200  0       0
4294966700
   2    26273 no       0     no        1    199  0       0
4294966698
   3    4348  no       0     no        0    200  0       0
4294966699
   4    10224 no       0     no        0    200  0       0
4294966697
   5    12157 no       0     no        0    200  0       0
4294966700
   6    23027 no       0     no        0    200  0       0
4294966698
   7    28597 no       0     no        0    200  0       0
4294966698
   8    7519  no       0     no        0    200  0       0
4294966697
   9    18609 no       0     no        2    198  0       0
4294966698
   10   3183  no       0     no        0    200  0       0
4294966698
   11   14704 no       0     no        0    200  0       0
4294966698
   12   26237 no       0     no        0    200  0       0
4294966700
   13   32070 no       0     no        0    200  0       0
4294966697
   14   12070 no       1     no        0    200  0       0
4294966699
   15   16627 no       0     no        0    200  0       0
4294966698
   16   29413 no       0     no        0    200  0       0
4294966699
   17   435   no       0     no        0    200  0       0
4294966699
   18   24808 no       0     no        0    200  0       0
4294966700
   19   1822  no       0     no        0    200  0       0
4294966699
   20   1721  no       0     no        0    200  0       0
4294966698
   21   2875  no       0     no        0    200  0       0
4294966698
   22   25879 no       0     no        0    200  0       0
4294966698
   23   28091 no       0     no        0    200  0       0
4294966696
   24   31452 no       0     no        0    200  0       0
4294966698
   25   32706 no       0     no        0    200  0       0
4294966698
   26   8858  no       14    yes       3    197  0       6
4294966783
   27   10203 no       5     yes       2    198  0       2
4294966949
   Sum  28    0        20              10   5590 0       8          -16400

Greets,
Stefan

Am 19.07.2017 um 17:05 schrieb Stefan Priebe - Profihost AG:

>
> Am 19.07.2017 um 16:59 schrieb Stefan Priebe - Profihost AG:
>> Hello Yann,
>>
>> i'm observing some deadlocks again.
>>
>> I'm using
>> httpd 2.4.27
>> + mod_h2
>> + httpd-2.4.x-mpm_event-wakeup-v7.1.patch
>> + your ssl linger fix patch from this thread
>>
>> What kind of information do you need? If you need a full stack backtrace
>>  - from which pid? Or from all httpd pids?
>
> Something i forgot to tell:
>
> it seems httpd is running at max threads:
> awk '{print $10 $11}' lsof.txt | sort | uniq -c | grep LISTEN
>   25050 *:http(LISTEN)
>   25050 *:https(LISTEN)
>
> Stefan
>
>>
>> Thanks!
>>
>> Greets,
>> Stefan
>>
>> Am 14.07.2017 um 21:52 schrieb Yann Ylavic:
>>> On Fri, Jun 30, 2017 at 1:33 PM, Yann Ylavic <[hidden email]> wrote:
>>>> On Fri, Jun 30, 2017 at 1:20 PM, Ruediger Pluem <[hidden email]> wrote:
>>>>>
>>>>> On 06/30/2017 12:18 PM, Yann Ylavic wrote:
>>>>>>
>>>>>> IMHO mod_ssl shoudn't (BIO_)flush unconditionally in
>>>>>> modssl_smart_shutdown(), only in the "abortive" mode of
>>>>>> ssl_filter_io_shutdown().
>>>>>
>>>>> I think the issue starts before that.
>>>>> ap_prep_lingering_close calls the pre_close_connection hook and modules that are registered
>>>>> to this hook can perform all sort of long lasting blocking operations there.
>>>>> While it can be argued that this would be a bug in the module I think the only safe way
>>>>> is to have the whole start_lingering_close_nonblocking being executed by a worker thread.
>>>>
>>>> Correct, that'd be much simpler/safer indeed.
>>>> We need a new SHUTDOWN state then, right?
>>>
>>> Actually it was less simple than expected, and it has some caveats obviously...
>>>
>>> The attached patch does not introduce a new state but reuses the
>>> existing CONN_STATE_LINGER since it was not really considered by the
>>> listener thread (which uses CONN_STATE_LINGER_NORMAL and
>>> CONN_STATE_LINGER_SHORT instead), but that's a detail.
>>>
>>> Mainly, start_lingering_close_nonblocking() now simply schedules a
>>> shutdown (i.e. pre_close_connection() followed by immediate close)
>>> that will we be run by a worker thread.
>>> A new shutdown_linger_q is created/handled (with the same timeout as
>>> the short_linger_q, namely 2 seconds) to hold connections to be
>>> shutdown.
>>>
>>> So now when a connection times out in the write_completion or
>>> keepalive queues, it needs (i.e. the listener may wait for) an
>>> available worker to process its shutdown/close.
>>> This means we can *not* close kept alive connections immediatly like
>>> before when becoming short of workers, which will favor active KA
>>> connections over new ones in this case (I don't think it's that
>>> serious but the previous was taking care of that. For me it's up to
>>> the admin to size the workers appropriately...).
>>>
>>> Same when a connection in the shutdown_linger_q itself times out, the
>>> patch will require a worker immediatly to do the job (see
>>> shutdown_lingering_close() callback).
>>>
>>> So overall, this patch may introduce the need for more workers than
>>> before, what was (wrongly) done by the listener thread has to be done
>>> somewhere anyway...
>>>
>>> Finally, I think there is room for improvements like batching
>>> shutdowns in the same worker if there is no objection on the approach
>>> so far.
>>>
>>> WDYT?
>>>
>>> Regards,
>>> Yann.
>>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Luca Toscano
In reply to this post by Stefan Priebe - Profihost AG
Hello Stefan,

2017-07-19 17:05 GMT+02:00 Stefan Priebe - Profihost AG <[hidden email]>:

Am 19.07.2017 um 16:59 schrieb Stefan Priebe - Profihost AG:
> Hello Yann,
>
> i'm observing some deadlocks again.
>
> I'm using
> httpd 2.4.27
> + mod_h2
> + httpd-2.4.x-mpm_event-wakeup-v7.1.patch
> + your ssl linger fix patch from this thread
>
> What kind of information do you need? If you need a full stack backtrace
>  - from which pid? Or from all httpd pids?

Something i forgot to tell:

it seems httpd is running at max threads:
awk '{print $10 $11}' lsof.txt | sort | uniq -c | grep LISTEN
  25050 *:http(LISTEN)
  25050 *:https(LISTEN)


First of all let me tell you how awesome is your regular testing, thank you! It is helping a ton to deliver stable code :)

From my point of view I think that you can attach gdb to one or more httpd processes and do the usual "thread apply all", IIUC your httpd ends up having all of its processes in more or less the same state right?

Let's use https://apaste.info/ or similar though otherwise we'll need to exchange super long emails.

Thanks again!

Luca  
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Stefan Priebe - Profihost AG
Hello Luca,

i need to wait until a machine is crashing again. What looks strange
from a first view is that async connections closing has very high and
strange values:
4294967211

Even a not yet crashed system has those:

   Slot  PID  Stopping   Connections    Threads       Async connections
                       total accepting busy idle writing keep-alive  closing
   0    25157 no       12    yes       4    196  0       9
4294967231 <== HERE ==
   1    25159 no       22    yes       8    192  0       13
4294967211 <== HERE ==
   Sum  2     0        34              12   388  0       22         -150

Greets,
Stefan

Am 19.07.2017 um 17:48 schrieb Luca Toscano:

> Hello Stefan,
>
> 2017-07-19 17:05 GMT+02:00 Stefan Priebe - Profihost AG
> <[hidden email] <mailto:[hidden email]>>:
>
>
>     Am 19.07.2017 um 16:59 schrieb Stefan Priebe - Profihost AG:
>     > Hello Yann,
>     >
>     > i'm observing some deadlocks again.
>     >
>     > I'm using
>     > httpd 2.4.27
>     > + mod_h2
>     > + httpd-2.4.x-mpm_event-wakeup-v7.1.patch
>     > + your ssl linger fix patch from this thread
>     >
>     > What kind of information do you need? If you need a full stack backtrace
>     >  - from which pid? Or from all httpd pids?
>
>     Something i forgot to tell:
>
>     it seems httpd is running at max threads:
>     awk '{print $10 $11}' lsof.txt | sort | uniq -c | grep LISTEN
>       25050 *:http(LISTEN)
>       25050 *:https(LISTEN)
>
>
> First of all let me tell you how awesome is your regular testing, thank
> you! It is helping a ton to deliver stable code :)
>
> From my point of view I think that you can attach gdb to one or more
> httpd processes and do the usual "thread apply all", IIUC your httpd
> ends up having all of its processes in more or less the same state right?
>
> Let's use https://apaste.info/ or similar though otherwise we'll need to
> exchange super long emails.
>
> Thanks again!
>
> Luca  
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Stefan Priebe - Profihost AG
In reply to this post by Luca Toscano
Hello,

here we go:

This one is from a server where the first httpd process got stuck:

   Slot  PID  Stopping   Connections    Threads       Async connections
                       total accepting busy idle writing keep-alive  closing
   0    31675 no       0     no        0    200  0       0
4294966700

gdb thread apply all from this pid:
https://apaste.info/cBT5

Greets,
Stefan

Am 19.07.2017 um 17:48 schrieb Luca Toscano:

> Hello Stefan,
>
> 2017-07-19 17:05 GMT+02:00 Stefan Priebe - Profihost AGlo
> <[hidden email] <mailto:[hidden email]>>:
>
>
>     Am 19.07.2017 um 16:59 schrieb Stefan Priebe - Profihost AG:
>     > Hello Yann,
>     >
>     > i'm observing some deadlocks again.
>     >
>     > I'm using
>     > httpd 2.4.27
>     > + mod_h2
>     > + httpd-2.4.x-mpm_event-wakeup-v7.1.patch
>     > + your ssl linger fix patch from this thread
>     >
>     > What kind of information do you need? If you need a full stack backtrace
>     >  - from which pid? Or from all httpd pids?
>
>     Something i forgot to tell:
>
>     it seems httpd is running at max threads:
>     awk '{print $10 $11}' lsof.txt | sort | uniq -c | grep LISTEN
>       25050 *:http(LISTEN)
>       25050 *:https(LISTEN)
>
>
> First of all let me tell you how awesome is your regular testing, thank
> you! It is helping a ton to deliver stable code :)
>
> From my point of view I think that you can attach gdb to one or more
> httpd processes and do the usual "thread apply all", IIUC your httpd
> ends up having all of its processes in more or less the same state right?
>
> Let's use https://apaste.info/ or similar though otherwise we'll need to
> exchange super long emails.
>
> Thanks again!
>
> Luca  
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Eric Covener
On Wed, Jul 19, 2017 at 2:25 PM, Stefan Priebe - Profihost AG
<[hidden email]> wrote:

> Hello,
>
> here we go:
>
> This one is from a server where the first httpd process got stuck:
>
>    Slot  PID  Stopping   Connections    Threads       Async connections
>                        total accepting busy idle writing keep-alive  closing
>    0    31675 no       0     no        0    200  0       0
> 4294966700
>
> gdb thread apply all from this pid:
> https://apaste.info/cBT5
>

summary from a script I use:

1: read>read>ap>child_main>make_child>server_main_loop>event>ap_run_mpm>main
300: pthread_cond_wait@@GLIBC_2.3.2>apr_thread_cond_wait>get_next>slot>start_thread>clone
200: pthread_cond_wait@@GLIBC_2.3.2>apr_thread_cond_wait>ap_queue_pop_something>worker_thread>start_thread>clone
1: epoll_wait>impl_pollset_poll>listener_thread>start_thread>clone
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Bug 60956] Event MPM listener thread may get blocked by SSL shutdowns

Yann Ylavic
In reply to this post by Stefan Priebe - Profihost AG
Hi Stefan,

thanks for testing again!

On Wed, Jul 19, 2017 at 7:42 PM, Stefan Priebe - Profihost AG
<[hidden email]> wrote:
>
> What looks strange
> from a first view is that async connections closing has very high and
> strange values:
> 4294967211

Indeed, I messed up with mpm_event's lingering_count in the first patch.
And it can lead to disabling the listener, which I think is what you observe.

Attached is a v2 if you feel confident enough, still ;)


Regards,
Yann.

shutdown_linger_q-v2.patch (20K) Download Attachment
123
Loading...