[Bug 63328] New: Apparent race condition causes undeserved 500 / connection reset by peer errors

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Bug 63328] New: Apparent race condition causes undeserved 500 / connection reset by peer errors

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63328

            Bug ID: 63328
           Summary: Apparent race condition causes undeserved 500 /
                    connection reset by peer errors
           Product: Apache httpd-2
           Version: 2.4.25
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: mod_fcgid
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

There appears to be a race condition in mod_fcgid.

Here's what I see:

A Perl (CGI::Fast) application decides (actually is told) to exit.

In response to a POST, it issues a (303) redirect to its status page and
exit()s - expecting that a new instance will be started to service the
redirect.

HTTPD reports:

[Sun Apr 07 13:24:13.991499 2019] [fcgid:warn] [pid 17236] (104)Connection
reset by peer: [client ...] mod_
fcgid: error reading data from FastCGI server, referer: ...
[Sun Apr 07 13:24:13.994622 2019] [core:error] [pid 17236] [...] End of script
output before headers
: notice.fcgi, referer: https://.../fastbrowser

Modsec's helpful audit log says:
Apache-Error: [file "fcgid_proc_unix.c"] [line 627] [level 4] [status 104]
mod_fcgid: error reading data from FastCGI server
Apache-Error: [file "util_script.c"] [line 500] [level 3] %s: %s
Apache-Handler: fcgid-script


And the web browser gets a 500 error from httpd.

What's interesting is that if the server generates a 200 response, the error
doesn't happen.

Further, if the application generates the 303 without doing anything else, the
500 isn't generated; the redirect works.

The crash seems to be timing sensitive.  My working theory is that:

The (Perl application)server exits, closing the FCGI server connection.  If a
200 is provided before the exit, all goes as expected.  A redirect takes time -
the browser sends the GET some time later.  If it's much later, it hits a new
server instance.  If it's at just the right time, it starts to get sent to the
(now exiting) server; the connection close is noticed, and the request is lost
to the 500.

This reproduces consistently with a real application.  I've tried to cut it
down to a reproducer, but failed.

I tried various ways to prevent this - including sending 'LastCall' - none work
in the real application.

httpd 2.4.25, mod_fcgid 2.3.9.  CGI::Fast 2.15 FCGI 0.78

Here is my attempt at a small reproducer.  While I haven't found the right
magic to reproduce the problem, it clearly illustrates the failing application
structure. (For simplicity, this is all done with GET, but that shouldn't
matter.)

Usage:

Setup shutdown.fcgi to run as a script, as, say /test.fcgi

Browse to /test.fcgi - hit refresh, you will see the Requests served counter
increment.

Now Browse to /test.fcgi/shutdown - the server issues a redirect and exits.
You will see that the response has a new PID, the requests served goes back to
1, and the URL in the address bar is no /test.fcgi/LoopExit.

Or (change the if(01)), it invokes LastCall - which tells the library
explicitly not to send more requests - then falls out of the loop synchronously
to exit.  The GET invoked by the redirect should start a new server; instead
you get the 500 error.  In the real application, the 500 errors are 100%
reproducible.  I haven't found the right timing to make the reproducer fail -
and if I did, I suspect that timing would not be portable to other machines.

What I expect is that once the server exits (and especially with LastCall
invoked), mod_fcgid will pass incoming requests to another server instance.
Starting a new one if necessary.  (In the real app, it is guaranteed that there
is only one server at this time.)  If one can't be found/started, the response
should be something like "no servers available", not "Internal error" with
logging that blames the server.

Here's the (very small) almost-reproducer.  The structure is the same as the
real application.

#!/usr/bin/perl


use warnings;
use strict;

require CGI::Fast;

my $n;
my $q;
while( ( $q = CGI::Fast->new ) ) {
    # Variable work here

    if( $ENV{PATH_INFO} eq '/shutdown' ) {
        if( 01 ) {
            print( <<"xx" );
Status: 303 See other
Location: /test.fcgi/LoopExit

Server $$ shutdown after $n requests
xx
            exit(0);
        }
        no warnings 'once';
        $CGI::Fast::Ext_Request->LastCall;
        next;
    }

    $n++;
    print( <<"XX" );
Status: 200 OK
Content-Type: text/plain

Server $$, Requests served: $n
XX
}
# Here when CGI::Fast returns undef to shut down.
print STDERR ( "ERR: Server $$ shutdown after $n requests\n" ) if( 0 );


exit(0);

Finally, my work around is to send a buffer page - it waits 15 seconds and then
does a javascript redirect.  This works every time  - but is a horrible user
experience...

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]