'kafka-server-stop.sh not working when Kafka started from Python script

After deploying some Apache Kafka instances on remote nodes I observed problem with kafka-server-stop.sh script that is part of Kafka archive.

By default it contains:

#!/bin/sh
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
# 
#    http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '{print $1}' | xargs kill -SIGTERM

and this script works great if I execute apache kafka as not background process, for example:

/var/lib/kafka/bin/kafka-server-start.sh /var/lib/kafka/config/server.properties

also it works when I execute it as background process:

/var/lib/kafka/bin/kafka-server-start.sh /var/lib/kafka/config/server.properties &

but on my remote nodes I execute it (with the use of Ansible) with this python script:

#!/usr/bin/env python
import argparse
import os
import subprocess

KAFKA_PATH = "/var/lib/kafka/"

def execute_command_pipe_output(command_to_call):
  return subprocess.Popen(command_to_call, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

def execute_command_no_output(command_to_call):
  with open(os.devnull, "w") as null_file:
    return subprocess.Popen(command_to_call, stdout=null_file, stderr=subprocess.STDOUT)  

def start_kafka(args):
  command_to_call = ["nohup"]
  command_to_call += [KAFKA_PATH + "bin/zookeeper-server-start.sh"]
  command_to_call += [KAFKA_PATH + "config/zookeeper.properties"]

  proc = execute_command_no_output(command_to_call)

  command_to_call = ["nohup"]
  command_to_call += [KAFKA_PATH + "bin/kafka-server-start.sh"]
  command_to_call += [KAFKA_PATH + "config/server.properties"]

  proc = execute_command_no_output(command_to_call)

def stop_kafka(args):
  command_to_call = [KAFKA_PATH + "bin/kafka-server-stop.sh"]

  proc = execute_command_pipe_output(command_to_call)
  for line in iter(proc.stdout.readline, b''):
    print line,

  command_to_call = [KAFKA_PATH + "bin/zookeeper-server-stop.sh"]

  proc = execute_command_pipe_output(command_to_call)
  for line in iter(proc.stdout.readline, b''):
    print line,


if __name__ == "__main__":
  parser = argparse.ArgumentParser(description="Starting Zookeeper and Kafka instances")
  parser.add_argument('action', choices=['start', 'stop'], help="action to take")

  args = parser.parse_args()

  if args.action == 'start':
    start_kafka(args)
  elif args.action == 'stop':
    stop_kafka(args)
  else:
    parser.print_help()

after executing

manage-kafka.py start
manage-kafka.py stop

Zookeeper is shutdown (as it should be) but Kafka is still running.

What is more interesting, when I invoke (by hand)

nohup /var/lib/kafka/bin/kafka-server-stop.sh

or

nohup /var/lib/kafka/bin/kafka-server-stop.sh &

kafka-server-stop.sh properly shutdowns Kafka instance. I suspect this problem may be caused by some Linux/Python thing.



Solution 1:[1]

Kafka brokers need to finish the shutdown process before the zookeepers do.

So start the zookeepers, then the kafka brokers will retry the shutdown process.

I had a similar case. The problem was that my config was not waiting for the kafka brokers to shutdown. Hope this helps somebody. It took me a while to figure out...

Solution 2:[2]

I faced this issue a lot before figuring out a brute face way to solve the issue. So what has happened is Kafka closed down abruptly but the port is still in use.

Follow the following steps:

  1. Find the process id of the process running at that port: lsof -t -i :YOUR_PORT_NUMBER . ##this is for mac
  2. Kill that process kill -9 process_id

Solution 3:[3]

Please exercise kafka-server-stop.sh before executing kafka-zookeeper-stop.sh management tool. It will first disconnect the server from zookeeper and then it will stop zookeeper itself. Please allow 3-4 seconds before you start again.

Solution 4:[4]

My guess: kafka-server-stop.sh uses shell pipes. So Popen would need shell=True argument.

See https://docs.python.org/2/library/subprocess.html#subprocess.Popen

Solution 5:[5]

Changing the command in kafka-server-stop.sh to this solved my issue:

PIDS=$(ps axww | grep -i 'kafka\.Kafka' | grep java | grep -v grep | nawk '{print $1}')


Explanation:
The issue is that the kafka-server-stop.sh uses the following command to get the PIDS to kill:

PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '{print $1}')

'ps' 80 columns issue on the terminal:
The problem with this, is that the output of ps ax is not showing all the output of the command as it is being truncated to xx columns(Normally 80 columns, the default terminal width in the old days). Mine was 168 columns as defined in stty -a. Changing to ps axww does, which in short widens the output.

awk input record length issue:
The other issue is that awk has a Characters per input record limitation of 3000 chars as described here. nawk on the contrary does not and is limited by the value of the C long. gawk will also work.

The downside to this, is that i am modifying a core script, which could be overwritten during an upgrade or so. It's quick and possibly dirty, but it does the job for me.

P.S I found a jira here if you are interested.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 mohdnaveed
Solution 3 user9059436
Solution 4 Stephane Martin
Solution 5