Paho MQTT Client (Javascript), Envoy, VerneMQ: debugging disconnects

By Maximilian Batz | 2019-09-08

After deploying picockpit to my server, a bug was observed: the web frontend would disconnect in very regular intervals.

This had not been present on the local development environment. Users of picockpit reported the same bug.

I have debugged it today, and applied a hotfix. It turns out it was a timing issue.

TL;DR how to fix this

set up a shorter keepAliveInterval for Paho. 60 seconds is the default, but with network latencies this will lead to these timeouts.

Here is my code for the connection (inside pcp-code.js which is loaded by the webpage):

/* https://www.eclipse.org/paho/files/jsdoc/Paho.MQTT.Client.html */
function MQTTconnect_inner(params){
     mqtt = new Paho.Client(params.uri, params.clientid); /* setting not required parameters as JS does not allow to pass named positional parameters */
     mqtt.onConnectionLost = mqtt_onConnectionLost;
     mqtt.onMessageDelivered = mqtt_onMessageDelivered;
     mqtt.onMessageArrived = mqtt_onMessageArrived;
     mqtt.onConnected = mqtt_onConnect_proper;

    var options = {
         timeout: 3, /* if conection not successfull within 3 seconds, it is deemed to have failed */
         onSuccess: mqtt_onConnect_connect,
         onFailure: mqtt_onConnectFailure_connect,
         userName: params.userName,
         password: params.password,
         keepAliveInterval: 30,
         useSSL: true,
         reconnect: false,
         mqttVersion: 4
     };
     mqtt.connect(options);
}

(reproduced as image as well)

Debugging steps

Timing

I sat down with my mobile phone to time the timeouts, and see whether there was a pattern there.

In my implementation, Paho is reconnected by surrounding JavaScript code every 10 seconds on timeout.

We see the Paho javascript MQTT client loosing the connection every minute or every two minutes. This is already a strong indication that this is systematic, possibly a forced disconnect by the server.

JavaScript console

The error message I see in the console is:

WebSocket is already in CLOSING or CLOSED state.

In retrospect, this is possibly Paho trying to send the PINGREQ but failing just by a couple of seconds or possibly microseconds, as the server has disconnected the socket.

NB: For superior compatibility with company firewalls, etc. I run Paho through websockets.

Verification with other browser

I have seen this issue with Chrome (my main browser) and Firefox. Therefore it is not due to a browser problem.

(I was led to believe that it might be to some browser issues by this website: https://github.com/socketio/socket.io/issues/3259)

Debugging VerneMQ

see also:

Show full client_id and mountpoint to avoid truncation:

vmq-admin session show –mountpoint –client_id

Run a trace:

vmq-admin trace client client-id=js.1567943187982KSWuLXb –mountpoint=5cc72b97a6b76f00013d35d2

As you see, you will get a trace of what is sent to the client and received from the client.

Starting trace for 1 existing sessions for client “js.1567943187982KSWuLXb” with PIDs

[<9043.1475.12>]

<9043.1475.12> MQTT SEND: MP: “5cc72b97a6b76f00013d35d2” CID: “js.1567943187982KSWuLXb” PUBLISH(d0, q0, r0, m0, “pi/000000007eaefc47/com.picockpit/sensors/$state/pi”) with payload:

{“wlan_signal_info”: “Link Quality=61/70 Signal level=-49 dBm”, “cpu_load”: 1.4, “net_io_bytes_received”: 798582943, “root_partition_used”: 2028572672, “net_io_bytes_sent”: 1482989984, “soc_temperatu (truncated)

I also saw a PINGREQ() and PINGRESP(), keeping the session alive in this trace:

<9043.1475.12> MQTT RECV: MP: “5cc72b97a6b76f00013d35d2” CID: “js.1567943187982KSWuLXb” PINGREQ()

<9043.1475.12> MQTT SEND: MP: “5cc72b97a6b76f00013d35d2” CID: “js.1567943187982KSWuLXb” PINGRESP()

Here is how the disconnect looks like:

<9043.1475.12> MQTT SEND: MP: “5cc72b97a6b76f00013d35d2” CID: “js.1567943187982KSWuLXb” PUBLISH(d0, q0, r0, m0, “pi/00000000327f9cbf/com.picockpit/sensors/$state/pi”) with payload:

{“wlan_signal_info”: “Link Quality=68/70 Signal level=-42 dBm”, “ram_total_bytes”: 969392128, “$uuid”: “928ca964-d22e-11e9-9e22-dca632007438”, “disk_io_bytes_written”: 619947008, “root_partition_tota (truncated)

<9043.1475.12> Trace session for js.1567943187982KSWuLXb stopped

Unfortunately there is no information that this is due to a timeout waiting for a pingreq() from the client.

Debugging Envoy

Paho Timing options

A good source for reading about MQTT and keep-alives is Steve’s Internet Guide:

http://www.steves-internet-guide.com/mqtt-keep-alive-by-example/

The client must send keepalive packets (PINGREQ), in the absence of other control packets, and is responsible for the interval between the packets not exceeding the Keep Alive value set on the broker.

The client can send the packets earlier

The next step of the investigation lead to looking at Paho and the keepalive. My initial concern was that Paho would not send keepalives by itself, so that I would have to run a loop and send them.

This, however, was found not to be true – Paho actually handles the keepalives itself.

Here’s the source code of Paho MQTT Javascript:

https://github.com/eclipse/paho.mqtt.javascript/blob/master/src/paho-mqtt.js

line 703 and following have the “Pinger”.

it defines a doTimeout function, which in turn returns doPing

doPing is defined a bit further below, it runs the actual ping, and sets up a timeout, with this._keepAliveInterval.

this._client.socket.send(pingReq);
this.timeout = setTimeout(doTimeout(this), this._keepAliveInterval);

By now, I am pretty certain that the keepAliveInterval has to be decreased for Paho from the default 60 seconds. Possibly due to network latencies, what worked for me locally does not work remotely any more.

how do we set up a different time interval?

Paho documentation:

https://www.eclipse.org/paho/files/jsdoc/Paho.MQTT.Client.html

The Paho client instance has a method connect, which takes an object as its single parameter.

The object has different attributes, one of them being the keepAliveInterval:

I set this keepAliveInterval to 30 sec. Here’s the current set of parameters I use:

var options = {
     timeout: 3, /* if conection not successfull within 3 seconds, it is deemed to have failed */
     onSuccess: mqtt_onConnect_connect,
     onFailure: mqtt_onConnectFailure_connect,
     userName: params.userName,
     password: params.password,
     keepAliveInterval: 30,
     useSSL: true,
     reconnect: false,
     mqttVersion: 4
};
mqtt.connect(options);

With mqtt being an instance of the Paho MQTT Client.

Posted in MQTT, PiCockpit, VerneMQ and tagged debug, envoy, javascript, keepAliveInterval, mqtt, network, paho, PINGREQ, remote, timeout, timing, vernemq, websocket, websockets