Debugging VerneMQ connection / authentication problems for dual authentication with webhooks and vmq_diversity
TL;DR
https://github.com/vernemq/vernemq/blob/master/apps/vmq_diversity/src/vmq_diversity_plugin.erl
includes a statement which will block authentication / authorization through any other channels if your Lua script returns “false”:
will lead to:
instead of trying other plugins, as you might have assumed!
In order to avoid this problem, simply do not return anything from your Lua script, if you do not want to handle the message!
Debugging
VerneMQ has a built-in trace tool, which will help you to see which problems can occur. First, display the sessions, in order to find out which data your clients are trying to connect with:
vmq-admin session show
Then run a trace:
vmq-admin trace client client-id=000000007eaefc47 –mountpoint=5c61a59ba6b76f558d256f42
Note that the mountpoint defaults to “” (the default mountpoint for Verne) – if you are not setting a custom mountpoint on the listeners ( as me), then you can omit this.
Initially vmq-admin trace will report that no sessions are found for this client – if your client is not currently connected.
Connect your client, and there will be more output.
It is also helpful to increase the verbosity level in the logger to debug (for vernemq.conf):
log.console.level = debug
econnrefused
In my particular case, my webhook endpoint was not running. After starting it up, I now get (multiple attempts with similar output):
As you can see, the client is now being authenticated, but not being authorized to publish. The client is then disconnected (as per MQTT 3.1.1 spec).
And this is the log output:
what is happening here, is that the client is authenticated using a webhook, and then it tries to become authorized using a Lua script ( MongoDB authorization).
I am not sure why auth_on_publish for the client is not called using a webhook as well. This client in any case is supposed to be authenticated fully using MongoDB.
I also only see requests for auth-on register:
To show which webhooks are registered, you can run
vmq-admin webhooks show
see: https://docs.vernemq.com/plugindevelopment/webhookplugins
All webhooks are called with the method POST, and need to be answered with the HTTP code 200 to be considered successful.
A request for the webhook server
Here is how a request will look like for the Webhook server:
headers are passed, with the header vernemq-hook set to auth_on_register in this case.
The post body will include different information on the client trying to connect.
After using “next” as a reply from the webhook, the publish works correctly – the client is authenticated against the MongoDB, and afterwards the publish is authorized, as desired.
Here’s how the answer from the webhook should look like to pass authentication on to the next plugin:
{“result”: “next”}
And here’s how the trace looks now:
At the bottom you can see the keepalives for the client (PINGREQ, PINGRESP).
So, one client is working now. Onwards to the next client – which should be authenticated and authorized using webhooks!
Webhook client
As you can see, by the way, the mountpoint gets rewritten from “piclient” to a different mountpoint by the MongoDB Lua script.
Basically I am using the mountpoint on the listeners to indicate which authentication method should be used for this client.
In version 2 of VerneMQ, we are promised, there will be a separation of Plugins per Listener – therefore we can avoid these checks which add overhead to each connection.
apparently only one trace can be run at a time. Trying again after shutting down the other trace:
vmq-admin trace client client-id=my_clientid –mountpoint=jsclient
Please note that the message auth_on_publish debug is printed by an additional statement I inserted in the Lua script:
in auth_commons.lua
Understanding what to return from the Lua script
What we want is probably to return from the Lua script that the next hook should be called.
“In case the plugin can’t validate the publish message it is best to return next
as this would allow subsequent plugins in the chain to validate the request.”
(see https://docs.vernemq.com/plugindevelopment/publishflow )
I assumed that returning false would imply that, but apparently not?
Let’s look in the Erlang source of vernemq
https://github.com/vernemq/vernemq
This is a test suite for the auth_cache for vmq_diversity (where the Lua scripts live):
vmq-admin trace client client-id=my_clientid –mountpoint=jsclient
Adding some additional debugging for the message in Lua:
function auth_on_publish(pub)
print(“***auth_on_publish debug*** (LUA)”)
for k,v in pairs(pub) do
print(k .. ” ” .. v)
end
print(pub)
return false
end
Note that the code seems to be faulty (false apparently can’t be concatenated using .. in Lua – OK.)
In any case, in the Lua table, which is passed in as “pub”, we have the following parameters:
- client_id
- mountpoint
- payload
- qos
now we could check for mountpoint being “jsclient” (which should be authenticated using webhooks), and for now force true:
function auth_on_publish(pub)
print(“***auth_on_publish debug*** (LUA)”)
if pub[“mountpoint”] == “jsclient” then
return true
else
return false
end
end
Here is how a trace looks like for this artificial condition:
Back to Erlang
https://github.com/vernemq/vernemq/blob/master/apps/vmq_diversity/src/vmq_diversity_plugin.erl
when auth_on_publish is called for the vmq_diversity_plugin.erl the result value will be obtained like this:
- if vmq_diversity_cache:match_publish_acl will return true, it will be “ok”.
- if it returns false, it will throw an error
- if it returns no_cache, the result will be dependend on the all_till_ok function (called with auth_on_publish)
as you can see here are the values passed on to our Lua script, including the mountpoint
conv_res will modify the returned result:
I highly recommend if you are not familiar with Erlang to read my previous article, https://pi3g.com/2019/05/12/understanding-erlang-lua-luerl-for-vernemq-mongodb-auth_on_register-hook/
Basically if {error, lua_script_returned_false} is not matched within the results, for any (_) name and any Other, the result will be Other – simply passing on the results of all_till_ok to the calling function.
For the name lua_script_returned_false, the error will be not_authorized.
I am not entirely sure if this vmq_plugin_helper.erl is where the main controlling logic (of passing on on next) is:
in any case, probably “next” is expected to be returned.
Here, in vmq_diversity_plugin.erl all_till_ok is defined:
https://github.com/vernemq/vernemq/blob/master/apps/vmq_diversity/src/vmq_diversity_plugin.erl
As you can see, if “true” is returned by the script, the return value is set to “ok”.
If “false” is returned from the script, the previously mentioned error is returned, and execution stops –> not authorized.
This means that if you return false in the Lua scripts, no other plugins are going to be called!
alternatively modifiers can be returned.
The trick is not to return anything, in case you do not want to handle the request:
function auth_on_publish(pub)
print(“***auth_on_publish debug*** (LUA)”)
if pub[“mountpoint”] == “jsclient” then
print(“*** jsclient detected ***”)
print(“not returning anything”)
— return true
else
print(“+++ other client detected +++”)
return false
end
end
Note that — (two dashes) is a comment in Lua.
In this case, the auth_on_publish hook gets called using webhooks:
Bonus
Error generating config with cuttlefish:
This can be debugged further by setting the command in docker-compose.yml to:
command: “vernemq config generate -l debug /vernemq/etc/vernemq.conf.local”
and bringing the container up (note that I have created a .local file, so certain settings are not overwritten anymore).
The error message I get is:
jalpine_morpheus | escript: exception error: no function clause matching
jalpine_morpheus | lists:unzip([[ws,ws],
jalpine_morpheus | {[“log”,”console”,”file”],”./log/console.log”},
apparently this is related to cuttlefish choking on lines which contain only a space:
https://github.com/basho/cuttlefish/issues/222
pressing Alt + P in nano will enable whitespace display – good for debugging this particular issue.
The problematic section in my config looks like this:
here there are two tabs (?) where they should be nothing.
Removing the problematic line will allow VerneMQ to start again.
Further reading
This might also be helpful:
http://www.steves-internet-guide.com/mqtt-protocol-messages-overview/
Changelog of VerneMQ:
https://github.com/vernemq/vernemq/blob/fe514561cbeebb2ac4f7c98a87c64d766587c1d7/changelog.md
NB: in Version 1.8 an interesting new plugin is the vmq_pulse plugin, which will push cluster information to a pulse backend via HTTP(S).
Showing some additional Lua scripting for VerneMQ diversity.