12
responses

Hi.

We are facing this problem that NXLog takes a lot of memory when using it to collect logs from PostgreSQL database. I tried to modify the polling interval in the config but it did not help.

Our config is like this now:

<Input PostgreSQL>

Module im_dbi
Driver pgsql
SavePos false
PollInterval 5
Option host 127.0.0.1
Option username *****
Option password **************
Option dbname messagelog
SQL SELECT id, discriminator, time, queryid, message, timestamprecord, response, memberclass, membercode, subsystemcode FROM logrecord
Exec $SourceName = 'PostgreSQL';
Exec to_json();
</Input>

<Output out>
Module om_tcp
Host 192.168.1.1
Port 1468
Exec to_syslog_ietf();
</Output>

<Route 1>
Path PostgreSQL => out
</Route>

I enabled debug and it produces a huge amount of these lines per second:

2019-04-09 15:00:57 DEBUG worker 0 processing event 0x7f67240a6d80
2019-04-09 15:00:57 DEBUG PROCESS_EVENT: POLL (out)
2019-04-09 15:00:57 DEBUG nx_module_pollset_poll: out
2019-04-09 15:00:57 DEBUG worker 2 got signal for new job
2019-04-09 15:00:57 DEBUG worker 2 got no event to process
2019-04-09 15:00:57 DEBUG worker 2 waiting for new event
2019-04-09 15:00:57 DEBUG [out] no poll events, pollset_poll timed out
2019-04-09 15:00:57 DEBUG nx_event_to_jobqueue: POLL (out)
2019-04-09 15:00:57 DEBUG event added to jobqueue
2019-04-09 15:00:57 DEBUG worker 0 processing event 0x7f67240a4fb0
2019-04-09 15:00:57 DEBUG PROCESS_EVENT: POLL (out)
2019-04-09 15:00:57 DEBUG nx_module_pollset_poll: out
2019-04-09 15:00:57 DEBUG worker 1 got signal for new job
2019-04-09 15:00:57 DEBUG worker 1 got no event to process
2019-04-09 15:00:57 DEBUG worker 1 waiting for new event
2019-04-09 15:00:57 DEBUG [out] no poll events, pollset_poll timed out
2019-04-09 15:00:57 DEBUG nx_event_to_jobqueue: POLL (out)
2019-04-09 15:00:57 DEBUG event added to jobqueue
2019-04-09 15:00:57 DEBUG worker 0 processing event 0x7f67240a6d80
2019-04-09 15:00:57 DEBUG PROCESS_EVENT: POLL (out)
2019-04-09 15:00:57 DEBUG nx_module_pollset_poll: out
2019-04-09 15:00:57 DEBUG worker 2 got signal for new job
2019-04-09 15:00:57 DEBUG worker 2 got no event to process
2019-04-09 15:00:57 DEBUG worker 2 waiting for new event
2019-04-09 15:00:57 DEBUG [out] no poll events, pollset_poll timed out
2019-04-09 15:00:57 DEBUG nx_event_to_jobqueue: POLL (out)
2019-04-09 15:00:57 DEBUG event added to jobqueue
2019-04-09 15:00:57 DEBUG worker 0 processing event 0x7f67240a4fb0

We are not seeing the memory usage problem on other log collecting methods. Any ideas what could cause this?

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21239 root 20 0 162120 2396 1600 R 0.3 0.1 0:00.04 top
28670 nxlog 20 0 1589652 1.3g 3676 S 0.3 34.4 14:33.41 nxlog

AskedApril 9, 2019 - 2:07pm

Answer (1)

Thanks for the bug report. We will investigate the source of the memory leak.

Comments (11)

  • JaVa's picture

    Here is what the system logs say about it:

    Mar 15 14:49:28 server01 kernel: [8814660.305262] [32363] 993 32363 624602 412246 1141 143229 0 nxlog
    Mar 15 14:49:28 server01 kernel: [8814660.305291] Out of memory: Kill process 32363 (nxlog) score 451 or sacrifice child
    Mar 15 14:49:28 server01 kernel: [8814660.305337] Killed process 32363 (nxlog) total-vm:2498408kB, anon-rss:1648984kB, file-rss:0kB, shmem-rss:0kB
    Mar 15 14:49:28 server01 kernel: [32363] 993 32363 624602 412246 1141 143229 0 nxlog
    Mar 15 14:49:28 server01 kernel: Out of memory: Kill process 32363 (nxlog) score 451 or sacrifice child
    Mar 15 14:49:28 server01 kernel: Killed process 32363 (nxlog) total-vm:2498408kB, anon-rss:1648984kB, file-rss:0kB, shmem-rss:0kB
    Mar 15 14:49:28 server01 systemd: nxlog.service: main process exited, code=killed, status=9/KILL
    Mar 15 14:49:28 server01 nxlog: removing stale pidfile /opt/nxlog/var/run/nxlog/nxlog.pid
    Mar 15 14:49:28 server01 systemd: Unit nxlog.service entered failed state.
    Mar 15 14:49:28 server01 systemd: nxlog.service failed.
    Mar 15 14:49:31 server01 systemd: nxlog.service holdoff time over, scheduling restart.
    Mar 15 14:49:31 server01 nxlog: 2019-03-15 14:49:31 INFO configuration OK

    Basically it takes eventually all the memory the system has and then it gets killed. This happens once in a week on our systems that are gathering logs from PostgreSQL with NXLog. We are running Red Hat Enterprise Linux Server release 7.6 (Maipo), postgresql-server.x86_64 9.2.24-1.el7_5 and libdbi-dbd-pgsql.x86_64 0.8.3-16.el7.

  • erempel's picture

    I can confirm that there is a significant memory leak. After 2 minutes of running 2MB is already lost, and that keeps growing indefinitely.
    The leak is in nxlog code exclusively. It appears that nxlog is NOT linked against the shared PostgreSQL libraries on the system.

    ==2014== 2,529,011 (98,016 direct, 2,430,995 indirect) bytes in 1,021 blocks are definitely lost in loss record 29 of 29
    ==2014== at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
    ==2014== by 0x74CF432: ???
    ==2014== by 0x795C449: ???
    ==2014== by 0x74C9F13: ???
    ==2014== by 0x72C25E2: ???
    ==2014== by 0x72C3AFA: ???
    ==2014== by 0x43E9F3: nx_event_process (in /usr/bin/nxlog)
    ==2014== by 0x45272E: ??? (in /usr/bin/nxlog)
    ==2014== by 0x445F57: ??? (in /usr/bin/nxlog)
    ==2014== by 0x549BAA0: start_thread (in /lib64/libpthread-2.12.so)
    ==2014== by 0x5799C4C: clone (in /lib64/libc-2.12.so)

    All other shared libraries are Redhat 6 sources.

  • JaVa's picture

    Now we are facing a new problem and this time NXLog is taking a lot of CPU. This apparently started after upgrading to the newer version that b0ti gave to us. The problem is with that postgresql reading but i don't know if it happens with other log types as we haven't installed this new version to other machines than those who collect psql logs.

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    24488 postgres 20 0 236412 8168 5852 R 70.7 0.2 1:47.21 postgres
    24478 nxlog 20 0 329896 4848 3136 S 27.0 0.1 0:40.79 nxlog

    As you can see it actually makes psql do a lot of work also. If we stop NXLog service, the psql process also stops taking CPU resource.

  • JaVa's picture

    I enabled debug logging and we are seeing hundreds of these log messages per second:

    2019-05-13 21:27:50 DEBUG im_dbi read 0 rows
    2019-05-13 21:27:50 DEBUG worker 0 waiting for new event
    2019-05-13 21:27:50 DEBUG new event in event_thread [_fileop:SCHEDULE]
    2019-05-13 21:27:50 DEBUG new event in event_thread [_fileop:SCHEDULE]
    2019-05-13 21:27:50 DEBUG new event in event_thread [__CORE__:CONFIG_CACHE_FLUSH]
    2019-05-13 21:27:50 DEBUG new event in event_thread [TestSQL:READ]
    2019-05-13 21:27:50 DEBUG nx_event_to_jobqueue: READ (TestSQL)
    2019-05-13 21:27:50 DEBUG event added to jobqueue
    2019-05-13 21:27:50 DEBUG future event, event thread sleeping 2827006us in cond_timedwait
    2019-05-13 21:27:50 DEBUG worker 1 got signal for new job
    2019-05-13 21:27:50 DEBUG worker 1 processing event 0x7f4bc4009700
    2019-05-13 21:27:50 DEBUG PROCESS_EVENT: READ (TestSQL)
    2019-05-13 21:27:50 DEBUG im_dbi sql: SELECT id, discriminator, time, queryid, message, timestamprecord, response, memberclass, membercode, subsystemcode FROM logrecord WHERE id > 34603 LIMIT 50
    2019-05-13 21:27:50 DEBUG im_dbi read 0 rows
    2019-05-13 21:27:50 DEBUG worker 1 waiting for new event
    2019-05-13 21:27:50 DEBUG new event in event_thread [_fileop:SCHEDULE]
    2019-05-13 21:27:50 DEBUG new event in event_thread [_fileop:SCHEDULE]
    2019-05-13 21:27:50 DEBUG new event in event_thread [__CORE__:CONFIG_CACHE_FLUSH]
    2019-05-13 21:27:50 DEBUG new event in event_thread [TestSQL:READ]
    2019-05-13 21:27:50 DEBUG nx_event_to_jobqueue: READ (TestSQL)
    2019-05-13 21:27:50 DEBUG event added to jobqueue
    2019-05-13 21:27:50 DEBUG future event, event thread sleeping 2826778us in cond_timedwait
    2019-05-13 21:27:50 DEBUG worker 0 got signal for new job
    2019-05-13 21:27:50 DEBUG worker 0 processing event 0x7f4bbc002920
    2019-05-13 21:27:50 DEBUG PROCESS_EVENT: READ (TestSQL)
    2019-05-13 21:27:50 DEBUG im_dbi sql: SELECT id, discriminator, time, queryid, message, timestamprecord, response, memberclass, membercode, subsystemcode FROM logrecord WHERE id > 34603 LIMIT 50

    I tried to set the pollinginterval to 5 seconds but it makes no difference. Why is the software acting like this? That thread sleeping time should near three seconds and the id stays the same unless there are new lines to read and it reads them as it is supposed to so atleast that part is working.

    We also updated the nxlog-dbi when updating NXLog itself to the newest version so i don't think that is the problem?