Empty fields omitted in JSON conversion


#1 Patrick_Berchtold

This is a complex question about a complex problem, but please feel free to read i anyway :-)

We use NXLog to read the Windows eventlog and also csv files. We send the data to a linux loghost which does some regexp-based parsing.

We now encounter problems with missing fields.

Example 1: A csv file with three columns A, B and C. It looks like this:

#A,B,C
a,b,c
1,2,3
x,y,z

NXLog reads this file, uses an xm_csv module to parse the content, uses an xm_json module to convert it to JSON, uses an xm_syslog module to further convert it to syslog and finally sends it to the syslog server. At first glance this works fine. Here is the result from the syslog server:

...SourceModuleType:im_file,A:a,B:b,C:c,Hostname:...
...SourceModuleType:im_file,A:1,B:2,C:3,Hostname:...
...SourceModuleType:im_file,A:x,B:y,C:z,Hostname:...

However, as soon as we have empty values in a csv row, we run into problems:

#A,B,C
a,b,
1,,3
,y,

leads to:

...SourceModuleType:im_file,A:a,B:b,Hostname:...
...SourceModuleType:im_file,A:1,C:3,Hostname:...
...SourceModuleType:im_file,B:y,Hostname:...

All the fields that are empty in the csv file are now absent in the syslog message. (And this is a huge issue for our regexp parser.)

Interestingly, this:

#A,B,C
"a","b",""
1,"",3
,y,""

leads to:

...SourceModuleType:im_file,A:a,B:b,C:,Hostname:...
...SourceModuleType:im_file,A:1,B:,C:3,Hostname:...
...SourceModuleType:im_file,B:y,C:,Hostname:...

So it looks like NXLog treats an empty string in a different way than "nothing". (However, this is of limited value, as we are dealing with csv files created by applications, such as Exchange Server logfiles.)

The same behaviour not only applies to csv-based file inputs but also to the Windows eventlog input.

Example 2: Windows security log

Windows event 4624 (successful login) includes the two fields "TargetUserName" and "TargetDomainName". If users log in to a system using "DOMAIN\username" as their username, everything works fine:

...,TargetUserName:Administrator,TargetDomainName:DEMO,TargetLogonId:...

However, if a user uses the UPN (user.name@domain.org) to log in, Windows writes the UPN into the "TargetUserName" field and leaves the TargetDomainField empty. This results in:

...,TargetUserName:Administrator@demo.local,TargetLogonId:...

The "TargetDomainName" field is missing.

I have already spent a lot of time to troubleshoot this issue, but still haven't found THE solution. This is what I found out so far:

  • The parse_csv() function of the xm_csv extension module does or does not create an NXLog field for each value in each row. If there is a value, such as in 1,2,3 a field with the respective value is generated. For empty strings, such as in 1,"",3 a field is generated as well, with an empty string as its value. But for "nothing", such as in 1,,3 no field is generated, and this seems to be the root cause of our problem.
  • Both to_json() and to_kvp() add all existing NXLog fields to the message, even the ones having "undef" values. But of course, fields that don't exist do not appear in the message.
  • I could not find a way to distinguish between an NXLog field that is present but has an "undef" value and a field that is not present. The if defined($A) construct returns false in both cases.
  • There is a (not so elegant) solution for the problem that applies to csv files only: Before calling parse_csv() all fields can be initialized manually, like this: $A = ""; $B = ""; $C = ""; parse_csv(); However, this does not apply to the Windows eventlog input, because the fields differ between Windows event ids.

So finally, the questions:

  1. Does anybody have a (config-based) solution for this problem?
  2. Is a change in NXLog behaviour needed to resolve the root cause? (I hope NXLog staff is reading this post.)
#2 b0ti Nxlog ✓
#1 Patrick_Berchtold
This is a complex question about a complex problem, but please feel free to read i anyway :-) We use NXLog to read the Windows eventlog and also csv files. We send the data to a linux loghost which does some regexp-based parsing. We now encounter problems with missing fields. Example 1: A csv file with three columns A, B and C. It looks like this: #A,B,C a,b,c 1,2,3 x,y,z NXLog reads this file, uses an xm_csv module to parse the content, uses an xm_json module to convert it to JSON, uses an xm_syslog module to further convert it to syslog and finally sends it to the syslog server. At first glance this works fine. Here is the result from the syslog server: ...SourceModuleType:im_file,A:a,B:b,C:c,Hostname:... ...SourceModuleType:im_file,A:1,B:2,C:3,Hostname:... ...SourceModuleType:im_file,A:x,B:y,C:z,Hostname:... However, as soon as we have empty values in a csv row, we run into problems: #A,B,C a,b, 1,,3 ,y, leads to: ...SourceModuleType:im_file,A:a,B:b,Hostname:... ...SourceModuleType:im_file,A:1,C:3,Hostname:... ...SourceModuleType:im_file,B:y,Hostname:... All the fields that are empty in the csv file are now absent in the syslog message. (And this is a huge issue for our regexp parser.) Interestingly, this: #A,B,C "a","b","" 1,"",3 ,y,"" leads to: ...SourceModuleType:im_file,A:a,B:b,C:,Hostname:... ...SourceModuleType:im_file,A:1,B:,C:3,Hostname:... ...SourceModuleType:im_file,B:y,C:,Hostname:... So it looks like NXLog treats an empty string in a different way than "nothing". (However, this is of limited value, as we are dealing with csv files created by applications, such as Exchange Server logfiles.) The same behaviour not only applies to csv-based file inputs but also to the Windows eventlog input. Example 2: Windows security log Windows event 4624 (successful login) includes the two fields "TargetUserName" and "TargetDomainName". If users log in to a system using "DOMAIN\username" as their username, everything works fine: ...,TargetUserName:Administrator,TargetDomainName:DEMO,TargetLogonId:... However, if a user uses the UPN (user.name@domain.org) to log in, Windows writes the UPN into the "TargetUserName" field and leaves the TargetDomainField empty. This results in: ...,TargetUserName:Administrator@demo.local,TargetLogonId:... The "TargetDomainName" field is missing. I have already spent a lot of time to troubleshoot this issue, but still haven't found THE solution. This is what I found out so far: The parse_csv() function of the xm_csv extension module does or does not create an NXLog field for each value in each row. If there is a value, such as in 1,2,3 a field with the respective value is generated. For empty strings, such as in 1,"",3 a field is generated as well, with an empty string as its value. But for "nothing", such as in 1,,3 no field is generated, and this seems to be the root cause of our problem. Both to_json() and to_kvp() add all existing NXLog fields to the message, even the ones having "undef" values. But of course, fields that don't exist do not appear in the message. I could not find a way to distinguish between an NXLog field that is present but has an "undef" value and a field that is not present. The if defined($A) construct returns false in both cases. There is a (not so elegant) solution for the problem that applies to csv files only: Before calling parse_csv() all fields can be initialized manually, like this: $A = ""; $B = ""; $C = ""; parse_csv(); However, this does not apply to the Windows eventlog input, because the fields differ between Windows event ids. So finally, the questions: Does anybody have a (config-based) solution for this problem? Is a change in NXLog behaviour needed to resolve the root cause? (I hope NXLog staff is reading this post.)

I think the problem is caused by nxlog and modules not handling the following values consistently:

  • field exists but is not defined , i.e. it is undef.
  • field does not exist.

Some modules took the liberty of omitting a field instead of storing the undef value in it as in most situations the two are equivalent. Obviously we should follow perl here with respect to the language and implement the exists() function and also review all code to make sure existence and defined-ness is handled consistently.

This question should explain better.