497 Day Bug

I recently ran into a server on a site that had seemingly stopped servicing any DHCP requests. So after a little digging and checking, the System event log showed the time the DHCP service first started throwing errors, the error here being EventID 1059 “The DHCP service failed to see a directory server for authorisation.” So DHCP seemed to be having trouble talking to a domain controller, which seemed odd, as it was also a domain controller itself. A quick check of dcdiag returned the following output;

Ldap search capability attribute search failed on server HOSTNAME, return
value = 81

It appeared that the server had stopped servicing LDAP requests too, as well as the word “capability” being spelt incorrectly in the error returned.

About the same time the DHCP service started reporting problems contacting a domain controller, the Group Policy client also started reporting it was unable to find a DC, which would be expected as it was trying to contact the server itself and again it was failing. Checking the Directory Service event log showed that it was complaining about replication, but not much else. Checking the replication, with repadmin /replsummary again threw an error with communication via LDAP.
Running the same command from another machine seemed to show the RPC server was down on the server, which it wasn’t, the service was up. So I checked the RPC ports with a quick netstat -no and was greeted with tens of thousands of ports all in a TIME_WAIT state. That would explain things then, if there’s no RPC ports available various things will start to break. Googling “Ports not closing TIME_WAIT” led me to a hotfix from Microsoft, All the TCP/IP ports that are in a TIME_WAIT status are not closed after 497 days from system startup in Windows Vista, in Windows 7, in Windows Server 2008 and in Windows Server 2008 R2.

And a little further Googling around the problem showed this to not be a problem limited to Microsoft, with various other vendors and products mentioned as affected such as Avaya, Brocade, Cisco, EMC, QLogic and VAX/VMS;

The 497 Day Uptime Bug
497 – The number of the IT beast

From the IBM post linked above;

Basically a 32bit counter used to record uptime will cause this problem when it overflows. If you record a tick for every 10 msec of uptime, then a 32-bit counter will overflow after approximately 497.1 days. This is because a 32 bit counter equates to 2^32, which can count 4,294,967,296 ticks. Because a tick is counted every 10 msec, we create 8,640,000 ticks per day (100*60*60*24). So after 497.102696 days, the counter will overflow.

So all that was left was to patch the thing, and while a hotfix is fine, it is fairly old and I did wonder if it had been included in standard Windows updates. Helpfully Microsoft advise that if you have the following security bulletin installed then the hotfix is not needed, suggesting it’s included in the security patch;

Microsoft Security Bulletin MS12-032 – Important

Again though, that’s a pretty old patch itself, and I guessed this must have been rolled into a standard patch at some point. So a quick search of the Microsoft Update Catalog for MS12-032 will show that update and all updates that supersede it, so you can then check if that update KB or any superseding update KB numbers are installed on your system.

If they’re installed, you should be fine and covered off against this, if not, keep an eye on your systems, as when they get over 497.1 days of uptime, you may find that some services start to fail, like ADDS and other dependant services.

Below is a basic script I wrote for PowerShell to get uptime of domain controllers in this example to see if any were approaching the time frame for needing this to be done, so they could then be checked off in WSUS for the right patches. WSUS is your friend here when it comes to rolling out fixes like this

$dcs = Get-ADDomainController -Filter * | sort name
foreach ($dc in $dcs)
{
$name = $dc.name
$a = Get-wmiobject -ComputerName $name -ClassName win32_operatingsystem
write-host "`n"
$name
[Management.ManagementDateTimeConverter]::ToDateTime($a.lastbootuptime)
}

Obviously the lesson here is keep your servers updated, but as we all know there are times when that’s not possible. In which case at least this should help you find and fix the ones that might be affected by this.