Sunday 27 November 2011

RHEL 3 and AMD Devil chips, I think, help me please

RHEL 3 and AMD Devil chips, I think, help me please

Hello People
I have an unusual problem I have been looking at for ages and finally think I am getting close to the problem, but my test to verify it has failed me and I am not sure I have tested in the correct manner.
We have several servers (all our really important DB infrastructure) running on really old various HP servers. All these servers that we have the problem with are Dual PROC AMD Opteron Processor 252. Only the RHEL 3 servers running on these chips have this problems newer CPUs and VMs on newer CPUs have never had the problem. That problem is these servers can randomly reboot over a random period with absolutely no OS logging as to the reason why they are rebooting. The only indication that something fishy has gone on is in the ilolog you get this message below

Informational iLO 11/24/2011 11:03 11/24/2011 11:03 1 On-board clock set; was 11/23/2011 22:03:41.
Informational iLO 11/24/2011 10:59 11/24/2011 10:59 1 Server power restored.
Caution iLO 11/24/2011 10:59 11/24/2011 10:59 1 Server reset.
Informational iLO 11/23/2011 21:59 11/23/2011 21:59 1 On-board clock set; was 11/24/2011 10:59:48.
Informational iLO 11/18/2011 15:33 11/18/2011 15:33 1 Server power restored.
Caution iLO 11/18/2011 15:33 11/18/2011 15:33 1 Server reset.

As you can see it appears to be switching between Local time and UTC Local time is GMT +12.
We always see this entry when we get this type of unexplained reboot. The HWclock is set to UTC and have hard set it start up scripts

We do not get anything like this if it is running on Hardware that does not have those CPUs, but we get it on PCLass blades and DLs alike.
I have logged a call with HP, but they were not much help, but while working on another problem on windows I came across the TSC drift on AMDs issue

Can not post a url, but google AMD TSC drift Redhat and second result will lead you to what I am talking about

Now before I attempt to apply any of these recommendations I have been tasked to replicate the problem in our DEV oracle cluster first (Which have never experienced this issue) on the same type of hardware and OS

Now I compiled a C script to constantly uses the gettimeofday query ran it half a dozen times specifying it to run on different CPUs, but still could not replicate.

I have had many problems with the AMD quad and six core CPUs with Linux and databases and I am sure that the AMD CPU factory is run by the devil and think that maybe I have not tested this as well as many of the much more gifted geeks out there would.

Has anybody got any ideas or ever had a similar issue or ever dealt with the TSC time drift issue on RHEL 3?
2.4.21-27.ELsmp i686 athlon i386 GNU/Linux

No comments:

Post a Comment