Challenge Accepted

No comments

Once in a while, a disaster happens and when it happens, the first question will be which servers were affected?

Unfortunately, the answer to this question is not accurate. You are probably thinking that I am wrong. Since the “Failed to connect to computer” alert will pop up when the agent will stop sending a heartbeat for 3 seconds (the default value) and no ping answer from the Management Server. This is a great monitor, however, the issue comes up when the agent has some kind of a problem. Let’s say the agent had a problem before the disaster such as service being shut down or data being corrupted or even somehow the agent is deleted from the server. In these scenarios, the “Failed to connect to computer” alert will not be pop up. I’m sure many of the SCOM administrators are familiar with the problem.

This problem motivated me to create a new management pack.

My solution is based on SQL query and ping check from a management server.

The following steps will explain to you how to solve the problem:

First, get all the unhealthy agents that are not in maintenance mode. The reason for this is that the unhealthy agents are the only agents that could be affected by the disaster. Second, check the ping status from agent’s management server (Note: The management server itself can be shut down, therefore, if necessary we will check the ping from the secondary management server and if needed from the RMS server). The output will be stored as an event in the Operations Manager database. The third step is to run a simple SQL query that will return a list of servers that have been affected as a result of the disaster (I added a sample query at the end of this page).

While I was working on this management pack I had another idea, which was to repair the unhealthy agent automatically. Let me explain since we are already collecting unhealthy agents that answer to ping, we can easily reach the conclusion, that the problem is not a server. Therefore, trying to repair the agent is necessary. So to repair the agent, all we will do is run a task of restarting the agent. For the agents that have a problem with collecting performance data, we will run the flush agent task.

Link to the SCOM Administration Add-Ons.

SQL Query Example:

DECLARE @StartTime DateTime

DECLARE @EndTime DateTime




SET @StartTime =DATEADD(MINUTE, ((DATEPART(MINUTE,  GETUTCDATE()) / 5) * 5)-30, DATEADD(HOUR, DATEDIFF(HOUR, 0, GETUTCDATE()), 0))

SET @EndTime =  DATEADD(MINUTE, ((DATEPART(MINUTE,  GETUTCDATE()) / 5) * 5), DATEADD(HOUR, DATEDIFF(HOUR, 0, GETUTCDATE()), 0))




SELECT

       v.TimeGenerated

       ,DATEADD(MINUTE, ((DATEPART(MINUTE, v.TimeGenerated) / 5) * 5), DATEADD(HOUR, DATEDIFF(HOUR, 0, v.TimeGenerated), 0)) AS TimeGeneratedFixed

       ,EventParametersXML

       ,y.Status

       ,y.StatusCode

       ,y.ResponseTime

       ,y.AgentServerName

       ,y.ManagementServerName

INTO #EventAllView

FROM EventAllView v

OUTER APPLY (SELECT

              CAST(v.EventParameters AS XML) AS EventParametersXML) x

OUTER APPLY (SELECT

              x.value('Param[1]', 'VARCHAR(80)') AS Status

              ,x.value('Param[2]', 'VARCHAR(80)') AS StatusCode

              ,x.value('Param[3]', 'VARCHAR(80)') AS ResponseTime

              ,x.value('Param[4]', 'VARCHAR(80)') AS AgentServerName

              ,x.value('Param[5]', 'VARCHAR(80)') AS ManagementServerName

       FROM x.EventParametersXML.nodes('/') AS NodeValues (x)) y

WHERE

PublisherName = 'ServerConnectivityCheck'

AND TimeGenerated >= @StartTime

AND DATEADD(MINUTE, DATEDIFF(MINUTE, 0, TimeGenerated), 0) <= @EndTime




;WITH TimesCTE ([Date])

AS

(

       SELECT

       @StartTime

       UNION ALL

       SELECT DATEADD(MINUTE, 5, [Date]) FROM TimesCTE WHERE [Date] < @EndTime

)

SELECT

res.AgentServerName,

res.[Date],

v.TimeGenerated,

v.Status,

v.StatusCode,

v.ResponseTime,

v.ManagementServerName into #tmpPivot

FROM

(

       SELECT

       srv.AgentServerName,

       c.[Date]

       FROM

       (

              SELECT DISTINCT AgentServerName FROM #EventAllView

       ) srv

       CROSS APPLY

       (

              SELECT [Date] FROM TimesCTE

       ) c   

) res

JOIN

#EventAllView v

       ON DATEADD(MINUTE, ((DATEPART(MINUTE, v.TimeGenerated) / 5) * 5), DATEADD(HOUR, DATEDIFF(HOUR, 0, v.TimeGenerated), 0)) = res.[Date]

       AND v.AgentServerName = res.AgentServerName

ORDER BY

res.AgentServerName,

res.[Date]







DECLARE @CMD VARCHAR(MAX)

SET @CMD = 'SELECT * FROM (SELECT AgentServerName, StatusCode, CONVERT(VARCHAR(80), DATEADD(HOUR, (DATEDIFF(HOUR, GETUTCDATE(), GETDATE())), [Date]), 108) AS [Date] FROM #tmpPivot) AS SourceTable PIVOT (MAX(StatusCode) FOR [Date] IN ('




SET @CMD = @CMD + (

       SELECT

       LEFT(txt.grouped, LEN(txt.grouped) - 1)

       FROM

       (

              SELECT

              '[' + CONVERT(VARCHAR(80), DATEADD(HOUR, (DATEDIFF(HOUR, GETUTCDATE(), GETDATE())), [Date]), 108) + '],' AS [text()]

              FROM

              #tmpPivot

              GROUP BY

              CONVERT(VARCHAR(80), DATEADD(HOUR, (DATEDIFF(HOUR, GETUTCDATE(), GETDATE())), [Date]), 108)

                       ORDER BY

                       CONVERT(VARCHAR(80), DATEADD(HOUR, (DATEDIFF(HOUR, GETUTCDATE(), GETDATE())), [Date]), 108)

              FOR XML PATH('')

       ) txt(grouped))




SET @CMD = @CMD + ')) AS PivotTable ORDER BY AgentServerName'




EXEC (@CMD)







DROP TABLE #EventAllView

DROP TABLE #tmpPivot

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s