Nine Steps to Troubleshooting SQL Server problems
DBAs often have a need to identify why a problem has occurred, or is occurring in their SQL Server database. This article covers some of the tools you can use to look for clues, and the steps you might go through to help troubleshoot a SQL Server problem.We all have problems that occur from time to time, where we need to work through some problem solving steps to identify why the problem occurred, or is occurring. In these situations, you need to act like a Crime Scene Investigator (CSI) to uncover the root cause of the problem. No, we do not put yellow caution tape around our servers, our network cables, and our desktop machines. However, sometimes we do unplug the network cable to prevent the machine from being further contaminated, in rare cases. Sometimes it is obvious what caused the problem, but not always. Regardless of the problem, you will need to do some forensic analysis to determine the cause of the problem. In this article, I will cover some of the tools you can use to look for clues, and the steps you might go through to help troubleshoot a problem.
Problem Solving ToolsThere are a number of different tools you can use for troubleshooting. I cannot cover, nor do I know of all the tools you could possibly use for troubleshooting, therefore I will cover the most common tools that are available within SQL Server and the Windows OS. In most cases, you should be able to find enough information using these tools to provide you with enough clues to determine the cause of a particular problem.
Here is a list of those commonly used tools:
- Event viewer
Notepad does not seem like much of a tool; however, it can be used to open different log files. Notepad allows you to do string searches within large log files to quickly locate information. . If your log files are too big you might have to use WordPad as an alternative.
Sometime when your system is having problems, event records will be written to the Windows event logs. You can browse through the events one at a time using the Event viewer. Events in the event log may provide a quick answer to why your SQL Server instance is not behaving as it should, provided there are event records associated with the problem you are trying to solve.
Information Gathering PhaseIn order to diagnose a problem you first need to gather some information about the problem. You also need to review log files to determine what kind of system error messages and log records exist that might help you to diagnose the problem. Below are a set of steps you should consider when going through the information gathering phase of your forensic analysis.
Step 1: Gathering the FactsThe first step in any problem solving exercise is to gather the facts. You need to know what kind of problem is happening. This is where you need to interview the customer, or programmer to understand how and when the problem occurs. You need to determine if it is system wide, or is it more localized to a particular application, or component of an application. You also need to know the timeframe around when the problem occurred, and whether or not it is still a problem. In addition to this, you need to know the last time the system was working correctly. You need to determine if any new system or application changes were introduced that might have caused the problem. Armed with some facts about the problem you can start to look for clues that might help identify the root cause of the problem.
Step 2: Test in Different environments and MachinesIt is worth testing in different environments, if you have them. This is a fact gathering exercise, but I spelled it out as a separate step because lots of times seasoned staff do not think about performing tests in separate environments when they gather facts.
You might find only one environment is affected, a set of environments or all environments. If only one environment is affected, the problem might be a configuration issue with that environment, or the other working environments. Alternatively, it might be the data in the environment that is causing the problem.
Additionally you might want to try different client machines, or application servers. Occasional, you might find that a different configuration or a set up is causing the application to work, or not work. You need to explore all the different setup and configuration options and then document those that work, and those that do not work.
Step 3: Review the SQL Server Error LogSQL Server creates a log file called "ERRORLOG". A new ERRORLOG file is created every time SQL Server starts up. SQL Server by default keeps six old errors log files, where each one has a sequential number associated. The ERRORLOG file by default is stored in the "Log" folder within the standard "…Program FilesMicrosoft SQL Server..." folder structure.
Find the log file that is associated with the timeframe for when the problem first occurred. Look to see if there are any anomalies in the messages being outputted by SQL Server. Sometime if SQL Server detects a change, or encounters a problem, it will be logged in the ERRORLOG file.
Step 4: Review the Event LogYou should use the Event Viewer to look at the different event log records. The event log contains both informational warnings and error events. You should look at all the events that occurred shortly before, during and after the timeframe of the identified problem. You need to make sure you review both the "Application" and "System" events, as well as the "Security" events.
Step 5: Review the Default TraceThe default trace, as stated earlier, is a trace that SQL Server starts automatically when it starts up, if the default trace option is enabled. It is similar to the flight recorder in a modern jet. This trace captures all configuration changes to an instance. By reviewing the default trace information you can identify what kind of database changes might have been made during the period of time when the problem was identified.
The default trace files are stored in the same log folder as the ERRORLOG. They are named like "log_xxx.trc", where xxx is a sequential number. You can open these files with profiler to see the recorded events. Alternatively, you can use the "fn_trace_gettable" function to process the file using T-SQL, like so:
SELECT * FROM fn_trace_gettable ('C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLLOGlog_155.trc', default);
Step 6: Review the Change LogReview your organization's change log. I hope your organization has one. A change log is some centralized location that identifies all changes the have been introduced. If your organization has one, this helps identify any changes that have recently occurred. This log might provide you with some clues as to why a particular problem is occurring, especially if the application that is having the problem is the one that has been modified recently. If your organization does not have a change log, then during step 1 you might ask a programmer when the last application change was made.
Analysis PhaseNow that you have gathered some information, you need to analyze the data that you have gathered. Review the information collected in each step. Look for anomalies that would support the problem identified by the customer or programmer.
Take the situation identified in step 1 above and try to determine how each log or the trace file might help you identify why the problem is occurring. Review the information available in each step to see if there are any clues that will allow you to gaining a better understanding of what is causing the problem.
After you have done this analysis, you might be lucky and identify the cause of the problem. However, there will be times when the steps above do not yield a solution to the problem. In this case, you will need to move on and do some additional testing and information gathering.