| Author |
Message |
Scott Schnoll [MSFT]
Guest
|
Posted:
Mon Dec 13, 2004 10:42 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Delayed failover when pulling the public NIC is a known issue (and actually
by design, albeit not an intentional one). Basically the issue is contained
with DSAccess, and we are examining what might be able to be done to shorten
up the failover times in cases where the public NIC is lost.
As for your storage solution, have you reviewed our Optimizing Exchange
Server 2003 Storage guide
(http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/optimizestorage.mspx)?
It may be able to help here with some planning and design issues.
I've never heard of RAID 4. How are the disks laid out in that
configuration? Also, have you verified that your entire cluster is in the
Windows Server Catalog (formerly the HCL)?
--
Scott Schnoll
This posting is provided "AS IS" with no warranties, and confers no
rights. Please do not send email directly to this alias. This alias is for
newsgroup
purposes only.
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:PKWdnfnAIbMqoSDcRVn-oQ@giganews.com...
| Quote: | Scott-
This is correct. We are seeing recovery times of two to four minutes
on an unloaded server with no user store. Diskpar was not used to align
the partitions. I'm not quite sure that Diskpar would be of value in this
instance, as the storage vendor has elected to place all volumes into one
RAID 4 group and partition LUNs out from there. Once we resolve all the
functional issues surrounding the cluster and achieve the impossible (see
previous post to Rodney regarding ten second failover), then we will dive
into capability and performance.
I'm pretty platform agnostic when it comes to storage, though this is
my first implementation with Network Appliance as a SAN. I have experience
with many vendors, including HDS, EMC, IBM, and HP. We have some concerns
around NetApp's ability to support the Exchange cluster from a SAN point
of view. We are anticipating a 2.1TB total mail store with IOPS
consumption in the neighborhood of 2200 and 3250, and to achieve the
metrics they are throwing 24 TB worth of disk at it. Couple this with the
mandatory RAID 4 and the disk block level emulation, and there is a lot of
uncertainty that needs to be validated in test.
M.
Scott Schnoll [MSFT] wrote:
So getting back to the original issue, is it fair to say that you are
only seeing extended failover times in tests where the public NIC is
unplugged?
As an aside, did you use Diskpar on your storage volumes? That can boost
performance as much as 15-20% (mostly on the disk containing the log
files). |
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Tue Dec 14, 2004 4:50 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Scott-
Is the delayed failover a set time, or is it a variable setting?
Any idea what this might account for in my failover time? I've been
using the Optimizing Exchange Storage guide extensively, but I'm not
sure how much of it we can take advantage of with regards to storage
layout and configuration.
RAID 4 is striping across a large set of disks, with dedicated
volumes for parity. It's optimized for reads, and has one of the worst
performance metrics for reads. The Storage Vendor has elected to use all
the drives on the SAN as part of one giant disk group, and delegate the
LUNs out. I'm not terribly optimistic about the performance
characteristics of this configuration, but it's out of my hands.
I believe that only one solution from NetApp is on the HCL for
cluster, which is completely different from what we have to work with
here. IBM Server against a different filer, instead of a Dell Server
through a fibre switch.
A final data point, I talked with someone who did extensive
testing with a similar configuration and found that he was able to get
7000 concurrent connections to failover in four minutes.
M.
Scott Schnoll [MSFT] wrote:
| Quote: | Delayed failover when pulling the public NIC is a known issue (and actually
by design, albeit not an intentional one). Basically the issue is contained
with DSAccess, and we are examining what might be able to be done to shorten
up the failover times in cases where the public NIC is lost.
As for your storage solution, have you reviewed our Optimizing Exchange
Server 2003 Storage guide
(http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/optimizestorage.mspx)?
It may be able to help here with some planning and design issues.
I've never heard of RAID 4. How are the disks laid out in that
configuration? Also, have you verified that your entire cluster is in the
Windows Server Catalog (formerly the HCL)? |
|
|
| Back to top |
|
 |
Scott Schnoll [MSFT]
Guest
|
Posted:
Tue Dec 14, 2004 5:44 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
This is starting to sound like an unsupported configuration. Have you
reviewed http://support.microsoft.com/kb/810986?
It's not really a set time. If you review your cluster.log file, you'll see
exactly what is happening. Look for things like "[EXRES] Terminate
requested" (without the quotes) and you'll see where each component is
timing out.
When DSAccess determines that no viable DCs are available (as in the case
when the public NIC is lost), it retries all active searches for 2½ minutes
(just in case it is temporary network glitch). If there are many sequential
searches, it obviously takes longer.
--
Scott Schnoll
This posting is provided "AS IS" with no warranties, and confers no
rights. Please do not send email directly to this alias. This alias is for
newsgroup
purposes only.
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:b4idnb4BjZoPgSPcRVn-pA@giganews.com...
| Quote: | Scott-
Is the delayed failover a set time, or is it a variable setting? Any
idea what this might account for in my failover time? I've been using the
Optimizing Exchange Storage guide extensively, but I'm not sure how much
of it we can take advantage of with regards to storage layout and
configuration.
RAID 4 is striping across a large set of disks, with dedicated
volumes for parity. It's optimized for reads, and has one of the worst
performance metrics for reads. The Storage Vendor has elected to use all
the drives on the SAN as part of one giant disk group, and delegate the
LUNs out. I'm not terribly optimistic about the performance
characteristics of this configuration, but it's out of my hands.
I believe that only one solution from NetApp is on the HCL for
cluster, which is completely different from what we have to work with
here. IBM Server against a different filer, instead of a Dell Server
through a fibre switch.
A final data point, I talked with someone who did extensive testing
with a similar configuration and found that he was able to get 7000
concurrent connections to failover in four minutes.
M.
Scott Schnoll [MSFT] wrote:
Delayed failover when pulling the public NIC is a known issue (and
actually by design, albeit not an intentional one). Basically the issue
is contained with DSAccess, and we are examining what might be able to be
done to shorten up the failover times in cases where the public NIC is
lost.
As for your storage solution, have you reviewed our Optimizing Exchange
Server 2003 Storage guide
(http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/optimizestorage.mspx)?
It may be able to help here with some planning and design issues.
I've never heard of RAID 4. How are the disks laid out in that
configuration? Also, have you verified that your entire cluster is in
the Windows Server Catalog (formerly the HCL)? |
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Wed Dec 15, 2004 11:43 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Scott-
My configuration is most assuredly not a supported configuration,
but it is what my customer would like to see for an end deployment. I
explained about Microsoft support and everything, and still they went
this route.
I've spent the better part of the day looking through my
cluster.log, and I did not see anything to indicate a network timeout
issue causing my delay. What I did see, is below:
00000960.000002a8::2004/12/15-01:57:13.379 INFO [MM]
MmSetQuorumOwner(0,1), old owner 2.
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM]
FmpCompleteMoveGroup: Completing the move for group MSDTC to node 1 (1)
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM] FmpOfflineResource:
Offline resource <MSDTC Disk> returned pending
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM]
FmpCompleteMoveGroup: Exit, status = 997
This repeats every 500 milliseconds until:
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM]
FmpCompleteMoveGroup: Completing the move for group MSDTC to node 1 (1)
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM] FmpOfflineResource:
Offline resource <MSDTC Disk> returned pending
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM]
FmpCompleteMoveGroup: Exit, status = 997
00000114.00000af0::2004/12/15-01:58:12.892 ERR Microsoft Exchange
Information Store <Exchange Information Store Instance (PTSJ-EXCHVS1)>:
[EXRES] EventLogging: Exchange Information Store Instance
(PTSJ-EXCHVS1): Failed to terminate this resource because of timeout.
Error Code: 1460.
Group MSDTC contains a network name, ip, disk resource, and msdtc
service. I see a similar occurrance on the other node, matching time and
activity. Any ideas as to what normally should be happening to avoid
this minute of confusion? Thanks much.
M.
Scott Schnoll [MSFT] wrote:
| Quote: | This is starting to sound like an unsupported configuration. Have you
reviewed http://support.microsoft.com/kb/810986?
It's not really a set time. If you review your cluster.log file, you'll see
exactly what is happening. Look for things like "[EXRES] Terminate
requested" (without the quotes) and you'll see where each component is
timing out.
When DSAccess determines that no viable DCs are available (as in the case
when the public NIC is lost), it retries all active searches for 2½ minutes
(just in case it is temporary network glitch). If there are many sequential
searches, it obviously takes longer. |
|
|
| Back to top |
|
 |
Scott Schnoll [MSFT]
Guest
|
Posted:
Wed Dec 15, 2004 12:49 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Do you have the DTC resource in a dedicated group, or in the default cluster
group? If your cluster is dedicated to Exchange we recommend leaving the
DTC resource in the cluster group. It's only used by Exchange Setup and
Exchange Service Pack Setup. Do you see the same behavior after putting the
DTC in the cluster group?
--
Scott Schnoll
This posting is provided "AS IS" with no warranties, and confers no
rights. Please do not send email directly to this alias. This alias is for
newsgroup
purposes only.
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:ZIydnfNVQJtxUyLcRVn-3w@giganews.com...
| Quote: | Scott-
My configuration is most assuredly not a supported configuration, but
it is what my customer would like to see for an end deployment. I
explained about Microsoft support and everything, and still they went this
route.
I've spent the better part of the day looking through my cluster.log,
and I did not see anything to indicate a network timeout issue causing my
delay. What I did see, is below:
00000960.000002a8::2004/12/15-01:57:13.379 INFO [MM]
MmSetQuorumOwner(0,1), old owner 2.
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM] FmpCompleteMoveGroup:
Completing the move for group MSDTC to node 1 (1)
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM] FmpOfflineResource:
Offline resource <MSDTC Disk> returned pending
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM] FmpCompleteMoveGroup:
Exit, status = 997
This repeats every 500 milliseconds until:
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM] FmpCompleteMoveGroup:
Completing the move for group MSDTC to node 1 (1)
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM] FmpOfflineResource:
Offline resource <MSDTC Disk> returned pending
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM] FmpCompleteMoveGroup:
Exit, status = 997
00000114.00000af0::2004/12/15-01:58:12.892 ERR Microsoft Exchange
Information Store <Exchange Information Store Instance (PTSJ-EXCHVS1)>:
[EXRES] EventLogging: Exchange Information Store Instance (PTSJ-EXCHVS1):
Failed to terminate this resource because of timeout. Error Code: 1460.
Group MSDTC contains a network name, ip, disk resource, and msdtc service.
I see a similar occurrance on the other node, matching time and activity.
Any ideas as to what normally should be happening to avoid this minute of
confusion? Thanks much.
M.
Scott Schnoll [MSFT] wrote:
This is starting to sound like an unsupported configuration. Have you
reviewed http://support.microsoft.com/kb/810986?
It's not really a set time. If you review your cluster.log file, you'll
see exactly what is happening. Look for things like "[EXRES] Terminate
requested" (without the quotes) and you'll see where each component is
timing out.
When DSAccess determines that no viable DCs are available (as in the case
when the public NIC is lost), it retries all active searches for 2½
minutes (just in case it is temporary network glitch). If there are many
sequential searches, it obviously takes longer. |
|
|
| Back to top |
|
 |
Scott Schnoll [MSFT]
Guest
|
Posted:
Wed Dec 15, 2004 12:51 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
As an aside, the primary reason for clustering is high availability.
Deploying an unsupported cluster configuration is contrary to this goal.
--
Scott Schnoll
This posting is provided "AS IS" with no warranties, and confers no
rights. Please do not send email directly to this alias. This alias is for
newsgroup
purposes only.
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:ZIydnfNVQJtxUyLcRVn-3w@giganews.com...
| Quote: | Scott-
My configuration is most assuredly not a supported configuration, but
it is what my customer would like to see for an end deployment. I
explained about Microsoft support and everything, and still they went this
route.
I've spent the better part of the day looking through my cluster.log,
and I did not see anything to indicate a network timeout issue causing my
delay. What I did see, is below:
00000960.000002a8::2004/12/15-01:57:13.379 INFO [MM]
MmSetQuorumOwner(0,1), old owner 2.
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM] FmpCompleteMoveGroup:
Completing the move for group MSDTC to node 1 (1)
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM] FmpOfflineResource:
Offline resource <MSDTC Disk> returned pending
00000960.00000ec8::2004/12/15-01:57:13.863 INFO [FM] FmpCompleteMoveGroup:
Exit, status = 997
This repeats every 500 milliseconds until:
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM] FmpCompleteMoveGroup:
Completing the move for group MSDTC to node 1 (1)
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM] FmpOfflineResource:
Offline resource <MSDTC Disk> returned pending
00000960.00000ec8::2004/12/15-01:58:12.861 INFO [FM] FmpCompleteMoveGroup:
Exit, status = 997
00000114.00000af0::2004/12/15-01:58:12.892 ERR Microsoft Exchange
Information Store <Exchange Information Store Instance (PTSJ-EXCHVS1)>:
[EXRES] EventLogging: Exchange Information Store Instance (PTSJ-EXCHVS1):
Failed to terminate this resource because of timeout. Error Code: 1460.
Group MSDTC contains a network name, ip, disk resource, and msdtc service.
I see a similar occurrance on the other node, matching time and activity.
Any ideas as to what normally should be happening to avoid this minute of
confusion? Thanks much.
M.
Scott Schnoll [MSFT] wrote:
This is starting to sound like an unsupported configuration. Have you
reviewed http://support.microsoft.com/kb/810986?
It's not really a set time. If you review your cluster.log file, you'll
see exactly what is happening. Look for things like "[EXRES] Terminate
requested" (without the quotes) and you'll see where each component is
timing out.
When DSAccess determines that no viable DCs are available (as in the case
when the public NIC is lost), it retries all active searches for 2½
minutes (just in case it is temporary network glitch). If there are many
sequential searches, it obviously takes longer. |
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Wed Dec 15, 2004 1:55 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Currently, the DTC group is in a dedicated group, but I'll reconfigure
the cluster to have this as part of the primary cluster group. I'll take
metrics, and let you know how it fares.
M.
Scott Schnoll [MSFT] wrote:
| Quote: | Do you have the DTC resource in a dedicated group, or in the default cluster
group? If your cluster is dedicated to Exchange we recommend leaving the
DTC resource in the cluster group. It's only used by Exchange Setup and
Exchange Service Pack Setup. Do you see the same behavior after putting the
DTC in the cluster group? |
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Wed Dec 15, 2004 1:59 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Scott-
You are absolutely correct with respect to this. The customer has
made a substantial investment in Network Appliance kit, and is committed
to utilize that investment moving forward. We have actually begun the
process of seeing what it would take to get our vendor to take their
current configuration to pass WHQL certification for a cluster solution.
There are many advantages to this, most importantly is the assurance by
Microsoft that their software and applications are guaranteed to
function appropriately. The storage solution for this deployment is
still up in the air, as long as these type of issues remain unresolved.
Thanks much for the help with this, I truly appreciate it.
M.
Scott Schnoll [MSFT] wrote:
| Quote: | As an aside, the primary reason for clustering is high availability.
Deploying an unsupported cluster configuration is contrary to this goal.
|
|
|
| Back to top |
|
 |
|
|
|
|