| Author |
Message |
Michael Mahar
Guest
|
Posted:
Mon Dec 13, 2004 3:07 am Post subject:
Clustered Exchange Failover Consistancy |
|
|
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre Channel.
The Network Appliance box is acting as a SAN, presenting disk block
level. This system is only in test and we are seeing inconsistant, or
what appear to us to be inconsistant failover times on removal of the
active nodes public network cable. The failover times range from 1:59 to
3:31, with no real rhyme or reason as to why one takes any given length
of time. I am looking for more of a sanity check, that this given range
is an acceptable time frame for a destructive failover. Thanks much.
M.
|
|
| Back to top |
|
 |
Rodney R. Fournier [MVP]
Guest
|
Posted:
Mon Dec 13, 2004 3:28 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
How big are the stores? Information? Public? How many Information Stores do
you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
| Quote: | Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines through
a Brocade 3900 to a Network Appliance FAS 940 via Fibre Channel. The
Network Appliance box is acting as a SAN, presenting disk block level.
This system is only in test and we are seeing inconsistant, or what appear
to us to be inconsistant failover times on removal of the active nodes
public network cable. The failover times range from 1:59 to 3:31, with no
real rhyme or reason as to why one takes any given length of time. I am
looking for more of a sanity check, that this given range is an acceptable
time frame for a destructive failover. Thanks much.
M. |
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Mon Dec 13, 2004 3:46 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Rodney-
Hehe, sorry about the lack of detail. This is a completely
pristine install, so the Information Stores would be the default size
that comes after a clean install. What's that, maybe a couple of MB
each? There should be on mail store and one public folder store, both in
the First Administrative groups. No users, or anything. We are
attempting to benchmark the system for both failover and storage
performance so we have control values when we move into pilot and
production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
| Quote: | How big are the stores? Information? Public? How many Information Stores do
you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines through
a Brocade 3900 to a Network Appliance FAS 940 via Fibre Channel. The
Network Appliance box is acting as a SAN, presenting disk block level.
This system is only in test and we are seeing inconsistant, or what appear
to us to be inconsistant failover times on removal of the active nodes
public network cable. The failover times range from 1:59 to 3:31, with no
real rhyme or reason as to why one takes any given length of time. I am
looking for more of a sanity check, that this given range is an acceptable
time frame for a destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Rodney R. Fournier [MVP]
Guest
|
Posted:
Mon Dec 13, 2004 6:33 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Those times are slow, but they could just be what your hardware is capable
of, though I seriously doubt it.
Check your config on the heartbeat - http://support.microsoft.com/?id=258750
most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
| Quote: | Rodney-
Hehe, sorry about the lack of detail. This is a completely pristine
install, so the Information Stores would be the default size that comes
after a clean install. What's that, maybe a couple of MB each? There
should be on mail store and one public folder store, both in the First
Administrative groups. No users, or anything. We are attempting to
benchmark the system for both failover and storage performance so we have
control values when we move into pilot and production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information Stores
do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre Channel.
The Network Appliance box is acting as a SAN, presenting disk block
level. This system is only in test and we are seeing inconsistant, or
what appear to us to be inconsistant failover times on removal of the
active nodes public network cable. The failover times range from 1:59 to
3:31, with no real rhyme or reason as to why one takes any given length
of time. I am looking for more of a sanity check, that this given range
is an acceptable time frame for a destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Mon Dec 13, 2004 6:53 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Rod-
I agree with you that this smells like a storage related issue,
but I'm not able to prove it conclusively. If we remove the EVS from the
cluster configuration, we see Cluster Group, MSDTC, and the Exchange
virtual IP/Name/Storage failing over in nearly thirty seconds (19 to
31). This would seem to rule the storage out as an issue. Adding
Exchange to the mix seems to cause all the inconsistancy on failover,
but also muddies the path as far as solutions go.
This whole exercise is purely proof of concept now, since we've
been having tremendous issues getting the cluster rolling on the Network
Appliance SAN kit. We are going straight to the filer via Fibre Channel,
which is presenting us the LUNS, so it's not a NAS presentation, but a
NAS presenting itself as a disk block level device, like a SAN. Perhaps
we will look at a CX300 or a 9500V and see what kind of fail over times
we are getting with that. Any ideas on what I should expect to see?
Thanks much, once again.
M.
Rodney R. Fournier [MVP] wrote:
| Quote: | Those times are slow, but they could just be what your hardware is capable
of, though I seriously doubt it.
Check your config on the heartbeat - http://support.microsoft.com/?id=258750
most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
Rodney-
Hehe, sorry about the lack of detail. This is a completely pristine
install, so the Information Stores would be the default size that comes
after a clean install. What's that, maybe a couple of MB each? There
should be on mail store and one public folder store, both in the First
Administrative groups. No users, or anything. We are attempting to
benchmark the system for both failover and storage performance so we have
control values when we move into pilot and production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information Stores
do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre Channel.
The Network Appliance box is acting as a SAN, presenting disk block
level. This system is only in test and we are seeing inconsistant, or
what appear to us to be inconsistant failover times on removal of the
active nodes public network cable. The failover times range from 1:59 to
3:31, with no real rhyme or reason as to why one takes any given length
of time. I am looking for more of a sanity check, that this given range
is an acceptable time frame for a destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Scott Schnoll [MSFT]
Guest
|
Posted:
Mon Dec 13, 2004 7:31 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Is this the only failover test you are performing? If you just kill the
power on one node instead of pulling the public NIC, do you get the expected
faster failover?
--
Scott Schnoll
This posting is provided "AS IS" with no warranties, and confers no
rights. Please do not send email directly to this alias. This alias is for
newsgroup
purposes only.
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
| Quote: | Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines through
a Brocade 3900 to a Network Appliance FAS 940 via Fibre Channel. The
Network Appliance box is acting as a SAN, presenting disk block level.
This system is only in test and we are seeing inconsistant, or what appear
to us to be inconsistant failover times on removal of the active nodes
public network cable. The failover times range from 1:59 to 3:31, with no
real rhyme or reason as to why one takes any given length of time. I am
looking for more of a sanity check, that this given range is an acceptable
time frame for a destructive failover. Thanks much.
M. |
|
|
| Back to top |
|
 |
Rodney R. Fournier [MVP]
Guest
|
Posted:
Mon Dec 13, 2004 7:46 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Taking Exchange out of the mix and failing just the Cluster and MSDTC does
not prove the NAS is the problem. It could be your Exchange install, though
I have no idea what.
Exchange with no data on a SAN - 10-20 second forced failover.
Keep digging.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:J6adncDlmv2XdSHcRVn-rQ@giganews.com...
| Quote: | Rod-
I agree with you that this smells like a storage related issue, but
I'm not able to prove it conclusively. If we remove the EVS from the
cluster configuration, we see Cluster Group, MSDTC, and the Exchange
virtual IP/Name/Storage failing over in nearly thirty seconds (19 to 31).
This would seem to rule the storage out as an issue. Adding Exchange to
the mix seems to cause all the inconsistancy on failover, but also muddies
the path as far as solutions go.
This whole exercise is purely proof of concept now, since we've been
having tremendous issues getting the cluster rolling on the Network
Appliance SAN kit. We are going straight to the filer via Fibre Channel,
which is presenting us the LUNS, so it's not a NAS presentation, but a NAS
presenting itself as a disk block level device, like a SAN. Perhaps we
will look at a CX300 or a 9500V and see what kind of fail over times we
are getting with that. Any ideas on what I should expect to see? Thanks
much, once again.
M.
Rodney R. Fournier [MVP] wrote:
Those times are slow, but they could just be what your hardware is
capable of, though I seriously doubt it.
Check your config on the heartbeat -
http://support.microsoft.com/?id=258750 most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
Rodney-
Hehe, sorry about the lack of detail. This is a completely pristine
install, so the Information Stores would be the default size that comes
after a clean install. What's that, maybe a couple of MB each? There
should be on mail store and one public folder store, both in the First
Administrative groups. No users, or anything. We are attempting to
benchmark the system for both failover and storage performance so we
have control values when we move into pilot and production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information Stores
do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre
Channel. The Network Appliance box is acting as a SAN, presenting disk
block level. This system is only in test and we are seeing
inconsistant, or what appear to us to be inconsistant failover times
on removal of the active nodes public network cable. The failover
times range from 1:59 to 3:31, with no real rhyme or reason as to why
one takes any given length of time. I am looking for more of a sanity
check, that this given range is an acceptable time frame for a
destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Mon Dec 13, 2004 8:27 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Status from the trenches:
I am working on gathering the data per Scott's recommendation,
doing the "fun" tests. Everything on node A, move groups A->B, then
B->A, and yank the power cord on A. This test, in either direction takes
approximately fifty one seconds to complete, plus or minus a single
second, highly repeatable. I do the same move group deal, to ensure that
each node is capable of handling it after the hard reboot.
Next I'll do the multipathed HBAs on each system and see where
that gets me. If I get a similar time with an Exchange storage
disconnect (not the O/S, mind you) then I am thinking that drive
arbitration might be something to consider. If there are other failover
tests that I should be looking at or that you'd just like to see me run
for grins, please feel free to let me know. I'll report back any and all
data that people are interested in.
On a personal note, Rodney, I am now irked that you said 10 to 20
seconds on forced failover for Exchange against a SAN, as I now have a
target to hit. :) More on this story as it develops, and once again,
thanks for the extra eyes and help.
M.
Rodney R. Fournier [MVP] wrote:
| Quote: | Taking Exchange out of the mix and failing just the Cluster and MSDTC does
not prove the NAS is the problem. It could be your Exchange install, though
I have no idea what.
Exchange with no data on a SAN - 10-20 second forced failover.
Keep digging.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:J6adncDlmv2XdSHcRVn-rQ@giganews.com...
Rod-
I agree with you that this smells like a storage related issue, but
I'm not able to prove it conclusively. If we remove the EVS from the
cluster configuration, we see Cluster Group, MSDTC, and the Exchange
virtual IP/Name/Storage failing over in nearly thirty seconds (19 to 31).
This would seem to rule the storage out as an issue. Adding Exchange to
the mix seems to cause all the inconsistancy on failover, but also muddies
the path as far as solutions go.
This whole exercise is purely proof of concept now, since we've been
having tremendous issues getting the cluster rolling on the Network
Appliance SAN kit. We are going straight to the filer via Fibre Channel,
which is presenting us the LUNS, so it's not a NAS presentation, but a NAS
presenting itself as a disk block level device, like a SAN. Perhaps we
will look at a CX300 or a 9500V and see what kind of fail over times we
are getting with that. Any ideas on what I should expect to see? Thanks
much, once again.
M.
Rodney R. Fournier [MVP] wrote:
Those times are slow, but they could just be what your hardware is
capable of, though I seriously doubt it.
Check your config on the heartbeat -
http://support.microsoft.com/?id=258750 most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
Rodney-
Hehe, sorry about the lack of detail. This is a completely pristine
install, so the Information Stores would be the default size that comes
after a clean install. What's that, maybe a couple of MB each? There
should be on mail store and one public folder store, both in the First
Administrative groups. No users, or anything. We are attempting to
benchmark the system for both failover and storage performance so we
have control values when we move into pilot and production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information Stores
do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre
Channel. The Network Appliance box is acting as a SAN, presenting disk
block level. This system is only in test and we are seeing
inconsistant, or what appear to us to be inconsistant failover times
on removal of the active nodes public network cable. The failover
times range from 1:59 to 3:31, with no real rhyme or reason as to why
one takes any given length of time. I am looking for more of a sanity
check, that this given range is an acceptable time frame for a
destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Mon Dec 13, 2004 9:28 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Replying to myself isn't really indicative of good solid mental health,
is it?
Next series of tests used HBA failures on the active node, and the
length of time to failover was recorded. Recovery times were pretty
consistant at sixty four seconds, plus or minus four seconds. I tend to
believe that plus or minus ten percent is indicative of a repeatable event.
All of this leads me to believe that something is going on with the
arbitration of the drives on non-storage destructive failover, which is
why destructive storage and catastrophic failovers are accomplished so
expediently. It's not 10 to 20 seconds, mind you, but that's a target to
shoot for once I've figured out my delay. Am I missing anything here in
my thoughts?
M.
Michael Mahar wrote:
| Quote: | Status from the trenches:
I am working on gathering the data per Scott's recommendation,
doing the "fun" tests. Everything on node A, move groups A->B, then
B->A, and yank the power cord on A. This test, in either direction takes
approximately fifty one seconds to complete, plus or minus a single
second, highly repeatable. I do the same move group deal, to ensure that
each node is capable of handling it after the hard reboot.
Next I'll do the multipathed HBAs on each system and see where that
gets me. If I get a similar time with an Exchange storage disconnect
(not the O/S, mind you) then I am thinking that drive arbitration might
be something to consider. If there are other failover tests that I
should be looking at or that you'd just like to see me run for grins,
please feel free to let me know. I'll report back any and all data that
people are interested in.
On a personal note, Rodney, I am now irked that you said 10 to 20
seconds on forced failover for Exchange against a SAN, as I now have a
target to hit. :) More on this story as it develops, and once again,
thanks for the extra eyes and help.
M.
Rodney R. Fournier [MVP] wrote:
Taking Exchange out of the mix and failing just the Cluster and MSDTC
does not prove the NAS is the problem. It could be your Exchange
install, though I have no idea what.
Exchange with no data on a SAN - 10-20 second forced failover.
Keep digging.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:J6adncDlmv2XdSHcRVn-rQ@giganews.com...
Rod-
I agree with you that this smells like a storage related issue,
but I'm not able to prove it conclusively. If we remove the EVS from
the cluster configuration, we see Cluster Group, MSDTC, and the
Exchange virtual IP/Name/Storage failing over in nearly thirty
seconds (19 to 31). This would seem to rule the storage out as an
issue. Adding Exchange to the mix seems to cause all the
inconsistancy on failover, but also muddies the path as far as
solutions go.
This whole exercise is purely proof of concept now, since we've
been having tremendous issues getting the cluster rolling on the
Network Appliance SAN kit. We are going straight to the filer via
Fibre Channel, which is presenting us the LUNS, so it's not a NAS
presentation, but a NAS presenting itself as a disk block level
device, like a SAN. Perhaps we will look at a CX300 or a 9500V and
see what kind of fail over times we are getting with that. Any ideas
on what I should expect to see? Thanks much, once again.
M.
Rodney R. Fournier [MVP] wrote:
Those times are slow, but they could just be what your hardware is
capable of, though I seriously doubt it.
Check your config on the heartbeat -
http://support.microsoft.com/?id=258750 most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
Rodney-
Hehe, sorry about the lack of detail. This is a completely
pristine install, so the Information Stores would be the default
size that comes after a clean install. What's that, maybe a couple
of MB each? There should be on mail store and one public folder
store, both in the First Administrative groups. No users, or
anything. We are attempting to benchmark the system for both
failover and storage performance so we have control values when we
move into pilot and production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information
Stores do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre
Channel. The Network Appliance box is acting as a SAN, presenting
disk block level. This system is only in test and we are seeing
inconsistant, or what appear to us to be inconsistant failover
times on removal of the active nodes public network cable. The
failover times range from 1:59 to 3:31, with no real rhyme or
reason as to why one takes any given length of time. I am looking
for more of a sanity check, that this given range is an
acceptable time frame for a destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Rodney R. Fournier [MVP]
Guest
|
Posted:
Mon Dec 13, 2004 10:12 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
I think sometimes I am the only one that will listen to me ;)
Right, you are getting there. 10-20 was on the high end buddy. On our SAN
Exchange fails over under 10 seconds (on a move group), usually about 6
seconds. Our 300 GB of SQL Databases take 9 seconds to move. And our servers
are only dual procs :)
Keep on keeping on! Good luck!
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:o72dnUkOZKWhkSDcRVn-ug@giganews.com...
| Quote: | Replying to myself isn't really indicative of good solid mental health, is
it?
Next series of tests used HBA failures on the active node, and the length
of time to failover was recorded. Recovery times were pretty consistant at
sixty four seconds, plus or minus four seconds. I tend to believe that
plus or minus ten percent is indicative of a repeatable event.
All of this leads me to believe that something is going on with the
arbitration of the drives on non-storage destructive failover, which is
why destructive storage and catastrophic failovers are accomplished so
expediently. It's not 10 to 20 seconds, mind you, but that's a target to
shoot for once I've figured out my delay. Am I missing anything here in my
thoughts?
M.
Michael Mahar wrote:
Status from the trenches:
I am working on gathering the data per Scott's recommendation, doing
the "fun" tests. Everything on node A, move groups A->B, then B->A, and
yank the power cord on A. This test, in either direction takes
approximately fifty one seconds to complete, plus or minus a single
second, highly repeatable. I do the same move group deal, to ensure that
each node is capable of handling it after the hard reboot.
Next I'll do the multipathed HBAs on each system and see where that
gets me. If I get a similar time with an Exchange storage disconnect (not
the O/S, mind you) then I am thinking that drive arbitration might be
something to consider. If there are other failover tests that I should be
looking at or that you'd just like to see me run for grins, please feel
free to let me know. I'll report back any and all data that people are
interested in.
On a personal note, Rodney, I am now irked that you said 10 to 20
seconds on forced failover for Exchange against a SAN, as I now have a
target to hit. :) More on this story as it develops, and once again,
thanks for the extra eyes and help.
M.
Rodney R. Fournier [MVP] wrote:
Taking Exchange out of the mix and failing just the Cluster and MSDTC
does not prove the NAS is the problem. It could be your Exchange
install, though I have no idea what.
Exchange with no data on a SAN - 10-20 second forced failover.
Keep digging.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:J6adncDlmv2XdSHcRVn-rQ@giganews.com...
Rod-
I agree with you that this smells like a storage related issue, but
I'm not able to prove it conclusively. If we remove the EVS from the
cluster configuration, we see Cluster Group, MSDTC, and the Exchange
virtual IP/Name/Storage failing over in nearly thirty seconds (19 to
31). This would seem to rule the storage out as an issue. Adding
Exchange to the mix seems to cause all the inconsistancy on failover,
but also muddies the path as far as solutions go.
This whole exercise is purely proof of concept now, since we've
been having tremendous issues getting the cluster rolling on the
Network Appliance SAN kit. We are going straight to the filer via Fibre
Channel, which is presenting us the LUNS, so it's not a NAS
presentation, but a NAS presenting itself as a disk block level device,
like a SAN. Perhaps we will look at a CX300 or a 9500V and see what
kind of fail over times we are getting with that. Any ideas on what I
should expect to see? Thanks much, once again.
M.
Rodney R. Fournier [MVP] wrote:
Those times are slow, but they could just be what your hardware is
capable of, though I seriously doubt it.
Check your config on the heartbeat -
http://support.microsoft.com/?id=258750 most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
Rodney-
Hehe, sorry about the lack of detail. This is a completely
pristine install, so the Information Stores would be the default size
that comes after a clean install. What's that, maybe a couple of MB
each? There should be on mail store and one public folder store, both
in the First Administrative groups. No users, or anything. We are
attempting to benchmark the system for both failover and storage
performance so we have control values when we move into pilot and
production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information
Stores do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre
Channel. The Network Appliance box is acting as a SAN, presenting
disk block level. This system is only in test and we are seeing
inconsistant, or what appear to us to be inconsistant failover
times on removal of the active nodes public network cable. The
failover times range from 1:59 to 3:31, with no real rhyme or
reason as to why one takes any given length of time. I am looking
for more of a sanity check, that this given range is an acceptable
time frame for a destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Scott Schnoll [MSFT]
Guest
|
Posted:
Mon Dec 13, 2004 10:12 am Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
So getting back to the original issue, is it fair to say that you are only
seeing extended failover times in tests where the public NIC is unplugged?
As an aside, did you use Diskpar on your storage volumes? That can boost
performance as much as 15-20% (mostly on the disk containing the log files).
--
Scott Schnoll
This posting is provided "AS IS" with no warranties, and confers no
rights. Please do not send email directly to this alias. This alias is for
newsgroup
purposes only.
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:o72dnUkOZKWhkSDcRVn-ug@giganews.com...
| Quote: | Replying to myself isn't really indicative of good solid mental health, is
it?
Next series of tests used HBA failures on the active node, and the length
of time to failover was recorded. Recovery times were pretty consistant at
sixty four seconds, plus or minus four seconds. I tend to believe that
plus or minus ten percent is indicative of a repeatable event.
All of this leads me to believe that something is going on with the
arbitration of the drives on non-storage destructive failover, which is
why destructive storage and catastrophic failovers are accomplished so
expediently. It's not 10 to 20 seconds, mind you, but that's a target to
shoot for once I've figured out my delay. Am I missing anything here in my
thoughts?
M.
Michael Mahar wrote:
Status from the trenches:
I am working on gathering the data per Scott's recommendation, doing
the "fun" tests. Everything on node A, move groups A->B, then B->A, and
yank the power cord on A. This test, in either direction takes
approximately fifty one seconds to complete, plus or minus a single
second, highly repeatable. I do the same move group deal, to ensure that
each node is capable of handling it after the hard reboot.
Next I'll do the multipathed HBAs on each system and see where that
gets me. If I get a similar time with an Exchange storage disconnect (not
the O/S, mind you) then I am thinking that drive arbitration might be
something to consider. If there are other failover tests that I should be
looking at or that you'd just like to see me run for grins, please feel
free to let me know. I'll report back any and all data that people are
interested in.
On a personal note, Rodney, I am now irked that you said 10 to 20
seconds on forced failover for Exchange against a SAN, as I now have a
target to hit. :) More on this story as it develops, and once again,
thanks for the extra eyes and help.
M.
Rodney R. Fournier [MVP] wrote:
Taking Exchange out of the mix and failing just the Cluster and MSDTC
does not prove the NAS is the problem. It could be your Exchange
install, though I have no idea what.
Exchange with no data on a SAN - 10-20 second forced failover.
Keep digging.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:J6adncDlmv2XdSHcRVn-rQ@giganews.com...
Rod-
I agree with you that this smells like a storage related issue, but
I'm not able to prove it conclusively. If we remove the EVS from the
cluster configuration, we see Cluster Group, MSDTC, and the Exchange
virtual IP/Name/Storage failing over in nearly thirty seconds (19 to
31). This would seem to rule the storage out as an issue. Adding
Exchange to the mix seems to cause all the inconsistancy on failover,
but also muddies the path as far as solutions go.
This whole exercise is purely proof of concept now, since we've
been having tremendous issues getting the cluster rolling on the
Network Appliance SAN kit. We are going straight to the filer via Fibre
Channel, which is presenting us the LUNS, so it's not a NAS
presentation, but a NAS presenting itself as a disk block level device,
like a SAN. Perhaps we will look at a CX300 or a 9500V and see what
kind of fail over times we are getting with that. Any ideas on what I
should expect to see? Thanks much, once again.
M.
Rodney R. Fournier [MVP] wrote:
Those times are slow, but they could just be what your hardware is
capable of, though I seriously doubt it.
Check your config on the heartbeat -
http://support.microsoft.com/?id=258750 most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
Rodney-
Hehe, sorry about the lack of detail. This is a completely
pristine install, so the Information Stores would be the default size
that comes after a clean install. What's that, maybe a couple of MB
each? There should be on mail store and one public folder store, both
in the First Administrative groups. No users, or anything. We are
attempting to benchmark the system for both failover and storage
performance so we have control values when we move into pilot and
production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information
Stores do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre
Channel. The Network Appliance box is acting as a SAN, presenting
disk block level. This system is only in test and we are seeing
inconsistant, or what appear to us to be inconsistant failover
times on removal of the active nodes public network cable. The
failover times range from 1:59 to 3:31, with no real rhyme or
reason as to why one takes any given length of time. I am looking
for more of a sanity check, that this given range is an acceptable
time frame for a destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Mon Dec 13, 2004 12:45 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Rodney-
I'm sorry, what'd you say? On a serious note, can I ask what
storage vendor you are using to get this turn around? Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
| Quote: | I think sometimes I am the only one that will listen to me ;)
Right, you are getting there. 10-20 was on the high end buddy. On our SAN
Exchange fails over under 10 seconds (on a move group), usually about 6
seconds. Our 300 GB of SQL Databases take 9 seconds to move. And our servers
are only dual procs :)
Keep on keeping on! Good luck!
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:o72dnUkOZKWhkSDcRVn-ug@giganews.com...
Replying to myself isn't really indicative of good solid mental health, is
it?
Next series of tests used HBA failures on the active node, and the length
of time to failover was recorded. Recovery times were pretty consistant at
sixty four seconds, plus or minus four seconds. I tend to believe that
plus or minus ten percent is indicative of a repeatable event.
All of this leads me to believe that something is going on with the
arbitration of the drives on non-storage destructive failover, which is
why destructive storage and catastrophic failovers are accomplished so
expediently. It's not 10 to 20 seconds, mind you, but that's a target to
shoot for once I've figured out my delay. Am I missing anything here in my
thoughts?
M.
Michael Mahar wrote:
Status from the trenches:
I am working on gathering the data per Scott's recommendation, doing
the "fun" tests. Everything on node A, move groups A->B, then B->A, and
yank the power cord on A. This test, in either direction takes
approximately fifty one seconds to complete, plus or minus a single
second, highly repeatable. I do the same move group deal, to ensure that
each node is capable of handling it after the hard reboot.
Next I'll do the multipathed HBAs on each system and see where that
gets me. If I get a similar time with an Exchange storage disconnect (not
the O/S, mind you) then I am thinking that drive arbitration might be
something to consider. If there are other failover tests that I should be
looking at or that you'd just like to see me run for grins, please feel
free to let me know. I'll report back any and all data that people are
interested in.
On a personal note, Rodney, I am now irked that you said 10 to 20
seconds on forced failover for Exchange against a SAN, as I now have a
target to hit. :) More on this story as it develops, and once again,
thanks for the extra eyes and help.
M.
Rodney R. Fournier [MVP] wrote:
Taking Exchange out of the mix and failing just the Cluster and MSDTC
does not prove the NAS is the problem. It could be your Exchange
install, though I have no idea what.
Exchange with no data on a SAN - 10-20 second forced failover.
Keep digging.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:J6adncDlmv2XdSHcRVn-rQ@giganews.com...
Rod-
I agree with you that this smells like a storage related issue, but
I'm not able to prove it conclusively. If we remove the EVS from the
cluster configuration, we see Cluster Group, MSDTC, and the Exchange
virtual IP/Name/Storage failing over in nearly thirty seconds (19 to
31). This would seem to rule the storage out as an issue. Adding
Exchange to the mix seems to cause all the inconsistancy on failover,
but also muddies the path as far as solutions go.
This whole exercise is purely proof of concept now, since we've
been having tremendous issues getting the cluster rolling on the
Network Appliance SAN kit. We are going straight to the filer via Fibre
Channel, which is presenting us the LUNS, so it's not a NAS
presentation, but a NAS presenting itself as a disk block level device,
like a SAN. Perhaps we will look at a CX300 or a 9500V and see what
kind of fail over times we are getting with that. Any ideas on what I
should expect to see? Thanks much, once again.
M.
Rodney R. Fournier [MVP] wrote:
Those times are slow, but they could just be what your hardware is
capable of, though I seriously doubt it.
Check your config on the heartbeat -
http://support.microsoft.com/?id=258750 most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
Rodney-
Hehe, sorry about the lack of detail. This is a completely
pristine install, so the Information Stores would be the default size
that comes after a clean install. What's that, maybe a couple of MB
each? There should be on mail store and one public folder store, both
in the First Administrative groups. No users, or anything. We are
attempting to benchmark the system for both failover and storage
performance so we have control values when we move into pilot and
production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information
Stores do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre
Channel. The Network Appliance box is acting as a SAN, presenting
disk block level. This system is only in test and we are seeing
inconsistant, or what appear to us to be inconsistant failover
times on removal of the active nodes public network cable. The
failover times range from 1:59 to 3:31, with no real rhyme or
reason as to why one takes any given length of time. I am looking
for more of a sanity check, that this given range is an acceptable
time frame for a destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Mon Dec 13, 2004 12:54 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Scott-
This is correct. We are seeing recovery times of two to four
minutes on an unloaded server with no user store. Diskpar was not used
to align the partitions. I'm not quite sure that Diskpar would be of
value in this instance, as the storage vendor has elected to place all
volumes into one RAID 4 group and partition LUNs out from there. Once we
resolve all the functional issues surrounding the cluster and achieve
the impossible (see previous post to Rodney regarding ten second
failover), then we will dive into capability and performance.
I'm pretty platform agnostic when it comes to storage, though this
is my first implementation with Network Appliance as a SAN. I have
experience with many vendors, including HDS, EMC, IBM, and HP. We have
some concerns around NetApp's ability to support the Exchange cluster
from a SAN point of view. We are anticipating a 2.1TB total mail store
with IOPS consumption in the neighborhood of 2200 and 3250, and to
achieve the metrics they are throwing 24 TB worth of disk at it. Couple
this with the mandatory RAID 4 and the disk block level emulation, and
there is a lot of uncertainty that needs to be validated in test.
M.
Scott Schnoll [MSFT] wrote:
| Quote: | So getting back to the original issue, is it fair to say that you are only
seeing extended failover times in tests where the public NIC is unplugged?
As an aside, did you use Diskpar on your storage volumes? That can boost
performance as much as 15-20% (mostly on the disk containing the log files). |
|
|
| Back to top |
|
 |
Michael Mahar
Guest
|
Posted:
Mon Dec 13, 2004 12:57 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Something that just occurred to me, is that the Network Appliance
software that handles the configuration and availability of the LUNs is
actually running on each server. I wonder if that software has some
influence on the arbitration process, and when the public interface is
down, it can't contact the filer via TCP/IP and then we have the delays.
Something to ask about.
M.
Michael Mahar wrote:
| Quote: | Scott-
This is correct. We are seeing recovery times of two to four
minutes on an unloaded server with no user store. Diskpar was not used
to align the partitions. I'm not quite sure that Diskpar would be of
value in this instance, as the storage vendor has elected to place all
volumes into one RAID 4 group and partition LUNs out from there. Once we
resolve all the functional issues surrounding the cluster and achieve
the impossible (see previous post to Rodney regarding ten second
failover), then we will dive into capability and performance.
I'm pretty platform agnostic when it comes to storage, though this
is my first implementation with Network Appliance as a SAN. I have
experience with many vendors, including HDS, EMC, IBM, and HP. We have
some concerns around NetApp's ability to support the Exchange cluster
from a SAN point of view. We are anticipating a 2.1TB total mail store
with IOPS consumption in the neighborhood of 2200 and 3250, and to
achieve the metrics they are throwing 24 TB worth of disk at it. Couple
this with the mandatory RAID 4 and the disk block level emulation, and
there is a lot of uncertainty that needs to be validated in test.
M.
Scott Schnoll [MSFT] wrote:
So getting back to the original issue, is it fair to say that you are
only seeing extended failover times in tests where the public NIC is
unplugged?
As an aside, did you use Diskpar on your storage volumes? That can
boost performance as much as 15-20% (mostly on the disk containing the
log files). |
|
|
| Back to top |
|
 |
Rodney R. Fournier [MVP]
Guest
|
Posted:
Mon Dec 13, 2004 6:41 pm Post subject:
Re: Clustered Exchange Failover Consistancy |
|
|
Right now QLogic 2310 HBA's and Brocade switches with very old StorageTek
9176 drives.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:3KCdnXTBY7gOpyDcRVn-vQ@giganews.com...
| Quote: | Rodney-
I'm sorry, what'd you say? On a serious note, can I ask what storage
vendor you are using to get this turn around? Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
I think sometimes I am the only one that will listen to me ;)
Right, you are getting there. 10-20 was on the high end buddy. On our SAN
Exchange fails over under 10 seconds (on a move group), usually about 6
seconds. Our 300 GB of SQL Databases take 9 seconds to move. And our
servers are only dual procs :)
Keep on keeping on! Good luck!
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:o72dnUkOZKWhkSDcRVn-ug@giganews.com...
Replying to myself isn't really indicative of good solid mental health,
is it?
Next series of tests used HBA failures on the active node, and the length
of time to failover was recorded. Recovery times were pretty consistant
at sixty four seconds, plus or minus four seconds. I tend to believe that
plus or minus ten percent is indicative of a repeatable event.
All of this leads me to believe that something is going on with the
arbitration of the drives on non-storage destructive failover, which is
why destructive storage and catastrophic failovers are accomplished so
expediently. It's not 10 to 20 seconds, mind you, but that's a target to
shoot for once I've figured out my delay. Am I missing anything here in
my thoughts?
M.
Michael Mahar wrote:
Status from the trenches:
I am working on gathering the data per Scott's recommendation, doing
the "fun" tests. Everything on node A, move groups A->B, then B->A, and
yank the power cord on A. This test, in either direction takes
approximately fifty one seconds to complete, plus or minus a single
second, highly repeatable. I do the same move group deal, to ensure
that each node is capable of handling it after the hard reboot.
Next I'll do the multipathed HBAs on each system and see where that
gets me. If I get a similar time with an Exchange storage disconnect
(not the O/S, mind you) then I am thinking that drive arbitration might
be something to consider. If there are other failover tests that I
should be looking at or that you'd just like to see me run for grins,
please feel free to let me know. I'll report back any and all data that
people are interested in.
On a personal note, Rodney, I am now irked that you said 10 to 20
seconds on forced failover for Exchange against a SAN, as I now have a
target to hit. :) More on this story as it develops, and once again,
thanks for the extra eyes and help.
M.
Rodney R. Fournier [MVP] wrote:
Taking Exchange out of the mix and failing just the Cluster and MSDTC
does not prove the NAS is the problem. It could be your Exchange
install, though I have no idea what.
Exchange with no data on a SAN - 10-20 second forced failover.
Keep digging.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:J6adncDlmv2XdSHcRVn-rQ@giganews.com...
Rod-
I agree with you that this smells like a storage related issue,
but I'm not able to prove it conclusively. If we remove the EVS from
the cluster configuration, we see Cluster Group, MSDTC, and the
Exchange virtual IP/Name/Storage failing over in nearly thirty
seconds (19 to 31). This would seem to rule the storage out as an
issue. Adding Exchange to the mix seems to cause all the
inconsistancy on failover, but also muddies the path as far as
solutions go.
This whole exercise is purely proof of concept now, since we've
been having tremendous issues getting the cluster rolling on the
Network Appliance SAN kit. We are going straight to the filer via
Fibre Channel, which is presenting us the LUNS, so it's not a NAS
presentation, but a NAS presenting itself as a disk block level
device, like a SAN. Perhaps we will look at a CX300 or a 9500V and
see what kind of fail over times we are getting with that. Any ideas
on what I should expect to see? Thanks much, once again.
M.
Rodney R. Fournier [MVP] wrote:
Those times are slow, but they could just be what your hardware is
capable of, though I seriously doubt it.
Check your config on the heartbeat -
http://support.microsoft.com/?id=258750 most important 10 MB/Half.
NAS drives firmware, Brocade firmware, HBA's, etc.
Something is a miss, I would double check all drivers too.
It could just be the NAS is slow.
Rod
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:LJOdnTRe0o2PISHcRVn-qA@giganews.com...
Rodney-
Hehe, sorry about the lack of detail. This is a completely
pristine install, so the Information Stores would be the default
size that comes after a clean install. What's that, maybe a couple
of MB each? There should be on mail store and one public folder
store, both in the First Administrative groups. No users, or
anything. We are attempting to benchmark the system for both
failover and storage performance so we have control values when we
move into pilot and production.
Here's some of the data, to this point. My apologies for the poor
formatting:
Failover Direction Time (seconds) Time (hh:mm:ss)
1 -> 2 262 0:04:22
2 -> 1 214 0:03:34
1 -> 2 131 0:02:11
2 -> 1 119 0:01:59
1 -> 2 214 0:03:34
2 -> 1 195 0:03:15
1 -> 2 137 0:02:17
2 -> 1 136 0:02:16
1 -> 2 210 0:03:30
2 -> 1 212 0:03:32
1 -> 2 151 0:02:31
2 -> 1 211 0:03:31
1 -> 2 180 0:03:00
2 -> 1 133 0:02:13
1 -> 2 176 0:02:56
2 -> 1 135 0:02:15
Thanks much.
M.
Rodney R. Fournier [MVP] wrote:
How big are the stores? Information? Public? How many Information
Stores do you have?
Cheers,
Rod
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering
http://www.msmvps.com/clustering - Blog
"Michael Mahar" <mmahar@fireflyconsulting.net> wrote in message
news:uMqdnfC7ociBLiHcRVn-pQ@giganews.com...
Greetings all-
I am configuring an Exchange 2003 SP1 Active/Passive cluster on
Windows Server 2003, on a pair of Dell 6650 (4 way, 4GB) machines
through a Brocade 3900 to a Network Appliance FAS 940 via Fibre
Channel. The Network Appliance box is acting as a SAN, presenting
disk block level. This system is only in test and we are seeing
inconsistant, or what appear to us to be inconsistant failover
times on removal of the active nodes public network cable. The
failover times range from 1:59 to 3:31, with no real rhyme or
reason as to why one takes any given length of time. I am looking
for more of a sanity check, that this given range is an
acceptable time frame for a destructive failover. Thanks much.
M.
|
|
|
| Back to top |
|
 |
|
|
|
|