Capacity Planning and PXE issues for Microsoft Endpoint Configuration Manager

Hi Guys,

I am back with a new blog. This blog is an informative post where we discuss how capacity planning is important for SCCM to function during these times where most of the people are working from home. Recently one of my clients had an issue where PXE boot stopped working during peak hours ,but it was working during off peak hours. This behavior made me think that the issue might be related to performance.

I started my log analysis and I could see that the PXE server was able to reply the client with bootfiles, but the client was not able to load the boot file, and failed with timeout error, which could be either caused by network bandwidth limitation or over utilization of MP or DP servers.

But this was not enough to find the root cause and fix the issue, so we collected network trace from the MP, DP and database servers and this is where we got to know that MP was taking 25 Seconds to reply .

In the image we can see initial request time and MP response time .

Further checking the network trace we can see that it took total 46 seconds for the client to receive the reply from DP which is a very long time.

To basically put the analysis into perspective , if the client uses more than 5s to receive the boot file, then it would not find the boot file, as for UEFI boot it has a limit with 10s.

The whole PXE process is that the client will send a request to DP, DP will communicate with MP to do stored procedure against DB, and if these connections between DP, MP and DB are slow, we may encounter PXE error.

The slowness may be caused by below factors.
- Network between DP and MP, MP and DB is slow.
- MP has performance issue
- DB has performance issue

We checked how many clients are managed by each MP. Use below query.

select count(*), LastMPServerName from vSMS_CombinedDeviceResources group by lastmpservername

We also check the HTTPERR.LOG on the problematic MP’s and notice that there are huge number of errors related to DP Connection_Dropped. This is usually related to performance issue. MP cannot handle it on time. Thus, the connection is dropped.

Refer to https://support.microsoft.com/en-ca/help/937692/a-connection-dropped-event-message-is-logged-in-the-httperr-log-file-o

06-17-2020 08:07:04.000 64942 443 HTTP/1.1 CCM_POST /bgb/handler.ashx?RequestType=Continue – 1 Connection_Dropped CCM+Client+Notification+Proxy+Pool

06-18-2020 03:37:58.000 443 HTTP/1.1 CCM_POST /bgb/handler.ashx?RequestType=LogIn – 1 Timer_SslRenegotiation CCM+Client+Notification+Proxy+Pool

******At this stage, we locate the cause of MP performance issue: Hardware is not enough, and DP’s are not enough for VPN clients. There were too many clients requesting content from DP’s and failing, as at any given time SCCM DP can serve 4000 clients , and for other clients it will simply say queue full. Due to this reason the failed requests will increase the load on MP’s and hence it will start causing delay in response requests.

Resolution Stage : Deliver action plan to resolve the issue.

We have to resolve the DP issue first. Because DP failed will cause clients re-send location requests. Then, it will increase MP load. We suggested to increase the number of DPs and move MP out of the VPN DP boundary. So that we can improve performance.
After increasing hardware by increasing the number of DP’s , increasing the CPU and memory the issue is resolved.
Please refer the following article for capacity planning :

https://docs.microsoft.com/en-us/mem/configmgr/core/plan-design/configs/recommended-hardware