How to handle the Page Fault of Beckhoff system?

An engineer who has not dealt with Page Fault cannot be considered a skilled Beckhoff engineer. In my opinion, Page Fault may be the most widespread problem other than ADS routing addition, and its causes and solutions are much more complex than the former .Here, I have tried my best to compile the information related to Page Fault that I and my colleagues have dealt with and collected internally over the years.I do not guarantee the comprehensiveness, depth, and accuracy of this article.  If you have other comments, please feel free to correct or supplement.

 

1.     Problem phenomenon

If the device stops without any reason, if the controller is connected to Monitor, you will see a prompt of PLC Stop and a Page Fault alarm from the PlcTask Server:


2. Trigger mechanism of Page Fault

First of all, what is Page here?Page is the unit of memory segmentation in MMU.We know that in fact, processes do not directly access physical memory, but access memory resources through memory management unit (MMU).The memory addresses involved in the process are not physical memory addresses, but virtual memory addresses.  A mapping relationship is maintained between virtual memory addresses and physical memory addresses, which is managed by MMU.MMU is the full name of memory management unit, which divides physical memory into multiple Pages.  Windows CE operating system supports two Page sizes: 1KB and 4KB.MMU manages the mapping relationship between PAGE in the virtual address space of the process and PAGE in the physical memory, and provides certain memory protection.

Each process has its own independent virtual address space working set, which is a collection of pages currently belonging to the process in physical memory.Each process in Windows XP has an exclusive 4GB virtual address space, while all processes in Windows CE share a 4GB virtual address space.

When a process references a virtual memory page that is not in its "workingset", a page fault occurs.

3. Possible causes of Page Fault in TwinCAT PLC

The calculation result overflows, except for zero, negative root extraction, etc., which immediately triggers a Page Fault.

 

Pointer error, immediately triggering a Page Fault.The most common is that the pointer is not initialized or cleared, and points to memory address 0.

 

The array index is out of bounds, and it will not immediately trigger a Page Fault.

 

All "Write" operations that use the ADR() function as the target address are actually memory accesses, including functions such as ADSWRITE, MEMCPY, and MEMSET.If the written content or length is incorrect, a Page Fault may be triggered.

 

Memory leak refers to the sudden increase in memory usage caused by a specific program that is not handled properly.

 

Illegal operations or non-existent variables are used in the TwinCAT PLC HMI (CE) screen.

 

4. Determine whether zero removal and overflow have occurred

 

During the project development and debugging stage, first disable the TwinCAT PLC HMI.  If the problem persists, perform the following actions.If the project has been running steadily for some time, the cause of the problem may be eliminated.

 

For Beckhoff controllers that do not have a connection control panel, it is not intuitive to detect Page Faults.  Often, customers just describe the device stopping unexpectedly, and in order to meet production schedules, they usually restart directly without allowing the fault status to be maintained to slowly find the cause.  At this time, you can check whether zero-division and overflow have occurred from TcSysLog.

How to view TcSysLog?For Win7 systems, there is a system log.  For CE systems, first go to the "Beckhoff CX Configuration Tool" from the control panel, find the TwinCAT Setting page, and check the "Enable Log File" option under the Log File Configuration column in the upper left.  The default size is 4096 bytes, and the default storage path is blank.Then you can find the log file TcSysLog in the root directory.This is a text file that can be opened in Notepad for viewing.If there are words such as Divide Zero or Page Fault, it proves that the fault has occurred.

 

5. Steps to deal with PageFault caused by overflow of calculation results, pointer 0, and array subscript overrun

 

Pre-eliminating, such as referencing SysFuntion.lib in TC2, writing functions such as Checkbound yourself, or adding references to functions such as Checkbound in Implicit in TC3.These operations increase CPU consumption, and once you confirm that there are no errors in the program, you can clear these settings.Find the line of code that has an error.

The first method is to use CheckerFunction.exp, which is a function file written by colleagues in the wind power industry.When customers seek help, occasional problems occur, and using a program to record the number of errors is better than waiting for human intervention.Importing the program can detect the type of exception, the number of times the boundary is crossed, the upper and lower bounds, and the type of exception.

The second method is to use the attachment "Use CheckBounds to locate the error code.rar", and enter a breakpoint before the error occurs.Code optimization: Before all pointer operations, array operations, division by zero, and square root operations, perform validity checks.

 

6. Methods to deal with memory leaks and PageFaults caused by memory access

 

Try to increase the available memory and minimize the CPU utilization.To check whether there is a memory leak or abnormal CPU utilization:Method

 1: Copy the TaskMgr.exe tool to a path on the controller, enter the CE system, and run the program.Method

2: Use the attachment program Check_bound+CPU_SD_Mem_Monitor.pro to monitor and record CPU utilization and memory changes.Optimization measures that can be taken include:A larger card selection for CE is beneficial for memory consumption.

 

The larger the card, the lower the chance of a Page Fault caused by array index overflow, as it can be used for a larger amount of virtual memory.Set the memory allocation for the CE system.  The Storage can be larger, while the Program's MEM is smaller.As shown in the figure, in the System of the Control Panel



Attached: on-site troubleshooting experience




• On February 23, 2017, a Page Fault occurred in TC3 Demo operation.

Solution:

Delete the call of the task to the POU, cancel the PLC automatic startup, and activate.

Add task calls POU, Activation, Login, Start. Errors disappear.



• On March 8, 2017, the controller automatically switched from RUN to CONFIG mode, indicating "pagefault".

Solution:

A closer look at the program shows that the INPUT variable, which is a function block, is of pointer type, and the function block was called without an address, resulting in the pointer pointing to 0. Because the customer uses a ladder diagram, it is not easy to see the problem, so be careful when using Pointers, arrays, etc.



• On April 13, 2017, is it possible to detect potential causes of Page faults such as division by zero, array overreach, etc. at compile time?

Solution: No.

• On April 14, 2017, the TC3 project was suspended irregularly, and an error was reported: Exception code, Page Fault.

Solution:

Generally, this problem is caused by the array of subscripts out of bounds, pointer 0, divide 0, etc. Using the corresponding Function test, no pointer 0 and divide 0 problems were found, but a lot of array out of bounds problems were found, change all the array out of bounds code in the program, subsequent observation did not find the previous problem. Note that functions such as checkbounds in TC3 need to be added from the project, and adding code manually is not executed.

• On June 29, 2017, there was a Page Fault

Solution:

Using the attached CheckBound documentation, add code to the program to set the breakpoint, and Login waits for the problem to appear.



• On July 11, 2017, the TC light of the customer controller turned yellow when it was running, prompting a PageFault alarm.

Solution:

Let the customer delete the BOOT file in the controller CE system directory, and re-run the program after adding the CHECK function in the program, and the alarm no longer appears, indicating that there are hidden dangers in the code such as "array on the subscript out of bounds, pointer 0, divided by 0".



• On December 22, 2017, the controller frequently reported pagefault

Solution:

Check procedure, no division by zero, out of bounds; Because the program TCPIP communication program is not handled well, create a large number of sockets.