An engineer who has not dealt with Page Fault cannot be considered a skilled Beckhoff engineer. In my opinion, Page Fault may be the most widespread problem other than ADS routing addition, and its causes and solutions are much more complex than the former .Here, I have tried my best to compile the information related to Page Fault that I and my colleagues have dealt with and collected internally over the years.I do not guarantee the comprehensiveness, depth, and accuracy of this article. If you have other comments, please feel free to correct or supplement.
1. Problem phenomenon
If the device stops
without any reason, if the controller is connected to Monitor, you will see a
prompt of PLC Stop and a Page Fault alarm from the PlcTask Server:
2. Trigger mechanism of Page Fault
First of all, what is Page here?Page
is the unit of memory segmentation in MMU.We know that in fact, processes do
not directly access physical memory, but access memory resources through memory
management unit (MMU).The memory addresses involved in the process are not
physical memory addresses, but virtual memory addresses. A mapping relationship is maintained between
virtual memory addresses and physical memory addresses, which is managed by
MMU.MMU is the full name of memory management unit, which divides physical
memory into multiple Pages. Windows CE
operating system supports two Page sizes: 1KB and 4KB.MMU manages the mapping
relationship between PAGE in the virtual address space of the process and PAGE
in the physical memory, and provides certain memory protection.
Each process has its own independent
virtual address space working set, which is a collection of pages currently
belonging to the process in physical memory.Each process in Windows XP has an
exclusive 4GB virtual address space, while all processes in Windows CE share a
4GB virtual address space.
When a process references a virtual
memory page that is not in its "workingset", a page fault occurs.
3. Possible causes of Page Fault in TwinCAT PLC
The calculation result
overflows, except for zero, negative root extraction, etc., which immediately
triggers a Page Fault.
Pointer error,
immediately triggering a Page Fault.The most common is that the pointer is not
initialized or cleared, and points to memory address 0.
The array index is out
of bounds, and it will not immediately trigger a Page Fault.
All "Write"
operations that use the ADR() function as the target address are actually
memory accesses, including functions such as ADSWRITE, MEMCPY, and MEMSET.If
the written content or length is incorrect, a Page Fault may be triggered.
Memory leak refers to
the sudden increase in memory usage caused by a specific program that is not
handled properly.
Illegal operations or
non-existent variables are used in the TwinCAT PLC HMI (CE) screen.
4. Determine whether zero removal and overflow have occurred
During the project
development and debugging stage, first disable the TwinCAT PLC HMI. If the problem persists, perform the
following actions.If the project has been running steadily for some time, the
cause of the problem may be eliminated.
For Beckhoff controllers
that do not have a connection control panel, it is not intuitive to detect Page
Faults. Often, customers just describe
the device stopping unexpectedly, and in order to meet production schedules,
they usually restart directly without allowing the fault status to be
maintained to slowly find the cause. At
this time, you can check whether zero-division and overflow have occurred from
TcSysLog.
How to view TcSysLog?For
Win7 systems, there is a system log. For
CE systems, first go to the "Beckhoff CX Configuration Tool" from the
control panel, find the TwinCAT Setting page, and check the "Enable Log
File" option under the Log File Configuration column in the upper
left. The default size is 4096 bytes,
and the default storage path is blank.Then you can find the log file TcSysLog
in the root directory.This is a text file that can be opened in Notepad for
viewing.If there are words such as Divide Zero or Page Fault, it proves that
the fault has occurred.
5. Steps to deal with PageFault caused by overflow of calculation results, pointer 0, and array subscript overrun
Pre-eliminating, such as
referencing SysFuntion.lib in TC2, writing functions such as Checkbound
yourself, or adding references to functions such as Checkbound in Implicit in
TC3.These operations increase CPU consumption, and once you confirm that there are
no errors in the program, you can clear these settings.Find the line of code
that has an error.
The first method is to
use CheckerFunction.exp, which is a function file written by colleagues in the
wind power industry.When customers seek help, occasional problems occur, and
using a program to record the number of errors is better than waiting for human
intervention.Importing the program can detect the type of exception, the number
of times the boundary is crossed, the upper and lower bounds, and the type of
exception.
The second method is to
use the attachment "Use CheckBounds to locate the error code.rar",
and enter a breakpoint before the error occurs.Code optimization: Before all
pointer operations, array operations, division by zero, and square root operations,
perform validity checks.
6. Methods to deal with memory leaks and PageFaults caused by memory access
Try to increase the
available memory and minimize the CPU utilization.To check whether there is a
memory leak or abnormal CPU utilization:Method
1: Copy the TaskMgr.exe tool to a path on the
controller, enter the CE system, and run the program.Method
2: Use the attachment
program Check_bound+CPU_SD_Mem_Monitor.pro to monitor and record CPU
utilization and memory changes.Optimization measures that can be taken
include:A larger card selection for CE is beneficial for memory consumption.
The larger the card, the
lower the chance of a Page Fault caused by array index overflow, as it can be
used for a larger amount of virtual memory.Set the memory allocation for the CE
system. The Storage can be larger, while
the Program's MEM is smaller.As shown in the figure, in the System of the
Control Panel。
Attached: on-site troubleshooting experience
• On February 23, 2017, a Page Fault occurred in TC3 Demo operation.
Solution:
Delete the call of the task to the POU, cancel the PLC automatic startup, and activate.
Add task calls POU, Activation, Login, Start. Errors disappear.
• On March 8, 2017, the controller automatically switched from RUN to CONFIG mode, indicating "pagefault".
Solution:
A closer look at the program shows that the INPUT variable, which is a function block, is of pointer type, and the function block was called without an address, resulting in the pointer pointing to 0. Because the customer uses a ladder diagram, it is not easy to see the problem, so be careful when using Pointers, arrays, etc.
• On April 13, 2017, is it possible to detect potential causes of Page faults such as division by zero, array overreach, etc. at compile time?
Solution: No.
• On April 14, 2017, the TC3 project was suspended irregularly, and an error was reported: Exception code, Page Fault.
Solution:
Generally, this problem is caused by the array of subscripts out of bounds, pointer 0, divide 0, etc. Using the corresponding Function test, no pointer 0 and divide 0 problems were found, but a lot of array out of bounds problems were found, change all the array out of bounds code in the program, subsequent observation did not find the previous problem. Note that functions such as checkbounds in TC3 need to be added from the project, and adding code manually is not executed.
• On June 29, 2017, there was a Page Fault
Solution:
Using the attached CheckBound documentation, add code to the program to set the breakpoint, and Login waits for the problem to appear.
• On July 11, 2017, the TC light of the customer controller turned yellow when it was running, prompting a PageFault alarm.
Solution:
Let the customer delete the BOOT file in the controller CE system directory, and re-run the program after adding the CHECK function in the program, and the alarm no longer appears, indicating that there are hidden dangers in the code such as "array on the subscript out of bounds, pointer 0, divided by 0".
• On December 22, 2017, the controller frequently reported pagefault
Solution:
Check procedure, no division by zero, out of bounds; Because the program TCPIP communication program is not handled well, create a large number of sockets.