Material selection plays a crucial role in product development. Last week we shared an infographic about metals and their basic properties. This week we deal with ceramics. This material group is less known, but ceramic materials can be vital especially in applications used in demanding environments.
A few years ago, we were buying a new mobile router for my family. After the purchase, we quite soon realised that the router had some reliability problems, and it stopped working properly, when we had used it only for a very short time. The router worked okay after we restarted it, but every time it had been on for a while, it stopped working properly.
So, we took the router back to the store, where the clerk promised that they would check what was wrong. I was pretty sure that we had a device with an intermittent failure. Typical for such failures is that they occur only in suitable conditions and can cause very strange failure modes. Unfortunately, after service, the router did not work any better. Hence, we took it back to the store where it became clear they had done nothing, even though we had explained the problem. They had turned the router on, checked that it worked okay and decided that the router was fine. They were sure that the problem was in our system not in the router.
We repeated the service cycle again. However, after this the service clerk told me that there is not such thing as an intermittent failure, and this is not their problem. Luckily, we managed to return the router, go to another store, and buy a new one which worked perfectly without any failures.
I must admit that I was somewhat shocked by the service clerk’s attitude. Firstly, of course we got really bad service, but I was surprised that he had never had a device before with such issues and declared that they were not possible. Considering that intermittent failures are quite common in electronics, this was quite surprising. An intermittent failure basically means a failure which comes and goes but is not permanent. Since electronics devices are very complex, there are plenty of potential reasons for such a failure to occur.
Intermittent failures may appear, for example, due to cracks, relaxation of plastics, migration of materials, corrosion, and bad connectors. In testing we commonly see these failures when a crack occurs in a suitable interconnection. A crack in a solder joint may lead to a situation, in which the joint is closed at high temperatures, but bending due to temperatures below zero opens the joint, and the device does not function as it should. Even more common is that during thermal cycling testing the interconnection is closed at high and low temperatures, but during the change of temperatures the interconnection is open, and the device does not work. In the picture below an example of a cross-sectioned cracked solder joint is shown.
In Trelic, we do lots of different accelerated life tests which mean we expose test structures, components, or whole devices to different environmental conditions. Quite often a test with fluctuating test conditions is involved. During testing we prefer to measure the functionality or the electrical signals of tested components and devices in-situ. This way we can see in real-time how failures occur and are they permanent or not i.e. can we see the failure both during and after the test.
Quite often we see failures which occur only at low or high temperature or at high humidity. Even though the samples show a failure during the test, after testing they may function perfectly fine, and no indication of any failures is seen. It is not uncommon that in these situations our customer mentions that such failures have also be seen in real use conditions as intermittent failures and they have not been able to determine the reason for them. In this case it is of course great that we have been able to imitate the failures because we can then start to analyse what is causing them. One of the main problems with intermittent failures is that their location and cause is difficult to define. Additionally, they may be very hard to replicate. For example, as shown in the picture below, electronics have lots of components with different polymers and plastics, which makes them vulnerable to intermittent failures due to humidity and such failures may typically be difficult to find.
An intermittent failure can also be a real challenge for in-depth failure analysis (read more about failure analysis here), sometimes even a nightmare, as the heading says. When a failure occurs, for example, only at very high humidity and temperature, it is typically not seen at room conditions. However, it is difficult to study a device at such high humidity and temperature conditions to determine the exact location of the failure. Furthermore, detailed failure analysis techniques can usually be only used at room conditions and it is possible, that because of this, it is impossible to confirm the reason for the failure. In the picture below a corroded copper plated via is shown. Corrosion may be one of the reasons causing failures which are difficult to locate or cause instability to connections.
Unfortunately, intermittent failures are common in electronics and cause problems in reliability analysis. Nowadays, there is also a risk that such failure is caused by combined hardware and software problems, which makes it even harder to analyse their reasons and locate the original cause. As I mentioned before, one way to analyse and find such failures is to measure the functionality of the device and age them in various environmental conditions. If this is done already in product development, the risk of intermittent failures is significantly reduced. If failures are seen in the test conditions, they can be further studied, and the test conditions already give important clues about the problems and reasons for them. Building electrical set-ups for electrical measurements during environmental testing is sometimes time consuming, but typically well worth the extra information gotten.
Thermal cycling testing is a very widely used especially in electronics. It exposes devices and components to fluctuating temperature, which causes fatigue in the components and especially in their interconnections. For example, cracking due to fatigue is one of the most common failure mechanisms in electronics, and therefore, it is a very important consideration in reliability analysis.
Fatigue failures are caused by different thermal expansion coefficients (CTE) of the materials used in electronics devices. For example, the CTE of silicon is very small (about 3ppm) while the CTE of polymer parts can be very high (more than 100ppm) especially, if the polymer materials need to be used unfilled. Even filled polymer materials tend to have rather high CTEs, as shown in the picture below, which can cause formation of very high stresses in the structures. Even though fluctuating stresses due CTE differences are the main reason for failures, in thermal cycling several other potential failure mechanisms are also present. For example, diffusion and relaxation of materials due to high temperature may affect the failure modes.
As mentioned above, thermal cycling testing is a very common test method. Due to this there are numerous test standards and recommendation for these tests and how they should be conducted. However, many standards give lots of options for test parameters or even mention that the parameters should be tailored according to the application. Although it is common to choose a test which has been widely used earlier, before testing it is useful to consider whether the parameters of the test really are suitable, or the most efficient ones for the studied component or device. There are several parameters to consider and picking the best combination is not straightforward.
The main parameters of thermal cycling testing include temperature limits, dwell time at both limits and change rate between the limits.
The main parameters of thermal cycling testing include temperature limits, dwell time at both limits and the change rate between the limits. In the picture below the main test parameters are shown. All of them are important and affect the stresses formed during testing. Of course, the number of test cycles is also a critical factor and the duration of testing should always be carefully considered.
The temperature limits are critical for the acceleration level of testing. The greater the difference between the limits is, the higher the stresses caused by them will be. However, if the limits are too extreme, there is a marked risk that overstress failures occur, leading to early failures which would never really occur in use conditions. A typical example of a critical limit is the glass transition temperature, Tg, of polymer materials. A temperature limit above Tg may lead to catastrophic failure which is easily seen but it may also just change the failure mechanisms to unrealistic ones. Then again, it is useful to use as high temperature limit as possible, since, if the difference between the limits is not great enough, the test has very small acceleration factor and the test time becomes very long.
In addition to the stresses caused by the differences between the temperature limits, the exposure to either high or low temperature may cause degradation leading to failures. For example, high temperature accelerates many harmful processes including for example diffusion, migration, and oxidation. An example of such processes is the growth of intermetallic layers in solder joints which typically reduces the mechanical robustness of the joints. High temperature also causes degradation and oxidation of polymers and permanently weakens their properties. To take these factors into account, it is important to consider how long exposure time is suitable at each temperature limit i.e. the dwell time at each limit.
If a long dwell time is used, the test duration increases unless the number of cycles is reduced. In the picture below the effect of cycle time to the test duration is shown. If the aim is to do 500 cycles with a 30 min cycle, we need approximately 250h or 1.5 weeks of testing. With 120 min cycle the test time increases to 1,000h or to 6 weeks. Then again, sometimes long dwell time may even accelerate testing, if it causes changes in the structure which increase the stresses during the temperature changes. For example, at high temperature polymer materials relax or creep – the polymer chains in the material move to reduce the stresses caused by the high temperature. When the temperature is lowered, these changes may significantly increase the stresses formed in the structures. However, long enough dwell time is required for these changes to occur. Often it is difficult to optimize the dwell time, but it is good to consider if critical changes may occur at high temperatures and would an extended dwell time be required.
Futhermore, the change rate of the temperature is critical. Very fast cycling testing (or shock testing) may cause thermal gradients to form in the tested structures. This means that different parts of a device or component heat up at different rates and warping of the system may occur. If such rapid changes of temperature may occur at use conditions, it is important to test their effects. However, typically such shocks are not present and in testing they cause incorrect failure mechanisms. Consequently, it is typically better to use slower change rates which allows different materials to warm up at similar rate. However, slower change rate naturally increases the testing time. Maximum change rate depends greatly on the structure tested. Small structures or components warm quickly and can be normally tested with very fast change rates but large devices typically require slow change rates and, also longer dwell times should be used.
When the test profile has been chosen, it is good to remember that the actual temperature within the test chamber or more importantly within the tested component may be something quite different than the programmed temperature. In the picture below the test temperature measured from a tested component and the programmed profile are shown. As can be seen, the actual change rate is clearly slower than the programmed one causing the dwell time to be shorter. Due to this effect with large components, there is a substantial risk that the components do not reach the temperature limits especially when a short cycle time with a fast change rate is used. Therefore, it is useful regularly to measure what the test sample is really exposed to in the test conditions, and to adjust the test parameters if needed.
Finally, the number of cycles is a critical parameter to determine. It depends on several other parameters, for example on the test parameters, the use conditions, and expected use life. For some structures, such solder joints, several formulas to calculate optimal test durations exist. However, there is no easy answer how to determine the duration of the test and it should always be considered on basis of the tested components and structures.
Humidity testing is one the most commonly used accelerated reliability test methods. This makes sense since high humidity level is a one of the most common reasons to cause failures. Moreover, humidity testing is relatively easy to conduct and equipment for it is commonly available. A typical humidity test is a constant humidity test, in which tests samples are exposed to extended periods of steady high humidity conditions. Typically, high temperature is used as an additional accelerating factor, since this makes it possible to gain highly accelerating test conditions and, thereby, reduce the test time.
When humidity testing is planned it is important to choose suitable test conditions. This is not always easy since there are lots of test standards available and these standards give numerous different test combinations based on the test temperature, relative humidity (RH), and duration. In electronics it is quite common, that the same test is used repeatedly without really considering the suitability of the test. For example, a test with 85% relative humidity (RH) and 85°C, so called 85/85-test, has been very widely used as a basic test for almost everything in electronics, even though it is a very harsh test.
At least, when reliability testing for something new is developed or when an old design is markedly changed, the test methods should be carefully considered. Test parameters are even more important when acceleration factors are determined using several tests. In humidity testing this means that both test temperature and humidity level need to be considered.
The level of humidity may be considered using absolute or relative humidity. The absolute humidity tells the actual amount of water in the air and is expressed in g/m3. The relative humidity tells the percentage of the maximum water the air can hold. Consequently, the relative humidity changes with temperature i.e. at low temperature smaller amount of water leads to higher relative humidity than at higher temperatures. In the figure below, the relation between the relative and absolute humidity is given at different temperatures. As can be seen, the absolute humidity can be very different at different temperatures, even though the relative humidity stays the same.
Typically, we use test standards and literary data to plan humidity tests. The standards give the humidity as relative humidity, most likely because it is much easier than to use absolute humidity. Moreover, the test chambers are programmed using relative humidity. The problem with relative humidity is that it easily leads to situation, in which one does not really realise how much the real amount of water changes between the tests and conditions. For example, if the use conditions have on average 35°C and 90%RH, it is very humid, but the absolute humidity is still only slightly more than tenth of the humidity in 85/85 test.
The problem with relative humidity is that it easily leads to situation, in which one does not really realise how much the real amount of water changes between the tests and conditions.
Below the absolute humidity of some common test conditions are compared. As can be seen, with increasing temperature the absolute humidity rises very quickly to very high values.
So why is this important? We are using more and more plastic or polymer materials in all engineering applications. Unlike metals and ceramics, polymers, and plastics tent to absorb moisture. For some polymers, this absorption can be considerable and for most polymers it depends on the amount of humidity present. So, if in test conditions the amount of water is significantly higher than anything possible in use conditions, there is a risk of accelerating failures which would never occur in the real conditions. One example of this is chemical degradation of polymers due to water such as hydrolysis, which may be accelerated hugely in high temperature and high humidity conditions but is not relevant in use conditions. Or the thermal and mechanical properties of a plastic part may considerable change due to high water content and again lead to failure which would not have happened in normal use conditions.
The challenge is that the combination of both high temperature and humidity is an excellent way to accelerate reliability testing. Especially, when the use life of a product is long, high acceleration is essential to reach reasonable testing times. Consequently, we must compromise with the extreme test conditions and the risks related to their use. However, before testing it is always important to consider the risks and is it meaningful to test with extremely high absolute humidity values. Sometimes, it is better to lower the temperature and humidity levels, even if this means longer testing.
Failure analysis process is used when something fails, and we need to know how and why it happened. The process itself may be very complex with numerous different analysis methods and tools. However, the overall process itself is typically quite simple and does not markedly change whether we are trying to figure out why a beer can has exploded (like the one in the picture above) or what happened to a complex electrical unit which has stopped working.
Commonly, the process can be described with four steps which you may need to repeat multiple times, especially with complicated failures which may involve several failure types.
Step 1. Data collection
The first part of the failure analysis process is the collection of data. The purpose of this part is to gather as much information as possible of the failure and of the factors related to it. Basically, this means verifying what has failed, how the failure happened or how it was observed, where did the failure occur, when did it happen, was the product used as it should be or has it been used at all, or anything else specific to the failure. In addition to the failure, it is important to gather as much information as possible of the failed device or structure. For example, what kind of materials, components, and manufacturing processes has been used.
Failure analysis often starts in a hurry, especially, if something critical has failed. Because of this, finding enough time to collect all necessary data can be a challenge. Furthermore, finding correct information can be quite difficult or even impossible. However, data collection is a crucial part of the process, because without enough data there is a major risk of that one uses unsuitable analysis methods or draws wrong conclusions.
Step 2. Hypothesis of the cause
After the data has been gathered, it will be used to make a guess of what could have happened. In addition to the collected data, we often need to use literary data, former experience, and history data from similar products to support our hypothesis.
Sometimes making such a hypothesis is very easy and just by using visual inspection we can see the probable cause for the failure. For example, there could be a crack which is clearly due to fatigue or obvious corrosion which has caused an electrical breakdown. However, it is common, especially in complex systems, that there are several possibilities and making a hypothesis of the cause is very difficult. It is even possible that we will start by eliminating potential causes just to get more information.
Even though making a hypothesis can be challenge, it is important since we can use it to move to the next stage i.e. we determine which analysis methods should be used.
Step 3. Analysis using various analytical methods
The next stage of the process is to use suitable methods to analyse the failed product. There are lots of different methods we can use to do this. Some are quite straightforward and obvious, for example visual inspection is used to check what we are dealing with and in electrical systems some kind of electrical measurements are almost always used. However, often analysis methods are complex, require lots of experience and knowledge, and are expensive to use. Because of this, it is important to use the data from step 1 and the hypothesis from step 2 to decide which analysis methods to start with to ensure that they are both efficient and meaningful.
When the methods have been decided, the analysis should start with the least destructive methods. This way the failed samples are not destroyed and can be further analysed with other techniques if needed.
Moving to destructive techniques is often necessary after the non-destructive methods. This can be problematic, especially, if there is only one sample to analyse. It is possible that only one analysis can be conducted from the sample, since the analysis may fully destroy the sample. Consequently, it is vital to pick the right methods.
If you are unlucky, there are several failure mechanisms acting at the same time which can make the analysis very difficult. In such case you may just needs to make an educated guess and hope for the best. Sometimes it can also be useful to start by eliminating most of the potential causes and using the results to decide how to proceed.
Step 4. Analysis of the results and conclusions
After the data has been collected and the samples analysed, we analyse the results and draw conclusions i.e. determine what was the cause of the failure. Then we can write a report including all relevant data of the samples, description of used analysis techniques, results, and conclusions and the failure analysis process is complete.
However, in real world this might not be how the process goes. When we start to analyse the results or even before it, we can realise that we lack some critical data and we need to go back to step 1 to collect more information. Or the results given by various analytical methods are not conclusive and we need to use additional techniques and go back to step 3. Or the results indicate something unexpected, and we need to fix our hypothesis or maybe even start the whole process from the beginning. In practise this means that the failure analysis process often contains several cycles before it is finally concluded.
Finally, it is important to notice that more often than not there is considerable amount on uncertainty in the final results of the failure analysis even though it is well conducted. We can reduce the uncertainty by repeating the different parts – adding more analysis methods or even trying to replicate the failure mechanism. This can go on for a long time. Consequently, sometimes it is important to not only consider how to conduct the process but also to consider what is enough and when to stop it.