Oracle SOA Suite 12c – BPEL: tuning a domain and improving the performancePublished on: Author: Eduardo Barra Cordeiro Category: Oracle
Last year, I had the chance to work with the Oracle SOA Suite 12c product. I invested a lot of time in the BPEL product, and was involved in I don’t know how many incidents and troubleshooting during operation support. In this post, I want to share some tips based on my experience. So your environment will also constantly perform well. Most of the tips are related to DB cleanup.
In my case, I had big databases (over 500GB) with 8 weeks of data retention. Purge process was setup with row_movement strategy because the instances used to be long running.
1. Work with a good Oracle DBA close to you
BPEL is a product that uses a lot of dehydration to always keep the process state updated. Even if you can reduce the in audit trail, it will be always there. For this reason, it’s important to keep executing your queries under 0.01 second. You, as a Fusion Middleware/WebLogic administrator, don’t have responsibility for the BPEL database, right? Wrong. You have. Though often you do not have the permission or even the skills to maintain the product database.
If you have the chance, work closely with a good DBA and help them learn more about the product. I had a chance to work with two great DBAs, and for more than a year we had zero incidents (in 126.96.36.199 product version, in 2015). In 12c, I didn’t have the same luck. Partly because of the product stabilization, and partly because of bad luck in incidents related to infrastructure (storage and network outages). Luckily, we were able to anticipate lot of problems and business impact due to the team’s work and knowledge – DBA & SOA Operation.
2. In DB storage prefer SSDs over hard disks; keep monitoring everything
During the move from 11g to 12c, I had to choose between hard disks and SSDs for DB storage. I chose the second option. It helps to keep the performance stable under 0.02 seconds per transaction. The database in that particular situation was 2.5tb big, with weekly purge and 10 partitions retention. You should calculate the ROI in your case, since SSDs are usually more expensive. Calculate the loss per hour when your environment is down and use this information to justify the investment – or not.
3. Improve product queries and indexes. Execute periodic maintenance
The DBAs have good tools to monitor and evaluate the performance of product queries. Each scenario is different and the query usage depends on your case. For instance: there is a huge difference if you only have synchronous instances or if you use a long-running approach. Also, if you have events or not. Even in Fusion Middleware EM you can see poor performance, and be impacted by slow queries. Ask the DBA to analyze the queries performance and check if Oracle DB suggest a new index or another profile.
Also, if you run purge frequently, it’s important to re-build your indexes. Otherwise you will see your environment become slower after awhile and that’s not just because of the database size. It’s good practice to run periodic maintenance in DB level each quarter or semester, depending on your case. In Oracle 12c you can run it online – without downtime.
4. Purge your data in non-business hours, even in non-production environments
Oracle provides a set of scripts to purge (delete) your old data. Check it out in the official documentation. Those scripts are not the same as the auto-purge feature started in 12c version. The scripts will delete whole partitions, not only terminated instances. This task is quite fast if you only have synchronous instances, and longer if you need to use the row-movement approach (in case your project uses long-running instances). Execute this activity during non-business hours. Because in very huge databases, with row-movement, it can be more than 50 hours long.
Regarding the time to retain your data, it depends on the business requirements. If it’s information the business can’t provide, choose at least 3 months of data (12 weeks). In case your DB grows too much and too fast, 2 weeks or 1 month can be enough.
5. Avoid auto-purge feature in business hours
As described before, the auto-purge feature started in the 12c version will delete only terminated instances – not partitions. It’s a good option for non-production environments, which you can run daily in non-business hours. I don’t recommend this in production without good performance tests. It can create contention in DB level due to the deletion strategy. Since each case is different, test and check your results, then make a final decision. If you need support to set it up, take a look at this post.
6. Keep the domain clean. Undeploy old composites version
In both non-prod and production environments you need to have some kind of composites governance. I would’ve liked to say that each new, deployed composite version means that the old one can be undeployed but in most cases, that isn’t true. You will need to discuss with your business and development teams what strategy you want to use to keep your domain clean. Usually, teams keep the default version and an older version.
In case you’re having doubts about whether or not the composite is ready to be undeployed, you can run the query below to confirm if you have instances running/terminated for a specific composite. It will show a count for a specific composite. Customize when needed.
- SELECT domain_name, composite_name, composite_revision, COUNT(*)
- FROM YOUR_SOAINFRA.cube_instance
- WHERE composite_name = 'YOUR_COMPOSITE_NAME'
- GROUP BY domain_name, composite_name, composite_revision
- ORDER BY composite_name,domain_name
Another option is to set old composites to lazy loading. You can evaluate if the composite has been loaded or not a few days after your server starts. This is also a 12c feature and can be set in domain level or composite level. Take a look at this post for further information.
7. Abort non-terminated old instances
This tip is useful if you have long running instances and are using the row_movement strategy. For different reasons, usually related with some fault, your instances can be stuck in recovery state for a long time. First of all, digest and try to recover them. If you don’t have a way to recover, abort them. Non-terminated instances are not purgeable and will be kept in your database. Aborting them will allow the purge to delete, instead of moving to a newer partition.
To do that you can use the Fusion Middleware EM console in Error Hospital.
You can also write your own script using the product API to bulk delete. Keep in mind that Oracle decided in the 12c version to not document the API, and they are discouraging developers to use it.
8. Learn how your application works
Even if it sounds obvious, keep in mind that this will require time and lot of observation. If your domain works globally for instance, this means it will be difficult to find a free slot to execute your online maintenance in low volume hours. If it’s regional, you know you’ll have 8 hours available for the online maintenances every day.
Some incidents can occur due to business application sending messages in batch and creating stuck threads/contention in BPEL layer if you don’t have throttling control set. This situation will usually be triggered at the exact same hour every day, since it’s a scheduled process.
Check your logs and check the DB traffic graphic with the DBA after a few days, or weeks. This will give you a good idea about your environment, and help you to be proactive and responsive in case of a real incident.
9. Disable auto-recovery in business hours
After you understand how your application works, you will be able to define when the low volume hours are, usually related with non-business hours, and what the best timeframe is to have auto-recovery enabled. During high business hours you can see stuck threads and contention in BPEL layer, even if it won’t create a formal incident.
You can change the auto-recovery in the EM >> Domain >> SOA Infrastructure >> BPEL Properties. See the screenshots below.
10. Set correct timeouts
During the domain tuning you can use the following rule to define your timeouts. It was an Oracle recommendation during a project that I was involved:
syncMaxWaitTime < BPEL EJB's transaction timeout < Global Transaction Timeout (JTA timeout) < XA timeout < distributed_lock_timeout < http_timeout
The syncMaxWaitTime value must be lower than the http_timeout. With this rule implemented, you will avoid that your environment starts having timeouts because of a wrong setup.
SOA Suite product is hugely database-dependent. Most of the time your stuck threads will be related to DB slowness or low performance. A continuous maintenance will help you to avoid incidents and keep your environment stable.