Oracle SOA Suite 12c – BPEL: tuning a domain and improving the performancePublished on: Author: Eduardo Barra Cordeiro Category: Oracle
Last years I had a chance to work with Oracle SOA Suite 12c product and I invest a lot of my time in BPEL product. I was involved in I don’t know how much incidents and troubleshooting during operation support. In this post I want to summarize some tips based on my experience to have your environment constantly with a good performance. Most of them are related with DB cleanup.
In my scenario I used to have big databases (over 500gb) and with 8 weeks data retention. Purge process was setup with row_movement strategy because the instances used to be long running.
1. Work with a good Oracle DBA close to you
BPEL is a product that use a lot of dehydration to keep the process state always updated. Even if you can reduce the in audit trail, it will be always there. For that reason it is important keep your queries been executed under 0.01 second. You, as Fusion Middleware/WebLogic administrator, don’t have responsibility over BPEL database. Right? No. You have. But often you don’t have the permission even the skills to maintain the product database.
If you have a chance, work close with a good DBA and help him to learn about the product. I had a chance to work with two great DBA’s in my life and during more than one year we had zero incidents (in 18.104.22.168 product version, in 2015). In 12c I don’t had the same luck, part because the product stabilization, part because bad luck in some incidents related with infrastructure (storage and network outages). But we are able to anticipate lot of problems and business impact due the team work knowledge – DBA & SOA Operation.
2. In DB storage prefer SSD’s than hard disks. Keep everything under monitoring
During the project to move from 11g to 12c version I am able to choose between hard disks and SSD’s for DB storage. I choose the second option. It helps to keep the performance stable under 0.02 seconds per transaction. The database in that situation was 2.5tb big, with weekly purge and 10 partitions retention. Calculate the ROI in your case, since SSD usually will be more expensive. Calculate the loss per hour when your environment is down and use this information to justify the investment – or not.
3. Improve product queries and indexes. Execute periodic maintenance
The DBA’s have good tools to monitor and evaluate the performance of product queries. Each scenario is different and the query usage will depend on your case. For instance: there is a huge difference if you have only synchronous instances or if you use long-running approach; also, if you have events or not. Even in Fusion Middleware EM you can see poor performance and be impacted by slow queries. Ask the DBA to analyze the queries performance and check if Oracle DB suggest a new index or another profile.
Also, if you run purge frequently will be important re-build your indexes. Otherwise you will see your environment become slow after some time and it is not only because the database size. A good practice is run periodic maintenance in DB level each quarter or semester, depending on you case. In Oracle 12c you can run it online – without downtime.
4. Purge your data in non-business hours, even in non-production environments
Oracle provides a set of scripts to purge (delete) your old data. Check it out in the official documentation. Those scripts are not the same of auto-purge feature, started in 12c version. The scripts will delete whole partitions and not only terminated instances. This task will be quite fast if you have only synchronous instances and will be longer if you need to use the row-movement approach in case your project uses long-running instances. Choose non-business hours to execute this activity. In very huge databases, with row-movement, it can be more than 50h long.
Regarding the time to retain your data, it will depend on the business requirements. If it is an information that business can’t provide to you, choose at least 3 months of data (12 weeks). In case your DB grows too much and too fast, 2 weeks or 1 month can be enough.
5. Avoid auto-purge feature in business hours
As already described, the auto-purge feature, started with 12c version, will delete only terminated instances – not partitions. It is a good option for non-production environments, where you can run daily in non-business hours. I don’t recommend in production without good performance tests. It can create contention in DB level due the deletion strategy. Since each case is different, test and check your results to them take a final decision. If you need support to setup it, take a look in this post.
6. Keep the domain clean. Undeploy old composites version
In both non-prod and production environments you need to have some kind composites governance. I would like to say that each new composite version deployed should define that the old one could be undeployed but it is not true in most of the cases. Because that you need to discuss with business and development teams what is the strategy to keep your domain clean. Usually I saw teams defining as general rule keep the default version plus one old.
In case you have doubts if the composite are ready to be undeployed you can run the below query to confirm if you have instances running/terminated for a specific composite. It will show a count for a specific composite. Customize it as need.
- SELECT domain_name, composite_name, composite_revision, COUNT(*)
- FROM YOUR_SOAINFRA.cube_instance
- WHERE composite_name = 'YOUR_COMPOSITE_NAME'
- GROUP BY domain_name, composite_name, composite_revision
- ORDER BY composite_name,domain_name
Another option is set old composites to lazy loading. You can evaluate if the composite has been load or not some days after your server starts. It is also a 12c feature and can be set in domain level or composite level. Take a look in this post for further information.
7. Abort non-terminated old instances
This tip is useful if you have long running instances and are using the row_movement strategy. For different reasons, usually related with some fault, your instances can be stuck in recovery state for long time. First of all, digest and try to recover them. If you don’t have a way to recover, abort them. Instances non-terminated are not purgeable and will be kept in your database. Aborting them will allow the purge to delete instead of move to a newer partition.
To do that you can use the Fusion Middleware EM console in Error Hospital.
You can also write your own script using the product API to bulk delete. Keep in mind that Oracle decided in 12c version to not document the API and they not encourage developers to use it.
8. Learn how your application works
Even if it obvious, it will require time and lot of observation. For instance, your domain can work globally and not only regionally. It means that you will face problems to have a free slot to execute your online maintenance in low volume hours. Or else, it can be regional and you know that you have daily 8h available to the online maintenances.
Some incidents can occur due business application sending messages in batch and creating stuck threads/contention in BPEL layer if you don’t have throttling control set. Such situation usually will be trigger in the exact same hour at the day, since it is a scheduled process.
Check your logs and check with the DBA the DB traffic graphic during some days or weeks. It will give you a good picture about your environment and help you to be proactive and responsive before a real incident.
9. Disable auto-recovery in business hours
After understand how your application works, you will be able to define what is the low volume hours, usually related with non-business hours, and the best timeframe to have auto-recovery enabled. In high business hours you can see stuck threads and contention in BPEL layer, even if it will not create a formal incident.
You can change the auto-recovery in the EM >> Domain >> SOA Infrastructure >> BPEL Properties. See the screens below.
10. Set correct timeouts
During the domain tuning you can use the following rule to define your timeouts. It was an Oracle recommendation during a project that I was involved:
syncMaxWaitTime < BPEL EJB's transaction timeout < Global Transaction Timeout (JTA timeout) < XA timeout < distributed_lock_timeout < http_timeout
The syncMaxWaitTime value must be the lower one and http_timeout the higher. With this rule implemented you will avoid that your environment starts to have timeouts because wrong setup.
SOA Suite product is huge database-dependent. Most of time your stuck threads will be related with DB slowness or low performance. A continuous maintenance will help you to avoid incidents and keep your environment stable.