背景
前几天在客户的生产环境上发现一个问题,开发的一个java服务,cpu占用率特别高,严重影响机器性能。当时重启了这个服务,临时解决了这个问题,还好在重启之前生成了一下javaCore文件,这几天分析java core后发现这是一个bug。
分析 javaCore
通过分析间隔几秒生成的javaCore文件。占用cpu资源最多的几个线程都在执行如下操作:
由于现象是cpu占用过高,结合上面的堆栈,以及这个项目是用netty建立了一个tcp服务器,猜测是某个地方可能有死循环,导致cpu过高。根据堆栈提供的信息,直接查看netty源码,从栈顶往下找,第一个netty类,SelectorUtil:
/*
* Copyright 2012 The Netty Project
*
* The Netty Project licenses this file to you under the Apache License,
* version 2.0 (the "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at:
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/
package org.jboss.netty.channel.socket.nio;
import org.jboss.netty.logging.InternalLogger;
import org.jboss.netty.logging.InternalLoggerFactory;
import org.jboss.netty.util.internal.SystemPropertyUtil;
import java.io.IOException;
import java.nio.channels.CancelledKeyException;
import java.nio.channels.Selector;
import java.util.concurrent.TimeUnit;
final class SelectorUtil {
private static final InternalLogger logger =
InternalLoggerFactory.getInstance(SelectorUtil.class);
static final int DEFAULT_IO_THREADS = Runtime.getRuntime().availableProcessors() * 2;
static final long DEFAULT_SELECT_TIMEOUT = 500;
static final long SELECT_TIMEOUT =
SystemPropertyUtil.getLong("org.jboss.netty.selectTimeout", DEFAULT_SELECT_TIMEOUT);
static final long SELECT_TIMEOUT_NANOS = TimeUnit.MILLISECONDS.toNanos(SELECT_TIMEOUT);
static final boolean EPOLL_BUG_WORKAROUND =
SystemPropertyUtil.getBoolean("org.jboss.netty.epollBugWorkaround", false);
// Workaround for JDK NIO bug.
//
// See:
// - http://bugs.sun.com/view_bug.do?bug_id=6427854
// - https://github.com/netty/netty/issues/203
static {
String key = "sun.nio.ch.bugLevel";
try {
String buglevel = System.getProperty(key);
if (buglevel == null) {
System.setProperty(key, "");
}
} catch (SecurityException e) {
if (logger.isDebugEnabled()) {
logger.debug("Unable to get/set System Property '" + key + '\'', e);
}
}
if (logger.isDebugEnabled()) {
logger.debug("Using select timeout of " + SELECT_TIMEOUT);
logger.debug("Epoll-bug workaround enabled = " + EPOLL_BUG_WORKAROUND);
}
}
static Selector open() throws IOException {
return Selector.open();
}
static int select(Selector selector) throws IOException {
try {
return selector.select(SELECT_TIMEOUT);
} catch (CancelledKeyException e) {
if (logger.isDebugEnabled()) {
logger.debug(
CancelledKeyException.class.getSimpleName() +
" raised by a Selector - JDK bug?", e);
}
// Harmless exception - log anyway
}
return -1;
}
private SelectorUtil() {
// Unused
}
}
查看select()
方法,在catch
代码块一个一个日志输出,提示jdk bug但是有没有详细说明。
往上看有一个静态语句块,看注释,提到了一个jdk的bugNullPointerException in Selector.open()看了一下跟我遇到的不一样。在往上看有一个系统属性org.jboss.netty.epollBugWorkaround
,看名字好像是为了绕开epollBug用的。好像跟这个有关系,直接查看这个变量被谁引用了。发现在AbstractNioSelector.run()
方法中有引用,二在上面的线程栈中,正好也有这个方法,直接进入这个方法查看。
public void run() {
thread = Thread.currentThread();
startupLatch.countDown();
int selectReturnsImmediately = 0;
Selector selector = this.selector;
if (selector == null) {
return;
}
// use 80% of the timeout for measure
final long minSelectTimeout = SelectorUtil.SELECT_TIMEOUT_NANOS * 80 / 100;
boolean wakenupFromLoop = false;
for (;;) {
wakenUp.set(false);
try {
long beforeSelect = System.nanoTime();
int selected = select(selector);
if (selected == 0 && !wakenupFromLoop && !wakenUp.get()) {
long timeBlocked = System.nanoTime() - beforeSelect;
if (timeBlocked < minSelectTimeout) {
boolean notConnected = false;
// loop over all keys as the selector may was unblocked because of a closed channel
for (SelectionKey key: selector.keys()) {
SelectableChannel ch = key.channel();
try {
if (ch instanceof DatagramChannel && !ch.isOpen() ||
ch instanceof SocketChannel && !((SocketChannel) ch).isConnected() &&
// Only cancel if the connection is not pending
// See https://github.com/netty/netty/issues/2931
!((SocketChannel) ch).isConnectionPending()) {
notConnected = true;
// cancel the key just to be on the safe side
key.cancel();
}
} catch (CancelledKeyException e) {
// ignore
}
}
if (notConnected) {
selectReturnsImmediately = 0;
} else {
if (Thread.interrupted() && !shutdown) {
// Thread was interrupted but NioSelector was not shutdown.
// As this is most likely a bug in the handler of the user or it's client
// library we will log it.
//
// See https://github.com/netty/netty/issues/2426
if (logger.isDebugEnabled()) {
logger.debug("Selector.select() returned prematurely because the I/O thread " +
"has been interrupted. Use shutdown() to shut the NioSelector down.");
}
selectReturnsImmediately = 0;
} else {
// Returned before the minSelectTimeout elapsed with nothing selected.
// This may be because of a bug in JDK NIO Selector provider, so increment the counter
// which we will use later to see if it's really the bug in JDK.
selectReturnsImmediately ++;
}
}
} else {
selectReturnsImmediately = 0;
}
} else {
selectReturnsImmediately = 0;
}
if (SelectorUtil.EPOLL_BUG_WORKAROUND) {
if (selectReturnsImmediately == 1024) {
// The selector returned immediately for 10 times in a row,
// so recreate one selector as it seems like we hit the
// famous epoll(..) jdk bug.
rebuildSelector();
selector = this.selector;
selectReturnsImmediately = 0;
wakenupFromLoop = false;
// try to select again
continue;
}
} else {
// reset counter
selectReturnsImmediately = 0;
}
// 'wakenUp.compareAndSet(false, true)' is always evaluated
// before calling 'selector.wakeup()' to reduce the wake-up
// overhead. (Selector.wakeup() is an expensive operation.)
//
// However, there is a race condition in this approach.
// The race condition is triggered when 'wakenUp' is set to
// true too early.
//
// 'wakenUp' is set to true too early if:
// 1) Selector is waken up between 'wakenUp.set(false)' and
// 'selector.select(...)'. (BAD)
// 2) Selector is waken up between 'selector.select(...)' and
// 'if (wakenUp.get()) { ... }'. (OK)
//
// In the first case, 'wakenUp' is set to true and the
// following 'selector.select(...)' will wake up immediately.
// Until 'wakenUp' is set to false again in the next round,
// 'wakenUp.compareAndSet(false, true)' will fail, and therefore
// any attempt to wake up the Selector will fail, too, causing
// the following 'selector.select(...)' call to block
// unnecessarily.
//
// To fix this problem, we wake up the selector again if wakenUp
// is true immediately after selector.select(...).
// It is inefficient in that it wakes up the selector for both
// the first case (BAD - wake-up required) and the second case
// (OK - no wake-up required).
if (wakenUp.get()) {
wakenupFromLoop = true;
selector.wakeup();
} else {
wakenupFromLoop = false;
}
cancelledKeys = 0;
processTaskQueue();
selector = this.selector; // processTaskQueue() can call rebuildSelector()
if (shutdown) {
this.selector = null;
// process one time again
processTaskQueue();
for (SelectionKey k: selector.keys()) {
close(k);
}
try {
selector.close();
} catch (IOException e) {
logger.warn(
"Failed to close a selector.", e);
}
shutdownLatch.countDown();
break;
} else {
process(selector);
}
} catch (Throwable t) {
logger.warn(
"Unexpected exception in the selector loop.", t);
// Prevent possible consecutive immediate failures that lead to
// excessive CPU consumption.
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// Ignore.
}
}
}
}
看这个方法,第一个映入眼帘的就是一个大的for (;;){...}
死循环,难道是这里的问题?看着这里最符合我们遇到的现象,cpu占用过高。但是又感觉不太可能,因为这个是tcpserver,一般都是这么实现的,在一个死循环里,处理发生的事件。详细看循环里面的代码,里面有个select(selector)
方法,这个方式会最终调用jdk的Selector.select(long timeout)
方法,查看jdk源码,看方法注释,这个方法会阻塞的,所以理论上除非有事件就绪或者超时,否则该方法会阻塞,整个循环是不会占用太多CPU资源的。继续往下看,看到引用org.jboss.netty.epollBugWorkaround
的地方,发现这个地方,做了一个判断,看注释可以发现这是jdk的一个bug,selector在某些情况下会不会阻塞,会立即返回,而且也没有就绪的时间,相当于在空转,所以会导致cpu过高,这段代码已经处理了这个问题,解决方法就是发生bug的时候新建一个selector,废弃旧的selector,但是这个属性如果不设置的话默认是false,这也解释了为什么项目上会遇到问题,要想解决这个问题,在启动的时候要加参数 -Dorg.jboss.netty.epollBugWorkaround=true
。此参数是netty 3系列中的解决方案,在4中已经默认开启了此参数,所以4系列不需要做任何操作。这里有一个逻辑就是说程序怎么判断自身已经陷入了这个bug中呢,在netty3中是这样的,首先设置一个阈值,final long minSelectTimeout = SelectorUtil.SELECT_TIMEOUT_NANOS * 80 / 100;
其中SelectorUtil.SELECT_TIMEOUT_NANOS是每次调用Selector.select(long timeout)
的参数,也就是超时时间,minSelectTimeout
用来衡量是否触发了这个bug,程序在每次select(long timeout)
前后做了计时,只要这个时间小于minSelectTimeout
,且返回就绪事件数是0,就认为触发了一次bug,selectReturnsImmediately
用来记录触发bug的次数,如果这个次数达到了1024次,就进行重建selector,绕开这个bug。
结论
经过查询,发现两个先关的jdkbug
- JDK-6670302 : (se) NIO selector wakes up with 0 selected keys infinitely [lnx 2.4]
- JDK-6403933 : (se) Selector doesn’t block on Selector.select(timeout) (lnx)
这两个bug说明中,都是说这个bug跟linux内核有关,但实际上我不仅在linux环境遇到过,在windows下同样遇到过这种问题。解决方案也是一样的。所以感觉不像是操作系统的问题,可能还是jdk本身的问题。
实际上这个bug是比较底层的bug,只要用了nio的相关功能,就有可能会触发这个bug,目前知道的有以下几个软件也提供了响应的解决方案:
-
Hazelcast 增加启动参数
-Dhazelcast.io.selectorMode=selectwithfix
- Jetty JVM NIO Bug