背景

前几天在客户的生产环境上发现一个问题，开发的一个java服务，cpu占用率特别高，严重影响机器性能。当时重启了这个服务，临时解决了这个问题，还好在重启之前生成了一下javaCore文件，这几天分析java core后发现这是一个bug。

分析 javaCore

通过分析间隔几秒生成的javaCore文件。占用cpu资源最多的几个线程都在执行如下操作：

由于现象是cpu占用过高，结合上面的堆栈，以及这个项目是用netty建立了一个tcp服务器，猜测是某个地方可能有死循环，导致cpu过高。根据堆栈提供的信息，直接查看netty源码，从栈顶往下找，第一个netty类，SelectorUtil：

/*
 * Copyright 2012 The Netty Project
 *
 * The Netty Project licenses this file to you under the Apache License,
 * version 2.0 (the "License"); you may not use this file except in compliance
 * with the License. You may obtain a copy of the License at:
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations
 * under the License.
 */
package org.jboss.netty.channel.socket.nio;

import org.jboss.netty.logging.InternalLogger;
import org.jboss.netty.logging.InternalLoggerFactory;
import org.jboss.netty.util.internal.SystemPropertyUtil;

import java.io.IOException;
import java.nio.channels.CancelledKeyException;
import java.nio.channels.Selector;
import java.util.concurrent.TimeUnit;

final class SelectorUtil {
    private static final InternalLogger logger =
        InternalLoggerFactory.getInstance(SelectorUtil.class);

    static final int DEFAULT_IO_THREADS = Runtime.getRuntime().availableProcessors() * 2;
    static final long DEFAULT_SELECT_TIMEOUT = 500;
    static final long SELECT_TIMEOUT =
            SystemPropertyUtil.getLong("org.jboss.netty.selectTimeout", DEFAULT_SELECT_TIMEOUT);
    static final long SELECT_TIMEOUT_NANOS = TimeUnit.MILLISECONDS.toNanos(SELECT_TIMEOUT);
    static final boolean EPOLL_BUG_WORKAROUND =
            SystemPropertyUtil.getBoolean("org.jboss.netty.epollBugWorkaround", false);

    // Workaround for JDK NIO bug.
    //
    // See:
    // - http://bugs.sun.com/view_bug.do?bug_id=6427854
    // - https://github.com/netty/netty/issues/203
    static {
        String key = "sun.nio.ch.bugLevel";
        try {
            String buglevel = System.getProperty(key);
            if (buglevel == null) {
                System.setProperty(key, "");
            }
        } catch (SecurityException e) {
            if (logger.isDebugEnabled()) {
                logger.debug("Unable to get/set System Property '" + key + '\'', e);
            }
        }
        if (logger.isDebugEnabled()) {
            logger.debug("Using select timeout of " + SELECT_TIMEOUT);
            logger.debug("Epoll-bug workaround enabled = " + EPOLL_BUG_WORKAROUND);
        }
    }

    static Selector open() throws IOException {
        return Selector.open();
    }

    static int select(Selector selector) throws IOException {
        try {
            return selector.select(SELECT_TIMEOUT);
        } catch (CancelledKeyException e) {
            if (logger.isDebugEnabled()) {
                logger.debug(
                        CancelledKeyException.class.getSimpleName() +
                        " raised by a Selector - JDK bug?", e);
            }
            // Harmless exception - log anyway
        }
        return -1;
    }

    private SelectorUtil() {
        // Unused
    }
}

查看select()方法，在catch代码块一个一个日志输出，提示jdk bug但是有没有详细说明。
往上看有一个静态语句块，看注释，提到了一个jdk的bugNullPointerException in Selector.open()看了一下跟我遇到的不一样。在往上看有一个系统属性org.jboss.netty.epollBugWorkaround，看名字好像是为了绕开epollBug用的。好像跟这个有关系，直接查看这个变量被谁引用了。发现在AbstractNioSelector.run()方法中有引用，二在上面的线程栈中，正好也有这个方法，直接进入这个方法查看。

public void run() {
        thread = Thread.currentThread();
        startupLatch.countDown();

        int selectReturnsImmediately = 0;
        Selector selector = this.selector;

        if (selector == null) {
            return;
        }
        // use 80% of the timeout for measure
        final long minSelectTimeout = SelectorUtil.SELECT_TIMEOUT_NANOS * 80 / 100;
        boolean wakenupFromLoop = false;
        for (;;) {
            wakenUp.set(false);

            try {
                long beforeSelect = System.nanoTime();
                int selected = select(selector);
                if (selected == 0 && !wakenupFromLoop && !wakenUp.get()) {
                    long timeBlocked = System.nanoTime() - beforeSelect;
                    if (timeBlocked < minSelectTimeout) {
                        boolean notConnected = false;
                        // loop over all keys as the selector may was unblocked because of a closed channel
                        for (SelectionKey key: selector.keys()) {
                            SelectableChannel ch = key.channel();
                            try {
                                if (ch instanceof DatagramChannel && !ch.isOpen() ||
                                        ch instanceof SocketChannel && !((SocketChannel) ch).isConnected() &&
                                                // Only cancel if the connection is not pending
                                                // See https://github.com/netty/netty/issues/2931
                                                !((SocketChannel) ch).isConnectionPending()) {
                                    notConnected = true;
                                    // cancel the key just to be on the safe side
                                    key.cancel();
                                }
                            } catch (CancelledKeyException e) {
                                // ignore
                            }
                        }
                        if (notConnected) {
                            selectReturnsImmediately = 0;
                        } else {
                            if (Thread.interrupted() && !shutdown) {
                                // Thread was interrupted but NioSelector was not shutdown.
                                // As this is most likely a bug in the handler of the user or it's client
                                // library we will log it.
                                //
                                // See https://github.com/netty/netty/issues/2426
                                if (logger.isDebugEnabled()) {
                                    logger.debug("Selector.select() returned prematurely because the I/O thread " +
                                            "has been interrupted. Use shutdown() to shut the NioSelector down.");
                                }
                                selectReturnsImmediately = 0;
                            } else {
                                // Returned before the minSelectTimeout elapsed with nothing selected.
                                // This may be because of a bug in JDK NIO Selector provider, so increment the counter
                                // which we will use later to see if it's really the bug in JDK.
                                selectReturnsImmediately ++;
                            }
                        }
                    } else {
                        selectReturnsImmediately = 0;
                    }
                } else {
                    selectReturnsImmediately = 0;
                }

                if (SelectorUtil.EPOLL_BUG_WORKAROUND) {
                    if (selectReturnsImmediately == 1024) {
                        // The selector returned immediately for 10 times in a row,
                        // so recreate one selector as it seems like we hit the
                        // famous epoll(..) jdk bug.
                        rebuildSelector();
                        selector = this.selector;
                        selectReturnsImmediately = 0;
                        wakenupFromLoop = false;
                        // try to select again
                        continue;
                    }
                } else {
                    // reset counter
                    selectReturnsImmediately = 0;
                }

                // 'wakenUp.compareAndSet(false, true)' is always evaluated
                // before calling 'selector.wakeup()' to reduce the wake-up
                // overhead. (Selector.wakeup() is an expensive operation.)
                //
                // However, there is a race condition in this approach.
                // The race condition is triggered when 'wakenUp' is set to
                // true too early.
                //
                // 'wakenUp' is set to true too early if:
                // 1) Selector is waken up between 'wakenUp.set(false)' and
                //    'selector.select(...)'. (BAD)
                // 2) Selector is waken up between 'selector.select(...)' and
                //    'if (wakenUp.get()) { ... }'. (OK)
                //
                // In the first case, 'wakenUp' is set to true and the
                // following 'selector.select(...)' will wake up immediately.
                // Until 'wakenUp' is set to false again in the next round,
                // 'wakenUp.compareAndSet(false, true)' will fail, and therefore
                // any attempt to wake up the Selector will fail, too, causing
                // the following 'selector.select(...)' call to block
                // unnecessarily.
                //
                // To fix this problem, we wake up the selector again if wakenUp
                // is true immediately after selector.select(...).
                // It is inefficient in that it wakes up the selector for both
                // the first case (BAD - wake-up required) and the second case
                // (OK - no wake-up required).

                if (wakenUp.get()) {
                    wakenupFromLoop = true;
                    selector.wakeup();
                } else {
                    wakenupFromLoop = false;
                }

                cancelledKeys = 0;
                processTaskQueue();
                selector = this.selector; // processTaskQueue() can call rebuildSelector()

                if (shutdown) {
                    this.selector = null;

                    // process one time again
                    processTaskQueue();

                    for (SelectionKey k: selector.keys()) {
                        close(k);
                    }

                    try {
                        selector.close();
                    } catch (IOException e) {
                        logger.warn(
                                "Failed to close a selector.", e);
                    }
                    shutdownLatch.countDown();
                    break;
                } else {
                    process(selector);
                }
            } catch (Throwable t) {
                logger.warn(
                        "Unexpected exception in the selector loop.", t);

                // Prevent possible consecutive immediate failures that lead to
                // excessive CPU consumption.
                try {
                    Thread.sleep(1000);
                } catch (InterruptedException e) {
                    // Ignore.
                }
            }
        }
    }

看这个方法，第一个映入眼帘的就是一个大的for (;;){...}死循环，难道是这里的问题？看着这里最符合我们遇到的现象，cpu占用过高。但是又感觉不太可能，因为这个是tcpserver，一般都是这么实现的，在一个死循环里，处理发生的事件。详细看循环里面的代码，里面有个select(selector)方法，这个方式会最终调用jdk的Selector.select(long timeout)方法，查看jdk源码，看方法注释，这个方法会阻塞的，所以理论上除非有事件就绪或者超时，否则该方法会阻塞，整个循环是不会占用太多CPU资源的。继续往下看，看到引用org.jboss.netty.epollBugWorkaround的地方，发现这个地方，做了一个判断，看注释可以发现这是jdk的一个bug，selector在某些情况下会不会阻塞，会立即返回，而且也没有就绪的时间，相当于在空转，所以会导致cpu过高，这段代码已经处理了这个问题，解决方法就是发生bug的时候新建一个selector，废弃旧的selector，但是这个属性如果不设置的话默认是false，这也解释了为什么项目上会遇到问题，要想解决这个问题，在启动的时候要加参数 -Dorg.jboss.netty.epollBugWorkaround=true。此参数是netty 3系列中的解决方案，在4中已经默认开启了此参数，所以4系列不需要做任何操作。这里有一个逻辑就是说程序怎么判断自身已经陷入了这个bug中呢，在netty3中是这样的，首先设置一个阈值，final long minSelectTimeout = SelectorUtil.SELECT_TIMEOUT_NANOS * 80 / 100;其中SelectorUtil.SELECT_TIMEOUT_NANOS是每次调用Selector.select(long timeout)的参数，也就是超时时间，minSelectTimeout用来衡量是否触发了这个bug，程序在每次select(long timeout)前后做了计时，只要这个时间小于minSelectTimeout，且返回就绪事件数是0，就认为触发了一次bug，selectReturnsImmediately用来记录触发bug的次数，如果这个次数达到了1024次，就进行重建selector，绕开这个bug。

结论

经过查询，发现两个先关的jdkbug

JDK-6670302 : (se) NIO selector wakes up with 0 selected keys infinitely [lnx 2.4]
JDK-6403933 : (se) Selector doesn’t block on Selector.select(timeout) (lnx)
这两个bug说明中，都是说这个bug跟linux内核有关，但实际上我不仅在linux环境遇到过，在windows下同样遇到过这种问题。解决方案也是一样的。所以感觉不像是操作系统的问题，可能还是jdk本身的问题。

实际上这个bug是比较底层的bug，只要用了nio的相关功能，就有可能会触发这个bug，目前知道的有以下几个软件也提供了响应的解决方案：
Hazelcast 增加启动参数 -Dhazelcast.io.selectorMode=selectwithfix
Jetty JVM NIO Bug

yuandm.tech

yuandm.tech

标签：selector

Selector doesn’t block on Selector.select(timeout)

背景

分析 javaCore

结论