Ticket #1497 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

Segmentation fault after several hours

Reported by: j0k3r Owned by: rakshasa
Priority: highest Component: libtorrent
Version: Severity: normal
Keywords: Cc:

Description

Hello. We've tested it several times - rtorrent is running in a screen mode, but after some time I've noticed that it in 12-24 hours after running fine.

We are using CentOS 5.2 (Linux 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:49:24 EDT 2008 i686 athlon i386 GNU/Linux) with all updates.

libtorrent-0.12.3 rtorrent-0.8.3 curl-7.19.0 (compiled from source to use a newest version) all announces are http, not https.

Caught Segmentation fault, dumping stack: 0 rtorrent [0x807eb71] 1 rtorrent [0x80846c6] 2 [0xfa0420] 3 /usr/local/lib/libtorrent.so.11(_ZN7torrent9PollEPoll7performEv+0x55) [0x14a845] 4 rtorrent [0x80bfced] 5 rtorrent [0x8080081] 6 /lib/libc.so.6(libc_start_main+0xdc) [0x6f2dec] 7 rtorrent(_ZNK7torrent8FileList14free_diskspaceEv+0x9d) [0x8050d21] Aborted

How can I solve that problem?

Attachments

garbage-cut.txt Download (3.2 KB) - added by synthemesc.antichrist@gmail.com 3 years ago.
cut of an strace of a crash

Change History

  Changed 3 years ago by j0k3r

Just got it again

Caught Segmentation fault, dumping stack:
0 rtorrent [0x807eb71]
1 rtorrent [0x80846c6]
2 [0x350420]
3 /usr/local/lib/libtorrent.so.11(_ZN7torrent9PollEPoll7performEv+0x55) [0x1fa845]
4 rtorrent [0x80bfced]
5 rtorrent [0x8080081]
6 /lib/libc.so.6(libc_start_main+0xdc) [0x366dec]
7 rtorrent(_ZNK7torrent8FileList14free_diskspaceEv+0x9d) [0x8050d21]
Aborted

  Changed 3 years ago by anonymous

Please compile libtorrent/rtorrent with debug information and install them without stripping debug information. The trace generated by your version without debug info is practically useless.

Even better for debugging would be to run rtorrent (with debug info) in gdb and post (pastebin) the output of "backtrace full" when it crashes.

  Changed 3 years ago by j0k3r

libtorrent-0.12.3: ./configure --enable-debug=yes
rtorrent-0.8.3: ./configure --enable-debug=yes
is it ok? So, waiting for a crush, huh.

  Changed 3 years ago by j0k3r

Got it again:

Caught Segmentation fault, dumping stack:
0 rtorrent [0x807eb71]
1 rtorrent [0x80846c6]
2 [0xd26420]
3 /usr/local/lib/libtorrent.so.11(_ZN7torrent9PollEPoll7performEv+0x55) [0xac9845]
4 rtorrent [0x80bfced]
5 rtorrent [0x8080081]
6 /lib/libc.so.6(libc_start_main+0xdc) [0x6f2dec]
7 rtorrent(_ZNK7torrent8FileList14free_diskspaceEv+0x9d) [0x8050d21]
Aborted

  Changed 3 years ago by j0k3r

So it has crushed again with a same output. Is it enough? Or should I recompile it with other parameters?

follow-up: ↓ 8   Changed 3 years ago by H

Well, as mentionned by someone else above, you'd have to run rtorrent under gdb to get more useful debug info. First, try compiling with --enable-extra-debug (even though I'm not sure how much "extra" there is). Then, in order to run under gdb, instead of entering "rtorrent" in your shell, enter "gdb rtorrent". You'll have a prompt saying "(gdb)". Enter "run", it will start running rtorrent. If and when rtorrent crashes, enter "bt full" and you'll get a full backtrace.

  Changed 3 years ago by anonymous

and if gdb complains about lack of symbol information, that means it got installed without again, so fix that first

in reply to: ↑ 6   Changed 3 years ago by j0k3r

Replying to H:

First, try compiling with --enable-extra-debug (even though I'm not sure how much "extra" there is).

What should I recompile?

Then, in order to run under gdb, instead of entering "rtorrent" in your shell, enter "gdb rtorrent". You'll have a prompt saying "(gdb)". Enter "run", it will start running rtorrent. If and when rtorrent crashes, enter "bt full" and you'll get a full backtrace.

Already running under gbd and waiting for a crush.

  Changed 3 years ago by H

Recompile everything, libtorrent and rtorrent. In fact, I have no idea what it really does, but in doubt just recompile everything. Throw in a make clean while you're at it XD

  Changed 3 years ago by j0k3r

libtorrent/rtorrent already recompiled with enable-debug, I haven't seen any info about "enable-extra-debug". Ok, rtorrent crushed in gdb:

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1208531232 (LWP 5984)] torrent::PollEPoll::perform (this=0x99c1de0) at poll_epoll.cc:57 57 Table::value_type entry = m_table[e->file_descriptor()];

(gdb) bt full #0 torrent::PollEPoll::perform (this=0x99c1de0) at poll_epoll.cc:57

itr = (epoll_event *) 0x99c1e08 last = (epoll_event *) 0x99c1e14

#1 0x080bfced in core::PollManagerEPoll::poll (this=0x99c1c30, timeout={m_time = 4}) at poll_manager_epoll.cc:74 No locals. #2 0x08080081 in main (argc=1, argv=0xbf88de84) at main.cc:276

firstArg = <value optimized out>

So, what's the problem?

  Changed 3 years ago by j0k3r

It was like that, sorry:

(gdb) bt full
#0 torrent::PollEPoll::perform (this=0x99c1de0) at poll_epoll.cc:57[[BR]]

itr = (epoll_event *) 0x99c1e08
last = (epoll_event *) 0x99c1e14

#1 0x080bfced in core::PollManagerEPoll::poll (this=0x99c1c30, timeout={m_time = 4}) at poll_manager_epoll.cc:74[[BR]] No locals.
#2 0x08080081 in main (argc=1, argv=0xbf88de84) at main.cc:276[[BR]]

firstArg = <value optimized out>

  Changed 3 years ago by Monsta

Hmmm... so the problem occurs in poll_epoll.cc at line 147.

if (itr->events & EPOLLERR && itr->data.ptr != NULL && event_mask((Event*)itr->data.ptr) & EPOLLERR)
  ((Event*)itr->data.ptr)->event_error();

When itr->data.ptr is really NULL the event_mask(...) method is executed anyway.
And then it fails at line 57 since "Event *e" is NULL:

Table::value_type entry = m_table[e->file_descriptor()];

Well, the first portion of code can be changed to something like this:

if (itr->events & EPOLLERR && itr->data.ptr != NULL)
  if (event_mask((Event*)itr->data.ptr) & EPOLLERR)
    ((Event*)itr->data.ptr)->event_error();

That may solve the problem for most cases.
But if the socket can remove itself somewhere between "(itr->data.ptr != NULL ...)" and "(event_mask((Event*)itr->data.ptr ...)"... can it? Well, if it can happen, some syncronization (maybe mutexes?) might be useful.

  Changed 3 years ago by josef

I think I know now what's going on. I had someone else do this too and check the local variables with me, and it turned out that the kernel returns an event with a bogus file descriptor (like -237211), and rtorrent crashes when trying to look that up in the table.

This can only mean that rtorrent's associated Event object has been deleted but the kernel still had pending events for the file descriptor (because otherwise it would've crashed earlier while setting the event). And this can only mean that libcurl told us "I'm closing this socket" and hence rtorrent removes its relevant internal structures for it. But then libcurl doesn't actually close the socket, so the events are left in the kernel epoll queue, and when rtorrent gets to them it will crash because it the Event structure for them will have.

I'll try fixing this by not relying on the pointer that the kernel returns at all.

  Changed 3 years ago by josef

OK, here's an experimental fix for this. Please try it and report whether it works:

 http://ovh.ttdpatch.net/~jdrexler/rt/experimental/poll-by-fd.diff

  Changed 3 years ago by anonymous

Patch seems to fix the problem for me.

  Changed 3 years ago by ramier

About 30 hours and I haven't segfaulted yet, awesome.

  Changed 3 years ago by j0k3r

I was waiting for a crash from 13.10 but got "no luck" :) So today I've just restarted rtorrent with a new patch - hope it will be ok ;)

Changed 3 years ago by synthemesc.antichrist@gmail.com

cut of an strace of a crash

  Changed 3 years ago by synthemesc.antichrist@gmail.com

I am having the same problem. It seemed to start happening when I upgraded cURL from 7.16.2 to 7.19.0. I tried both the new version (0.12.3/0.8.3) and the old version (0.12.2/0.8.3) that worked fine before. Even after recompiling each of them, the same thing happens.

I tried limiting the number of open sockets to 200, but it still keeps happening.

(see attachment garbage-cut.txt for the tail end of the strace where the crash happened)

follow-up: ↓ 20   Changed 3 years ago by anonymous

that "trace" only contains the syscalls used to generate the crash message, so it's pretty useless

in reply to: ↑ 19   Changed 3 years ago by anonymous

thx for the patch josef. what would rtorrent be without your patches.. :) perhaps you find an solution for ticket 929 too ;)

  Changed 3 years ago by anonymous

Is this patch incorporated into svn 1073? It seems that way, anyway.

Thanks for the nice work! r1072 and r 1073 are working better and better!

follow-up: ↓ 23   Changed 3 years ago by anonymous

No, it's not in r1073.

in reply to: ↑ 22   Changed 3 years ago by anonymous

patch doesn't seem to change anything at my system i'm running svn r1073 at debian lenny with kernel 2.6.25-2. i applied the patch above also. i configured rtorrent and libtorrent by ./configure --prefix=/usr --enable-debug=yes i compiled with gcc-4.3-1

curl 7.18.2 (x86_64-pc-linux-gnu) libcurl/7.18.2 OpenSSL/0.9.8g zlib/1.2.3.3 libidn/1.8 libssh2/0.18

here's a full backtrace

#0 0x00007f3eb8ae8d01 in ?? () from /usr/lib/libcurl.so.4 No symbol table info available. #1 0x00007f3eb8ae90ca in ?? () from /usr/lib/libcurl.so.4 No symbol table info available. #2 0x00007f3eb8afec71 in curl_multi_remove_handle ()

from /usr/lib/libcurl.so.4

No symbol table info available. #3 0x0000000000479bb1 in core::CurlStack::remove_get (this=0x838900,

get=0xdeedd0) at curl_stack.cc:198

itr = {_M_cur = 0x189e568, _M_first = 0x189e560, _M_last = 0x189e760,

_M_node = 0x8389d8}

#4 0x000000000047c2a5 in core::CurlGet::close (this=0xdeedd0)

at curl_get.cc:103

No locals. #5 0x00007f3eb88a7a05 in torrent::TrackerHttp::close (this=0xded260)

at tracker_http.cc:162

No locals. #6 0x00007f3eb88a7a5a in torrent::TrackerHttp::receive_failed (

this=0x197ec20, msg=

{static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x459f347623f18 <Address 0x459f347623f18 out of bounds>}})

at tracker_http.cc:243

No locals. #7 0x00007f3eb88aa90c in sigc::internal::slot_call1<sigc::bound_mem_functor1<void, torrent::TrackerHttp?, std::string>, void, std::string const&>::call_it (

rep=0xdeeea0, a_1=<value optimized out>) at /usr/include/sigc++-2.0/sigc++/functors/mem_fun.h:1851

No locals. #8 0x0000000000479f1a in core::CurlStack::transfer_done (

this=<value optimized out>, handle=<value optimized out>, msg=0x4af080 "Timed out") at /usr/include/sigc++-2.0/sigc++/signal.h:690

itr = {_M_cur = 0x189e568, _M_first = 0x189e560, _M_last = 0x189e760,

_M_node = 0x8389d8}

#9 0x00000000004350e9 in main (argc=<value optimized out>,

argv=0x7fffc13766f8) at ../rak/functional_fun.h:102

firstArg = <value optimized out> e = <value optimized out>

  Changed 3 years ago by anonymous

That's a different crash. I don't know why libcurl crashes when trying to remove a transfer (announce) from it, or even how to work around that.

  Changed 3 years ago by anonymous

Think i had the same crash as the OP. Was trying to use 0.8.3/0.12.3 when i frist saw it. First Try was to go back to 0.8.0/0.12.0 but then rtorrent needed literally minutes to react on keystrokes, not sure why though.

So i updated to SVN-1073 but still had the crashes from epolls i guess. Depending on activity in rtorrent (idle, hashing, adding/removing torrents) they happened more frequently or more seldom.

Then applied the "poll-by-fd" patch and it is running without fault for over 24 hours since then. Fingers crossed that it stays this way.

  Changed 3 years ago by anonymous

same system as 3 posts before

rtorrent: PollEPoll::modify(...) epoll_ctl(6, 1 -> 1, 170, [0x867620:8]) = 9: Bad file descriptor

Program exited with code 0377. (gdb) bt full No stack.

  Changed 3 years ago by rakshasa

  • status changed from new to closed
  • resolution set to fixed

The patch was committed in r1074. The above does seem like it's related to a libcurl bug, so closing this ticket.

Try a newer libcurl version, and if r1074 still crashes with that, create a new ticket with the relevant information.

Note: See TracTickets for help on using tickets.