Comments - Issues with MariaDB ColumnStore 1.0.4

7 years, 5 months ago David Thompson

Hi, the best thing to do would be to file a bug in our jira: https://jira.mariadb.org, project MCOL with more details. But first here are some pointers to check before doing this.

Tips on preparing your system which you should check for especially with a multi server deployment are here: https://mariadb.com/kb/en/mariadb/preparing-for-columnstore-installation/ Common issues are firewalls blocking ports or files, inconsistent passwords or ssh-keys, and inconsistent os setup.

Tips on troubleshooting are available here: https://mariadb.com/kb/en/mariadb/system-troubleshooting-mariadb-columnstore/

If you just want to try the product out, the single server installation is obviously much simpler.

After filing a jira ideally you should run the columnstoreSupport -a tool and email that to us separately since our jira is public.

 
7 years, 5 months ago Zeng Chun

The root cause may be related to the file /usr/local/mariadb/columnstore/etc/AlarmConfig.xml. In some cases, the file size becomes zero that will cause the PM node cannot be started. It is a new feature added in this release. Please help investigate the issue. Thank you very much.

 
7 years, 5 months ago Zeng Chun

Thanks for your feedback. I have reviewed the web links and followed the guide in those pages carefully. I can start the system to Active status occasionally. So I think it should not be the issue of firewalls.

My configuration is 2UM3PM. I found some core files in the user home directory of PM2 and PM3 when restarting the system. It should be caused by the conflict of ProcMon after I manually restarted the PM processes by using the command "/usr/local/mariadb/columnstore/bin/columnstore restart".

My workaround is installing the system with 2UM1PM and then add other PM modules, i.e. PM2 and PM3. However, it is still difficult to restart the system to Active status. Thanks.

OS: CentOS release 6.5 (Final) 2.6.32_431-3 #2 SMP Wed Aug 24 14:40:53 CST 2016 x86_64

OS locale: en_US.UTF-8

---

Core was generated by `/usr/local/mariadb/columnstore/bin/ProcMon'. Program terminated with signal 6, Aborted.

  1. 0 0x00000032d3e325e5 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 libxml2-2.7.6-14.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt
  2. 0 0x00000032d3e325e5 in raise () from /lib64/libc.so.6
  3. 1 0x00000032d3e33dc5 in abort () from /lib64/libc.so.6
  4. 2 0x0000003dc98bea7d in gnu_cxx::verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
  5. 3 0x0000003dc98bcbd6 in ?? () from /usr/lib64/libstdc++.so.6
  6. 4 0x0000003dc98bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
  7. 5 0x0000003dc98bcd22 in cxa_throw () from /usr/lib64/libstdc++.so.6
  8. 6 0x000000000045806e in void boost::throw_exception<boost::lock_error>(boost::lock_error const&) ()
  9. 7 0x0000000000456de8 in startup::StartUp::installDir() ()
  10. 8 0x00007fa1ffbca694 in config::Config::makeConfig(char const*) () from /usr/local/mariadb/columnstore/lib/libconfigcpp.so.1
  11. 9 0x00007fa1ffdedb2f in logging::Message::Message(unsigned int) () from /usr/local/mariadb/columnstore/lib/libloggingcpp.so.1
  12. 10 0x000000000042967f in processmonitor::MonitorLog::writeLog(int, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, logging::LOG_TYPE) ()
  13. 11 0x0000000000439618 in processmonitor::ProcessMonitor::stopProcess(int, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool) ()
  14. 12 0x000000000041d16a in main ()
 
7 years, 5 months ago David Thompson

This looks very similar to another case we just saw. Can you run: localedef -i en_US -f UTF-8 en_US.UTF-8 On each of the servers as missing locale defs for en_US and UTF-8 were the problem there. Although your server is running in UTF-8 this step may still be needed to make them available to the libraries. This will be added to the pre-req doc.

Let me know if that fixes the issue.

 
7 years, 5 months ago Zeng Chun

It seems that the system can be started easier than before. However, the system cannot be stopped successfully including restartSystem and stopSystem.

The error messages are as follows:

ProcessMonitor[29575]: 24.095830 |0|0|0| E 18 CAL0000: EXCEPTION ERROR on setProcessStatus: Caught unknown exception!

controllernode[29910]: 37.135918 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)

ProcessManager[29910]: 36.016391 |0|0|0| E 17 CAL0000: line: 1211 STOPSYSTEM: Failed, timeout waiting for module to stop

By the way, can you help investigate the issue of AlarmConfig.xml mentioned in my first reply?

 
7 years, 5 months ago David Thompson

We are not aware of issues that could cause AlarmConfig.xml to get emptied. The localedef issue causes some weird behavior because the install is partial so wanted to exclude that (we will update doc and the postCfg script is being updated to log better info if this is not done). On the stop, it would be helpful to check if there is anything in the logs for the other servers. Also the localedef command needs to performed on all servers not just the install one.

 
7 years, 5 months ago David Hill

For starters, you shouldn't be using '/usr/local/mariadb/columnstore/bin/columnstore restart' restart the system. This will just restart the columnstore service on a local node. As documented in the KD guides, you will want to use the mcsadmin console for these commands. And these are best run from the pm1, which is where the install took place.

mcsadmin shutdownsystem y stop all processes on all nodes mcsadmin startsystem will start all processes, if ssh-key is not setup, you need to provide the user password as the third argument, when this is run after a shutdown, password not required when running after a stopsystem mcsadmin stopsystem stops all the DB processes on all nodes, leave the Proc-Mgt running

 
7 years, 5 months ago David Hill

Additional information to help diagnose the issue. Once you install the packages on the initial server, pm1, run post-install and postConfigure. If you get to the point where it says Starting system processes, but it seems to hang or not return. Here are some things to check:

on pm1, create the alias if you haven't already

  1. . /usr/local/mariadb/columnstore/bin/columnstoreAlias then run following command and check the process status:
  2. mcsadmin getsysteminfo check if ProcMon is ACTIVE on all configured servers, if not, check the log files on the asscouiated server to see what error ProcMon is reporting. Also make sure the ProcMgr is ACTIVE on pm1.

logs are located in:

/var/log/mariadb/columnstore

generally when ProcMon/ProcMgr isn't active, its because one of these issues: 1. if external storage, an pm /etc/fstab isnt setup 2. message issue between the servers that is causing ProcMon's and ProcMgr to fail to communicate. Make sure all server firewalls are disable along with SElinux.

 
7 years, 5 months ago Zeng Chun

Thanks for your feedback.

I reinstalled the MariaDB ColumnStore 1.0.4 from scratch after resetting the locale on all worker nodes. Right now, the installation can be finished successfully and the system status is Active. So it seems that it is not an issue of firewalls. In addition, I used local disks for installation.

I have copied the ssh key from PM1 to other nodes. Also I need copy the ssh key from UM1 to UM2 so that the UM1 can configure the replication between UM1 and UM2.

My configuration is 2UM3PM. I can reproduce the zero size of AlarmConfig.xml after restarting the processes in a certain PM node using the command "/usr/local/mariadb/columnstore/bin/columnstore restart". You are right, I should use mcsadmin to manage the whole system. But I have to try other workarounds when I failed to stop the system. I have checked the settings of firewalls and stopped the service of iptables. I still encountered the issue of stopping the system. I observed that some processes cannot be stopped while the status of PM node was failed. So please help investigate the stop issue. Thank you very much.

 
7 years, 5 months ago David Thompson

I've create jira bug:https://jira.mariadb.org/browse/MCOL-396 to track the AlarmConfig issue with direct pm server restart.

 
7 years, 5 months ago Zeng Chun

A similar bug MCOL-404 was submitted. I used non-root installation guide and encountered the same issue. Thanks.

 
7 years, 5 months ago David Thompson

Yes, this was a miss in the 1.0.4 release. This will be fixed in our next RC release 1.0.5. Thanks for testing!

 
7 years, 5 months ago Zeng Chun

Ok. Thank you very much.

 
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.