My apologies for the past day or so of downtime.
I had a work conference all of last week. On the last morning around 4am, before I headed back to my timezone, “something” inside of my kubernetes cluster took a dump.
While- I can remotely reboot nodes, and even access them… the scope of what went wrong was far above what I can accomplish remotely via my phone.
After returning home yesterday evening, I started plugging away a bit, and quickly realized… something was seriously wrong with the cluster. As such, from previous experience, I found it was quicker to just tear it down, rebuild it, and restore from backups. So- I started that process.
However, since, I had not seen my wife in a week, I felt spending some time with her was slightly more important at the time. But- I was able to finish getting everything restored today.
Due, to the issues before, I will be rebuilding some areas of my infrastructure to be slightly more redundant.
Whereas before- I had bare-metal machines running ubuntu, going forward, I will be leveraging proxmox for compute clustering and HA, along with ceph for storage HA.
That being said, sometime soon, I will have ansible playbooks setup to get everything pushed out and running.
Again- My apologies for the downtime. It was completely unexpected, and came out of the blue. I honestly still have no idea what happened.
The best suspicion I have, is disk failure… and after rebooting the machine, it came back to life?
Regardless, Will work to improve this moving forward. Also- I don’t plan on being out of town soon… so, that will help too.
There may be some slight downtime later on as I am working on and moving things around. If- that is the case, it will be short. But- for now- the goal is just restoring my other services and getting back up and running.
Update 2023-07-23 CST
There are still a few kinks being worked out. I have noticed occasionally things are disconnecting still.
Working on ironing out the issues still. Please bear with me.
(This issue appears to be due to a single realtek nic in the cluster… realtek = bad)
Update 9:30pm CST
Well, it has been a “fun” evening. I have been finding issues left and right.
- A piece of bad fiber cable.
- The aforementioned server with a realtek NIC which was bringing down the entire cluster.
- STP/RSTP issues, likely caused by the above two issues.
Still, working and improving…
Update 2023-07-24
Update 9am CST
Working out a few minor kinks still. Finish line is in sight.
Update 5pm CST
Happened to find a SFP+ module which was in the process of dying. Swapped it out with a new one, and… magically, many of the spotty network issues went away.
Have new fiber ordered, will install later this week.
Update 9pm CST
- Broken/Intermittent SFP+ Module replaced.
- Server with crappy realtek nic removed. Re-added server with 10G SFP+ connectivity.
- Clustered servers moved to dedicated switch.
- New fiber stuff ordered to replace longer-distance (50ft) 10G copper runs.
I am aware of current performance issues. These will start going away as I expand out the cluster. Still focusing on rebuilding everything to a working state.