Thanks for digging into this. Yes, I suspected that delayed results going to the subsequent commands would crash (e.g. expecting a number on Cmd1, which times out, then Cmd3 gets it, and things go haywire.
So yes, buffer clearing is a good idea, but how long would one wait? SWS has a default of 200 ms timeout, plus the time a command takes on OnStep. It varies. Maybe 250 ms minimum timeout is safe? I don't know.
Yes, if you bypass the Wemos you will get different results. The thing is, the Wemos runs new software (SWS) which has some internal processing, and is a bit different from the addon. So what you are seeing is probably buried in SWS somewhere.
Regarding the rotator, are you testing with the conditional GX98 code? If not, then yes, the :r commands will take a long time and cause problems.
I don't have much to add here, so will wait until the experts finish their analysis.
In the meantime, should Alain push the small GX98 patch to Jasem, to prevent having others report the same issue?
No I don't want to try the GX98 command knowing it works.
What I say is since the ":rx#" commands work without any trouble over usb, they should work in the same manner over WiFi.
There is no reason it should be different.
If it is something is wrong also on the WiFi side.
If I understand well people running over Ascom have similar issues.
So may be we not only have to look what we do wrong on Indi side but also what's going on in the SWS.
I think there are two separate things here:
1. Investigate what is different in SWS vs. USB (and even the older WiFi addon)
2. Proper device detection (rotator, focuser, and in the future thermometers, and other features), and not polling devices that are not there.
Task 1 is a technical development task to get to the bottom of why SWS delays responses. Howard gave some clue in the thread I linked to earlier.
Task 2 needs to eventually go into Jasem's repo, because it is the efficient way of doing things. Only 10% or 20% of OnStep users have a rotator, so why make INDI "too chatty" for something that is not there. The focusers are used more, but most users who have them has only one. So why poll the second one? And so on.
I am okay with delaying task 2 until you get to what the underlying issue for task 1. But task 2 must go in for efficiency's sake.
At least, now after flashing Wemos again but with board library version 2.4.2 I can say it does not work better than before with version 3.02. (for the r: commands) but cannot reproduce crashes so far.
Will leave running this night and see
But I suppose I should keep this 2.4.2 version since it is what is advised by Howard.
Yes, I am using ESP8266 Board Manager version 2.4.2, and it works.
Espressif, the company that makes the ESP8266 and ESP32, does not have a good track record for stability on newer versions. They always introduce problems. The classic case of good hardware but inferior software.
> So yes, buffer clearing is a good idea, but how long would one wait? SWS has a default of 200 ms timeout, plus the time a command takes on OnStep. It varies. Maybe 250 ms minimum timeout is safe? I don't know.
For network (CONNECTION_TCP) specifically, It's (like everything should be) detected on startup. I've got it set to 2 seconds, otherwise 100ms if wired. Messages in the INFO log (ie displayed for users) should be:
* Network based connection, detection timeouts set to 2 seconds
* Non-Network based connection, detection timeouts set to 0.1 seconds
As of the latest version, which I don't think I uploaded to github quite yet, run about 9 hours, I see 29 times the buffer gets flushed, actually just checked a bit later, and it's now 32, with 33 events indicating a read error of some type. So at least with my network that's about 3-4 an hour. Sometimes in batches, sometimes, alone.
One example (flushIO error type = 0 indicates it read something, -4 indicates it timed out (this is expected if we have no data for it to clear. In playing around I increased it to 10ms) In this :GX95# is correct and you see the 2 second timeout on the RES to GX96, which it then returns from in ReadScopeStatus so things like the focusers don't get checked. You see the next :GU# get called. (I have a really low polling interval, 50 ms, which is why it might look like it goes immediately to :GU#.)
One thing I don't like and I may change the behavior of is that returning false from the function sets the state to ERROR. Which per docs is technically correct. However, practically isn't entirely correct. Yes, communication error, but with the bounds checking in there now, and bailing, I'm not sure I'd say that the telescope is in an error state, and I worry about things bailing out. I may also change the Communication errors from LOG_ERROR to LOG_WARN (Still would show up for users). Anyone's thoughts on either of those changes?
Actually a quick check in KStars, I'm going to change that, because there are a couple of places it bails out if that changes that it won't restart jobs. (Slewing is one, I'm not completely sure about it's effect on imaging.)
Last edit: 1 year 1 week ago by james_lan. Reason: Clarify expected error/timeout on flushIO log part
For reference as well: My esp8266 is at 2.7.1 and I haven't noticed esp8266 problems with it for OnStep or other projects. I think that's what all mine were built with. (That version is because I haven't bothered to upgrade or downgrade when I last installed it.)
Though I mostly haven't tested with a rotator, except simulated via USB.
Sorry I have some translation problems.
After 10 minutes of reading and google translation ending in total nonsense.
I more or less understand the rest but when it comes to "bail out" I have not idea what it could mean.
Any other wording for that?
To best explain my use of it, would be to return, escape, abort or stop (almost always early/generally before it's done/prematurely)
I'll note to try to avoid it in the future, because searching google I can understand your confusion! Several I didn't even think of, as a native speaker. Hopefully my explanation being a bit verbose above didn't confuse you or anyone else.
So in multiple day+ long tests, I see plenty of network timeouts, but 0 crashes, even while accessing it with multiple things. About 6-ish per hour at 2 second timeouts on my network.
I did go ahead and change to LOG_WARN and to return true to ReadScopeStatus. (Focusers/Rotators and such are still LOG_ERROR) so the scope doesn't go into error status.
Unless anyone sees a problem I'll see about a PR to main tonight.
NOTE: The logs will have a combination of INFO/WARN/ERROR. I mostly kept the wording the same, but I'm not sure I like it: This update aborted, will retry... (Kept the same for easy find/replace.)
Mount will not change state due to the LOG_ERRORs. (To prevent stopping captures, sequeneces, etc.)
Mount will not disconnect.
TODO (related or discovered due to this):
Convert the Focusers & such that are in functions other than ReadScopeStatus to others.
One issue is that the scope doesn't register as disconnect when on the network, I believe this is the case with the existing network code as well. (But you were more likely to have a crash than notice that.)
LX200 adjustable timeout. This is a bit trickier than I thought needing to touch a lot more things, so I'm going to make it separately. (This is IMO better than just OnStep, and will get all the LX200 functions we use, plus it should benefit anyone using a network adapter for an lx200, TeenAstro or others.)
Alain's investigations into the rotator differences on SWS vs direct.
Park: I'm not sure I'm understanding the process right (I don't use it normally), but when I set a position it's now slewing to it.
Question: Should I raise the network timeout from 2 sec to 3 sec or higher?
I see little reason to not keep the serial at 0.1 sec, I don't see errors with it, even on the slowest platform we have. There is one edge case: serial to wifi or bluetooth to serial might be slower, but only if it's setup as a serial port, as opposed to accessing the serial port over a network. I don't have a bluetooth serial module to test right now. (and I've had horrible luck with them.) That should be resolved with the adjustable timeout patch later on.