So after looking at this a lot more, and thinking about some things. The problem mainly is the timeout, but even setting it longer will crash things sometimes for a few reasons. (Yay wireless, also *#$%* ISPs.)
After that happens, often the connection does get the response. Unfortunately, tcflush doesn't work for that, nor do a few other things. So what was happening was say I had :A?# (returns ABC (A= max stars, B=Current Star, C=#stars in alignment, or B/C is flipped,) followed by :GU# (WHich is the nPada/250 looking string that's status info) and then :GR# So you'd get a timeout on A?, so :GU# would get back 600#, and :GR# would get nPada/250, and if input verification and on down the line. (Occasionally something would help it.) I didn't have the input verification needed for every function, so when it hit those it could crash. When I did add the input verification, I still had a problem: Everything was shifted. So while it wouldn't crash, you'd get endless error messages.
Unfortunately after playing with tcflush and others which didn't work, Eventually, I just decided to read it with a very short timeout (1ms (lower it? since it should be in the buffer already?)) until the buffer was cleared. Which seems kinda inelegant, but works. I also introduced getCommandDoubleResponse (no corresponding function in lx200drivers), which is same calling as the SingleCharErrorLongResponse, with a double added. This has the clear out portion, and after testing for a while (and quite large logfiles!) I can't reproduce the continuous shift or crashes. In the past I could after some period of time, be it 5 minutes, or 2 hours, it'd eventually have the issue.) It doesn't clean up the code much at all over doing it in the LongResponse, but does setup nicer for checksummed commands/resend. (I think.)
I will add the buffer clearing to others, and I do wonder about adding it to the lx200driver calls, because they also operate over the network. I don't have another lx200 to test, but I suspect if someone hooked up an esp8266 in it's original 'dumb' serial port <-> wifi intention, that might be seen.
Occasional value is incorrect (due to not all functions flushing it, This should mostly be strings, like :GU#, or similar, I will be adding flushes to reduce/eliminate this)
INFO/ERROR messages only when something makes it past that buffer flush, so there should be only 1 for any given message.
Lots of error messages on disconnection (need to work on this.)
This is on the network-timeouts branch so far. Please test and let me know if there are any crashes or problems on wifi (or serial).
I usually believe I am the reason when something goes wrong, but with our problem there is anyhow something strange.
I am also testing intensively the comm and found at least that on OnStep side something is wrong.
There is definitively a different behavior between USB and WiFi and it is not due to the Network but to the handling of the bridge between serial (OnStep Side and Network) .
Sending the same sequences over USB and then over WiFi I have two different results:
- USB does not crash nor timeouts or hangs
- WiFi hangs for all Rotator commands
In still do not understand why but in my opinion there should be no difference.
I connected directly to Serial in place of the Wemos ( bypassing Wemos) and do not observe errors.
So the errors must be due to the way the WEmos handles Serial to WiFi (I observed also hangs when challenging the Web and in parallel communicating over port 9998.
There are also hangs over port 9999 for the android App.
So I try to see what's going on the Wemos side. But here I firs must understand how it is handled ;-(
Thanks for digging into this. Yes, I suspected that delayed results going to the subsequent commands would crash (e.g. expecting a number on Cmd1, which times out, then Cmd3 gets it, and things go haywire.
So yes, buffer clearing is a good idea, but how long would one wait? SWS has a default of 200 ms timeout, plus the time a command takes on OnStep. It varies. Maybe 250 ms minimum timeout is safe? I don't know.
Yes, if you bypass the Wemos you will get different results. The thing is, the Wemos runs new software (SWS) which has some internal processing, and is a bit different from the addon. So what you are seeing is probably buried in SWS somewhere.
Regarding the rotator, are you testing with the conditional GX98 code? If not, then yes, the :r commands will take a long time and cause problems.
I don't have much to add here, so will wait until the experts finish their analysis.
In the meantime, should Alain push the small GX98 patch to Jasem, to prevent having others report the same issue?
No I don't want to try the GX98 command knowing it works.
What I say is since the ":rx#" commands work without any trouble over usb, they should work in the same manner over WiFi.
There is no reason it should be different.
If it is something is wrong also on the WiFi side.
If I understand well people running over Ascom have similar issues.
So may be we not only have to look what we do wrong on Indi side but also what's going on in the SWS.
I think there are two separate things here:
1. Investigate what is different in SWS vs. USB (and even the older WiFi addon)
2. Proper device detection (rotator, focuser, and in the future thermometers, and other features), and not polling devices that are not there.
Task 1 is a technical development task to get to the bottom of why SWS delays responses. Howard gave some clue in the thread I linked to earlier.
Task 2 needs to eventually go into Jasem's repo, because it is the efficient way of doing things. Only 10% or 20% of OnStep users have a rotator, so why make INDI "too chatty" for something that is not there. The focusers are used more, but most users who have them has only one. So why poll the second one? And so on.
I am okay with delaying task 2 until you get to what the underlying issue for task 1. But task 2 must go in for efficiency's sake.
At least, now after flashing Wemos again but with board library version 2.4.2 I can say it does not work better than before with version 3.02. (for the r: commands) but cannot reproduce crashes so far.
Will leave running this night and see
But I suppose I should keep this 2.4.2 version since it is what is advised by Howard.
Yes, I am using ESP8266 Board Manager version 2.4.2, and it works.
Espressif, the company that makes the ESP8266 and ESP32, does not have a good track record for stability on newer versions. They always introduce problems. The classic case of good hardware but inferior software.
> So yes, buffer clearing is a good idea, but how long would one wait? SWS has a default of 200 ms timeout, plus the time a command takes on OnStep. It varies. Maybe 250 ms minimum timeout is safe? I don't know.
For network (CONNECTION_TCP) specifically, It's (like everything should be) detected on startup. I've got it set to 2 seconds, otherwise 100ms if wired. Messages in the INFO log (ie displayed for users) should be:
* Network based connection, detection timeouts set to 2 seconds
* Non-Network based connection, detection timeouts set to 0.1 seconds
As of the latest version, which I don't think I uploaded to github quite yet, run about 9 hours, I see 29 times the buffer gets flushed, actually just checked a bit later, and it's now 32, with 33 events indicating a read error of some type. So at least with my network that's about 3-4 an hour. Sometimes in batches, sometimes, alone.
One example (flushIO error type = 0 indicates it read something, -4 indicates it timed out (this is expected if we have no data for it to clear. In playing around I increased it to 10ms) In this :GX95# is correct and you see the 2 second timeout on the RES to GX96, which it then returns from in ReadScopeStatus so things like the focusers don't get checked. You see the next :GU# get called. (I have a really low polling interval, 50 ms, which is why it might look like it goes immediately to :GU#.)
One thing I don't like and I may change the behavior of is that returning false from the function sets the state to ERROR. Which per docs is technically correct. However, practically isn't entirely correct. Yes, communication error, but with the bounds checking in there now, and bailing, I'm not sure I'd say that the telescope is in an error state, and I worry about things bailing out. I may also change the Communication errors from LOG_ERROR to LOG_WARN (Still would show up for users). Anyone's thoughts on either of those changes?
Actually a quick check in KStars, I'm going to change that, because there are a couple of places it bails out if that changes that it won't restart jobs. (Slewing is one, I'm not completely sure about it's effect on imaging.)
Last edit: 7 months 5 days ago by james_lan. Reason: Clarify expected error/timeout on flushIO log part