Khalid - can you crash at will? if so humour me here - as you know I have a number of systems and under constant settings and conditions I randomly crash...BUT I have found if I turn off logging completely I do not randomly crash - Your crash on park may not be the park process but LOGGING about the park process - every crash I get is related somehow to Q lib(either io or audio) so my suspicion is that the the values used in motion and parking blow up when being logged - if you can replicate you crash, try a few time with no logging turned on - it would almost verify my theory....I am too random to verify
I try to setup different configuration with the hardware I have available (Bluepill / Arduino-Mega / Fysetc S6 / Max PCB2)
Not thaty I beleive the errors come from different platforms, it is just because this is what I have to set-up different scenarii.
(With / Without Focuser, Rotator, Equatorial vs Altaz ...)
I could not do all since I have still no soldering iron, but was able to set-up an Arduino Mega 2560 with flying wires.
just to see if my thoughts are correct:
1) -If I connect to Serial (via usb-serial dongle) dedicated to WiFi I should be able to connect with Indi or terminal and send commands / receive responses, correct?
2) - If I use an Arduino Mega 2560 I connect my ESP8266 Rx/Tx to serial 1, correct?
If I try (1) I still have sometimes errors that I don't have via standard USB
If I try (2) I have some other errors: (with kstars of with python script)
a) I cannot connect at all
b) I can connect but after a while all is blocked
c) I have the message "Serial Interface to OnStep is Down!" in the Web browser
So after looking at this a lot more, and thinking about some things. The problem mainly is the timeout, but even setting it longer will crash things sometimes for a few reasons. (Yay wireless, also *#$%* ISPs.)
After that happens, often the connection does get the response. Unfortunately, tcflush doesn't work for that, nor do a few other things. So what was happening was say I had :A?# (returns ABC (A= max stars, B=Current Star, C=#stars in alignment, or B/C is flipped,) followed by :GU# (WHich is the nPada/250 looking string that's status info) and then :GR# So you'd get a timeout on A?, so :GU# would get back 600#, and :GR# would get nPada/250, and if input verification and on down the line. (Occasionally something would help it.) I didn't have the input verification needed for every function, so when it hit those it could crash. When I did add the input verification, I still had a problem: Everything was shifted. So while it wouldn't crash, you'd get endless error messages.
Unfortunately after playing with tcflush and others which didn't work, Eventually, I just decided to read it with a very short timeout (1ms (lower it? since it should be in the buffer already?)) until the buffer was cleared. Which seems kinda inelegant, but works. I also introduced getCommandDoubleResponse (no corresponding function in lx200drivers), which is same calling as the SingleCharErrorLongResponse, with a double added. This has the clear out portion, and after testing for a while (and quite large logfiles!) I can't reproduce the continuous shift or crashes. In the past I could after some period of time, be it 5 minutes, or 2 hours, it'd eventually have the issue.) It doesn't clean up the code much at all over doing it in the LongResponse, but does setup nicer for checksummed commands/resend. (I think.)
I will add the buffer clearing to others, and I do wonder about adding it to the lx200driver calls, because they also operate over the network. I don't have another lx200 to test, but I suspect if someone hooked up an esp8266 in it's original 'dumb' serial port <-> wifi intention, that might be seen.
Occasional value is incorrect (due to not all functions flushing it, This should mostly be strings, like :GU#, or similar, I will be adding flushes to reduce/eliminate this)
INFO/ERROR messages only when something makes it past that buffer flush, so there should be only 1 for any given message.
Lots of error messages on disconnection (need to work on this.)
This is on the network-timeouts branch so far. Please test and let me know if there are any crashes or problems on wifi (or serial).
I usually believe I am the reason when something goes wrong, but with our problem there is anyhow something strange.
I am also testing intensively the comm and found at least that on OnStep side something is wrong.
There is definitively a different behavior between USB and WiFi and it is not due to the Network but to the handling of the bridge between serial (OnStep Side and Network) .
Sending the same sequences over USB and then over WiFi I have two different results:
- USB does not crash nor timeouts or hangs
- WiFi hangs for all Rotator commands
In still do not understand why but in my opinion there should be no difference.
I connected directly to Serial in place of the Wemos ( bypassing Wemos) and do not observe errors.
So the errors must be due to the way the WEmos handles Serial to WiFi (I observed also hangs when challenging the Web and in parallel communicating over port 9998.
There are also hangs over port 9999 for the android App.
So I try to see what's going on the Wemos side. But here I firs must understand how it is handled ;-(
Thanks for digging into this. Yes, I suspected that delayed results going to the subsequent commands would crash (e.g. expecting a number on Cmd1, which times out, then Cmd3 gets it, and things go haywire.
So yes, buffer clearing is a good idea, but how long would one wait? SWS has a default of 200 ms timeout, plus the time a command takes on OnStep. It varies. Maybe 250 ms minimum timeout is safe? I don't know.
Yes, if you bypass the Wemos you will get different results. The thing is, the Wemos runs new software (SWS) which has some internal processing, and is a bit different from the addon. So what you are seeing is probably buried in SWS somewhere.
Regarding the rotator, are you testing with the conditional GX98 code? If not, then yes, the :r commands will take a long time and cause problems.
I don't have much to add here, so will wait until the experts finish their analysis.
In the meantime, should Alain push the small GX98 patch to Jasem, to prevent having others report the same issue?