PCMHammer P04

Post by **antus** » Thu Aug 03, 2023 10:22 am

yeah you are right about that one. But 3 of the loop counters did have that fault where the counter is loaded with a 16 bit move.w and then the dbf or dbequ uses the whole 32 bits. Even if the high bits a clear because of other functions that's risky as the function should be self contained to make it easier to develop further without creating unintended consequences later. We both know how hard 68k can be to debug the obscure breakages. We'll get there yet... All targets are working with the full kernel now, but like you described it 'its on a knife edge'. The DLC code is still not as stable as we want. Now I have to break it again and I my goal is to get to the bottom of that so that we don't need any padding or alignment for the data segment as I am convinced alignment is the symptom not the cause, and then see if we are of that knife edge. I know this because I have seen it transmit kernel version where it starts on and odd and an even. I think it's to do with timing, we are driving it too fast and alignment must slow the pcm down by a single clock tick or something to shift the data 8 bits to load it when its on an odd, or something. I doubt we'll ever truly know what's going on at that level, the goal is just to get it stable.

Post by **Gampy** » Fri Aug 04, 2023 2:47 am

antus wrote:But 3 of the loop counters did have that fault where the counter is loaded with a 16 bit move.w and then the dbf or dbequ uses the whole 32 bits. Even if the high bits a clear because of other functions that's risky as the function should be self contained to make it easier to develop further without creating unintended consequences later.

HA HA HA That's funny, there are only two, they are DBcc instructions (dbeq and dbne to be exact) and RTFM!

: M68k-PRM_DBcc.png (39.01 KiB) Viewed 2149 times

-Enjoy

Post by **antus** » Fri Aug 04, 2023 10:44 am

Well you learn something every day. I always thought it was universal that instructions without the size qualifier was 32bit. But that does explain why IntelEraseSector has:

Code: Select all

    move.l  #0x320000, %d2             | Erase Loop Timeout index
<sniP>
    subq.l  #1, %d2                    | Decrement index
    bne.s   IntelEraseSectorNotReady   | Fail Status Check Test again

eg using subq.l and bne.s rather than dbf which did make me wonder why it was different, but I decide to trust it and move on figuring it worked and I didn't need to touch it at this stage. Now I see, dbf wouldnt have been able to handle the 24/32 bit counter.

Post by **antus** » Fri Aug 04, 2023 11:50 am

I spent last night going over all the DLC routines and testing on my P08. P08 write is stable, but presents the same DLC issue they all do where we need data to be starting on ODD alignment. This alignment should not be a requirement, and is a loose end we have not yet been able to understand. Further, I have been able to get some data to transmit from EVEN alignment, proving that its repeatable but not consistent and the root cause is not understood. This may or may not be realated to the P04 problems. So I decided to investigate this on P08 where I have easy connection to BDM.

I read the code, compared it to the MC68Hc58 DLC datasheet, checked the register read and write, checked the logic. Added short delays to slow the DLC communications a bit to see if we were accessing the command register and status registers too quickly (the datasheet documents 3uS delay required after writing status register). Then at the end of the night after finding nothing conclusive in that regard I reset the kernel back to the P08 branch state and hacked it to just pump out kernel version responses as fast as it could, and hacked pcmhammer to pump out vin requests as fast as it could, and set universal patchers logger with the MDI to monitor the bus. This was to create an overnight stress test of lots of bi-directional traffic, probably collisions, any type of mess we need the DLC to be able to handle. This morning it was still running after about 8 hours. So I am now pretty sure the DLC code is stable. I'll now repeat this test on P04 and see if there are different results...

I also tested creating a routine that keeps the watchdog happy then loops to self. This test was also on P08. I moved the data to EVEN alignment and fixed the size of the kernel. Essentially this function halts the PCM to silence it but prevents a reboot. This is because I wanted a way to test what instructions were executing without having to use the DLC or the DLC code to verify what was going on inside the PCM. Essentially I wanted an instruction level test. When the PCM crashes, it restarts and the factory OS which gets quite talkative on the bus immediately. So I was able to move a "jmp haltnoreboot" command around, stepping over each instruction one at a time. This moves one instruction backwards in position and the jmp instruction forward one position, with no other changes to any other codes address, or length of the kernel. Its the minimum change possible which has observable results.

So in this state the Kernel version response is still still received and response transmitted disproving the alignment theory about the data segment. But then there is a crash somewhere after the OSID request is sent, and before the OSID response comes back. So my working theory is solve this on a platform more stable than the P04, then see if it solves our P04 problems and removes the 'not alignment' alignment requirement on the rest, too.

What I found was when setting up the length of the VPW response for the OSID request, the crash was before we call the DLC function. Rather it was in an innocuous looking move instruction.

No crash

Code: Select all

jsr.w haltnoreboot
move.l #0x9, %d1
....

Does crash:

Code: Select all

move.l #0x9, %d1
jsr.w haltnoreboot
....

The change is 6 bytes, the 2 byte move.l (which the assembler has converted to a more efficient moveq), and the jsr.w which is 4 bytes are swapped in the kernel.

So this makes it look like the move instruction is crashing, which is a completely unexpected find, as was that the assembler converts the move.l to a moveq in the background when the argument is a value between -127 and +127 being written to a 32bit register. I havn't been able to come to any conclusion as to what this means, if anything, yet

Ideas or theories to test welcome.

: change.png (39.11 KiB) Viewed 2117 times

Post by **kur4o** » Fri Aug 04, 2023 4:10 pm

While working with these I have experienced several crash situation.

First one being odd vs even address, writing a value to. If you try to write a value[usually a word or dword] to odd address it crashed.
It was hard to debug and only by chance I figured it out. These cpus are word based and don`t like single bytes handling.

Second. Not all opcodes work on all pcms. Looks like there is an upgraded cpus even on the same family, that don`t support all opcodes.

Some cpus needs absolute RAM addressing for some opcodes [00FFxxxx for example] while others can go away with relative addressing [xxxx].

The last crash was real stunner.

For some reason the outgoing DLC was echoed back to pcm as incoming message and some specific sequence of bytes lead to an exit [20 for example].

It is also possible that DLC overwrites some memory location, that are used by flash routine code. Some code to monitor flash routine integrity can be needed if that is the case.

Post by **antus** » Sun Aug 06, 2023 1:04 pm

Its certainly been a lot of non-fun. Still no closer to nailing down exactly whats going on. I am seriously starting to consider that the P04 and P08 might have some cpu bugs in the silicon. But I can not prove it.

Firstly this moveq that causes a crash if I move it 4 bytes. That has the same word and 32 bit alignment in both locations. I wondered if maybe moveq was faulty on these CPUs so I looked at the factory bins and I can see plenty of moveq opcodes there, so that does not seem to be the case.

When you say not all opcodes work on different pcms.. do you mean dont work properly or dont work at all? I can confirm our kernels work sometimes, then you make a minor change and they dont anymore. So I guess this is not a missing opcode, because if that was the case it'd always not work and be easy to find.

Now lastly, I am looking at P08 again and I found that I was not able to reliably get kernel upload accepted by the factory OS. But other tools could. So I moved my load address higher and re-instated the loader. No I can get the kernel in there reliably, but when I try and erase I get back a response of 5. This is intriguing because the only two possible code paths between the intel erase main loop and sending the response either clears d0 and then the fail path writes a 1 to d0. So I cant see any way a 5 could get in to d0. The 5 is consistent across multiple runs.

Post by **kur4o** » Sun Aug 06, 2023 4:26 pm

The opcodes don`t work at all, and they have some specific usage. Used outside that and expect crash. Some dissasembly analysis of typical usage can be very helpful.

I looked more closely how moveq is used. Usually a condition byte is moved to a register,

If you use 70 7f [moveq #7f, d0] I guess the content of d0 will be 0000007f
but 70 81 will be d0=FFFFFF81

Some signing is used with it.

To move 4 bytes to a register a move.l command is much better choice. It also have broader coverage of usage of moveq that is limited to immediate value.

Only way to debug is low level analysis with opcodes being used. Extensive use of stack + some specific compiler opcodes will be way harder to debug than usual.
It is also possible some of the opcodes are not used as expected and with some conditions it works[some random data aligns correctly] but on some other it crashes.

If expected data is not got either it gets overwrite by some other means, Or something interfere with it[ a shared ram conflict].

I guess the current issue is related to intel code that handles the erase, write portion. It is already told but vpp can screw really bad some stuff. On all the routines I have examined there is a good waiting time when vpp is applied and removed, with some register checking. Some extra margin add, can make bus happier. You can also hook some monitor to vpp pin, and monitor it when the crash happens. For some reason it may be left on leading the DLC crash and not actual code crash.

Post by **Gampy** » Tue Aug 08, 2023 4:59 am

Ho Hum ... Same old symphony different conductor!

kur4o wrote:First one being odd vs even address, writing a value to. If you try to write a value[usually a word or dword] to odd address it crashed.
It was hard to debug and only by chance I figured it out. These cpus are word based and don`t like single bytes handling.

Thanks kur4o, I was hoping that would help, apparently not ... I have been stating for a long long time this is a problem, it is clearly documented that the processor is word sized, and everything must remain word aligned on an even address!

This is what I believe the issue to be, IMO the assembly Kernel's VPWReceive is broke, it forces us to have it's receive buffer off alignment (starting on a odd address), this affects everything right down the chain.
Moving bytes around proves absolutely nothing, it's a complete waste of time, it does not relocate the receive buffer to an even address, if it does, it breaks VPWReceive.

I believe this is causing addressing errors at the beginning and end of RAM pages, and this is what causes PCM's with smaller RAM sizes to crash (RAM pages are smaller).
It is also highly likely causing byte overlaps of used RAM thus stepping on each others toes. One does it aligned, the other does not, and this will move around accordingly, just exactly as is happening!

I assure you, I am working as hard as I can learn how to prove this ... ATM, I have an almost complete Kernel that is 100% aligned, it has one problem, it is deaf, not silent, but deaf! It cannot receive.
There is only one oddity, ProcessCRC response can only send 13 bytes, beyond 13 bytes VPWSend goes into an infinite loop in,

Code: Select all

VPWSendWaitForFlush:
    bsr.w   ResetWatchdog              | Scratch the dog
    bsr.w   WasteTime                  | Twiddle thumbs
    move.b  (J1850_Status).l, %d0      | Get status byte
    andi.b  #0xE0, %d0                 | Mask RFS 1110 0000
    cmpi.b  #0xE0, %d0                 | Empty except for completion byte status
    bne.s   VPWSendWaitForFlush        | Loop until true
    move.b  (J1850_RX_FIFO).l, %d0     | Read FIFO

The only 'alignment' changes made are to the reply part of ProcessCRC, doing exactly the same process as every other sub that uses VPWSend to load the reply buffer.
All other sub routines that use VPWSend can send much longer messages, I tested 2048 bytes (+header and sum) from several sub's and let it run over night, the next day it was still zipping along happily!

I do not know the DLC, nor do I have a PRM or datasheet on it and I really don't see how it could have been affected, especially only by one sub routine, ProcessCRC.

I do know one thing for absolute certainty, the way it is right now in the repo, is wrong, it is well documented that code and data must be aligned and until it's fixed there is no proving anything else because there will always be that question!

As for buggy or missing instructions ... In the case of the assembly kernel, I call hogwash on that!
We are using nothing but well developed core instructions that have been in every m68k ... The core design was solid by the time the PCM's we are interested in came out, I don't know when GM started using the M68k and don't care.
I do know they have been around a very very long time, longer than many around here have been alive, 44 years, they are used in a lot of things and are extremely well documented, including their quirks.

There are no issues with moveq, one just needs to understand it's valid use by design ... It is only good for source values ranging from -128 to 127, yet it is a 32bit signed instruction, no different then addq and subq with a valid source range from 1 to 8.
They are strictly instructions meant for speed critical use!

Absolute vs Indirect addressing, I have tested both ways on every PCM (P01, P04 AMD and Intel, P08, P10, P12, P12b, P59 AMD and Intel, E54) and yet to find any issues or change accessing RAM.

Yes, 16 bit addresses are sign extended to 32 bits ...

The code is so simple and carefully crafted there is no need to complicate it with stack tricks.

-Enjoy

Post by **kur4o** » Tue Aug 08, 2023 6:32 am

Word align to an even address is needed only when writing to RAM. All other cases should be no problem.

I was trying to use some loop that will dump whole bin in multiple message, and increasing some values each cycle. It was random hit and miss to make it work[a crash vs working], Until finally figured the odd vs even pattern where data is written.

I did test some code that works on p01 and try it on p04, No luck[crash], It wasn`t a regular code but some very odd instructions, but still didn`t work as expected. P04 cpus are a little different[custom design][p12 is only relative addressing, absolute leads to a crash], and there is earlier versions of cpus that have some tabs at the edges, They also use slightly different instruction sets.

I know for sure that the very same code will work on p01 and p04[including intel and amd erase/write], only changing the DLC registers and COP, and vpp.

ALso some pcms are slower in communication than others, p59 being the fastest one.

Post by **antus** » Tue Aug 08, 2023 4:45 pm

What confuses me about our DLC code is that read and write are both byte sized operations, which because of this read and write from odd and even bytes sequentially.

I have also observed that moving the VPW buffer and/or data looks like it causes problems, but I don't believe buffer or data alignment has anything to do with it, even if it appears so. I think if that was the case, we'd have solid and consistent test cases by now, and a solution. I have seen the code work properly buffer / data starting on both odd and even, too, not exclusively one way or the other.

I am actually starting to lean away from DLC problems, though I don't rule out a transmitter crash as we no code to detect this and reset it. It should be possible to be stable without this though, most other aftermarket kernels don't have DLC crash detect and reset either.

When I next find the time I plan to investigate kernel integrity. Either a function that can sum itself and send back an OK, or a button in hammer that can be pressed at any time to read the RAM out to verify there are no unexpected changes at particular addresses. Or perhaps use the CRC code to send back a CRC of RAM blocks. I am still very sus that this moveq #9, %d0 appears to crash in one of my test cases. It's a byte operation which is not communicating with RAM or DLC hardware. I can't see anyway it could cause a crash unless its changed to something else and is no longer a moveq in RAM.

I have stress tested the DLC hardware and routines with the PCM and a PC both smashing the bus and reading it from opposite ends with messages for 10 hours. At about the 8 hour mark the AVT-852 crashed and required a reboot on the PC side, but the PCM held up the whole time. This was a P08.

PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04

Re: PCMHammer P04