Commit Graph

61 Commits

Author SHA1 Message Date
9db73f74cf ARMeilleure: Respect FZ/RM flags for all floating point operations (#4618)
* ARMeilleure: Respect Fz flag for all floating point operations.

This is a change in strategy for emulating the Fz FPCR flag. Before, it was set before instructions that "needed it" and reset after. However, this missed a few hot instructions like the multiplication instruction, and the entirety of A32.

The new strategy is to set the Fz flag only in the following circumstances:

- Set to match FPCR before translated functions/loop are executed.
- Reset when calling SoftFloat methods, set when returning.
- Reset when exiting execution.

This allows us to remove the code around the existing Fz aware instructions, and get the accuracy benefits on all floating point instructions executed while in translated code.

Single step executions now need to be called with a context wrapper - right now it just contains the Fz flag initialization, and won't actually do anything on ARM.

This fixes a bug in Breath of the Wild where some physics interactions could randomly crash the game due to subnormal values not flushing to zero.

This is draft right now because I need to answer the questions:
- Does dotnet avoid changing the value of Mxcsr?
- Is it a good idea to assume that? Or should the flag set/restore be done on every managed method call, not just softfloat?
- If we assume that, do we want a unit test to verify the behaviour?

I recommend testing a bunch of games, especially games affected when this was originally added, such as #1611.

* Remove unused method

* Use FMA for Fmadd, Fmsub, Fnmadd, Fnmsub, Fmla, Fmls

...when available.

Similar implementation to A32

* Use FMA for Frecps, Frsqrts

* Don't set DAZ.

* Add round mode to ARM FP mode

* Fix mistakes

* Add test for FP state when calling managed methods

* Add explanatory comment to test.

* Cleanup

* Add A64 FPCR flags

* Vrintx_S A32 fast path on A64 backend

* Address feedback 1, re-enable DAZ

* Fix FMA instructions By Elem

* Address feedback
2023-04-10 12:22:58 +02:00
0992310b76 ARMeilleure: Check for XSAVE cpuid flag for AVX{2,512} (#4584)
Protection for the `xgetbv` instruction for systems that do not support
`xcr0` such as nehalem processors.

The `XSAVE` cpuid indicates support for `XSAVE`, `XRESTOR`, `XSETBV`,
`XGETBV` while `OSXSAVE` indicates if the operating system itself has
`XSAVE` turned on. Both must be checked at the same time.
2023-03-22 14:51:21 -03:00
17620d18db ARMeilleure: Add initial support for AVX512 (EVEX encoding) (cont) (#4147)
* ARMeilleure: Add AVX512{F,VL,DQ,BW} detection

Add `UseAvx512Ortho` and `UseAvx512OrthoFloat` optimization flags as
short-hands for `F+VL` and `F+VL+DQ`.

* ARMeilleure: Add initial support for EVEX instruction encoding

Does not implement rounding, or exception controls.

* ARMeilleure: Add `X86Vpternlogd`

Accelerates the vector-`Not` instruction.

* ARMeilleure: Add check for `OSXSAVE` for AVX{2,512}

* ARMeilleure: Add check for `XCR0` flags

Add XCR0 register checks for AVX and AVX512F, following the guidelines
from section 14.3 and 15.2 from the Intel Architecture Software
Developer's Manual.

* ARMeilleure: Remove redundant `ReProtect` and `Dispose`, formatting

* ARMeilleure: Move XCR0 procedure to GetXcr0Eax

* ARMeilleure: Add `XCR0` to `FeatureInfo` structure

* ARMeilleure: Utilize `ReadOnlySpan` for Xcr0 assembly

Avoids an additional allocation

* ARMeilleure: Formatting fixes

* ARMeilleure: Fix EVEX encoding src2 register index

> Just like in VEX prefix, vvvv is provided in inverted form.

* ARMeilleure: Add `X86Vpternlogd` acceleration to `Vmvn_I`

Passes unit tests, verified instruction utilization

* ARMeilleure: Fix EVEX register operand designations

Operand 2 was being sourced improperly.

EVEX encoded instructions source their operands like so:
Operand 1: ModRM:reg
Operand 2: EVEX.vvvvv
Operand 3: ModRM:r/m
Operand 4: Imm

This fixes the improper register designations when emitting vpternlog.
Now "dest", "src1", "src2" arguments emit in the proper order in EVEX instructions.

* ARMeilleure: Add `X86Vpternlogd` acceleration to `Orn_V`

* ARMeilleure: PTC version bump

* ARMeilleure: Update EVEX encoding Debug.Assert to Debug.Fail

* ARMeilleure: Update EVEX encoding comment capitalization
2023-03-20 16:09:24 -03:00
5131b71437 Reducing memory allocations (#4537)
* add RecyclableMemoryStream dependency and MemoryStreamManager

* organize BinaryReader/BinaryWriter extensions

* add StreamExtensions to reduce need for BinaryWriter

* simple replacments of MemoryStream with RecyclableMemoryStream

* add write ReadOnlySequence<byte> support to IVirtualMemoryManager

* avoid 0-length array creation

* rework IpcMessage and related types to greatly reduce memory allocation by using RecylableMemoryStream, keeping streams around longer, avoiding their creation when possible, and avoiding creation of BinaryReader and BinaryWriter when possible

* reduce LINQ-induced memory allocations with custom methods to query KPriorityQueue

* use RecyclableMemoryStream in StreamUtils, and use StreamUtils in EmbeddedResources

* add constants for nanosecond/millisecond conversions

* code formatting

* XML doc adjustments

* fix: StreamExtension.WriteByte not writing non-zero values for lengths <= 16

* XML Doc improvements. Implement StreamExtensions.WriteByte() block writes for large-enough count values.

* add copyless path for StreamExtension.Write(ReadOnlySpan<int>)

* add default implementation of IVirtualMemoryManager.Write(ulong, ReadOnlySequence<byte>); remove previous explicit implementations

* code style fixes

* remove LINQ completely from KScheduler/KPriorityQueue by implementing a custom struct-based enumerator
2023-03-17 13:14:50 +01:00
f0562b9c75 CPU: Avoid argument value copies on the JIT (#4484)
* Minor refactoring of the pre-allocator

* Avoid LoadArgument copies

* PPTC version bump
2023-03-08 23:25:35 +01:00
5e0f8e8738 Implement JIT Arm64 backend (#4114)
* Implement JIT Arm64 backend

* PPTC version bump

* Address some feedback from Arm64 JIT PR

* Address even more PR feedback

* Remove unused IsPageAligned function

* Sync Qc flag before calls

* Fix comment and remove unused enum

* Address riperiperi PR feedback

* Delete Breakpoint IR instruction that was only implemented for Arm64
2023-01-10 19:16:59 -03:00
ee0f9b03a4 Eliminate zero-extension moves in more cases on 32-bit games (#4140)
* Eliminate zero-extension moves in more cases on 32-bit games

* PPTC version bump

* Revert X86Optimizer changes
2022-12-19 14:45:58 -03:00
f93c5f006a Revert "ARMeilleure: Add initial support for AVX512(EVEX encoding) (#3663)" (#4145)
This reverts commit 295fbd0542.
2022-12-18 20:21:10 -03:00
295fbd0542 ARMeilleure: Add initial support for AVX512(EVEX encoding) (#3663)
* ARMeilleure: Add AVX512{F,VL,DQ,BW} detection

Add `UseAvx512Ortho` and `UseAvx512OrthoFloat` optimization flags as
short-hands for `F+VL` and `F+VL+DQ`.

* ARMeilleure: Add initial support for EVEX instruction encoding

Does not implement rounding, or exception controls.

* ARMeilleure: Add `X86Vpternlogd`

Accelerates the vector-`Not` instruction.

* ARMeilleure: Add check for `OSXSAVE` for AVX{2,512}

* ARMeilleure: Add check for `XCR0` flags

Add XCR0 register checks for AVX and AVX512F, following the guidelines
from section 14.3 and 15.2 from the Intel Architecture Software
Developer's Manual.

* ARMeilleure: Increment InternalVersion

* ARMeilleure: Remove redundant `ReProtect` and `Dispose`, formatting

* ARMeilleure: Move XCR0 procedure to GetXcr0Eax

* ARMeilleure: Add `XCR0` to `FeatureInfo` structure

* ARMeilleure: Utilize `ReadOnlySpan` for Xcr0 assembly

Avoids an additional allocation

* ARMeilleure: Formatting fixes
2022-12-18 16:46:13 -03:00
4da44e09cb Make structs readonly when applicable (#4002)
* Make all structs readonly when applicable. It should reduce amount of needless defensive copies

* Make structs with trivial boilerplate equality code record structs

* Remove unnecessary readonly modifiers from TextureCreateInfo

* Make BitMap structs readonly too
2022-12-05 14:47:39 +01:00
45ce540b9b ARMeilleure: Add gfni acceleration (#3669)
* ARMeilleure: Add `GFNI` detection

This is intended for utilizing the `gf2p8affineqb` instruction

* ARMeilleure: Add `gf2p8affineqb`

Not using the VEX or EVEX-form of this instruction is intentional. There
are `GFNI`-chips that do not support AVX(so no VEX encoding) such as
Tremont(Lakefield) chips as well as Jasper Lake.

13df339fe7/GenuineIntel/GenuineIntel00806A1_Lakefield_LC_InstLatX64.txt (L1297-L1299)

13df339fe7/GenuineIntel/GenuineIntel00906C0_JasperLake_InstLatX64.txt (L1252-L1254)

* ARMeilleure: Add `gfni` acceleration of `Rbit_V`

Passes all `Rbit_V*` unit tests on my `i9-11900k`

* ARMeilleure: Add `gfni` acceleration of `S{l,r}i_V`

Also added a fast-path for when the shift amount is greater than the
size of the element.

* ARMeilleure: Add `gfni` acceleration of `Shl_V` and `Sshr_V`

* ARMeilleure: Increment InternalVersion

* ARMeilleure: Fix Intrinsic and Assembler Table alignment

`gf2p8affineqb` is the longest instruction name I know of. It shouldn't
get any wider than this.

* ARMeilleure: Remove SSE2+SHA requirement for GFNI

* ARMeilleure Add `X86GetGf2p8LogicalShiftLeft`

Used to generate GF(2^8) 8x8 bit-matrices for bit-shifting for the `gf2p8affineqb` instruction.

* ARMeilleure: Append `FeatureInfo7Ecx` to `FeatureInfo`
2022-10-02 11:17:19 +02:00
f5235fff29 ARMeilleure: Hardware accelerate SHA256 (#3585)
* ARMeilleure/HardwareCapabilities: Add Sha

* ARMeilleure/Intrinsic: Add X86Sha256Rnds2

* ARmeilleure: Hardware accelerate SHA256H/SHA256H2

* ARMeilleure/Intrinsic: Add X86Sha256Msg1, X86Sha256Msg2

* ARMeilleure/Intrinsic: Add X86Palignr

* ARMeilleure: Hardware accelerate SHA256SU0, SHA256SU1

* PTC: Bump InternalVersion
2022-08-25 10:12:13 +00:00
951700fdd8 Removed unused usings. (#3593)
* Removed unused usings.

* Added back using, now that it's used.

* Removed extra whitespace.
2022-08-18 18:04:54 +02:00
6dfb6ccf8c PreAllocator: Check if instruction supports a Vex prefix in IsVexSameOperandDestSrc1 (#3587) 2022-08-14 17:35:08 -03:00
c3c3914ed3 Add a limit on the number of uses a constant may have (#3097) 2022-02-09 17:42:47 -03:00
f3bfd799e1 Fix calls passing V128 values on Linux (#3034)
* Fix calls passing V128 values on Linux

* PPTC version bump
2022-01-24 11:23:24 +01:00
f0824fde9f Add host CPU memory barriers for DMB/DSB and ordered load/store (#3015)
* Add host CPU memory barriers for DMB/DSB and ordered load/store

* PPTC version bump

* Revert to old barrier order
2022-01-21 12:47:34 -03:00
f39fce8f54 misc: Migrate usage of RuntimeInformation to OperatingSystem (#2901)
Very basic migration across the codebase.
2021-12-04 20:02:30 -03:00
ecc64c934d Add Operand.Label support to Assembler (#2680)
* Add `Operand.Label` support to `Assembler`

This adds label support to `Assembler` and enables branch tightening
when compiling with relocatables. Jump management and patching has been
moved to the `Assembler`.

* Move instruction table to `Assembler.Table`

* Set PTC internal version

* Rename `Assembler.Table` to `AssemblerTable`
2021-10-05 14:04:55 -03:00
312be74861 Optimize HybridAllocator (#2637)
* Store constant `Operand`s in the `LocalInfo`

Since the spill slot and register assigned is fixed, we can just store
the `Operand` reference in the `LocalInfo` struct. This allows skipping
hitting the intern-table for a look up.

* Skip `Uses`/`Assignments` management

Since the `HybridAllocator` is the last pass and we do not care about
uses/assignments we can skip managing that when setting destinations or
sources.

* Make `GetLocalInfo` inlineable

Also fix a possible issue where with numbered locals. See or-assignment
operator in `SetVisited(local)` before patch.

* Do not run `BlockPlacement` in LCQ

With the host mapped memory manager, there is a lot less cold code to
split from hot code. So disabling this in LCQ gives some extra
throughput - where we need it.

* Address Mou-Ikkai's feedback

* Apply suggestions from code review

Co-authored-by: VocalFan <45863583+Mou-Ikkai@users.noreply.github.com>

* Move check to an assert

Co-authored-by: VocalFan <45863583+Mou-Ikkai@users.noreply.github.com>
2021-09-29 01:38:37 +02:00
a9343c9364 Refactor PtcInfo (#2625)
* Refactor `PtcInfo`

This change reduces the coupling of `PtcInfo` by moving relocation
tracking to the backend. `RelocEntry`s remains as `RelocEntry`s through
out the pipeline until it actually needs to be written to the PTC
streams. Keeping this representation makes inspecting and manipulating
relocations after compilations less painful. This is something I needed
to do to patch relocations to 0 to diff dumps.

Contributes to #1125.

* Turn `Symbol` & `RelocInfo` into readonly structs

* Add documentation to `CompiledFunction`

* Remove `Compiler.Compile<T>`

Remove `Compiler.Compile<T>` and replace it by `Map<T>` of the
`CompiledFunction` returned.
2021-09-14 01:23:37 +02:00
22b2cb39af Reduce JIT GC allocations (#2515)
* Turn `MemoryOperand` into a struct

* Remove `IntrinsicOperation`

* Remove `PhiNode`

* Remove `Node`

* Turn `Operand` into a struct

* Turn `Operation` into a struct

* Clean up pool management methods

* Add `Arena` allocator

* Move `OperationHelper` to `Operation.Factory`

* Move `OperandHelper` to `Operand.Factory`

* Optimize `Operation` a bit

* Fix `Arena` initialization

* Rename `NativeList<T>` to `ArenaList<T>`

* Reduce `Operand` size from 88 to 56 bytes

* Reduce `Operation` size from 56 to 40 bytes

* Add optimistic interning of Register & Constant operands

* Optimize `RegisterUsage` pass a bit

* Optimize `RemoveUnusedNodes` pass a bit

Iterating in reverse-order allows killing dependency chains in a single
pass.

* Fix PPTC symbols

* Optimize `BasicBlock` a bit

Reduce allocations from `_successor` & `DominanceFrontiers`

* Fix `Operation` resize

* Make `Arena` expandable

Change the arena allocator to be expandable by allocating in pages, with
some of them being pooled. Currently 32 pages are pooled. An LRU removal
mechanism should probably be added to it.

Apparently MHR can allocate bitmaps large enough to exceed the 16MB
limit for the type.

* Move `Arena` & `ArenaList` to `Common`

* Remove `ThreadStaticPool` & co

* Add `PhiOperation`

* Reduce `Operand` size from 56 from 48 bytes

* Add linear-probing to `Operand` intern table

* Optimize `HybridAllocator` a bit

* Add `Allocators` class

* Tune `ArenaAllocator` sizes

* Add page removal mechanism to `ArenaAllocator`

Remove pages which have not been used for more than 5s after each reset.

I am on fence if this would be better using a Gen2 callback object like
the one in System.Buffers.ArrayPool<T>, to trim the pool. Because right
now if a large translation happens, the pages will be freed only after a
reset. This reset may not happen for a while because no new translation
is hit, but the arena base sizes are rather small.

* Fix `OOM` when allocating larger than page size in `ArenaAllocator`

Tweak resizing mechanism for Operand.Uses and Assignemnts.

* Optimize `Optimizer` a bit

* Optimize `Operand.Add<T>/Remove<T>` a bit

* Clean up `PreAllocator`

* Fix phi insertion order

Reduce codegen diffs.

* Fix code alignment

* Use new heuristics for degree of parallelism

* Suppress warnings

* Address gdkchan's feedback

Renamed `GetValue()` to `GetValueUnsafe()` to make it more clear that
`Operand.Value` should usually not be modified directly.

* Add fast path to `ArenaAllocator`

* Assembly for `ArenaAllocator.Allocate(ulong)`:

  .L0:
    mov rax, [rcx+0x18]
    lea r8, [rax+rdx]
    cmp r8, [rcx+0x10]
    ja short .L2
  .L1:
    mov rdx, [rcx+8]
    add rax, [rdx+8]
    mov [rcx+0x18], r8
    ret
  .L2:
    jmp ArenaAllocator.AllocateSlow(UInt64)

  A few variable/field had to be changed to ulong so that RyuJIT avoids
  emitting zero-extends.

* Implement a new heuristic to free pooled pages.

  If an arena is used often, it is more likely that its pages will be
  needed, so the pages are kept for longer (e.g: during PPTC rebuild or
  burst sof compilations). If is not used often, then it is more likely
  that its pages will not be needed (e.g: after PPTC rebuild or bursts
  of compilations).

* Address riperiperi's feedback

* Use `EqualityComparer<T>` in `IntrusiveList<T>`

Avoids a potential GC hole in `Equals(T, T)`.
2021-08-17 15:08:34 -03:00
9d7627af64 Add multi-level function table (#2228)
* Add AddressTable<T>

* Use AddressTable<T> for dispatch

* Remove JumpTable & co.

* Add fallback for out of range addresses

* Add PPTC support

* Add documentation to `AddressTable<T>`

* Make AddressTable<T> configurable

* Fix table walk

* Fix IsMapped check

* Remove CountTableCapacity

* Add PPTC support for fast path

* Rename IsMapped to IsValid

* Remove stale comment

* Change format of address in exception message

* Add TranslatorStubs

* Split DispatchStub

Avoids recompilation of stubs during tests.

* Add hint for 64bit or 32bit

* Add documentation to `Symbol`

* Add documentation to `TranslatorStubs`

Make `TranslatorStubs` disposable as well.

* Add documentation to `SymbolType`

* Add `AddressTableEventSource` to monitor function table size

Add an EventSource which measures the amount of unmanaged bytes
allocated by AddressTable<T> instances.

 dotnet-counters monitor -n Ryujinx --counters ARMeilleure

* Add `AllowLcqInFunctionTable` optimization toggle

This is to reduce the impact this change has on the test duration.
Before everytime a test was ran, the FunctionTable would be initialized
and populated so that the newly compiled test would get registered to
it.

* Implement unmanaged dispatcher

Uses the DispatchStub to dispatch into the next translation, which
allows execution to stay in unmanaged for longer and skips a
ConcurrentDictionary look up when the target translation has been
registered to the FunctionTable.

* Remove redundant null check

* Tune levels of FunctionTable

Uses 5 levels instead of 4 and change unit of AddressTableEventSource
from KB to MB.

* Use 64-bit function table

Improves codegen for direct branches:

    mov qword [rax+0x408],0x10603560
 -  mov rcx,sub_10603560_OFFSET
 -  mov ecx,[rcx]
 -  mov ecx,ecx
 -  mov rdx,JIT_CACHE_BASE
 -  add rdx,rcx
 +  mov rcx,sub_10603560
 +  mov rdx,[rcx]
    mov rcx,rax

Improves codegen for dispatch stub:

    and rax,byte +0x1f
 -  mov eax,[rcx+rax*4]
 -  mov eax,eax
 -  mov rcx,JIT_CACHE_BASE
 -  lea rax,[rcx+rax]
 +  mov rax,[rcx+rax*8]
    mov rcx,rbx

* Remove `JitCacheSymbol` & `JitCache.Offset`

* Turn `Translator.Translate` into an instance method

We do not have to add more parameter to this method and related ones as
new structures are added & needed for translation.

* Add symbol only when PTC is enabled

Address LDj3SNuD's feedback

* Change `NativeContext.Running` to a 32-bit integer

* Fix PageTable symbol for host mapped
2021-05-29 18:06:28 -03:00
e318022b89 Fold constant offsets and group constant addresses (#2285)
* Fold constant offsets and group constant addresses

* PPTC version bump
2021-05-13 21:26:57 +02:00
d43a56726c (CPU) Fix CRC32 instruction when constant values are used as input (#2183) 2021-04-07 23:43:08 +02:00
bcbf240d2e PPTC: Fix unwanted propagation of a relocatable constant in a specific case. (#1990)
* Fix unwanted propagation of a relocatable constant in a specific case.

* Ptc.InternalVersion = 1990

* Nit to retrigger the Checks.
2021-02-23 13:15:45 +01:00
dc0adb533d PPTC & Pool Enhancements. (#1968)
* PPTC & Pool Enhancements.

* Avoid buffer allocations in CodeGenContext.GetCode(). Avoid stream allocations in PTC.PtcInfo.

Refactoring/nits.

* Use XXHash128, for Ptc.Load & Ptc.Save, x10 faster than Md5.

* Why not a nice Span.

* Added a simple PtcFormatter library for deserialization/serialization, which does not require reflection, in use at PtcJumpTable and PtcProfiler; improves maintainability and simplicity/readability of affected code.

* Nits.

* Revert #1987.

* Revert "Revert #1987."

This reverts commit 998be765cf.
2021-02-22 03:23:48 +01:00
40797a1283 Optimization | Modify Add (Integer) Instruction to use LEA instead. (#1971)
* Optimization | Modify Add Instruction to use LEA instead.

Currently, the add instruction requires 4 registers to take place. By using LEA, we can effectively perform the same working using 3 registers, reducing memory spills and improving translation efficiency.

* Fix IsSameOperandDestSrc1 Check for Add

* Use LEA if Dest != SRC1

* Update IsSameOperandDestSrc1 to account for Cases where Dest and Src1 can be same for add

* Fix error in logic

* Typo

* Add paranthesis for clarity

* Compare registers as requested.

* Cleanup if statement, use same comparison method as generateCopy

* Make change as recommended by gdk

* Perform check only when Add calls are made

* use ensure sametype for lea, fix else

* Update comment

* Update version #
2021-02-08 10:49:46 +11:00
c3e0c41da3 CPU (A64): Add Fmaxnmp & Fminnmp Scalar Inst.s, Fast & Slow Paths; with Tests. (#1894) 2021-01-20 09:12:33 +11:00
8a33e884f8 Fix Vnmls_S fast path (F64: losing input d value). Fix Vnmla_S & Vnmls_S slow paths (using fused inst.s). Fix Vfma_V slow path not using StandardFPSCRValue(). (#1775)
* Fix Vnmls_S fast path (F64: losing input d value). Fix Vnmla_S & Vnmls_S slow paths (using fused inst.s).

Add Vfma_S & Vfms_S Fma fast paths.
Add Vfnma_S inst. with Fma/Sse fast paths and slow path.
Add Vfnms_S Sse fast path.

Add Tests for affected inst.s.

Nits.

* InternalVersion = 1775

* Nits.

* Fix Vfma_V slow path not using StandardFPSCRValue().

* Nit: Fix Vfma_V order.

* Add Vfms_V Sse fast path and slow path.

* Add Vfma_V and Vfms_V Test.
2020-12-17 20:43:41 +01:00
3332b29f01 CPU: Implement VFMA (Vector) (#1762)
* Implement VFMA.F64

* Simplify switch

* Simplify FMA Instructions into their own IntrinsicType.

* Remove whitespace

* Fix indentation

* Change tests for Vfnms -- disable inf / nan

* Move args up, not description ;)

* Implementation Complete.

All Tests Pass (Slow / Fast Path)

* Move location of function in assembler + test updates.

* Shift params upwards

* Remove unused function

* Update PTC version.

* Add comments / re-oreder opcode table.

* Remove whitespace

* Fix nit

* Fix nit.

* Fix whitespace

* Wrong opcode was used by a bad merge.

* Addressed rip's comments.
2020-12-15 00:01:52 -03:00
47ba81c661 Fix pre-allocator shift instruction copy on a specific case (#1752)
* Fix pre-allocator shift instruction copy on a specific case

* Fix to make shift use int rather than long for constants
2020-12-14 17:56:07 -03:00
36f6bbf5b9 CPU: Implement VFNMA.F32 | F.64 (#1783)
* Implement VFNMA.F<32/64>

* Update PTC Version

* Update Implementation & Renames & Correct Order

* Fix alignment

* Update implementation to not trigger assert

* Actually use the intrinsic that makes sense :)
2020-12-07 21:04:01 -03:00
567ea726e1 Add support for guest Fz (Fpcr) mode through host Ftz and Daz (Mxcsr) modes (fast paths). (#1630)
* Add support for guest Fz (Fpcr) mode through host Ftz and Daz (Mxcsr) modes (fast paths).

* Ptc.InternalVersion = 1630

* Nits.

* Address comments.

* Update Ptc.cs

* Address comment.
2020-12-07 10:37:07 +01:00
b479a43939 CPU: Implement VFNMS.F32/64 (#1758)
* Add necessary methods / op-code

* Enable Support for FMA Instruction Set

* Add Intrinsics / Assembly Opcodes for VFMSUB231XX.

* Add X86 Instructions for VFMSUB231XX

* Implement VFNMS

* Implement VFNMS Tests

* Add special cases for FMA instructions.

* Update PPTC Version

* Remove unused Op

* Move Check into Assert / Cleanup

* Rename and cleanup

* Whitespace

* Whitespace / Rename

* Re-sort

* Address final requests

* Implement VFMA.F64

* Simplify switch

* Simplify FMA Instructions into their own IntrinsicType.

* Remove whitespace

* Fix indentation

* Change tests for Vfnms -- disable inf / nan

* Move args up, not description ;)

* Undo vfma

* Completely remove vfms code.,

* Fix order of instruction in assembler
2020-12-03 20:20:02 +01:00
0679084f11 CPU (A64): Add FP16/FP32 fast paths (F16C Intrinsics) for Fcvt_S, Fcvtl_V & Fcvtn_V Instructions. Now HardwareCapabilities uses CpuId. (#1650)
* net5.0

* CPU (A64): Add FP16/FP32 fast paths (F16C Intrinsics) for Fcvt_S, Fcvtl_V & Fcvtn_V Instructions. Switch to .NET 5.0.

Nits.

Tests performed successfully in both debug and release mode (for all instructions involved).

* Address comment.

* Update appveyor.yml

* Revert "Update appveyor.yml"

This reverts commit 27cdd59e8b.

* Remove Assembler CpuId.

* Update appveyor.yml

* Address comment.
2020-11-18 19:35:54 +01:00
f60033e0aa Implement block placement (#1549)
* Implement block placement

Implement a simple pass which re-orders cold blocks at the end of the
list of blocks in the CFG.

* Set PPTC version

* Use Array.Resize

Address gdkchan's feedback
2020-09-19 20:00:24 -03:00
36ec1bc6c0 Relax block ordering constraints (#1535)
* Relax block ordering constraints

Before `block.Next` had to follow `block.ListNext`, now it does not.
Instead `CodeGenerator` will now emit the necessary jump instructions
to ensure control flow.

This makes control flow and block order modifications easier. It also
eliminates some simple cases of redundant branches.

* Set PPTC version
2020-09-12 12:32:53 -03:00
ee22517d92 Improve branch operations (#1442)
* Add Compare instruction

* Add BranchIf instruction

* Use test when BranchIf & Compare against 0

* Propagate Compare into BranchIfTrue/False use

- Propagate Compare operations into their BranchIfTrue/False use and
  turn these into a BranchIf.

- Clean up Comparison enum.

* Replace BranchIfTrue/False with BranchIf

* Use BranchIf in EmitPtPointerLoad

- Using BranchIf early instead of BranchIfTrue/False improves LCQ and
  reduces the amount of work needed by the Optimizer.

  EmitPtPointerLoader was a/the big producer of BranchIfTrue/False.

- Fix asserts firing when assembling BitwiseAnd because of type
  mismatch in EmitStoreExclusive. This is harmless and should not
  cause any diffs.

* Increment PPTC interval version

* Improve IRDumper for BranchIf & Compare

* Use BranchIf in EmitNativeCall

* Clean up

* Do not emit test when immediately preceded by and
2020-08-05 08:52:33 +10:00
9878fc2d3c Implement inline memory load/store exclusive and ordered (#1413)
* Implement inline memory load/store exclusive

* Fix missing REX prefix on 8-bits CMPXCHG

* Increment PTC version due to bugfix

* Remove redundant memory checks

* Address PR feedback

* Increment PPTC version
2020-07-30 11:29:28 -03:00
b3c051bbec Use movd,movq for i32/64 VectorExtract %x, 0x0 (#1439)
* Use movd,movq for i32/64 VectorExtract %x, 0x0

* Increment PPTC interval version

* Use else-if instead

- Address gdkchan's feedback.
- Clean up Debug.Assert calls

* Inline `count` expression into Debug.Assert

Apparently the CoreCLR JIT will not eliminate this. :(
2020-07-30 15:52:26 +10:00
d7044b10a2 Add SSE4.2 Path for CRC32, add A32 variant, add tests for non-castagnoli variants. (#1328)
* Add CRC32 A32 instructions.

* Fix CRC32 instructions.

* Add CRC intrinsic and fast path.

Loop is currently unrolled, will look into adding temp vars after tests are added.

* Begin work on Crc tests

* Fix SSE4.2 path for CRC32C, finialize tests.

* Remove unused IR path.

* Fix spacing between prefix checks.

* This should be Src.

* PTC Version

* OpCodeTable Order

* Integer check improvement. Value and Crc can be either 32 or 64 size.

* This wasn't necessary...

* If size is 3, value type must be I64.

* Fix same src+dest handling for non crc intrinsics.

* Pre-fix (ha) issue with vex encodings
2020-07-13 20:48:14 +10:00
38b26cf424 Mask shift constants on x86 backend (#1382)
* Mask shift constants on x86 backendd

* Version bump
2020-07-11 15:52:38 +10:00
c050994995 Fix PPTC on Windows 7. (#1369)
* Fix PPTC on Windows 7.

* Address gdkchan comment.
2020-07-09 10:45:24 +10:00
5e724cf24e Add Profiled Persistent Translation Cache. (#769)
* Delete DelegateTypes.cs

* Delete DelegateCache.cs

* Add files via upload

* Update Horizon.cs

* Update Program.cs

* Update MainWindow.cs

* Update Aot.cs

* Update RelocEntry.cs

* Update Translator.cs

* Update MemoryManager.cs

* Update InstEmitMemoryHelper.cs

* Update Delegates.cs

* Nit.

* Nit.

* Nit.

* 10 fewer MSIL bytes for us

* Add comment. Nits.

* Update Translator.cs

* Update Aot.cs

* Nits.

* Opt..

* Opt..

* Opt..

* Opt..

* Allow to change compression level.

* Update MemoryManager.cs

* Update Translator.cs

* Manage corner cases during the save phase. Nits.

* Update Aot.cs

* Translator response tweak for Aot disabled. Nit.

* Nit.

* Nits.

* Create DelegateHelpers.cs

* Update Delegates.cs

* Nit.

* Nit.

* Nits.

* Fix due to #784.

* Fixes due to #757 & #841.

* Fix due to #846.

* Fix due to #847.

* Use MethodInfo for managed method calls.

Use IR methods instead of managed methods about Max/Min (S/U).
Follow-ups & Nits.

* Add missing exception messages.

Reintroduce slow path for Fmov_Vi.
Implement slow path for Fmov_Si.

* Switch to the new folder structure.

Nits.

* Impl. index-based relocation information. Impl. cache file version field.

* Nit.

* Address gdkchan comments.

Mainly:
- fixed cache file corruption issue on exit; - exposed a way to disable AOT on the GUI.

* Address AcK77 comment.

* Address Thealexbarney, jduncanator & emmauss comments.

Header magic, CpuId (FI) & Aot -> Ptc.

* Adaptation to the new application reloading system.

Improvements to the call system of managed methods.
Follow-ups.
Nits.

* Get the same boot times as on master when PTC is disabled.

* Profiled Aot.

* A32 support (#897).

* #975 support (1 of 2).

* #975 support (2 of 2).

* Rebase fix & nits.

* Some fixes and nits (still one bug left).

* One fix & nits.

* Tests fix (by gdk) & nits.

* Support translations not only in high quality and rejit.

Nits.

* Added possibility to skip translations and continue execution, using `ESC` key.

* Update SettingsWindow.cs

* Update GLRenderer.cs

* Update Ptc.cs

* Disabled Profiled PTC by default as requested in the past by gdk.

* Fix rejit bug. Increased number of parallel translations. Add stack unwinding stuffs support (1 of 2).

Nits.

* Add stack unwinding stuffs support (2 of 2). Tuned number of parallel translations.

* Restored the ability to assemble jumps with 8-bit offset when Profiled PTC is disabled or during profiling.

Modifications due to rebase.
Nits.

* Limited profiling of the functions to be translated to the addresses belonging to the range of static objects only.

* Nits.

* Nits.

* Update Delegates.cs

* Nit.

* Update InstEmitSimdArithmetic.cs

* Address riperiperi comments.

* Fixed the issue of unjustifiably longer boot times at the second boot than at the first boot, measured at the same time or reference point and with the same number of translated functions.

* Implemented a simple redundant load/save mechanism.

Halved the value of Decoder.MaxInstsPerFunction more appropriate for the current performance of the Translator.
Replaced by Logger.PrintError to Logger.PrintDebug in TexturePool.cs about the supposed invalid texture format to avoid the spawn of the log.
Nits.

* Nit.

Improved Logger.PrintError in TexturePool.cs to avoid log spawn.
Added missing code for FZ handling (in output) for fp max/min instructions (slow paths).

* Add configuration migration for PTC

Co-authored-by: Thog <me@thog.eu>
2020-06-16 20:28:02 +02:00
f8cd072b62 Faster crc32 implementation (#1294)
* Add Pclmulqdq intrinsic

* Implement crc32 in terms of pclmulqdq

* Address PR comments
2020-06-05 20:58:27 +10:00
3b70a28087 Unwinding Follow-up. Fix a bug in JitUnwindWindows where ... (#1238)
... in case of "Vector" unwind codes the remaining unwind codes could be corrupted.
Nits.
2020-05-15 13:46:35 +02:00
96c7988671 Remove CpuId IR instruction (#1227) 2020-05-13 15:30:21 +10:00
f77694e4f7 Implement a new physical memory manager and replace DeviceMemory (#856)
* Implement a new physical memory manager and replace DeviceMemory

* Proper generic constraints

* Fix debug build

* Add memory tests

* New CPU memory manager and general code cleanup

* Remove host memory management from CPU project, use Ryujinx.Memory instead

* Fix tests

* Document exceptions on MemoryBlock

* Fix leak on unix memory allocation

* Proper disposal of some objects on tests

* Fix JitCache not being set as initialized

* GetRef without checks for 8-bits and 16-bits CAS

* Add MemoryBlock destructor

* Throw in separate method to improve codegen

* Address PR feedback

* QueryModified improvements

* Fix memory write tracking not marking all pages as modified in some cases

* Simplify MarkRegionAsModified

* Remove XML doc for ghost param

* Add back optimization to avoid useless buffer updates

* Add Ryujinx.Cpu project, move MemoryManager there and remove MemoryBlockWrapper

* Some nits

* Do not perform address translation when size is 0

* Address PR feedback and format NativeInterface class

* Remove ghost parameter description

* Update Ryujinx.Cpu to .NET Core 3.1

* Address PR feedback

* Fix build

* Return a well defined value for GetPhysicalAddress with invalid VA, and do not return unmapped ranges as modified

* Typo
2020-05-04 08:54:50 +10:00
fe5bb439f1 Do temp constant copy for CompareAndSwap, other improvements to PreAllocator (#1126)
* Do temp constant copy for CompareAndSwap, other improvements to PreAllocator

* Nit
2020-04-25 23:00:54 +10:00