IIUC it's critically dependent on the manufacturing process used.
> how much would it cost
A CPU IC isn't very interesting until it has some I/O, so it's much more meaningful to talk about an SoC with one or more of these darkriscv cores.
Sorry, I don't have an answer other than to say "this isn't quite complete enough for it to be useful for most tasks." That said, there's probably tons of open source implementations of DDR/SPI/PCI/USB interfaces (on opencores.org, e.g.). So it's "only" a matter of integrating these.
fastest speeds depend on two things. First the manufacturing technology used (smaller is faster because parasitic capacitance is smaller, although finfet is getting to have large parasitic resistance). The 2nd factor is the datapath between flip flops. Some architectures handle very fast clock rates because the data path is simple (or pipelined). Others are quite slow. I've seen some architectures top out at 500MHz when another architecture is running over 1 GHZ in the same chip.
I am not sure is recommendable integrate this design in a ASIC because is not so stable yet, but the synthesis tool pointed 133MHz in a Xilinx Artix-7 FPGA, running at 1 clock/instruction. A more pipelined and stable RISC-V design, such as the VexRisc, can easily reach 346MHz in the same FPGA and uses less logic, but with only 0.5 instructions per clock. A performance optimized VexRisc runs at 183MHz in the same FPGA with impressive 1.44 instructions per clock, but uses more logic. There are lots of RISC-V implementations, each one with different features.
Microsemi can use a 111MHz clock for their RISC-V implementation but they recommend a 70MHz limit on the milspec FPGAs. If you scroll down in the readme, they list some other max clock rates for other chips. Also, you can usually mess around a bit with the timing by either changing the seed that the synthesizer is using to place and route or by manually placing and routing signals if you have a high LUT utilization and the tool is struggling with automatic placement.
The first configuration works in a zero wait-state environment with separate instruction and data high speed synchronous memories working in a different clock phase (weeeeeird!). As long there are no latency, this configuration works at 75MIPS with a 2-stage pipeline, which means only one clock is lost when the pipeline is flushed by a branch.
The second configuration uses a small hi-speed cache with 256 bytes for instruction and 256 bytes for data, a 3-stage pipeline, which means two clocks are lost when the pipeline is flushed by a branch and a more convencional single phase clock architecture, as well a memory with 3 wait states or something like this. Although working at 75MIPS, the cache miss and the longer pipeline decrease the performance to around 51MIPS.
The third configuration is the core configuration from the first scenario, but with the small hi-speed cache from the second scenario and the 3 wait states. In this configuration, the performance decreased to 50MHz and, according to my calculations, the performance is around 34MIPS.
By this way, if is possible work only with the interna FPGA memory, the first configuration is better, otherwise you can use the second configuration.
I guess is possible create a fourth configuration with the 3-stage pipeline and zero wait-states (no cache), but I need implement a two-clock load instruction. In this case, I guess is possible peak around 100MHz.
How much would this increase if it used an ASIC instead of FPGA? And how much would it cost for different batch-sizes?