Gcc compile error with 2GB of data

dexen · on June 9, 2011

tl;dr:

the error is returned by the linker rather than compiler itself. It is not a bug, just size limitation of the default memory model. Linux x86_64 provides `large model' -- as pointed out by VJo, but it is not supported by GCC before 4.6, http://stackoverflow.com/questions/6296837/gcc-compile-error...

joeyh · on June 10, 2011

What a great technical problem, except for this bit:

> Btw, I didn't try to hide behind "this is scientific computing -- no way to optimize". It's just that the basis for this code is something that comes out of a "black box" where I have no real access to

Black box != science.

esrauch · on June 10, 2011

A research assistantship I held was based on classified data; all of the published work had to be approved by the DoD and the actual data we used wasn't allowed to be published which made our results entirely unreproducable.

jedbrown · on June 10, 2011

I'm certain the methods being used here are just wrong. It looks like it's code generated by a symbolic algebra system. Chances are that using a more scalable methodology would eliminate the code size problem. For example, symbolic manipulations often have exponential space complexity in the depth of the expression tree, but automatic differentiation can do the required operation in polynomial time and space (often linear or log-linear).

Unfortunately, "no way to optimize" is basically a statement that the questioner is not interested in an algorithmic way to make the problem go away.

roel_v · on June 10, 2011

It seems you don't understand the context of research projects. When you're assigned to write a program that processes data x coming group another group, it's valid to say that that input data is a 'black box' to you. That still doesn't mean that the whole research is 'black box' or somehow not 'real science'.

eru · on June 10, 2011

I wish that was true.

sevenproxies · on June 10, 2011

Science is a blackbox. We make theories about the world we live in based upon the results of experiments. Richard Feynman commented method of thinking about science on one of his videos.

http://www.youtube.com/watch?v=o1dgrvlWML4

jerf · on June 10, 2011

Science studies a black box. The science itself shouldn't be a black box. What is the added value of a black box representation of a black box?

jefftk · on June 10, 2011

> What is the added value of a black box representation of a black box?

The ability to make predictions.

jerf · on June 10, 2011

How? You handed me a black box. You wrote some code, and you fed it some data, and you made some prediction, and all I really have is your word for it that your prediction worked, but I sure can't do anything with it. I don't know what it is or how it works, how to extend it, how to build on it, anything.

That's not science by any useful definition of the term.

shareme · on June 10, 2011

wht do you call a machine so complicated that we do not know its source code?

Its called the human body and the black box is known as Medical Science..:)

robin_reala · on June 10, 2011

We know its source code. We just don’t understand it.

http://www.ornl.gov/sci/techresources/Human_Genome/home.shtm...

__rkaup__ · on June 9, 2011

Is including all that data in the object code really necessary?

dekayed · on June 9, 2011

Generally not. My last project involved building a framework in which scientists could run calculations for certain types of risk. A prototype built by some of said scientists involved code like seen in the SO question. After looking at the types of calculations being done, we figured out that most of the equations that were used were similar and could be generalized. We also moved all the coefficients and parameters for all the equations into configuration files. I suspect that a thorough evaluation of this code would reveal something similar as in my case.

kelnos · on June 9, 2011

From some of the comments, it's suggested that the problem is too much code, and that the data (of which there is a lot, granted) isn't contributing to the problem. I'm not sure if that's the case though.

If the data is part of the problem, I wonder if he can write a new linker script to rearrange the sections in the file so all code is below the signed 32-bit (~2GB) boundary. Though that raises the question... will it be able to address the data? Does initialized data access use 32- or 64-bit offsets in the small/medium models?

At any rate, it seems gcc 4.6 supports the x86_64 "large model", which should solve the problem without code/data changes.

kodablah · on June 9, 2011

Absolutely not. This appears to be generated code from a data set to avoid what most developers would do at runtime.

juiceandjuice · on June 10, 2011

Those expressions look a lot like an alternating series to me. You should be able to generate an expression to produce them fairly easily.

bmcniel · on June 9, 2011

This sorta reminds me of the silly things people do with JavaScript.

nddrylliog · on June 9, 2011

Such as... an MP3 decoder? https://github.com/nddrylliog/jsmad

malkia · on June 10, 2011

I suggested to try and use LuaJIT.

It has very good double precision floating point, and it looks like SSE/SIMD were not used directly from the code.

LuaJIT should garbage collect unused code, if not it always regenerates it. It should be able to compile and execute tons of code as he has (and it looks like all his stuff is generated, but for C++).

premchai21 · on June 10, 2011

Last I checked, the LuaJIT allocator on AMD64 platforms uses only a small part of the address space for the Lua heap, partly so that more efficient type-punned representations can be used internally. I don't remember what the limit is exactly, but it's only a few GB (and beyond that the GC starts having trouble anyway). I don't know whether this applies to the machine-code JIT output, or to external cdata arrays, but it's something to watch out for here.