GCC questions :: pouët.net

GCC questions

category: general [glöplog]

Thanks to your help in a month ago topic now I'm not working on Visual C 6 anymore (I can't install it under Windows, and also I wanted to be able to port easily). Now I use MinGW.

Well, my questions:

- With -O2 I'm getting about a 10% faster code than with -03. Isn't O3 supposed to optimize more? And also, doesn't look 10% a too big difference?

- Then, I get absolutely no difference the march option (CPU architecture). Should I get any difference or not? I'm not sure of what this option is for...

- Do GCC does any kind of mmx, sse optimizations by itself or I have to code for vectors directly? (I supppose the second option).

- And finally... do you know if there is a big difference in optimizations between GCC and the VC compiler?

Now, not related to GCC, but to modern computers:

In my VC code I used to do MMX 64 bits memory accesses since these were faster than 32 bits memory accesses (well, in the same time you got 64 bits of data and not 32). Do current computers (Pentium Core 2 Duo and similar) have the same thing? Do 128 bits access are the same as fast as 64bits, for example? And, do multicore code affect this?

added on the 2008-10-24 02:11:08 by texel

- With -O2 I'm getting about a 10% faster code than with -03. Isn't O3 supposed to optimize more? And also, doesn't look 10% a too big difference?

Answer: Usual GCC bullcrap... I'd recommend looking at the output, but most of the GCCs (even those for embedded systems) produce better code with -O2 from my experiences.

- Then, I get absolutely no difference the march option (CPU architecture). Should I get any difference or not? I'm not sure of what this option is for...

Answer: It will make a difference if and only if the compiler is capable of producing codes for different architectures/processors/instruction sets.

- Do GCC does any kind of mmx, sse optimizations by itself or I have to code for vectors directly? (I supppose the second option).

Answer: As far as I know GCC 4.x does. I'm not sure though and it's 5'o clock, sorry just google it :)

- And finally... do you know if there is a big difference in optimizations between GCC and the VC compiler?

Answer: Do I even need to answer this? ...

added on the 2008-10-24 04:40:40 by decipher

texel, the -march is not really about performance, it's about using the full set of instructions available for an archicture i.e. it's more about compatibility than performance. (Of course, making available certain instruction sets can enable better performance, but it might be very marginal)

In any case about the O2 vs O3 .. you're comparing a small piece of code or a full program? I *think* (but have not tested) that the code produced by O2 might be slightly more compact than the one produced by O3 .. might be changing your behaviour regarding cache.

GCC4.x has provisions to do auto vectorization. See thread:

http://www.nabble.com/vectorization-td15339532.html

Read it in *full* (and do your own tests if you're so inclined)

added on the 2008-10-24 06:19:49 by _-_-__

I don't see how a compiler could vectorize code for you. Say, a raytracer.

added on the 2008-10-24 10:17:19 by iq

It actually might be only applicable to very very basic pieces of code and I'm not sure it's worth the effort..

added on the 2008-10-24 10:23:07 by _-_-__

As far as I know O3 tries to optimize the code to the ridiciolous extents thus producing much bigger executables (for example with O3 gcc unrolls all the loops it can) which sometimes leads to decrease in performance. Using O3 is in fact discouraged (well at least by Gentoo guys), use O2 instead.

added on the 2008-10-24 10:25:12 by masterm

FUCK! And I always set it to -O3 and forgot it :(

added on the 2008-10-24 11:50:14 by Optimus

..that's why all gamepark makefiles had it -O2 and I always wondered why and set it to -O3? Me silly :P

added on the 2008-10-24 11:50:54 by Optimus

Quote:

I don't see how a compiler could vectorize code for you. Say, a raytracer.

i haven't checked too much GCC output code (only some ps3 ppu stuff) but i've seen vc++ do it in a few cases (i was sceptical as well until someone showed me)

furthermore, what masterm says is about right. -o3 will for example inline your socks off and that isn't always the best thing to do obviously (though just as obviously taking advice from a few *nix distr. goons isnt good practice either). however if you adhere to the strict aliasing rule, i have seen it being taken advantage of much better with -o3.

just check the output code to see whats going on.

added on the 2008-10-24 17:45:04 by superplek

oh and sure -- spot-on vectorization of algorithms can only be done in it's most optimal form when taking into account the context, which is something a compiler just cant do for ya. but that aside.

added on the 2008-10-24 17:46:21 by superplek

Oh I remember reading as a kid about "assembler is good for when you like to optimize what you have done in C" (this was around the 90's). Is there anyone who ever optimized anything they did in C etc. after 1979?

added on the 2008-10-24 17:53:56 by El Topo

there's a big difference between "optimizing in asm" and utilizing vector architecture through c intrinsics and such.

Quote:

In my VC code I used to do MMX 64 bits memory accesses since these were faster than 32 bits memory accesses

these assumptions in general may hurt you. it strongly depends on cpu/codepath.

added on the 2008-10-24 18:41:33 by superplek

Quote:

- Do GCC does any kind of mmx, sse optimizations by itself or I have to code for vectors directly? (I supppose the second option).

Yes it does, and 4.2.x does a really good job as well, just add those to your args:

Code:-mmmx -msse -msse2

I believe that -march= and -mcpu= are deprecated today and are there for compatibility reasons, but I might be wrong.

Remember that sometimes compilers need extra help understanding what's going on, so try being "obvious" with your code. the following code will compile wonderfully with SSE:

Code:

typedef union vec4
{
		// Cast operator, for []
	inline operator float* ()
	{ 
		return (float*)this;
	}
		
		// Const cast operator, for const []
	inline operator const float* () const
	{ 
		return (const float*)this;
	}

	union {
			// Lowercase xyzw
		struct {
			float x;
			float y;
			float z;
			float w;
		};
			// Uppercase XYZW
		struct {
			float X;
			float Y;
			float Z;
			float W;
		};
	};
	
/*	
	union {
		// RGB maybe?
	}
*/	

	inline vec4 operator += (const vec4 &v)
	{
		for(int i=0; i<4; ++i)
			(*this)[i] += v[i];
			
		return *this;
	}
		
	inline vec4 operator += (float f)
	{
		for(int i=0; i<4; ++i)
			(*this)[i] += f;
			
		return *this;
	}
	
	inline vec4 operator + (const vec4 &v) const 
	{
		return (vec4(*this) += v);
	}

	inline vec4 operator + (float f) const 
	{
		return (vec4(*this) += f);
	}
	
	// ... More ops ...
	
} vec4;

On a side note, if you're not using RTTI/exceptions, you might as well want to disable it for the extra code/ops reduction:

Code:-fno-rtti -fno-exceptions -fomit-frame-pointer -ffast-math

-ffast-math proves really useful even though it "breaks" IEEE float/double compatibility, which means you can't be guaranteed on floats as binary being the same (such as the famous quake3 invert sqr trick, but the compiler does it for you!)

Hope it helps.

added on the 2008-10-24 20:19:31 by LiraNuna

LiraNuna, I don't think you have the right to do what you've just done above.

(When returning your float* you're assuming floats in the structure are all tightly packed)

added on the 2008-10-24 20:24:37 by _-_-__

I must say that the code above was tested on GCC 4.2.4 on x86, x64, ARM and blackfin; as well as on VC 2008 express tageting x86 and x64.

I have never seen a case where 4 floats (4bytes x 4) were padded for whatever reason. but since there's always the

Code:#pragma pack(4)

directive, though again, I've never seen an arch with >4 alignment order...

added on the 2008-10-24 20:36:57 by LiraNuna

Yeah I agree that practically it should work just about anywhere

added on the 2008-10-24 20:50:56 by _-_-__

Would vectorizing single operations like that ^ always be a win anyway?

added on the 2008-10-24 22:04:50 by parcelshit

Quote:

there's a big difference between "optimizing in asm" and utilizing vector architecture through c intrinsics and such.

Most likely, I don't know about such things. The question here is, when coding a 64b noise flutter in asm, can you do it better than GCC?

added on the 2008-10-24 22:21:56 by El Topo

Quote:

(*this) += v[i];

i never like it when people use that pointer [i]this way either :)

Quote:

Would vectorizing single operations like that ^ always be a win anyway?

as long as gcc takes the context and architecture into careful consideration when making the choice, it'd be safe.

that said, where issues are still so sensitive i personally don't care to rely on compiler heuristics too much (certainly for multi-architecture code).. better implement/implicate your vectorization on a somewhat more obvious and controllable level (i.e > language).

added on the 2008-10-24 22:24:18 by superplek

minus tag fuckups.

added on the 2008-10-24 22:25:57 by superplek

Thanks so much. LiraNuma, very useful your code to understand how to do it (I read something about it but examples were not clear at all)

added on the 2008-10-25 00:00:01 by texel

GCC questions

login