graphics filter ape if anyone's interested...

cfp

12th September 2002 05:52 UTC

graphics filter ape if anyone's interested...
if you're fluent with any graphics program you'll probably have played around with the "custom filter"/"user defined effect"

this ape implements one of them. this means it can do anything from blur to sharpen to water ripples to cellular automota like stuff

it is not fast by any means though so don't get your hopes up too high, but it's certainly worth playing round with and i'd love to see what you guys think of it. i think it's the kind of thing which should be added to the next avs.

hope you enjoy it

tom

p.s. a brief description of how it works follows:

you have to enter a 5*5 matrix of integers, a "bias" and a "scaling factor" (also integers)

for each pixel the plugin takes the weighted (by the matrix) average of the 25 pixels surrounding it, adds the bias then divides by the scaling factor. for example something like:

0 0 1 0 0
0 1 2 1 0
1 2 3 2 1
0 1 2 1 0
0 0 1 0 0

with bias 0 and scaling factor 19 would function as a blur.

dirkdeftly

12th September 2002 06:43 UTC

Sweet...but somehow it seems like it could be a lot faster. I wouldn't think something that simple would be that slow...but maybe it's just me.

geozop

12th September 2002 11:12 UTC

Perhaps
Is the trans re-doing the math for each pixel on every frame? Even when the variables have not been changed? I'm not sure how functional programming an ape is, but is there a way to implement something that keeps it from recalculating everything every frame?
Like, the Movement trans... it only re-evaluates the function only after a second after the equation has been changed, rather than every frame... Perhaps, a button to "tell" your ape to use updated numbers?

This is all assuming that my theory is correct, anyway.

I really like all the different kinds of blurs you can create with this, though. Hopefully you'll find something to work it better...

[edit]: Did you know you can put negative numbers in the fields? spiffy.

[edit2]: when the scale variable is negative, it does not remember the number when selecting the effect (in the sidebar), after selecting another trans or render

cfp

12th September 2002 13:16 UTC

it doesn't re-get all the data from the boxes every frame, it only "gets" that data when you edit a box.

i'm going to produce a single channel one which should be much much faster i reckon, i'm currently working on it...

i'll post both that and an updated version of the original without the negative scale bug later on today.

thanks for the bug spotting and comments

tom

cfp

12th September 2002 14:02 UTC

here's the custom filter with the negative scaling bug removed.

no speed ups as yet.

UnConeD

12th September 2002 15:44 UTC

Does it use MMX? I wrote a new APE yesterday using an optimized MMX routine and the speedup is incredible..

cfp

12th September 2002 20:07 UTC

much improved version
no it doesn't use mmx. i'm not that experienced a programmer tbh, i don't think i know how to access mmx routines...

anyway i have now produced a version which gives about 50% better frame rates by only working on a single channel. it's attatched to this e-mail, along with some examples of its use and a very slightly improved version of the multichannel original.

tom

p.s. any speed up advice would be appreciated, if you want to mail me my address is: cfp@DELETEALLTHESECAPSmyrealbox.com

UnConeD

12th September 2002 21:49 UTC

MMX is a godsend for 32-bit RGBA operations. For example, in order to add 2 pixels together, 2 pixels at a time (2x2 adds), you simply do:

movq mm0, source;
movq mm1, destination;
paddusb mm0, mm1;
movq destination, mm0;

This will do:

R1G1B100 R2G2B200
+
R3G3B300 R4G4B400

It requires you to be fluent in regular x86 assembly already though.

cfp

13th September 2002 00:06 UTC

that mmx code sounds interesting but i'm not sure how useful it would be for this because you are scaling the values of pixels way out of unsigned char range and adding lots of so scaled values together.

still i'd be interested to find out more, are there any web guides to mmx programming you could point me to?? admittedly i've not programmed in asm since my amstrad cpc 464 back in the day but i'm sure i could pick it up again.

tom

UnConeD

13th September 2002 01:27 UTC

There are a few resources, but I don't have any urls. You can locate a lot of interesting things through Google though, that's what I did.

Actually what you're describing is no problem at all for MMX. You can multiply 4 unsigned char's with saturation... so no clipping or maximum checks are needed. I'm sure you'll find alpha-blending code immediately, which can be adapted for your purpose.

The only thing I see that can slow your APE down is the fact that you usually don't use all 25 pixels. The best solution would be one that compiles a dynamic blending routine at run-time with optimized memory access and no jumps. But that's a lot more complicated of course.
Or you could make special predefined 'shapes' of the matrix... 5x5, 3x3, 1x3, 3x1, 1x5, etc. which each have an optimized loop. That way you don't have unnecessary jumps and still give the user the flexibility needed for most effects.
In any case, I made an APE like yours a while (the official name is a Convolution Matrix if I'm not mistaken): I spent quite a while optimizing a routine for a 3x3 matrix in regular x86 assembly (no mmx) and it was still quite slow.

If you need help with MMX just ask, but I'm not a pro either, I just learnt it yesterday. I've had quite some regular x86 assembly experience though, and MMX is really just a multi-byte/word version of most basic arithmetic such as adding and shifting (which makes it perfect for RGBA 32-bit pixels).

jheriko

13th September 2002 17:10 UTC

Damn, reading about all of your ape adventures really irritates me that I lost MSVS.net the other day during a reformat :( I had only just downloaded the SDK as well...

I'm gonna have to find a freeware C compiler and get in on the fun...

EDIT: quick question:

you don't need to make the whole ape in asm do you?

UnConeD

13th September 2002 19:30 UTC

Nah, you just include the assembly in __asm { } blocks. Like, if
color is a BGR0 windows COLORREF, and you want to convert it into an RGB0 AVS color, you could do:

color = ((color & 0xFF) << 16) | (color & 0xFF00) | ((color & 0xFF0000) >> 16);

But that's all quite complicated. Using assembly, we turn it into:

mov eax, color
bswap eax
shr eax, 8
mov color, eax

That's only 2 instructions! The C code had 3 ANDs, 2 bitshifts and 2 ORs.

jheriko

14th September 2002 03:39 UTC

cool. I need to look into that then, I'm fluent in C but my asm is a bit lacking, I had a few ideas for things that I'd like to implement so I'm gonna have to get a compiler again, then maybe I'll make something. :)

dirkdeftly

14th September 2002 19:49 UTC

I really want to learn APE programming...but NONE of the SDK's I've found work. Not that I can't figure them out, they simply do not WORK. It's kind of annoying, especially since I can't learn to do things like, say, graphics...If anyone's willing to teach me (that has some idea what they're doing), I'd really appreciate it.

cfp

18th September 2002 22:40 UTC

a fast version of these filters are on their way
just posting to let you know that i've not abandoned this project.

i've spent the last couple of days frantically coding in every spair hour and i now have a fast, working (multi-channel) version of my filter.

at the moment it makes a few assumptions about the input conditions which i'm going to remove over the next few days, but basically, expect the final release soon.

i've not done detailed speed tests yet but its speed does seem at least comparable if not better than the default blur for small matrices.

many thanks to unconed for his advice of using mmx and dynamic generating the render function (e.g. it just doesn't exist untill runtime). programming in asm opcodes is perhaps the most tedious programming i've ever done, but i'm pretty darn chuffed with the results.

tom

UnConeD

19th September 2002 00:21 UTC

Sounds awesome :). Are you doing the opcodes by hand? If so, do you have a nice reference of all the opcodes in binary form?

Another idea is to use the disassembler in MSVC... just compile, set a breakpoint in front of the code and you'll be able to copy/paste the opcodes from the disassembler.

cfp

19th September 2002 04:09 UTC

i've been doing a bit of both. i've got my pdf copy of the "IA-32 Intel® Architecture Software Developer’s Manual Volume 2: Instruction Set Reference" (from the intel website) always open + a second vc project where i can easily insert code to get the opcodes by using the generate asm with opcodes and source compile option.

if both me and the rest of the world had pentium 4's then my job would be 100 times easier cos sse2 fixes the botch job that is mmx. eg. mmx has no division, mmx can only have 4 words making up a register instead of 4 dwords in sse2... which means less code for me to write...

ahh if only, i can dream (^_^)

tom

UnConeD

19th September 2002 04:20 UTC

Hmm no division is a pain, but couldn't you use a fixed point hack instead?

Suppose you need to divide by 6. That's the same as multiplying by 0.166... As an approximation, you can multiply by 43 (0.166.. * 256) and then divide by 256 (shift right by 8 bits).

I'm not sure if the loss of accuracy will matter a lot, but you can try ;). Fixed point of course means that you lose the dynamic range on the scalar values...

I know AVS's built-in components take short-cuts too: the blur causes banding artifacts in areas with little color difference, which I can only attribute to rounding errors.
And the APE SDK comes with a 50/50 blend function that discards the lower bits: ((a >> 1) & 0x7F7F7F) + ((b >> 1) & 0x7F7F7F). This causes white + white to blend to a tone darker.

cfp

19th September 2002 04:46 UTC

how did you handle people without mmx in your ape's unconed??
i'm tempted just to leave it doing nothing or just doing the very first, very slow algorithm i had, for people without mmx, because it seems like most people who are able to run winamp 3 would at least have a pentium 2 or a k6-2. do you think this is reasonable?? it would mean seriously more work for me if it wasn't...

also do you think that it is safe to put a max of 256 on the sums of the positive elements in the matrix and a min of -256 on the sums of the negative elements? again this would save me a heck of lot of work... (you should probably bear in mind that the matrix is now 7x7)

thanks for any advice

tom

Yathosho

19th September 2002 09:33 UTC

how about posting the source-code instead of discussing endlessly? if you really want to contribute, i think it's the best. it could still be you coordinating about what's in and what's not?

..or are you expecting the big bucks here :p

jheriko

19th September 2002 09:53 UTC

ugh... asm code, theres bound to be hundreds of lines.

UnConeD

19th September 2002 14:13 UTC

Jheriko: it depends on the application. And considering we're using inline asm inside C++, only the tough spots are hand-optimized. I've only got about 30 lines of asm code in my latest (complicated) APE.

As far as people without MMX, I ignored them ;). We do occasionally get requests in here of people who want the non-MMX version of AVS, but I believe it is an older version, so most presets made today wouldn't work on it either.

And considering the fastest processor without MMX was around 200Mhz, I doubt these people are getting an enjoyable experience anyhow.

cfp

19th September 2002 14:41 UTC

i'm not sure if the source would be that useful cos it's now nearly 1000 lines at least half of which is just lines like:

((LPBYTE)draw)[codelength] = 0x0F; // movq mm0, mm7
codelength++;
((LPBYTE)draw)[codelength] = 0x7F;
codelength++;
((LPBYTE)draw)[codelength] = 0xF8;
codelength++;
((LPBYTE)draw)[codelength] = 0x0F; // movq mm1, mm7
codelength++;
((LPBYTE)draw)[codelength] = 0x7F;
codelength++;
((LPBYTE)draw)[codelength] = 0xF9;
codelength++;

which does not make fun reading...

tom

UnConeD

19th September 2002 15:18 UTC

Couldn't you at least use a macro? :)

Like:

#define emitcode(a) { ((LPBYTE)draw)[codelength] = a; codelength++; }
#define emitcode(a,b) { ((LPBYTE)draw)[codelength] = a; codelength++; ((LPBYTE)draw)[codelength] = b; codelength++; }
#define emitcode(a,b,c) { ((LPBYTE)draw)[codelength] = a; codelength++; ((LPBYTE)draw)[codelength] = b; codelength++; ((LPBYTE)draw)[codelength] = c; codelength++; }

etc.

Not sure if macros allow overloading, but you could use emitcode1, emitcode2, ...

cfp

19th September 2002 16:32 UTC

you know i never thought of that... neat programming never was my strong point

(^_^)

cfp

20th September 2002 04:54 UTC

i'm getting close to finishing now. attatched is a version which is complete apart from load/save routines and one remaining bug:
that you can't have more than one copy of it running at the same time.

avs does start a new instance of the class doesn't it unconed?
if not where should you store variables which persist between render frames?

any help would be appreciated

tom

jheriko

20th September 2002 05:26 UTC

YES!! a new toy... (i'd been playing with the first one)

UnConeD

20th September 2002 14:31 UTC

Normally you shouldn't have any problems if you store everything in the class. Make sure you're not using global or static variables.

Where and how does it crash? Using a debugger you should be able to easily locate that...

cfp

20th September 2002 16:56 UTC

the error is in the call to virtual protect used to give the generated function execute rights.

there are no static or global variables apart from the pointer to the main class which was set as static in the tutorial so i guess should be.

tis strange...

UnConeD

20th September 2002 17:11 UTC

I tried making an AVS compiler a while ago, but I didn't get far due to lack of time. The part that did work, worked fine without VirtualProtect... I must say I don't know much about protecting memory and such. I used something like:

typedef void CompiledCode(void);

// nop (x6) (to make sure the disassembler recognizes the code)
// mov ebx, 0xC0DEBABE
// int 3 (breakpoint)
// ret

unsigned char function[] = { 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0xBB, 0xBE, 0xBA, 0xDE, 0xC0, 0xCC, 0xC3 };
CompiledCode *code = (CompiledCode *)(void *)function;
code();

This works fine for me in an APE (you can test it easily because of the break point).

cfp

20th September 2002 17:49 UTC

god you're right you know... the amount of hastle virtual protect has given me and i didn't even need it... (^_^)

thanks loads.

i'm not even going to try and understand why i don't need virtual protect... i guess it's just possible that the whole of the data segment for avs has execute rights cos i wouldn't be surprised if the superscope/movements were dynamically generated.

expect a final release with save and load functionality, examples and documentation by the end of tonight.

i'll put it in a new thread for ease of future reference.

tom