I gave a presentation to my team on using large language models for their application engineering work

1.0K

•

Comparing images for regression tests is quite standard. The difficulty here was that due to imperfections specifically in graphics resolutions, and generally in life and the world, the screenshots inevitably had individual pixel differences, or some angled lines would be shifted over by 1 pixel, or other very minor differences inevitably existed. Standard tools using standard image compare algorithms would correctly flag these diffs as fails, but we needed them to be passes because there was nothing that could be changed in the code to generate better screenshots. There was no actionable bug. The legacy Win95 based regression tool would produce hundreds of fails due to these kinds of diffs, resulting in some poor engineer needing to manually review them all just to determine that they were not actually fails.

I first tried various vector compare approaches, dumping out the graphics as ascii dxf, hpgl, and other formats and trying ascii to ascii compare of those datasets. I thought it would be an easy way to go, but there were challenges in those approaches that necessitated writing code to post process the output data, and I could never get it clean enough to be reliable across almost 1,500 tests and four software versions...

In a moment of inspiration, I asked ChatGPT what techniques could be used to compare images that are almost exactly the same, but do have pixel differences. It responded with a wealth of information related to image comparisons that would probably mean a lot more to a graphics artist than me, including statements about techniques like "normalized cross-correlation" and "structural similarity index". Both of these techniques return a value ranging from -1 to 1, with 1 being a perfect match. So comparing two example images from the regression test may return a value of .999458, indicating a close match but not perfect.

This was the inspiration I needed to finish the project. I used ChatGPT to whip up some python code to compare images and check the return value against thresholds set with environment variables. For example the "structural similarity index" EV is set to .99, and any test that returns a value higher than .99 is a pass. This allows for some measure of differences to exist and still generate a pass, while more significant differences would still cause the test to fail. The final implementation is a little more technical that that, but you get the idea.

I put together a presentation for my team highlighting some of the ways I used ChatGPT to accomplish this project. The presentation got good visibility up the management chain, and although they probably didn't understand all the details, they're happy that they have an internal AI success story to ponder and share.

To keep the momentum going, I started a monthly "Large Language Model Discussion Group" where we meet to talk about the topic and share ideas. This keeps me in the front of the curve on this topic relative to my peers, and should help give me an edge over them should any business shake-ups happen in the next couple years.

Right now I'm using LLM to fuzz some input files and guide my team in discussions on software robustness. That's the topic for the next couple of discussion groups.

Have any of you used LLM in a practical way in your business? I'd like to hear your story.

It was quite well received. The business problem to solve was implementation of an automated software regression harness for a CAD tool that my company acquired some time back. The CAD tool already has legacy automated regressions but they only run on a single Windows 95 box that is well out of any reasonable compliance with any modern standard. The regressions are based on screenshots of the tool with projects loaded, compared to golden screenshots. Comparing images for regression tests is quite standard. The difficulty here was that due to imperfections specifically in graphics resolutions, and generally in life and the world, the screenshots inevitably had individual pixel differences, or some angled lines would be shifted over by 1 pixel, or other very minor differences inevitably existed. Standard tools using standard image compare algorithms would correctly flag these diffs as fails, but we needed them to be passes because there was nothing that could be changed in the code to generate better screenshots. There was no actionable bug. The legacy Win95 based regression tool would produce hundreds of fails due to these kinds of diffs, resulting in some poor engineer needing to manually review them all just to determine that they were not actually fails. I first tried various vector compare approaches, dumping out the graphics as ascii dxf, hpgl, and other formats and trying ascii to ascii compare of those datasets. I thought it would be an easy way to go, but there were challenges in those approaches that necessitated writing code to post process the output data, and I could never get it clean enough to be reliable across almost 1,500 tests and four software versions... In a moment of inspiration, I asked ChatGPT what techniques could be used to compare images that are almost exactly the same, but do have pixel differences. It responded with a wealth of information related to image comparisons that would probably mean a lot more to a graphics artist than me, including statements about techniques like "normalized cross-correlation" and "structural similarity index". Both of these techniques return a value ranging from -1 to 1, with 1 being a perfect match. So comparing two example images from the regression test may return a value of .999458, indicating a close match but not perfect. This was the inspiration I needed to finish the project. I used ChatGPT to whip up some python code to compare images and check the return value against thresholds set with environment variables. For example the "structural similarity index" EV is set to .99, and any test that returns a value higher than .99 is a pass. This allows for some measure of differences to exist and still generate a pass, while more significant differences would still cause the test to fail. The final implementation is a little more technical that that, but you get the idea. I put together a presentation for my team highlighting some of the ways I used ChatGPT to accomplish this project. The presentation got good visibility up the management chain, and although they probably didn't understand all the details, they're happy that they have an internal AI success story to ponder and share. To keep the momentum going, I started a monthly "Large Language Model Discussion Group" where we meet to talk about the topic and share ideas. This keeps me in the front of the curve on this topic relative to my peers, and should help give me an edge over them should any business shake-ups happen in the next couple years. Right now I'm using LLM to fuzz some input files and guide my team in discussions on software robustness. That's the topic for the next couple of discussion groups. Have any of you used LLM in a practical way in your business? I'd like to hear your story.

[–] • 2 pts

I avoid recommending AI to anyone I work with because I'm worried they would use it without checking stuff and it would just cause problems for everyone.

Also, am I the only one on this website who's not an engineer or in IT?

worried they would use it without checking stuff

A big reason I did the presentation is to help demystify the technology for our group. I could have just finished the regression test project and kept the details to myself. By sharing, they gain some new insight, and hopefully we all avoid the boss coming down with a rule like LLM shouldn't be used.

Yes @Suplex wrangled deer for a living, but now that he's gone all that is left are engineers and IT. Except @Theodore_kent, he is between gigs at the moment.

I've used chatgpt to test it out. I asked it a technical question that I knew the answer to. I asked it about something that I knew had been advanced and updated but it gave me the old results which were no longer applicable in practice. So yeah, it is useful, but the operator still needs to know how to use their tool.

[–] • 1 pt

don't use anything proprietary.

[–] • 0 pt

Absolutely correct

(post is archived)