Accelerating CUDA Development with Windsurf

In this post, we'll walk through how we use Windsurf as a modern development environment for CUDA projects. The stack includes C++, CMake for builds, Conan for dependency management, and the Catch2 unit testing framework. The setup is fast, lightweight and has significantly streamlined the CUDA development workflow.

Full working project on GitHub: https://github.com/richjohnson-wwt/stock-walk-blog

As a proof of concept, we will develop a stock price signal predictor. This application analyzes historical stock data using moving averages to generate simple but effective buy, sell or hold recommendations — a practical way to experiment with parallel GPU computing for financial analytics.

GPU environment

The GPU environment for this project was set up in the ATC GPUaaS area. In the OpenShift dashboard, I set up instance type wwt1.l40s.small.

Project setup

stock-walker/
├── .venv
├── CMakeLists.txt
├── conanfile.txt
├── README.md
├── data/
|   ├── prices.csv
├── src/
│   ├── main.cu
│   ├── CMakeLists.txt
├── test/
│   ├── test_main.cpp
│   └── CMakeLists.txt

The project was structured using CMake, Conan and Catch2 to support modular development and testing. Before we start chatting with Windsurf Cascade, we want to have some context established, such as the data→prices.csv file. Windsurf can look in that file and see the CSV format, allowing it to generate the CSV reader. We prompt Windsurf to generate code in the src→main.cu. The @ symbol followed by the filename in the prompt '@main.cu' will point Windsurf to our working file.

Windsurf assistance

Windsurf offers two ways to assist in writing code. One is to provide ghost text in the editor, delivering real-time, context-aware code suggestions as you type. These suggestions, powered by Windsurf's understanding of your codebase, appear as inline hints, allowing you to accept them with the tab key to maintain your coding flow. Second, Windsurf's Cascade Chat/Write feature offers a conversational approach to coding assistance.

Chat mode: Cascade won't make changes to your code, but will give code snippets that can be copied and pasted into your code
Write mode: Cascade will make changes directly in the IDE editor. Use caution.

Cascade is accessible via cmd/ctrl+L or the icon in the IDE. This tool lets you interact with an AI agent that understands your project's context. Both methods are useful and empowered me to focus on the logic of my stock predictor while Windsurf handled the heavy lifting of code generation and error checking.

We began by prompting Windsurf Cascade with a clear high-level request:

I am an experienced C++ developer and I am learning CUDA programming. 
I have an L40s GPU available on this system. I would like to write a 
small CUDA program that will determin buy/sell/hold signals from 
reading the historical data @stock_prices.csv file. The code 
should be in @main.cu. The program main should call out to a method 
moving_average_kernel to determin the moving_avg given a window_size 
and then a second kernel detect_signals when the price rises, falls, 
or stays flat.

After some thought process output in the Cascade chat, I got the response:

Perfect! I can see the CSV file contains 3,817 stock price records from 
2010 to 2025 with timestamp, open, high, low, close, and volume data. 
The main.cu file is currently empty and ready for implementation.
Based on your requirements and my experience with CUDA stock analysis 
programs, I'll create a comprehensive CUDA program that implements both 
the moving average calculation and signal detection kernels. Let me start 
building this:

Windsurf responded immediately with context-aware insight, recognizing the structure and content of the stock_prices.csv file. It proposed a CUDA program with two core kernels: moving_average_kernel and detect_signals. After reviewing the response, I accepted the code suggestions directly in main.cu.

View the generated file here: https://github.com/richjohnson-wwt/stock-walk-blog/blob/main/src/main.cu

We are then offered the command that will compile the program:

nvcc -o stock_signal main.cu

We will eventually move to CMake for the build process, but for now, we'll accept the command to test the code generated. The compile is successful, and now Windsurf offers the command to run:

./stock_signal

The output looks great:

=== CUDA Stock Signal Analysis ===
Reading stock price data from: ../data/stock_prices.csv
Loaded 3815 price points

Launching CUDA kernels...
Grid size: 15, Block size: 256
Moving Average kernel completed in 124.137 ms
Signal Detection kernel completed in 0.019 ms
Total GPU computation time: 124.157 ms

=== STOCK SIGNAL ANALYSIS ===
Window Size: 20
Total Data Points: 3815

Signal Legend: BUY(1), SELL(-1), HOLD(0)
Index    Price      MovAvg     Signal  
----------------------------------------
20       319.04     327.60     0       
21       321.20     326.07     0       
22       325.59     325.02     1       
23       316.06     323.82     -1      
24       305.30     323.08     -1      
25       332.56     323.52     1       
26       332.11     323.66     0       
27       328.49     323.32     0       
28       329.65     323.28     1       
29       319.41     322.47     -1      
...
3805     18.14      20.48      -1      
3806     17.05      20.35      -1      
3807     17.40      20.24      0       
3808     17.46      20.09      0       
3809     15.80      19.82      -1      
3810     16.11      19.52      0       
3811     19.20      19.44      0       
3812     21.96      19.54      1       
3813     23.83      19.71      1       
3814     23.89      19.87      1       

=== SIGNAL SUMMARY ===
BUY signals: 1030
SELL signals: 1307
HOLD signals: 1458
Total analyzed: 3795

Analysis complete!

This is a great starting point for the program we want to construct. But first, we take all of the time needed to fully understand the code that was generated. If we don't understand any of the code, we can highlight that code and Windsurf editor will pop up two options: Edit or Chat with that highlighted code. For example, we select the line around 42 which is:

int idx = blockIdx.x * blockDim.x + threadIdx.x;

We select the Chat option, and Windsurf puts this into the Cascade chat window:

@blog_stock_price_main.cu#L50 Can you tell me about this code?

We get a response that fully explains CUDA thread indexing which is crucial to understanding how CUDA runs the kernels in parallel on the GPU.

CMake setup with Windsurf

Before we proceed with further development, we want to complete the CMake build configuration, and Windsurf can assist. Since we will be using Conan for package dependencies such as Catch Unit test framework, we will first set up a Python virtual environment. This will provide Windsurf with the context for our build environment.

uv venv
source .venv/bin/activate
uv pip install conan
conan profile detect

With the Catch2 added to root conanfile.txt, we can now run:

conan install .\
 --output-folder=build/debug \
 --build=missing \
 --settings=build_type=Debug

Now we are ready for Windsurf to assist with the CMake setup. Here is the prompt in Cascade:

The next thing I would like help with in this blog-stock-walk project 
is to add CMakeLists.txt files to the root and the src directory that 
will build this CUDA project with nvcc

The response:

I can see you want to add CMake build system support for your CUDA project. 
This will provide a more professional build setup that integrates well with 
Conan and your development workflow. Let me update the plan to include the 
CMake build system tasks.
Perfect! I'll help you create CMakeLists.txt files that integrate with your existing 
Conan setup and properly build the CUDA project with nvcc. Let me examine the 
project structure first and then create the CMake files.

Notice that Windsurf detected the Conan environment and built the CMake files accordingly. See CMakeLists.txt in the GitHub repo root and the CMakeLists.txt under the src directory on GitHub.

Next, Windsurf prompts us to build the project with the following command:

cmake ../.. -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Debug

The generation of Makefiles under the build/debug folder succeeded. Then, Windsurf prompts to build the project with the CMake command:

cmake --build . --parallel

The build succeeded, and now Windsurf prompts to run the executable under build/debug:

./stock_signal

When running the application from the CMake build directory, the program couldn't locate the CSV file due to a relative path mismatch:

./stock_signal
=== CUDA Stock Signal Analysis ===
Reading stock data from: ../data/stock_prices.csv
Loaded 0 price points
Error: No data loaded from CSV file

Windsurf quickly diagnosed the issue and proposed an edit to the file path. While this temporarily resolved the issue, I wanted a more robust, maintainable solution. Windsurf helped me add a DATA_DIR variable to the root CMakeLists.txt, making the path configurable at build time. After rebuilding, the application ran as expected — now with a cleaner and portable setup.

Unit test setup

The next task to delegate to Cascade will be to create the CMakeLists.txt under the 'test' directory and create an initial unit test for opening and reading the stock price CSV file. I pause here to consider my project structure as it is and where I want it to be after adding unit tests. I have a main.cu CUDA file that has all of the code. I'd prefer to move the tested code to a separate file. My plan is to create the file as a placeholder and ask Cascade to move the CSV reading code to the new file location. I expect Cascade will add any needed includes.

I would like to add unit testing to this blog-stock-walk project. To prepare, I created an 
empty file @stockWalk.cu . Can you help me move the csv reading code from @main.cu 
to the @stockWalk.cu file including any #includes that are needed?

Cascade willingly accepts the request. There were four files touched for this code update:

stockWalk.cu - received the refactor→ move for the csv read code
stockWalk.cuh - a header was created that contains function declarations, which is the correct way to split code declarations and code definitions in C++ and CUDA.
CMakeLists.txt - the add_executable was updated to link the new source file
main.cu - the code for CSV reading was removed and replaced with code that references the new location of the CSV read code.

It didn't add the code to the test/CMakeLists.txt, so we will ask for assistance with that:

Next, can you add the excutable creation code to the test @CMakeLists.txt ?

The add_executable code was correctly added.

Next, can you help me add a unit test using Catch2 TEST_CASE that will test the 
code in @stockWalk.cu for csv file reading?

Cascade added two test cases with multiple sections that exercise the CSV reader logic. It also sets up a special set of CSV data for the tests, which I like. After building and running the ctest command:

ctest
Test project /home/ubuntu/workspace/blog-stock-walk/build/debug
    Start 1: StockData CSV Reading Tests
1/2 Test #1: StockData CSV Reading Tests .........   Passed    0.00 sec
    Start 2: StockData Memory Management Tests
2/2 Test #2: StockData Memory Management Tests ...   Passed    0.00 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   0.01 sec

When AI goes too far: Avoiding overreach

One lesson I learned while working with Windsurf: don't give it too much control at once. When I asked Cascade to make what seemed like a small algorithm change, it went further than I expected, touching many files and leaving my project in an unstable state.

The takeaway is simple — start in Chat mode to discuss changes, break them into smaller steps, and only use Write mode once you're confident about the scope. This way, you stay in control of your codebase while still benefiting from Windsurf's speed and insight.

Debugging

Debugging CUDA kernels is outside the scope of this post, but I'll quickly touch on it for context.

Debugging GPU code is a very different challenge compared to regular C++ on the CPU. Kernels may run across thousands of threads in parallel; they launch asynchronously; and you don't have the same kind of step-through debugger support you'd expect on the CPU side.

Because of this, developers often rely on strategies like:

printf debugging inside kernels (crude but effective in small doses)
CUDA-MEMCHECK / Compute Sanitizer to catch memory and race errors
NVIDIA Nsight tools for profiling and stepping through kernels at scale

Windsurf debugging setup

In VSCode, when you set up a C++ project, it pretty much hand-holds you into installing the official Microsoft C/C++ extension (with GDB/LLDB support baked in). The marketplace integration is smooth, so you don't really have to think much about which debugger extension you're using—it "just works."

In Windsurf, the extension marketplace isn't identical to VSCode's, so the exact Microsoft debugger extension might not show up. That doesn't mean you can't debug, though—you just have to pick a compatible one. In my case, I used the KylinIdeTeam C/C++ Debug (GDB) extension, and it worked fine with CUDA host-side code.

So the main difference is:

VSCode: Extension selection is mostly automatic.
Windsurf: You might need to manually choose a GDB debugging extension that works with your setup.

You can see our project setup for debugging on the following branch for the stock-walk-blog project: https://github.com/richjohnson-wwt/stock-walk-blog/tree/dual-avg

The three callouts are:

File: .vscode → launch.json
File: .vscode → tasks.json
Marketplace extension install: KylinIdeTeam C/C++ Debug (GDB)

Once these updates are made, you can add breakpoints in VSCode and launch the debugger from the left sidebar "Run & Debug" tab. The drop-down in the "Run & Debug" sidebar will have the name of the debug configuration in the launch.json. For this project, it is

"name": "Debug Calc Tests",

Closing thoughts

This project shows how Windsurf can dramatically accelerate CUDA development. From setting up CMake and Conan, to generating CUDA kernels, to refactoring code for testing, the environment consistently removed boilerplate friction and let me focus on the real problem: building a GPU-accelerated stock signal predictor.

At the same time, I learned the importance of staying in control. AI assistants like Cascade are powerful, but they can overreach if you hand them too much at once. Breaking changes into smaller, well-scoped steps — often starting in Chat mode before switching to Write — proved to be the safest and most productive workflow.

Overall, Windsurf gave me a faster path into CUDA programming while reinforcing best practices for C++ development. I'm excited to expand this project with more advanced kernels and apply the same workflow to other GPU-heavy workloads.