Debug: Campaign List Not Loading In Production

by ADMIN 47 views
Iklan Headers

Hey guys! Ever faced the dreaded white screen of death in your production environment when trying to load your campaign listing? It's a common issue, and thankfully, one that can be tackled with a methodical approach. This guide is here to walk you through a comprehensive checklist to help you pinpoint and resolve the issue of campaign listings failing to load in your production environment. We'll dive deep into potential causes, from database configurations to deployment pipeline errors, and provide clear steps and code references to get you back on track. So, let's jump right in and get those campaigns loading!

Understanding the Problem

Before we get our hands dirty, let's clearly define the problem. In a production environment, the campaign listing page isn't loading any content. This means no campaign templates, no user campaigns, nada! The page just sits there, blank and unresponsive, like a digital ghost town. There's no error message to guide us, making it a silent failure – the kind that can make developers sweat. This usually means the frontend isn't getting the data it needs to render anything, or something is crashing silently on the frontend. We've got some screenshots that show this blank page, and now we need a plan of attack. This guide will give you a checklist of all the probable causes for this issue, along with what you need to verify for each cause, complete with references to the relevant code snippets. So, let’s get this sorted out!

Possible Causes & Checks

Let's break down the potential culprits behind this issue. We'll go through each possibility step by step, providing checks and insights to help you narrow down the cause. We'll cover everything from database woes to deployment hiccups and frontend gremlins. Remember, a systematic approach is key to debugging, so let’s dive in!

1. Database Configuration Issues

Database configurations are often the first place to look when things go wrong in production. A misconfigured database connection can bring your entire application to its knees. Think of it as the foundation of your house – if it's shaky, the whole thing is going to crumble. So, let’s make sure our foundation is solid.

Environment Variable Missing

The first thing to check is whether the DATABASE_URL environment variable is correctly set in your production environment. This variable is like the key to your database – without it, your application can't access the data. If it's missing, the backend will default to using a local SQLite database (sqlite:///./app.db). While SQLite is great for development, it’s generally a no-go for production, especially in a containerized environment. Using SQLite without persistent storage or proper permissions in production can lead to failures – the file might become read-only or get wiped out on restart.

Action: Make sure you have a valid and correctly formatted DATABASE_URL environment variable set in your production environment. This should point to your persistent database (like PostgreSQL) and include all the necessary credentials. Also, make sure you have the necessary database driver installed (e.g., psycopg2 for Postgres). This variable is crucial, so double-check it! We really need to ensure the app has a valid connection string to a persistent database.

Incorrect Connection String

Even if the DATABASE_URL is present, it could be malformed or pointing to an unreachable host. Think of it like having the right key but trying to open the wrong door. If the connection string is incorrect, the application’s database engine initialization will fail silently. The FastAPI startup event calls init_db() on launch, and if the database can't be reached or authenticated, this initialization might fail without a clear error message. This can be super frustrating because it prevents data from loading without yelling about it.

Action: Carefully verify that the connection string uses the correct dialect (e.g., postgresql://), credentials, and host. Also, ensure that network access (firewall/VNet rules) allows your application to reach the database. Think of it as making sure your app has a clear path to the database server. Double-check the user name, password, database name, host, and port in your connection string. A single typo can cause a silent failure!

Azure DB/Security Config

If you're using an Azure managed database, you need to ensure that your application's managed identity or connection string is properly authorized to access the database. It’s like giving your app the right permissions to enter the building. A common mistake is forgetting to allow the container's IP or identity access to the database. This results in queries failing at runtime, often caught only as generic exceptions, which doesn’t give us much to go on.

Action: Check your production logs for any database connection errors around the time the application starts or when the listing API is called. In Azure, this usually involves checking the firewall settings, the managed identity configurations, and the network security group rules. Ensure that your app's managed identity has the necessary roles (e.g., db_datareader, db_datawriter, and db_owner) assigned. If you see authorization errors in your logs, you've probably found your culprit.

2. Missing Migrations or Schema Mismatch

Missing migrations and schema mismatches are classic culprits when your application behaves differently in production than in development. It's like trying to fit a square peg in a round hole – if your database schema doesn't match your application's expectations, things are bound to break.

Unapplied Schema Changes

The backend usually uses SQLAlchemy's Base.metadata.create_all() to create tables on startup. This is a good thing, but with a catch! It creates tables if they don’t exist, but it doesn't migrate or alter existing tables. This means if your production database was initialized with an older schema and hasn’t been recreated or migrated, it might be missing columns or tables needed by the current code.

For example, the Campaign model includes fields like is_template, is_custom, and template_id. If any of these are missing in the actual database schema, queries could fail or return incorrect data, leading to a blank page. This is like having a blueprint for a house that doesn’t match the actual house that’s built.

Action: Inspect the production database schema against your backend's model definitions (e.g., backend/app/models/db_models.py) to ensure all fields exist and have the correct types. Running the application with debug logs or using a migration tool (like Alembic) can reveal if the schema is out of sync. You can also manually query the database schema to check for the existence of tables and columns. If you find discrepancies, you'll need to apply the necessary migrations to bring your database schema up to date.

Missing Template Data

The application is designed to automatically create default campaign templates on startup using campaign_service.create_template_campaigns(). These templates are essential for the application to function correctly. If the database schema was incorrect or the insertion failed (perhaps due to a constraint issue), the templates list could be empty. This is like opening a restaurant and finding out you have no menu!

Action: Verify that the default templates (e.g., "Lost Mine of Phandelver", "Dragon Heist") are present in the campaigns table. The code checks for existing templates by name and inserts them if they are not present. If none of these entries exist in production, it indicates that the startup template seeding didn't run or failed. This could be due to a database error that was swallowed. In this case, fix any database issues and restart the application to seed the templates, or manually insert the template campaigns. You can also check the application logs for any errors during startup related to template creation.

Index/Relationship Issues

While not likely to cause a complete failure to load, poor indexing or broken foreign key relationships can definitely impact data retrieval. It's like having a library where the books are all out of order – it's hard to find what you're looking for.

For instance, NPCs and interactions reference campaigns by ID via foreign keys. If referential integrity is violated (e.g., an NPC refers to a campaign ID that doesn't exist), certain ORM queries might throw exceptions. Although this is less likely to completely break the listing, it's worth checking.

Action: Check the database for any integrity issues (or run an integrity check) to rule out errors that could bubble up when querying campaigns. You can use database-specific commands or tools to check for foreign key constraint violations and orphaned records. Also, review the indexes on your tables, especially on columns used in queries for campaign listing. Adding appropriate indexes can significantly improve query performance.

3. Deployment Pipeline Errors

Deployment pipelines are the arteries through which your code flows into production. If there's a blockage in the pipeline, your code might not make it to its destination, or it might arrive incomplete. Let's make sure the delivery is smooth.

Missing Migrations in Pipeline

If you're using a CI/CD pipeline (and you should be!), you need to ensure that any database migration or initialization steps are included in your deployment process. It's like having a recipe but forgetting to include a crucial ingredient. The deployment process should run the backend with the startup event so that init_db() and template creation occur.

If the backend was deployed but started without running the startup sequence (or crashed before completing it), the database might not be ready. This is a very common gotcha!

Action: Review the deployment logs to see if the backend container reported running the startup tasks. Look for log lines like “Initializing database...” and “Creating default campaign templates...” on startup. If these logs are absent or show errors, the issue likely occurred during deployment. You may need to adjust your deployment pipeline to explicitly run migrations or ensure the startup sequence completes successfully.

Environment Variables in Pipeline

The GitHub Actions workflow sets REACT_APP_API_URL for the frontend build to point to the backend URI. If this value isn’t correctly passed or is empty, the frontend won't know where to fetch data. It's like trying to call someone without knowing their phone number. Also, make sure the backend deployment is supplying all required environment variables (database URL, Azure OpenAI keys, etc.). A frequent deployment oversight is not configuring the production environment with the same variables used in development.

Action: Double-check your infrastructure-as-code (e.g., Bicep templates) to see that DATABASE_URL and other critical settings are provided to the backend container. If the pipeline deploys the backend without a valid DATABASE_URL, the application will run using SQLite by default, which might not persist or function as expected in Azure. Additionally, make sure REACT_APP_API_URL is correctly set in your production build and points to the correct backend endpoint.

Azure OpenAI Config Gating

The application requires Azure OpenAI credentials for certain operations. During startup, init_settings() will throw a ValueError if the OpenAI settings are missing, stating that the demo “requires proper Azure OpenAI setup”. Even though listing campaigns might not directly call the AI, the startup still calls init_settings() unconditionally.

If production doesn't have these environment variables, the entire application initialization could fail. It’s like having a fancy car that won't start because you forgot to put gas in it, even if you only wanted to use the radio.

Action: Check that AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_CHAT_DEPLOYMENT, and AZURE_OPENAI_EMBEDDING_DEPLOYMENT are set in production. If not, either provide dummy valid values to let the application start, or adjust the code to not hard-require them for non-AI operations. This can involve modifying the startup sequence to conditionally initialize AI-related components.

4. Poor Database Modeling or Missing Indexes

Poor database modeling and missing indexes can lead to performance bottlenecks, especially as your data grows. It's like trying to find a specific grain of sand on a beach – without proper organization, it's going to take forever. Let's optimize our beach!

Performance Bottlenecks

The data modeling (e.g., using JSON columns and storing full campaign data in a single field) could lead to slow queries as the data grows. This is like trying to read an entire book at once instead of chapter by chapter. In production, a very slow query might time out and result in no data being returned, which would explain a blank page.

The campaign listing query filters by booleans and orders by creation date, which should be fast, especially with indexes on id and possibly on created_at.

Action: Ensure that the campaigns table has an index on created_at if it's large. If the table is huge and unindexed on the sort/filter columns, the request might be timing out. Check the query execution on the production database and add indexes if needed. You can use database-specific tools to analyze query performance and identify missing indexes.

Logical Data Issues

The model flags is_template and is_custom are used to categorize campaigns. The code assumes a campaign is either a template or custom (or both, in the case of including all). If, due to a bug, a campaign ended up with both flags set to false, it would be omitted from list_campaigns() results since the OR filter would exclude it. This could lead to an empty list being returned if, say, all templates failed to be marked correctly.

Action: Review the data in the campaigns table to ensure that all entries have sensible flag values (templates should have is_template=true, is_custom=false, user campaigns vice versa). Any anomalies should be corrected via a migration or script. You can write SQL queries to identify campaigns with inconsistent flag values and update them accordingly.

5. Broken or Malformed SQL/ORM Queries

Broken or malformed SQL/ORM queries are like speaking the wrong language to your database – it's not going to understand you, and you won't get the data you need. Let's make sure we're speaking the same language.

Template Listing Query

The API endpoint for templates (/api/game/campaign/templates) simply fetches all campaigns where is_template is true. This is implemented in CampaignService.get_templates(), which queries the database for CampaignDB.is_template. If this query fails (due to connection issues or an ORM error), an exception will be raised and translated to an HTTP 500 error with the detail “Failed to get templates: ...”.

The frontend should catch a 500 and display an error message. If, instead, we see a silent hang, it suggests the request may not even be completing. This is where things get sneaky. We need to dig deeper.

Action: Check the backend logs when a templates fetch is triggered. If there’s a stack trace or error (e.g., malformed SQL or attribute error), that’s likely the culprit. Common issues might include using an incompatible SQL dialect in a query, or a failure in converting the SQLAlchemy result to a Pydantic model. Note that get_templates() converts database models to Pydantic Campaign objects via a dict_to_campaign function. If the data in the database doesn’t conform to the expected schema (say, a field has an unexpected type), the Pydantic model validation could throw an error. This would also surface as a 500. Validating one of the stored CampaignDB.data JSON blobs against the Campaign schema might reveal such issues.

All Campaigns Listing

Similarly, if the frontend uses the /api/game/campaigns endpoint (which returns both custom and template campaigns), issues in that route could affect listing. The list_campaigns() service method uses an or_ filter for flags. SQLAlchemy’s usage of or_(*conditions) is correct, but if either condition list is empty or the flags are not boolean types in the database, the filter could misbehave.

Action: Ensure the production database columns for is_template and is_custom are boolean types. In some cases, a boolean in SQLite could be stored as 0/1 integers; SQLAlchemy should handle that, but if not, the truthiness of the filter might differ. No obvious bug is present in the query code, but it’s worth running a quick query test in production (e.g., via a database client) to ensure that selecting campaigns with those conditions returns results as expected. You can also add logging around the query execution to inspect the generated SQL and the results.

6. API Endpoint or Route Misconfiguration

API endpoint and route misconfigurations are like having the right address but a broken GPS – you might be close, but you'll never actually arrive. Let’s get our directions straight.

URL Prefix Mismatch

The frontend calls the campaign templates API via GET /game/campaign/templates (the React code uses the base URL then /game/campaign/templates). The backend, however, mounts the router with a prefix /api/game. This means the actual endpoint is /api/game/campaign/templates. In production, if REACT_APP_API_URL is set to the root of the backend (e.g., https://example.com without /api), the frontend request URL becomes https://example.com/game/campaign/templates, which would return a 404 error.

This kind of misrouting would result in the frontend call failing (likely caught by Axios as a network error).

Action: Verify what REACT_APP_API_URL is in the production build. It should include the /api path if the backend is not automatically prefixing routes. Another approach is to ensure the API gateway or Azure Static Web App config forwards /game/* to the backend’s /api/game/*. In the infrastructure config, the static web app sets REACT_APP_API_URL to the backend URL – confirm if that URL already contains /api or not. Resolving this mismatch may simply involve updating the environment variables or the code to align on a consistent path.

CORS Configuration

The backend enables CORS (Cross-Origin Resource Sharing) for all origins during startup, which should allow the frontend (on a different domain) to call it. However, you need to ensure that this code is actually running in production. If the environment variable APP_DEBUG or others altered the behavior (e.g., if in production they intended to lock down CORS), a misconfigured CORS could block the requests. This is like having a gatekeeper that's not letting anyone in.

Action: Check the network calls in the browser’s developer console – if you see CORS errors (blocked by policy), then the issue lies there. The solution would be to adjust allow_origins to include the production frontend’s domain or simply keep it as ["*"] for now. You should also check your backend logs for any CORS-related warnings or errors.

API Not Deployed/Running

It might sound basic, but ensure the backend container is actually up and the /campaign/templates route is reachable. If the static frontend was deployed but the backend failed to deploy or start, all API calls will fail. This is like setting up a store but forgetting to open the doors.

Action: Hitting the health check (/api/health) directly in production can verify if the backend is running. If it’s down, focus on why (e.g., crash on startup, which could be due to the Azure config issue above or other exceptions). Logs from the container app will be crucial in diagnosing this. You can also check your deployment platform’s dashboard to ensure that your backend services are running and healthy.

7. Frontend/Backend Integration Mismatch

Frontend/Backend Integration Mismatches are like a couple speaking different languages – they might be trying to communicate, but they're not quite understanding each other. Let's translate!

Response Format Changes

The frontend expects the templates API to return JSON in the form { templates: [...] }. This is defined in the code and used as response.data.templates. The current backend implementation does return exactly that shape. However, if there was any change in the API (for example, if you considered returning both campaigns and templates together at some point), the frontend might be calling the wrong endpoint or parsing the response incorrectly.

Action: Double-check that the frontend is calling the /campaign/templates endpoint and not the combined /campaigns endpoint in production (the code suggests it calls the templates one). Also, confirm the backend is returning the expected fields. Any discrepancy here could cause the front-end code to throw an error (e.g., if response.data.templates is undefined because the key is different, the .map over templates could error out without setting the error state). In such a case, the React component might crash or simply not update state, resulting in a blank page. Use the browser’s network inspector to look at the actual response payload from the templates API and ensure it contains the templates array.

Version Misalignment

Ensure the deployed frontend and backend come from the same build. If the frontend is older, it might call a route that no longer exists or miss required data. If the backend is older, the frontend could be expecting a feature that isn’t there. This is like trying to assemble a puzzle with pieces from different sets. For example, if the frontend is expecting template campaigns but the backend hadn’t yet implemented them (or vice versa), that would cause an empty or failing result.

Action: This is less likely if you deploy them together from the main branch, but it's worth confirming. The issue could simply be resolved by redeploying the latest compatible versions of both. You can also use versioning or tagging in your deployment pipeline to ensure that the frontend and backend versions are always aligned.

8. Silent Errors and Unhandled Exceptions

Silent errors and unhandled exceptions are the ninjas of debugging – they strike without warning and leave no trace. But fear not, we can learn to detect them!

Frontend Silent Failures

The React CampaignGallery component does set an error message if the fetch fails. If nothing at all is rendering (no “Failed to load
” message), it implies that the code might have crashed before updating the state. This is like a tree falling in the woods – if no one hears it, does it make a sound?

Action: Check the browser console for a React error (such as “Cannot read property ‘templates’ of undefined” or similar). A runtime error would prevent the component from rendering any fallback UI. This can happen if, for instance, the getCampaignTemplates() succeeded but returned an unexpected value that breaks downstream. Additionally, ensure that the component is indeed mounted – if the routing in production is different (say, the user isn’t navigating to the correct page due to a routing config), the gallery might not be loading at all. Verify the React app’s behavior via console logs or adding temporary debug logs around the data load. This can help identify if the useEffect is running and if it hits the catch block.

Backend Exceptions Suppressed

It’s possible the backend encountered an exception that wasn’t logged clearly. For example, if campaign_service.get_templates() threw an error that isn’t caught (outside the try/except in the route), it might have aborted the response without sending the error JSON. The FastAPI route code should catch exceptions and convert them to HTTP 500 errors, but a low-level error (like an await not handled or an event loop issue) could break that flow.

Action: Review server logs around the time of requests. If you see something like an error trace without an HTTP response, consider adding more logging or wrapping the service call in a broader try/catch to surface the issue. Also, confirm that the FastAPI app is not set to hide errors – in debug mode, it would show a traceback in the response, but in production, it might just abort. Enabling more verbose logging temporarily in production (e.g., set APP_LOG_LEVEL=DEBUG) can help catch silent failures in the logs.

Logging Configuration

The logging is set up at startup with a certain level. If it’s not verbose, some warnings or errors might not be printed. Make sure the production log level is at least INFO (default) or DEBUG when troubleshooting. The code logs key events like database initialization and template creation. If those logs are missing, the failure likely happened before or during those steps. If they are present, the failure might be at request time rather than startup.

Action: Use this information to narrow down where things go wrong (startup vs. runtime). You can adjust the logging level in your environment variables or configuration files. Also, make sure that your logs are being stored and accessible in your production environment.

Conclusion / Next Steps

By methodically checking each of the above areas, you can identify why the campaign listing fails to load in production. Start with environmental and configuration issues (they are common culprits for environment-specific bugs), then verify database setup and data integrity, and finally, inspect the integration between the frontend and backend. Each potential cause above includes code references (files and line numbers) to aid in locating the relevant implementation for further investigation.

Going through this checklist and ruling out each item will greatly help in pinpointing the root cause and getting the campaign listing to load properly. Debugging can be challenging, but with a systematic approach, you'll conquer this issue. So, roll up your sleeves, dig into the logs, and get those campaigns loading!

image

image