Skip to main content
Classifier les verbatims

Classifying open-ended verbatim transcripts (levels 1 and 2) with smartinterview

Once the data has been collected, whether via SmartInterview or an external file, the next step is to transform the open-ended responses into usable thematic codes. Smartinterview allows you to:
  • Define a coding plan (1 or 2 levels of depth) (3 coming soon)
  • Automatically generate themes using AI, with customized instructions
  • Precisely control the number of codes per respondent using a rules system
  • Import a training set to guide classification
  • Pre-classify a sample and correct the results
  • Run the full classification on all responses
  • Evaluate code quality with a MECE (mutually exclusive, commonly exhaustive) correlation matrix
  • Export results to Excel
  • Analyze results on the dashboard
This article explains, step by step, how to perform a 1 or 2-level classification in the platform.

1. Choose the data source

The codification accepts two sources:
SourceUsageWhen to use it
SmartInterview SurveySelect an existing survey, then an open-ended questionYou collected responses via SmartInterview
Excel fileImport a file containing the transcriptsYou have data from an external tool

Excel file:

Fichier excel

Survey:

Sondage

File import: column selection

When importing a file, you must tell the system:
  • The respondent column (unique identifier for each respondent)
    • If your file does not contain an identifier, choose “No column”: the system will automatically number the respondents from 1 to N
  • The column of responses to be classified (the verbatim comments)
_Tip: The system automatically detects common columns (Respondent_ID, Serial, Responses, Answer, etc.) from the file headers._
[Screenshot: selecting columns in the configuration]

2. Choose the classification depth

Depth 1 level (L1 only)

A flat list of main themes. Each answer is associated with one or more themes. choose
Use case: exploratory studies, initial rapid analysis, short verbatim transcripts.

Depth 2 levels (L1 + L2)

Main themes (L1) with attached sub-themes (L2). The structure is hierarchical: each sub-theme belongs to a single parent theme. choose 2
Use case: in-depth studies requiring fine granularity, distinction of nuances within the same theme, coding conforming to market research standards.
In this example:
  • Ease of use → L1 (main theme)
  • Lines with an ID → L2 (sub-themes)

3. Define the coding plan

You also have two ways to create your code plan:
  • A - Import an Excel codeframe (as in the example)
  • B - Let the AI ​​generate the themes.

Option A: Importer les codes via Excel

If you already have a code-frame, import it directly.

Format for 1 level

A file with at least one column containing the theme labels:
IDLabel
1Interface is intuitive
2Ease of use
3Performance is fast
4App crashes
5Nothing

Format for 2 levels

The file must be structured L1 and L2 hierarchically. The system automatically detects the ID and Label columns from the headers. Option 1: Separate level columns (in an Excel sheet):
L1L2
Ease of useInterface is intuitive
Ease of useNavigation is confusing
Ease of useEasy to complete tasks
Ease of useSensation is smooth
Ease of useThe shape is nice
Option 2: With identifiers and Parent_ID:
Parent_ID
💡 Tip: You must store the topics in a separate sheet of your Excel file (e.g., a “Topics” tab). The system will prompt you to select the sheet containing the codes.

Preview and filtering

After the import, a preview of the code frame is displayed with:
  • The number of topics detected (updated automatically)
  • The ability to filter by column (useful for excluding certain categories)
  • The ability to manually exclude individual rows

Option B: Generate the codes using AI

If you do not have a pre-existing code plan, the AI ​​analyzes a sample of your responses and automatically discovers recurring themes.

How it works

  1. The system samples up to 400 responses from your file.
  2. AI identifies recurring themes and formulates them into clear labels.
  3. The themes are sorted by estimated frequency (the indicative number of respondents concerned).
  4. The themes are automatically numbered (sequential IDs).

Provide personalized instructions (guidelines)

You can guide the generation by providing text instructions in the “Guidelines” field: Guidelines These instructions directly influence:
  • The vocabulary used for the labels
  • The level of granularity (more or fewer themes)
  • The analytical perspective (sensory, emotional, functional, etc.)
  • The language of the labels
⚠️ Important: These instructions are in Beta. They work well for guiding the generation process, but results may vary. Always check the generated themes.

Generation in 2-level mode

In 2-level mode, the process involves two steps:
  1. L1 generation: AI identifies the main themes
  2. Automatic L2 generation: For each L1 theme, AI automatically generates sub-themes based on the corresponding responses.
L1s that do not yet have sub-themes are automatically detected, and the system starts generating the missing L2s before starting the classification.
classification

Theme editor

Whether imported or generated, themes appear in the theme editor (left sidebar), where you can:
ActionComment
Renaming a themeClick on the label and edit it directly
Delete a themeClick on the trash can icon 🗑️
Add a themeClick the + button at the bottom of the list
Reorder the themesDrag and drop using the handle ≡
Unfold/fold the L2Click the arrow ▶ next to an L1 theme
Regenerate the themesClick the ✨ button to restart AI generation
Regenerate the L2 of a parentClick on ✨ next to a specific L1 theme
💡 The estimated frequencies (indicative number displayed next to each theme) are recalculated after each classification. Before the first classification, they come from the AI’s estimate during generation.

4. Configure the classification rules

The rules control how many codes can be assigned to each respondent. They are applied at three levels: during pre-classification, on imported examples, and during full classification.

Rules for level 1

SettingDescriptionDefault
Max codesMaximum number of themes per respondent0 (unlimited)
Exemple*_ : With Max codes = 3, a respondent can only receive a maximum of 3 themes, even if their answer mentions more.*

Rules for 2 levels

In 2-level mode, three additional parameters allow for fine control:
SettingCode interneDescriptionDefault
Max L1maxCodesL1Maximum number of main themes per respondent0 (unlimited)
Max L2maxCodesL2Maximum overall number of sub-themes per respondent0 (unlimited)
Max L2 par L1maxCodesL2PerL1Maximum number of subtopics per parent topic0 (unlimited)
Order of application of the rules:
  1. Max L1: Limits the number of main themes (Pass 1)
  2. Max L2/L1: Limits subtopics by parent (Pass 2, by calls)
  3. Max L2 global: Final ceiling after merging all sub-themes (post-processing)
💡 Tip: The Max L2/L1 ratio is particularly useful when some L1 themes are very broad and might monopolize all the sub-themes. For example, with Max L2/L1 = 2, each parent theme can only contribute a maximum of 2 sub-themes, ensuring a balanced distribution.

5. Import a training set (past data) (optional)

Why import examples?

A training set (or a few-shot examples) allows us to show examples of already coded verbatim transcripts. These examples are sent as context to guide each batch of classification. Importing is recommended when:
  • The themes are nuanced or closely related.
  • You want continuity within a project or between several projects.
  • You have specific coding conventions (e.g., certain expressions must always be classified under a particular theme).
  • You want to replicate an existing classification on new data.
  • Pre-classification without examples yields unsatisfactory results.

Training file format

The Excel file should look like this:
AnswerANSW_1aCOMM1ANSW_2aCOMM2ANSW_3aCOMM3ANSW_4aCOMM4ANSW_5aCOMM5
The interface remains fluid from beginning to end, very close to a premium application.21
The application is fine, nothing particularly remarkable.18207
Very smooth navigation, some pleasant animations18207
Sometimes a little choppy, and some sections seem poorly optimized.45212233240
The system automatically detects columns containing codes by comparing them to the themes defined in your coding plan. Columns whose values ​​correspond to known themes are automatically identified.
⚠️ Limit: 30 examples are kept. The codes must be the same as those used in the initial code plan of the newly imported files.

Verification and translation

Each imported example is displayed with:
  • The text of the response (verbatim)
  • The assigned theme badges (color-coded)
  • An individual translation button with language selection (French, English, German)
The translation allows you to check the content of the examples in your working language, without modifying the data sent to the classification.
trusted

6. Pre-classify a sample

What is pre-classification?

Before running the classification on the entire dataset, the system classifies the first 30 responses as a test. This is the most important step to validate the quality of your coding plan. The pre-classification uses the exact same algorithm as the full classification, but on a smaller sample to allow for quick verification.

What the pre-classification shows you

For each answer, you see:
  • The verbatim text (with keywords corresponding to the themes highlighted)
  • The assigned L1 badges (with color coding)
  • The assigned L2 badges (if depth = 2), grouped under their L1 parent badges
  • A summary: number of classified responses out of the total

Correct the results

The pre-classification is interactive; you can correct each line:
ActionGestureEffect
Remove a themeClick on the badge ×The topic has been removed from this response.
Add an L1 themeClick on the + next to the L1 badgesDrop-down menu with all available L1 themes
Add an L2 themeClick on the + next to the L2 badgesFiltered dropdown menu: only sub-themes of already assigned L1 courses are offered
Search for a themeType in the menu search fieldReal-time filtering of available themes
💡 Validation area: All lines between your first and last correction are considered validated. They are highlighted in blue and automatically become valid for the complete classification.

7. Launch the full classification

When to start the classification?

Start the full classification when:
  • The pre-classification themes match your expectations.
  • Any necessary corrections are made to the first 30 lines.
  • Training data is imported.
  • The rules (Max codes) are correctly configured.

What’s happening in the background

  1. The responses are divided into batches.
  2. Each batch is sent to the AI ​​with:
    • List of available themes
    • Training examples (imported + pre-classification corrections)
    • Configured boundary rules
  3. In 2-level mode:
    • Pass 1: L1 classification on all batches
    • Pass 2: For each assigned L1 theme, L2 classification by parent
    • Post-processing: Application of the global L2 ceiling (Max L2)

Result

After classification, you see:
  • A success banner: “Classification complete: N classified responses.”
  • The first 30 responses with their assigned codes (editable)
  • The imported examples (expandable section, if a training set was used)
  • The correlation matrix (see next section)

8. Evaluate the results using the correlation matrix

The MECE principle

A quality coding plan must be MECE:
  • Mutually Exclusive: Each theme covers a distinct aspect. Two themes should not describe the same thing.
  • Collectively Exhaustive: The set of themes covers all responses. No verbatim response should remain without relevant code.

Read the co-occurrence matrix

The matrix displays the percentage of respondents who received two themes simultaneously. The diagonal is always 100% (a theme is always correlated with itself).
Interface is intuitive (122)App is fast (6)Navigation is confusing. (14)App crashes or freezes (28)
Interface is intuitive (122)100%33%21%0%
App is fast (6)33%100%0%0%
Navigation is confusing (14)21%0%100%0%
App crashes or freezes (28)0%0%0%100%

How to interpret the matrix

SignalValueSignificationRecommended action
🔴 High correlation50%The two themes often overlap: possibly redundant.Merge themes or rephrase definitions
🟠 Average correlation20-50%The themes are related but distinct: acceptableCheck a few answers to confirm
🟢 Weak correlation< 20%The themes are indeed mutually exclusive.Nothing to change
⚪ Zero correlation0%The themes never coexistPerfect for antagonistic themes (e.g., “Nothing” vs. the others)
⚠️ Low staff(1-2)The topic concerns very few respondentsPerhaps too specific; consider merging with a parent theme or removing it in a single-level encoding.
💡 Highly correlated cells are highlighted in color to quickly identify problems. Exemple d’analyse
In the matrix above:
  • App is fast × Interface is intuitive = 33% → These two sensations are sometimes mentioned together. This is normal for a product consumed by inhalation: the themes remain distinct.
  • Nothing × everything else = 0% → Perfect: respondents who have nothing to say are not categorized under other themes.
  • Interface is intuitive (122) is the dominant theme: 122 out of 232 respondents, or more than half.

Acting on the results

If the matrix reveals problems:
  1. Click “Back to codes” to return to the theme editor.
  2. Merge redundant themes or rewrite ambiguous definitions.
  3. Rerun the classification; the corrections made to the first 30 lines are saved as training examples (“Re-classify with corrections” button).
This iterative cycle, classify → evaluate → adjust → re-classify, allows us to gradually achieve a robust and MECE coding plan.

9. Export the results

Once the classification is validated, click on “Download Excel” to obtain a structured file:
LeafContentDescription
FilesQOClassified dataEach respondent with their text and assigned codes (columns L1 and L2 if applicable)
TopicsPlan de codificationThe complete list of themes with their identifiers, organized hierarchically
Top TopicsFrequency synthesisThe most frequent themes with their counts and percentages

Practical advice

How many themes should be defined?

Number of responsesRecommended L1 topicsRecommended L2 topics
< 1005 – 102 – 4 par L1
100 – 50010 – 203 – 6 par L1
50015 – 305 – 10 par L1

When to use 1 level vs 2 levels?

CriteriaLevel 12 levels
Rapid exploratory objective
Initial data analysis
Fine granulation required
Long and detailed transcripts
Verbatims courts (< 20 mots)