AI Cognitive Decline Narrative: Untested Claims

A Methodological Audit of AI Cognition Research, with HAIA-RECCLIN Reasoning and HEQ with AIS Composite as Testable Counter-Proposals and a Five-Arm Validation Design

Abstract

Public discourse increasingly claims that artificial intelligence use is producing cognitive decline. The peer-reviewed evidence base on AI and cognition does not yet support that conclusion at the methodological standards adjacent fields require, and it does not yet support the opposite conclusion either. Two scientific questions remain open. The first asks whether structured human-governed AI use can accelerate cognitive development beyond what either no AI or unstructured AI use produces. The second asks what augmented intelligence is in practice and how its presence or absence can be measured. Neither has been answered with the methodological consistency that randomized controlled trial reporting standards, risk-of-bias frameworks, and process evaluation guidance specify for cognitive intervention research.

This paper makes a methodological argument. AI use as currently practiced lacks structure, lacks development discipline, and lacks the human verification that would distinguish governance from acceptance. Cognitive science across more than six decades has established why those three things matter for cognitive development: scaffolding within a zone of proximal development, metacognitive monitoring, active engagement modes, productive cognitive dissonance, and the conversion of suboptimal cognitive offloading into strategic offloading. The augmentation tradition has established what the exploration of augmented intelligence requires: a hybrid unit of analysis that treats the human-AI pairing rather than the AI alone, a calibrated trust relationship, and a multi-dimensional rather than scalar conception of intelligence under partnership conditions. The AI cognition research that the existing peer-reviewed literature has produced did not test for, look for, or consider any architecture that provides all three. The studies tested AI as a tool. The methods of use were the missing variable, not any particular architecture that would supply them.

The principle that methods of use change cognitive outcomes is not new. The questions the field should have been asking from the start were obvious. How is AI being used? What is the result of how it is being used? Should it be used differently? Is there a way to use it differently? HAIA-RECCLIN Reasoning (Human Artificial Intelligence Assistant — Researcher, Editor, Coder, Calculator, Liaison, Ideator, Navigator) is offered as one operationalization of the methods-of-use category that is concrete enough to test: a single-platform structured cognitive interaction architecture organized around seven cognitive role functions and human-arbiter checkpoints, synthesized from prerequisites the literature already named. The Human Enhancement Quotient (HEQ) is offered as the measurement-space proposal: a four-dimension behavioral instrument that decomposes augmented intelligence into Cognitive Agility Speed, Ethical Alignment Index, Collaborative Intelligence Quotient, and Adaptive Growth Rate, with the Augmented Intelligence Score (AIS) as composite. Neither proposal has been independently validated. The paper closes with a five-arm randomized controlled trial design that would test these proposals, and any other operationalization of the methods-of-use category, against the methodological standards Section 5 articulates.

1. Two Open Questions in AI Cognition Research

Much of the public discourse now circulates around two confident claims. The first claim says AI use causes or contributes to cognitive decline. The second claim says structured human-governed AI use is unproven and therefore cannot be relied on as a counter-condition. Both claims circulate as if the underlying research had answered them. Neither claim is supported by evidence that meets the standards adjacent fields already accept for causal cognitive evaluation.

For the purposes of this paper, structured human-governed AI use refers to AI interaction in which the human assigns a cognitive role, verifies sources, preserves dissent, connects claims to tactics and measurable outcomes, and makes the final arbitration decision before any output is accepted. The full architecture is specified in Section 7, but the operational definition belongs at the front of the paper because the term carries the central argument.

A note on terminology. This paper uses generative AI and AI throughout because these are the names the field uses for the current generation of large language model systems. Neither term is used as a settled technical category. The systems do not generate in the sense a human author generates, and they do not exhibit general intelligence in any rigorous sense of either word. The paper retains the field’s vocabulary because the readers the paper is written for use it, not because the author endorses the implied claims the vocabulary makes.

Two questions sit underneath the discourse and remain open. The first asks whether structured human-governed AI use can accelerate cognitive development beyond what either no AI or unstructured AI use produces. The second asks what augmented intelligence is in practice and how its presence or absence can be measured. These questions are not new. The cognitive science literature has spent more than six decades specifying what cognitive development requires. The augmentation tradition has spent the same period specifying what human-machine partnership entails. The questions have become urgent because the deployment of generative AI has created the conditions under which both questions can be tested empirically at a scale and frequency that make rigorous testing more feasible than before, and the field has not yet run the tests.

The paper proceeds in nine sections. Sections 2 and 3 establish what cognitive science and the augmentation tradition say each question requires. Section 4 explains why the default single-platform AI deployment model cannot reliably answer either question. Section 5 specifies the methodological standards the field would have to meet. Section 6 audits the existing evidence against those standards. Section 7 introduces the methods proposal. Section 8 introduces the measurement proposal and validation design. Section 9 acknowledges the proposal’s own limitations.

The field already possesses the cognitive science, the methodological standards, and the design precedents required. What it has not yet produced is a single study that brings all three together to test a fully governed intervention class. This paper offers one such class and the measurement approach that would allow it to be evaluated rigorously.

2. What Cognitive Science Says Cognitive Development Acceleration Requires

2.1 Scaffolding within a zone of proximal development

The foundational claim that cognitive development can be supported by structured assistance comes from Vygotsky (1978). The zone of proximal development describes the range between what a learner can accomplish independently and what the same learner can accomplish with appropriate support. Acceleration occurs when the support is calibrated to that range. Support that falls below the range is redundant, since the learner could already accomplish the task. Support that exceeds the range bypasses the cognitive work that produces development. The mechanism is structural: the scaffold extends what the learner can reach, and the learner internalizes the extended reach as new independent capability across repeated cycles.

The implication for AI use is direct. A condition in which the AI produces the answer and the human accepts it does not extend the human’s reach. It substitutes for the cognitive work the zone of proximal development requires. A condition in which the AI provides specific support that the human integrates into a process the human is performing extends reach in the Vygotskian sense. The difference between substitution and scaffolding is not a property of the AI tool. The difference is a property of how the interaction is structured.

2.2 Metacognitive monitoring as the active mediator

Flavell (1979) defined metacognition as cognition about cognition, comprising knowledge of the task at hand, knowledge of available strategies, knowledge of oneself as a cognitive agent, and the ongoing monitoring of progress toward a goal. Subsequent decades of research in educational psychology have established metacognitive monitoring as one of the most consistent predictors of learning quality across domains and across age ranges. Learners who monitor their own understanding identify gaps earlier, select more appropriate strategies, and adjust those strategies more effectively than learners who do not.

Metacognitive monitoring is a plausible and often central mediator between intervention structure and learning outcome. An intervention that bypasses metacognition produces compliance with the intervention but not durable cognitive change. The Sidra and Mason (2026) Collaborative AI Literacy and Collaborative AI Metacognition Scales provide a recent peer-reviewed instrument for measuring metacognitive functioning specifically during AI collaboration. Their validation work supports the claim that metacognition-with-AI is a measurable construct distinguishable from general metacognition. The measurement work also indicates that AI collaboration can either support or suppress metacognitive monitoring depending on how the interaction is structured.

2.3 Engagement modes and the engagement-outcome gradient

The Interactive-Constructive-Active-Passive framework from Chi and Wylie (2014) classifies learning behaviors into four hierarchical modes ordered by the degree of cognitive engagement they require. Passive engagement involves receiving information without overt activity. Active engagement involves doing something with the information, such as repeating it or highlighting it. Constructive engagement involves producing something that goes beyond the original input, such as a summary in the learner’s own words. Interactive engagement involves dialogue with another agent that produces co-constructed knowledge. The hierarchy predicts that learning outcomes increase as engagement moves from passive through active and constructive to interactive modes.

Wekerle et al. (2024) provide bounded empirical support for the hierarchy in technology-enhanced higher education. Their findings confirm the prediction at the passive and interactive endpoints, while the middle modes show mixed results that the authors attribute to measurement challenges. Thurn et al. (2023) raise additional measurement concerns. The framework remains a central engagement-mode hierarchy in current educational psychology research and is treated in this paper as established but bounded.

Freeman et al. (2014) provide independent meta-analytic confirmation that engagement matters at the population level. Their analysis of 225 studies in undergraduate science, technology, engineering, and mathematics education found an effect size of 0.47 standard deviations on examination performance under active learning conditions compared to traditional lecturing, with failure rate odds 1.95 times higher under traditional lecturing. The Freeman finding does not depend on the ICAP framework specifically. The convergence between ICAP and the Freeman meta-analysis supports the broader claim that engagement structure mediates learning outcomes.

The Bloom taxonomy and its revision (Bloom et al., 1956; Anderson & Krathwohl, 2001; Krathwohl, 2002; Larsen et al., 2022) provide the cognitive-process classification that any engagement-mode framework presumes. The taxonomy distinguishes lower-order cognitive processes such as remember and understand from higher-order processes such as analyze, evaluate, and create. Acceleration of cognitive development means progression up the taxonomy, not faster execution at the same level. Productivity gains at lower levels do not constitute cognitive development if higher-level processes are not exercised.

2.4 Productive cognitive dissonance as the driver of reasoning

Festinger (1957) established cognitive dissonance theory as the account of how inconsistency between beliefs, between beliefs and behavior, or between expectations and outcomes produces psychological discomfort that motivates reasoning. Vaidis and Bran (2019) provide modern conceptual refinement that distinguishes the inconsistency trigger from the dissonance state and from the broader theory. Their refinement matters because a learning environment that exposes learners to dissonance produces reasoning, while a learning environment that suppresses dissonance produces compliance.

Default AI use can suppress dissonance when it presents a single confident output without requiring arbitration. The standard interaction pattern presents a single confident output, can frame disagreement as a prompt problem rather than preserving disagreement as evidence, and resolves apparent inconsistency by offering an alternative single output rather than preserving the disagreement. A learner who never encounters preserved disagreement does not have the conditions for the reasoning that resolves it. A learner who encounters preserved disagreement and is asked to arbitrate it has the conditions Festinger identified more than six decades ago.

2.5 The strategic and suboptimal offloading distinction

Risko and Gilbert (2016) define cognitive offloading as the use of physical action or external resources to alter the information processing requirements of a task. Their central distinction separates strategic offloading, which improves performance by reallocating cognitive resources to the most demanding parts of a task, from suboptimal offloading, which substitutes the external resource for cognitive work that would have produced learning. The same external resource can support either form of offloading depending on the user’s metacognitive monitoring and on the task structure.

Sparrow et al. (2011) provide the foundational empirical work in the search engine context. Their findings have a mixed replication record and predate large language models, but the underlying mechanism remains established in subsequent research. The implication for AI use is that the question is not whether AI offloads cognitive work. Many forms of AI use externalize cognitive work and therefore raise the offloading question directly. The question is whether the offloading is strategic or suboptimal, and the answer depends on whether the surrounding interaction structure supports the metacognitive monitoring that distinguishes the two.

2.6 What this means in practice

The cognitive science literature points to five conditions that any intervention claiming to accelerate cognitive development should satisfy. The intervention should scaffold within the zone of proximal development rather than substitute for the cognitive work that produces development. It should support rather than suppress metacognitive monitoring. It should move learners toward the higher engagement modes and the higher cognitive process levels rather than locking them into lower ones. It should expose learners to productive dissonance and require them to arbitrate it. It should convert offloading from a suboptimal form into a strategic form by surrounding the offloaded task with the structure that preserves cognitive development at the user’s end.

These conditions are familiar to anyone who has read the cognitive intervention research literature. They are the same prerequisites that have been expected of educational and cognitive interventions for decades. An AI cognition research program that did not address them would be expected to produce mixed signals, contested replications, effect sizes that swing with measurement choices, and findings that cannot be aggregated into a stable causal claim regardless of which direction the headline result points.

3. What the Augmentation Tradition Says Exploration of Augmented Intelligence Requires

3.1 The structural framing from Licklider and Engelbart

Licklider (1960) introduced the concept of man-computer symbiosis in the IRE Transactions on Human Factors in Electronics. The paper proposed that the productive partnership between humans and computers would not consist of computers replacing human cognition or of humans operating computers as tools. The productive partnership would consist of a coupled relationship in which humans set goals, formulated hypotheses, and evaluated results while computers performed the routinizable work in between. The argument was structural rather than tool-deployment focused. The shape of the partnership, not the capability of the machine, would determine whether the partnership produced augmentation.

Engelbart (1962) extended the framing in his SRI report on a conceptual framework for augmenting human intellect. The report argued that increases in human intellectual capability would come from designed combinations of human capabilities, language, methodology, training, and computer support. The combination was the unit of analysis, and no single component, including the computer, was the source of augmentation. The combination was, and the design of the combination was the locus of the engineering problem.

Both Licklider and Engelbart predate large language models by more than five decades. The framing they established remains relevant. Modern generative AI is a substantially more capable computational partner than earlier augmentation theorists could test. The question of whether generative AI produces augmentation remains the question Licklider and Engelbart posed, scaled up to the new capability range.

The shape-of-the-partnership argument has direct implications for the architecture this paper proposes in Section 7. If the human’s contribution to the partnership is reduced to acceptance, the partnership has the shape of automation rather than augmentation, regardless of how capable the AI component is. If the human contributes role assignment, source verification, dissent capture, arbitration of conflicting outputs, and the final decision authority grounded in professional values such as morals, ethics, beliefs, and accountability for outcomes, the partnership has the shape Licklider and Engelbart specified. The architecture is one operational expression of that shape; other operational expressions are possible.

Augmentation also has a measurable definition. The combined human-AI performance must exceed the performance of either component alone. The relationship is therefore not aesthetic but quantitative, and it is testable. Section 8 proposes one instrument for that test.

3.2 Hybrid intelligence as the unit of analysis

Dellermann et al. (2019) provide the most cited modern formalization of the unit-of-analysis claim. Their definition of hybrid intelligence specifies a combined human-AI capability that exceeds either component alone, with continuous mutual learning across repeated interactions. The definition makes three commitments that matter for measurement: the combined capability is the unit; the capability is exceedance rather than summation; and the relationship is dynamic with both components changing over time.

The unit-of-analysis commitment also has direct implications for what a measurement instrument needs to capture. The instrument cannot measure the human alone or the AI alone. The instrument must capture the quality of the integration: whether the human integrates diverse perspectives within the AI-augmented context, whether the human exercises source discipline across AI platforms and external resources, whether the human preserves dissent rather than forcing convergence, and whether the human arbitrates conflicting outputs with documented reasoning. Section 8 introduces one measurement instrument that operationalizes this requirement.

The unit-of-analysis commitment has direct implications for research design. A study that measures human-only performance and AI-only performance and reports their relationship has measured neither augmentation nor hybrid intelligence. A study that measures combined human-AI performance against the relevant counterfactuals (best human alone, best AI alone, average human alone, average AI alone) has begun to measure augmentation. A study that measures combined human-AI performance across repeated interactions and tracks change at both ends has begun to measure hybrid intelligence.

Vaccaro et al. (2024) provide the meta-analytic evidence that the field has not consistently met the second standard. Their analysis of 106 experimental studies and 370 effect sizes found that human-AI combinations on average underperformed the best of humans or AI alone. The finding is not an argument against augmentation in principle. The finding is the empirical concession that combinations require structured testing rather than presumed synergy. The variance Vaccaro documents across task types and combination structures is the variance proper testing should explain. Section 8 proposes one instrument designed to measure the variance, and Section 8.4 specifies the trial design that would test the structural argument against the variance pattern.

3.3 Trust calibration as the active mechanism

Lee and See (2004) established trust in automation as a research field with their foundational paper in Human Factors. Their framework defined appropriate reliance as the calibration of trust to the observed capability and limits of the automated system, with both overtrust and undertrust identified as failure modes. The framework has been widely cited across research on automated systems, decision aids, and now AI collaboration.

The implication for augmented intelligence research is that trust is the active mechanism through which the human contribution to the partnership is regulated. Overtrust produces the failure mode in which the human accepts AI outputs without evaluation, which collapses the partnership into AI-alone operation with the human as a passive conduit. Undertrust produces the failure mode in which the human rejects AI outputs by default, which collapses the partnership into human-alone operation with the AI as ignored noise. Augmentation requires calibrated trust, and calibrated trust requires evidence about when the AI is reliable and when it is not.

The Buçinca et al. (2021) work on cognitive forcing functions provides the conditional-mechanism evidence at the intervention level. Their experimental work found that interventions designed to require deliberation reduced overreliance, with people higher in Need for Cognition benefiting more. The Vasconcelos et al. (2023) work on AI explanations found that explanations can reduce overreliance under specific conditions including high task difficulty, low explanation effort, and salient incentives. The Vered et al. (2023) work confirms that explanations alone do not reduce automation bias. The conditional pattern these three studies establish is that the structural design of the interaction is what produces calibrated trust, not the AI tool by itself.

The default human-in-the-loop deployment pattern, where a human is nominally present but not structurally engaged, does not produce calibrated trust because the conditions calibration requires are not present. Calibration requires four things the default pattern does not enforce. The first is named accountability: someone whose professional identity is on the line for the output. The second is access to verification resources outside the AI session: search, books, primary sources, peer review by other humans, and where available cross-checks against other AI platforms. The third is sufficient prior education or domain knowledge to evaluate AI outputs against what the human already knows; without it, the human is a rubber stamp regardless of how capable they are in unrelated areas. The fourth is a method that forces engagement with all three rather than allowing them to remain optional. Section 7 proposes one such method.

3.4 Multi-dimensional intelligence as the precedent for measurement

Gardner (1983) established the multi-dimensional intelligence framework in Frames of Mind. The argument was that intelligence is not a scalar quantity but a set of relatively independent capabilities including linguistic, logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal, and intrapersonal intelligences. The framework remains contested in cognitive psychology, but the precedent it established for measuring intelligence as multi-dimensional rather than scalar is foundational for any modern instrument that claims to measure a complex cognitive construct.

Ganuthula and Balaraman (2025) provide the most recent peer-reviewed example of a multi-dimensional measurement framework specifically for human collaboration with AI. Their Artificial Intelligence Quotient framework defines components for measuring how individuals collaborate with AI systems. The framework is independent of the Human Enhancement Quotient with AIS Composite (HEQ with AIS, introduced in Section 8), was developed in parallel, and provides the cleanest peer-reviewed comparison anchor for any new measurement instrument in this space.

3.5 Developmental framing

Dweck (2006) established the growth mindset framework in Mindset, distinguishing fixed conceptions of intelligence from growth conceptions. The growth-mindset commitment is that intelligence is responsive to deliberate practice and that the rate of change is itself a measurable construct. Augmented intelligence research that adopts the developmental framing treats the rate of measurable change in cognitive performance across repeated interactions as a primary outcome rather than as noise around a fixed underlying capability.

3.6 What this means in practice

The augmentation literature points to three commitments that any rigorous exploration of augmented intelligence should adopt. The hybrid unit of analysis, with the human-AI pairing rather than either component alone as the relevant unit. Trust calibration as the active mechanism, with both overtrust and undertrust as testable failure modes. Multi-dimensional and developmental measurement, with intelligence treated as a set of capabilities that change over time under conditions of partnership rather than as a scalar trait.

These commitments have been built up across the augmentation tradition since the 1960s. A research program that did not adopt them would be measuring something else.

4. Why Default Single-Platform AI Use Cannot Answer Either Question

The default deployment model for generative AI, treated here as an ideal type, is a single-platform exchange in which the human submits a prompt, the AI returns an output, and the human accepts the output. The pattern is the dominant interaction mode across the consumer chat interfaces, the enterprise deployments, and the educational applications that account for the majority of current AI use. The pattern produces measurable productivity gains in narrow tasks. The pattern cannot answer either of the two questions Section 1 framed.

The reason is structural rather than capability-related. Default single-platform AI use satisfies none of the conditions established earlier for cognitive development acceleration and none of the commitments established earlier for augmented intelligence exploration. The failures are not accidents of the current generation of models, and they are properties of the interaction structure itself.

4.1 Failure mode one: hallucinated content accepted without verification

Generative AI produces plausible content that includes citations, dates, statistics, and quotations. Some of that content may be fabricated, misattributed, or unsupported. The default interaction pattern does not require the user to verify any specific claim. The user can accept the output, paste it into a downstream document, and proceed with the assumption that the content is accurate.

The cognitive science consequence is that the user has not engaged the higher-order processes the Bloom revision identifies as cognitive development. The user has not analyzed the content, evaluated the sources, or constructed a position. In the unverified version of this pattern, the user has retrieved and accepted rather than analyzed and evaluated. The augmented intelligence consequence is that the trust calibration Lee and See identified as the active mechanism has been bypassed in the direction of overtrust. The Vasconcelos et al. (2023) and Buçinca et al. (2021) work shows that the conditions under which trust calibration occurs are conditions the default pattern does not produce.

4.2 Failure mode two: surface fluency mistaken for reasoning depth

Generative AI produces fluent prose, and fluency is a property of the output’s linguistic surface. Reasoning depth is a property of the cognitive work that produced the conclusions the prose expresses. The two are independent. A model can produce highly fluent text that reflects shallow or absent reasoning, and a user can mistake the fluency for the depth.

A related problem is the reading-level mismatch between AI output and user comprehension. AI systems default to a register that mirrors academic or professional prose: long sentences, multi-clause constructions, technical vocabulary, structured logical scaffolding. Users with strong reading comprehension can evaluate this output. Users with weaker reading comprehension or domain knowledge cannot, and they often cannot tell that they cannot. The output looks authoritative because it carries the surface markers of authority, and the user accepts it because the user has no internal benchmark for what the output actually says. The cognitive consequence is that the user offloads not only the work but the standard against which the work would be judged. The user has no way to know whether the output is correct or whether they have understood it correctly, and the default interaction pattern does not create occasions for either check.

The cognitive science consequence is that the user receives no signal of the cognitive work that should have occurred to produce the position. The metacognitive monitoring Flavell identified as the active mediator has nothing to monitor, because the work that monitoring would assess was not the user’s. The augmented intelligence consequence is that the hybrid unit of analysis Dellermann specified collapses, because the human contribution has been reduced to acceptance and the AI contribution has been treated as if it were the partnership.

4.3 Failure mode three: productivity gains mistaken for cognitive gains

Noy and Zhang (2023) found that ChatGPT users reduced task time by 40 percent and improved task quality by 18 percent on midlevel writing tasks. Doshi and Hauser (2024) found that AI access raised individual story creativity but reduced collective diversity. Both findings are valid for what they measured, and neither finding speaks to cognitive development. Producing more output faster is not the same as developing the cognitive capabilities that would let the user produce comparable output independently.

The conflation of productivity with cognitive development is widespread in the popular discourse and appears in some of the academic discussion as well. The conflation matters because it allows the field to claim that AI is helping users when the only thing that has been measured is throughput. A user who has produced more documents per hour using AI has not necessarily developed better writing capability. A user who has produced more code per hour has not necessarily developed better debugging capability.

The cognitive question deserves a more careful framing than the productivity literature usually gives it. The naive version of the question asks whether the human alone, with AI removed, performs better than the human who learned with AI. Within a narrow specialty in the short run, the answer is often yes, and this paper grants that point. In the short run, the human who studied architecture for ten years without AI assistance will know architecture in greater specialist depth than the human who studied architecture with AI assistance for the same period. That comparison is not the augmentation question, and the augmentation question asks something different. Consider two learners over the same period: the first studies architecture without AI, while the second studies architecture with AI assistance and also acquires working knowledge of plumbing, electrical systems, and carpentry because the AI assistance frees enough cognitive capacity to make breadth tractable. In the short run, the first learner outperforms the second on architecture-specific tasks. In the long run, the second learner returns to architecture with a working understanding of how the building actually goes together, and that macro understanding produces architectural decisions the specialist could not produce.

The augmentation argument is not that the augmented learner outperforms the specialist on the specialty. The argument is that the augmented learner reaches the specialty plus the macro, and that the macro changes how the specialty itself is practiced.

This reframe matters for two reasons. First, it concedes the legitimate finding embedded in the cognitive decline narrative: human-only specialty depth in narrow domains in the short run is often greater than augmented depth in the same domains. Second, it identifies what the augmentation literature actually claims: breadth and integration that exceed what either alone could reach, with the specialty itself eventually deepened by the integration. The cognitive question requires measuring what the augmented user can do across domains that the unaugmented specialist never reached, not just what the augmented user can do in the specialist’s domain with the AI removed.

A second condition shapes the argument. Augmentation produces cognitive development only when the AI structurally supplies the dissent and conflict that a formal teacher or peer cohort would otherwise provide. A learner working alone with AI in the absence of teacher, peer review, or formal curriculum receives the cognitive benefit only if the AI is configured to disagree, surface counterevidence, identify weaknesses, and force the learner to arbitrate. Default AI use does not do this, but the architecture in Section 7 does.

4.4 Failure mode four: source-authority confusion

The default pattern treats the AI output as the authoritative content of the interaction. Sources cited by the AI appear in the output as if they had been verified. Disagreements between the AI and the user are typically resolved by the user accepting the AI’s framing or the user reformulating the prompt until the AI produces the framing the user wanted. Neither pattern preserves the user’s authority as arbiter.

The cognitive science consequence is that the dissonance Festinger identified as the driver of reasoning is suppressed rather than preserved. The user does not encounter productive disagreement that requires arbitration. The augmented intelligence consequence is that the human-AI relationship loses the structural property that distinguishes governance from automation. When the human is the arbiter, the partnership is governed. When the AI is the arbiter and the human is the consumer of its outputs, the partnership has collapsed into automation.

The distinction between governance and automation maps onto two terms that have become standard in the AI deployment literature: Responsible AI and AI Governance. The two are not synonyms. Responsible AI describes the configuration in which the AI system itself is constrained by safety filters, reliability standards, and audit logs, and in which automated checks verify that those constraints hold. The configuration improves the behavior of the machine. Responsible AI does not require a named human with binding override authority over any particular output. The accountability for individual decisions lives in the system rather than in a person. AI Governance describes the configuration in which a named human holds binding checkpoint authority, with the authority backed by professional and personal accountability for outcomes. The named human can override any AI output, must answer for the consequences of the override or the failure to override, and exercises that authority through documented arbitration grounded in the human’s professional values. Responsible AI is necessary but not sufficient for cognitive interactions where the user must produce work the user is responsible for. AI Governance is what the cognitive question requires, because the cognitive question is about the human’s development, not about the machine’s behavior. The architecture in Section 7 specifies the named-human-with-checkpoint-authority configuration that AI Governance entails.

4.5 The absence of audit trail and metacognitive scaffolding

The four failure modes share a common structural feature, which is that the default pattern produces no audit trail. There is no record of which sources were verified, which alternatives were considered, which disagreements were preserved, or which reasoning steps the user performed. The absence of the audit trail is not a logging problem. The absence reflects the fact that the activities the audit trail would record are not part of the default interaction.

The K-12 mathematics classroom offers the clearest illustration of why this matters. A teacher does not accept the answer alone but demands that the student show the work. The reason is not bureaucratic; showing the work makes the student’s process visible to the student and to the teacher, which means that when the answer is wrong, the teacher and the student together can locate where the process broke down. A correct answer with no shown work is an answer the student may or may not understand, and a wrong answer with no shown work is an error with no traceable cause. Default AI use is the answer-only condition. The user receives the output, accepts it or rejects it, and has no shown work to inspect when something later turns out to have been wrong.

The metacognitive scaffolding that cognitive science identifies as the active mediator depends on the user being able to monitor the cognitive work occurring during the interaction. The audit trail is the external record that supports that monitoring. A pattern that produces neither the work nor the record cannot support the monitoring. The pattern can produce outputs. The pattern cannot produce the cognitive development that the questions in Section 1 are asking about.

5. The Methodological Standards the Field Would Have to Meet

The methodological standards required to answer either question already exist in adjacent literatures. They are not novel proposals. They are the established reporting and design standards that randomized trial research, intervention research, and cognitive intervention research have spent decades developing. AI cognition research has not yet held itself to these standards consistently.

5.1 The four standards

The CONSORT 2025 Statement specifies the reporting requirements for randomized controlled trials. The statement requires defined interventions and comparators, predefined primary and secondary outcomes, and randomization procedures that support causal inference. The CONSORT-AI extension (Liu et al., 2020), originally developed against CONSORT 2010 and applicable to AI-trial reporting, adds requirements specific to interventions that include AI components, including reporting of model versions, prompt structures, training data characteristics where deployers can report them, and human-AI interaction protocols.

The Cochrane Risk of Bias 2 framework specifies the criteria for evaluating individual study quality. The framework distinguishes bias arising from the randomization process, bias due to deviations from intended interventions, bias due to missing outcome data, bias in measurement of the outcome, and bias in selection of the reported result. The framework gives particular weight to outcome measurement under unblinded conditions, which is the dominant condition in AI cognition research because participants generally know whether they are using AI.

The Medical Research Council process evaluation guidance, originally developed for complex public health interventions, specifies that outcome evaluation alone leaves unanswered questions about implementation, mechanisms, and context. A trial that reports only whether the intervention worked, without reporting how it was implemented, what mechanisms produced the effect, and how context modified the effect, has not produced the evidence that practice and policy decisions require.

A long-standing norm in cognitive intervention design specifies the role of business-as-usual control conditions. A business-as-usual control captures what participants would have done anyway, including beneficial non-AI activities the intervention might displace, and it shows whether an active intervention produces benefit beyond that baseline. Cognitive intervention research that omits a business-as-usual control cannot distinguish intervention effect from substitution effect.

5.2 The three-arm minimum design

The four standards together imply a minimum design for any study claiming to evaluate AI use against cognitive outcomes. The design has three arms. The first arm is a no-AI condition, which establishes what participants would have accomplished without AI access. The second arm is a general AI condition, which establishes what participants accomplish with default unstructured AI use. The third arm is a structured AI condition, which establishes what participants accomplish with the specific intervention being tested.

A two-arm design that compares AI to no AI cannot distinguish between the AI tool and the way the AI tool was used. A two-arm design that compares structured AI to general AI cannot distinguish between the structure and the AI use itself. The three-arm minimum is the design that lets researchers separate the three contributions and identify which one produced the effect. Studies with fewer than three arms can answer narrower questions within their design scope, but they cannot answer the question of whether structure mediates the cognitive effect of AI use. The same logic applies to any tool: how the tool is used matters at least as much as whether the tool is present, and a serious study reports the instructional protocol in detail rather than treating “had AI access” as a condition.

5.3 Treatment fidelity, process evaluation, and outcome hierarchies

Three additional requirements follow from the standards. Treatment fidelity verification requires evidence that participants in each arm actually performed the activities the arm specified. Logs of prompts used, measurements of adherence to the protocol, and documentation of dosage and compliance are the standard methods. Without treatment fidelity verification, a study cannot distinguish a failed intervention from an unimplemented intervention.

Process evaluation requires evidence about what occurred during the intervention beyond the inputs and outputs. The process evaluation answers questions about how participants reasoned, where they relied on AI versus relied on their own cognition, where they challenged AI outputs versus accepted them, and where they revised their work versus passed it through unchanged. Process evaluation is what distinguishes a study that reports a result from a study that explains a mechanism.

Outcome hierarchies require multiple measurements at multiple levels, and productivity and surface quality are the outcomes most often measured. Cognitive development requires measuring retention beyond the immediate task, transfer to related but distinct tasks, metacognitive monitoring during and after the task, critical reasoning about the AI outputs, and delayed performance after a period during which the AI is unavailable. A study that measures only the immediate output cannot speak to cognitive development regardless of how favorable the immediate result appears. The outcome hierarchy should also test for cumulative effects: a structured intervention applied repeatedly may produce gains that compound across cycles within the trial period and that persist or grow after the trial ends, while a default intervention may produce only the within-cycle productivity gains that disappear when the AI is removed. The compounding question is itself a measurable outcome.

5.4 The nine-element audit rubric

The standards combine into a nine-element audit rubric that any AI cognition study can be scored against. The nine elements are:

Defined AI intervention. Model, interface, prompt rules, allowed tasks, time, feedback, and source requirements specified.
No-AI comparison. A human-only or business-as-usual condition using conventional resources. A parallel traditional method comparator should be added when the claim asks whether structured AI outperforms a meaningful non-AI methodological alternative.
General AI comparison. An unguided default AI use condition with no prompt structure or scaffolding.
Structured AI comparison. A condition that requires source use, dissent search, fact checking, tactic formation, KPI review, and a reflection checkpoint.
Treatment fidelity verification. Logs, prompts, adherence checks, dosage measurements, and participant compliance records.
Outcome hierarchy beyond immediate output. Productivity, quality, retention, transfer, metacognition, critical reasoning, and delayed assessment.
Process evaluation. Records of how users reasoned, relied, challenged, revised, or checked outputs during the session.
Dissent or error exposure. Conflicting evidence, wrong AI answer trials, source checking, and calibration tests.
Predefined cognitive KPI. A specific success metric registered before the study begins.

Diagram 1 displays the rubric in scorecard form, and the rubric is offered as a practical checklist rather than a verdict. Its purpose is to help readers quickly see where existing studies are strong and where they leave room for stronger designs. A study that satisfies all nine elements supports causal claims about AI use and cognitive outcomes within the population it sampled. A study that satisfies fewer than nine elements supports narrower claims within its design scope only.

Diagram 1. Nine-Element Study Audit Rubric for AI Cognition Research

The rubric is not a pass-fail instrument but a scope-clarification instrument. A correlational survey that satisfies elements one and six but does not satisfy elements two through five is not a failed study. The survey is a study that has produced evidence about associations within its design scope. The rubric distinguishes that evidence from the causal claims being made from it in public discourse.

6. What the Existing Evidence Shows and Does Not Show

The existing peer-reviewed literature on AI use and cognitive outcomes can be sorted into two clusters when scored against the nine-element rubric. The first cluster is the structural-argument cluster that approaches the standards required to evaluate the question this paper poses. The second cluster is the descriptive-decline cluster that establishes important problem definition but does not meet the standards for the causal claims being made from it.

6.1 The structural-argument cluster

Bastani et al. (2025) is one of the most relevant structural precedents in the current source pool. The field experiment with approximately one thousand high school mathematics students compared two AI tutor designs. The GPT Base condition gave students standard ChatGPT access. The GPT Tutor condition used prompts designed to provide teacher-style hints rather than direct answers. GPT Base improved practice grades by 48 percent but reduced exam grades by 17 percent when AI access was removed. GPT Tutor improved practice grades by 127 percent, and the negative learning effects observed in the unguarded condition were largely mitigated by the safeguards in GPT Tutor.

The result matters in three ways at once. The same AI tool produced opposite cognitive outcomes depending on how the interaction was structured, which shifts the empirical question from “is AI good or bad for learning” to “what about the interaction structure determines the cognitive effect.” The size of the effect (a 48 percent practice gain accompanied by a 17 percent exam decline in one arm; a 127 percent practice gain with the negative exam effect largely mitigated in the other) is large enough to warrant serious follow-up. The study was run in a real classroom with real students at real scale, which is rare in this literature and gives the finding ecological weight that laboratory studies cannot produce on their own.

Bastani satisfies several elements of the rubric but remains incomplete for the present paper’s target claim. The AI conditions are defined. The study has a baseline measure (textbook-only) that functions as a no-AI comparator within Bastani’s research question. The study has a general AI condition (GPT Base) and a partial structured AI condition (GPT Tutor). The outcome hierarchy includes both immediate practice grades and delayed exam performance after AI removal. The predefined KPI is reported. Treatment fidelity is evidenced by the prompt structures and the platform configuration.

Two distinctions matter in the Bastani discussion. A no-AI condition using standard course resources and the same teacher review phase can function as a no-AI comparator, but it is not the same as a parallel human-supported method, and it does not by itself establish whether structured AI outperforms a valuable non-AI instructional alternative. The control arm is a baseline measure in Bastani’s design, but for the present paper’s target claim it is not a meaningful non-AI methodological alternative.

The structured AI condition in Bastani uses fixed safeguarding prompts that provide scaffolded mathematics help, without human-arbiter checkpoints, source evaluation, dissent capture, or integrated reasoning chains. The Bastani structured prompts were designed to avoid harm rather than to teach or to challenge the subject matter; the prompts kept the AI from giving away the answer, but they did not configure the AI to function as a tutor that surfaces dissent, presses the student on misconceptions, or forces metacognitive reflection. These are not necessarily weaknesses relative to Bastani’s stated research question, which was whether AI tutoring could be designed to avoid harm to independent performance, but they are material limitations for the present paper’s target claim, which is whether fully governed structured AI use accelerates cognitive development beyond what default AI use produces.

A practical caveat applies to all current AI cognition research, Bastani included. The studies use specific model versions that age out of the deployed environment quickly. Bastani’s GPT Base and GPT Tutor conditions used the model generation available at the time of the trial, which has since been superseded multiple times. The structural argument the paper extracts from Bastani survives the model-generation question because the argument is about the shape of the interaction, not the capability of any specific model. Future replications will need to lock model versions explicitly and report them, because a study that reports “ChatGPT” without specifying the version is reporting on a moving target. Bastani is cited throughout this paper as the closest available structural precedent for the broader claim, with the understanding that the model-specific findings will need to be reproduced under newer models before the practice-grade and exam-grade effect sizes can be treated as durable.

Garg et al. (2025) provide the cleanest three-arm design ally available. Their study with 157 first-year engineering students compared three conditions: a control group with internet access and no generative AI, a generative AI group without prompt training, and a generative AI group with structured prompt training. The structured-prompt-training group outperformed both alternatives across Bloom-aligned levels. The study is the first published example of the three-arm minimum design Section 5.2 specified. The study lacks the human-arbiter governance layer the methods proposal in Section 7 introduces, but the architecture itself is the design ally the field needs.

Fütterer et al. (2026) provide the randomized controlled trial discipline precedent. Their trial with 371 school students compared two generative-AI-supported self-regulated-learning interventions against standard ChatGPT control. The interventions targeted different components: one targeted motivational components through utility-value framing, the other targeted strategic components through cognitive-learning-strategy support. The findings were mixed across motivation, strategy use, effort, and learning outcomes. The mixed findings reinforce the position that structured AI must be tested by specific cognitive outcome rather than assumed beneficial across outcomes.

Gerlich (2025b) provides the closest available four-condition mapping. The cross-country experimental study (n=150 across Germany, Switzerland, and the United Kingdom) compared four conditions: human-only, AI-only, human plus AI unguided, and human plus AI guided with structured prompting. Across 450 evaluated responses, structured prompting reduced cognitive offloading and enhanced both critical reasoning and reflective engagement, as measured by expert rubric ratings and self-report indices. The study is single-author and published in MDPI Data, which warrants methodological caution, but the four-condition design is the closest existing approximation to the architecture this paper argues the field needs. The study should be distinguished from Gerlich (2025a), which is a separate correlational survey treated in Section 6.2 below.

Vaccaro et al. (2024) provide the meta-analytic concession. Their preregistered systematic review and meta-analysis of 106 studies and 370 effect sizes found that human-AI combinations on average underperformed the best of humans or AI alone. The finding is the strongest available counter-evidence to any presumption that AI access automatically produces augmentation. The finding does not refute the structural argument. The finding establishes that the structural argument requires testing rather than assertion, and the variance Vaccaro documents across task types and combination structures is the variance that proper testing of structured interventions should explain.

6.2 The descriptive-decline cluster

Five sources are cited prominently in the public discourse as evidence that AI use causes or contributes to cognitive decline. Each source establishes important findings within its design scope. None of the five satisfies the standards in Section 5 for the causal claims being made from them.

Gerlich (2025a) is a cross-sectional correlational study with 666 United Kingdom participants. The mixed-methods design used questionnaire items based on the Halpern Critical Thinking Assessment and Terenzini critical thinking measures, plus 50 semi-structured interviews. Headline correlations include AI use with cognitive offloading at r equals 0.72 and offloading with critical thinking at r equals negative 0.75. The correlations are large; the sample is reasonable for survey research; and the findings establish a clear association. The study cannot establish causation. The study cannot rule out that participants who tend toward suboptimal offloading are more likely to use AI, that participants with weaker critical thinking skills find AI more useful, or that a third variable explains both the AI use and the offloading. The audit verdict is that the study identifies the association and cannot establish the causal claim being made from it in public discourse.

Lee et al. (2025) is a survey of 319 knowledge workers reporting on 936 first-hand examples of generative AI use in workplace tasks. Higher confidence in generative AI predicted less reported critical thinking, while higher self-confidence predicted more reported critical thinking. The study establishes the perceived effect among knowledge workers, which matters for understanding adoption patterns and self-reported experience. The study cannot establish whether the perceived reduction in critical thinking corresponds to actual reduction, because the outcome is self-reported under unblinded conditions where participants knew they were reporting on their AI use. The audit verdict is that the study identifies the perceived effect and cannot establish that structured intervention prevents the effect.

Zhai et al. (2024) is a PRISMA systematic review of 14 articles on over-reliance on AI dialogue systems. The review synthesizes the existing problem literature and identifies the cognitive abilities most often affected, including decision-making, critical thinking, and analytical reasoning. The review covers a thin evidence base because the empirical literature on AI over-reliance is itself early-stage. The audit verdict is that the review synthesizes the problem literature and cannot test the structured intervention because the structured intervention has not yet been tested in studies the review could include.

Sparrow et al. (2011) is the foundational cognitive offloading study in the search engine context. The study established that participants showed altered memory for information they expected to be able to retrieve later. The replication record for the specific findings is mixed, and the work predates large language models by more than a decade. The study is correctly cited as foundational for the cognitive offloading mechanism. The study is incorrectly cited when it is presented as evidence that generative AI specifically causes cognitive decline. The audit verdict is that the study establishes the offloading mechanism and the replication record is mixed, with pre-LLM evidence base.

Kosmyna et al. (2025, preprint, excluded from the verified peer-reviewed source pool) reports EEG analysis of writing performance under three conditions: large language model assistance, search engine use, and unaided writing. The preprint reports reduced brain connectivity and reduced ownership of essays in the LLM condition. The preprint has received extensive media coverage and is widely cited in public discourse as evidence of cognitive harm from AI use. The study is not peer-reviewed. The sample is small and WEIRD (Western, Educated, Industrialized, Rich, Democratic). This paper does not include Kosmyna in the verified source pool, and the audit verdict is that the source is preprint and should not carry weight in the peer-reviewed evidence base until and unless it survives peer review.

6.3 The pattern

The descriptive-decline cluster shares a pattern. Each source is correlational, self-report, small or single-author, preprint, or a review of a thin evidence base. Each source is valuable for what it establishes within its design scope. None of the five establishes the causal claim being made from them in public discourse, because none of the five satisfies the methodological standards required for that claim.

The structural-argument cluster also has a pattern. Each source approaches but does not fully meet the standards in Section 5. Bastani is partial on two fronts, and Garg lacks the human-governance layer. Fütterer compares two structured interventions against unstructured AI but does not include a no-AI control. Gerlich Data is single-author. Vaccaro is meta-analytic and inherits the limitations of the underlying studies it synthesizes.

The combined pattern across both clusters is that no source in the verified peer-reviewed pool fully tests the intervention class that structured human-governed AI use represents. The descriptive-decline cluster cannot establish that AI causes cognitive decline. The structural-argument cluster cannot establish that the structural intervention this paper proposes produces cognitive acceleration. The field has the design precedents, the methodological standards, and the relevant theoretical foundations. The field has not yet integrated them into a single study that tests the proposed intervention class against the standards required for the causal claim.

7. The Methods Space: HAIA-RECCLIN Reasoning as the Proposal

The argument the paper makes in Sections 4 and 6 reduces to a single observation. AI use as currently practiced lacks structure, lacks development discipline, and lacks the human verification that would distinguish governance from acceptance. The cognitive science literature reviewed in Sections 2 and 3 establishes why those three things matter for cognitive development: structure organizes what the user attends to, development discipline ensures the user’s cognitive work is observable to the user and to others, and human verification preserves the dissent and arbitration that productive cognitive dissonance requires. The AI cognition research that the existing peer-reviewed literature has produced did not test for, look for, or consider any architecture that provides all three. The studies tested AI as a tool and treated the methods of use as a fixed background condition rather than as the variable that determines whether the cognitive outcome is decline, neutrality, or development. The methods of use were the missing variable, not any particular architecture that would supply them. The cognitive question cannot be settled until the methods of use are tested as the intervention they actually are. HAIA-RECCLIN Reasoning is offered as one operationalization of the methods-of-use category that is concrete enough to test. Other operationalizations are possible and welcome. The remaining sections specify this operationalization, the measurement instrument, and the validation design that would close the question for any operationalization the field chooses to test.

The principle that methods of use change cognitive outcomes is not new. Educational research has documented for decades that lecture, discussion, problem sets, apprenticeship, and structured practice produce different cognitive results from the same nominal subject matter. Research methodology has documented for decades that single-arm studies, missing controls, and conflated independent variables produce findings that do not survive replication. Cognitive science has documented for decades that scaffolding, dissent, metacognition, and arbitration are conditions under which learning happens. None of this was new in 2022 when generative AI tools entered general use. None of it required HAIA-RECCLIN Reasoning to articulate. The architecture this paper proposes is a synthesis of prerequisites the literature already named, applied to a tool the literature had not yet evaluated under the methods-of-use frame the literature itself established. The questions the field should have been asking from the start were obvious. How is AI being used? What is the result of how it is being used? Should it be used differently? Is there a way to use it differently? These questions were not asked because the field framed AI as a tool to be evaluated rather than as a context within which methods of use would determine the cognitive outcome. The failure was a failure of vision, not a failure of capability or resources. The literature, the methodology, and the cognitive science were all available the entire time.

7.1 The methods-space gap

Section 4 specified what default single-platform AI use does not provide, and Section 5 specified what proper research requires. Section 6 established that the existing peer-reviewed literature has not produced a structured intervention that satisfies both the cognitive science conditions in Section 2 and the methodological standards in Section 5. The methods-space gap is the absence of a publicly available structured-AI intervention class that operationalizes the known prerequisites for cognitive development acceleration within the constraints of single-platform AI use.

The proposal in this paper is HAIA-RECCLIN Reasoning (Human Artificial Intelligence Assistant — Researcher, Editor, Coder, Calculator, Liaison, Ideator, Navigator), which is offered as one operationalization of the intervention class the field needs to test. The proposal is not offered as the only possible operationalization, and the proposal is not claimed to have been independently validated. The proposal is claimed to be a coherent integration of the cognitive science prerequisites and the methodological standards, suitable for testing under the design specified in Section 8.

7.2 The architecture

HAIA-RECCLIN Reasoning is a single-platform structured cognitive interaction architecture organized around three components. The first component is role separation across seven cognitive role functions. The second component is a defined work-product chain consisting of source verification, dissent capture, and Fact-Tactic-KPI integration. The third component is the human-arbiter checkpoint structure that bookends the AI interaction with structured human cognitive work.

Diagram 2 displays the architecture. The seven roles are Researcher (sources, verification, evidence gathering), Editor (structure, clarity, audience adaptation), Coder (software generation, code review, and debugging), Calculator (quantitative analysis, data processing), Liaison (coordinating perspectives, stakeholder communication), Ideator (generating creative options, novel approaches), and Navigator (documenting dissent, preserving trade-offs without resolution). The roles are not arbitrary categories. The roles correspond to distinguishable cognitive functions that the work of evaluating an AI output decomposes into, and the role separation is what allows a single user working with a single AI session to engage each function deliberately rather than collapsing all of them into the single act of accepting the output.

Diagram 2. HAIA-RECCLIN Reasoning Seven-Role Architecture for Single-Platform Cognitive Interaction with Human-Arbiter Checkpoints

The human-arbiter checkpoints sit at the entry and exit of the AI session. At entry, the human arbiter constructs the prompt, assigns the cognitive role the task requires, and defines the scope of acceptable output. At exit, the human arbiter performs source verification on cited references, captures any dissent that arose during the session, audits the Fact-Tactic-KPI chain for coherence, and approves, rejects, or revises the session output. The audit trail produced by the checkpoint structure is the artifact that converts the session from a default exchange into a governed cognitive interaction.

A short worked example clarifies what a participant would actually do. A user begins a session by assigning the Researcher role: the user states the question, names the kinds of sources required, and asks the AI to surface candidate references. The AI returns a candidate list with brief summaries. The user verifies a sample of the candidates before any reliance. Verification means opening the cited source’s URL or DOI, confirming the source actually exists in the venue named, confirming the cited claim appears in the source’s text rather than being a related claim attributed to it, confirming the publication venue and date match what the AI said, and confirming the authorship is what the AI said. Where the AI summary paraphrases the source, the user reads the relevant passage in the source and compares. Where the AI provides a quotation, the user finds the quotation in the source verbatim. Sources that fail any check are rejected, and sources that pass are retained with the verification noted. The user then assigns the Navigator role: the user asks the AI to identify the strongest counter-positions to the emerging argument and to surface internal disagreements among the verified sources. Dissent is captured rather than reconciled. The user assigns the Editor role to draft a working position that integrates the verified evidence and preserves the dissent. The user closes by completing the Fact-Tactic-KPI chain: each claim is paired with the action it would inform and the measurable outcome that would test it. Only after the chain is complete and the user has approved or revised the output does the session output enter any downstream document. The audit trail records each step.

7.3 The cognitive science correspondence

Each component maps directly onto the cognitive science conditions laid out earlier. Role assignment before engagement enforces the metacognitive positioning that learning quality research identifies as one of its strongest predictors. Source verification at the exit checkpoint pulls the user into the higher-order cognitive processes (analyze, evaluate, construct) that distinguish development from retrieval. Dissent capture preserves the productive disagreement that reasoning needs to engage rather than resolving it by reformulating the prompt until the model produces the desired framing. The Fact-Tactic-KPI chain forces the user to construct rather than retrieve, pairing every claim with the action it would inform and the measurable outcome that would test it. The seven-role structure exercises multiple distinguishable cognitive functions in a single session, each with its own quality criteria and its own characteristic failure modes, rather than collapsing all of them into the single act of accepting the output.

7.4 What no other publicly available method currently provides at this layer

Several publicly available frameworks address subsets of the architecture: prompt engineering frameworks address role specification and prompt structure, critical AI literacy curricula address source verification, augmented decision-support systems address human-arbiter checkpoints in domain-specific applications, and hybrid intelligence research addresses the unit of analysis at the conceptual level. The reviewed source pool did not identify a publicly available general-purpose method that integrates all three components at this layer for single-platform AI use.

The integration matters because the cognitive science prerequisites do not function independently. Role assignment without source verification produces structured prompts that still permit hallucination acceptance. Source verification without dissent capture produces audit of the output without arbitration of alternatives. Dissent capture without Fact-Tactic-KPI integration produces preserved disagreement that does not resolve into action. The architecture proposes the integration as a single class to be tested as a class.

The architecture also assumes the user brings resources to the session beyond the AI itself. A user who has access only to a single AI platform with no external resources cannot perform the verification step, cannot resolve dissent without an external referent, and cannot arbitrate between competing AI outputs without something to arbitrate by. The full architecture presumes the user has access to and uses search engines for primary-source retrieval, books or institutional repositories for canonical references, peer collaboration with classmates or coworkers or colleagues for cross-check on interpretation, and where available, other AI platforms as cross-reference for outputs that warrant a second opinion. The architecture also presumes the user brings prior education or domain knowledge sufficient to evaluate the AI output against what the user already knows. Where the user does not have these resources or this knowledge, the architecture’s checkpoints are formalities rather than governance, and the cognitive benefit the architecture is designed to produce does not accrue. This is a strong precondition. It is also the precondition any cognitive intervention has, including formal classroom instruction, which presumes the student brings prior knowledge, peer interaction, and access to reference materials beyond the teacher.

The dissent function deserves separate treatment because it is the active mechanism through which the architecture produces cognitive development rather than just structured prompting. When the AI is configured to surface counterevidence, identify weaknesses, and present alternatives the user did not request, the user is forced into the cognitive work the engagement-mode hierarchy and the productive dissonance literature both identify as the prerequisite for higher-order learning. Dissent forces the user to verify the original claim against what the AI has now surfaced as a competing claim. Dissent forces the user to seek out external resources because the user cannot resolve the disagreement from the AI session alone. Dissent forces the user to articulate why one position is preferred over another, which exercises the evaluation and synthesis levels of the Bloom revision. In the absence of a teacher, peer cohort, or formal curriculum, the AI configured for dissent provides the friction that makes learning happen. Default AI use, configured for agreement, produces the opposite: the user receives confirmation of whatever framing the user brought, and the cognitive work the dissent would have prompted does not occur. The architecture’s value is not in the role names or the checkpoint structure as such; the value is in the integration of dissent with verification and arbitration as the pattern that converts AI use from substitution into scaffolding.

7.5 One operational path, not a claim of exclusivity

A reader who has reached this section has a fair question. The argument so far establishes that the cognitive science prerequisites and the methodological standards exist, that default AI use does not satisfy them, and that an integrated intervention class is needed. Why this particular integration? Why these seven roles? Why these checkpoints? Why this Fact-Tactic-KPI chain rather than some other structure?

The honest answer is that this is one operational path the author developed while running practitioner work that needed governance the field had not yet built. The roles came from observing which cognitive functions were repeatedly collapsed into the single act of accepting an AI output. The checkpoints came from a working career that involved physical safety oversight, where the difference between governance and assumption is settled by who has the authority to stop the line. The chain came from Factics (Puglisi, 2012), an earlier methodology developed before large language models existed, that paired claims with the actions they would inform and with the measurable outcomes that would test the actions.

The motivation underneath the architecture is worth naming directly. The architecture was built over several years as a personal cognitive development discipline, not as a research deliverable. The author needed a method that would allow learning across multiple domains at the speed AI assistance makes possible without surrendering the ability to challenge what was learned, verify what was claimed, and hold professional accountability for the resulting work. No available method offered the four-component integration the author wanted to hold the practice to. The architecture exists because the gap was personal first, and the architecture is offered to the field because the same gap appears to be the methodological gap the field has not yet closed at scale.

Other operational paths can satisfy the same prerequisites. A different role decomposition could work, a different checkpoint structure could work, a different reasoning chain could work. The architecture is not offered as the answer. The architecture is offered as evidence that an operational path exists, that it can be specified concretely enough to test, and that it is internally coherent across the cognitive science, methodological, and practical layers it touches.

What the proposal does claim is narrower. The integration of role separation, source verification with dissent capture, integrated reasoning chain, and human-arbiter checkpoints is a coherent intervention class suitable for the testing the field has not yet performed. The value of this proposal lies not in claiming uniqueness, but in showing that a fully specified, testable intervention class can be built from the cognitive science and methodological standards the field already possesses. The next section specifies the test.

8. The Measurement Space and the Validation Design: HEQ with AIS Composite as the Proposal

8.1 The measurement-space gap

The augmentation tradition Section 3 reviewed has had access to the conceptual framework for human-AI partnership since Licklider in 1960 and Engelbart in 1962, and to formalized definitions of hybrid intelligence since Dellermann et al. in 2019. The field does not have a publicly available cross-platform behavior-anchored governance-integrated measurement instrument that decomposes augmented intelligence into measurable behavioral dimensions. Ganuthula and Balaraman (2025) provide the closest peer-reviewed parallel effort. The field needs more than one such instrument so that comparison and convergent validation become possible.

The proposal in this paper is the Human Enhancement Quotient (HEQ), a four-dimension behavioral instrument, with the Augmented Intelligence Score (AIS) as its arithmetic-mean composite. HEQ is the dimensional instrument; AIS is the score derived from HEQ. The two are reported together because diagnosis requires the dimensions and tracking requires the composite. The proposal is offered as the measurement-space contribution that the validation design Section 8.4 requires. The proposal is not claimed to have been independently validated. The proposal is claimed to be a coherent operationalization of the augmentation tradition’s measurement commitments suitable for use as a secondary outcome measure in the trial design that closes this paper.

8.2 The four behavioral dimensions and the AIS composite

HEQ comprises four behavioral dimensions that together characterize human-AI collaboration quality. The definitions below follow the canonical instrument specification (Puglisi, 2025e; 2026a). Each dimension is scored on a five-band structure with score ranges 0-49, 50-69, 70-79, 80-89, and 90-100. The 70-79 and 80-89 bands together form the practitioner-significant range; the boundary between them is the construct boundary the instrument is most carefully calibrated against, because it distinguishes principled-but-reactive collaboration from architected collaboration. AIS is the arithmetic mean of the four dimensions and is computed as AIS = (CAS + EAI + CIQ + AGR) / 4. Diagram 3 displays the structure with the theoretical foundation for each component.

Diagram 3. HEQ Four-Dimension Structure with AIS Composite as Cross-Platform Behavior-Anchored Measurement Instrument

Cognitive Agility Speed (CAS). How quickly and clearly a person processes, connects, and articulates ideas when working with AI. The operational scoring mechanism is the rate of accurate insight generation given AI-augmented working memory load. Low-band CAS describes processing that lags, struggles to integrate new information, and depends heavily on the AI to structure thoughts. Mid-band CAS connects ideas across two or three domains with clear logic and adapts to new information with minor adjustments, but slows when integrating a fourth idea or shifting contexts. High-band CAS connects four or more domains without prompting, maintains clarity under complexity, anticipates at least one level of counterargument while reasoning, and shifts between concrete examples, strategic models, and higher-order implications without losing the thread. The construct boundary at the 75-to-85 transition distinguishes seeing connections when pointed there from seeing connections before being asked, and catching the first-order inconsistency from catching the third-order implication. The difference is not just speed; it is whether speed stays intelligible under complexity.

Ethical Alignment Index (EAI). How well the person’s thinking reflects fairness, responsibility, and transparency when operating with AI. The operational scoring mechanism is the consistency of human-AI reasoning with declared ethical frameworks under uncertainty. Low-band EAI shows values absent or inconsistently applied, tradeoffs ignored, authority deference without critical examination, and no accountability for claims. Mid-band EAI states fairness and transparency as goals, acknowledges tradeoffs when asked directly, follows existing rules, and notes general limitations without operationalizing them; ethical reasoning is present but reactive. High-band EAI builds values into system design before being asked, surfaces uncomfortable truths proactively, names specific blind spots, chooses open standards over lock-in, sets defaults that protect the less powerful party, and shows intellectual honesty even when inconvenient. The construct boundary at 75-to-85 distinguishes ethics as a principled posture from ethics as a working control system, and reactive limitation acknowledgment from proactive limitation identification with proposed remedies.

Collaborative Intelligence Quotient (CIQ). How effectively the person integrates diverse perspectives within AI-augmented collaboration. The perspectives can come from AI platforms (single or multiple), personal domain expertise, external research (search engines, academic papers, regulatory documents), human collaborators (colleagues, clients, peer reviewers), or any other source brought into the collaboration context. CIQ scores the integration behavior, not the source type: appropriate reliance (the ratio of correct trust to correct skepticism across all inputs), dissent engagement (whether provided or self-generated), source diversity, and governed decision-making. The operational scoring mechanism is the Reliance Calibration Score. CIQ is the dimension most directly tied to the architecture in Section 7, because the exit checkpoint produces the audit trail that CIQ measurement uses. The instrument carries an explicit construct note: CIQ measures human-to-AI collaborative intelligence, not human-to-human collaboration; external source integration brought into the AI-augmented workflow does score CIQ because it represents source diversity within that workflow. Low-band CIQ accepts AI outputs without verification and shows minimal iterative dialogue. Mid-band CIQ uses one or two AI platforms effectively, notices disagreements but does not deeply investigate their source, and accepts AI output unless it is blatantly wrong. High-band CIQ uses multiple sources, catches specific AI errors, preserves dissent rather than forcing convergence, treats AI as governed partner rather than oracle, and integrates external perspectives substantively.

Adaptive Growth Rate (AGR). How the person learns from feedback and applies it forward across AI collaboration contexts. The operational scoring mechanism is the acceleration rate of capability gain per AI interaction cycle, distinct from the rate of change because acceleration is the second derivative of capability rather than the first; that is, the dimension asks whether improvement itself accelerates across repeated cycles, not just whether improvement occurs. Low-band AGR ignores or dismisses feedback and repeats the same patterns despite correction. Mid-band AGR accepts feedback and corrects specific points raised, makes surface-level adjustments, names lessons from failures, and shows awareness of knowledge gaps but reverts to prior patterns when context changes. High-band AGR integrates feedback rapidly, applies lessons to related areas that were not flagged, shows pattern recognition across critiques, and creates systems to prevent repeating errors. The construct boundary at 75-to-85 distinguishes situational learning from cumulative infrastructure: at 75, the person corrects the artifact; at 85, the person changes the process that produced the artifact.

The AIS composite is produced through a multi-platform administration protocol. The canonical instrument requires a minimum of three AI platforms, with the cross-platform mean as the reported AIS and the standard deviation across platforms as the confidence band. The multi-platform requirement is a construct-level necessity rather than a methodological preference: augmented intelligence, properly defined, is shown through effective human orchestration across AI systems rather than dependence on a single architecture. A person who collaborates fluently with one platform and struggles with others shows narrower collaborative intelligence than one who adapts effectively across architectures. Human arbitration of the composite is required before the score is finalized.

A scope note applies to this paper. The methods proposal in Section 7 describes single-platform HAIA-RECCLIN Reasoning, while AIS as canonically specified requires multi-platform administration. The two are not in conflict. HEQ administered within a single-platform session can score the dimensions for that session, with the understanding that AIS as a construct-level measurement of augmented intelligence requires the multi-platform protocol. The validation design in Section 8.4 accommodates both: the single-platform structured arm uses HEQ scoring within the session, and the protocol allows the multi-platform AIS administration as an additional secondary outcome where resources permit.

8.3 The author’s own practice as feasibility evidence

The measurement gap Section 8.1 names has produced a familiar pattern in the public conversation about augmented intelligence. Frameworks are proposed, diagrams are drawn, conferences are held, and validation studies are called for, but the instruments themselves are rarely run, even by the people proposing them. The reader is left without evidence that the proposed measurement is operationally possible at any scale, including the scale of one user across one period of time.

The author has run HEQ on the author’s own work across four instances over the past six months. The four AIS scores have been recorded, the four-dimension breakdowns (CAS, EAI, CIQ, AGR) have been recorded, and the composite has been tracked across the period. A narrative summary of the four-instance record and the band-boundary observation appears in Appendix B, while the underlying score data is held by the author and available for inspection by researchers conducting validation work, and the observation is not used as a threshold in this paper.

The four-instance record is not a research finding. It is one user, closed observation, no control group, no blinding, no independent rater, no comparison condition. The author does not present it as evidence that HEQ measures what it claims to measure or that AIS reflects a real construct that exists outside this single case. The author presents it for a different and narrower purpose: to show that the proposed instrument can be operated, that the four dimensions produce numbers, that the composite can be computed and tracked over time, and that the construct does not collapse on first contact with practice.

The distinction between feasibility evidence and validation matters because a measurement gap can persist for two reasons. The first is that nobody has built a candidate instrument. The second is that candidate instruments have been proposed but never operated. The augmented intelligence field is in the second condition: multiple frameworks have been proposed for measuring human-AI collaboration quality, but few have been run. The four-instance record is the author’s contribution to moving the field from talking about measurement to showing measurement, even at the smallest possible scale and with the explicit acknowledgment that scientific validation requires the multi-arm trial Section 8.4 specifies.

The proposal does not claim that HEQ with AIS Composite has been validated, and the proposal does not claim that the four-dimension decomposition is the only useful structure or that AIS as composite is preferred over reporting the four dimensions separately. The four-instance record removes one practical obstacle: it shows that HEQ can be run by a single practitioner in real work. Other researchers and organizations can now test it at larger scale, compare results, and contribute the independent replication that the author cannot supply alone.

8.4 The validation design

The validation design that would test both the methods proposal in Section 7 and the measurement proposal in Section 8 is a multi-arm randomized controlled trial with four or five arms.

Arm one is the no-AI condition. Participants perform the assigned cognitive tasks without AI access, using only conventional resources. The arm satisfies the no-AI baseline element of the rubric and provides the business-as-usual control that Section 5.3 specified.

Arm two is the parallel traditional method condition. Participants perform the assigned cognitive tasks with structured human support, such as a human tutor, that does not include AI. The arm distinguishes AI presence from methodological alternative, addressing the partial-control gap the Bastani study leaves open.

Arm three is the unstructured single-platform AI condition. Participants perform the assigned cognitive tasks using default single-platform AI access without structured prompting or governance scaffolding. The arm satisfies the general AI condition element of the rubric and represents the dominant current deployment model.

Arm four is the structured single-platform AI condition without the full governance layer. Participants perform the assigned cognitive tasks using a structured prompting protocol comparable to the Garg et al. (2025) study but without the human-arbiter checkpoint structure or the complete role-separation architecture. The arm isolates the contribution of structured prompting from the contribution of the full proposed intervention class.

Arm five is the structured single-platform AI condition with the full HAIA-RECCLIN Reasoning architecture. Participants perform the assigned cognitive tasks using the role separation, source verification, dissent capture, Fact-Tactic-KPI chain, and human-arbiter checkpoint structure that Section 7 specified. The arm tests the complete proposed intervention class.

The design satisfies the three-arm minimum specified in Section 5.2 and extends to five arms to isolate the structured prompting contribution from the full governance contribution. For an initial feasibility trial, a four-arm version may omit either the parallel traditional method condition or the structured prompting only condition, but the omitted arm should be named explicitly because the resulting study would answer a narrower question.

The trial design requires the standard randomized-controlled-trial controls. Preregistration of hypotheses, primary and secondary outcomes, analysis plan, and stopping rules through a recognized registry such as ClinicalTrials.gov, the Open Science Framework, or AsPredicted is required before recruitment begins. A power analysis based on the smallest effect size of practical interest in delayed cognitive performance must precede sample size determination. The randomization procedure must be specified, blocked by relevant baseline characteristics, and reported transparently. Baseline cognitive measures must be administered before random assignment so that adjusted analyses are possible.

Outcome scoring must be blinded. Human raters who score participant work products on critical thinking, retention, transfer, and delayed performance must not know which arm produced which work product. Interrater reliability for HEQ scoring must be reported in the form of intraclass correlation coefficients or comparable agreement statistics across at least two raters per work product, with disagreement adjudicated by a third rater. AI model versions used in arms three through five must be locked at the start of the trial, with model identifiers, deployment dates, and any version changes during the trial documented in the methods section. Prompt logs from arms three, four, and five must be preserved for treatment fidelity verification and post-hoc analysis. Contamination controls between arms must be implemented through randomization at the level of natural clusters (classroom, team, cohort) rather than at the level of individuals where individuals share a learning environment.

Treatment fidelity verification requires logs of prompts used in each AI condition, documentation of adherence to the role-assignment and checkpoint protocol in arm five, and measurement of dosage in terms of session count, session duration, and total active interaction time across the trial period.

Process evaluation requires recording of the cognitive activities participants performed during each session, including which sources were verified in arm five, which dissents were captured, and which arbitration decisions were made. Mixed-method process evaluation may include think-aloud protocols and post-session structured interviews.

The outcome hierarchy includes the conventional cognitive assessments matched to the task domain, including standardized measures of critical thinking, retention beyond the immediate task, transfer to related but distinct tasks, and delayed performance after a period during which AI access is removed for all arms. The delayed no-AI transfer test is the strongest single outcome for the cognitive development question because productivity gains that depend on continued AI access cannot speak to the cognitive question this paper poses. HEQ subdimension scoring sits in the outcome hierarchy as a secondary outcome measure within the single-platform arms, capturing the augmented intelligence dimensions Section 8.2 specified. AIS should be reported only where the multi-platform administration protocol is implemented, with the four subdimensions reported separately and the composite reported in addition.

The predefined KPI is the difference in delayed performance and metacognitive monitoring between arm five and arm three, adjusted for baseline performance. The KPI specifies that the proposed intervention class must produce a measurable cognitive benefit that survives the removal of AI access. A productivity gain that disappears when the AI is removed is not the cognitive development the question in Section 1 asks about.

The full design is resource-intensive. Blinded expert scoring across multiple cognitive outcome measures, multi-platform AIS administration, and delayed transfer testing without AI access together require sustained funding and multi-site institutional collaboration. A staged approach is therefore plausible. An initial three-arm or four-arm feasibility pilot may precede the full five-arm trial, omitting either the parallel traditional-method arm or the structured-prompting-only arm to test instrumentation, recruitment, and treatment fidelity before scaling to the full design. The feasibility pilot would not answer the substantive cognitive question on its own, and the staged approach is a sequencing recommendation rather than a substitute for the full trial.

8.5 What the trial would establish

A trial that produced significant differences in delayed performance and metacognitive monitoring favoring arm five over arm three would provide evidence, within that population and task domain, that structured human-governed AI use can accelerate cognitive development beyond what default unstructured AI use produces. A trial that produced significant differences favoring arm five over arm two would provide evidence that the proposed intervention class is competitive with established traditional methods. A trial that produced no significant differences across arms three through five would indicate that the structural argument as currently operationalized does not produce the predicted cognitive benefit, and the paper would have to be revised accordingly.

The trial is falsifiable. The proposal therefore meets the standard the paper invokes against the cognitive decline narrative. Many public versions of the cognitive decline narrative are not yet framed in falsifiable intervention terms because the studies cited as evidence do not specify the design that would distinguish the causal claim from the available alternative explanations. The proposal in this paper specifies the design.

Until the trial is run, claims in either direction run ahead of the evidence cited for them. The proposal’s purpose is to convert claims into testable hypotheses and to invite the trial.

9. Limitations of This Proposal

This proposal has clear limitations that the paper acknowledges directly rather than burying in caveats.

The work is the output of a single practitioner, not a research team. The architectures and instruments have been developed and tested in practitioner contexts but have not been independently validated. The four-instance HEQ record shows feasibility at the smallest possible scale but does not establish reliability, validity, or generalizability. The trial design in Section 8.4 is ambitious and would require resources, institutional partnership, and methodological expertise that a single practitioner cannot supply.

The cognitive science and augmentation literatures the paper draws on are themselves contested in places that the paper does not litigate. The Vygotsky zone-of-proximal-development framework remains a productive metaphor whose precise operationalization is debated, Gardner’s multi-dimensional intelligence framework is contested in cognitive psychology, and ICAP measurement challenges are noted but not resolved. The paper takes these as background commitments rather than as foundations to be defended in this manuscript.

The verified source pool is limited to peer-reviewed publications in journals of record, peer-reviewed conference proceedings, and foundational documents with established academic authority. Working papers, preprints, theses, and retracted publications were excluded under the source-quality principle the paper applies to itself. This excludes some recent work that may eventually become important to the question once it survives peer review. The paper depends on this pool as the evidence base and does not constitute a full systematic review. The single working-document citation in this paper, Puglisi (2012) Factics, is referenced for architectural provenance of the Fact-Tactic-KPI chain rather than as evidentiary support for any cognitive or methodological claim, which preserves the source-pool exclusion principle as applied to the evidence base.

The author developed the framework being proposed and the measurement instrument being recommended. This author-developer relationship is a recognized conflict of interest in framework-validation research, and the paper acknowledges it directly. The validation design in Section 8.4 is structured so that an independent research team can run it without the author’s involvement, which is the appropriate response to the conflict.

The trial itself carries logistical requirements distinct from the framework’s single-practitioner origin. The full five-arm design requires multi-site institutional collaboration, independent funding, and coordinated administration of cognitive assessment instruments at multiple time points across multiple cohorts. These requirements are common to ambitious cognitive intervention trials and are not unique to this proposal, but they should be named directly so that any group considering the trial design understands the resource commitment in advance.

HAIA-RECCLIN Reasoning imposes time and training costs on users. The role separation, source verification, dissent capture, and Fact-Tactic-KPI chain take longer than default AI use produces. Adoption is therefore likely to favor users with strong professional motivation to verify outputs (researchers, journalists, analysts, policy practitioners) and to encounter resistance from users for whom default speed is the primary value. This selection effect is itself a research question that any deployment study should measure.

These limitations are stated openly because the paper’s central claim is not that these tools are proven. The central claim is that they are now concrete enough to be tested. The limitations are the work the field needs to do next.

References

Anderson, L. W., & Krathwohl, D. R. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Longman. WorldCat OCLC 44811619.

Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Mariman, R. (2025). Generative AI without guardrails can harm learning: Evidence from high school mathematics. Proceedings of the National Academy of Sciences, 122(26), e2422633122. https://doi.org/10.1073/pnas.2422633122 (Correction: Proceedings of the National Academy of Sciences, 122(34), e2518204122. https://doi.org/10.1073/pnas.2518204122)

Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives, Handbook I: The cognitive domain. David McKay.

Buçinca, Z., Malaya, M. B., & Gajos, K. Z. (2021). To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), Article 188. https://doi.org/10.1145/3449287

Chi, M. T. H., & Wylie, R. (2014). The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist, 49(4), 219–243. https://doi.org/10.1080/00461520.2014.965823

Dellermann, D., Ebel, P., Söllner, M., & Leimeister, J. M. (2019). Hybrid intelligence. Business & Information Systems Engineering, 61(5), 637–643. https://doi.org/10.1007/s12599-019-00595-2

Doshi, A. R., & Hauser, O. P. (2024). Generative AI enhances individual creativity but reduces the collective diversity of novel content. Science Advances, 10(28), eadn5290. https://doi.org/10.1126/sciadv.adn5290

Dweck, C. S. (2006). Mindset: The new psychology of success. Random House.

Engelbart, D. C. (1962). Augmenting human intellect: A conceptual framework (SRI Summary Report AFOSR-3223). Stanford Research Institute.

Festinger, L. (1957). A theory of cognitive dissonance. Stanford University Press.

Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10), 906–911. https://doi.org/10.1037/0003-066X.34.10.906

Freeman, S., Eddy, S. L., McDonough, M., Smith, M. K., Okoroafor, N., Jordt, H., & Wenderoth, M. P. (2014). Active learning increases student performance in science, engineering, and mathematics. Proceedings of the National Academy of Sciences, 111(23), 8410–8415. https://doi.org/10.1073/pnas.1319030111

Fütterer, T., Bardach, L., Kuhn, J., Keller, S. D., & Gerjets, P. (2026). Enhancing school students’ self-regulated learning through generative AI support: A randomized controlled trial. Educational Psychology Review, 38, Article 42. https://doi.org/10.1007/s10648-026-10133-8

Ganuthula, V. R. R., & Balaraman, K. K. (2025). Artificial intelligence quotient framework for measuring human collaboration with artificial intelligence. Discover Artificial Intelligence, 5, Article 268. https://doi.org/10.1007/s44163-025-00516-1

Gardner, H. (1983). Frames of mind: The theory of multiple intelligences. Basic Books.

Garg, A., Soodhani, K. N., & Rajendran, R. (2025). Enhancing data analysis and programming skills through structured prompt training: The impact of generative AI in engineering education. Computers and Education: Artificial Intelligence, 8, Article 100380. https://doi.org/10.1016/j.caeai.2025.100380

Gerlich, M. (2025a). AI tools in society: Impacts on cognitive offloading and the future of critical thinking. Societies, 15(1), Article 6. https://doi.org/10.3390/soc15010006 (Correction: Societies, 15(9), Article 252. https://doi.org/10.3390/soc15090252)

Gerlich, M. (2025b). From offloading to engagement: An experimental study on structured prompting and critical reasoning with generative AI. Data, 10(11), Article 172. https://doi.org/10.3390/data10110172

Hopewell, S., Chan, A.-W., Collins, G. S., Hróbjartsson, A., Moher, D., Schulz, K. F., Tunn, R., Aggarwal, R., Berkwits, M., Berlin, J. A., Bhandari, N., Butcher, N. J., Campbell, M. K., Chidebe, R. C. W., Elbourne, D., Farmer, A., Fergusson, D. A., Golub, R. M., Goodman, S. N., … Boutron, I. (2025). CONSORT 2025 statement: Updated guideline for reporting randomised trials. BMJ, 389, e081123. https://doi.org/10.1136/bmj-2024-081123

Krathwohl, D. R. (2002). A revision of Bloom’s taxonomy: An overview. Theory Into Practice, 41(4), 212–218. https://doi.org/10.1207/s15430421tip4104_2

Larsen, T. M., Endo, B. H., Yee, A. T., Do, T., & Lo, S. M. (2022). Probing internal assumptions of the revised Bloom’s taxonomy. CBE—Life Sciences Education, 21(4), Article ar66. https://doi.org/10.1187/cbe.20-08-0170

Lee, H.-P., Sarkar, A., Tankelevitch, L., Drosos, I., Rintel, S., Banks, R., & Wilson, N. (2025). The impact of generative AI on critical thinking: Self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3706598.3713778

Lee, J. D., & See, K. A. (2004). Trust in automation: Designing for appropriate reliance. Human Factors, 46(1), 50–80. https://doi.org/10.1518/hfes.46.1.50_30392

Licklider, J. C. R. (1960). Man-computer symbiosis. IRE Transactions on Human Factors in Electronics, HFE-1, 4–11. https://doi.org/10.1109/THFE2.1960.4503259

Noy, S., & Zhang, W. (2023). Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381(6654), 187–192. https://doi.org/10.1126/science.adh2586

Puglisi, B. C. (2012). *Factics**: Facts, Tactics, and Measurable Outcomes.* Working methodology document. basilpuglisi.com.

Puglisi, B. C. (2025e). The Human Enhancement Quotient (HEQ): Enterprise White Paper, Operational Scoring Mechanisms. basilpuglisi.com.

Puglisi, B. C. (2026a). Bridging the Measurement Gap in Augmented Intelligence: The Human Enhancement Quotient (HEQ) and Augmented Intelligence Score (AIS). SSRN Abstract ID 6583419.

Risko, E. F., & Gilbert, S. J. (2016). Cognitive offloading. Trends in Cognitive Sciences, 20(9), 676–688. https://doi.org/10.1016/j.tics.2016.07.002

Sidra, S., & Mason, C. (2026). Generative AI in human-AI collaboration: Validation of the Collaborative AI Literacy and Collaborative AI Metacognition Scales for effective use. International Journal of Human–Computer Interaction, 42(7), 5084–5108. https://doi.org/10.1080/10447318.2025.2543997

Sparrow, B., Liu, J., & Wegner, D. M. (2011). Google effects on memory: Cognitive consequences of having information at our fingertips. Science, 333(6043), 776–778. https://doi.org/10.1126/science.1207745

Sterne, J. A. C., Savović, J., Page, M. J., Elbers, R. G., Blencowe, N. S., Boutron, I., Cates, C. J., Cheng, H.-Y., Corbett, M. S., Eldridge, S. M., Emberson, J. R., Hernán, M. A., Hopewell, S., Hróbjartsson, A., Junqueira, D. R., Jüni, P., Kirkham, J. J., Lasserson, T., Li, T., … Higgins, J. P. T. (2019). RoB 2: A revised tool for assessing risk of bias in randomised trials. BMJ, 366, l4898. https://doi.org/10.1136/bmj.l4898

Thurn, C. M., Edelsbrunner, P. A., Berkowitz, M., Deiglmayr, A., & Schalk, L. (2023). Comment on the role of cognitive engagement in learning. *npj** Science of Learning*, 8(1), Article 49. https://doi.org/10.1038/s41539-023-00200-y

Vaccaro, M., Almaatouq, A., & Malone, T. (2024). When combinations of humans and AI are useful: A systematic review and meta-analysis. *Nature Human **Behaviour*, 8(12), 2293–2303. https://doi.org/10.1038/s41562-024-02024-1

Vaidis, D. C., & Bran, A. (2019). Some prior considerations about dissonance to understand its reduction: Comment on McGrath (2017). Frontiers in Psychology, 10, Article 1189. https://doi.org/10.3389/fpsyg.2019.01189

Vasconcelos, H., Jörke, M., Grunde-McLaughlin, M., Gerstenberg, T., Bernstein, M. S., & Krishna, R. (2023). Explanations can reduce overreliance on AI systems during decision-making. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1), Article 129. https://doi.org/10.1145/3579605

Vered, M., Livni, T., Howe, P. D. L., Miller, T., & Sonenberg, L. (2023). The effects of explanations on automation bias. Artificial Intelligence, 322, Article 103952. https://doi.org/10.1016/j.artint.2023.103952

Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes. Harvard University Press.

Wekerle, C., Daumiller, M., Janke, S., Dickhäuser, O., Dresel, M., & Kollar, I. (2024). Using digital technology to promote higher education learning: The importance of different learning activities and their relations to learning outcomes. Scientific Reports, 14(1), Article 16295. https://doi.org/10.1038/s41598-024-65961-x

Zhai, C., Wibowo, S., & Li, L. D. (2024). The effects of over-reliance on AI dialogue systems on students’ cognitive abilities: A systematic review. Smart Learning Environments, 11(1), Article 28. https://doi.org/10.1186/s40561-024-00316-7

Appendix A. Institutional Standards Referenced

Hoffmann, T. C., Glasziou, P. P., Boutron, I., Milne, R., Perera, R., Moher, D., Altman, D. G., Barbour, V., Macdonald, H., Johnston, M., Lamb, S. E., Dixon-Woods, M., McCulloch, P., Wyatt, J. C., Chan, A.-W., & Michie, S. (2014). Better reporting of interventions: Template for intervention description and replication (TIDieR) checklist and guide. BMJ, 348, g1687. https://doi.org/10.1136/bmj.g1687

Liu, X., Cruz Rivera, S., Moher, D., Calvert, M. J., Denniston, A. K., & SPIRIT-AI and CONSORT-AI Working Group. (2020). Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nature Medicine, 26(9), 1364–1374. https://doi.org/10.1038/s41591-020-1034-x

Skivington, K., Matthews, L., Simpson, S. A., Craig, P., Baird, J., Blazeby, J. M., Boyd, K. A., Craig, N., French, D. P., McIntosh, E., Petticrew, M., Rycroft-Malone, J., White, M., & Moore, L. (2021). A new framework for developing and evaluating complex interventions: Update of Medical Research Council guidance. BMJ, 374, n2061. https://doi.org/10.1136/bmj.n2061

Appendix B. Author’s Four-Instance HEQ Practitioner Record

This appendix documents the author’s four-instance HEQ practitioner record referenced in Section 8.3. The record is presented for transparency rather than as evidence of construct validity, reliability, or generalizability of the instrument.

The author administered HEQ to the author’s own work across four observation periods over six months. AIS composite scores were recorded at each instance along with the four sub-dimension breakdowns (CAS, EAI, CIQ, AGR). Across the four observations, AIS scores clustered in a narrow band, with movement above and below an apparent threshold around the 79 to 80 range that appeared to separate observations made under favorable conditions (sufficient time, low external pressure, available source materials) from observations made under stress (compressed timeline, high external pressure, partial source access).

The 79 to 80 band-boundary observation is a single-user pattern that the author reports for replication rather than as a calibrated threshold. The pattern may reflect a genuine instrument property, an artifact of single-user scoring drift, an artifact of the specific work types observed, or some combination. The validation design in Section 8.4 is structured to test whether any band boundary appears across multiple users in independent administration. The observation is not used as a threshold elsewhere in this paper.

The score data itself is held by the author and available for inspection by researchers conducting validation work. The data is not published in this paper because it derives from a single observer scoring the observer’s own work, and publication of the specific scores in the absence of independent rating would not contribute to the field’s evaluation of the instrument.

FAQ

Does AI use cause cognitive decline?

The peer-reviewed evidence does not yet support the causal claim. The studies cited as evidence are correlational, self-report, small sample, preprint, or reviews of a thin evidence base. None satisfies the design requirements adjacent fields apply to causal cognitive evaluation, including a no-AI condition, a default-AI condition, a structured-AI condition, treatment fidelity, and delayed transfer testing.

What is structured human-governed AI use?

AI interaction in which the human assigns a cognitive role, verifies sources, preserves dissent, connects claims to tactics and measurable outcomes, and makes the final arbitration decision before any output is accepted. The pattern enforces metacognitive monitoring, source discipline, and dissent capture that default single-platform AI use does not require of the user.

What is HAIA-RECCLIN Reasoning?

A single-platform structured cognitive interaction architecture organized around seven cognitive role functions, a defined work-product chain of source verification with dissent capture and Fact-Tactic-KPI integration, and human-arbiter checkpoints at session entry and exit. The seven roles are Researcher, Editor, Coder, Calculator, Liaison, Ideator, and Navigator.

What is the Human Enhancement Quotient (HEQ) and the Augmented Intelligence Score (AIS)?

HEQ is a four-dimension behavioral instrument decomposing augmented intelligence into Cognitive Agility Speed, Ethical Alignment Index, Collaborative Intelligence Quotient, and Adaptive Growth Rate. AIS is the arithmetic-mean composite across the four dimensions. Each dimension uses a five-band score structure with the practitioner-significant boundary at the 79 to 80 transition.

What does the nine-element audit rubric evaluate?

The rubric scores any AI cognition study against nine elements drawn from CONSORT 2025, Cochrane RoB 2, Medical Research Council process evaluation guidance, and cognitive intervention methods consensus. The nine elements are defined intervention, no-AI comparison, general AI comparison, structured AI comparison, treatment fidelity verification, outcome hierarchy beyond immediate output, process evaluation, dissent or error exposure, and predefined cognitive KPI.

Why does the field need a five-arm randomized controlled trial?

A three-arm minimum cannot isolate the structured prompting contribution from the full governance contribution from default AI use from no AI use from a parallel traditional method. The five-arm design adds a parallel traditional method comparator and separates structured prompting alone from full governance, which lets researchers identify which component produces the cognitive effect rather than aggregating contributions into a single AI condition.

What does Bastani 2025 establish and what does it leave open?

Bastani found GPT Base improved practice grades by 48 percent but reduced exam grades by 17 percent when AI access was removed, while GPT Tutor improved practice grades by 127 percent with the negative learning effects largely mitigated. The study shifts the empirical question from whether AI helps learning to what about the interaction structure determines the cognitive effect. The study does not test full governance with human-arbiter checkpoints, source verification, or dissent capture, which leaves the present paper’s target claim open.

What is the difference between Responsible AI and AI Governance in this paper?

Responsible AI describes a configuration in which the AI system is constrained by safety filters, reliability standards, and audit logs verified by automated checks. The accountability lives in the system. AI Governance describes a configuration in which a named human holds binding checkpoint authority backed by professional and personal accountability for outcomes. The named human can override any AI output and exercises that authority through documented arbitration grounded in professional values.

#AIassisted