DPT-Agent

Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration

Shao Zhang^1*, Xihuai Wang^1*, Wenhao Zhang¹, Chaoran Li¹, Junru Song¹, Tingyu Li¹, Lin Qiu², Xuezhi Cao², Xunliang Cai², Wen Yao³, Weinan Zhang¹, Xinbing Wang¹, Ying Wen^1#

¹ Shanghai Jiao Tong University, ² Meituan, ³ Intelligent Game and Decision Laboratory

^*Equal Contribution ^#Corresponding Author

Abstract

Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose
DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. DPT-Agent can effectively help LLMs convert correct slow thinking and reasoning into executable actions, thereby improving performance. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously.

Visualization

Map 1

DPT-Agent (in red hat) collaborating with human (in blue hat) through division of labor.

Map 2

DPT-Agent (in red hat) collaborating with human (in blue hat) by using the central counter.

Experiment Results

ReAct

Models	Score	Score Efficiency	Latency
GPT-4o	21.00(7.01)	3.08(0.30)	7.10(0.29)
GPT-4o-mini	-28.50(6.23)	0.60(0.28)	3.06(0.07)
o3-mini-low	5.50(5.86)	2.51(0.25)	8.64(0.27)
DeepSeek-V2.5-236b	-21.50(3.56)	1.72(0.24)	6.45(0.18)
DeepSeek-R1-70b	-17.00(4.32)	1.48(0.17)	7.79(0.20)
DeepSeek-R1-32b	-15.50(4.51)	1.49(0.18)	5.77(0.18)
DeepSeek-R1-14b	-7.00(4.94)	2.67(0.19)	2.91(0.03)
Llama3.3-70b	20.00(4.21)	2.86(0.16)	5.44(0.05)
Mistral-nemo-12b	-10.00(3.31)	2.40(0.13)	1.10(0.03)
Mistral-small-24b	59.50(5.04)	4.63(0.20)	2.69(0.02)
Mixtral-8x22b	-5.00(5.23)	1.73(0.22)	5.56(0.10)
Qwen2.5-14b	-5.00(5.31)	1.98(0.21)	1.55(0.03)
Qwen2.5-32b	10.00(0.50)	2.94(0.02)	1.93(0.04)
Qwen2.5-72b	16.50(3.22)	2.71(0.09)	4.60(0.09)
QwQ-32b	8.00(2.77)	2.46(0.12)	10.75(0.24)

Reflexion

Models	Score	Score Efficiency	Latency
GPT-4o	-1.50(3.78)	2.14(0.17)	7.49(0.27)
GPT-4o-mini	-40.00(2.17)	0.00(0.14)	3.11(0.08)
o3-mini-low	-16.50(7.12)	1.78(0.26)	8.86(0.23)
DeepSeek-V2.5	-25.56(2.91)	1.24(0.18)	7.64(0.16)
DeepSeek-R1-70b	-20.00(4.79)	1.44(0.19)	7.78(0.17)
DeepSeek-R1-32b	-37.50(4.77)	0.90(0.21)	7.39(0.11)
DeepSeek-R1-14b	-10.50(4.12)	1.93(0.22)	4.01(0.11)
Llama3.3-70b	20.00(4.47)	3.25(0.19)	5.20(0.06)
Mistral-nemo-12b	-40.00(0.00)	0.00(0.00)	1.60(0.02)
Mistral-small-24b	-5.00(3.63)	1.43(0.03)	3.11(0.05)
Mixtral-8x22b	0.50(4.33)	2.44(0.20)	5.58(0.23)
Qwen2.5-14b	-4.00(4.45)	2.44(0.24)	1.87(0.05)
Qwen2.5-32b	-40.00(0.00)	0.00(0.00)	2.93(0.05)
Qwen2.5-72b	-25.00(2.76)	1.47(0.09)	4.66(0.05)
QwQ-32b	-50.00(0.75)	0.00(0.11)	7.75(0.11)

DPT-Agent w/o ToM

Models	Score	Score Efficiency	Latency
GPT-4o	20.50(5.41)	3.05(0.24)	5.08(0.15)
GPT-4o-mini	21.00(4.47)	3.50(0.23)	2.13(0.01)
o3-mini-low	37.50(4.81)	3.68(0.19)	7.03(0.28)
DeepSeek-V2.5	31.50(3.40)	3.40(0.14)	4.73(0.11)
DeepSeek-R1-70b	60.00(4.35)	4.19(0.15)	9.09(0.26)
DeepSeek-R1-32b	39.50(7.68)	3.35(0.27)	6.58(0.25)
DeepSeek-R1-14b	23.00(5.42)	23.00(5.42)	3.87(0.07)
Llama3.3-70b	-10.00(6.46)	1.82(0.34)	2.28(0.10)
Mistral-nemo-12b	30.00(5.20)	3.49(0.21)	1.31(0.03)
Mistral-small-24b	-1.50(3.63)	2.05(0.17)	3.61(0.31)
Mixtral-8x22b	0.00(15.00)	2.70(0.20)	4.21(0.17)
Qwen2.5-14b	1.50(4.11)	2.68(0.22)	1.18(0.02)
Qwen2.5-32b	1.00(3.83)	2.26(0.13)	1.65(0.03)
Qwen2.5-72b	11.00(4.88)	2.66(0.21)	3.01(0.12)
QwQ-32b	-51.00(4.74)	3.90(0.15)	14.96(0.78)

DPT-Agent achieved the best performance across the majority of models, especially on the widely recognized general-purpose SOTA models like GPT-4o. This phenomenon aligns with the conclusions from the experiments in single-agent settings, where larger models can overcome the latency limitations and achieve better performance with the help of DPT-Agent. Such performance improvements are more noticeable in the reasoning model series of GPT o3-mini and DeepSeek-R1. DPT-Agent framework can help reasoning models, which require long periods of thinking, overcome the latency and effectively transition from thinking to action. Additionally, when facing rule-based agents that can only perform a single task, DPT-Agent can maintain a high contribution rate. For some models like Llama3.3-70b, DPT-Agent w/o ToM outperforms the complete DPT-Agent, which may be closely related to the model's ToM capabilities.

Score

Models	ReAct	Reflexion	DPT-Agent w/o ToM	DPT-Agent
Overall	13.37(6.08)	15.22(6.42)	43.67(6.64)	46.57(6.89)
o3-mini-high	-43.00(0.93)	-42.00(0.85)	65.83(5.66)	55.17(4.84)
o3-mini-medium	-10.00(6.94)	4.83(7.63)	56.50(7.07)	60.00(6.07)
o3-mini-low	7.00(7.491)	33.50(7.06)	44.83(9.74)	51.33(8.67)
GPT-4o	35.67(9.62)	39.17(8.43)	18.67(8.50)	39.50(8.63)
GPT-4o-mini	-6.58(5.37)	5.58(7.53)	50.00(5.27)	52.92(6.34)
Qwen-Max	30.50(6.58)	21.17(6.23)	51.50(9.27)	53.83(7.33)
Claude 3.5 Haiku	29.50(5.63)	24.83(6.58)	43.17(8.01)	41.50(7.69)
DeepSeek-R1-671b	20.67(5.47)	21.00(6.83)	56.67(5.13)	74.33(5.33)
DeepSeek-R1-70b	33.83(6.77)	-2.67(5.98)	51.00(6.08)	61.50(6.40)
DeepSeek-R1-32b	37.33(8.51)	23.33(7.46)	45.50(6.39)	38.83(8.51)
DeepSeek-R1-14b	-8.50(3.88)	12.00(8.51)	40.33(7.73)	43.17(8.54)
DeepSeek-V3	29.17(8.24)	33.33(7.76)	70.33(5.28)	61.83(5.86)
DeepSeek-V2.5	-6.00(5.23)	12.33(4.83)	31.50(6.58)	23.50(8.44)
Qwen2.5-72b	18.03(4.69)	48.67(5.68)	18.83(5.51)	32.08(5.17)
QwQ-32b	49.17(7.32)	-43.33(4.56)	53.17(6.00)	47.50(6.59)
Llama3.3-70b	27.97(5.68)	-15.58(5.28)	30.75(3.86)	28.08(6.68)
Mixtral-8x22b	20.17(6.30)	24.67(6.07)	24.00(6.10)	26.83(5.79)

Agent Contribution Rate

Models	ReAct	Reflexion	DPT-Agent w/o ToM	DPT-Agent
Overall	0.52(0.03)	0.52(0.03)	0.68(0.03)	0.68(0.03)
o3-mini-high	0.00(0.00)	0.00(0.00)	0.68(0.02)	0.72(0.01)
o3-mini-medium	0.60(0.05)	0.62(0.02)	0.56(0.04)	0.68(0.03)
o3-mini-low	0.60(0.05)	0.62(0.02)	0.56(0.04)	0.68(0.03)
GPT-4o	0.60(0.02)	0.61(0.02)	0.60(0.05)	0.69(0.04)
GPT-4o-mini	0.27(0.07)	0.46(0.06)	0.66(0.02)	0.67(0.02)
Qwen-Max	0.59(0.03)	0.60(0.03)	0.68(0.04)	0.70(0.03)
Claude 3.5 Haiku	0.62(0.04)	0.58(0.03)	0.67(0.03)	0.70(0.03)
DeepSeek-R1-671b	0.61(0.01)	0.59(0.01)	0.69(0.02)	0.69(0.01)
DeepSeek-R1-70b	0.57(0.01)	0.55(0.05)	0.69(0.02)	0.66(0.02)
DeepSeek-R1-32b	0.56(0.02)	0.53(0.03)	0.67(0.02)	0.69(0.05)
DeepSeek-R1-14b	0.52(0.02)	0.48(0.02)	0.68(0.03)	0.71(0.03)
DeepSeek-V3	0.60(0.03)	0.58(0.02)	0.74(0.01)	0.74(0.02)
DeepSeek-V2.5	0.25(0.02)	0.47(0.04)	0.64(0.04)	0.60(0.04)
Qwen2.5-72b	0.75(0.01)	0.58(0.01)	0.67(0.04)	0.67(0.03)
QwQ-32b	0.58(0.03)	0.00(0.03)	0.64(0.02)	0.70(0.02)
Llama3.3-70b	0.74(0.03)	0.54(0.05)	0.85(0.02)	0.75(0.05)
Mixtral-8x22b	0.54(0.03)	0.54(0.03)	0.70(0.06)	0.60(0.03)

Score Efficiency

Models	ReAct	Reflexion	DPT-Agent w/o ToM	DPT-Agent
Overall	3.57(0.30)	3.54(0.33)	4.59(0.31)	4.69(0.32)
o3-mini-high	0.00(0.00)	0.00(0.00)	5.66(0.21)	5.33(0.18)
o3-mini-medium	2.67(0.38)	3.59(0.39)	5.16(0.28)	5.23(0.24)
o3-mini-low	3.20(0.34)	4.18(0.34)	4.28(0.43)	4.60(0.35)
GPT-4o	4.26(0.42)	3.86(0.34)	3.43(0.42)	4.46(0.39)
GPT-4o-mini	3.95(0.52)	4.64(0.66)	5.03(0.28)	5.33(0.33)
Qwen-Max	4.56(0.39)	4.03(0.28)	4.83(0.45)	5.09(0.31)
Claude 3.5 Haiku	4.04(0.30)	3.65(0.31)	4.67(0.39)	4.47(0.34)
DeepSeek-R1-671b	4.52(0.25)	4.47(0.37)	4.90(0.21)	5.27(0.19)
DeepSeek-R1-70b	3.66(0.25)	2.25(0.27)	4.64(0.25)	4.92(0.24)
DeepSeek-R1-32b	3.64(0.29)	3.27(0.29)	4.31(0.26)	4.04(0.38)
DeepSeek-R1-14b	3.16(0.22)	3.16(0.43)	4.33(0.36)	4.29(0.38)
DeepSeek-V3	4.78(0.39)	5.03(0.38)	6.00(0.18)	5.66(0.25)
DeepSeek-V2.5	2.29(0.26)	3.43(0.29)	4.24(0.40)	3.61(0.42)
Qwen2.5-72b	4.44(0.16)	5.11(0.29)	3.25(0.27)	4.51(0.29)
QwQ-32b	3.92(0.24)	0.00(0.21)	4.64(0.25)	4.29(0.25)
Llama3.3-70b	4.44(0.37)	2.01(0.28)	4.08(0.19)	3.89(0.32)
Mixtral-8x22b	3.58(0.30)	4.01(0.32)	4.63(0.41)	4.38(0.43)

Latency

Models	ReAct	Reflexion	DPT-Agent w/o ToM	DPT-Agent
o3-mini-high	39.01(2.82)	39.47(2.43)	34.77(4.37)	35.96(4.91)
o3-mini-medium	28.07(2.42)	26.73(3.93)	22.24(1.39)	24.05(2.81)
o3-mini-low	10.78(1.40)	10.58(0.80)	7.34(0.37)	7.68(0.38)
GPT-4o	6.63(7.53)	6.81(0.24)	4.92(1.32)	4.91(1.41)
GPT-4o-mini	2.93(0.77)	3.15(1.27)	2.09(1.09)	2.08(0.58)
Qwen-Max	8.29(0.14)	10.30(0.21)	5.90(0.11)	5.89(0.10)
Claude 3.5 Haiku	5.74(0.06)	7.47(0.11)	5.21(0.05)	5.25(0.06)
DeepSeek-R1-671b	31.31(2.17)	41.66(2.45)	38.89(3.70)	34.63(2.30)
DeepSeek-R1-70b	7.82(0.17)	7.39(0.14)	10.30(0.36)	10.13(0.34)
DeepSeek-R1-32b	5.75(0.09)	6.77(0.13)	5.24(0.08)	5.11(0.13)
DeepSeek-R1-14b	3.06(0.06)	3.44(0.09)	3.88(0.088)	3.57(0.06)
DeepSeek-V3	7.54(0.15)	8.86(0.15)	1.92(0.04)	2.41(0.10)
DeepSeek-V2.5	4.88(0.07)	5.35(0.08)	4.06(0.10)	4.49(0.07)
Qwen2.5-72b	4.34(0.06)	4.83(0.11)	3.81(0.10)	4.62(0.11)
QwQ-32b	8.80(1.04)	7.28(2.47)	14.50(3.04)	12.67(1.21)
Llama3.3-70b	4.53(0.08)	5.34(0.11)	2.30(0.08)	2.90(0.09)
Mixtral-8x22b	5.20(0.18)	5.19(0.22)	4.53(0.14)	5.31(0.19)
Overall	10.99(1.14)	12.08(0.78)	9.84(0.84)	9.94(0.85)

After data validation, we have 68 valid data points in total: 36 of Map 1 and 32 of Map 2. DPT-Agent achieves the highest scores in both Map 1 and Map 2 when collaborating with humans. DPT-Agent w/o ToM also outperforms ReAct and Reflexion, confirming the effectiveness of asynchronous reflection. Moreover, the ToM module also brought a significant score improvement in collaborating with humans, confirming that incorporating human belief reasoning into System 2 can foster better collaboration. Regarding human perception, DPT-Agent ranks highest in Map 1, with the most participants recognizing its collaborative abilities. Interestingly, in Map 2, DPT-Agent w/o ToM surpasses DPT-Agent in both cooperation and preference ranking with a higher agent contribution rate, which may refer to the hu- man preference for partners who work more.

Map 1 - New Counter Circuit

Agent Frameworks	Score	Agent Contribution Rate	Borda Count of Cooperation	Borda Count of Preference
ReAct	99.03(9.86)	0.51(0.03)	88	88
Reflexion	97.78(7.23)	0.53(0.03)	79	80
DPT-Agent w/o ToM	103.19(7.06)	0.62(0.02)	86	91
DPT-Agent	111.53(5.42)	0.62(0.02)	107	101

Map 2 - New Asymmetric Advantages

Agent Frameworks	Score	Agent Contribution Rate	Borda Count of Cooperation	Borda Count of Preference
ReAct	115.00(9.28)	0.49(0.04)	65	63
Reflexion	119.67(10.54)	0.51(0.03)	78	73
DPT-Agent w/o ToM	152.03(8.13)	0.62(0.02)	94	95
DPT-Agent	160.63(7.97)	0.59(0.03)	83	89

BibTeX

@article{zhang2025ldpt, title={Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration}, author={Shao Zhang and Xihuai Wang and Wenhao Zhang and Chaoran Li and Junru Song and Tingyu Li and Lin Qiu and Xuezhi Cao and Xunliang Cai and Wen Yao and Weinan Zhang and Xinbing Wang and Ying Wen}, year={2025}, eprint={2502.11882}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2502.11882}, }