Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2023 Workshop SoLaR Submissions
Testing Language Model Agents Safely in the Wild
Silen Naihin
,
David Atkinson
,
Marc Green
,
Merwane Hamadi
,
Craig Swift
,
Douglas Schonholtz
,
Adam Tauman Kalai
,
David Bau
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
James Campbell
,
Phillip Guo
,
Richard Ren
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Predictive Minds: LLMs As Atypical Active Inference Agents
Jan Kulveit
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Breaking Physical and Linguistic Borders: Privacy-Preserving Multilingual Prompt Tuning for Low-Resource Languages
Wanru Zhao
,
Yihong Chen
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Spotlight
Readers:
Everyone
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Rhys Gould
,
Euan Ong
,
George Ogden
,
Arthur Conmy
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Towards Optimal Statistical Watermarking
Baihe Huang
,
Banghua Zhu
,
Hanlin Zhu
,
Jason Lee
,
Jiantao Jiao
,
Michael Jordan
Published: 23 Oct 2023, Last Modified: 30 Nov 2023
SoLaR Spotlight
Readers:
Everyone
Towards Publicly Accountable Frontier LLMs
Markus Anderljung
,
Everett Smith
,
Joe O'Brien
,
Lisa Soder
,
Benjamin Bucknall
,
Emma Bluemke
,
Jonas Schuett
,
Robert Trager
,
Lacey Strahm
,
Rumman Chowdhury
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Eliciting Language Model Behaviors using Reverse Language Models
Jacob Pfau
,
Alex Infanger
,
Abhay Sheshadri
,
Ayush Panda
,
Julian Michael
,
Curtis Huebner
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Spotlight
Readers:
Everyone
FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs
Swanand Kadhe
,
Anisa Halimi
,
Ambrish Rawat
,
Nathalie Baracaldo
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Are Models Biased on Text without Gender-related Language?
Catarina Belém
,
Preethi Seshadri
,
Yasaman Razeghi
,
Sameer Singh
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Understanding Hidden Context in Preference Learning: Consequences for RLHF
Anand Siththaranjan
,
Cassidy Laidlaw
,
Dylan Hadfield-Menell
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
SuperHF: Supervised Iterative Learning from Human Feedback
Gabriel Mukobi
,
Peter Chatain
,
Su Fong
,
Robert Windesheim
,
Gitta Kutyniok
,
Kush Bhatia
,
Silas Alberti
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Probing Explicit and Implicit Gender Bias through LLM Conditional Text Generation
Xiangjue Dong
,
Yibo Wang
,
Philip Yu
,
James Caverlee
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
A Simple Test of Expected Utility Theory with GPT
Mengxin Wang
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Spotlight
Readers:
Everyone
Training Private and Efficient Language Models with Synthetic Data from LLMs
Da Yu
,
Arturs Backurs
,
Sivakanth Gopi
,
Huseyin Inan
,
Janardhan Kulkarni
,
Zinan Lin
,
Chulin Xie
,
Huishuai Zhang
,
Wanrong Zhang
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
MoPe: Model Perturbation-based Privacy Attacks on Language Models
Jason Wang
,
Jeffrey Wang
,
Marvin Li
,
Seth Neel
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Efficient Evaluation of Bias in Large Language Models through Prompt Tuning
Jacob-Junqi Tian
,
D. Emerson
,
Deval Pandya
,
Laleh Seyyed-Kalantari
,
Faiza Khattak
Published: 23 Oct 2023, Last Modified: 04 Dec 2023
SoLaR Poster
Readers:
Everyone
An Archival Perspective on Pretraining Data
Meera Desai
,
Abigail Jacobs
,
Dallas Card
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Spotlight
Readers:
Everyone
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Yang Liu
,
Yuanshun Yao
,
Jean-Francois Ton
,
Xiaoying Zhang
,
Ruocheng Guo
,
Hao Cheng
,
Yegor Klochkov
,
Muhammad Faaiz Taufiq
,
Hang Li
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Towards a Situational Awareness Benchmark for LLMs
Rudolf Laine
,
Alexander Meinke
,
Owain Evans
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Spotlight
Readers:
Everyone
Large Language Model Unlearning
Yuanshun Yao
,
Xiaojun Xu
,
Yang Liu
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Post-Deployment Regulatory Oversight for General-Purpose Large Language Models
Carson Ezell
,
Abraham Loeb
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
Comparing Optimization Targets for Contrast-Consistent Search
Hugo Fry
,
Seamus Fallows
,
Jamie Wright
,
Ian Fan
,
Nandi Schoots
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
FlexModel: A Framework for Interpretability of Distributed Large Language Models
Matthew Choi
,
Muhammad Adil Asif
,
John Willes
,
D. Emerson
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Spotlight
Readers:
Everyone
AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models
Sicheng Zhu
,
Ruiyi Zhang
,
Bang An
,
Gang Wu
,
Joe Barrow
,
Zichao Wang
,
Furong Huang
,
Ani Nenkova
,
Tong Sun
Published: 23 Oct 2023, Last Modified: 28 Nov 2023
SoLaR Poster
Readers:
Everyone
«
‹
1
2
3
›
»