46 subscribers
Player FM uygulamasıyla çevrimdışı Player FM !
Dinlemeye Değer Podcast'ler
SPONSOR


1 The Secret To Getting Inspired: Millie Bobby Brown & Chris Pratt Go Behind The Scenes 21:04
Alignment Newsletter #169: Collaborating with humans without human data
Manage episode 307906147 series 2508242
Recorded by Robert Miles: http://robertskmiles.com
More information about the newsletter here: https://rohinshah.com/alignment-newsletter/
YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
HIGHLIGHTS
Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We’ve previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case?
This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper).
In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well).
Read more: Related paper: Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination
Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now.
Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don’t think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully “aim” your system at the true task of interest, and as a result it doesn’t perform as well as it “could have”. In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there’s still some chance it ends up rambling about something else entirely, or neglects to mention some information that it “knows” but that a human on the Internet would not have said. (See also this post (AN #141).)
I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both.
I’m generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans).
TECHNICAL AI ALIGNMENT LEARNING HUMAN INTENTInverse Decision Modeling: Learning Interpretable Representations of Behavior (Daniel Jarrett, Alihan Hüyük et al) (summarized by Rohin): There’s lots of work on learning preferences from demonstrations, which varies in how much structure they assume on the demonstrator: for example, we might consider them to be Boltzmann rational (AN #12) or risk sensitive, or we could try to learn their biases (AN #59). This paper proposes a framework to encompass all of these choices: the core idea is to model the demonstrator as choosing actions according to a planner; some parameters of this planner are fixed in advance to provide an assumption on the structure of the planner, while others are learned from data. This also allows them to separate beliefs, decision-making, and rewards, so that different structures can be imposed on each of them individually.
The paper provides a mathematical treatment of both the forward problem (how to compute actions in the planner given the reward, think of algorithms like value iteration) and the backward problem (how to compute the reward given demonstrations, the typical inverse reinforcement learning setting). They demonstrate the framework on a medical dataset, where they introduce a planner with parameters for flexibility of decision-making, optimism of beliefs, and adaptivity of beliefs. In this case they specify the desired reward function and then run backward inference to conclude that, with respect to this reward function, clinicians appear to be significantly less optimistic when diagnosing dementia in female and elderly patients.
Rohin's opinion: One thing to note about this paper is that it is an incredible work of scholarship; it fluently cites research across a variety of disciplines including AI safety, and provides a useful organizing framework for many such papers. If you need to do a literature review on inverse reinforcement learning, this paper is a good place to start.
Human irrationality: both bad and good for reward inference (Lawrence Chan et al) (summarized by Rohin): Last summary, we saw a framework for inverse reinforcement learning with suboptimal demonstrators. This paper instead investigates the qualitative effects of performing inverse reinforcement learning with a suboptimal demonstrator. The authors modify different parts of the Bellman equation in order to create a suite of possible suboptimal demonstrators to study. They run experiments with exact inference on random MDPs and FrozenLake, and with approximate inference on a simple autonomous driving environment, and conclude:
1. Irrationalities can be helpful for reward inference, that is, if you infer a reward from demonstrations by an irrational demonstrator (where you know the irrationality), you often learn more about the reward than if you inferred a reward from optimal demonstrations (where you know they are optimal). Conceptually, this happens because optimal demonstrations only tell you about what the best behavior is, whereas most kinds of irrationality can also tell you about preferences between suboptimal behaviors.
2. If you fail to model irrationality, your performance can be very bad, that is, if you infer a reward from demonstrations by an irrational demonstrator, but you assume that the demonstrator was Boltzmann rational, you can perform quite badly.
Rohin's opinion: One way this paper differs from my intuitions is that it finds that assuming Boltzmann rationality performs very poorly if the demonstrator is in fact systematically suboptimal. I would have instead guessed that Boltzmann rationality would do okay -- not as well as in the case where there is no misspecification, but only a little worse than that. (That’s what I found in my paper (AN #59), and it makes intuitive sense to me.) Some hypotheses for what’s going on, which the lead author agrees are at least part of the story:
1. When assuming Boltzmann rationality, you infer a distribution over reward functions that is “close” to the correct one in terms of incentivizing the right behavior, but differs in rewards assigned to suboptimal behavior. In this case, you might get a very bad log loss (the metric used in this paper), but still have a reasonable policy that is decent at acquiring true reward (the metric used in my paper).
2. The environments we’re using may differ in some important way (for example, in the environment in my paper, it is primarily important to identify the goal, which might be much easier to do than inferring the right behavior or reward in the autonomous driving environment used in this paper).
FORECASTINGForecasting progress in language models (Matthew Barnett) (summarized by Sudhanshu): This post aims to forecast when a "human-level language model" may be created. To build up to this, the author swiftly covers basic concepts from information theory and natural language processing such as entropy, N-gram models, modern LMs, and perplexity. Data for perplexity achieved from recent state-of-the-art models is collected and used to estimate - by linear regression - when we can expect to see future models score below certain entropy levels, approaching the hypothesised entropy for the English Language.
These predictions range across the next 15 years, depending which dataset, method, and entropy level is being solved for; there's an attached python notebook with these details for curious readers to further investigate. Preemptly disjunctive, the author concludes "either current trends will break down soon, or human-level language models will likely arrive in the next decade or two."
Sudhanshu's opinion: This quick read provides a natural, accessible analysis stemming from recent results, while staying self-aware (and informing readers) of potential improvements. The comments section too includes some interesting debates, e.g. about the Goodhart-ability of the Perplexity metric.
I personally felt these estimates were broadly in line with my own intuitions. I would go so far as to say that with the confluence of improved generation capabilities across text, speech/audio, video, as well as multimodal consistency and integration, virtually any kind of content we see ~10 years from now will be algorithmically generated and indistinguishable from the work of human professionals.
Rohin's opinion: I would generally adopt forecasts produced by this sort of method as my own, perhaps making them a bit longer as I expect the quickly growing compute trend to slow down. Note however that this is a forecast for human-level language models, not transformative AI; I would expect these to be quite different and would predict that transformative AI comes significantly later.
MISCELLANEOUS (ALIGNMENT)Rohin Shah on the State of AGI Safety Research in 2021 (Lucas Perry and Rohin Shah) (summarized by Rohin): As in previous years (AN #54), on this FLI podcast I talk about the state of the field. Relative to previous years, this podcast is a bit more introductory, and focuses a bit more on what I find interesting rather than what the field as a whole would consider interesting.
Read more: Transcript
NEAR-TERM CONCERNS RECOMMENDER SYSTEMSUser Tampering in Reinforcement Learning Recommender Systems (Charles Evans et al) (summarized by Zach): Large-scale recommender systems have emerged as a way to filter through large pools of content to identify and recommend content to users. However, these advances have led to social and ethical concerns over the use of recommender systems in applications. This paper focuses on the potential for social manipulability and polarization from the use of RL-based recommender systems. In particular, they present evidence that such recommender systems have an instrumental goal to engage in user tampering by polarizing users early on in an attempt to make later predictions easier.
To formalize the problem the authors introduce a causal model. Essentially, they note that predicting user preferences requires an exogenous variable, a non-observable variable, that models click-through rates. They then introduce a notion of instrumental goal that models the general behavior of RL-based algorithms over a set of potential tasks. The authors argue that such algorithms will have an instrumental goal to influence the exogenous/preference variables whenever user opinions are malleable. This ultimately introduces a risk for preference manipulation.
The author's hypothesis is tested using a simple media recommendation problem. They model the exogenous variable as either leftist, centrist, or right-wing. User preferences are malleable in the sense that a user shown content from an opposing side will polarize their initial preferences. In experiments, the authors show that a standard Q-learning algorithm will learn to tamper with user preferences which increases polarization in both leftist and right-wing populations. Moreover, even though the agent makes use of tampering it fails to outperform a crude baseline policy that avoids tampering.
Zach's opinion: This article is interesting because it formalizes and experimentally demonstrates an intuitive concern many have regarding recommender systems. I also found the formalization of instrumental goals to be of independent interest. The most surprising result was that the agents who exploit tampering are not particularly more effective than policies that avoid tampering. This suggests that the instrumental incentive is not really pointing at what is actually optimal which I found to be an illuminating distinction.
NEWSOpenAI hiring Software Engineer, Alignment (summarized by Rohin): Exactly what it sounds like: OpenAI is hiring a software engineer to work with the Alignment team.
BERI hiring ML Software Engineer (Sawyer Bernath) (summarized by Rohin): BERI is hiring a remote ML Engineer as part of their collaboration with the Autonomous Learning Lab at UMass Amherst. The goal is to create a software library that enables easy deployment of the ALL's Seldonian algorithm framework for safe and aligned AI.
AI Safety Needs Great Engineers (Andy Jones) (summarized by Rohin): If the previous two roles weren't enough to convince you, this post explicitly argues that a lot of AI safety work is bottlenecked on good engineers, and encourages people to apply to such roles.
AI Safety Camp Virtual 2022 (summarized by Rohin): Applications are open for this remote research program, where people from various disciplines come together to research an open problem under the mentorship of an established AI-alignment researcher. Deadline to apply is December 1st.
Political Economy of Reinforcement Learning schedule (summarized by Rohin): The date for the PERLS workshop (AN #159) at NeurIPS has been set for December 14, and the schedule and speaker list are now available on the website.
FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles (http://robertskmiles.com). Subscribe here:122 bölüm
Manage episode 307906147 series 2508242
Recorded by Robert Miles: http://robertskmiles.com
More information about the newsletter here: https://rohinshah.com/alignment-newsletter/
YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
HIGHLIGHTS
Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We’ve previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case?
This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper).
In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well).
Read more: Related paper: Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination
Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now.
Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don’t think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully “aim” your system at the true task of interest, and as a result it doesn’t perform as well as it “could have”. In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there’s still some chance it ends up rambling about something else entirely, or neglects to mention some information that it “knows” but that a human on the Internet would not have said. (See also this post (AN #141).)
I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both.
I’m generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans).
TECHNICAL AI ALIGNMENT LEARNING HUMAN INTENTInverse Decision Modeling: Learning Interpretable Representations of Behavior (Daniel Jarrett, Alihan Hüyük et al) (summarized by Rohin): There’s lots of work on learning preferences from demonstrations, which varies in how much structure they assume on the demonstrator: for example, we might consider them to be Boltzmann rational (AN #12) or risk sensitive, or we could try to learn their biases (AN #59). This paper proposes a framework to encompass all of these choices: the core idea is to model the demonstrator as choosing actions according to a planner; some parameters of this planner are fixed in advance to provide an assumption on the structure of the planner, while others are learned from data. This also allows them to separate beliefs, decision-making, and rewards, so that different structures can be imposed on each of them individually.
The paper provides a mathematical treatment of both the forward problem (how to compute actions in the planner given the reward, think of algorithms like value iteration) and the backward problem (how to compute the reward given demonstrations, the typical inverse reinforcement learning setting). They demonstrate the framework on a medical dataset, where they introduce a planner with parameters for flexibility of decision-making, optimism of beliefs, and adaptivity of beliefs. In this case they specify the desired reward function and then run backward inference to conclude that, with respect to this reward function, clinicians appear to be significantly less optimistic when diagnosing dementia in female and elderly patients.
Rohin's opinion: One thing to note about this paper is that it is an incredible work of scholarship; it fluently cites research across a variety of disciplines including AI safety, and provides a useful organizing framework for many such papers. If you need to do a literature review on inverse reinforcement learning, this paper is a good place to start.
Human irrationality: both bad and good for reward inference (Lawrence Chan et al) (summarized by Rohin): Last summary, we saw a framework for inverse reinforcement learning with suboptimal demonstrators. This paper instead investigates the qualitative effects of performing inverse reinforcement learning with a suboptimal demonstrator. The authors modify different parts of the Bellman equation in order to create a suite of possible suboptimal demonstrators to study. They run experiments with exact inference on random MDPs and FrozenLake, and with approximate inference on a simple autonomous driving environment, and conclude:
1. Irrationalities can be helpful for reward inference, that is, if you infer a reward from demonstrations by an irrational demonstrator (where you know the irrationality), you often learn more about the reward than if you inferred a reward from optimal demonstrations (where you know they are optimal). Conceptually, this happens because optimal demonstrations only tell you about what the best behavior is, whereas most kinds of irrationality can also tell you about preferences between suboptimal behaviors.
2. If you fail to model irrationality, your performance can be very bad, that is, if you infer a reward from demonstrations by an irrational demonstrator, but you assume that the demonstrator was Boltzmann rational, you can perform quite badly.
Rohin's opinion: One way this paper differs from my intuitions is that it finds that assuming Boltzmann rationality performs very poorly if the demonstrator is in fact systematically suboptimal. I would have instead guessed that Boltzmann rationality would do okay -- not as well as in the case where there is no misspecification, but only a little worse than that. (That’s what I found in my paper (AN #59), and it makes intuitive sense to me.) Some hypotheses for what’s going on, which the lead author agrees are at least part of the story:
1. When assuming Boltzmann rationality, you infer a distribution over reward functions that is “close” to the correct one in terms of incentivizing the right behavior, but differs in rewards assigned to suboptimal behavior. In this case, you might get a very bad log loss (the metric used in this paper), but still have a reasonable policy that is decent at acquiring true reward (the metric used in my paper).
2. The environments we’re using may differ in some important way (for example, in the environment in my paper, it is primarily important to identify the goal, which might be much easier to do than inferring the right behavior or reward in the autonomous driving environment used in this paper).
FORECASTINGForecasting progress in language models (Matthew Barnett) (summarized by Sudhanshu): This post aims to forecast when a "human-level language model" may be created. To build up to this, the author swiftly covers basic concepts from information theory and natural language processing such as entropy, N-gram models, modern LMs, and perplexity. Data for perplexity achieved from recent state-of-the-art models is collected and used to estimate - by linear regression - when we can expect to see future models score below certain entropy levels, approaching the hypothesised entropy for the English Language.
These predictions range across the next 15 years, depending which dataset, method, and entropy level is being solved for; there's an attached python notebook with these details for curious readers to further investigate. Preemptly disjunctive, the author concludes "either current trends will break down soon, or human-level language models will likely arrive in the next decade or two."
Sudhanshu's opinion: This quick read provides a natural, accessible analysis stemming from recent results, while staying self-aware (and informing readers) of potential improvements. The comments section too includes some interesting debates, e.g. about the Goodhart-ability of the Perplexity metric.
I personally felt these estimates were broadly in line with my own intuitions. I would go so far as to say that with the confluence of improved generation capabilities across text, speech/audio, video, as well as multimodal consistency and integration, virtually any kind of content we see ~10 years from now will be algorithmically generated and indistinguishable from the work of human professionals.
Rohin's opinion: I would generally adopt forecasts produced by this sort of method as my own, perhaps making them a bit longer as I expect the quickly growing compute trend to slow down. Note however that this is a forecast for human-level language models, not transformative AI; I would expect these to be quite different and would predict that transformative AI comes significantly later.
MISCELLANEOUS (ALIGNMENT)Rohin Shah on the State of AGI Safety Research in 2021 (Lucas Perry and Rohin Shah) (summarized by Rohin): As in previous years (AN #54), on this FLI podcast I talk about the state of the field. Relative to previous years, this podcast is a bit more introductory, and focuses a bit more on what I find interesting rather than what the field as a whole would consider interesting.
Read more: Transcript
NEAR-TERM CONCERNS RECOMMENDER SYSTEMSUser Tampering in Reinforcement Learning Recommender Systems (Charles Evans et al) (summarized by Zach): Large-scale recommender systems have emerged as a way to filter through large pools of content to identify and recommend content to users. However, these advances have led to social and ethical concerns over the use of recommender systems in applications. This paper focuses on the potential for social manipulability and polarization from the use of RL-based recommender systems. In particular, they present evidence that such recommender systems have an instrumental goal to engage in user tampering by polarizing users early on in an attempt to make later predictions easier.
To formalize the problem the authors introduce a causal model. Essentially, they note that predicting user preferences requires an exogenous variable, a non-observable variable, that models click-through rates. They then introduce a notion of instrumental goal that models the general behavior of RL-based algorithms over a set of potential tasks. The authors argue that such algorithms will have an instrumental goal to influence the exogenous/preference variables whenever user opinions are malleable. This ultimately introduces a risk for preference manipulation.
The author's hypothesis is tested using a simple media recommendation problem. They model the exogenous variable as either leftist, centrist, or right-wing. User preferences are malleable in the sense that a user shown content from an opposing side will polarize their initial preferences. In experiments, the authors show that a standard Q-learning algorithm will learn to tamper with user preferences which increases polarization in both leftist and right-wing populations. Moreover, even though the agent makes use of tampering it fails to outperform a crude baseline policy that avoids tampering.
Zach's opinion: This article is interesting because it formalizes and experimentally demonstrates an intuitive concern many have regarding recommender systems. I also found the formalization of instrumental goals to be of independent interest. The most surprising result was that the agents who exploit tampering are not particularly more effective than policies that avoid tampering. This suggests that the instrumental incentive is not really pointing at what is actually optimal which I found to be an illuminating distinction.
NEWSOpenAI hiring Software Engineer, Alignment (summarized by Rohin): Exactly what it sounds like: OpenAI is hiring a software engineer to work with the Alignment team.
BERI hiring ML Software Engineer (Sawyer Bernath) (summarized by Rohin): BERI is hiring a remote ML Engineer as part of their collaboration with the Autonomous Learning Lab at UMass Amherst. The goal is to create a software library that enables easy deployment of the ALL's Seldonian algorithm framework for safe and aligned AI.
AI Safety Needs Great Engineers (Andy Jones) (summarized by Rohin): If the previous two roles weren't enough to convince you, this post explicitly argues that a lot of AI safety work is bottlenecked on good engineers, and encourages people to apply to such roles.
AI Safety Camp Virtual 2022 (summarized by Rohin): Applications are open for this remote research program, where people from various disciplines come together to research an open problem under the mentorship of an established AI-alignment researcher. Deadline to apply is December 1st.
Political Economy of Reinforcement Learning schedule (summarized by Rohin): The date for the PERLS workshop (AN #159) at NeurIPS has been set for December 14, and the schedule and speaker list are now available on the website.
FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles (http://robertskmiles.com). Subscribe here:122 bölüm
すべてのエピソード
×
1 Alignment Newsletter #173: Recent language model results from DeepMind 16:43

1 Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists" 14:21

1 Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI 13:01

1 Alignment Newsletter #169: Collaborating with humans without human data 15:08

1 Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals 16:21

1 Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk 17:10

1 Alignment Newsletter #166: Is it crazy to claim we're in the most important century? 15:42

1 Alignment Newsletter #165: When large models are more likely to lie 16:05

1 Alignment Newsletter #164: How well can language models write code? 18:40

1 Alignment Newsletter #163: Using finite factored sets for causal and temporal inference 19:27

1 Alignment Newsletter #162: Foundation models: a paradigm shift within AI 15:46

1 Alignment Newsletter #161: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity 17:38

1 Alignment Newsletter #160: Building AIs that learn and think like people 17:26

1 Alignment Newsletter #159: Building agents that know how to experiment, by training on procedurally generated games 27:00

1 Alignment Newsletter #158: Should we be optimistic about generalization? 15:39

1 Alignment Newsletter #157: Measuring misalignment in the technology underlying Copilot 14:17

1 Alignment Newsletter #156: The scaling hypothesis: a plan for building AGI 14:17

1 Alignment Newsletter #155: A Minecraft benchmark for algorithms that learn without reward functions 12:43

1 Alignment Newsletter #154: What economic growth theory has to say about transformative AI 16:05

1 Alignment Newsletter #153: Experiments that demonstrate failures of objective robustness 15:37

1 Alignment Newsletter #152: How we’ve overestimated few-shot learning capabilities 14:59

1 Alignment Newsletter #151: How sparsity in the final layer makes a neural net debuggable 11:13

1 Alignment Newsletter #150: The subtypes of Cooperative AI research 12:34

1 Alignment Newsletter #149: The newsletter's editorial policy 14:14

1 Alignment Newsletter #148: Analyzing generalization across more axes than just accuracy or loss 21:57

1 Alignment Newsletter #147: An overview of the interpretability landscape 13:28

1 Alignment Newsletter #146: Plausible stories of how we might fail to avert an existential catastrophe 15:10

1 Alignment Newsletter #145: Our three year anniversary! 13:39

1 Alignment Newsletter #144: How language models can also be finetuned for non-language tasks 12:45

1 Alignment Newsletter #143: How to make embedded agents that reason probabilistically about their environments 14:45

1 Alignment Newsletter #142: The quest to understand a network well enough to reimplement it by hand 15:55

1 Alignment Newsletter #141: The case for practicing alignment work on GPT-3 and other large models 16:00

1 Alignment Newsletter #140: Theoretical models that predict scaling laws 19:21

1 Alignment Newsletter #139: How the simplicity of reality explains the success of neural nets 22:14

1 Alignment Newsletter #138: Why AI governance should find problems rather than just solving them 16:41

1 Alignment Newsletter #137: Quantifying the benefits of pretraining on downstream task performance 15:47

1 Alignment Newsletter #136: How well will GPT-N perform on downstream tasks? 17:20

1 Alignment Newsletter #135: Five properties of goal-directed systems 15:48

1 Alignment Newsletter #134: Underspecification as a cause of fragility to distribution shift 13:17

1 Alignment Newsletter #133: Building machines that can cooperate (with humans, institutions, or other machines) 17:12

1 Alignment Newsletter #132: Complex and subtly incorrect arguments as an obstacle to debate 17:44

1 Alignment Newsletter #131: Formalizing the argument of ignored attributes in a utility function 17:06

1 Alignment Newsletter #130: A new AI x-risk podcast, and reviews of the field 12:08

1 Alignment Newsletter #129: Explaining double descent by measuring bias and variance 13:11

1 Alignment Newsletter #128: Prioritizing research on AI existential safety based on its application to governance demands 18:30

1 Alignment Newsletter #127: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment 22:56

1 Alignment Newsletter #126: Avoiding wireheading by decoupling action feedback from action effects 16:59

1 Alignment Newsletter #125: Neural network scaling laws across multiple modalities 14:41

1 Alignment Newsletter #124: Provably safe exploration through shielding 18:14

1 Alignment Newsletter #123: Inferring what is valuable in order to align recommender systems 14:59

1 Alignment Newsletter #122: Arguing for AGI-driven existential risk from first principles 15:39

1 Alignment Newsletter #121: Forecasting transformative AI timelines using biological anchors 26:39

1 Alignment Newsletter #120: Tracing the intellectual roots of AI and AI alignment 17:32

1 Alignment Newsletter #119: AI safety when agents are shaped by environments, not rewards 21:33

1 Alignment Newsletter #118: Risks, solutions, and prioritization in a world with many AI systems 19:47

1 Alignment Newsletter #117: How neural nets would fare under the TEVV framework 14:03

1 Alignment Newsletter #116: How to make explanations of neurons compositional 17:54

1 Alignment Newsletter #115: AI safety research problems in the AI-GA framework 12:31

1 Alignment Newsletter #114: Theory-inspired safety solutions for powerful Bayesian RL agents 15:12

1 Alignment Newsletter #113: Checking the ethical intuitions of large language models 17:27

1 Alignment Newsletter #112: Engineering a Safer World 20:42

1 Alignment Newsletter #111: The Circuits hypotheses for deep learning 17:40

1 Alignment Newsletter #110: Learning features from human feedback to enable reward learning 19:07

1 Alignment Newsletter #109: Teaching neural nets to generalize the way humans would 18:27

1 Alignment Newsletter #108: Why we should scrutinize arguments for AI risk 21:11

1 Alignment Newsletter #107: The convergent instrumental subgoals of goal-directed agents 14:11

1 Alignment Newsletter #106: Evaluating generalization ability of learned reward models 22:17

1 Alignment Newsletter #105: The economic trajectory of humanity, and what we might mean by optimization 20:24

1 Alignment Newsletter #104: The perils of inaccessible information, and what we can learn about AI alignment from COVID 15:40

1 Alignment Newsletter #103: ARCHES: an agenda for existential safety, and combining natural language with deep RL 18:19

1 Alignment Newsletter #102: Meta learning by GPT-3, and a list of full proposals for AI alignment 17:28

1 Alignment Newsletter #101: Why we should rigorously measure and forecast AI progress 19:51

1 Alignment Newsletter #100: What might go wrong if you learn a reward function while acting 22:52

1 Alignment Newsletter #99: Doubling times for the efficiency of AI algorithms 19:39

1 Alignment Newsletter #98: Understanding neural net training by seeing which gradients were helpful 16:38

1 Alignment Newsletter #97: Are there historical examples of large, robust discontinuities? 18:49

1 Alignment Newsletter #96: Buck and I discuss/argue about AI Alignment 17:49

1 Alignment Newsletter #95: A framework for thinking about how to make AI go well 20:44

1 Alignment Newsletter #94: AI alignment as translation between humans and machines 12:49

1 Alignment Newsletter #93: The Precipice we’re standing at, and how we can back away from it 13:57

1 Alignment Newsletter #92: Learning good representations with contrastive predictive coding 18:42

1 Alignment Newsletter #91: Concepts, implementations, problems, and a benchmark for impact measurement 24:31

1 Alignment Newsletter #90: How search landscapes can contain self-reinforcing feedback loops 14:55

1 Alignment Newsletter #89: A unifying formalism for preference learning algorithms 17:04

1 Alignment Newsletter #88: How the principal-agent literature relates to AI risk 17:30


1 Alignment Newsletter #86: Improving debate and factored cognition through human experiments 17:48

1 Alignment Newsletter #85: The normative questions we should be asking for AI alignment, and a surprisingly good chatbot 12:53

1 Alignment Newsletter #84: Reviewing AI alignment work in 2018-19 11:00

1 Alignment Newsletter #83: Sample efficient deep learning with ReMixMatch 21:05

1 Alignment Newsletter #82: How OpenAI Five distributed their training computation 14:34

1 Alignment Newsletter #81: Universality as a potential solution to conceptual difficulties in intent alignment 20:17

1 Alignment Newsletter #80: Why AI risk might be solved without additional intervention from longtermists 17:19

1 Alignment Newsletter #79: Recursive reward modeling as an alignment technique integrated with deep RL 22:07

1 Alignment Newsletter #78: Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison 18:03

1 Alignment Newsletter #77: Double descent: a unification of statistical theory and modern ML practice 25:02

1 Alignment Newsletter #76: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations 18:44

1 Alignment Newsletter #75: Solving Atari and Go with learned game models, and thoughts from a MIRI employee 18:15

1 Alignment Newsletter #74: Separating beneficial AI into competence, alignment, and coping with impacts 15:27
Player FM'e Hoş Geldiniz!
Player FM şu anda sizin için internetteki yüksek kalitedeki podcast'leri arıyor. En iyi podcast uygulaması ve Android, iPhone ve internet üzerinde çalışıyor. Aboneliklerinizi cihazlar arasında eş zamanlamak için üye olun.