Sensor Essay

Learning to Select Robotic Grasps Using
Vision on the Stanford Artificial Intelligence
Lawson Wong1
Grasping is an essential ability for manipulation; for robots such as the Stanford Artificial
Intelligence Robot (STAIR) to be resourceful in real-world environments, they must know
how to grasp. While this is a well-studied problem in the case when a full 3-D model of
the target object is known, it is difficult for real-world scenarios, where the robot must rely
on imperfect perception to model the scenario. This paper presents a novel approach for
grasping that only uses local 3-D information acquired from sensors. Given data of the
Given data of the
environment from 3-D sensors, our

The ability to grasp is crucial;
if we were unable to grasp with our
hands, we would find it very difficult
to perform essential tasks such as
eating, and more complex actions such
as cooking and working in an office
would definitely be unachievable. A
robust and infallible grasping system
is therefore necessary for STAIR to
1Stanford University

achieve its goal.
In this paper, a novel approach for
robotic grasping will be discussed. By
considering information acquired from
our 3-D visual sensors, we developed
a reliable and efficient grasping system
for STAIR that works in unknown and
cluttered environments.


The problem of robotic
grasping has existed and has been well
studied over the past few decades. The
conventional approach use the forces
applied by the fingers on the object
at their contact points to determine
whether a stable grasp can be achieved1.
While in theory this fully determines
the result of the grasp, this approach is
not practical because a complete and
precise model of the target object is
necessary. If the model was inaccurate,
force computations would likely be
incorrect. When working in unknown
and dynamic real-world environments,
STAIR can only acquire a model of the
environment through visual perception,
which is subject to inaccuracies and
incompleteness. In practice, applying
force computations directly on these
models leads to poor results.

The limitations imposed by
perception have spurred interest over
the past two decades in vision-based
grasping systems. In particular, it has
been found that perception of 2-D planar
objects usually suffers from fewer
problems. For such objects, the object

Figure 1: STAIR grasping from a very cluttered environment.



Figure 3: Imperfect perception. Original bowl, and the point cloud obtained via vision (shown in
simulation). Red points come from Bumblebee2, gray points from SwissRanger. Only some edges
are picked up by the Bumblebee2, and neither the bowl surface or table is seen. The SwissRanger
gives a much more complete bowl front face and table, but no other side of the bowl is seen. Interestingly, the two cameras complement each other in this scenario; however, the perception of the
bowl is still far from complete.

Figure 2: STAIR. 7-dof Barrett WAM Arm and
4-dof 3-fingered BarrettHand with “open” spread
pictured. The spread can be “closed” such that
all 3 fingers will be at the top. Vision system
mounted on robot frame; blue arrow marks
SwissRanger, green arrow marks Bumblebee2.

surface contour can be found reliably
from vision. Criteria for successful
grasps, derived from the mentioned
theoretical force computations, can
then be found for the object2,3. A similar
approach was used by Kamon, Flash,
and Edelman, where features indicative
of successful grasps were computed
given a 2-D image of the object4. A
learnt model then used these features
to compute an...

