2 February 2022

The First Two Miles: Building Research-Ready Programming Skills in the Classroom

Imad Pasha Yale University

In the shipping business, the “last mile” problem refers to how a package gets from the final distribution center (as part of a controlled supply chain) to the actual front door of a home. With coding in astronomy, I often see a sort of “first mile” problem — students enter astronomy with passion and enthusiasm, and quickly need to learn a lot of basic (and then highly specialized) programming skills before they’re ready to carry out novel research.

It has become apparent that programming skills are critical to the development of new astronomers. As programming literacy has increased within our field over the last several decades, astrophysics papers have been regularly employing either large scale, complex simulations or complex statistical frameworks such as hierarchical bayesian modeling and machine learning, all of which require an advanced level of coding acumen. These works are increasingly being helmed by early career scientists — postdocs, graduate students, and even undergraduates, meaning basic programming skills must be conferred at earlier and earlier stages in a student’s academic career.

This “moving standard” mirrors other trends within our field. As the number of applicants to PhD programs increases, undergraduates are expected more and more to have prior research experience (among other things) to be considered among the most competitive applicants, where just a generation ago, some graduate students entered with little to no prior coding experience. Despite this increasing need, the core fundamental skill set underpinning modern astronomy and astrophysics research — programming — is not yet a universal offering within undergraduate astronomy major curricula. It is instead deemed primarily an extracurricular, and students are expected to learn this critical skill from a patchwork of resources ranging from online websites like Codeacademy, to summer internships and REUs (National Science Foundation-funded research experience programs), to university computer science departments, or simply on their own via independent research. If astronomy departments wish to look to the future, the need for more uniform programming training, as early as possible, is clear.

It’s also worth noting that many of the places students have been (to date) expected to learn coding outside of the astronomy classroom are ultimately drivers of inequity in our field. They represent a limited resource to which not all students have access. Offering courses within the astronomy major of a given program gives all students an equal playing field to develop career-necessary skills.

Many astronomy programs have recognized the importance of teaching “research-grade coding” at the undergraduate level, and several have implemented programs or courses that work toward this goal (for example, Pre-Major in Astronomy Program at University of Washington and Phys 104 at Haverford College). But the adoption of full-semester, programming-first courses has been slow and is, at present, not universal or widespread.

I have taught a version of a programming course (in Python) since 2015. At UC Berkeley, I created and co-taught an introductory course for 1st and 2nd year undergraduate students, designed to be a true introduction to programming. At Yale University, I have created and co-taught a course aimed at those with introductory python under their belt (generally 3rd and 4th year students). Below, I discuss how this field has changed since then and share some useful tips and lessons I learned during the most recent iteration.

Building Resources

One area that has improved dramatically even from the time I first taught this Python course in 2015 to today is the availability of not only high quality Python resources, but those focused specifically on astrophysics. In 2015, python4astronomers was one of the few astronomy-focused Python resources, and it was not a comprehensive introduction for beginners. In fact, astronomy-focused introductory materials were so sparse that I wrote my own introductory textbook in order to facilitate my class. Even though my book is an incomplete introduction, it has remained a popular resource. I host it online (free and open source), along with associated tutorials, and it has been viewed roughly 20,000 times in the past year.

I am happy to say that I am now in the process of writing a revised and significantly expanded version of that early book. I am even happier to say that the community has not been stagnant in the intervening years. Many astronomers and data scientists have taken the time to make high quality resources freely available to the public. Great examples include these SDSS labs, this data visualization book (for Matplotlib), and Jake VanderPlas’s Python Data Science Handbook.

Into the Classroom

While it is amazing that there is now a slew of resources available to early career students seeking to gain a solid foundation in astronomical coding practices, it remains the case that these resources are largely scattered across the web (I myself maintain an informal list of many shared resources as Twitter bookmarks). To use the words of a friend and faculty member, on boarding new students into research often involves “a hodge-podge of online and physical resources that students find via Google or that instructors hack together.” In short, the resources are there, but students are still left to their own devices vis-à-vis actually compiling them into a coherent educational tool.

A primary question, then, is how do we bring these myriad resources into the classroom itself, in a way that is pedagogically sound and that produces lasting, positive outcomes for students’ programming journeys.

This is exactly what I attempted this year, in a brand new course (Astro 330; an upper level course) co-created and co-taught with Marla Geha at Yale. We built the course from the ground up as a post-introductory Python-focused methods class. This is the reason for my title of this blog post being “The First Two Miles” — I refer here to the first mile as being the absolute, introductory basics of Python, and the second mile as the set of Python skills one would need to, say, carry out an undergraduate (or early graduate) level research project to completion. I am interested in both of these “miles” and the gap that often exists between them. For this particular blog post, I’d like to focus on my experience with this course (teaching the second mile), and the lessons we learned along the way. I hope these heuristics and the resources we created can be of use in the teaching of other similarly focused classes.

Lesson 1: Don’t try to shoehorn an existing course; write a new course and make it “programming-first”

Our first lesson was learned right away, in the planning stages of even applying to create this course. I recognize that building (and getting approval for) new courses is a pain, and a daunting task at that. But the set of topics needing coverage, pedagogically, to get students from mile one to mile two is simply not coverable in a course already designed to have astronomy based goals. Unlike a typical course syllabus for an astronomy course that has programming involved, ours listed its core topics, e.g.,

  • functional programming,
  • classes and object oriented programming,
  • fitting methods,
  • visualization and Matplotlib,
  • pandas, and
  • software development.

We used astronomical data and examples as the vehicles for these core topics, but did not attempt to make students learn a lot of new astronomy and programming at the same time. In this sense, the underlying astronomy being used was what would be reasonably covered in introductory coursework. As we worked our way through the semester, it became clear we would’ve had no chance at achieving our learning goals on the programming side if we had chosen an astronomy-concept-first approach.

Lesson 2: Use real data

I mentioned our programming-first approach above — and how we thus chose (relatively) simple astronomical examples. To be specific, things like measuring the flux of stars in an image or shifting and matching a spectrum to get a doppler shift, did not require much new conceptual science knowledge for our students. This let us focus on the actual goals, e.g., writing a peak-finding and centroiding function in Python (to measure the stars) or comparing the usage of chi-squared vs. bayesian fitting (to fit the spectrum).

That said, at a fundamental level, the problems that arise in writing such algorithms in any programming language are often results of the data itself. Canned examples with known answers or data that has been previously pruned and cleaned by the instructors, moves the learning significantly away from the ins and outs of actual research programming. For example, the error messages one often gets or the ways in which data are often riddled with issues.

We made a commitment in our course to use only real, “certified-untreated” data in our course. This caused its fair share of headaches, but I am convinced now that this was the right choice.

An unexpected side effect of this choice was that the students’ engagement with the material and their willingness to power through the bugs in their code and the inherent challenge set by the assignments was strengthened, I believe, by the knowledge that they were working with real data. Real data is messy, and sometimes requires imperfect and messy solutions. We were able to show our class which of these were OK (in a pinch) and which, for example, fundamentally change the statistics or outcome of a measurement, and should thus be avoided. We got the distinct impression that, especially for the students wanting to pursue astronomy research, knowing that their problem sets were actually a true analog to astronomy research (at least, observationally focused data analysis), gave additional motivation to complete and learn the content.

Lesson 3: Be flexible, be modular, be interactive

Flexibility

We learned very quickly that our students’ familiarity with the foundational Python we were building on was spotty — great in some areas, almost nonexistent in others. This is a direct consequence, in my mind, of lack of “mile one” coursework, and I suspect, will be a common theme for anyone teaching such a course. Even in a class of 15 students, we had a variety of background experience.

Especially in a semester marked by a pandemic, changing guidelines, and general uncertainty, we were ready to be flexible with our students on most deadlines and assignments. But it turns out that building flexibility into our course content actually helped with the course’s overall learning goals.

To give a specific example, one lab involved the students figuring out several non-trivial algorithms for performing aperture photometry, and also involved them packaging this all into a Python class, a type of structure (object oriented programming) that is often confusing for introductory Python students. Rather than leave the learning half-finished, we extended this lab an extra week, and spent more time in class discussing the elements that were going into that assignment.

A second concrete example: We spent one module focused on fitting models to data, building on resources compiled by David Hogg, Dan Foreman-Mackey, and Dustin Lang (among others). We built slowly from handwritten chi-squared fits, to handwritten Metropolis Hastings MCMCs, to actually using emcee to run sampling. Multiple times during this section it was necessary to offer extensions of elements of the problem sets or take an extra lecture to really cover a particular element of Bayesian fitting. Given the topic’s importance, we took the time to do this rather than rushing through. Six programming topics, well covered, is better than eight topics that are rushed.

Modularity

In the early part of the course, the labs built very heavily on one another, using concepts and even code from previous labs in order to build the next. I would advise those writing similar courses to try to enforce a bit more modularity and independence between the assignments. We found that anyone struggling with a particular concept or element would fall further behind or grow increasingly frustrated as time went on. The less each problem set requires a priori “perfect” knowledge of the previous lab, the more jumping on points a student has to re-engage and learn the new material.

Interactivity

It’s been shown time and time again that the traditional lecture model does not lead to much synthesis or even knowledge retention. Coding especially is learned by doing, so wherever possible, lectures involved interactive demos or think-pair-shares. In one lecture, everyone in the class wrote and installed a Python package together. Building on this, students got a full class period each week to work on their labs (which were also their homework). During this hour they worked with those around them and could ask questions freely of the instructors (we walked around the room). This gave us the ability to also see and correct incorrect mental models or bad programming practices, right at the source, before they had time to calcify.

Additionally, every week, we took the time in class to cover our instructor solutions to each lab (often, Marla and I had differing solutions). Given the number of ways to correctly (and incorrectly) code any particular solution, we found that it was important to stop and discuss with some depth why the instructors’ solution looked a certain way, and to take questions on anything the students might find confusing in our implementations.

Lesson 4: Allow creativity

Finally, many people will tell you that the best way to learn programming is to pick a project and then do it. I believe there’s a lot of truth to this; putting together the pieces of a larger whole, looking things up along the way — it’s how many of us learned to code. Our goal was to replicate this principle as much as possible in our course. Individual problem sets were structured like realistic research tasks, and students had significant freedom in how they chose to code their solutions.

Most of all, their final projects — for which we asked students to write fully-fledged, installable, GitHub-hosted Python packages — demonstrated this effect in action. We did not require students to choose astronomy themes for their projects (which were completed in the last two months of the semester), and we got a slew of interesting projects both within and outside of astronomy. Ultimately, students picked topics they were excited and passionate about, and this directly resulted in them spending time on their code, and ultimately developing more as programmers. We made the final projects a hefty portion of their grade to reward this work, and we supported students through it by having a set of check-ins and code reviews in which their incremental progress was tracked and graded, and feedback was given. We were seriously impressed by many of the projects, and would encourage anyone teaching a similar course to include a major, student-driven project for this reason.

Looking to the Future

I was continuously impressed by what our students were able to accomplish, and where their programming skill was by the end of the course compared to the start. I believe courses like this one should exist in every astronomy department, whether they mirror ours in structure, or perhaps try other styles, such as structuring the course as several longer, more in-depth “projects.” Either way, the creation and running of such courses becomes more and more important, and also grows easier — the open source community has made many resources readily available. For our part, anything on my website or our Astro 330 course website is free to use, adapt, or serve as inspiration. And if you know of or have created coding assignments that would be relevant for such a course and would like to share them, I would love to begin compiling a list that can be accessed and shared by other educators in our community; so please reach out me at [email protected].