You are here

LeaRning R in ChemistRy at Reed College

Author(s): 

Danielle Cass, Department of Chemistry, Reed College, Reed, OR.
Chester Ismay, Data Camp, https://www.datacamp.com/, New York, NY.

Abstract: 

In 2015 general chemistry at Reed made the decision to switch from a spreadsheet system to using a more comprehensive program R and the RStudio interface to perform all data analysis and visualization.  Excel was originally used for this purpose and the advantage Excel had was that most students had experience using it in high school.  Therefore there was not a steep learning curve for its use in labs. However, the switch was motivated by a continued frustration with the limits of Excel and a newly acquired R expert on the Reed staff.  In hindsight we can also add that this switch is advantageous to the student’s future.  This paper provides an outline of why and how we implemented the use of R in the introductory curriculum.

Introduction

Educational ideas are constantly a swinging pendulum.  Teaching computer skills is no different. In 1988 a plenary speaker at the BCCE conference explained how departments should shift their computer focus from coding to software skills (1). Yet in 2018 the opposite was suggested, as an editorial in C&EN news said that the future of chemistry will include directly working and talking with machines (2).  The author goes on to say, “Unfortunately, few chemists can actually code, let alone program a robot or write an algorithm” (2).  If we were trying to provide students an education that they could take with them, a switch to R seems like a solid move; feeding a need in our current and future chemistry work force.

So why the switch from Excel to R, how are they different and what does R offer that Excel does not? R is a programming language that was originally developed by two statisticians in New Zealand (3). The way data analysis is executed and visualized in R is much different than Excel. First of all, everything done in R is written in code instead of the traditional point and click format of spreadsheets. RStudio, the interface that is being used, shows the code and the results of the analysis separately (Figure 1A).  It is also possible to directly integrate the code into your final report (Figure 2). This is drastically different from Excel which does not use a programming language but instead displays either the formula used or the data and there is no record of the changes made by the point and click options (Figure 1B).  This leads to the second advantage of R, reproducibility. To reproduce an analysis in Excel, it is often complicated and requires many changing variables.  R, however, allows a person to set up code for creating a scatter plot fit to a straight line and then anyone who wants to create this type of plot can use the same code.  This is an important point for general chemistry because it allows students to begin creating an analysis tool box which they can take with them anywhere. 

Figure 1:  The RStudio interface compared to the Excel interface.  A.  Every step made to create the graph in R is written in the code lines 13-20 with the resulting graph shown below.  B. Excel has no record of the graphical changes and mathematical changes can only be visualized separately (bottom).

 

Figure 2: Output of RStudio into a Word document.  The code used to create the graphs can be shown in the final document as boxed areas.

 

Excel is also limited in its data set size while R can quickly handle an infinitely large dataset.  Although this is not relevant in most general chemistry analysis, it is important when thinking about their future and what kind of analysis they are likely to do. Finally, one important factor to remember is that R is open source and free.  Excel requires a license to use, which means that there is a monetary hurdle in its use.  Excel is provided, free, to students at Reed, but that is limited to their time as students.  Also, because of the open-source format there is a large user group on-line that has created packages that may be useful to specific types of analysis.

 

Introducing R

As mentioned above, a hurdle to using R for data analysis is that most students don’t start the course with any experience using the language and program.  Therefore we needed to change the lab structure to allow for more in-class analysis time.  This is good for providing help to students regarding R usage, but it also just a good pedagogical plan.  Most students don’t appear to struggle with the acquisition of data, but what to do with that data is where they need extra help. Yet, traditionally the labs had an instructor and TA present during all of the acquisition stage and very little presence during the analysis stage. The change to R made us rethink this strategy. The result is that we went from sixteen to twelve labs during the entire first year lab sequence, spreading labs into multiple weeks and using the extra time for data analysis.  This allowed us to spend more time helping students with R and their subsequent analysis during their allotted lab sections.

Currently, the second lab of the semester is where R is initially introduced. The data acquisition stage occurs during week three of the semester and then the data analysis occurs the following lab period.  Previously we had tried to begin with R in week 1, but found that introductory students are mostly freshmen and it was too much for students to comprehend along with everything else happening during their first weeks as a college student.  In this first analysis period we introduce students to the RStudio platform, introduce them to the resources they will have available to them during all subsequent lab reports, and help them through their data analysis.  This is done through a short lecture and then a step-by-step instruction guide (see attached pdf).

How to introduce students to R without devoting an entire course to it is a continuing struggle for us.  The rest of the paper will be devoted to explaining the resources we have developed and used to help in this regard.

 

Chemistr

One advantage to R, as mentioned above, is the open-source aspect.  This allows many people to develop their own code and share it with others.  In R these shareable sets of code are called packages and can be accessed from a variety of locations.  The main location is CRAN (Comprehensive R Archive  Network).  Currently there are over 12,000 packages available which means that many computational scenarios have already been worked out and have gone through rigorous quality checks (4).  However, we decided to create our own package.  The advantage to this is that we could make it specific to the needs and abilities of the introductory chemistry students.  This package is called Chemistr and is available through github, another location for publishing packages which have not gone through the quality checks required by CRAN (5).  The Chemistr package is designed to assist students with little to no programming experience create reproducible analysis using R and R Markdown.  Currently there are six defined functions in Chemistr (Figure 3a), one of which is chem_scatter.  This function takes a complex set of code (Figure 3b) and simplifies it into a small set of arguments that students can see using the HELP window in RStudio(Figure 3c). These arguments are relative to the different types of plots the students will be creating (Figure 3d). 

Figure 3:  Chemistr package.  A.  The functions that are part of the chemistr package.  B.  The code that is behind the use of the chem_scatter function.  C.  The information available through the HELP window about the chem_scatter function.  D.  The use of the chem_scatter function to create and analyze a plot of transmittance versus concentration.  This is a plot that is created for the first R Studio lab.

 

Templates

Another advantage to creating our own package is that when students use Chemistr, they also gain access to all the templates created specifically for every lab report that uses R. These templates provide a basic layout of their report along with helpful hints and code when necessary (Figure 4a).  Students then fill out the template by adding their own data (Figure 4b, lines 18-22), doing any mathematical manipulation of the data (Figure 4b, lines 23-27), and a statistical analysis (Figure 4b, lines 41-44).  The output of this code is a word document that now has a consistent style and can be further modified for significant figures and other small changes (Figure 4c). 

Figure 4: An example template found in the Chemistr package. A. An empty template has a outline for the report, including the types of analysis that needs to be done.  B. Students then fill in the template and add extra code and formatting. C. The final report is knitted into a Word document that can then undergo final editing and modifications.

 

E-Book

Finally, we have created an on-line resource (6).  This resource takes students through the initial setup phase of RStudio all the way to the details of writing each individual lab report. The advantage of using an e-book format is that we can include videos and gifs of how students should navigate RStudio correctly and links to relevant outside resources (Figure 5).  The e-book becomes the one-stop for all lab report needs.  The downside to the e-book is that it requires constant updating as labs, lab instructions, or teaching methods change.  This has been difficult to do and we are in the process of making this a more sustainable endeavor.

 

 

Figure 5:  LeaRning R in ChemistRy at Reed College.  A.  The book gives basic information on how to get started using R.  B.  There also individual chapters on each lab report.  In this example there is information on how to use code from a previous lab in a new report.  C.  A GIF showing a user how to upload a picture that a student may want to include in their lab report.

 

Future Work

Now that we are starting to see how to implement RStudio into the classroom, our next project is to make it even easier for students to begin this process outside of the classroom.  This entails condensing all of our lab resources, including the lab instructions, into our on-line book.  Within this book we would like to include more videos and links to our other sources.  Also, we are working on creating a course that mimics the style of Data Camp, specifically focusing on teaching students the basic R skills they will need for their first year of chemistry.  This way students could take this course before, during, or after they have completed general chemistry.

 

References

  1. Earl, B.L.; Emerson, D.W.; Johnson, B.J.; Titus, R.L. Teaching Practical Computer Skills to Chemistry Majors. J Chem Ed. 1994 71, 1065.
  2. Martinez, J.G.; The new chemist.  C&EN News. 2018, 92, 2.
  3. Ihaka, R.; R: Past and Future History. https://cran.r-project.org/doc/html/interface98-paper/paper.html (Accessed March 25, 2018).
  4. CRAN. https://cran.r-project.org/(Accessed April 2nd, 2018).
  5. Chemistr.  https://github.com/ismayc/chemistr. (Accessed April 2nd, 2018)
  6. Cass, D.; Ismay, C. LeaRning R in ChemistRy.  https://ismayc.github.io/chemistr-book/

 

Date: 
05/07/18 to 05/09/18

Comments

You are terribly brave. 

The principle advantage of R is that it is increasingly being used in just about any area that uses statistical analysis, from social science to areas of the humanities to hard core STEM.  My question is whether other departments are/or will be approached to use R in teaching their classes/labs at all levels.  Without that I would fear that the students would not retain much.  This, of course requires that the faculty learn R, which is an entirely different problem.

You are correct that after taking the time and effort to learn this language it would be a shame for them to only encounter it in my course and thankfully that is not necessarily the case.  I try and emphasize this broad appeal to students when they are struggling, although I find that they believe the words of other students more than mine.  Currently the only other introductory course using R is introduction to statistics and they have been using R for decades longer than we have. However we have hope that seeing it being used in our course and faculty with R experience might change that in the future.  As for upper division courses, R is used in psychology, linguistics, biology and chemistry.  In chemistry, we currently use R in analytical and biochemistry as well as many students using it during their senior thesis.

 

Thank you for this paper.  I learned some R in college many years ago, but found it much more challenging than Excel because I am not a programmer by nature.   I have taught my general chemistry students to use Excel and it seems to be mostly accessible to them.  However, I often have students who have mac incompatibilities.  If R does not have these incompatibilities, I could see the advantages to switching to R.  Do you know if there are differences in the capability for PC users versus mac users with R?

Xavier Prat-Resina's picture

Rstudio works great on Mac. You need to install "R" itself first.

The only problem that I saw when my students tried to use Rstudio is that if they programmatically open a file, Windows and Mac use different syntax to indicate the path to that file (slash or backslash). For example
for windows: read.csv("c:\Users\username\Desktop\myfile.txt")
for mac: read.csv("/Users/username/myfile.txt")

For some reason, the whole idea of a path to a file and whether to use slash or backslash was challenging.

Donna Wrublewski's picture

I teach R at Caltech as part of The Carpentries program, and am a certified Carpentry instructor. The goal of the program is to get non-programmer researchers up to speed quickly with skills to work with data in a reproducible way. As part of a standard Carpentry lesson, we include a lesson on navigating the shell environment for pretty much exactly that reason. Many lessons are available online and CC-licensed: https://software-carpentry.org/lessons/ and http://www.datacarpentry.org/lessons/ - perhaps some of that material could be adapted for use?

I'd like to echo what Donna mentioned about the Carpentries lessons. I am also a certified instructor. The workshops on R we host at Colorado School of Mines are very popular and many attendees are Chemistry graduate students and researchers. The materials Donna linked can certainly be adapted. One thing this community can do together is to identify those examples and use cases that makes particular sense to chemists, especially those can demostrate how reproducible data analysis can improve the quality of the research and facilitate collaborative explorations across disciplines.

Dear Donna and Yi,

Thanks.  This will be helpful in trying to figure out what is already available and how we can use or adapt it for our use.  I am looking forward to getting started on this project over the summer.

-Danielle

 

I agree that most coding except for file finding works on both a Mac and a PC.  This is an issue that I didn't discuss in the paper, but I agree was also a motivator, is that students were coming to campus with a larger variety of computers and often wanting to use a wider variety of spreadsheet programs.  I was not able to help or keep up with this and so it was an added bonus to move to a program that had the same accesibility to everyone.

Another aspect that I don't discuss is a change the college made after our first year of using RStudio.  The first year we used RStudio, we had all students download the program on their own computer.  This worked, but caused issues because the download process was different for different computers.  It became a bit of a headache.  The following year, however, the college decided to provide RStudio through a server.  This simplified the process 100% because now a student can access RStudio through a website and I can make sure that all the packages are up to date and ready to be used by students.  It also allows students to store their files on the server and share them with other users, including me.  This allows me to help troubleshoot issues the students might encounter.

 

Bob Belford's picture

Hi Danielle,

I really enjoyed your paper and am very impressed with what you are doing. Our students get exposed to R in bioinformatics, and some of them get very excited about it. 

I have two lines of questions.  The first parallels Josh's, and that is, do upper division chemistry courses like analytical use R or Excel?  And if they use Excel, are the students coming in wanting to use R, and how is that handled?

Second, is there some sort of "tutorial" that compares Excel to R?  My thought is students probably need to know both, and being able to know which is best for what purpose could be of value.

Third, let me see if I understand things right.  Material on CRAN is vetted, and that is why you use Git Hub, right?  Where I am leading is, where would I go if I wanted to find material like yours to potentially adopt in my classes? Are there "communities of R educators?" Are there other schools using R, and how can we find them?

Finally, what are your students like?  By this I mean, how selective is your admissions process, do any of your students need remediation? Many of our freshmen do.  In some ways I think there is a "taboo" against looking at code in much of the population that relies on computers, and I think having that experience is important for 21st century students. As Josh said, "you are terribly brave".

I am very glad you shared your work with us.  Thank you.

Bob

Dear Bob,

Thanks for the encouragement.

As for a student's desire to use R after taking CHEM101, I am not sure.  I will say that since I started using R three years ago, I have seen an uptick in the use of R for the analysis in student projects.  Since students know I am familiar with it, I am starting to have non-general chemistry students knock on my door for help.  I find that a good sign.  I even had a colleague stop by after realizing that Excel was not able to produce the graphs they desired. 

Analytical chemistry is currently using R for their analysis.  They began this the year after I began using R in general chemistry. Unfortunately there has been a lot of change-over in faculty teaching that course, so I don't have a good sense of how much the use of R in general chemistry is influencing their buy-in for upper division classes.  Since next year's seniors will have learned R in general chemistry, I think it would be an ideal time to poll the students about their experience and how it has influenced their choice in data analysis afterwards.

As for a transition program from Excel to R, I completely agree that would be helpful.  There are a lot of posts about transitioning from one to the other, but I have not seen anything directly helping students see the connections.  I also want to point out that I have not abandoned Excel (or spreadsheets in general) completely.  They do serve a very important function and I think helping students see when and how to use a spreadsheet versus R would be useful.  Also, acknowledging their current knowledge and letting them know we are not throwing that out the window would help many students open-up to a new program. This may be something to consider when we create our Introduction to R for Chemists mini-course. 

As for where to find useful packages for chemists, I don't have a one-stop shop.  (If anyone does, please let us know).  Most of what I find is by searching.  For example, we were doing a pH titration of an unknown acid and I wanted to plot both the titration and the first derivative plot.  I knew I could code this myself, but it would take me awhile.  So instead I did a bit of searching and found the package Titration Curves by David Harvey of DePauw University  https://cran.r-project.org/web/packages/titrationCurves/vignettes/titrationCurves.pdf.  I found that his package allowed me to do what I wanted and even allow my students to model titration curves in a very simple maner.

As for my students, I think they come in with a lot of the same baggage as any other group of students.  I will say, however, that Reed does not have a chemistry for non-majors.  Thus, this course is taken by students who are wanting to major in a science and students just looking to fullfill their science requirement.  So their math, computer, and chemistry background varies immensely.  We don't have any remedial courses for students to take, so I can't give you numbers on that.  I can just say, that I have not been blown away (or even impressed) by their incoming math skills.

I agree that R is a useful program for statistical purposes, and is doubly attractive because it is freely available.  There might exist other useful programs for specific applications in the learning, teaching and practice of chemistry, but Maple that facilitates general computation contains all those statistical modes that any student of chemistry is ever likely to need, combined with superlative symbolic and graphic facilities.  As you can read in the paper by Erick Castellon and myself being discussed in the second half of this week, the ability to use Maple for any arithmetical or mathematical purpose gives any user -- professor or student -- a tremendous power to attack any problem with arithmetical and mathematical -- including statistical and actuarial -- aspects.  Maple is not free software, but an institution can purchase a site licence for use in various modes, and the cost to a student of the student version, which is absolutely the full version simply bearing a student label for marketing purpose, if a student must purchase his or her individual copy, is coomparable with or even less than the cost of a standard textbook for chemistry; whereas that textbook might be useful for only one course in one semester, the use of Maple under the appropriate licence is eternal. 

The learning curve for Maple is less steep than for Mathematica and likely also for R; when one learns to use Maple for calculus, linear algebra, differential and integral equations, it is a simple extension to use commands for statistical applications, including fitting of data that with R is absolutely more cumbersome.  Why should anybody select software that covers only one aspect of mathematical applications in chemistry when a general program includes all mathematical aspects that any student, or professor, of chemistry is ever likely to encounter?

Dear John,

I completely agree. I have never used Maple but I am looking forward to reading and discussing your paper next.  I think this is a bit like thinking about what language a student should learn beyond their native language.  We can all disucss the most useful or "best" language to learn, but in the end most people would agree that learning a second language is a good thing and one that will teach students how to learn any language in the future. I think of our R work the same way.  I want students to have exposure to a program that is fairly limitless in what they can do with it.  Then, if confronted with a different program in the future they have the basic skills to see how that program works and learn the new language.

-Danielle

 

Just add my two cents here. An open source tool like R can be cumbersome to use in some cases comparing to commercial products like Maple. But I think it's still worth it because in the long run it maximizes the possibility that anyone could reproduce the data analysis process without the cost barriers. It may not sound that significant for undergraduate student trying to complete their homework. But being able to transfer this ability to use the "second language" (as Dannielle mantioned) will help these future researchers better collaborate with other researchers and may generate more cross-discipline collaboration opportunities. The sense of solving a problem in a community-driven approach can be one of the important learning outcomes from this kind of experience too. After all, tools are just tools. The best practices and research habit that can be developed while using these tools are more important.  

As someone who grades lab reports, I love the idea of having a template for a data table and for a scatter plot! With that in place (and the integration of word files into the work flow), you can concentrate on the data. For the students, there are huge benefits as well, but it takes some hard work on their part before they realize those benefits.

The question is how to sell the switch to an unfamiliar workflow to the students. I just finished a biochemistry lab where students had to make figures of protein structures using jmol. I had the option of letting students work with a graphical user interface, or with a script. I opted for the script because it it easier to tweak, revise or redraw figures that are script-based. We worked on this in two sessions. One group had the program crash in the middle of the first session, and they appreciated that they had a script to quickly pick up after restarting the program. All groups appreciated that they did not have to recreate the figure during the second session when we put in the final touches (labels, etc.) but could just rerun the script. Also, some had to make very similar figures based on different structural data, and they spent much less time on the second figure by simply making some changes in the script.

You could design your first lab in a way that students see the benefit of having script-based data analysis and figure preparation. For the Beers law lab, for example, the figures for the film with different thickness and the solutions with different concentration would be very similar, so making the second figures should be easier. If they are asked to do multiple measurements and analyze averages and standard deviation, they would also be able to recycle the script.

You could also require a revision of their first lab report, and they would experience that it is much easier to change the script and run it again than what they would have to do in the MS office world. For example, if you ask them to make the labels larger for all graphs (maybe to use the figure in a poster or presentation instead of a report), they would have to go into each figure to tweak the graph, export it back into the word document, and repeat all the editing (resizing etc) they did in the word document.

Anyone who has made publication-quality graphs, with the many edits and tweaks they usually come with, will appreciate graphing software more robust than the graphing capabilites of a spread sheet program, but I think it is worth thinking about how to convince students to use sharper tools.

It is fairly easy to provide templates for graphing in spreadsheets and, of course, while Excel dominates today, there are other spreadsheet programs, each with a different set of advantages (e.g. IgorPro, LibreOfficeCalc, etc. and if you want something fancier, SigmaPlot and it's ilk).  

That being said, to my mind the issues are

1.  Market share.  It is a dis-service to students to teach then an app that is not used almost universally.  It's hard to avoid AutoCAD for design and today Python for programming.

2.  Local market share, e.g. will this software be used in other courses, at best across the university.  For example, if there is a school of engineering you are hard pressed to avoid MathCAD.

3.  Cost both immediate costs and future costs.  Community driven freeware is, to me the best choice by far.  In that respect r and sagemath have an important advantage.  IDL is dying out because the high cost drives even commercial users away, even with substantial educational discounts.

4.  Finally a point that has not been raised, the ability for students to collaborate on line.  This is a particular advantage of Google Docs.  A colleague of mine has used this in GChem Lab http://www.pucrs.br/ciencias/viali/tic_literatura/artigos/planilhas/Sinex.pdf

Best

Josh Halpern

Dear Karsten,

I agree that buy-in with a steep learning curve is deffinitely my biggest challenge.  This idea of trying to emphasize a toolbox to pull from is deffinitly the key.  I like the idea of asking them to make minor changes and emphasizing to the students how easy this change is when doing something scrip-based.  I have also tried to keep the number of tool-box scripts to a minimum so that students can begin to see how the copy code-paste code-modify code is the easiest workflow model.  This is why in many templates, I refer them back to previous lab reports for an example of the code that would be useful for a given analysis.  We also do an end of the year project in which student groups get to choose their own experiment.  Since all of the data analysis is different and their lab report is a poster, it gives me an oportunity to work with smaller groups of students on designing their posters.  This has been when I have started to see who has bought into the R analysis, because they will come to me with ideas of how they want to analyze and present their data and then together we work on finding and modifying code that will work for them.

-Danielle

 

Your paper focuses on the implementation of R and the pedagogy used to introduce the technology. I am wondering about the students' experience learning chemistry now that they have the power of R at their disposal. Which learning goals are now better supported by the lab program? Are there any topics or concepts that were too tedious or complicated to tackle by hand but are now feasible using R?

Dear Karsten,

Great question.  First of all there is nothing I used to do in Excel that I think is too complicated or I have chosen not to do in R.  The advantage I see in learning for students using R is:

1.  The biggest learning goal that I can now easily manipulate is that I can choose when I want studets to focus on interpreting their data or on analyzing their data.  This is done through how much code I put into their templates and how much I require them to do on their own.  For example, there is lab in which they sysnthesize and determine the mass % of the components in their synthesized product.  We then cumulate all of the class data into a spreadsheet and they need to use this data to determine the overall empirical formula.  This requires, at the end, for them to combine uncertanties.  However, this is not something we discuss in our general chemistry course.  Instead we focus on understanding that values have uncertanties and the meaning of a confidence interval (focused on 95% confidence interval).  Therefore, instead of having them figure out how to add their 95% confidence intervals into the final value, I have included all of the code to do this (aka it is a plug and chug sort of set up).  This takes the focus off of how to do the analysis and moves it to what these numbers mean.

 

2.  Another advantage is that you can choose to fit non-linear models.  This could be used in a kinetics lab, instead of plotting the data in a linear fashion.  We tried this one year and it was easy to do on R, but we switched back to linear models because it is how they were being taught to look at the data in the lecture.  However, one could imagine that kinetics could be taught in a way that emphasizes the different models that would fit the data and not discuss the linearization of that data. Then this would be an advantage.

3.  Finally, something we haven't done but could be done using R is have students work with modeling data and how it influences the outcome.  For example, students could be given a script that models acid-base titrations and they could change the pKasor acid concentration to see how the titration curves would change.  This is something that would be way too tedious for a student to do by hand, but could be assigned as a homeowrk or in-class problem to look at and discuss.

-Danielle