I Have a Research Question About MOOCs That Your Elite Institution Can Answer in Under an Hour

I’ve been really curious about how much (and in what way) xMOOC students use forums.    And I can’t find any good data on it. Not even a “per capita visits to forum” number.

This is pretty suboptimal for the field, since one of the main advantages of MOOCs often presented is the conversation they allow between students. Now, there is a question of whether these online conversations have any learning impact. But we haven’t even got to the basic question of whether the median xMOOC user uses the forums at all, outside a few brief scans. That seems insane to me.

Fortunately, if you work at an institution that offers a Coursera MOOC, you can answer that question pretty easily. What’s more, you can be the first to do it.

Here’s the data export procedure and SQL schema (thank you for the link, George Veletsianos!):


There are some unique challenges, due to the anonymization of user IDs on a per session basis. But post data is not anonymized, so you could get a metric like median posts per user, or percent of users who post. And on the reviewing the forums side, perhaps the percent of sessions involving a forum visit?

I’m not sure how hard it is to piece the database together from the download. But a competent SQL coder could use the document and database to get at these sorts of questions in a matter of minutes once the database is up and running and the document read. These are very simple sorts of SQL statements.

I don’t think these will answer any research questions definitively, but they will help us refine the questions that we ask in very helpful ways. And that, in turn, will help us improve education. I’ve seen a lot of institutions say they are doing Coursera courses as a charitable effort, to help improve education for all. Data is essential to that effort, so now is the time to see whether the rhetoric reflects reality.

Show me the data!


17 thoughts on “I Have a Research Question About MOOCs That Your Elite Institution Can Answer in Under an Hour

    • Thats a good point, a lot of forum activity is in the “introduce yourself” thread. And like Martin Hawksey has done in his twitter analysis, there are interesting things you can do by looking at questions (and ostensibly, responses).

      Otherwise let’s just keep counting the clicks at the entrance turnstile.

  2. Hi Mike,
    I’m working on collecting and analyzing MOOC data for UofT, and I might be able to answer this question in a few months. As much as I’d love to just run a quick SQL query and tweet the answer right now, there are several constraints. UofT only has access to the data for their own courses, and this access is administered by the institution, which is currently exploring how data/findings might be shared with the world (this is an ongoing internal discussion among those responsible at present). In addition, we need ethics approvals to begin to query this data (and as you see in the data export document there are various levels depending on how easy it is to de-anonymize the data, etc). At UofT we have already received ethics permission to analyze the data from five MOOCs, and hope to be able to publish something based on this, but it won’t be for a while yet. I will keep this question in mind though!

    • Thanks Stian. I wonder if broad aggregate data regarding behavior (and in particular, simple measures of central tendency and the like) could be looked at in a differently than more complex and specific measures of performance. One could argue, for example, that average posts per user to the forums is public information anyway; the SQL just makes tabulating it more convenient. Additionally IRBs usually have broad exceptions for non-identifiable student data related to educational research — it would seem to me that simple uses could be fast tracked.

      That said, determinations of who gets to use the data when it contains identifiable elements are more tricky, and I can see how that might be holding things up.

      • We’re a bit further along in our ethics discussions here at Vanderbilt since we already have two MOOCs up and running. Our IRB has two key criteria for deciding how to handle educational data mining projects.

        1) Is the data being collected specifically for research purposes? For instance, one might run a control group experiment that involves assigning some students to watch video A and other students to watch video B, then comparing their responses on a subsequent quiz. Data collected specifically for research purposes, then IRB needs to look at it.

        2) Can the data be de-identified? If you don’t need to use personally identifiable information (name, email, IP address) and can strip that information out of your data set, then IRB oversight is less. For instance, if you’re having students complete a survey that’s clearly for research purposes (criteria #1) and the results of that survey can be de-identified (without preventing the data analysis you want to do), then it’s probably the quick “exempt” level of IRB review, rather than the lengthier “expedited” or “full” reviews.

        What about the kind of data analysis that Mike proposed in his post? In this case, the data isn’t being collected specifically for research purposes; it’s being collected as part of the normal operation of the course. And the research questions can be answered with de-identified data. Here at Vanderbilt, this means I don’t need to get IRB approval to share the data.

        As I’ve discovered very recently, it’s not just IRB I need to work with, it’s Coursera, too. They have a data privacy policy that limits what I can do with data obtained through the instructor interface on the Coursera platform. If the data is being used for research, the request for data should go through the data export procedure that Mike linked to in his post. The data shouldn’t be pulled from the instructor interface, since the instructor interface makes it easy to see personally identifiable information.

        One reason this is relevant is that the Coursera instructor interface makes available some very useful aggregate course statistics, including ones that might answer Mike’s question here, depending on which metric you want to use. When Mike posted this, I asked my Coursera contact if I could share publicly those aggregate course statistics, or if I needed to request a SQL data export and see if my very rusty SQL skills are up to the task of extracting the right statistics. Coursera said that using aggregate data from the instructor interface is fine for “university reporting.” Mike posed his question as a *research* question, of course, but I think as long as I stick to using aggregate data from the instructor interface, I’m fine to share.

        Whew! This may be way more information than anyone needed, but I figured I’d get this down in writing somewhere. Mike, let me see if I have time later today to put together a short blog post sharing some data that might answer your question.

      • No this is the perfect amount of information! We’ll definitely pull this up at the Hangout shindig on Tuesday. I particularly appreciate the Coursera permission angle — not necessarily something I would have thought of off the top of my head.

    • We are in the same boat more or less as U of T — we will be able to answer in a few months. We are doing a blanket IRB to look at multiple dimensions of MOOC use, participation, performance, retention, post content, etc but won’t really get started until after we launch our first course. I suspect we will probably find a way to look at this in various stripes … people that enroll, people that participate, people that drop out (and when), and people that complete. One of the things we are interested in all of this is the potential for lots of data to look at in very smart ways. We are lucky enough to have our Center for Online Innovations in Learning as a research and assessment partner in our MOOC endeavors (http://coil.psu.edu). When we start to see trends we will be reporting!

      • Thanks, Cole! Smart doing a blanket IRB before you launch. And knowing you, I’m sure you’ll get some of that out in ways more immediate and accessible than conference papers, which seems to me key. Policy is moving too fast to be discussing this in a traditional journal cycle — there are real world impacts to sitting on data.

  3. If anyone’s taking requests, I’m really interested in the visualisation of interaction between forum users, as I’m trying to figure out the impact of perceived language competency in Anglophone MOOCs, which are still the dominant model. I feel that the raw number of posts for users who might be defined as “highly engaged” is likely itself to be the result of other kinds of “cultural capital” advantages that they enjoy in relation to English. Is anyone looking into this? My own lack of skill in network visualisation is truly startling, so I can only offer the question.

      • In a couple of weeks, our two ongoing MOOCs will conclude, and I’ll request de-identified data exports from these two courses. I’ll get those data exports set up as SQL databases I can query, and, at that point, I would be glad to take requests! As I mentioned above, as long as the data I’m working with is de-identified, I’m clear with IRB and Coursera to share analysis of those data publicly.

        I don’t know if I’ll have access to data on students’ first languages, so I might not be able to answer this particular question, but I’m happy to see what I can pull out of the data using my limited SQL skills.

  4. Yes, and I meant to say earlier that I strongly agree with your comment that policy/institutional-decision-making/investment-of-real-time-and-money [you name it] is moving much too fast for the traditional research cycle to support evidence-based calculations from the usual sources. Educational research cycle: write grant, wait for result, get result, months later get grant, establish project, actually do research, analyse results, write up, submit to journals, wait, wait, wait, wait, article published behind paywall, end.
    Your approach seems better to me.

  5. Hi, I’m Kristin Palmer from UVa. We’ll be happy to collaborate with folks on research. We have a few proposals that have been approved by our IRB around motivation, interventions and authentic learning. If you’d like to connect with someone here, let me know at kristinatvirginiadotedu. We don’t have a SQL person to just run the #s and I think both Amy and Stian have interesting points. I’ll be happy to jump on a call to discuss if you want to help coordinate folks. We really could use some cross-platform force to help coordinate research if anyone has lots of resources and is looking for something to do!

    • Thanks for offering, Kristin! Fortunately, we have an educational data mining research group here at Vanderbilt interested in exploring the data generated by our Coursera courses. They have their own research agenda, of course, but I can make suggestions to them. I’ll see if they’re interested in working with non-Vanderbilt data.

      And as I indicated above, I have some rusty SQL skills I can try to apply here. Frankly, time is a bigger constraint for me than skill level. But I can see what I can fit in.

      • One other note: I had a very bright former student, a math/CS major, do an independent study with me this spring, mining some of our Coursera data. He’s almost done with his report, which I’ll make sure we share one way or another. You might look around for smart undergraduates who have an interest in educational data mining…

