The Route To Personal Cyberinfrastructure Is Through Storage-Neutral Apps

Jim’s got a great summary of the larger idea behind UMW Domains (written by Ryan Brazell) up on his site. The core idea — personal cyberinfrastructure — is one I buy into, but at the same time the current mechanisms for it (cPanel, personal servers, and the like) seem clunky and not poised for greater adoption (although I watch the Thali project with interest).

Rather, the route to personal cyberinfrastructure is likely to run through storage-neutral apps. Briefly, the way most apps work now is that there is a program on your tablet/desktop/phone that is owned by Company A, and then there is often a certain about of web storage used by that used by that app, also owned by Company A. There’s a certain amount of web-based processing, also done on servers owned by Compnay A.  This is somewhat different than the PC model, where Adobe sold you software but you owned the disk that held all your image creations, Microsoft sold you MS Word but your computer ran it, etc.

The cPanel-as-infrastructure response to that is to move to an all-web-app where you own the server. Some of the apps have mobile extensions to them, but by and large you avoid the lock-in of both modern web apps (Google Docs, Dropbox, Tumblr) and modern apps by going to open, HTML-based web apps.

This works, but it seems to me an intermediate step. You get the freedoms you want, but the freedoms you care about are actually a pain in the ass to exercise. Klint Finley, in a post on what a new open software movement might look like, nicely summarizes the freedoms people actually want from most applications (as opposed to content):

  • Freedom to run software that I’ve paid for on any device I want without hardware dongles or persistent online verification schemes.
  • Freedom from the prying eyes of government and corporations.
  • Freedom to move my data from one application to another.
  • Freedom to move an application from one hosting provider to another.
  • Freedom from contracts that lock me in to expensive monthly or annual plans.
  • Freedom from terms and conditions that offer a binary “my way or the highway” decision.

You get all those freedoms from the web-app personal cyber infrastructure, but you get them because you do all the work yourself. Additionally, your average user does not care about some of the hard-won freedoms baked into things like WordPress — the ability to hack the code (we care about that very much, but the average person does not). They really just want to use it without being locked forever into a provider to keep their legacy content up.

What I think people want (and what they are not provided) is a means to buy software where others do all this work for you, but you hold on to these freedoms. And assuming we live in a market that tries to match people with products they want (big assumption) the way that will come about is storage neutral net-enabled apps. I’ll own virtual server space and cycles somewhere (Amazon, Google, Microsoft, Squarespace, wherever). I’ll buy apps. But instead of installing software and data on the app-provider’s server, they’ll install to my stack on the web. And because they’ll encrypt that data, the company that runs my server won’t be able to see it either. My subscription to Adobe or Word will operate much like older subscriptions. Subscription will get me updates, but at any given point I stop paying Adobe I can still run my web app on my server in the state it was in when I stopped paying them.

Why is this more possible than the open web app model? None of the  major providers have much incentive to go this route. Subscriptions are a lucrative business with undreamed of lock-in potential. I would say there are two reasons. First, companies with a virtual server platform (Microsoft, Google, Amazon) have some incentive to promote this model. Even Apple has a chance here to pair its app store with virtual server space. Second, and more importantly, such a scheme would be a huge boon to small developers and hackers. Knowing  that they don’t have to scale up server architecture to sell server-powered apps frees them to focus on the software instead of scalability, the way that API-rich operating systems allowed previous generations of developers to focus on their own core product. And as this broadens out to where everyone’s phone has a slice of supercomputer attached to it, some really neat things become possible: truly federated wikis where pages are spread across multiple personal sites, music software that can write down effect-laden tracks in near real-time using rented processor time, music library apps written in 200 lines of code. That’s the larger win, and that’s where we want to be heading, the place where practical user freedoms and developer capabilities meet.

Doing Big Data and Analytics Right

From the Chronicle, a surprisingly good article on Big Data:

This month Mr. Lazer published a new Science article that seemed to dump a bucket of cold water on such data-mining excitement. The paper dissected the failures of Google Flu Trends, a flu-monitoring system that became a Big Data poster child. The technology, which mines people’s flu-related search queries to detect outbreaks, had been “persistently overestimating” flu prevalence, Mr. Lazer and three colleagues wrote. Its creators suffered from “Big Data hubris.” An onslaught of headlines and tweets followed. The reaction, from some, boiled down to this: Aha! Big Data has been overhyped. It’s bunk.

Not so, says Mr. Lazer, who remains “hugely” bullish on Big Data. “I would be quite distressed if this resulted in less resources being invested in Big Data,” he says in an interview. Mr. Lazer calls the episode “a good moment for Big Data, because it reflects the fact that there’s some degree of maturing. Saying ‘Big Data’ isn’t enough. You gotta be about doing Big Data right.”

I don’t know if I have to sketch out the parallels in education, but just in case: we have two really unhelpful parties in learning analytics. We have the “it’s all bunk” crowd, and we have the evangelists. And I don’t know which is worse.

Here’s the thing — saying “Big Data is bunk” is pretty close in ridiculousness to saying “Oceanography is bunk”. Seventy percent of the planet is ocean. Likewise, the “data exhaust” we emit on a daily basis is growing exponentially. There is no future where the study of this data is not going to play a large role in the research we do and the solutions we create. None. Nada.

How we do it is the issue. And the “science” in “data science” is supposed to bring an element of rigor to that.

But for various reasons, the Big Data world is surprisingly unscientific, surprisingly data illiterate, surprisingly uncritical of its own products. I hear supposed data scientists quoting the long debunked claim that Obama won the 2012 election through use of Big Data (unclear and unlikely). They latch on to the same story about Target predicting pregnancy, which remains, years later, an anecdote that has never seen external scrutiny. They cite Netflix, even though Netflix has walked back from a purer data approach, handtooling micro-genres to make results more meaningful.

It gets worse. As the Chronicle article points out, Google statisticians published the original Nature article on using Google searches to predict flu outbreaks in 2009. Google Flu Trends, the result of that research, is used by public health professionals as one measure of likely flu incidence. That persistent over-estimation that was just discovered? Flu Trends was overestimating physician visits by about 100%. That’s bad. But here’s the kicker: It’s been doing that for three years.

And this, unfortunately, is par for the course. A recent article by an Open University analytics expert cites the Purdue Course Signals experiment, apparently unaware that a substantial portion of those findings came under substantial critique last year, which raised questions that have still not been answered to this day.  Meanwhile the examples used by MOOC executives are either trivial or so naively interpreted that one has to assume that they are deliberately decieving the public (the alternative, that professors at Stanford do not understand basic issues in research methods, is just too frightening to contemplate). Yet, if they are called on these errors, it is generally by the “Big Data is bunk” crowd.

There are people — people I know personally — who are in the happy middle here, believing not that we should support analytics or Big Data but that we should support better practice, period. But there’s too few of them.

So here’s my proposal. We’ve all used these anecdotes — Google Flu Trends, Course Signals, Netflix recommendations, Obama’s election — to make a point in a presentation.

Early in this field, that was probably OK. We needed stories, and there wasn’t a whole lot of rigorous work to pull from. But it’s not OK anymore. I’m declaring a moratorium on poorly sourced anecdotes. If you are truly a data person, research the examples you plan to cite. See what has happened in the four or five years since you first saw them. Don’t cite stuff that is questionable or seemingly permanently anecdotal. And if you hear people cite this stuff in a keynote, call them on it. Be the skunk at the party. Because it’s intelligent skunks, not cheerleaders, that this field needs right now.

Like Tumblr for Wikis (Sample Implementation. Downloadable Code.)

I radically simplified the approach to wiki article reuse. I think for the better. I’d like you to tell me what you think:

Keep in mind this is only the start. The idea would be to build communities around the reuse. So, for example, when your page gets rewiki’d a central system logs that, and feeds back to your page a little snipppet of text that says something like “26 clones, 3 forks” and like Tumblr lists all the different sites that have resused it. You could also create a central hub where the most re-used content of the week floats to the top. Etc., etc.

If you have a Dokuwiki instance and a some hacker blood, you can try it out yourself. Instructions here:

My coding on this stuff is very hacky. I’d love to do this cleanly through XML-RPC in such a way that the only thing you would need is the bookmarklet (and a standard dokuwiki install). If you’re the genius who can make that happen quickly and cleanly, come share the glory!

Thanks to all the people I’ve talked this last iteration through with, but probably especially Devlin Daley who helped me stumble toward what I wanted to do during an hour long videochat. I came out of it with a clearer sense of what the core product was.

Truth, Durability, and Big Data

Ages ago in MOOCtime there was this media think-nugget going around about the glories of Big Data in MOOCs. It reached its apex in the modestly titled BBC piece “We Can Build the Perfect Teacher“:

One day, Sebastian Thrun ran a simple and surprising experiment on a class of students that changed his ideas about how they were learning.

The students were doing an online course provided by Udacity, an educational organisation that Thrun co-founded in 2011. Thrun and his colleagues split the online students into two groups. One group saw the lesson’s presentation slides in colour, and another got the same material in black and white. Thrun and Udacity then monitored their performance. The outcome? “Test results were much better for the black-and-white version,” Thrun told Technology Review. “That surprised me.”

Why was a black-and-white lesson better than colour? It’s not clear. But what matters is that the data was unequivocal – and crucially it challenged conventional assumptions about teaching, providing the possibility that lessons can be tweaked and improved for students.

The data was unequivocal. But was the truth it found durable? I’ve argued before that the Big Data truth of A/B testing is different from the truth of theoretically grounded models. And one of the differences is durability. We saw this with the A/B testing during the Obama campaign, when they thought they had found the Holy Grail of campaign email marketing:

It quickly became clear that a casual tone was usually most effective. “The subject lines that worked best were things you might see in your in-box from other people,” Fallsgraff says. “ ‘Hey’ was probably the best one we had over the duration.” Another blockbuster in June simply read, “I will be outspent.” According to testing data shared with Bloomberg Businessweek, that outperformed 17 other variants and raised more than $2.6 million.

The “magic formula”, right? Well, no:

But these triumphs were fleeting. There was no such thing as the perfect e-mail; every breakthrough had a shelf life. “Eventually the novelty wore off, and we had to go back and retest,” says Showalter.

And today there is news that the “Upworthy effect” — that A/B tested impulse to click on those “This man was assaulted for his beliefs. You won’t believe what he did next.” sort of headlines — is fading:

[Mordecai] lets everyone in on his newest data discovery, which is that descriptive headlines—ones that tell you exactly what the content is—are starting to win out over Upworthy’s signature “curiosity gap” headlines, which tease you by withholding details. (“She Has a Horrifying Story to Tell. Except It Isn’t Actually True. Except It Actually Is True.”) How then, someone asks, have they been getting away with teasing headlines for so long? “Because people weren’t used to it,” says Mordecai.

Now, Upworthy is an amazing organization, and I’m pretty sure they’ll stay ahead of the curve. But they are ahead of the curve precisely because they understand something that many Big Data in Education folks don’t — the truths of A/B are not the truths of theory. Thrun either believed or pretended to believe he had discovered something eternal about black and white slides and cognition. Which is ridiculous. Because the likelihood is he discovered something about how students largely fed color slides reacted to a slideset strangely reduced to black and white.

Had he scaled that truth up and delivered all slides in black and white he would have found that suddenly color slides were more effective.

There’s nothing wrong with this. Chasing the opportunities of the moment with materials keyed to the specific set of students in front of you is worthwhile. In fact, it’s more than worthwhile; it’s much of what teaching is *about*. Big Data can help us do that better. But it can only do that if we realize the difference between discovering a process that gets at eternal truths vs. discovering a process that gets at the truth of the moment.

Goal: Make Wiki Page Reuse as Easy and Natural as Reblogging on Tumblr

Short post, but a note at where I’m at on the wiki project.

I’m a maddeningly circular developer, because I write code to help me think about problems. The Dokuwiki work is coming along well, given the amount of time I actually have to give to it, and given I took a detour to make it look a bit prettier.

But I wanted to get an idea down that has crystallized one of the major goals for me. On Tumblr, most content on your blog is borrowed — reblogged — from other sources. While you link to those sources, you display the content in your own space, and technically you display a copy of the content.

Initially more traditional bloggers like me thought this was a bit weird. Shouldn’t you link to stuff instead of copying it? What the heck is “re-blogging” anyway?

But with a combination of attribution features, favorite love, and interface genius, reblogging became a Tumblr norm. And now content flows though that system incredibly fast.

Wiki pages are different, certainly, in many ways. But I want to rethink them, with this idea of “reuse in your own space”as a controlling principle. I want a world where if I find a great treatment of how to use a wrapped MOOC on Derek Bruff’s Vanderbilt wiki I can “re-wiki” it to my own space, and the attribution, recognition, and linkbacks are such that Derek will actually like that.

That’s the pitch. Scared yet?

Google Apps For Education Sued For Data-Mining Students

Google is not supposed to be building profiles of students for advertising purposes. It’s looking like they did.

The suit maintains that, because such non-Gmail users who send emails to Gmail users never signed on to Google’s terms of services, they can never have given, in Google’s terms, “implied consent” to scan their email.

The plaintiffs are seeking payouts for millions of Gmail users. The financial damages would amount to $100 per day of each day of violation for every individual who sent or received an email message using Google Apps for Education during a two-year period beginning in May 2011.

My guess is that it’s not nefarious. Nobody was evilly rubbing together hands over all the student data they were colleting.

It’s worse than that. Building profiles of users is so built into the DNA of how Google works that they can’t remove it. And Google itself does not see the cultural difference between building a profile to increase quality of product, and building one to shill junk.

I’m not storming out of my Google account (I actually use more nowadays anyway). But I think it has become increasingly hard to argue that one should use Google on a large scale in education. They just aren’t built for it.

What the Heck Is Going On With Lesson Plan Searches?

Via @fnoschese, here’s the Google Trend on searching for “Lesson Plans”


Wow. I can think of lots of little reasons why this might be happening, but none seems sufficient in itself. I know for instance that my wife Nicole used to search a lot more for art lesson plans but two things happened. First, she got into a Pinterest community that started bringing that stuff to her. Second, after a big sweep through her materials her needs became much more specific.

So that’s a possibility, though frankly I don’t think Nicole is the median teacher.

Another possibility is that the rise of the lesson plan swapping mega-site (and the lesson plans for profit industry generally) has reduced the number of people turning to the open web for these needs.

A third possibility is that for various reasons this term is just a lousy proxy for people searching for lesson plans. I saw a few people on twitter mention that perhaps this was due to fragmentation of queries — people are searching for more specific things. But playing around with this idea in trends I was surprised to find that for most queries I typed in, the decline was still evident. “Common Core” queries formed a tiny, tiny blip upwards, but for most other queries “Second Grade Reading Activities”, etc, you found the same pattern. From whatever point Trends starting tracking, it headed towards decline:


Pumping these search terms into Google Trends began to feel spooky, like stumbling on the inevitable heat death of the universe. There’s a general rule when you find something happening everywhere at once like this — it’s either an artifact of measurement, or something is going on at a really fundamental level. I’m still leaning towards artifact of measurement (we’re missing something fundamental about how these figures come about), but maybe that’s because the other explanations are either unconvincing (The Pinterest Revolution!) or disconcerting (Triumph of the Textbook Curriculum; Rise of the Lesson Plan Mega-site). Either way, though, anyone who cares about use of open resources in education should be watching this closely.