** **

* What Is Data Science?* Analyzing real, and often – dirty, data using a mixture of programming and statistics. Or, as Josh Wills put it:

Data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.

From my perspective the whole process looks that way:

- ask question that is relevant to the project
- get data (CSV, SQL, plain text)
- process it (joining, cleaning, supplementing it)
- run analysis (statistical tests or machine learning)
- interpret and use results (being able to understand the above)
- present results (a report, plot, interactive data visualization)

In order to get into data science, it’s paramount to first understand its history, that will be shortly discussed after the bottom line of this post.

It’s an exciting time for data science. The field is new, but growing quickly. There’s huge demand for data scientists – average compensation in SF is well north of 100 thousand dollars a year. Where there’s money, there are also people trying to earn it. The data science skills gap means that many people are learning or trying to learn data science.

The first step to learning data science is usually asking “how do I learn data science?”. The response to this question tends to be a long list of courses to take and books to read, starting with linear algebra or statistics. I went through this myself a few years ago when I was learning. I had no programming background, but knew that I wanted to work with data.

I can’t fully explain how immensely unmotivating it is to be given a huge list of resources without any context. It’s akin to a teacher handing you a stack of textbooks and saying “read all of these”. I struggled with this approach when I was in school. If I had started learning data science this way, I never would have kept going.

Some people learn best with a list of books, but I learn best by building and trying things. I learn when I’m motivated, and when I know why I’m learning something. Best of all, when you learn this way, you come out with immediately useful skills. From my conversations with new learners over the years, I know many share these views.

That’s why I don’t think your first goal should be to learn linear algebra or statistics. If you want to learn data science, your first goal should be to learn to love data. Interested in finding out how? Read on to see how to actually learn data science.

* Learn To Love Data *Nobody ever talks about motivation in learning. Data science is a broad and fuzzy field, which makes it hard to learn. Really hard. Without motivation, you’ll end up stopping halfway through and believing you can’t do it, when the fault isn’t with you – it’s with the teaching.

You need something that will motivate you to keep learning, even when it’s midnight, formulas are starting to look blurry, and you’re wondering if this will be the night that neural networks finally make sense.

You need something that will make you find the linkages between statistics, linear algebra, and neural networks. Something that will prevent you from struggling with the “what do I learn next?” question.

My entry point to data science was predicting the stock market, although I didn’t know it at the time. Some of the first programs I coded to predict the stock market involved almost no statistics. But I knew they weren’t performing well, so I worked day and night to make them better.

I was obsessed with improving the performance of my programs. I was obsessed with the stock market. I was learning to love data. And because I was learning to love data, I was motivated to learn anything I needed to make my programs better.

Not everyone is obsessed with predicting the stock market, I know. But it’s important to find that thing that make you want to learn.

It can be figuring out new and interesting things about your city, mapping all the devices on the internet, finding the real positions NBA players play,mapping refugees by year, or anything else. The great thing about data science is that there are infinite interesting things to work on – it’s all about asking questions and finding a way to get answers.

Take control of your learning by tailoring it to what you want to do, not the other way around.

* Learn Data Science Skills On Your Own* Going into data science means acquiring several useful skill sets that can be employed across several sectors, from pharmaceuticals and research to marketing and technology. But it also requires a lot of commitment. Luckily, there’s plenty of online courses to start the fire. Try testing out an Introduction to Data Science course on an online education platform like Coursera or Udemy. It’s hard not to get excited about data science after seeing all the possibilities! Follow that up by trying your hand at programming, either in R or Python. In school, be sure to take statistics, perhaps even an advanced class. If all these topics still interest you, you may be looking at a career in data science. You can also try courses in linear algebra or machine learning to further test the waters.

**1. Learn by doing**

Learning about neural networks, image recognition, and other cutting-edge techniques is important. But most data science doesn’t involve any of it:

- 90% of your work will be data cleaning.
- Knowing a few algorithms really well is better than knowing a little about many algorithms.
- If you know linear regression, k-means clustering, and logistic regression well, can explain and interpret their results, and can actually complete a project from start to finish with them, you’ll be much more employable than if you know every single algorithm, but can’t use them.

- Most of the time, when you use an algorithm, it will be a version from a library (you’ll rarely be coding your own SVM implementations – it takes too long).

What all of this means is that the best way to learn is to work on projects. By working on projects, you gain skills that are immediately applicable and useful. You also have a nice way to build a portfolio.

One technique to start projects is to find a dataset you like. Answer an interesting question about it. Rinse and repeat.

Here are some good places to find datasets to get you started:

Another technique (and my technique) was to find a deep problem, predicting the stock market, that could be broken down into small steps. I first connected to the yahoo finance API, and pulled down daily price data. I then created some indicators, like average price over the past few days, and used them to predict the future (no real algorithms here, just technical analysis). This didn’t work so well, so I learned some statistics, and then used linear regression. Then I connected to another API, scraped minute by minute data, and stored it in a SQL database. And so on, until the algorithm worked well.

The great thing about this is that I had context for my learning. I didn’t just learn SQL syntax – I used it to store price data, and thus learned 10x as much as I would have by just studying syntax. Learning without application isn’t going to be retained very well, and won’t prepare you to do actual data science work.

**2. Learn to communicate insights**

Data scientists constantly need to present the results of their analysis to others. Skill at doing this can be the difference between an okay and a great data scientist.

Part of communicating insights is understanding the topic and theory well. Another part is understanding how to clearly organize your results. The final piece is being able to explain your analysis clearly.

It’s hard to get good at communicating complex concepts effectively, but here are some things you should try:

- Start a blog. Post the results of your data analysis.
- Try to teach your less tech-savvy friends and family about data science concepts. It’s amazing how much teaching can help you understand concepts.
- Try to speak at meetups.
- Use github to host all your analysis.

Get active ondata science blogs and communities like Datcamp, Quora, DataTau, and the machine learning subreddit.** **

**3. Learn from peers**

It’s amazing how much you can learn from working with others. In data science, teamwork can also be very important in a job setting.

Some ideas here:

- Find people to work with at meetups.
- Contribute to open source packages.
- Message people who write interesting data analysis blogs seeing if you can collaborate.

Try out Kaggle, a machine learning competition site, and see if you can find a teammate.

**4. Constantly increase the degree of difficulty**

Are you completely comfortable with the project you’re working on? Was the last time you used a new concept a week ago? It’s time to work on something more difficult. Data science is a steep mountain to climb, and if you stop climbing, it’s easy to never make it.

If you find yourself getting too comfortable, here are some ideas:

- Work with a larger dataset. Learn to use spark.
- See if you can make your algorithm faster.
- How would you scale your algorithm to multiple processors? Can you do it?
- Uunderstand the theory of the algorithm you’re using more. Does this change your assumptions?
- Try to teach a novice to do the same things you’re doing now.

* Prepare For A Degree Program* Will you need a degree to become a data scientist? Probably. In fact, 88% of data scientists have a master’s and 46% have a PhD. Most of these scientists, however, never took specialized “data science” courses. Many of them started in related fields and then turned their skills toward data science. The real question is what should you study to become a data scientist?

The answer is heavily dependent on the individual. More surprisingly, employers and working data scientists hold a lot of skepticism over specialized “data science” degrees. Not every degree program is worthwhile, and many programs are simply repackaged existing courses with no deeper understanding of data science. Some insiders recommend getting a BA in statistics to create a solid theoretical foundation for your career. Many more suggest supplementing traditional statistical and computer science studies with online courses in data science topics like SQL, NoSQL or Hadoop.

When choosing a university program, it’s key to choose based on the quality of the curriculum and professors rather than just the title. A degree in data science is useless if it doesn’t include the skills required for the job. When choosing a bachelor’s program, be sure it will enable you to pursue a master’s. Even if it seems far off or impossible, a degree in computer science, mathematics, statistics or engineering may be paramount to getting into the field more easily.

* Perfect The Soft Skills The Term* “soft skills” refers to abilities that are personal rather than learned skills like coding. In data science, soft skills are actually much more important than they appear. This has a lot to do with the career trajectory of data scientists. A degree or certain skill isn’t necessarily a “fast track” into real data science work. On top of powerful skill sets, data scientists must be adaptable and prepared to use their abilities in a variety of ways. It’s important to prove you have not only theoretical knowledge, but practical. Building a portfolio is just as important as going to class. More importantly, students can build portfolios completely on their own.

After grasping the basics of data science and analytics, students can play around with data tools and create real results. There are several open source tools available to mine data, to analyze it or create visualizations. Try asking a question and use data to find the answer. Data can be found all over the internet, often in nice downloadable collections. Try mining Twitter for information on what’s popular or who’s saying what. Learn from Wikidata and put your findings into visualizations. Open source programs and open data are all free to use. Technology will always be changing, but it’s good to become acquainted with popular programs and how they work. Degrees may teach skills, but doing data science is the only way to get good at data science.

** Get Comfortable With Data And Have Fun** While data science is full of theoretical skills that can be tough and time-consuming to learn, don’t forget about the fun aspects of data. The internet is full of great datasets and visualizations, so get inspired by what’s out there! Check out TedTalks on data usages to see how people are using data in real life. Read up on the history of data science to understand where it comes from and what it means. Try to understand all the different ways data science is used.

If you’re getting stuck on the lingo, try the Big Data Dictionary. Or read up on the three most important algorithms and find out what they really do. Tune into a data podcast on your drive to school. There’s no one path to becoming a data scientist, so find out what part of data excites you, follow it, and make your way into data science.

* Bottom line* This is less a roadmap of exactly what to do that it is a rough set of guidelines to follow as you learn data science. If you do all of these things well, you’ll find that you’re naturally developing data science expertise.

I generally dislike the “here’s a big list of stuff” approach, because it makes it extremely hard to figure out what to do next. I’ve seen a lot of people give up learning when confronted with a giant list of textbooks and MOOCs.

I personally believe that anyone can learn data science if they approach it with the right frame of mind.

** History Of Data Science**

The idea of data science spans many different fields, and has been slowly making its way into the mainstream for over fifty years. In fact, many considered last year the fiftieth anniversary of its official introduction. While many proponents have taken up the stick, made new assertions and challenges, there are a few names and dates you need know.

**1962**. **John Tukey** writes “The Future of Data Analysis.” Published in The Annals of Mathematical Statistics, a major venue for statistical research, he brought the relationship between statistics and analysis into question. One famous quote has since struck a chord with modern data lovers:

*“For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt…I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”*

**1974**. After Tukey, there is another important name that any data enthusiast should know: **Peter Naur**. He published the Concise Survey of Computer Methods, which surveyed data processing methods across a wide variety of applications. More importantly, the very term “data science” is used repeatedly. Naur offers his own definition of the term: “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.” It would take some time for the ideas to really catch on, but the general push toward data science started to pop up more and more often after his paper.

**1977**. The International Association for Statistical Computing (IASC) was founded. Their mission was to “link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.” In this year, Tukey also published a second major work: “Exploratory Data Analysis.” Here, he argues that emphasis should be placed on using data to suggest hypotheses for testing, and that exploratory data analysis should work side-by-side with confirmatory data analysis. In **1989**, the first Knowledge Discovery in Databases (KDD) workshop was organized, which would become the annual ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).

In **1994** the early forms of modern marketing began to appear. One example comes from the Business Week cover story “Database Marketing.” Here, readers get the news that companies are gathering all kinds of data in order to start new marketing campaigns. While companies had yet to figure out what to do with all of the data, the ominous line that “still, many companies believe they have no choice but to brave the database-marketing frontier” marked the beginning of an era.

In **1996,** the term “data science” appeared for the first time at the International Federation of Classification Societies in Japan. The topic? “Data science, classification, and related methods.” The next year, in **1997**, **C.F. Jeff Wu** gave an inaugural lecture titled simply “Statistics = Data Science?”

Already in **1999**, we get a glimpse of the burgeoning field of big data. Jacob Zahavi, quoted in “Mining Data for Nuggets of Knowledge” in Knowledge@Wharton had some more insight that would only prove to true over the following years:

*“Conventional statistical methods work well with small data sets. Today’s databases, however, can involve millions of rows and scores of columns of data… Scalability is a huge issue in data mining. Another technical challenge is developing models that can do a better job analyzing data, detecting non-linear relationships and interaction between elements… Special data mining tools may have to be developed to address web-site decisions.”*

And this was only in** 1999**! **2001** brought even more, including the first usage of “software as a service,” the fundamental concept behind cloud-based applications. Data science and big data seemed to grow and work perfectly with the developing technology. One of the many more important names is **William S. Cleveland.** He co-edited Tukey’s collected works, developed valuable statistical methods, and published the paper “Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics.” Cleveland put forward the notion that data science was an independent discipline and named six areas in which he believed data scientists should be educated: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory.

**2008**. The term “data scientist” is often attributed to Jeff Hammerbacher and DJ Patil, of Facebook and LinkedIn—because they carefully chose it. Attempting to describe their teams and work, they settled on “data scientist” and a buzzword was born. (Oh, and Patil continues to make waves as the current Chief Data Scientist at White House Office of Science and Technology Policy).

**2010**. The term “data science” has fully infiltrated the vernacular. Between just **2011** and **2012**, “data scientist” job listings increased 15,000%. There has also been an increase in conferences and meetups devoted solely to data science and big data. The theme of data science hasn’t only become popular by this point, it has become highly developed and incredibly useful.

**2013** was the year data got really big. IBM shared statistics that showed 90% of the world’s data had been created in the preceding two years, alone.

**2016** may have only just began, but predictions are already begin made for the upcoming year. Data science is entrenched in machine learning, and many expect this to be the year of Deep Learning. With access to vast amounts of data, deep learning will be key towards moving forward into new areas. This will go hand-in-hand with opening up data and creating open source data solutions that enable non-experts to take part in the data science revolution.

In the past decade, the idea of data science exploded and slowly became what we recognize today. One vital point analysts understand is that data science and big data are not simply “scaling up” data. Instead, it means a shift in study and analysis. Despite seeming almost completely ordinary in today’s world, like something that could not possibly be removed from research and study, the nature and importance of data science was not always so clear, and its exact nature will continue to develop alongside technology.