'How to start a machine learning course of Udacity on Anaconda Jupyter notebook and Python 2.7?

I want to start a machine learning course of udacity. So I downloaded ud120-projects-master.zip file and extracted it in my downloads folder. I have installed anaconda jupyter notebook (python 2.7).

First mini project is Naïve-Bayes ,so I opened the jupyter notebook and the %load nb_author_id.py to convert into .ipynb But I think I have to first run the startup.py in tools folder to extract the data.

So I ran the startup.ipynb.

# %load startup.py
print
print "checking for nltk"
try:
    import nltk
except ImportError:
    print "you should install nltk before continuing"

print "checking for numpy"
try:
    import numpy
except ImportError:
    print "you should install numpy before continuing"

print "checking for scipy"
try:
    import scipy
except:
    print "you should install scipy before continuing"

print "checking for sklearn"
try:
    import sklearn
except:
    print "you should install sklearn before continuing"

print
print "downloading the Enron dataset (this may take a while)"
print "to check on progress, you can cd up one level, then execute <ls -lthr>"
print "Enron dataset should be last item on the list, along with its current size"
print "download will complete at about 423 MB"
import urllib
url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz"
urllib.urlretrieve(url, filename="../enron_mail_20150507.tgz") 
print "download complete!"


print
print "unzipping Enron dataset (this may take a while)"
import tarfile
import os
os.chdir("..")
tfile = tarfile.open("enron_mail_20150507.tgz", "r:gz")
tfile.extractall(".")

print "you're ready to go!"

But getting an error....

checking for nltk
checking for numpy
checking for scipy
checking for sklearn

downloading the Enron dataset (this may take a while)
to check on progress, you can cd up one level, then execute <ls -lthr>
Enron dataset should be last item on the list, along with its current size
download will complete at about 423 MB




---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-1-c30fe1ced56a> in <module>()
     32 import urllib
     33 url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz"
---> 34 urllib.urlretrieve(url, filename="../enron_mail_20150507.tgz")
     35 print "download complete!"
     36 

This is for nb_author_id.py :

# %load nb_author_id.py
#!/usr/bin/python

""" 
    This is the code to accompany the Lesson 1 (Naive Bayes) mini-project. 

    Use a Naive Bayes Classifier to identify emails by their authors

    authors and labels:
    Sara has label 0
    Chris has label 1
"""

import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()




#########################################################
### your code goes here ###


#########################################################

error/warning

C:\Users\jr31964\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)




no. of Chris training emails: 7936
no. of Sara training emails: 7884

How to I start with Naïve Bayes mini project and what are the prerequisites action needed.



Solution 1:[1]

Since the course is I presume in Python 3, I would suggest making a conda environment in python 3. You can do this even though you have a base python installation of python 2. This should save you converting all the course code in python 3 to your python 2.

conda create --name UdacityCourseEnvironment python=3.6

# to get into your new environment (mac/linux)
source activate UdacityCourseEnvironment

# to get into your new environment (windows)
activate UdacityCourseEnvironment

# When you need new packages inside your new environment 
conda install nameOfPackage

Source: Switching between python 2 and 3 with Conda

Solution 2:[2]

You made the right decision to go with Anaconda - this solves a bunch of incompatibility issues between Python 2 and Python 3 and the various package dependencies. I did it the hard way and am converting the code to Python3 (& dependencies) as I go along, because I want an up-to-date environment & programming skills when I finish; but that's just me.

Obviously, you can ignore that deprecation warning: sklearn 0.19.0 still works. Anyone who tries to run this after 0.20.0 will have an issue. But, if you find it annoying (like me) you can edit the file tools/email_preprocess.py and change the following lines (original in comments):

# from sklearn import cross_validation
from sklearn.model_selection import train_test_split

and

#features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)
features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)

Also, because some installs are dependent on others. An earlier successful install (e.g. numpy) turns out to cause a failure of the install of other packages (e.g. scipy) because a prereq for that is numpy+mkl. If you just installed numpy, that needs to be uninstalled and replaced. See more on that at (I have hit my link limit) https colon //github dot com/scipy/scipy/issues/7221

The next problem I hit was that, on my machine, the volume of the email files in enron_mail_20150507.tgz was so large that it ran for several hours without reaching the completion message:

print "you're ready to go!"

Turns out that my IDE (PyCharm) was indexing the files as they were being unpacked and this was killing the disk. As indexing text files is unnecessary I turned that off for the directory 'maildir'. That allowed startup.py to finish.

The error you are encountering with urllib is due to a change in the package: you need to change the import statement to:

import urllib.request

...and then your line 34 (error message above) to:

urllib.request.urlretrieve(url, filename="../enron_mail_20150507.tar.gz")

Note also this link on github is very helpful: https://github.com/MLTO/general/wiki/Python-Setup-for-Udacity-ud120-course

The rest of this answer relates to Windows 10, so Linux users can skip this.

The next problem I encountered was that some of the package imports were failing, due to the installs not being correctly optimized for W10. An invaluable resource to resolve this is a set of Windows optimized .whl (wheel) files that can be found at http://www.lfd.uci.edu/~gohlke/pythonlibs/

Next problem was the unpacking of the .tgz file introduced the probably familiar LF/CRLF character issues between Linux and Windows files. There is a fix for this from @monkshow92 on github here: (link limit again) https colon //github dot com/udacity/ud120-projects/issues/46

Apart from that, it was a breeze....

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Michael James Kali Galarnyk
Solution 2