Monday, June 23, 2014

Basic of Statistic and Machine Learning used for Analyze Data (3)

Machine Learning:

A branch of artificial intelligence, concerns the construction and study of systems that can learn from data.

Machine Learning vs Statistics:

Statistics is about drawing valid conclusions

It cares deeply about how the data was collected, methodology, and statistical properties of the estimator. Much of Statistics is motivated by problems where you need to know precisely what you're doing (clinical trials, other experiments).

Statistics insists on proper and rigorous methodology, and is comfortable with making and noting assumptions. It cares about how the data was collected, the resulting properties of the estimator or experiment (e.g. p-value, unbiased estimators), and the kinds of properties you would expect if you did a procedure many times.

Machine Learning is about prediction

It cares deeply about scalability and using the predictions to make decisions. Much of Machine Learning is motivated by problems that need to have answers (e.g. image recognition, text inference, ranking, computer vision, medical and healthcare, search engines.)

ML is happy to treat the algorithm as a black box as long as it works. Prediction and decision-making is king, and the algorithm is only a means to an end. It's very important in ML to make sure that your performance would improve (and not take an absurd amount of time) with more data.



Type of Machine Learning Algorithm:

  • Supervised learning algorithms are trained on labelled examples, i.e., input where the desired output is known. The supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used speculatively to generate an output for previously unseen inputs.

  • Unsupervised learning algorithms operate on unlabelled examples, i.e., input where the desired output is unknown. Here the objective is to discover structure in the data ( clustering), not to generalize a mapping from inputs to outputs.


Linear Regression:
In statisticslinear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X.




 Gradient Descent in Python:

import numpy
import pandas

def compute_cost(features, values, theta):
    """
    Compute the cost of a list of parameters, theta, given a list of features (input 
data points) and values (output data points).
    """
    m = len(values)
    sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)

    return cost

def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """

    # Write code here that performs num_iterations updates to the elements of theta.
    # times. Every time you compute the cost for a given list of thetas, append it 
    # to cost_history.
    
    cost_history = []

     m = len(values)
    for i in range(num_iterations):
        predicted_values = numpy.dot(features, theta)
        theta -= (alpha / m) * numpy.dot((predicted_values - values), features)
        cost_history.append(compute_cost(features, values, theta))

    

    return theta, pandas.Series(cost_history) # leave this line for the grader


Coefficient of determination:
In statistics, the coefficient of determination, denoted R2 and pronounced R squared, indicates how well data points fit a statistical model – sometimes simply a line or curve.


Calculating R Squared:

import numpy as np

def compute_r_squared(data, predictions):
    # Write a function that, given two input numpy arrays, 'data', and 'predictions,'
    # returns the coefficient of determination, R^2, for the model that produced 
    # predictions.
    # 
    # Numpy has a couple of functions -- np.mean() and np.sum() --
    # that you might find useful, but you don't have to use them.

    # YOUR CODE GOES HERE
    mean = np.mean(data)
    SSr = np.sum(np.square(data - predictions))
    SSt = np.sum(np.square(data - mean))
    r_squared = 1.0 - (SSr / SSt)


    return r_squared



No comments:

Post a Comment