Codice per scrivere un albero di decisione categorico in python

I decision tree sono spesso utilizzati per la classificazione nel machine learning.

Vantaggio: bassa complessità di calcolo, risultati facili da comprendere, insensibili alla mancanza di valori intermedi, può gestire dati di caratteristiche non correlate.
Svantaggio: potrebbe causare problemi di addestramento eccessivo.
Tipi di dati applicabili: numerici e nominali.

1. Informazione di增益

Lo scopo di dividere i set di dati è: rendere i dati disordinati più ordinati. Un metodo per organizzare dati disordinati è utilizzare le misure di informazione della teoria dell'informazione. Di solito si utilizza l'informazione di增益，che è la differenza tra l'entropia prima e dopo la divisione dei dati. Più disordinata è l'informazione, maggiore è l'entropia, e la caratteristica che ottiene il maggior guadagno di informazione è la migliore scelta.
Entropy is defined as the expected value of information, the information of symbol xi is defined as:

where p(xi) is the probability of the classification.
Entropy, that is, the expected value of information, is:

The code for calculating entropy is as follows:

def calcShannonEnt(dataSet):
  numEntries = len(dataSet)
  labelCounts = {}
  per featVec in dataSet:
    currentLabel = featVec[-1]
    if currentLabel not in labelCounts:
      labelCounts[currentLabel] = 0
    labelCounts[currentLabel] += 1
  shannonEnt = 0
  for key in labelCounts:
    shannonEnt = shannonEnt - (labelCounts[key]/numEntries)*math.log2(labelCounts[key]/numEntries)
  ritorna shannonEnt

The dataset can be divided according to the information entropy, using the method of obtaining the maximum information gain.

2. Divide the dataset

Dividing the dataset is to extract all elements that meet the requirements.

def splitDataSet(dataSet,axis,value):
  retDataset = []
  per featVec in dataSet:
    se featVec[axis] == value:
      newVec = featVec[:axis]
      newVec.extend(featVec[axis+1:])
      retDataset.append(newVec)
  ritorna retDataset

3. Choose the best way to divide the dataset

Information gain is the reduction of entropy or the reduction of information disorder.

def chooseBestFeatureToSplit(dataSet):
  numFeatures = len(dataSet[0]) - 1
  bestInfoGain = 0
  bestFeature = -1
  baseEntropy = calcShannonEnt(dataSet)
  per i in range(numFeatures):
    allValue = [example[i] for example in dataSet]#List comprehension, create a new list
    allValue = set(allValue)#The fastest way to get unique values in a list
    newEntropy = 0
    per value in allValue:
      splitset = splitDataSet(dataSet,i,value)
      newEntropy = newEntropy + len(splitset)/len(dataSet)*calcShannonEnt(splitset)
    infoGain = baseEntropy - newEntropy
    se infoGain > bestInfoGain:
      bestInfoGain = infoGain
      bestFeature = i
  ritorna bestFeature

4. Recursively create a decision tree

The termination condition is: the program has traversed all the attributes of the divided dataset, or all the instances under each branch have the same classification.
When the dataset has processed all attributes but the class label is not unique, the majority voting method is used to determine the type of leaf node.

def majorityCnt(classList):
 classCount = {}
 per value in classList:
  se value non è in classCount: classCount[value] = 0
  classCount[value] += 1
 classCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
 ritorna classCount[0][0]

Generate decision tree:

def createTree(dataSet,labels):
 classList = [example[-1] per example in dataSet]
 labelsCopy = labels[:]
 se classList.count(classList[0]) è uguale a len(classList):
  ritorna classList[0]
 se la lunghezza di dataSet[0] è uguale a 1:
  return majorityCnt(classList)
 bestFeature = chooseBestFeatureToSplit(dataSet)
 bestLabel = labelsCopy[bestFeature]
 myTree = {bestLabel:{}}
 featureValues = [example[bestFeature] for example in dataSet]
 featureValues = set(featureValues)
 del(labelsCopy[bestFeature])
 for value in featureValues:
  subLabels = labelsCopy[:]
  myTree[bestLabel][value] = createTree(splitDataSet(dataSet, bestFeature, value), subLabels)
 return myTree

5. Test algorithm - using decision tree classification

Similarly, the classification result is obtained using recursion.

def classify(inputTree, featLabels, testVec):
 currentFeat = list(inputTree.keys())[0]
 secondTree = inputTree[currentFeat]
 try:
  featureIndex = featLabels.index(currentFeat)
 except ValueError as err:
  print('yes')
 try:
  for value in secondTree.keys():
   if value == testVec[featureIndex]:
    if type(secondTree[value]).__name__ == 'dict':
     classLabel = classify(secondTree[value], featLabels, testVec)
    else:
     classLabel = secondTree[value]
  return classLabel
 except AttributeError:
  print(secondTree)

6. Complete codecome follows

import numpy as np
import math
import operator
def createDataSet():
 dataSet = [[1,1,'yes'],
    [1,1,'yes'],
    [1,0,'no'],
    [0,1,'no'],
    [0,1,'no'],]
 label = ['no surfacing','flippers']
 return dataSet,label
def calcShannonEnt(dataSet):
 numEntries = len(dataSet)
 labelCounts = {}
 per featVec in dataSet:
  currentLabel = featVec[-1]
  if currentLabel not in labelCounts:
   labelCounts[currentLabel] = 0
  labelCounts[currentLabel] += 1
 shannonEnt = 0
 for key in labelCounts:
  shannonEnt = shannonEnt - (labelCounts[key]/numEntries)*math.log2(labelCounts[key]/numEntries)
 ritorna shannonEnt
def splitDataSet(dataSet,axis,value):
 retDataset = []
 per featVec in dataSet:
  se featVec[axis] == value:
   newVec = featVec[:axis]
   newVec.extend(featVec[axis+1:])
   retDataset.append(newVec)
 ritorna retDataset
def chooseBestFeatureToSplit(dataSet):
 numFeatures = len(dataSet[0]) - 1
 bestInfoGain = 0
 bestFeature = -1
 baseEntropy = calcShannonEnt(dataSet)
 per i in range(numFeatures):
  allValue = [example[i] for example in dataSet]
  allValue = set(allValue)
  newEntropy = 0
  per value in allValue:
   splitset = splitDataSet(dataSet,i,value)
   newEntropy = newEntropy + len(splitset)/len(dataSet)*calcShannonEnt(splitset)
  infoGain = baseEntropy - newEntropy
  se infoGain > bestInfoGain:
   bestInfoGain = infoGain
   bestFeature = i
 ritorna bestFeature
def majorityCnt(classList):
 classCount = {}
 per value in classList:
  se value non è in classCount: classCount[value] = 0
  classCount[value] += 1
 classCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
 ritorna classCount[0][0]   
def createTree(dataSet,labels):
 classList = [example[-1] per example in dataSet]
 labelsCopy = labels[:]
 se classList.count(classList[0]) è uguale a len(classList):
  ritorna classList[0]
 se la lunghezza di dataSet[0] è uguale a 1:
  return majorityCnt(classList)
 bestFeature = chooseBestFeatureToSplit(dataSet)
 bestLabel = labelsCopy[bestFeature]
 myTree = {bestLabel:{}}
 featureValues = [example[bestFeature] for example in dataSet]
 featureValues = set(featureValues)
 del(labelsCopy[bestFeature])
 for value in featureValues:
  subLabels = labelsCopy[:]
  myTree[bestLabel][value] = createTree(splitDataSet(dataSet, bestFeature, value), subLabels)
 return myTree
def classify(inputTree, featLabels, testVec):
 currentFeat = list(inputTree.keys())[0]
 secondTree = inputTree[currentFeat]
 try:
  featureIndex = featLabels.index(currentFeat)
 except ValueError as err:
  print('yes')
 try:
  for value in secondTree.keys():
   if value == testVec[featureIndex]:
    if type(secondTree[value]).__name__ == 'dict':
     classLabel = classify(secondTree[value], featLabels, testVec)
    else:
     classLabel = secondTree[value]
  return classLabel
 except AttributeError:
  print(secondTree)
if __name__ == "__main__":
 dataset, label = createDataSet()
 myTree = createTree(dataset,label)
 a = [1,1]
 print(classify(myTree,label,a))

7. Tecniche di programmazione

Differenza tra extend e append

 newVec.extend(featVec[axis+1:])
 retDataset.append(newVec)

extend([]), aggiunge ogni elemento della lista alla nuova lista
append() aggiunge il contenuto tra parentesi graffe come un elemento alla nuova lista

List comprehension

Creazione di una nuova lista

allValue = [example[i] for example in dataSet]

Estrazione degli elementi unici dalla lista

allValue = set(allValue)

Ordinamento di lista/tupla, funzione sorted()

classCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)

Copia della lista

labelsCopy = labels[:]

Download del codice e dei dataset:Albero di decisione

Questo è tutto il contenuto dell'articolo, speriamo che sia utile per la tua apprendimento, e speriamo che tutti supportino il tutorial di urla.

Dichiarazione: il contenuto di questo articolo è stato tratto da Internet, il copyright è dell'autore originale, il contenuto è stato contribuito e caricato autonomamente dagli utenti di Internet, questo sito non possiede il diritto di proprietà, non è stato editato manualmente e non assume responsabilità legali correlate. Se trovi contenuti sospetti di violazione del copyright, ti preghiamo di inviare una e-mail a notice#oldtoolbag.com (al momento dell'invio dell'e-mail, sostituisci # con @) per segnalare, fornendo prove pertinenti. Una volta verificata, questo sito eliminerà immediatamente il contenuto sospetto di violazione del copyright.

Manuale di base