CvSVM: Balanced Cross Validation (Feature #314)


Added by Arman Savran almost 15 years ago. Updated over 14 years ago.


Status:Done Start date:
Priority:High Due date:
Assignee:- % Done:

0%

Category:ml
Target version:-
Difficulty: Pull request:

Description

I suggest an improvement in CvSVM::train_auto (mlsvm.cpp) that performs cross validation to optimize the hyper parameters. I noticed that the current code does not ensure balanced sets for cross validation, only randomly permute samples. However, if number of positive and negative samples differ much, this becomes an important issue. Below, I implemented a patch in order to keep the original ratio of positive and negative samples in every cross validation set. Do you consider to add it?

Arman


struct indexedratio {
double val;
int ind;
int count_smallest, count_biggest;

void eval() { val = (double) count_smallest/(count_smallest+count_biggest); }
};

static int CV_CDECL
icvCmpIndexedratio( const void* a, const void* b ) {
return ((const indexedratio*)a)->val < ((const indexedratio*)b)->val ? -1
: ((const indexedratio*)a)->val > ((const indexedratio*)b)->val ? 1
: 0;
}

-------------------------------------------------------------------------------------------------------------------------
THE BELOW CODE IS AFTER THE BLOCK OF -> // randomly permute samples and responses
(Around line 1750 of mlsvm.cpp)

bool do_balanced=true;
if (!is_regression && class_labels->cols==2 && do_balanced) {
// count class samples
int num_0=0,num_1=0;
for (i=0; i&lt;sample_count; ++i) {
if (responses->data.i[i]==class_labels->data.ir0)
++num_0;
else
++num_1;
}
int label_smallest_class;
int label_biggest_class;
if (num_0 < num_1) {
label_biggest_class = class_labels->data.ir1;
label_smallest_class = class_labels->data.ir0;
}
else {
label_biggest_class = class_labels->data.ir0;
label_smallest_class = class_labels->data.ir1;
int y;
CV_SWAP(num_0,num_1,y);
}
const double class_ratio = (double) num_0/sample_count;
// calculate class ratio of each fold
indexedratio ratios=0;
ratios = (indexedratio
) cvAlloc(k_fold*sizeof(*ratios));
for (int k=0, i_begin=0; k&lt;k_fold; ++k, i_begin+=testset_size) {
int count0=0;
int count1=0;
int i_end = i_begin + (k&lt;k_fold-1 ? testset_size : last_testset_size);
for (int i=i_begin; i&lt;i_end; ++i) {
if (responses->data.i[i]==label_smallest_class)
++count0;
else
++count1;
}
ratios[k].ind = k;
ratios[k].count_smallest = count0;
ratios[k].count_biggest = count1;
ratios[k].eval();
}
// initial distance
qsort(ratios, k_fold, sizeof(ratiosr0), icvCmpIndexedratio);
double old_dist = 0.0;
for (int k=0; k&lt;k_fold; ++k)
old_dist += abs(ratios[k].val-class_ratio);
double new_dist = 1.0;
// iterate to make the folds more balanced
while (new_dist > 0.0) {
if (ratiosr0.count_biggest==0 || ratios[k_fold-1].count_smallest==0)
break; // we are not able to swap samples anymore
// what if we swap the samples, calculate the new distance
ratiosr0.count_smallest++;
ratiosr0.count_biggest--;
ratiosr0.eval();
ratios[k_fold-1].count_smallest--;
ratios[k_fold-1].count_biggest++;
ratios[k_fold-1].eval();
qsort(ratios, k_fold, sizeof(ratiosr0), icvCmpIndexedratio);
new_dist = 0.0;
for (int k=0; k&lt;k_fold; ++k)
new_dist += abs(ratios[k].val-class_ratio);
if (new_dist < old_dist)
{
// swapping really improves, so swap the samples
// index of the biggest_class sample from the minimum ratio fold
int i1 = ratiosr0.ind * testset_size;
for ( ; i1&lt;sample_count; ++i1) {
if (responses->data.i[i1]==label_biggest_class)
break;
}
// index of the smallest_class sample from the maximum ratio fold
int i2 = ratios[k_fold-1].ind * testset_size;
for ( ; i2&lt;sample_count; ++i2) {
if (responses->data.i[i2]==label_smallest_class)
break;
}
// swap
const float* temp;
int y;
CV_SWAP( samples[i1], samples[i2], temp );
CV_SWAP( responses->data.i[i1], responses->data.i[i2], y );
old_dist = new_dist;
}
else
break; // does not improve, so break the loop
}
cvFree(&ratios);
}

Associated revisions

Revision bad4ca2a
Added by Vadim Pisarevsky over 14 years ago

added the optional balanced cross-validation in SVN::train_auto (by arman, ticket #314)

Revision abb9e086
Added by Andrey Kamaev about 12 years ago

Merge pull request #314 from vpisarev:2.4

History

Updated by Vadim Pisarevsky over 14 years ago

thank you and sorry for delay! finally, your patch has been integrated in r4128. I added balanced parameter to CvSVM::train_auto. By default this feature is turned off until it is better tested. But user can pass balanced=true and this way run your balancing algorithm.

  • Status changed from Open to Done
  • (deleted custom field) set to fixed

Also available in: Atom PDF