Login or Sign Up to become a member!
LessThanDot Site Logo

LessThanDot

A Technical Community for IT Professionals

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Highly Rated Users

Forum
No Posts Rated

Top 50
Given
Received

Forum Statistics

Users
Members:
1592
Members Online:
4
Guests Online:
4

Total Post History
Posts:
80552
Topics:
18446

7-Day Post History
New Posts:
9
New Topics:
1
Active Topics:
7

Our newest member
marykee

Other

FAQ
All times are UTC [ DST ]

Google Ads

Puzzle 17: Fraud Detection

Mind Boggling Puzzles, to keep that grey matter in shape...
Forum rules
Always post answers in a "Hidecode" tag, so that others have a chance to answer the question too.
Please wait...

Puzzle 17: Fraud Detection

Postby damber on Sun Nov 02, 2008 11:13 pm

OK, we had a difficult challenge last time, so we are going to keep this one relatively simple.

The challenge is to identify falsification of data sets. Given a set of numbers of natural source (e.g. naturally occuring like credit card payments, not machine/human generated like a telephone number), the program needs to identify the probability of that data being naturally occuring vs. being falsified.

That sounds quite hard as it stands, but don't fear, we have a simple, basic way for you to determine how to check if the data is likely to be naturally occuring or not... "Benfords Law", which outlines the probability of ratios of the leading digit in any given list of values. This way, you can determine if the source data matches the expected profile within a given threshold.

The acceptable deviation threshold is up to you... and the calculation of the resulting probability of falsification is also up to you... So, you could be really strict and say it has to have the exact distribution profile as the standard profile, or you could allow a +/- 10% variance at each data point - the choice is yours, though you will need to be able to identify 2 of the 3 data sets below correctly.

The data sets...

1. This is a valid / true dataset, your program should return a positive judgement when validating the data - you MUST get this right.

(World Populations by Country): http://en.wikipedia.org/wiki/List_of_co ... population
Code is hidden, SHOW


2. This is a valid dataset that has been moderately changed so that it is in between 'completely false' and 'completely true', your program should return a negative judgement when validating the data, however the probability returned should reflect it's potential ambiguity.

Code is hidden, SHOW


3. This is an invalid dataset that has been completely made up. your program should return a negative judgement when validating the data - you MUST get this right.

Code is hidden, SHOW


The output expected from the program is simply 2 values:
- A Proposed Validity: e.g. "Valid" or "Invalid"
- The Probability of the validity: e.g. "80%"

If you have not heard of Benfords law, an overview video can be found here: http://videos.kirix.com/data-and-the-we ... ds-law.htm
and of course, wikipedia can also help: http://en.wikipedia.org/wiki/Benfords_law

As always, the programming language is your choice, though must be posted with your answer (please use the hidecode tags).

The program should output the following validity statements, and the probability percentage estimate it calculates for each:
1. Valid - xx%
2. Invalid - xx%
3. Invalid - xx%

Your program should at least identify 1 and 3 correctly, and show that the percentage probability of the data being valid decreases for each dataset (from 1-3).

Have fun... :-)
a smile is worth a thousand kind words, so smile, it's easy! :-)


CODE: $5
WORKING CODE: $500
PROPERLY DESIGNED & WORKING CODE: Priceless
User avatar
damber
LTD Admin
LTD Admin
LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660
LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660
 
Posts: 3134
Joined: Tue Oct 09, 2007 1:48 pm
Location: North Wales, UK
Unrated

Re: Puzzle 17: Fraud Detection

Postby tisodotsk on Thu Nov 06, 2008 2:15 am

My solution in PHP:
Code is hidden, SHOW

Output:
Code is hidden, SHOW


Extra:

Chart of datasets distributions via google charts api:
Image
(maybe it may be hidden too)
I can post solution with this graph generation hier, if somebody will...
I try to improve my English language skills. Most things i do better than this.
tisodotsk
Apprentice
Apprentice
LTD Bronze - Rating: 62LTD Bronze - Rating: 62
 
Posts: 22
Joined: Fri Aug 08, 2008 12:45 pm
Location: Bratislava, Slovakia

Re: Puzzle 17: Fraud Detection

Postby damber on Sun Nov 09, 2008 7:44 pm

once again tisodotsk you've come up with the goods :-) congrats

We'll give it one more week to see if anyone else has what it takes to step up to the mark...
a smile is worth a thousand kind words, so smile, it's easy! :-)


CODE: $5
WORKING CODE: $500
PROPERLY DESIGNED & WORKING CODE: Priceless
User avatar
damber
LTD Admin
LTD Admin
LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660
LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660LTD Silver - Rating: 660
 
Posts: 3134
Joined: Tue Oct 09, 2007 1:48 pm
Location: North Wales, UK
Unrated

Re: Puzzle 17: Fraud Detection

Postby funkture on Wed Jan 04, 2012 8:29 pm

tisodotsk: looks like you should be resetting the deviation at the top of each loop iteration
funkture
Newbie
Newbie
 
Posts: 1
Joined: Wed Jan 04, 2012 8:26 pm
Unrated

Re: Puzzle 17: Fraud Detection

Postby joefkelley on Wed Jan 16, 2013 12:42 am

Here's a solution in R: I let the chisq.test function do the heavy lifting
Code is hidden, SHOW

The transformation of the p-value into a more intuitive "confidence" is somewhat arbitrary, but seems to work well.

Output on all three datasets, in order:
Code is hidden, SHOW
joefkelley
Newbie
Newbie
LTD Bronze - Rating: 3
 
Posts: 1
Joined: Wed Jan 16, 2013 12:20 am