CS 526 - Computer Graphics II -- Yiwen Sun
Project 3 Badge and Network Traffic Viz
In this project I created a multi-view visualization to solve a cyber analytics problem:
An embassy employee is suspected of sending data to an outside criminal organization. Two data sets are provided to analyze. The targets are: identify which computer the employee most likely used to send information to his contact; and characterize the patterns of behavior of suspicious computer use. A detailed description of the problem can be found at IEEE VAST Challenge 2009
The program was written in Java on WindowsXP system. JFreeChart is used in the implementation of plots.
Data
Two data sets are provided:
1) Proximity (prox) card log. The data is about employees entering and leaving a facility, where they must use their badge (also called "proximity card") to gain access to either the building or a classified area inside the building. The data consist of a CSV file with values of the event datetime, the employee id, and the type of event (prox-in-building, prox-in-classified, prox-out-classified). Here is an example:
Datetime ID Type 2008-01-01T07:28 44 prox-in-building 2008-01-01T08:31 44 prox-in-classified 2008-01-01T09:23 38 prox-in-building 2008-01-01T09:56 44 prox-out-classified 2008-01-01T11:06 44 prox-in-building 2008-01-01T11:15 38 prox-in-classified 2008-01-01T12:14 38 prox-out-classified
2) Network traffic logs. The data is a month's worth of computer use in the form of IP logs. The data contains the source IP address and the destination IP address, the port, the sizes of the request (called request payload) in bytes and the response (called response payload). Here is an example:
USER WARNING SourceIP AccessTime DestIP Socket ReqSize RespSize Synthetic Data 37.170.100.38 2008-01-01T09:40:29.276 37.170.100.200 80 7063 49591 Synthetic Data 37.170.100.38 2008-01-01T09:43:08.861 37.157.76.124 80 5171 434285 Synthetic Data 37.170.100.38 2008-01-01T09:47:41.282 37.170.30.250 25 32818 182798 Synthetic Data 37.170.100.38 2008-01-01T09:49:21.413 37.116.192.39 80 4455 46397 Synthetic Data 37.170.100.38 2008-01-01T09:50:12.995 10.24.74.254 80 5949 10166 Synthetic Data 37.170.100.38 2008-01-01T09:50:41.467 105.133.117.251 80 30999 56102 Synthetic Data 37.170.100.44 2008-01-01T09:57:17.588 37.170.100.200 80 3785 53246 Synthetic Data 37.170.100.44 2008-01-01T09:57:40.142 100.204.207.157 80 70031 10505 Synthetic Data 37.170.100.38 2008-01-01T09:59:55.643 37.254.130.230 80 43917 846347 Synthetic Data 37.170.100.44 2008-01-01T10:00:08.287 101.160.27.28 80 5013 331066
Data AggregationThere are over 115K entries in the network traffic log over a month period, we aggregate them into hours for effective visualization and analysis. Three measures (attributes) are considered in this problem:
maximum request payload per hour
maximum response payload per hour
number of packets per hour.
ChallengeTo detect abnormal behavior of computer use, we need to anaylyze the data and find a baseline model. Here, a statistic model is applied. The model is defined as the average network traffic in a week, which includes the time_of_day and day_of_week patterns. The average value of all the computers (src IPs) is defined as the baseline model.
The computer which has traffic data far beyond baseline model and presents some irregular patterns would be suspicious.
Visualization and Analysis
- Visualize Network Traffic Data in Scatter Plot
In this scenario, three measures of network traffic can be visualized in time-series scatter plots.
Visualize max request payload for the whole period. Here, each point represents one hour data for one computer. Different computers are color coded.
Visualize three measures together during a week period with a shared time line.
All the computers (source IPs) are shown in a list. User can hide/show each IP's data, and change the color code.
- Highlight on Network Traffic Data
User can click a point and highlight all the packets from the same computer (source IP). The highlighted points are shown with black outline.
When mouse over a point, the detail (src IP, date&time, measure value) will be shown as the tooltip.
The raw IP log entries can be shown in a table form for highlighted data point.
- Visualize Statistic Analysis Result
The baseline model is shown in thick red line. It is easy to compare a computer's behavior with others' and detect anomaly (e.g. the one shown in blue).
- Network traffic data can be shown in animation
- Visualize Badge Proximity Data in Bar Chart
Two type of event (in-building, in-classified) are color coded as green and yellow respectively in the bar chart. The length of bar indicate the duration.
There are sliding bars to change employeeID axis scale and scroll among all the employeeIDs.
The raw proximity log entries can be shown in a table form for a highlighted employee.
- Linkage between Two Visualization
Toggle on this option, when brushing a time period in network traffic view, the badge proximity view will change to the corresponding time duration.
- User Interface
Here is a snapshot of the user interface, left side are various controls, right side is the view panel, which is a dockable tabbed panel.
- Image Export
Both views support the function to save the current view as a PNG file.
- Solution
From the statistic analysis model, we can find computer 37.170.100.31 has abnormal pattern: high request payload on Tuesday afternoon at 5pm.
Its behavior model is shown below in purple, compared with the baseline model in red:
When linking with proximity data, we find employee 31's behavior most likely fit the abnormal computer network pattern.
by Yiwen Sun