Petaflop

Wednesday, 24 July 2013

Parsing socket / cpu / hyperthreading information from /proc/cpuinfo

#!/bin/bash

# total number of sockets
NUM_SOCKETS=`grep physical\ id /proc/cpuinfo | sort -u | wc -l`

# total number of cores per socket
NUM_CORES=`grep cpu\ cores /proc/cpuinfo | sort -u | awk '{print $4}'`

# total number of physical cpus (cores per socket * number of sockets)
NUM_PHYSICAL_CPUS=$[NUM_SOCKETS * ${NUM_CORES}]

echo "${NUM_SOCKETS} sockets"
echo "${NUM_CORES} cores per socket"
echo "${NUM_PHYSICAL_CPUS} physical processors"

# Work out if hyperthreading is enabled.
# This is done by working out how many siblings each core has
# If it's not the same as the number of cores per socket then
# hyperthreading must be on
NUM_SIBLINGS=`grep siblings /proc/cpuinfo | sort -u | awk '{print $3}'`
echo "$NUM_SIBLINGS siblings"
if [ ${NUM_SIBLINGS} -ne ${NUM_CORES} ]
then
# total number of local cpus (ie: physical cpus + hyperthreading cpus)
NUM_LOGICAL_CPUS=`grep processor /proc/cpuinfo | sort -u | wc -l`
echo "hyperthreading is enabled - ${NUM_LOGICAL_CPUS} logical processors"
fi

# display which socket each core is on
echo "Sockets: Cores"
cat /proc/cpuinfo | egrep "physical id|processor" | tr \\n ' ' | sed 's/processor/\nprocessor/g' | grep -v ^$ | awk '{printf "%d: %02d\n",$7,$3}' | sort

Notes on performance counters and profiling with PAPI

Performance counters

2 main types of profiling applications with performance counters: aggregate (direct) and statistical (indirect).

Aggregate: Involves reading the counters before and after the execution of a region of code and recording the difference. This usage model permits explicit, highly accurate, fine-grained measurements. There are two sub-cases of aggregate counter usage: Summation of the data from multiple executions of an instrumented location, and trace generation, where the counter values are recorded for every execution of the instrumentation.
Statistical: The PM hardware is set to generate an interrupt when a performance counter reaches a preset value. This interrupt carries with it important contextual information about the state of the processor at the time of the event. Specifically, it includes the program counter (PC), the text address at which the interrupt occurred. By populating a histogram with this data, users obtain a probabilistic distribution of PM interrupt events across the address space of the application. This kind of profiling facilitates a good high-level understanding of where and why the bottlenecks are occurring. For instance, the questions, "What code is responsible for most of the cache misses?" and "Where is the branch prediction hardware performing poorly?" can quickly be answered by generating a statistical profile.

PAPI

PAPI supports two types of events, preset and native.

Preset events have a symbolic name associated with them that is the same for every processor supported by PAPI.
Native events, on the other hand, provide a means to access every possible event on a particular platform, regardless of there being a predefined PAPI event name for it

PAPI supports measurements per-thread; that is, each measurement only contains counts generated by the thread performing the PAPI calls

int events[2] = { PAPI_L1_DCM, PAPI_FP_OPS }; // L1 data cache misses; hardware flops

long_long values[2];

PAPI_start_counters(events, 2);

// do work

PAPI_read_counters(values, 2);

Taken from a Dr Dobbs article

Monday, 22 July 2013

Voluntary/involuntary context switches

$ cat /prod/$PID/status

Voluntary context switches are when your application is blocked in a system call and the kernel decide to give it's time slice to another process.

Non voluntary context switches are when your application has used the entire timeslice the scheduler has attributed to it

Monday, 17 June 2013

intercepting libc functions with LD_PRELOAD

What follows is an example of how to intercept uname

// pseudo-handle RTLD_NEXT: find the next occurrence of a function in the search order after the current library. This allows one to provide a wrapper around a function in another shared library.
#ifndef RTLD_NEXT
# define RTLD_NEXT ((void *) -1L)
#endif

#define REAL_LIBC RTLD_NEXT

// function pointer which will store the location of libc's uname (ie: the 'real' uname function)
int (*real_uname)(struct utsname *buf) = 0;

static void init (void) __attribute__ ((constructor));
static void init (void)
{
if(!real_uname)
{
real_uname = dlsym(REAL_LIBC, "uname");
if(!real_uname)
{
fprintf(stderr, "missing symbol: uname");
exit(1);
}
}

}

static int do_uname(struct utsname *buf, int (*uname_proc)(struct utsname *buf))

{

return uname_proc(buf);

}

__attribute__ ((visibility("default"))) int uname(struct utsname *buf)

{

init(); // we must always call init as constructor may not be called in some cases (such as loading 32bit pthread library)

int rc = do_uname(buf, real_uname);

if(!rc)

{

// do special processing

}

return rc;

}

Compile this into a shared library, and intercept libc's uname by using LD_PRELOAD=libname.so

Thursday, 13 June 2013

Display a process's threads and the cpu core each thread is running on

ps -ALopid,lwp,psr,cpuid,comm | grep $APP_NAME

Tuesday, 5 March 2013

SFINAE decltype comma operator trick

Note the decltype statement below contains 2 elements: reserve and bool: decltype(t.reserve(0), bool())

This is a trick using SFINAE and the comma operator: SFINAE will cull the function if 'reserve' doesn't exist and the comma operator will mean the result type of the decltype statement will be a bool.

This means we can easily implement an 'enable_if'esque statement to check for the existence of a member function called 'reserve'

// Culled by SFINAE if reserve does not exist or is not accessible
template <typename T>
constexpr auto has_reserve_method(T& t) -> decltype(t.reserve(0), bool()) { return true; }

// Used as fallback when SFINAE culls the template method
constexpr bool has_reserve_method(...) { return false; }

template <typename T, bool b>
struct Reserver
{
static void apply(T& t, size_t n) { t.reserve(n); }
};

template <typename T>
struct Reserver <T, false>
{
static void apply(T& t, size_t n) {}
};

template <typename T>
bool reserve(T& t, size_t n)
{
Reserver<T, has_reserve_method(t)>::apply(t, n);
return has_reserve_method(t);
}

(Thanks to Matthieu M for his post on stackoverflow here)

--------------------------

Another implementation which has 2 SFINAE functions to access a member int, items_n or items_c, ultimately falling back to 0 if neither exist

// culled by SFINAE if items_n does not exist
template<typename T>
constexpr auto has_items_n(int) -> decltype(std::declval<T>().items_n, bool())
{
return true;
}
// catch-all fallback for items with no items_n
template<typename T> constexpr bool has_items_n(...)
{
return false;
}
//-----------------------------------------------------

template<typename T, bool>
struct GetItemsN
{
static int value(T& t)
{
return t.items_n;
}
};
template<typename T>
struct GetItemsN<T, false>
{
static int value(T&)
{
return 0;
}
};
//-----------------------------------------------------

// culled by SFINAE if items_c does not exist
template<typename T>
constexpr auto has_items_c(int) -> decltype(std::declval<T>().items_c, bool())
{
return true;
}
// catch-all fallback for items with no items_c
template<typename T> constexpr bool has_items_c(...)
{
return false;
}
//-----------------------------------------------------

template<typename T, bool>
struct GetItemsC
{
static int value(T& t)
{
return t.items_c;
}
};
template<typename T>
struct GetItemsC<T, false>
{
static int value(T&)
{
return 0;
}
};
//-----------------------------------------------------

template<typename T>
int get_items(T& t)
{
if (has_items_n<T>(0))
return GetItemsN<T, has_items_n<T>(0)>::value(t);
if (has_items_c<T>(0))
return GetItemsC<T, has_items_c<T>(0)>::value(t);
return 0;
}
//-----------------------------------------------------

When you have two candidates function templates, and want to use SFINAE to choose between them, sometimes you may have a parameter for which both overloads will work.

To prevent ambiguity you can favour one overload over the other.

By using implicit type casting we can make one overload a better match, therefore resolving the ambiguity.

#include <iostream>

template<class T>
auto serialize_imp(std::ostream& os, T const& obj, int)
-> decltype(os << obj, void())
{
os << obj;
}

template<class T>
auto serialize_imp(std::ostream& os, T const& obj, long)
-> decltype(obj.stream(os), void())
{
obj.stream(os);
}

template<class T>
auto serialize(std::ostream& os, T const& obj)
-> decltype(serialize_imp(os, obj, 0), void())
{
serialize_imp(os, obj, 0);
}

struct X
{
void stream(std::ostream&) const
{
std::cout << "\nX::stream()\n";
}
};

int main(){
serialize(std::cout, 5);
X x;
serialize(std::cout, x);
}

Here the ostream operator overload will be chosen when an object with both operator<< and stream() because by passing in 0 for the 3rd parameter of serialize_imp, we choose the overload with the int parameter, as 0 is an int, whereas the long parameter would require an implicit cast.

(Thanks to Xeo for his post on stackoverflow here)

Monday, 4 March 2013

Boost Compute - GPGPU programming

Pre-release version of boost compute by Kyle Lutz

http://kylelutz.github.com/compute/index.html
https://github.com/kylelutz/compute