Petaflop: 2013

Sunday 8 December 2013

Debugging node servers with node-inspector

Install node inspector

$ npm install -g node-inspector

Run node in debug mode

$ node --debug your/node/program.js

Send USR1 signal to node

$ pkill -USR1 node

Launch node-inspector in background mode

$ node-inspector &

Open up the debugger in your browser by navigating to http://localhost:8080/debug?port=5858

If you have something like livereload restarting your node server when you change a source file, then re-launching node-inspector can be done with something like this:

$ PID=`ps -ef | grep node-inspector | grep -v grep | awk '{print $2}'`; if [[ $PID -ne '' ]]; then echo "Killing node-inspector (pid $PID)"; kill $PID; wait $PID; fi; echo "Sending USR1 signal to node"; pkill -USR1 node; echo "Launching node-inspector"; node-inspector&

Node inspector GitHub page: https://github.com/node-inspector/node-inspector

Wednesday 23 October 2013

atomics & fences

Acquire
cannot move anything up beyond an acquire

Release
cannot move anything down beyond a release

Note
acquire/release cannot be reordered with respect to each other

What does this mean?

instructions can be reordered from before an acquire to after an acquire
instructions can be reordered from after a release to before a release
acquire cannot be reordered before or after a release
release cannot be reordered before or after an acquire

std::atomic
read = load_acquire --> read the value == acquire the value
write = store_release --> write the value == release the value

Sequential Consistency
Transitivity / Causality
Total Store Order

using /proc/irq/#/smp_affinity to shield cpus from IRQs

IRQs (interrupt requests) are a request for service at a hardware level from the kernel.

When an IRQ arrives the kernel switches to interrupt context and loads the ISR (interrupt service routine) for the IRQ number which will process the interrupt.

IRQs have an affinity mask which defines which CPUs the ISR can run on. This is defined in /proc/irq/#/smp_affinity

The irqbalance service distributes IRQs across the processors in their associated affinity masks on a multiprocessor system. It uses /proc/irq/#/smp_affinity if it exists, or otherwise falls back to /proc/irq/default_smp_affinity/

The affinity mask in /proc/irq/#/smp_affinity and /proc/irq/default_smp_affinity is a 32-bit hex bitmask of CPUs.
eg: ffffffff means all 32 CPUs 0-31
If there are more than 32 cores, we build up multiple comma separated 32-bit hex masks.
eg: ffffffff,00000000 means CPUs 32-63

The irqbalance service uses the environment variable IRQBALANCE_BANNED_CPUS to tell it which CPUs it can't use for ISRs and IRQBALANCE_BANNED_INTERRUPTS to tell it which IRQs to ignore,

IRQBALANCE_BANNED_CPUS follows the same comma separated 32-bit hex format as /proc/irq/#/smp_affinity
IRQBALANCE_BANNED_INTERRUPTS is a space separated list of integer IRQs.

We can shield some CPUs from being interrupted in order to dedicate them to our own tasks.

We can also pin a network data consuming process onto a certain CPU and set the associated NIC (network interface controller) IRQs affinity mask to the same CPU so that they can share cache lines.

Monday 21 October 2013

Create angular app and express server / mongodb with yeoman

install (or update) yeoman
npm install -g yo

npm update -g yo

install (or update) yeoman angular fullstack generator (angular frontend / express server)
npm install -g generator-angular-fullstack

npm update -g generator-angular-fullstack

create angular app
yo angular-fullstack [name] // creates ng-app="nameApp", if blank uses curdir

I chose yes to add bootstrap, no for scss authoring, angular-resource (ngResource) and angular-route (ngRoute) and yes for mongoose/mongodb

add angular-bootstrap (angular directives for twitter bootstrap)
bower install angular-bootstrap --save

run karma tests
grunt karma

get rid of the missing file warning by commenting out the following line from the files array:
'test/mock/**/*.js',

grab phantom.js for headless testing
http://phantomjs.org/download.html

configure karma to run Phantom.js instead of Chrome to
in karma.conf.js change browsers array to PhantomJS

run karma tests again to validate there are no warnings, and we're running through Phantom.js
grunt karma

serve angular app

grunt server

Github page for the angular-fullstack generator here:

https://github.com/DaftMonk/generator-angular-fullstack

Great blog entry with more details here:

http://www.sitepoint.com/kickstart-your-angularjs-development-with-yeoman-grunt-and-bower/

Wednesday 16 October 2013

MongoDb on Fedora 19

Install and start server

yum install mongodb-server
systemctl start mongod

systemctl enable mongod

systemctl status mongod

Install client and verify it can connect to the server

yum install mongodb

mongo

You should now be in the mongo shell - test you can save and retrieve an object

db.test.save( { a: 1 } )

db.test.find()

Should display something like this:

{ "_id" : ObjectId("525f2fb01ec8e4af43c529c0"), "a" : 1 }

Thursday 3 October 2013

hello world app with node.js and express.js

requires node and express to be installed (see here for instructions)

go to the root directory where your hello world app will be created
$ cd ~/src/web

create an express app called 'helloWorld' (and use the less css tool)
$ express helloWorld -c less

obtain the required dependencies and install

$ cd helloWorld

$ npm install

run the server

$ npm start # npm calls 'node app'

add nodemon to our devDependencies so the server gets restarted automatically during development

$ vim package.json

add the following

"devDependencies": {

"nodemon": "*"

}

change the scripts / start value:

"start": "nodemon app.js"

download dependency nodemon

$ npm install

start the server via nodemon

$ npm start

Wednesday 2 October 2013

node.js / express.js / yeoman / angular installation on Fedora 19

download the prebuilt binary:
http://nodejs.org/dist/v0.10.20/node-v0.10.20-linux-x64.tar.gz

download and build from source
cd /tmp
wget http://nodejs.org/dist/v0.10.20/node-v0.10.20.tar.gz
tar -xf node-v0.10.20-linux-x64.tar.gz
cd node-v0.10.20-linux-x64/

configure, build and install
export PREFIX=/usr/local # or whatever your prefix is
./configure --prefix=$PREFIX
export LINK=g++ # only required if you're building on NFS
make
make install

clean up temporary files
rm -rf /tmp/node-v0.10.20-linux-x64*

add node to your path
export PATH=$PREFIX/bin:$PATH

display node and npm versions
node --version
v0.10.20
npm --version
1.3.11

install express.js
npm install -g express

display express version
express --version
3.4.0

install yeoman
npm install -g yo

install yeoman angular generator
npm install -g generator-angular

create angular app
yo angular app-name

serve angular app
grunt server

Wednesday 25 September 2013

Setting up a basic local git repo

create a basic local repo

$ git init

configure git with my user

$ git config --global user.email email.address@email.com
$ git config --global user.name "My Name"

create a .gitignore file

$ echo install/ > .gitignore

add all files not in .gitignore to git

$ git add .

show my pending adds

$ git status

perform initial commit

$ git commit -m 'created'

show current status

$ git status
# On branch master
nothing to commit (working directory clean)

Friday 20 September 2013

execute a command remotely using ssh

Execute a command remotely

eg: Set remote date to local date (useful if say you don't have ntpd running)

ssh root@nas 'ntpdate 0.fedora.pool.ntp.org 1.fedora.pool.ntp.org'

Sunday 18 August 2013

Generalized function evaluation

#include <type_traits>
#include <utility>

// functions, functors, lambdas, etc.
template<
class F, class... Args,
class = typename std::enable_if<!std::is_member_function_pointer<F>::value>::type,
class = typename std::enable_if<!std::is_member_object_pointer<F>::value>::type
>
auto eval(F&& f, Args&&... args) -> decltype(f(std::forward<Args>(args)...))
{
return f(std::forward<Args>(args)...);
}

// const member function
template<class R, class C, class P, class... Args>
auto eval(R(C::*f)() const, P&& p, Args&&... args) -> R
{
return (*p.*f)(std::forward<Args>(args)...);
}

template<class R, class C, class... Args>
auto eval(R(C::*f)() const, C& c, Args&&... args) -> R
{
return (c.*f)(std::forward<Args>(args)...);
}

// non-const member function
template<class R, class C, class P, class... Args>
auto eval(R(C::*f)(), P&& p, Args&&... args) -> R
{
return (*p.*f)(std::forward<Args>(args)...);
}

// member object
template<class R, class C>
auto eval(R(C::*m), const C& c) -> const R&
{
return c.*m;
}

template<class R, class C>
auto eval(R(C::*m), C& c) -> R&
{
return c.*m;
}

Taken from here: http://functionalcpp.wordpress.com/2013/08/03/generalized-function-evaluation/

Sunday 4 August 2013

Quick stablize NTP

By design NTP converges on the correct time slowly so as to prevent jumps in the clock time. It achieves this by varying the frequency of the clock rather than stepping the time.

When the clock is far out (say, after a reboot when the hardware clock isn't sync'd with the system clock), overshoot is caused because NTP applies a huge frequency correction to accelerate the convergence between the kernel clock time and the real time.

Overshoot causes the clock to oscillate, eventually settling down to the correct time and stabilizing.

If we want to quickly stabilize NTP we can apply the following process:

1. Stop NTP - /etc/init.d/ntp stop
2. Reset kernel bias - /usr/sbin/ntptime -f 0
3. Run ntpdate to sync the time - ntpdate -p8 <server>
4. Run it several times, this will make more measurements and have the kernel get a more accurate idea of the time
5. Start NTP - /etc/init.d/ntp start

Wednesday 24 July 2013

Parsing socket / cpu / hyperthreading information from /proc/cpuinfo

#!/bin/bash

# total number of sockets
NUM_SOCKETS=`grep physical\ id /proc/cpuinfo | sort -u | wc -l`

# total number of cores per socket
NUM_CORES=`grep cpu\ cores /proc/cpuinfo | sort -u | awk '{print $4}'`

# total number of physical cpus (cores per socket * number of sockets)
NUM_PHYSICAL_CPUS=$[NUM_SOCKETS * ${NUM_CORES}]

echo "${NUM_SOCKETS} sockets"
echo "${NUM_CORES} cores per socket"
echo "${NUM_PHYSICAL_CPUS} physical processors"

# Work out if hyperthreading is enabled.
# This is done by working out how many siblings each core has
# If it's not the same as the number of cores per socket then
# hyperthreading must be on
NUM_SIBLINGS=`grep siblings /proc/cpuinfo | sort -u | awk '{print $3}'`
echo "$NUM_SIBLINGS siblings"
if [ ${NUM_SIBLINGS} -ne ${NUM_CORES} ]
then
# total number of local cpus (ie: physical cpus + hyperthreading cpus)
NUM_LOGICAL_CPUS=`grep processor /proc/cpuinfo | sort -u | wc -l`
echo "hyperthreading is enabled - ${NUM_LOGICAL_CPUS} logical processors"
fi

# display which socket each core is on
echo "Sockets: Cores"
cat /proc/cpuinfo | egrep "physical id|processor" | tr \\n ' ' | sed 's/processor/\nprocessor/g' | grep -v ^$ | awk '{printf "%d: %02d\n",$7,$3}' | sort

Notes on performance counters and profiling with PAPI

Performance counters

2 main types of profiling applications with performance counters: aggregate (direct) and statistical (indirect).

Aggregate: Involves reading the counters before and after the execution of a region of code and recording the difference. This usage model permits explicit, highly accurate, fine-grained measurements. There are two sub-cases of aggregate counter usage: Summation of the data from multiple executions of an instrumented location, and trace generation, where the counter values are recorded for every execution of the instrumentation.
Statistical: The PM hardware is set to generate an interrupt when a performance counter reaches a preset value. This interrupt carries with it important contextual information about the state of the processor at the time of the event. Specifically, it includes the program counter (PC), the text address at which the interrupt occurred. By populating a histogram with this data, users obtain a probabilistic distribution of PM interrupt events across the address space of the application. This kind of profiling facilitates a good high-level understanding of where and why the bottlenecks are occurring. For instance, the questions, "What code is responsible for most of the cache misses?" and "Where is the branch prediction hardware performing poorly?" can quickly be answered by generating a statistical profile.

PAPI

PAPI supports two types of events, preset and native.

Preset events have a symbolic name associated with them that is the same for every processor supported by PAPI.
Native events, on the other hand, provide a means to access every possible event on a particular platform, regardless of there being a predefined PAPI event name for it

PAPI supports measurements per-thread; that is, each measurement only contains counts generated by the thread performing the PAPI calls

int events[2] = { PAPI_L1_DCM, PAPI_FP_OPS }; // L1 data cache misses; hardware flops

long_long values[2];

PAPI_start_counters(events, 2);

// do work

PAPI_read_counters(values, 2);

Taken from a Dr Dobbs article

Monday 22 July 2013

Voluntary/involuntary context switches

$ cat /prod/$PID/status

Voluntary context switches are when your application is blocked in a system call and the kernel decide to give it's time slice to another process.

Non voluntary context switches are when your application has used the entire timeslice the scheduler has attributed to it

Monday 17 June 2013

intercepting libc functions with LD_PRELOAD

What follows is an example of how to intercept uname

// pseudo-handle RTLD_NEXT: find the next occurrence of a function in the search order after the current library. This allows one to provide a wrapper around a function in another shared library.
#ifndef RTLD_NEXT
# define RTLD_NEXT ((void *) -1L)
#endif

#define REAL_LIBC RTLD_NEXT

// function pointer which will store the location of libc's uname (ie: the 'real' uname function)
int (*real_uname)(struct utsname *buf) = 0;

static void init (void) __attribute__ ((constructor));
static void init (void)
{
if(!real_uname)
{
real_uname = dlsym(REAL_LIBC, "uname");
if(!real_uname)
{
fprintf(stderr, "missing symbol: uname");
exit(1);
}
}

}

static int do_uname(struct utsname *buf, int (*uname_proc)(struct utsname *buf))

{

return uname_proc(buf);

}

__attribute__ ((visibility("default"))) int uname(struct utsname *buf)

{

init(); // we must always call init as constructor may not be called in some cases (such as loading 32bit pthread library)

int rc = do_uname(buf, real_uname);

if(!rc)

{

// do special processing

}

return rc;

}

Compile this into a shared library, and intercept libc's uname by using LD_PRELOAD=libname.so

Thursday 13 June 2013

Display a process's threads and the cpu core each thread is running on

ps -ALopid,lwp,psr,cpuid,comm | grep $APP_NAME

Tuesday 5 March 2013

SFINAE decltype comma operator trick

Note the decltype statement below contains 2 elements: reserve and bool: decltype(t.reserve(0), bool())

This is a trick using SFINAE and the comma operator: SFINAE will cull the function if 'reserve' doesn't exist and the comma operator will mean the result type of the decltype statement will be a bool.

This means we can easily implement an 'enable_if'esque statement to check for the existence of a member function called 'reserve'

// Culled by SFINAE if reserve does not exist or is not accessible
template <typename T>
constexpr auto has_reserve_method(T& t) -> decltype(t.reserve(0), bool()) { return true; }

// Used as fallback when SFINAE culls the template method
constexpr bool has_reserve_method(...) { return false; }

template <typename T, bool b>
struct Reserver
{
static void apply(T& t, size_t n) { t.reserve(n); }
};

template <typename T>
struct Reserver <T, false>
{
static void apply(T& t, size_t n) {}
};

template <typename T>
bool reserve(T& t, size_t n)
{
Reserver<T, has_reserve_method(t)>::apply(t, n);
return has_reserve_method(t);
}

(Thanks to Matthieu M for his post on stackoverflow here)

--------------------------

Another implementation which has 2 SFINAE functions to access a member int, items_n or items_c, ultimately falling back to 0 if neither exist

// culled by SFINAE if items_n does not exist
template<typename T>
constexpr auto has_items_n(int) -> decltype(std::declval<T>().items_n, bool())
{
return true;
}
// catch-all fallback for items with no items_n
template<typename T> constexpr bool has_items_n(...)
{
return false;
}
//-----------------------------------------------------

template<typename T, bool>
struct GetItemsN
{
static int value(T& t)
{
return t.items_n;
}
};
template<typename T>
struct GetItemsN<T, false>
{
static int value(T&)
{
return 0;
}
};
//-----------------------------------------------------

// culled by SFINAE if items_c does not exist
template<typename T>
constexpr auto has_items_c(int) -> decltype(std::declval<T>().items_c, bool())
{
return true;
}
// catch-all fallback for items with no items_c
template<typename T> constexpr bool has_items_c(...)
{
return false;
}
//-----------------------------------------------------

template<typename T, bool>
struct GetItemsC
{
static int value(T& t)
{
return t.items_c;
}
};
template<typename T>
struct GetItemsC<T, false>
{
static int value(T&)
{
return 0;
}
};
//-----------------------------------------------------

template<typename T>
int get_items(T& t)
{
if (has_items_n<T>(0))
return GetItemsN<T, has_items_n<T>(0)>::value(t);
if (has_items_c<T>(0))
return GetItemsC<T, has_items_c<T>(0)>::value(t);
return 0;
}
//-----------------------------------------------------

When you have two candidates function templates, and want to use SFINAE to choose between them, sometimes you may have a parameter for which both overloads will work.

To prevent ambiguity you can favour one overload over the other.

By using implicit type casting we can make one overload a better match, therefore resolving the ambiguity.

#include <iostream>

template<class T>
auto serialize_imp(std::ostream& os, T const& obj, int)
-> decltype(os << obj, void())
{
os << obj;
}

template<class T>
auto serialize_imp(std::ostream& os, T const& obj, long)
-> decltype(obj.stream(os), void())
{
obj.stream(os);
}

template<class T>
auto serialize(std::ostream& os, T const& obj)
-> decltype(serialize_imp(os, obj, 0), void())
{
serialize_imp(os, obj, 0);
}

struct X
{
void stream(std::ostream&) const
{
std::cout << "\nX::stream()\n";
}
};

int main(){
serialize(std::cout, 5);
X x;
serialize(std::cout, x);
}

Here the ostream operator overload will be chosen when an object with both operator<< and stream() because by passing in 0 for the 3rd parameter of serialize_imp, we choose the overload with the int parameter, as 0 is an int, whereas the long parameter would require an implicit cast.

(Thanks to Xeo for his post on stackoverflow here)

Monday 4 March 2013

Boost Compute - GPGPU programming

Pre-release version of boost compute by Kyle Lutz

http://kylelutz.github.com/compute/index.html
https://github.com/kylelutz/compute

Sunday 3 March 2013

C++ implementation of the Disruptor pattern

original: https://github.com/fsaintjacques/disruptor--

fork: https://github.com/jwakely/disruptor--

Friday 22 February 2013

enable samba for sharing to windows

/etc/samba/smb.cfg:

[global]
workgroup = WORKGROUP <-- this must be the windows workgroup
server string = NAS samba server %v

security = user
passdb backend = tdbsam

[homes]
comment = Home Directories
browseable = yes
writable = yes

[raid]
path = /mnt/raid/
public = yes
writable = yes
browseable = yes
available = yes
create mask = 0777
directory mask = 0777

$ systemctl start smb.service nmb.service

$ systemctl enable smb.service nmb.service

After starting the samba service, you need to enable samba with selinux

Details here: http://linux.die.net/man/8/samba_selinux

$ setsebool -P samba_domain_controller on

$ setsebool -P samba_enable_home_dirs on

$ chcon -t samba_share_t /mnt/raid/

$ semanage fcontext -a -t samba_share_t "/mnt/raid(/.*)?"

$ restorecon -R -v /mnt/raid/

Tuesday 29 January 2013

Kernel performance tuning

Kernel performance tuning is art not science
http://blog.kreyolys.com/2011/03/05/kernel-performance-tuning-is-art-not-science-part-1/

RedHat summit: RHEL Kernel Performance Optimization, Characterization and Tuning
http://www.redhat.com/promo/summit/2008/downloads/pdf/Wednesday_1015am_John_Shakshober_and_Larry_Woodman_Decoding_the_Code.pdf