OVH Community, your new community space.

Some ideas for setting up a couple of servers


Andy
08-11-2012, 20:17
Perhaps you could think about rate limiting on the API for people who are abusing it or their own scripts that are using it are not using it correctly. This will force your users to run their own caching taking that responsibility away from you. A lot of places do this, you should too

Kode
08-11-2012, 20:02
That sounds interesting Myatu but I wasn't even talking about the images, just the API, tbh I don't know what kind of a bottleneck the HD is going to be because we are currently on an SSD, but running out of space, hence moving to a normal disk.

Myatu
08-11-2012, 19:06
Quote Originally Posted by Kode
Not sure trying to access 12 million (and growing) small files a month from a 3.5" disk is going to be better than returning it from memory / database query, the disk would be a pretty big bottleneck I would have thought, thanks for the ideas though.
One way of dealing with that is hashing the filename - MD5 is enough for that. Say the filename "test":

Code:
echo test | md5sum 
d8e8fca2dc0f896fd7cb4cb0031ba249  -
Based on the first portion of it, you can store the file in directory:

/var/somewhere/d8/e8/fc/d8e8fca2dc0f896fd7cb4cb0031ba249.jpg or the likes. This allows you to split file storage by d8, d8/e8 and or d8/e8/fc (or deeper, if you want) and make it easier for caching.

And if you start noticing that the HDD or traffic becomes a bottleneck, you can add another server or cloud instance that specifically serves files that start with, in this example, d8. You could then return a URL like d8.files.fanart.tv/d8e8fca2dc0f896fd7cb4cb0031ba249.jpg. This allows your API back-end to serve the proper file without really having to know exactly which server it's stored on, as the hostname d8.files.fanart.tv already solves that.

Same goes for local caching, which can be done at the level of Apache or Nginx, to see if that file already exists (ie., a preview image) and serve it, before sending the request further down the line to PHP.

Kode
08-11-2012, 15:31
Not sure trying to access 12 million (and growing) small files a month from a 3.5" disk is going to be better than returning it from memory / database query, the disk would be a pretty big bottleneck I would have thought, thanks for the ideas though.

Ashley
08-11-2012, 12:32
Also, I'd use the machines with the better hardware for the harder tasks and use the kimsufi to serve static files.

Ashley
08-11-2012, 12:30
I agree with Myatu about ditching wordpress.

Efficiency comes from purpose built code. At the moment it sounds like you're using a lot of things (wordpress) to perform jobs they weren't meant for (APIs).

Also, memcache for caching isn't always the best solution. Ever thought about caching stuff to disk in small files that are easily accessed rather than doing a **** load of logic to make each time?

I mean, you're providing artwork for TV shows and what not, when is this information going to change? It's not critical data such as traffic, or weather etc, so maybe it needs a life span of a couple of days instead of being on-demand high availability.

Store the result in a file on disk, have a cron job that deletes files older than a few hours, and when that result gets accessed, see if it exists on disk first - if so, return that to the browser, else create it, store it and return it.

This simple solution is how a friend of mine saved 75% on his cloud computing costs

Myatu
26-10-2012, 18:36
Nginx is "set and forget", just like Apache really. You can run it stand-alone, or in front of Apache. If you run it in front of Apache, you have the benefit of leaving everything "as-is". On a 4GB Pentium DualCore machine at OVH, I managed 4500+ requests per second with NginX on a WordPress site (which also used a database cache).

NgingX can handle incoming requests much faster than Apache - an added benefit is that you can pool several back-end servers together (as with the cloud service I mentioned). So it can spread the load across Apache servers, that will do the hard work. And if the output it's cachable, NginX will take care of it without bothering the back-end servers.

As for a message queue, you can use various options like BeanstalkD. You simply place the neccesary values into the queue, whilst a cron-job to a PHP script picks it up later to be sent. You could even write a simple servelet in Node.js (which, even though its server-side Javascript, I've found very robust and flexible).

The added benefit of throwing it into a message queue, is that you can work with the data "at your own pace", that is, if for example you received an error from Google Analytics, you can re-send it again at another time or perform some additional logic. All this without holding up the actual API request.

Edit: Oh yes, you can definitely benefit from using PHP's own DB handlers instead of $wpdb. The thing is that it does a very large amount of additional "includes", loading classes you don't need, in all taking up valuable memory, thrash your HDD and slow down the request.

Kode
26-10-2012, 13:42
I did look at Nginx, it looks like a pain to use with wordpress which the site uses, but maybe I could run it along side apache and just have the API on Nginx.

I could also try just using php mysql commands and manually load in the right class rather than using wp-load and $wpdb database functions.

How would you put step 2 into a message queue?

I'll have a look at noSQL, thanks very much for your help so far.

Myatu
26-10-2012, 02:29
Quote Originally Posted by Kode
It's also using wordpress functions so I'm loading in require_once($root."/wp-load.php");
For the API? That's a *huge* bottleneck.

Myatu
26-10-2012, 02:27
Personally, I would throw step 2 into a message queue, so not to cause a bottleneck.

Running a MySQL Master-Slave setup is actually easier than it appears, but if its very simple data (key-data pairs), then you might want to look into noSQL options as they can deal with these types of requests with a lot lower overhead than MySQL. Couchbase comes to mind, as you already use memcache too (which Couchbase can provide, but in a fail-over setup).

Also, think about adding Nginx in front of Apache.

Finally, did you consider the cloud offerings? The key here, is that you can have it "off" for when you don't need it. When you reach a peak - evenings in your case - you can switch them on, spreading the load. So you do not have to keep them running constantly, and you can scale within 5 minutes...

Kode
26-10-2012, 02:11
It's also using wordpress functions so I'm loading in require_once($root."/wp-load.php");

I also have an abstract class fanart which defines abstract function buildApi()

So depending on the section it would load $external = new tv_fanart;

And then

$external->buildApi($format, $images, $allowedart, $external->fanimages["folder"], $external->fanimages["id"], $sectiontype, $filepath);

is used.

Kode
26-10-2012, 01:59
I'm not saying it's not the application code that's the problem

It currently does the following:

1) Check the database to see if the API key used exists and is active, using the results of memcache if available

2) Sends details to google analytics using:
$gaURL='http://www.google-analytics.com/__utm.gif?utm_source=API-Access&utm_medium=api&utm_campaign='.$api->api_key.'&utmwv=1&utmn='.$ga_randNum.'&utmsr=-&utmsc=-&utmul=-&utmje=0&utmfl=-&utmdt=-&utmhn='.$ga_domain.'&utmr='.$ga_referrer.'&utmp=' .$ga_hitPage.'&utmac='.$ga_uid.'&utmcc=__utma%3D'. $ga_cookie.'.'.$ga_rand.'.'.$ga_today.'.'.$ga_toda y.'.'.$ga_today.'.2%3B%2B__utmb%3D'.$ga_cookie.'%3 B%2B__utmc%3D'.$ga_cookie.'%3B%2B__utmz%3D'.$ga_co okie.'.'.$ga_today.'.2.2.utmccn%3D(direct)%7Cutmcs r%3D(direct)%7Cutmcmd%3D(none)%3B%2B__utmv%3D'.$ga _cookie.'.'.$ga_userVar.'%3B';

$handle = @fopen($gaURL, "r"); // open the xml file
$fget = @fgets($handle); // get the XML data
@fclose($handle); // close the xml file

3) gets all the possible image types for the section chosen, using results from memcache if available and builds an array

4) Pulls back all the images for the selected id(s) using memcache results if available

5) Formats all the returned images into an array

6) outputs the array either as json or serialized php or if xml has been selected uses simplexml

elcct
25-10-2012, 22:50
Maybe you should look into your application code? By briefly looking at your API functionality it seems that there is nothing that could justify such a big load.
Do you have any caching in place etc?
Your server should easily take that amount of hits daily.

Kode
25-10-2012, 22:20
Thought I'd put this here to see if any of you more experienced people had some ideas.

I run a website called fanart.tv which is used by a fair few media centres including XBMC for getting artwork for movies / tv shows and music.

I'm currently on an SP SSD but we are running out of disk space and while most of the time the site runs fine quite often on an evening when we are under heavy load the servers load can go up to 20+

The biggest traffic comes from the API we provide, which in the last month has been hit 10,552,720 times and when we are under high load the site becomes unresponsive.

We rely on donations to pay for hosting, but I have set a target of £1440 for January. If we hit/exceed it I was thinking of running a 2013 equivalent to the SP BestOf and maybe a small kimsufi.

But I don't know if it would be best to have just the site on the kimsufi and all API traffic on the SP, or to have the API on the kimsufi for most traffic, and then on the SP put the site and API traffic from people who contribute.

Also what would be the best way to keep them both in sync? rsync I guess, but I've never been sure if the rsync commands I use just transfer the new images or whether it sends everything. It's keeping the mysql databases in sync that confuses me the most though, I had a bit of a read up on setting up a master and slave configuration a while ago, but only having 1 server makes it hard to try.