попинайте реализацию тестового задания
От: developer201208  
Дата: 16.08.12 15:51
Оценка:
попинайте реализацию тестового задания
от компании уже пришел отказ

но мне все-таки интересно, что конкретно не так, поскольку все потенциальные ляпсусы, к которым можно было "прикопаться", я описал в README и в комментариях
(эти ляпсусы реализовывать бессмысленно, это все таки тестовое задание, а не реальное приложение)

архив со всем заданием
http://files.rsdn.ru/101777/test_task.zip

задание включает 2 части — веб приложение и алгоритмическую часть
алгоритмическая часть умещается в одном файле — поэтому вместе с задачей вставляю ее сюда
(ее попросили реализовать на perl, но я его не знаю, поэтому сделал на php)
кому интересна и первая часть, можете скачать архив

PART 2: Perl

Please write a Perl script that will check a log file and count how many
times a pattern appears within hh hours.
For example...

checklog.pl /home/saag/mylog.txt "created '\w[1-10]$" 48

This would count how many times "created 'aa'" and similar matches
appear between now and 48 hours ago. You can assume each line of the log
file has a datetime stamp at the front like so: "Nov 11 11:06:39: "


/* 
The algorithm of parsing is written with use of the following considerations:
- log files may be very large, up to several gygabytes;
- they may be accessed for writing at the moment when parser is running;
- writing of records into the log file may be very quick, up to several thousands per second, for example;
- records in the log have different lengths;
- the distribution of timestamps is regular over the whole file;
- the distribution of lengths of records in the log file is regular over the whole file
(in other words there is not such situation like first 50% of records are 10000 bytes length, and the second 50% are 100 bytes length);
- timestamp has fixed length and format "Nov 11 2012 11:06:39 "
(year was added to the task's initial condition, because timestamp without year theoretically may cause creation of incorrectly parsable logs);
- search of regex is performed once per each record;
- timestamps of records are kept in the correct order;
- there are no empty records in the file.

IMPORTANT:
There is some size of the string when it is more quick to read the whole file contents into the memory and parse by regular 
expression instead of extracting file content block by block. I don't know this value. Although it certainly may be determined 
by an empirical way. That's why firstly I've set this value as 1000 for testing first condition in the code. 
And then have set it to 0 to test the second part of the condition with algorithm for large files.

Notes: 
* no testing on large log files have been done, because you were interested in approach, but not in working code;
* error handling for file reading and file pointer handling functions is not done in many places for simplicity - 
in a real environment, I would create wrappers for all these methods if error check with subsequent throwing of an exception 
would be required;
* regex is applied to the content part of the record, as you can see further - so using of $ (as in task condition) is impossible,
you may use \r for logs created on Windows platform, or just do not use anything for logs created on Linux platform.

Examples of regexes you can use:
(.*)j(.*)
created (\w{1,10})\r
*/

class LogParser
{
    const MAX_SIZE_FOR_REGEX = 0; // see "IMPORTANT" comment above
    
    // this value is used just for testing, in fact it should be much more high (minimal size = largest possible size of the record);
    // generally, it would be enough to have the first constant above - this one was created mainly for testing purposes
    const MAX_BLOCK_SIZE_FOR_PARSING_LARGE_FILES = 50; 
    
    public static function run (
        $filePath, // full path
        $regex, // regex for the content part of each record
        $numHours // num hours back from the current time
        )
    {
        $tmpFilePath = $filePath.".tmp";
        $searchedTime = time() - $numHours * 3600;
        if ($searchedTime < 0) exit ("Invalid number of hours.");
        $regex = "/^(.{20})\s(".$regex.")$/m";
        $res = 0;
        
        try
        {
            if (!@copy ($filePath, $tmpFilePath)) throw new CException ("Failed to create a temporary copy of the file.");
            if (($fileSize = filesize ($tmpFilePath)) === false) throw new CException ("Failed to get file size.");;
            
            ////////////////////////////////////////////////////////////////////////////
            // file is small enough to be analyzed by single regex
            if ($fileSize <= self::MAX_SIZE_FOR_REGEX)
            {
                $str = file_get_contents ($tmpFilePath);
                $matches = array ();
                if (false === @preg_match_all ($regex, $str, $matches) || !is_array ($matches) || sizeof ($matches) < 3 || !is_array ($matches[1])) 
                    throw new Exception ("Invalid regular expression.");
                for ($i = sizeof ($matches[1]) - 1; $i >= 0; $i--)
                {
                    if (strtotime ($matches[1][$i]) < $searchedTime) break;
                    $res++;
                }
            }
            ////////////////////////////////////////////////////////////////////////////
            // large file
            else
            {
                $startParsingFrom = self::findLocationToStartParsing ($tmpFilePath, $searchedTime, $fileSize);
                if ($startParsingFrom >= 0) $res = self::parseFile ($tmpFilePath, $startParsingFrom, $regex);
            }
            ////////////////////////////////////////////////////////////////////////////
            
            echo "Number of found records: ".$res."\n";
        }
        catch (Exception $e)
        {
            echo $e->getMessage();
        }
        
        if (file_exists ($tmpFilePath)) @unlink ($tmpFilePath);
    }
    
    // notes:
    // this method searches the record with a required timestamp by "breaking" into 2 parts the remaining analyzed part of the 
    // file and moving a pointer forward and back - it should work rather quickly even for very large files;
    // the logic is not completely tested, but as you wrote in the letter you were interested in the approach, 
    // but not in getting completely working code
    protected static function findLocationToStartParsing ($tmpFilePath, $searchedTime, $fileSize)
    {
        $startParsingFrom = -1;
        $currentStart = $currentPointer = 0;
        $currentEnd = $fileSize;
        $arrVisitedPointers = array ();
        
        if (($f = @fopen ($tmpFilePath, "r")) === false) throw new Exception ("Failed to open a temporary copy of the log file.");

        while (true)
        {
            $str = "";
            if (($str = fgets ($f)) === false) throw new Exception ("Failed to read a record from a temporary copy of the log file.");
            
            $currentTime =  strtotime (substr ($str, 0, 20)); // get time for the current record
            $arrVisitedPointers[] = $currentPointer;
            
            // find the middle of the remaining part of the file
            if ($currentTime < $searchedTime && !feof ($f))
            {
                $currentStart = $currentPointer;
                $currentPointer = $currentPointer + floor (($currentEnd - $currentPointer) / 2);
            }
            // find the middle of the previous part of the file
            else if ($currentTime >= $searchedTime && $currentPointer > 0)
            {
                $currentEnd = $currentPointer;
                $currentPointer = $currentPointer - floor (($currentPointer - $currentStart) / 2);
            }
            // the first record in file meets time requirements
            else if ($currentTime >= $searchedTime && $currentPointer == 0)
            {
                $startParsingFrom = $currentPointer;
                break;
            }
            fseek ($f, $currentPointer);
            // correction of the pointer placement by the beginning of the current record
            while ($currentPointer > 0 && fgetc ($f) != "\n")
            {
                $currentPointer -= 1;
                fseek ($f, $currentPointer);
            }
            if (in_array ($currentPointer, $arrVisitedPointers))
            {
                fgets ($f);
                if (feof ($f)) break;
                if (($currentPointer = ftell ($f)) == $fileSize) break;
                if (in_array ($currentPointer, $arrVisitedPointers))
                {
                    $startParsingFrom = $currentPointer;
                    break;
                }
            }
        }
        
        fclose ($f);
        
        return $startParsingFrom;
    }
    
    protected static function parseFile ($tmpFilePath, $startParsingFrom, $regex)
    {
        $res = 0;
        if (($f = @fopen ($tmpFilePath, "r")) === false) throw new Exception ("Failed to open a temporary copy of the log file.");
        fseek ($f, $startParsingFrom);
        $i = 0;
        while (!feof ($f))
        {
            $str = fread ($f, self::MAX_BLOCK_SIZE_FOR_PARSING_LARGE_FILES);
            $n = strrpos ($str, "\n");
            if ($n !== false && $n != strlen ($str) - 1)
            {
                fseek ($f, ftell($f) - (strlen ($str) - ($n + 1)));
                $str = substr ($str, 0, $n);
            }

            $matches = array ();
            if (false === @preg_match_all ($regex, $str, $matches) || !is_array ($matches) || sizeof ($matches) < 3 || !is_array ($matches[0])) 
                throw new Exception ("Invalid regular expression.");
            $res += sizeof ($matches[0]);
        }
        fclose ($f);
        
        return $res;
    }
}
 
Подождите ...
Wait...
Пока на собственное сообщение не было ответов, его можно удалить.