Better build names in Jenkins

I'm becoming a real fan of the task runner Jenkins. It's very helpful in a team to keep a record of branches built and various deployments that is kept in a shared location. One of the problems with Grunt alone is that it is quite easy to build a local branch that perhaps is still uncommited and make a product no-one else in the team can reproduce. An agreement to create builds Grunt builds through Jenkins helps solve that problem, since any resources available to Jenkins can also be available to your other teammates.
Maybe it's the lack of sunlight in winter, but I've become fixated in making the information provided from Jenkins more useful in a dashboard kind of way.
One of the first little steps with a great payoff is to modify the build name that is stored in the log to be more helpful than a sequesntial number.
This is achieved by installing the build-name-setter in Jenkins plugin and modifying the name to be a concatination of variables in the environment or provided by Jenkins itself. These variables are resolved in two passes: once when the job first kicks off and once again at the end. Any variables placed into ENV by inputs to the Jenkins job can be accessed through ENV through the syntax ${ENV,var="Branch_name"}. If you want a list of variables that are available for use due to the inner workings of Jenkins there is a helpful one provided by clicking the '?' button next to the branch name input.

Complexity

branches.jpg

The issue of managing complexity has been coming up in my Project Euler problems the best optimized solutions don't set up your CPU to double as an electric blanket.

Complexity, in a nutshell, is about how requirements to find a solution scale in relationship to the problem. For example, the difficulty of assigning two variables is just twice as demanding as assigning one variable, so it's not a very complex task. In contrast, the complexity of finding multiple pairs of numbers with nested iteration expands as a square. Every additional cycle of the external iteration requires a whole series of nested cycles.

Let me use a concrete example of a problem involving large numbers. I solved it two ways: once quite naively and once with reduced complexity.

The question involved locating the optimal times in one day to purchase and then sell a single stock, when the stock value is saved in an array every minute after midnight.

My first solution to find the optimal times of purchase and sale had a wheel within a wheel system. The stock could only be sold after it was purchased (I assumed one minute of latency), so as I progressed through the day to analyze the potential purchase times, I compared that with all future times a sale could be made. The best optimization I could do was to analyze only times of sale that came after the purchase, but that just cuts the processing by half.

Note that this program will not recommend a purchase or sale time on days in which the stock price only declined.

class ProfitFinderOne
attr_accessor :purchase_time, :sale_time, :best_profit

PRICES = []

def initialize
@best_profit = 0
@purchase_time
@sale_time
scan
end

def scan
purchase_index = 0
while purchase_index < PRICES.length
purchase_price = PRICES[purchase_index]
sale_index = purchase_index + 1
while sale_index < PRICES.length
sale_price = PRICES[sale_index]
if sale_price - purchase_price > best_profit
best_profit = sale_price - purchase_price
purchase_time = purchase_index
sale_time = sale_index
end
end
purchase_index += 1
end

end

To reduce the complexity however, I can remove that wheel inside the other wheel. Now I have chosen to pre-process information about optimal future sale prices and the time that price is available. Now, as the program progresses through the day, each comparison is only to one other element in the parallel array. Thus reducing the complexity from a square to linear growth pattern.

class ProfitFinderTwo
attr_accessor :purchase_time, :sale_time, :best_profit

PURCHASE_PRICE = []
BEST_FUTURE_SALE = []

def initialize
@best_profit = 0
@purchase_time
@sale_time
populate_sale_array
scan
end

def populate_sale_array
temp_array = PURCHASE_PRICE.reverse
last_best = [PURCHASE_PRICE[-1], PURCHASE_PRICE.length - 1]
temp_array.each_with_index do |value, index|
if value > last_best[0]
last_best[0] = value
last_best[1] = index
end
BEST_FUTURE_SALE << last_best
end
BEST_FUTURE_SALE.reverse!
end

def scan
PRICES.each_with_index do |purchase_price, index|
profit = BEST_FUTURE_SALE[index][0] - purchase_price
if profit > best_profit
best_profit = profit
purchase_time = index
sale_time = BEST_FUTURE_SALE[index][1]
end
end
end

end

About "time"

I've been working a lot on Project Euler problems. I try and push code to Github every day, and these little code snacks have been really fun to do between larger projects. What I enjoy a lot about the Project Euler questions is that you can pretty much always get some kind of naive answer and the real satisfaction of the project is to optimize it all Ugly-Duckling style until it is faster, shorter, and much more snappy.

Many of the problems hinge on REALLY big numbers to make it challenging. Since it's super boring to sit and watch your terminal grind while it calculates prime factors or what not, benchmarking is important in optimization and the Unix "time" command is a sweet way of knowing how much processing you have whittled off of your solution. All you have to do is preface your Ruby execution command with "time" like so:

time.jpg

> time ruby problem3.rb

While writing my restaurant data scraper I wanted to benchmark my optimizations but "time" doesn't work in the Rails console. My work around for that was to make a named Time.now object at the beginning of the parser and a second one at the end. Then subtracting the two gave me the time in seconds that my parser ran.

Restaurant Cleaner

I'm talking today at Hack and Tell about the parser I wrote to convert the NYC Restaurant Inspection results into a useful and hostable JSON file.
New York City has a very progressive policy of publicly making available all data that is acquired with taxpayer money. They have some great and easy to use stuff up there like hosted JSON of all the greenspaces in the city that would make a very nice dynamic map.
What is not so nice is the restaurant inspection data provided by the Department of Health and Mental Hygiene. It is only available as a download of a nearly 1 Gigabyte plain text file. It's supposedly in a CSV format, but according to complaints on the boards it can't be opened properly as there are commas and double quotes in titles which disrupts the format. I wanted it in JSON anyhow.

This is what a bit of it looks like:

"CAMIS","DBA","BORO","BUILDING","STREET","ZIPCODE","PHONE","CUISINECODE","INSPDATE", "ACTION","VIOLCODE","SCORE","CURRENTGRADE","GRADEDATE","RECORDDATE" "40280083","INTERCONTINENTAL THE BARCLAY","1","111 ","EAST 48 STREET ","10017","2129063134","03","2014-02-07 00:00:00","D","10F","4","A","2014-02-07 00:00:00","2014-03-20 06:01:11.660000000" "40356483","WILKEN'S FINE FOOD","3","7114","AVENUE U","11234","7184443838","27","2014-01-14 00:00:00","D","10F","10","A","2014-01-14 00:00:00","2014-03-20 06:01:11.660000000" "40362869","SHASHEMENE INT'L RESTAURA","3","195","EAST 56 STREET","11203","3474300871","17","2013-05-08 00:00:00","D","10B","7","A","2013-05-08 00:00:00","2014-03-20 06:01:11.660000000" "50008280","WILD ORCHID BAR & LOUNGE INC.","4","111-48 ","LEFFERTS BOULEVARD ","11420","3479609997","99","1900-01-01 00:00:00","","","","","","2014-03-20 06:01:19.813000000"  "50008286","Espinal Restaurant","3","1039","BELMONT AVENUE","11208","7188275230","99","1900-01-01 00:00:00","","","","","","2014-03-20 06:01:19.813000000" "81642687","FAMOUS RAY'S RESTAURANT CORP.","1","582","WEST 207 STREET","10034","8624525735","99","1900-01-01 00:00:00","","","","","","2014-03-20 06:01:19.813000000"

You may notice the fanciful use of white space and capslock.

 Ziiiiiip!

My first idea was to take each line, which is a restaurant, and split it into an array of data elements. I knew from the column headings which data elements I wanted and planned to populate a temporary array with the desired elements. Then I would run that in parallel with my known column headings to make a temporary hash for each restaurant that then can be made into a JSON object.

  desired_data = [1, 3, 4, 5, 7, 8, 10, 12]
column_names = [:name, :street_address, :zip, :cuisine, :inspection_date, :violation, :current_grade]

##stuff happens here

temp_array.each_with_index {|item, index| temp_hash[column_names[index]] = item}

I wanted to do as little stuff as possible though, as there are over 53,000 records to process, many of which are out of date or incomplete. I wrote in logic to have a completeness flag that would flip if any of my required data elements were missing so that I could exit the loop, not save that record, and move on as quickly as possible.

element_array.each_with_index do |data_element, index|
is_complete = false if (data_element == "" &&; index != 10)
## processing code

temp_array << data_element unless index == 4
end
formatted_array << temp_hash if is_complete

But this is where things really started to balloon. I thought I'd be able to just shovel items from the restaurant array into one half of the hash zipper right away, but more and more things needed processing. Dates had to be checked to be sure they were current, codes were converted to human readable text, and the crazy caps lock situation on names was resolved with a specific method.

def namify(element)
element.split(" ").each{|word| word.capitalize!}.join(' ')
end

It became evident that nearly everything needed to be processed so I had to consider if it would be more efficient to remove the iteration and select items to go into the hash directly. For this version, I pulled all the data processing logic into its own method that contained many ternary statements that returned from the method if data was missing.

## stuff before ..
element_array[1] ? temp_hash[:name] = namify(element_array[1]) : return
## further processing
benchmark.jpg

I wanted to know what the difference in potential optimization was between the two approaches. I couldn't use the "time" command for benchmarking in the Rails console, so I created a Time.now object at the beginning of the parser and another at the end. Subtracting the two gives me the total time required to run the parser in seconds so I didn't need to sit there with a stop watch and could read Zooborns.com instead.

Surprisingly the old style parser was consistently, if only slightly, faster than the new version implying that removing iteration did not give me the hoped for optimization as I had traded it for more searching.

Reading Old Books with Tesseract

officepup.jpg

Our current project is in conjunction with the NYPL concerning the transcription of digital archives. Due to the fact the particular resource we are working with is typeset we will be making use of OCR, Optical Character Recognition, on these digital images in order to assist the transcriptionist. A bitmap image such as .png, .jpg, or .tiff is just information about the color of pixels in the image, and it takes some interesting programing to get an understanding what the pixels in an image means. The two basic types of OCR processing are Matrix Matching and Feature Extraction. Matrix Matching has a lower computational cost and works best with reproducible typefaces. A phone bill could be scanned quickly and well with a Matrix Matching engine. The computer metaphorically overlays a stencil of a letter on a grouping of pixels and records the letter with the closest match. Feature Extraction works much more like the human visual system and searches for, well, features. It looks for edges, monochrome fields, line intersections and other such topography. Feature Extraction is more versatile than Matrix Matching for unusual typefaces, different sizes of the same type, or uneven backgrounds.

For this project we will be using Tesseract, an OCR engine developed by HP and made open source in 2005. It is a feature detecting engine with a couple of optimization options. Before embedding a language specific hidden Markov Model or training a convolutional neural network with an evolutionary optimization algorithm such as Particle Swarm Optimization there are some more basic steps you can take to improve your OCR results.

  • First use a good resolution copy. 300dpi is about the minimum requirement to ride the ride. Grey-scale or color is better than black and white.
  • Tesseract does a lot of background and contrast adjustments itself so trying to anticipate what it wants is not very likely to to help much.
  • If the background of the image is known to be unevenly aged, setting the background adjustments to "tile" local adjustments may work better than averaging across the entire image.
  • Tesseract has different language files, so be sure to use the one appropriate to your particular document. This will be important for Tesseract to anticipate what characters it might come across.
  • It is possible to train Tesseract on a particular font by correcting and saving a early attempts at OCR. A good tutorial is located here

I miss .each

I miss the Ruby method .each so very, very much.

saddog.jpg

I've been refactoring many of my old Ruby todos (one thing I learned is that adorably younger me took a LITTERALLY arbitrary approach to spacing. Want two spaces, four spaces, THREE spaces, a tab? Sure, why not! Don't bother me with this formatting minutia, I'm crushing code here!), and recently I've been converting the same todos into JavaScript. 'cause you know, why not. Turns out that there is a "why not" and it is the fact that the lack of a direct .each replacement in JavaScript is actually making me depressed. I even pop open old code just to see its friendly little vowel-filled face sitting above its securely fenced code block, like looking at old photos after a breakup.

This is the todo that made me lose my mind:


##Write a function so that test = ['cat', 'dog', 'fish', 'fish']
##count(test) #=&gt; { 'cat' =&gt; 1, 'dog' =&gt; 1, 'fish' =&gt; 2 })

def count(array)
unique_array = array.uniq
return_hash = {}
unique_array.each do |item|
return_hash[item] = array.select{|repeat| repeat == item }.length
end
return_hash
end

This, by the way is not at all the least processor intensive way to solve this, but because Ruby comes packaged with so many useful named modules it is at least short. This solution relies on a lot of subtle flavors of iteration, which made my attempt at implementing it in JavaScript balloon distressingly before I realized that I don't know how to do nuanced iteration in JavaScript AT ALL:

function count(array){
var returnValue = {};
var givenArray = array;
var allKeys = unique(givenArray);
function unique(array){
var uniqueArray = [];
for(i=0; i &lt; array.length; i++){
standard = array[i];
if (!uniqueArray.some(element != standard){
uniqueArray.concat(standard);
}
}
return uniqueArray;}
for(i=0; i &lt; allKeys.length; i++){
returnValue[allKeys[i]] = givenArray.map{}.length;
}
};

The thing is that a lot of the methods that I am used to in Ruby come along in the Module of Enumerable and so are implemented consistently across the different classes. JavaScript does not have the same consistent application of methods across classes. Partially, this is due to the greater degree of functional style in JavaScript. It is more object oriented to have methods assigned to different classes and to only manipulate data through these pre-approved channels. JavaScript does have a small vocabulary of methods that do iteration, largely methods in the Array class such as .every, .filter, .forEach, .map, and .some. The documentation could never be accused of being user friendly so I'm going to give one small but practical example of how to use a prepackaged Array method without using the term "this," "callback," or amusingly "thisp." Let's say you want to step through an array and make each item an attribute in an object. You can start by writing what would be your block in Ruby as the body of a function that you want executed on each element.

    if(!containerObject[item]){
      containerObject[item] = 1 ;
    }
    else{
      containerObject[item]++;
    };

Now if you break out your MDN network decoder ring that you got from the bottom of one of many cans of coffee you were drinking while reading documentation late at night you will discover that Array.forEach is the best method for executing a function on each element of an array and that the parameters piped from .forEach to the function that it executes are, in this order:

  • the element value
  • the element index
  • the array itself

so now your attempt at a block can be wrapped in proper JavaScript like so:

containerObject = {};
function addToContainer(item, index, array) {
if(!containerObject[item]){
containerObject[item] = 1 ;
}else{
containerObject[item]++;
};
}

If you feel that it is untidy naming parameters that your function doesn't use, you can leave out the unused "index," and "array" as JavaScript will just dump any extra parameters that it is given. Now you can call your function by name as an argument to Array.forEach, letting it know what it should be doing for each item in this array.

  var containerObject = {};
  function addToContainer(item) {
    if(!containerObject[item]){
      containerObject[item] = 1 ;
    }
    else{
      containerObject[item]++;
    };
  }

  ["fish", "cat", "fish"].forEach(addToContainer);
  containerObject;

This is why Arrays are great. But now lets say you just have to work with the attributes of an object. In Ruby, Hashes include the module Enumerable so you can call .each on them. This is not true of JavaScript objects. The methods provided by the Object class are sparser as the functional style of JavaScript suggests that instead of using prewritten and preapproved class methods on objects that custom functions are written and invoked with the object as the parameter. This is best achieved with a plain vanilla "for" loop. JavaScript has multiple flavors of "for" loops although most of them are not recommended. There is "for each..in" which I was originally excited about as it seemed like a possible .each replacement, but it has been deprecated right after being introduced, like Cristal Pepsi.

Refactor Friday Part I

I went back to refactor one of the earliest homework assignments that we were given at the Flatiron School, which is the Number Guesser. Although it was all on one page and the tests totally didn't pass, I wasn't too displeased with how it has aged. Hardly fine wine, but at least not compost either. I ended up with a very long commit history that I tried to make into the general outline of my blogpost.

This:
def checkforequivalency

if @input == @standard<br/>
  true<br/>
else<br>
  false<br/>
 end


Becomes:

def equivalent?
@input == @standard

  • Had to unchain method calls as .guess returns a string. Whaaaat?! The "s" in gets means string. ^%$#!!!

  • realized that gets converts input into string. Included to_i to counter that and a validity check to recognize post to_i strings.

  • Changed method name to refelct true false response. One of the few teachable moments here. Having explicit names that both indicate the work it does as well as the return it gives makes for friendlier code. Noticed that response was missing the interpolated value when I ran the program. Added it back in.

  • Added tests to methods MOAR TESTS!!!

  • Got up to get a cup of tea. Thought of more features. Didn't yet decide if they were app or class features. Proud of myself that I resisted dropping in three half-baked def's. Wrote notes to myself that I wanted to get back to making these.

  • Added loop to allow repeated guesses. can end the game with exit.

  • Added exit test and higher/lower functionality. New functionality in the app and in the class now that I know what goes where.

The natural history of HTML tags

tulip.jpg

In biology, plants evolve defenses against the bugs that eat them, then bugs evolve the ability to eat these harder-to-eat plants until the cycle reaches some pretty amazing extremes. This sort of feedback loop has also been in play in the development of HTML elements and those who would like to either make use or take advantage of them.

If you want to take a straight ahead interpretation of the establishing chapter of Mark Pilgrim's book "Dive into HTML 5" it would seem that many of the basic HTML elements were mostly just agreed upon by the members of a large email correspondence. Even if you think that seems a little simplified, it is certainly the case that elements become standard because they are useful and that browsers will adapt to respond to these tags to better display content. Already, we are describing the first of a number of interlocking adaptive relationships that between browser developers and web page authors.

When the development of the Internet was still bathed in the light of amber CRT screens, the most compelling reason to apply an HTML tag on a piece of content that it did not accurately describe was formatting. See also the rampant abuse of tables for spacing purposes. Soon though, browsers evolved the ability to glean and record data about web pages, and how web pages relate to each other to make it easier for users to navigate the web. There were some interesting categorization efforts made my AOL and Yahoo! that are worth their own blog post, but the point of this story is that when the network got far too large for one metaphorical shelf Larry Page of Google fame developed his self-titled ranking system that treated the content on the web like a bucket of stuff. A bucket of Internet that, as the name implies, had the particular quality of being networked. The general reasoning behind the Page Rank algorithm is that the most important web content is that which is regarded authoritative by its peers. In this case, he used the HTML element of a hypertext link as a stand in for a vote of confidence to the importance of another web page. The more links to a web page from another well-linked page, the more likely it was of good quality. This has led to interesting uses and abuses of the humble href. One story recently in the news is the scandal of Rap Genius who were attempting to farm links from bloggers. They requested that bloggers spam links to their site in return for favors in an attempt to stuff the virtual ballot box and drive their site higher in the ranking algorithm. Another interesting adaptation taken on by authors is the "nofollow" attribute. It has been taken up by authors that want to refer to a source without providing the web site the benefit of increased significance on the Web. "nofollow" is a signal to web crawlers to discount a link in the Internet popularity contest. For example, it could even be used by reporters who want to refer to the site Rap Genius without incidentally providing the same linking services the company was attempting to solicit.

Doing Less by Doing Less

snoozindog.jpg

This is a post about the value of doing nothing. While reading Sandi Metz's book about program design I have been intrigued by how often she recommends not making a decision and waiting until the point that the flaws in your program design bite you in the butt. This doesn't seem like advice one would expect from a book about project design advocating that you, y'know, not design. Here's what makes it interesting: It is most likely the best choice to sit in stinky code than clean it up even if it bites you more often than not. Here's the math I did while smooshed in a commuter train.

Let's say that you optimize classes for one hour every time the chance comes up because it takes two hours to fix stinky code. In the case that half the classes stink this is a wash. For eight classes, assuming that your optimization is right every time, you invest eight hours of work to save eight hours of work.

What this doesn't take into account, though, is that optimizing too early means that some of the time your code will get stinky anyhow so optimizing for eight hours to avoid eight hours of work costs more because it is unlikely that you will avoid all future stink. However, waiting allows you to identify what will need fixing and allow you to only spend time optimizing the code that needs it.

So ultimately you can do less by doing less.

Optimism

This post is in response to the post on optimism by Mr. R. S. Braythwayt, Esq. found at http://braythwayt.com/homoiconic/2009/05/01/optimism.html

I usually get fairly tense reading posts that discuss psychology research. I sort of scootch over to the side of my chair. Perhaps I am concerned that I might pull an extraocular muscle from the imminent eye rolling once the results from one unimportant but highly quotable study gets generalized into a global statement about why we do everything that we do.

lilacs.jpg

Fortunately, my anxiety was not founded. The writing itself refreshingly avoided becoming maudlin, and Mr. Braythwayt comes across as being a pragmatic gentleman. Since he chose to keep his message clear in order to make it more actionable, it's good that he took a while getting to the punchline and invited the reader to attempt a prediction concerning optimistic behavior. Going straight to the message that most folks over value negative experiences and do not put as much emphasis on the things that go right would have ended up just being a recitation from the Journal of Things We Already Know.

But just because we already know something doesn't mean it isn't important.

cherryblossoms.jpg

A week ago I attended the Bris of my friend's baby. This came after weeks of a complicated pregnancy in which the life of both the mother and son had been imperiled. During the ceremony, the cantor asked that the attendees make a point of continuing to work as hard to make time to observe happy occasions as sad ones. In the case of a funeral, it seems important enough to leave work and pay for a ticket to attend in person, but attending a wedding or visiting a new baby does not. Leaving a blog comment because someone is wrong on the Internet seems worth populating a text field, when having an equally strong, but positive opinion does not.

This is why so much of the Internet has become a bad neighborhood.

Stupid Blocks

surprisedog.jpg

Spent most of the day working on Ruby. Besides being the general topic of the week, it's what I worked on the most prior to acceptance, so I feel less behind. The whole loops thing and control flow let me feel smart.

Blocks, procs, and lambdas do not.

I would be interested in seeing some in the wild.