Snacks for your mind

Tuesday, June 12, 2012

Using git for simple projects.

This is not really related to main purpose of this blog, but I had to write this article anyway, so I though I might as well publish it.

Git is becoming more and more popular collaborative programming tool. It main purpose is version control, so you can always revert to previous versions of your code and so on.

The problem is many people get really confused by it and there is no really good tutorial that would describe how proper workflow should look like. I know this can be really confusing, so I created this article.

Initial setup:

You need to update git variables. Type:
git config --global user.email "YOUREMAIL@gmail.com"
git config --global user.name "YOURNICKINGIT"
git config --global core.editor vim

This is so that other people can associate your changes to code with you. You will understand later on when we get to git log.

If you don't have vim you have to install it. In ubuntu type (if no ubuntu, I assume you are geeky enough to be able to google that).:

sudo apt-get install vim

Setting up a repository

Use github - this is a really nice website and I really recommend using it as it has a really nice interface and allows incredibly easy sharing. The downside of that that they make you pay if your project is not public.
DIY - you can set up a repository on ssh server that other people would be able to fork on their own computers and edit it. In this tutorial I follow a model of single master repo and multiple edit repos, which is very simple and powerful. So steps are:
1. Log in to your ssh server and go to directory where you want to set up a master git repository.
2. type
```
git init --shared=group project
```
  This will create a directory named project. Inside this directory there is a hidden folder .git which keeps all of the git state. All files in project folder are subject to version control by git.
3. There is some necessary initialization work that needs to be done. Don't worry about it for now (for curious: to be able to successfully clone git repository and be able to push, it has to be nonempty; it also must not be checked out to branch to which you are pushing; we will be pushing to branch master, so it is going to be checkout to devs branch).
```
cd project/ && touch TODO && git add TODO && git commit -a -m"Initial commit" && git checkout -b dev 
```
4. Now this is your master repo. Ideally you should never explicitly modify it from the inside. Instead you should take your own local copy and push to it. Also make sure that all your friends have rights to modify it (if they use different user to log in to the server - this is advanced if you do not know unix permissions yet; if you know it you will be able to do it).

To get the local copy of repo

git clone <address>

where address can be for example:
local directory - for example:

/home/memyselfandi/projects/massdestructionofbananalovers

remote directory on ssh server - it has to be structured like this:

ssh://username@server:pathtorepo

for example:

ssh://friendofmemyselfandi@secretgovermentorganisation.gov:/home/memyselfandi/projects/massdestructionofbananalovers

This will create a folder with the name of your project containing the repository.

Convention: In all the commands described it is assumed that you are inside your local repo.

To work with code:

git branch awesome_feature

Creates a new branch copying a contents of current one. Branch can be thought of as a new direction in which project is going (≈feature or set of features). If you type git branch you will see list of all the branches that you have on your local repo.
To understand branches better imagine, you are to add new feature to a project without git. What you would do is copy a directory with your project and work on a copy, because you will be worried that you will screw up the original. That is exactly what branches do. You can have many branches to work on many features in parallel.
There is one special branch - master - it is a useful convention to keep this one unchanged. That way you can always keep the newest set of changes committed to master repository there.
Observe the star near currently used branch. We want to change it to our newly created branch, so:

git checkout awesome_feature

- changes branch to awesome_feature. Branches are meant to develop logically independent changes. If you are working on two features in parallel, create additional branch. Be careful not to create it from awesome_feature, because it would make a copy of it rather than master (unless your branch diverged into two different directions - but even in this case you would be probably better of first pushing your change to master, possibly disabled with a boolean flag and then starting two new copies of updated master).

Now is the time do actual coding. Write whatever you want and once you are sure a logical part of your code is done clean up the code (what you commit will be shared with your friends and you don't want to make mess on their computer, do you?). Now type git status. What you will see will look something like:

# On branch awesome_feature
# Changed but not updated:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified:   TODO
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
# bananalover_decapitator.cpp
# banana_poisoner.cpp

While TODO is file that already existed and you modified it, bananalover_decapitator.cpp is completely new file. By default git will just ignore new files, so you have to add them to git's consideration.

git add <files>

- adds a file to gits consideration. Make sure you add all the files you want to share with your friends. Make sure it worked by typing git status again and checking files is listed under new files. Also you can add entire directories to git in one go. This is a sensible thing to do if you start version control of a previously uncontrolled project. Sometimes however that would also add some files you would prefer to ignore. This a actually really common situation, so git let's you specify, things to be ignored. They are located in a specific file in located in your repository:

$ cat .gitignore

# Can ignore specific files
photo_of_red_banana

# Use wildcards as well
*~
*.swp

# Can also ignore all directories and files in that directory.
build/**/*

git commit -a

Commits your changes as a part of current branch (awesome_feature). It will open your file editor - in this case vim. You will be asked to type a short message describing your set of changes. Make sure is descriptive enough for your friends to understand what you did. Editor you are using is vim, so hit i (letter 'i' on keyboard) to start typing. Type your message normally. Hit esc and then ":wq" + enter (without quotes) to save your text file (if you need to know more on vim, just google it). Now if you type git log you should see your change on top (q to exit if list of changes is long). Now it is crucial to understand what just happened. Your modifications used to be only saved as changes to local directory. But committing them you you added them to your branch on your local computer. One commit is a smallest unit that git can manage. You can merge your commits with commits of your friends, revert to the state after a particular commit etc.
Now that you saved your work on your computer, it's time to share with your friends, but first...

make sure you are not conflicting with newest changes (made by your friends):

git checkout master # go to branch master
git pull # download the newest changes from master branch 
git checkout awesome_feature # back to your branch
git rebase master # magically merge newest updates by your friends with your code.

conflict resolution: Sometimes it might happen that you and your friend were editing the same file in a similar place. Should that happen you will have to resolve a conflict. Git makes resolving conflicts really easy! In fact most of them are just resolved automatically. If that didn't happen git will output a message. To fix it: type git status to see which files include conflicts. For each of them individually you have to open them and merge the changes. Git will list two different versions it found in this file:

<<<<<<< HEAD:banana_bomb.cpp
kill_banana_lovers();
=======
kill_banana_lovers_and_their_families();
>>>>>>> 7b08cdcedc68edc6c5e87d7dc5ed7:banana_bomb.cpp

Once you resolved conflicts, you have to git add again all the files that use to include conflict. Once this is done you may continue your rebase: git rebase --continue

Now, if there were any changes that you merged with, make sure that merging didn't break anything. If it did, see and try if you can fix it by yourself, or consult a friend with whose changes you were merging.

All the changes that you made have to be committed again. Sensible thing to do is to add them as a part of your last commit. You can quickly do that by using git commit -a --amend

Now share it! Push your code from your local branch awesome_feature to branch master on master repository.
git push origin awesome_feature:master
This will push where ever your origin is (remote or local).

Additional features

git blame <file>

Displays names of people who last modified each line in a given files.

git diff

Displays uncomitted changes. You can also compare files between different revisions (states after different commits).

git reset --hard <commit_hash>

Resets current branch to given commit. In case smb screwed up. In case the change that is bad was not the last change, you are most probably better off writing new commit undoing the changes.

git stash

When you want to switch to different branch, but don't want to commit changes or want to transfer small set of changes between branches without committing and rebasing, just stash them - use responsibly.

git bisect

No sure which commit caused the failure? Use git bisect to binary search for it.

git reflog

Git is so meta - it even version controls what happens with itself. To display recent actions in git use git reflog. Each action also has its own hash, to which you can reset but using git reset --hard <action_hash>

Working with server-side code
Some people think that it is impossible to work on local repository if you are working on server side code, as you need to have access to server resources. This is only partially true. In most cases you can run local copy of server and access resources with hard state like SQL database remotely. Using PHP+MySQL as an example you can have separate file locally and on server specifying how to connect to database. You include in PHP files. Just gitignore that file, so that git does not change it around. Run apache at your local computer to execute PHP code.
Make sure you gitignore all your local configurations.

That is the end of this simple git usage tutorial. In general you have lot more interesting problems and features with git, but this actually covers more of the cases for small and medium sized projects.

Good luck and remember - with great power comes great responsibility.

Also don't be afraid to experiment with git - you always can use reflog to repair your mess.

Please comment if you find any bug in this tutorial. It is one of the longest things I ever wrote, so it probably has some.

Thursday, October 13, 2011

Scaling coordinates

Sometimes we want to iterate over all coordinates of points in some space. In many cases the distance between points is not important, what is important is their relative order of occurrence along axis. Then we can scale {-5, 0, 5} to {0,1,2} for example along x-axis. Quite often this makes it possible to implement a static interval tree along the axis, which can be useful for some application. Here is really nice an compact code in C++ that achieves that:

vector < int > scale(vector < int > v) {
    vector < int > X = v;
    map < int, int > M;
    sort(X.begin(), X.end());
    X.erase(unique(X.begin(), X.end()), X.end());
    int nc = 0;
    for (int i = 0; i < X.size(); ++i) M[X[i]] = nc++;
    for (int i = 0; i < v.size(); ++i) v[i] = M[v[i]];
}

Hope you will find this one helpful!

Friday, July 29, 2011

Snack for your tired mind no 1

What does computer scientist, mathematician and humanist do when they see a pretty girl?

Computer Scientist starts a chat, tries to be funny, get her to know and go on a date.

Mathematics starts a casual chat, get her to know and gets her phone number.

Humanist completely don't know how to behave, so he starts thinking about the reasons, because of which computer scientist or mathematician can't date her.

Wednesday, July 27, 2011

Algorithms on strings - XVIII Polish CS Olympiad - Difference (4/10)

This was quite a fun question. Although solution turned out to be simple it was a little bit unconventional and it took me a couple of minutes to work out the solution.

You can find the english version of the text as well as the checker here:

http://main.edu.pl/en/archive/oi/18/roz

So first natural approach to this problem is to analyse all subsequences separately. There are $\Theta(n^2)$ of them. For each sequence we just count for each letter how many times it occurs in this sequence and then find a minimum and maximum. As sequence can have pessimistically about $\Theta(n)$ length, the whole algorithm complexity is $\Theta(n^3)$. This is way is sufficient for short sequence but way to slow for $1000000$ element sequence.

The Maximum Sum contiguous subsequence problem

You should probably know this problem before you proceed. If you already know it you can skip this section. The problem is the following:
You are given the sequence of integers (can be negative). Find such a contiguous subsequence, that sum of it's elements is maximum possible.
For example if we consider sequence

2 1 -5 4 -2 4 -1 -2 -3 5

The answer is subseqence 4 -2 4 having sum 6.

Now assuming we have $n$ elements this problem can be solved in $\Theta(n)$. At this point you might want to try to work it out on your own.

As this problem is quite well described in the internet I won't give exhaustive description. Here's how you can approach it. As the target algorithm is $\Theta(n)$ it's sort of a clue that we scan the sequence only once. OK, let's try left to right maybe this will work out. The simplest idea that comes to my head is to calculate sum of the prefixes and see what happens.

Seq.	2	1	-5	4	-2	4	-1	-2	-3	5
Sum	2	3	-2

[Best so far: 5]

And look what happened now! We have negative prefix sum. This means that we don't want to add this prefix (2 1 -5) to our final result (there's no benefit at all!). But what about (1 -5) or (-5) ? Both of them sums to negative number and therefore are worthless. So let's totally forget about first three numbers. We will indicate that in our calculations by replacing -2 with 0 in our table.

Seq.	2	1	-5	4	-2	4	-1	-2	-4	5
Sum	2	3	0	4	2	6	5	3	-1

[Best so far: 6]

Ok now it happened again. In case somebody's lost we are currently considering subsequence (4 -2 4 -1 -2 -4). Now again let's notice that however many terms you strip from the beginning of this sequence we will never exceed -1. Let's forget about all of them then analogous to above.

Seq.	2	1	-5	4	-2	4	-1	-2	-4	5
Sum	2	3	0	4	2	6	5	3	0	5

[Best averally: 6]

Voille! The result is 6.
But hold on! How did it happened? Each time we were able to forget about whole prefix totally. What if it was say 1 -3 -3 4. It sums to -1 so we would get rid of all of it. But we can strip first three terms and get 4 alone, which would be quite useful in our future calculations. What now? Is the algorithm wrong? Did we just choose lucky example above?
No! Think about it! Here's the trick! We would never be considering such a prefix in our algorithm! After reading 1 and then -3 we would get rid of them (because they sum to negative value) and passed on 0. Then we read -3, which we also get rid of and pass on 0. Then we read 4 and pass it on. Hey, it works!

This way we ended up with linear time algorithm. We just do one pass and always get rid of currently considered terms once we get a negative sum. Also we don't have to think about the terms explicitly. Just change the variable holding sum so far to 0 once it's negative. Simple, isn't it?

// sequence contains 1-based n-element sequence. 
int res = 0, best = 0; 
for (int i = 1; i <= n; ++i) {
  res += sequence[i];
  if(res > best) best = res;   
  if(res < 0) res = 0;
}

For geeks: More formally we can prove this algorithm is correct inductively on the sequence length. The hypothesis is that after ith iteration of loop variable res contains maximum sum that subsequence (may be empty) finishing at i-th index can give. This indeed true. Assuming it's true for i-1. Now assume that to get better result for i we need to use other sequence that the one from i-1. Then it would be better for i-1 as well if we stripped the last term. Contradicion with the fact that we had optimal answer for i-1. Also we didn't take into account that in particular we can just have an empty sequence. It gives 0 so it's worth doing only if otherwise we would get negative number.

Back to reality now - the problem is quite complicated so let's solve simpler version first. Let's assume that magic fairy told us that in the subsequence which is the answer the letter that occurs the most is 'b' and the letter that occurs the least is 'a'. Problem is to calculate such a subsequence that difference between number of occurrences of 'b' and 'a' (if there are more 'a' then it would be negative in this case) is maximum possible. Now - can we do that?

I seriously recommend that you try to work it out on your own given the solution to the problem in yellow box before continuing. It's important at least to try, you don't have to succeed!
Here's the solution. Let's call the letter that occurs the biggest number of times the big letter and the one that occurs the least number of times the small letter . Hence, each big letter is sort of like +1, right? Each small letter is like a -1. What about rest? Completely unimportant so let it be 0. Now we just have to run the algorithm for maximum sum subsequence, right? Not quite, but close. If we read the question in detail we will notice the sentence "We are assuming that the least frequent letter has to occur at least once in the resulting fragment". So we can just check if in current sequence there is such a letter. The code would look sort of like this.

int calculate_maximum_difference(char small_letter, char big_letter) {   
  int res = 0, best = 0;   
  bool has_small = false;   
  for(int i = 0; i < n; ++i) {
      if (sequence[i] == small_letter) {
        --res;
        has_small = true;
        if(res < 0) {
          has_small = false;
          res = 0;
        } 
      } else if (sequence[i] == big_letter) {
        ++res;
      }
      if(best < res && has_small) best = res;
  }
  return best;
}

OK, so this code have to work! No? Of course not. Consider example:

baaaa

where b is small and a is big. Now we read one b then get rid of it, then read 4 a's but make no use out of them because we do not have a small letter. So sometimes it might be worth keeping the a letter even if it subtracts one from final result. But on the other hand in this example:

baaaabaaaaa

We definitely don't want to consider first b. So when to do what? I don't care! Let's just test each case separately. Our function will have additional boolean parameter discard, that will determine whether this time we are discarding the small letter or not. OK great, let's change this to general problem solution! We just consider each pair of letters. The worst that could happen if we consider wrong pair is that we will get result worse than optimal, but we don't really care because at some point we will encounter the optimal result. The full code of this solution is presented below:

#include <cstdio> 
#include <vector> 
#include <algorithm> 
using namespace std;  

int n; 
char sequence[1000006]; // always add some more just in case :)  

int calculate_maximum_difference(char small_letter, 
                                 char big_letter, 
                                 bool discard) {   
  int res = 0, best = 0;   
  bool has_small = false;   
  for(int i = 0; i < n; ++i) {
    if (sequence[i] == small_letter) {
      --res; has_small = true;
      if(discard) {
        if(res<0) { res = 0; has_small = false; }
      } else {
        if(res<-1) { res = -1; }
      }
    } else if (sequence[i] == big_letter) {
      ++res;
    }
    if(best < res && has_small) best = res;
  }
  return best;
}


int main() {
  scanf("%d", &n);
  scanf(" %s", sequence); // one of the fastest ways to read a string
  int best = 0;
  for (char i = 'a'; i <= 'z'; ++i) {
    for(char  j = 'a'; j <= 'z'; ++j) {
      if(i==j) continue;
      best = max(max(calculate_maximum_difference(i, j, true),
                 calculate_maximum_difference(i, j, false)), 
             best);
    }
  }
  printf("%d\n", best);
}

This is working solution. The only disadvantage of this solution is that it's to slow to satisfy the task requirements (it get's 60/100 points in tester). Let's think about a complexity. Let's denote the alphabet size by $\Omega$. For each unordered pair we do single scan through entire sequence. This is $\Theta(\Omega^2 n)$. It's quite a lot. Looking at the actual numbers ($26 * 26 * 1 000 000 = 676 000 000$). Another optimization is quite straightforward. Guessing the appropriate pair of letters sound difficult and non-trivial, so let's think what we can actually do to make calculate_maximum_difference run faster. So what's the actual thing? Oh, come on it's easy. Thinking... Thinking... ... Got it! We can remove every letter from considered string that is nor small, neither big. For example if a is big and b is small then string:

abcsdsfbabgfabgab

is equivalent to:

abbababab

But how to do it efficient? We can just store for each letter it's indexes in the input string. Then we can create strings looking just at the two necessary letters rather than whole string. So after another bit of thinking we end up with the code:

#include <cstdio> 
#include <vector> 
#include <algorithm> 
#include <string> 
using namespace std;  

int n; 
char sequence[1000006]; // always add some more just in case :) 
vector<int> letter_positions[256]; // letter_positions['a'] is 
                                         // a vector of all the indexes 
                                         // at which letter 'a' occurs 
                                         // in big string.  
int calculate_maximum_difference(const string& text, 
                                 char small_letter, 
                                 char big_letter, 
                                 bool discard) {   
  int res = 0, best = 0;   
  bool has_small = false;   
  for(int i = 0; i < text.size(); ++i) {
      if (text[i] == small_letter) {
        --res; has_small = true;
        if(discard) {
          if(res<0) { res = 0; has_small = false; }
        } else {
          if(res<-1) { res = -1; }
        }
      } else {
        ++res;
      }
      if(best < res && has_small) best = res;
  }
  return best;
}


int main() {
  scanf("%d", &n);
  scanf(" %s", sequence); // one of the fastest ways to read a string
  for(int i=0; i < n; ++i) 
    letter_positions[sequence[i]].push_back(i);
  int best = 0;
  for (char i = 'a'; i <= 'z'; ++i) {
    for(char  j = 'a'; j <= 'z'; ++j) {
      if(i==j) continue;
      string text; // we will store a text that only contains i and j 
                   // letters preserving the original order
      vector<int>::iterator small = letter_positions[i].begin(), 
                                  big = letter_positions[j].begin();
      while(small != letter_positions[i].end() || 
            big != letter_positions[j].end()) {
        if(big == letter_positions[j].end() ||
           (small != letter_positions[i].end() && *small < *big)) {
          text.push_back(i); ++small;
        } else { 
          text.push_back(j); ++big;
        }
      }
      best = max(max(calculate_maximum_difference(text, i, j, true),
                 calculate_maximum_difference(text, i, j, false)), 
             best);
    }
  }
  printf("%d\n", best);
}

This time when we send it, we wait, wait, wait and yes!!!!! It gets a 100/100. But why? Was the optimization that good? What is the complexity now? Hmm... hard to tell. But intuitively, the biggest amount of work will be done if we do substantial amount of work for each pair of letter. How to construct such a case? Let's have equally many letters of each kind. So we have $\Theta(n/ \Omega)$ letters of each kind. For each of $\Omega^2$ pairs we do approximately $\Theta(n/ \Omega)$ operations. So the final complexity is $\Theta(\Omega n)$. That's about $26000000$ operations, which is not that bad, assuming we keep our code tidy. So yeah, it works! Yeey!

Bugs in my mind: (in this section i describe the problems I had while solving this problem):

First of all it was hard to came up with the right ideas. I kept thinking about standard approaches to such problems: binary search length of the substring, crawl along it, do some other standard techniques. Only after about 20 minutes it stroke me that we actually can do sth for each pair and consider only relevant letters ant there for decrease complexity by a factor of $\Omega$. Then rest followed.
I also spend quite some time on getting the discarding/non-discarding to work. I was trying some more general approaches and each of them failed. Now it seems trivial, but it took me about an half an hour of tries and looking at the tests that I kept getting wrong answer on to actually fix that.
The last very, very stupid error I made that was about 15 minute to debug is that I created vector instead of vector initially.

Monday, July 25, 2011

Number Theory - Concrete Mathematics 4.9 (2/10)

The text of the question is:

Show that the number $(3^{77}-1)/2$ is odd and composite.

The hint proposed in the question text is to compute $3^{77} \mod\ 4$ (mod means division reminder: $ 13 \mod\ 5\ =\ 3$). Let's do that then. The useful think is that there is only a small number of possible reminders mod 4 (four exactly), so we if we calculate first, second, third etc. power of 3 we will have some cycle:
$ 3 \mod\ 4\ =\ 3$
$ 3^2 \mod\ 4\ =\ 1$
$ 3^3 \mod\ 4\ =\ 3$
$ 3^4 \mod\ 4\ =\ 1$
...
Bum! Pattern spotted. Therefore: $3^{77} \mod\ 4\ =\ 3$. Now by simple transform:
$3^{77}-1 \mod\ 4\ =\ 2$
$(3^{77}-1)/2 \mod\ 2\ =\ 1$
So the number is odd. If any of above transformations confuses you please read about modular arithmetic. The prove that it's composite you have to know some of the less popular quick multiplication formulas:

In our case formula is a little simpler:

$a^n-1 = (a-1)(a^{n-1}+ .. + 1)$

Now how to use it? $a = 3$ gives factor of 2 but we divide by two in the formula, so that's not helpful, right?

Right?

Wrong!

And I have to admit a fall for this a the beginning, but after consultation with a friend I went back on the right path:
Let's consider $ a = 3^{7}$. It gives us
$(3^7)^{11}-1 = (3^7-1)*((3^7)^{10} + (3^7)^9.. + 1)$
Where X is some integer greater than one. So then
$( (3^{77}-1)/2 = (3^7-1)/2 * ((3^7)^{10} + (3^7)^9.. + 1) $, so it's composite. Eureca!

OK. I have to admit this question was boring. I am sorry I picked up question at random just to write some test thing on blog. It was a little challenging but come on, who likes tedious arithmetic?

LateX