Difference between revisions of "DSL Encoding"

From edegan.com
Jump to navigation Jump to search
Line 12: Line 12:
 
The current scripts that I wrote by following pix2code source code are living on  
 
The current scripts that I wrote by following pix2code source code are living on  
 
  E:/projects/embedding
 
  E:/projects/embedding
So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write
+
So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write
 
  python convert_gui.py
 
  python convert_gui.py
  
Line 25: Line 25:
  
 
What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this
 
What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this
 +
tokens
 +
 
  ['header ',
 
  ['header ',
 
  'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',
 
  'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',
Line 48: Line 50:
 
  ''
 
  ''
 
  ]
 
  ]
 +
 +
Now, based on this list, to see the total number of tokens we can do
 +
 +
chars = sorted(list(set(tokens)))
 +
 +
which results in
 +
['',
 +
'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',
 +
'header ',
 +
'quadruple ',
 +
'row ',
 +
'single ',
 +
'small-title, text, btn-green',
 +
'small-title, text, btn-orange',
 +
'small-title, text, btn-red']
 +
 +
As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector.
 +
char_indices = dict((c, i) for i, c in enumerate(chars))
 +
indices_char = dict((i, c) for i, c in enumerate(chars))
 +
 +
This results in
 +
char_indices
 +
{'': 0,
 +
'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,
 +
'header ': 2,
 +
'quadruple ': 3,
 +
'row ': 4,
 +
'single ': 5,
 +
'small-title, text, btn-green': 6,
 +
'small-title, text, btn-orange': 7,
 +
'small-title, text, btn-red': 8}
 +
 +
Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.
 +
 +
Now, let's apply this embedding rule to our GUI file
 +
sentences=[]
 +
for i in range(0, len(tokens)):
 +
    sentences.append(tokens[i])
 +
one_hot_vector = np.zeros((len(sentences),len(chars)))
 +
for i, sentence in enumerate(sentences):
 +
    for t, char in enumerate(sentences):
 +
        one_hot_vector[t, char_indices[char]] = 1
 +
 +
The vector that represents our GUI will be something like this.
 +
array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],
 +
      [0., 1., 0., 0., 0., 0., 0., 0., 0.],
 +
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 0., 1., 0., 0., 0., 0.],
 +
      [0., 0., 0., 1., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 0., 0., 0., 0., 1., 0.],
 +
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 1., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 0., 0., 0., 0., 0., 1.],
 +
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 1., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 0., 0., 0., 1., 0., 0.],
 +
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 1., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 0., 0., 0., 0., 1., 0.],
 +
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
 +
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
 +
      [0., 0., 0., 0., 1., 0., 0., 0., 0.],
 +
      [0., 0., 0., 0., 0., 1., 0., 0., 0.],
 +
      [0., 0., 0., 0., 0., 0., 1., 0., 0.],
 +
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
 +
      [1., 0., 0., 0., 0., 0., 0., 0., 0.]])

Revision as of 18:57, 3 May 2019


Project
DSL Encoding
Project logo 02.png
Project Information
Has title DSL Encoding
Has owner Hiep Nguyen
Has start date 2019/04/26
Has deadline date
Has project status Active
Copyright © 2019 edegan.com. All Rights Reserved.


Approach

Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the pix2code project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This article gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found here

File and scripts

The current scripts that I wrote by following pix2code source code are living on

E:/projects/embedding

So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write

python convert_gui.py

Implementation

One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows

gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')
tokens=[]
for line in gui:
   line=line.strip('\n').strip('}').strip('{')
   tokens.append(line)
   print(line)

What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The tokens variable now looks something like this

tokens

['header ',
'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',
,
'row ',
'quadruple ',
'small-title, text, btn-orange',
,
'quadruple ',
'small-title, text, btn-red',
,
'quadruple ',
'small-title, text, btn-green',
,
'quadruple ',
'small-title, text, btn-orange',
,
,
'row ',
'single ',
'small-title, text, btn-green',
,

]

Now, based on this list, to see the total number of tokens we can do

chars = sorted(list(set(tokens)))

which results in

[,
'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',
'header ',
'quadruple ',
'row ',
'single ',
'small-title, text, btn-green',
'small-title, text, btn-orange',
'small-title, text, btn-red']

As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector.

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

This results in

char_indices
{: 0,
'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,
'header ': 2,
'quadruple ': 3,
'row ': 4,
'single ': 5,
'small-title, text, btn-green': 6,
'small-title, text, btn-orange': 7,
'small-title, text, btn-red': 8}

Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.

Now, let's apply this embedding rule to our GUI file

sentences=[]
for i in range(0, len(tokens)):
   sentences.append(tokens[i])
one_hot_vector = np.zeros((len(sentences),len(chars)))
for i, sentence in enumerate(sentences):
   for t, char in enumerate(sentences):
       one_hot_vector[t, char_indices[char]] = 1

The vector that represents our GUI will be something like this.

array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],
      [0., 1., 0., 0., 0., 0., 0., 0., 0.],
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
      [0., 0., 0., 0., 1., 0., 0., 0., 0.],
      [0., 0., 0., 1., 0., 0., 0., 0., 0.],
      [0., 0., 0., 0., 0., 0., 0., 1., 0.],
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
      [0., 0., 0., 1., 0., 0., 0., 0., 0.],
      [0., 0., 0., 0., 0., 0., 0., 0., 1.],
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
      [0., 0., 0., 1., 0., 0., 0., 0., 0.],
      [0., 0., 0., 0., 0., 0., 1., 0., 0.],
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
      [0., 0., 0., 1., 0., 0., 0., 0., 0.],
      [0., 0., 0., 0., 0., 0., 0., 1., 0.],
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
      [0., 0., 0., 0., 1., 0., 0., 0., 0.],
      [0., 0., 0., 0., 0., 1., 0., 0., 0.],
      [0., 0., 0., 0., 0., 0., 1., 0., 0.],
      [1., 0., 0., 0., 0., 0., 0., 0., 0.],
      [1., 0., 0., 0., 0., 0., 0., 0., 0.]])