edegan.com - User contributions [en]

http://www.edegan.com/mediawiki/api.php?action=feedcontributions&feedformat=atom&user=Hiep edegan.com - User contributions [en] 2026-08-02T08:08:47Z User contributions MediaWiki 1.34.2 http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25804 DSL Encoding 2019-05-30T15:40:29Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> import tensorflow as tf<br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units) #num_units is the number of hidden units in the LSTM cell.<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction<br /> <br /> If we want to stack multiple LSTM layers together, we can replace '''lstm=lstm_cell(keep_prob)''' with '''lstm= tf.contrib.rnn.MultiRNNCell([lstm_cell(keep_prob) for _ in range(num_layers)])''' where '''num_layers''' is an integer representing the number of LSTM layers we want<br /> <br /> A sample training code lives in<br /> E:\projects\embedding\Web_extractor_model\train_sample.py<br /> <br /> In the '''utils.py''' file, there are a few hyperparameters to remember.<br /> <br /> max_len: the length of each training point<br /> <br /> step: the number of steps we want to move to generate the next training point<br /> <br /> num_units: LSTM units, a safe choice is 128<br /> <br /> len_unique_chars: total number of unique tokens in all training data</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25803 DSL Encoding 2019-05-30T15:37:30Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> import tensorflow as tf<br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units) #num_units is the number of hidden units in the LSTM cell.<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction<br /> <br /> If we want to stack multiple LSTM layers together, we can replace '''lstm=lstm_cell(keep_prob)''' with '''lstm= tf.contrib.rnn.MultiRNNCell([lstm_cell(keep_prob) for _ in range(num_layers)])''' where '''num_layers''' is an integer representing the number of LSTM layers we want<br /> <br /> A sample training code lives in<br /> E:\projects\embedding\Web_extractor_model\train_sample.py</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25802 DSL Encoding 2019-05-30T15:28:29Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> import tensorflow as tf<br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units) #num_units is the number of hidden units in the LSTM cell.<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction<br /> <br /> A sample training code lives in<br /> E:\projects\embedding\Web_extractor_model\train_sample.py</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_generator&diff=25800 DSL generator 2019-05-30T15:13:23Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Geneator<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/05/16<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> The most important data we want to capture is the companies' tags, the images' tags, companies' short description, and companies' long description. For example, if we look at <br /> view-source:https://www.boost.vc/companies/<br /> <br /> We want to capture the tags that are associated with the companies' logos, the names of companies, the short descriptions of companies, and companies websites. The goal is to compress all of those into one DSL so that our [model http://www.edegan.com/wiki/DSL_Encoding] could learn the structure that we want.<br /> <br /> This [https://metacpan.org/pod/DSL::HTML::Compiler article] can be a good start to look at</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25739 DSL Encoding 2019-05-19T21:36:45Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> import tensorflow as tf<br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units) #num_units is the number of hidden units in the LSTM cell.<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction<br /> <br /> A sample training code lives in<br /> E:\projects\embedding\train_sample.py</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25738 DSL Encoding 2019-05-19T21:36:22Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> import tensorflow as tf<br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units) #num_units is the number of hidden units in the LSTM cell.<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction<br /> <br /> A sample training code lives in<br /> E:\projects\embedding</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_generator&diff=25737 DSL generator 2019-05-19T21:08:55Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Geneator<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/05/16<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> The most important data want to capture is the companies' tags, the images' tag, companies' short description, and companies' long description. For example, if we look at <br /> view-source:https://www.boost.vc/companies/<br /> <br /> We want to capture the tag that is associated with the companies' logos, the names of companies, the short descriptions of companies, and companies websites. The goal is to compress all of those into one DSL so that our [model http://www.edegan.com/wiki/DSL_Encoding] could learn the structure that we want.<br /> <br /> This [https://metacpan.org/pod/DSL::HTML::Compiler article] can be a good start to look at</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_generator&diff=25682 DSL generator 2019-05-16T15:56:54Z

<p>Hiep: Created page with "{{Project |Has title=DSL Geneator |Has owner=Hiep Nguyen |Has start date=2019/05/16 |Has project status=Active }} ==Approach== The most important data want to capture is the..."</p> <hr /> <div>{{Project<br /> |Has title=DSL Geneator<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/05/16<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> The most important data want to capture is the companies' tags, the images' tag, companies' short description, and companies' long description. For example, if we look at <br /> view-source:https://www.boost.vc/companies/<br /> <br /> We want to capture the tag that is associated with the companies' logos, the names of companies, the short descriptions of companies, and companies websites. The goal is to compress all of those into one DSL so that our [model http://www.edegan.com/wiki/DSL_Encoding] could learn the structure that we want.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25541 DSL Encoding 2019-05-10T21:41:05Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> import tensorflow as tf<br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units) #num_units is the number of hidden units in the LSTM cell.<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25540 DSL Encoding 2019-05-10T21:40:13Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> import tensorflow as tf<br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units)<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25539 DSL Encoding 2019-05-10T21:39:21Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> import tensorflow as tf<br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = Cudnn_LSTM_with_bias(num_units)<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25536 DSL Encoding 2019-05-09T17:04:04Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> <br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = Cudnn_LSTM_with_bias(num_units)<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25535 DSL Encoding 2019-05-09T17:00:23Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> <br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = Cudnn_LSTM_with_bias(num_units)<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. The sample code is<br /> <br /> def lstm_network(x, W, b,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_tokens]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, W), b,name='prediction')<br /> return prediction</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25534 DSL Encoding 2019-05-09T16:59:35Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])<br /> <br /> ==Proposed Training Model==<br /> Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.<br /> <br /> (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.<br /> <br /> (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.<br /> <br /> (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]<br /> <br /> (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data<br /> <br /> (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]<br /> <br /> A sample LSTM cell in tensorflow is as follows:<br /> <br /> def lstm_cell(keep_prob):<br /> '''<br /> Define one single lstm cell<br /> args:<br /> keep_prob: tensor scalar<br /> '''<br /> if tf.test.is_gpu_available():<br /> lstm = Cudnn_LSTM_with_bias(num_units)<br /> else:<br /> lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)<br /> lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)<br /> return lstm<br /> <br /> Then, we applied a tf.while loop through the cell to build our network. The sample code is<br /> <br /> def lstm_network(x, weight, bias,keep_prob):<br /> '''<br /> define stacked cells and prediction<br /> x: data with shape [batch_size,max_len,len_unique_char]<br /> ''' <br /> lstm=lstm_cell(keep_prob)<br /> outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)<br /> prediction = tf.add(tf.matmul(states.h, weight), bias,name='prediction')<br /> return prediction</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25505 DSL Encoding 2019-05-06T18:02:39Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> ' ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {' ': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25504 DSL Encoding 2019-05-06T18:02:15Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> ' ',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> ' ',<br /> ' ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> ' ',<br /> ' '<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> '',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {'': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25503 DSL Encoding 2019-05-06T18:01:34Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> ''<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> '',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {'': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25502 DSL Encoding 2019-05-06T18:00:58Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> ''<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> '',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {'': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25501 DSL Encoding 2019-05-06T18:00:10Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> == Explanation and Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> ''<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> '',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {'': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25498 DSL Encoding 2019-05-04T18:45:01Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> ==Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> ''<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> [<br /> '',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {'': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25497 DSL Encoding 2019-05-03T23:00:41Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> ==Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> ''<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> ['',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red'<br /> ]<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {'': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25496 DSL Encoding 2019-05-03T23:00:03Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> ==Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> [<br /> 'header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> ''<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> ['',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red']<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {'': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25495 DSL Encoding 2019-05-03T22:57:36Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> ==Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this<br /> tokens<br /> <br /> ['header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> ''<br /> ]<br /> <br /> Now, based on this list, to see the total number of tokens we can do<br /> <br /> chars = sorted(list(set(tokens)))<br /> <br /> which results in<br /> ['',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> 'header ',<br /> 'quadruple ',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> 'small-title, text, btn-orange',<br /> 'small-title, text, btn-red']<br /> <br /> As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector. <br /> char_indices = dict((c, i) for i, c in enumerate(chars))<br /> indices_char = dict((i, c) for i, c in enumerate(chars))<br /> <br /> This results in<br /> char_indices<br /> {'': 0,<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1,<br /> 'header ': 2,<br /> 'quadruple ': 3,<br /> 'row ': 4,<br /> 'single ': 5,<br /> 'small-title, text, btn-green': 6,<br /> 'small-title, text, btn-orange': 7,<br /> 'small-title, text, btn-red': 8}<br /> <br /> Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.<br /> <br /> Now, let's apply this embedding rule to our GUI file<br /> sentences=[]<br /> for i in range(0, len(tokens)):<br /> sentences.append(tokens[i])<br /> one_hot_vector = np.zeros((len(sentences),len(chars)))<br /> for i, sentence in enumerate(sentences):<br /> for t, char in enumerate(sentences):<br /> one_hot_vector[t, char_indices[char]] = 1<br /> <br /> The vector that represents our GUI will be something like this.<br /> array([[0., 0., 1., 0., 0., 0., 0., 0., 0.],<br /> [0., 1., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 0., 1.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 1., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 0., 1., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 1., 0., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 1., 0., 0., 0.],<br /> [0., 0., 0., 0., 0., 0., 1., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.],<br /> [1., 0., 0., 0., 0., 0., 0., 0., 0.]])</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25494 DSL Encoding 2019-05-03T22:48:48Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> ==Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this<br /> ['header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> ''<br /> ]</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25493 DSL Encoding 2019-05-03T22:48:35Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> ==Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this<br /> ['header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> '']</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25492 DSL Encoding 2019-05-03T22:48:11Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> ==Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)<br /> <br /> What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ```tokens``` variable now looks something like this<br /> ['header ',<br /> 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive',<br /> '',<br /> 'row ',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-red',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-green',<br /> '',<br /> 'quadruple ',<br /> 'small-title, text, btn-orange',<br /> '',<br /> '',<br /> 'row ',<br /> 'single ',<br /> 'small-title, text, btn-green',<br /> '',<br /> '']</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25491 DSL Encoding 2019-05-03T22:46:55Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py<br /> <br /> ==Implementation==<br /> One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows<br /> gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui')<br /> tokens=[]<br /> for line in gui:<br /> line=line.strip('\n').strip('}').strip('{')<br /> tokens.append(line)<br /> print(line)</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25435 DSL Encoding 2019-04-29T19:05:59Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found [https://github.com/emilwallner/Screenshot-to-code/blob/master/README.md here]<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25422 DSL Encoding 2019-04-26T22:19:00Z

<p>Hiep: /* Approach */</p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing.<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25421 DSL Encoding 2019-04-26T22:18:36Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing.<br /> <br /> ==File and scripts==<br /> The current scripts that I wrote by following pix2code source code are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&diff=25420 LP Extractor Protocol 2019-04-26T22:15:50Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=LP Extractor Protocol<br /> |Has owner=Lasya Rajan,<br /> |Has project status=Active<br /> }}<br /> <br /> ==Summary==<br /> <onlyinclude>The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar "paired input" networks, and are in the process refining our understanding of the pre-existing code and work related to each step.</onlyinclude><br /> <br /> Files location:<br /> E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21<br /> <br /> ==Proposed Method==<br /> <br /> According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. <br /> <br /> ==== DFS Encoding ====<br /> <br /> Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). <br /> <br /> ==== Adjacency Matrix ====<br /> <br /> By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).<br /> <br /> ==== Edges to Vertices Matrix ====<br /> <br /> For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). <br /> <br /> ==== Supervised Learning Approach (HTML to DSL) ====<br /> <br /> Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. <br /> <br /> [[File:Pix2code.png|thumb|center|upright=3|Image from "Project Goal V2" of Pix2Code architecture]]<br /> <br /> ==Literature==<br /> <br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] <br /> :This is the documentation for the Pix2Code architecture mentioned. <br /> <br /> * [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]<br /> :This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.<br /> <br /> * [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]<br /> :This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. <br /> <br /> * [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]<br /> : This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.<br /> <br /> *[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]<br /> : This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. <br /> <br /> *[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]<br /> : The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. <br /> <br /> *[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]<br /> : This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.<br /> <br /> *[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&rep=rep1&type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]<br /> : This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. <br /> <br /> *[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]<br /> : This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]<br /> : This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. <br /> <br /> * [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]<br /> : node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. <br /> <br /> * [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]<br /> : V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. <br /> <br /> * [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]<br /> : This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. <br /> <br /> * [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]<br /> : In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. <br /> <br /> * [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]<br /> :This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. <br /> <br /> <br /> === General ===<br /> <br /> * [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]<br /> : This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.<br /> <br /> * [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]<br /> : This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.<br /> <br /> <br /> == Implementation ==<br /> <br /> This section contains possible implementation libraries and tools for various components of the extractor.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]<br /> : A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.<br /> <br /> *[https://docs.scrapy.org/en/latest/index.html Scrapy]<br /> : Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating "selectors" specified by CSS or XPath expressions. <br /> <br /> ==== pix2code ====<br /> <br /> * [https://github.com/tonybeltramelli/pix2code pix2Code]<br /> : Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&feature=youtu.be Video] demo of trained neural network. <br /> <br /> * [https://github.com/fjbriones/pix2code2 pix2code2]<br /> : An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.<br /> <br /> * [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]<br /> : Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. <br /> <br /> * [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]<br /> : pix2code implemented in PyTorch, also not ready for general usage yet.<br /> <br /> * [https://github.com/ngundotra/code2pix code2pix]<br /> : A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://github.com/aditya-grover/node2vec node2Vec]<br /> : Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.<br /> : A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow<br /> : [https://towardsdatascience.com/node2vec-embeddings-for-graph-data-32a866340fef Here] is a very good and elementary introduction to node2vec <br /> * [https://networkx.github.io/documentation/stable/index.html NetworkX]<br /> : NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. <br /> <br /> *[https://radimrehurek.com/gensim/ Gensim]<br /> : Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.<br /> <br /> * [http://www.numpy.org/ NumPy]<br /> : NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.<br /> <br /> === DSL Development ===<br /> <br /> * [http://hackage.haskell.org/package/lucid Lucid]<br /> : Lucid is a DSL implemented with Haskell for writing HTML. It represents DOM elements as functions, and uses specific notation to differentiate between data elements and code elements. <br /> <br /> <br /> === General ===<br /> <br /> * [https://keras.io/ Keras]<br /> : In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.<br /> <br /> * [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]<br /> : From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.<br /> <br /> * [https://www.h5py.org/ H5PY]<br /> : The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy<br /> <br /> === Useful tutorials ===<br /> : Since we will be using a two-layer LSTMs in tensorflow, this [https://medium.com/@erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40 article] might be useful.<br /> <br /> === Proposed Model ===<br /> : Here is a visualization of the model that we might want to use for our extractor<br /> [[File: Extractor-Model.png| first diagram of extractor model]]<br /> <br /> ==DSL Encoder==<br /> To encode the structure of the DSL scripts, we can try using one-hot vector. More details can be found [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ here] and [http://www.edegan.com/wiki/DSL_Encoding here].</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=DSL_Encoding&diff=25419 DSL Encoding 2019-04-26T22:11:11Z

<p>Hiep: Created page with "{{Project |Has title=DSL Encoding |Has owner=Hiep Nguyen |Has start date=2019/04/26 |Has project status=Active }} ==Approach== Currently, I am thinking about using one-hot ve..."</p> <hr /> <div>{{Project<br /> |Has title=DSL Encoding<br /> |Has owner=Hiep Nguyen<br /> |Has start date=2019/04/26<br /> |Has project status=Active<br /> }}<br /> <br /> ==Approach==<br /> Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the [http://www.edegan.com/wiki/Pix2code pix2code] project also had the same approach. This [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ article] gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing.<br /> <br /> ==File and scripts==<br /> The current codes are living on <br /> E:/projects/embedding<br /> So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write<br /> python convert_gui.py</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=GPU_fix&diff=25381 GPU fix 2019-04-20T23:04:53Z

<p>Hiep: </p> <hr /> <div>==Fixing the GPU on the RDP (Aprl 2019)==<br /> <br /> At the time of writing, CUDA and tensorflow cannot connect to our GPU. This can be verified by opening python3 and type<br /> import tensorflow as tf<br /> tf.test.is_gpu_available()<br /> <br /> In addition, GeForce Experience and NVIDIA control panel do not work right now.<br /> <br /> Hiep has checked GPU driver by going to device manager -> Display adapter -> RTX Titan -> Properties -> Driver -> Update Driver. However, nothing new is installed since the current driver is already up-to-date.<br /> <br /> Hiep has also restarted several NVIDIA services by opening Run -> type in services.msc -> scrolled down to NVIDIA services -> clicked on restart. However, it still cannot resolve the issue.<br /> <br /> Since Geforce Experience cannot be opened, Hiep has followed a solution from NVIDIA forum, which is opening Run -> type in services.msc -> scrolled down to NVIDIA service -> Under nvidia telemetry container -> Choose Log on -> Check Log on as Local System Account.<br /> <br /> However, none of the above approaches actually resolved the issue. Reinstalling NVIDIA software might be necessary.<br /> <br /> The error we have while opening GeForce experience is <br /> The code execution cannot proceed because wlanapi.dll was not found. Reinstalling the program may fix this problem.<br /> <br /> However, after re-downloading and re-installing GeForce experience, the error does not go away.<br /> <br /> According to this [https://www.lifewire.com/how-to-fix-wlanapi-dll-not-found-or-missing-errors-2624238 article], adding a featured called Wireless LAN service may fix the problem.<br /> <br /> In addition, the computer cannot find the GPU when we type 'dxdiag' in the cmd prompt. This may be a driver-related issue and this [http://www.bsocialshine.com/2016/06/how-to-fix-all-dll-file-missing-error.html article] suggests that installing 'Directx end user runtime web installer' will fix the problem.<br /> <br /> ==Uninstall and Reinstall Drivers (04/20/2019)==<br /> <br /> I have uninstalled NVIDIA graphic drivers by <br /> <br /> (1) opening control panel -> remove application -> remove NVIDIA graphic driver, <br /> (2) device manager -> remove device -> remove device and associated drivers, and<br /> (3) restarted the computer<br /> <br /> I have reinstalled a new driver downloaded from [https://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/Windows/419.67/419.67-desktop-win10-64bit-international-crd-whql.exe&lang=us&type=TITAN NVIDIA page] using the default setting.<br /> <br /> However, the old error<br /> The code execution cannot proceed because wlanapi.dll was not found. Reinstalling the program may fix this problem.<br /> is still there.<br /> There is an [http://techgenix.com/enabling-physical-gpus-hyper/ article] that talks about enabling GPU(s) on windows sever.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=GPU_fix&diff=25380 GPU fix 2019-04-20T23:01:06Z

<p>Hiep: </p> <hr /> <div>==Fixing the GPU on the RDP (Aprl 2019)==<br /> <br /> At the time of writing, CUDA and tensorflow cannot connect to our GPU. This can be verified by opening python3 and type<br /> import tensorflow as tf<br /> tf.test.is_gpu_available()<br /> <br /> In addition, GeForce Experience and NVIDIA control panel do not work right now.<br /> <br /> Hiep has checked GPU driver by going to device manager -> Display adapter -> RTX Titan -> Properties -> Driver -> Update Driver. However, nothing new is installed since the current driver is already up-to-date.<br /> <br /> Hiep has also restarted several NVIDIA services by opening Run -> type in services.msc -> scrolled down to NVIDIA services -> clicked on restart. However, it still cannot resolve the issue.<br /> <br /> Since Geforce Experience cannot be opened, Hiep has followed a solution from NVIDIA forum, which is opening Run -> type in services.msc -> scrolled down to NVIDIA service -> Under nvidia telemetry container -> Choose Log on -> Check Log on as Local System Account.<br /> <br /> However, none of the above approaches actually resolved the issue. Reinstalling NVIDIA software might be necessary.<br /> <br /> The error we have while opening GeForce experience is <br /> The code execution cannot proceed because wlanapi.dll was not found. Reinstalling the program may fix this problem.<br /> <br /> However, after re-downloading and re-installing GeForce experience, the error does not go away.<br /> <br /> According to this [https://www.lifewire.com/how-to-fix-wlanapi-dll-not-found-or-missing-errors-2624238 article], adding a featured called Wireless LAN service may fix the problem.<br /> <br /> In addition, the computer cannot find the GPU when we type 'dxdiag' in the cmd prompt. This may be a driver-related issue and this [http://www.bsocialshine.com/2016/06/how-to-fix-all-dll-file-missing-error.html article] suggests that installing 'Directx end user runtime web installer' will fix the problem.<br /> <br /> ==Uninstall and Reinstall Drivers (04/20/2019)==<br /> <br /> I have uninstalled NVIDIA graphic drivers by <br /> <br /> (1) opening control panel -> remove application -> remove NVIDIA graphic driver, <br /> (2) device manager -> remove device -> remove device and associated drivers, and<br /> (3) restarted the computer<br /> <br /> I have reinstalled a new driver downloaded from [https://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/Windows/419.67/419.67-desktop-win10-64bit-international-crd-whql.exe&lang=us&type=TITAN NVIDIA page] using the default setting.<br /> <br /> However, the old error<br /> The code execution cannot proceed because wlanapi.dll was not found. Reinstalling the program may fix this problem.<br /> is still there.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&diff=25379 LP Extractor Protocol 2019-04-19T21:00:57Z

<p>Hiep: /* DFS Encoding */</p> <hr /> <div>{{Project<br /> |Has title=LP Extractor Protocol<br /> |Has owner=Lasya Rajan,<br /> |Has project status=Active<br /> }}<br /> <br /> ==Summary==<br /> <onlyinclude>The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar "paired input" networks, and are in the process refining our understanding of the pre-existing code and work related to each step.</onlyinclude><br /> <br /> Files location:<br /> E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21<br /> <br /> ==Proposed Method==<br /> <br /> According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. <br /> <br /> ==== DFS Encoding ====<br /> <br /> Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). <br /> <br /> ==== Adjacency Matrix ====<br /> <br /> By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).<br /> <br /> ==== Edges to Vertices Matrix ====<br /> <br /> For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). <br /> <br /> ==== Supervised Learning Approach (HTML to DSL) ====<br /> <br /> Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. <br /> <br /> [[File:Pix2code.png|thumb|center|upright=3|Image from "Project Goal V2" of Pix2Code architecture]]<br /> <br /> ==Literature==<br /> <br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] <br /> :This is the documentation for the Pix2Code architecture mentioned. <br /> <br /> * [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]<br /> :This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.<br /> <br /> * [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]<br /> :This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. <br /> <br /> * [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]<br /> : This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.<br /> <br /> *[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]<br /> : This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. <br /> <br /> *[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]<br /> : The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. <br /> <br /> *[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]<br /> : This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.<br /> <br /> *[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&rep=rep1&type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]<br /> : This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. <br /> <br /> *[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]<br /> : This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]<br /> : This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. <br /> <br /> * [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]<br /> : node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. <br /> <br /> * [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]<br /> : V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. <br /> <br /> * [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]<br /> : This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. <br /> <br /> * [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]<br /> : In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. <br /> <br /> * [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]<br /> :This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. <br /> <br /> <br /> === General ===<br /> <br /> * [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]<br /> : This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.<br /> <br /> * [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]<br /> : This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.<br /> <br /> <br /> == Implementation ==<br /> <br /> This section contains possible implementation libraries and tools for various components of the extractor.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]<br /> : A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.<br /> <br /> *[https://docs.scrapy.org/en/latest/index.html Scrapy]<br /> : Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating "selectors" specified by CSS or XPath expressions. <br /> <br /> ==== pix2code ====<br /> <br /> * [https://github.com/tonybeltramelli/pix2code pix2Code]<br /> : Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&feature=youtu.be Video] demo of trained neural network. <br /> <br /> * [https://github.com/fjbriones/pix2code2 pix2code2]<br /> : An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.<br /> <br /> * [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]<br /> : Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. <br /> <br /> * [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]<br /> : pix2code implemented in PyTorch, also not ready for general usage yet.<br /> <br /> * [https://github.com/ngundotra/code2pix code2pix]<br /> : A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://github.com/aditya-grover/node2vec node2Vec]<br /> : Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.<br /> : A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow<br /> : [https://towardsdatascience.com/node2vec-embeddings-for-graph-data-32a866340fef Here] is a very good and elementary introduction to node2vec <br /> * [https://networkx.github.io/documentation/stable/index.html NetworkX]<br /> : NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. <br /> <br /> *[https://radimrehurek.com/gensim/ Gensim]<br /> : Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.<br /> <br /> * [http://www.numpy.org/ NumPy]<br /> : NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.<br /> <br /> === DSL Development ===<br /> <br /> * [http://hackage.haskell.org/package/lucid Lucid]<br /> : Lucid is a DSL implemented with Haskell for writing HTML. It represents DOM elements as functions, and uses specific notation to differentiate between data elements and code elements. <br /> <br /> <br /> === General ===<br /> <br /> * [https://keras.io/ Keras]<br /> : In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.<br /> <br /> * [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]<br /> : From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.<br /> <br /> * [https://www.h5py.org/ H5PY]<br /> : The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy<br /> <br /> === Useful tutorials ===<br /> : Since we will be using a two-layer LSTMs in tensorflow, this [https://medium.com/@erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40 article] might be useful.<br /> <br /> === Proposed Model ===<br /> : Here is a visualization of the model that we might want to use for our extractor<br /> [[File: Extractor-Model.png| first diagram of extractor model]]</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&diff=25375 LP Extractor Protocol 2019-04-18T20:51:34Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=LP Extractor Protocol<br /> |Has owner=Lasya Rajan,<br /> |Has project status=Active<br /> }}<br /> <br /> ==Summary==<br /> <onlyinclude>The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar "paired input" networks, and are in the process refining our understanding of the pre-existing code and work related to each step.</onlyinclude><br /> <br /> Files location:<br /> E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21<br /> <br /> ==Proposed Method==<br /> <br /> According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. <br /> <br /> ==== DFS Encoding ====<br /> <br /> Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). <br /> <br /> ==== Adjacency Matrix ====<br /> <br /> By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).<br /> <br /> ==== Edges to Vertices Matrix ====<br /> <br /> For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). <br /> <br /> ==== Supervised Learning Approach (HTML to DSL) ====<br /> <br /> Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. <br /> <br /> [[File:Pix2code.png|thumb|center|upright=3|Image from "Project Goal V2" of Pix2Code architecture]]<br /> <br /> ==Literature==<br /> <br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] <br /> :This is the documentation for the Pix2Code architecture mentioned. <br /> <br /> * [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]<br /> :This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.<br /> <br /> * [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]<br /> :This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. <br /> <br /> * [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]<br /> : This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.<br /> <br /> *[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]<br /> : This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. <br /> <br /> *[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]<br /> : The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. <br /> <br /> *[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]<br /> : This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.<br /> <br /> *[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&rep=rep1&type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]<br /> : This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. <br /> <br /> *[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]<br /> : This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]<br /> : This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. <br /> <br /> * [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]<br /> : node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. <br /> <br /> * [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]<br /> : V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. <br /> <br /> * [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]<br /> : This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. <br /> <br /> * [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]<br /> : In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. <br /> <br /> * [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]<br /> :This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. <br /> <br /> <br /> === General ===<br /> <br /> * [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]<br /> : This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.<br /> <br /> * [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]<br /> : This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.<br /> <br /> <br /> == Implementation ==<br /> <br /> This section contains possible implementation libraries and tools for various components of the extractor.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]<br /> : A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.<br /> <br /> *[https://docs.scrapy.org/en/latest/index.html Scrapy]<br /> : Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating "selectors" specified by CSS or XPath expressions. <br /> <br /> ==== pix2code ====<br /> <br /> * [https://github.com/tonybeltramelli/pix2code pix2Code]<br /> : Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&feature=youtu.be Video] demo of trained neural network. <br /> <br /> * [https://github.com/fjbriones/pix2code2 pix2code2]<br /> : An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.<br /> <br /> * [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]<br /> : Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. <br /> <br /> * [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]<br /> : pix2code implemented in PyTorch, also not ready for general usage yet.<br /> <br /> * [https://github.com/ngundotra/code2pix code2pix]<br /> : A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://github.com/aditya-grover/node2vec node2Vec]<br /> : Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.<br /> : A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow<br /> <br /> * [https://networkx.github.io/documentation/stable/index.html NetworkX]<br /> : NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. <br /> <br /> *[https://radimrehurek.com/gensim/ Gensim]<br /> : Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.<br /> <br /> * [http://www.numpy.org/ NumPy]<br /> : NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.<br /> <br /> === DSL Development ===<br /> <br /> * [http://hackage.haskell.org/package/lucid Lucid]<br /> : Lucid is a DSL implemented with Haskell for writing HTML. It represents DOM elements as functions, and uses specific notation to differentiate between data elements and code elements. <br /> <br /> <br /> === General ===<br /> <br /> * [https://keras.io/ Keras]<br /> : In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.<br /> <br /> * [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]<br /> : From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.<br /> <br /> * [https://www.h5py.org/ H5PY]<br /> : The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy<br /> <br /> === Useful tutorials ===<br /> : Since we will be using a two-layer LSTMs in tensorflow, this [https://medium.com/@erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40 article] might be useful.<br /> <br /> === Proposed Model ===<br /> : Here is a visualization of the model that we might want to use for our extractor<br /> [[File: Extractor-Model.png| first diagram of extractor model]]</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&diff=25374 LP Extractor Protocol 2019-04-18T20:51:13Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=LP Extractor Protocol<br /> |Has owner=Lasya Rajan,<br /> |Has project status=Active<br /> }}<br /> <br /> ==Summary==<br /> <onlyinclude>The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar "paired input" networks, and are in the process refining our understanding of the pre-existing code and work related to each step.</onlyinclude><br /> <br /> Files location:<br /> E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21<br /> <br /> ==Proposed Method==<br /> <br /> According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. <br /> <br /> ==== DFS Encoding ====<br /> <br /> Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). <br /> <br /> ==== Adjacency Matrix ====<br /> <br /> By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).<br /> <br /> ==== Edges to Vertices Matrix ====<br /> <br /> For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). <br /> <br /> ==== Supervised Learning Approach (HTML to DSL) ====<br /> <br /> Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. <br /> <br /> [[File:Pix2code.png|thumb|center|upright=3|Image from "Project Goal V2" of Pix2Code architecture]]<br /> <br /> ==Literature==<br /> <br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] <br /> :This is the documentation for the Pix2Code architecture mentioned. <br /> <br /> * [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]<br /> :This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.<br /> <br /> * [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]<br /> :This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. <br /> <br /> * [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]<br /> : This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.<br /> <br /> *[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]<br /> : This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. <br /> <br /> *[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]<br /> : The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. <br /> <br /> *[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]<br /> : This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.<br /> <br /> *[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&rep=rep1&type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]<br /> : This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. <br /> <br /> *[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]<br /> : This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]<br /> : This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. <br /> <br /> * [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]<br /> : node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. <br /> <br /> * [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]<br /> : V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. <br /> <br /> * [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]<br /> : This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. <br /> <br /> * [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]<br /> : In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. <br /> <br /> * [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]<br /> :This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. <br /> <br /> <br /> === General ===<br /> <br /> * [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]<br /> : This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.<br /> <br /> * [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]<br /> : This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.<br /> <br /> <br /> == Implementation ==<br /> <br /> This section contains possible implementation libraries and tools for various components of the extractor.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]<br /> : A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.<br /> <br /> *[https://docs.scrapy.org/en/latest/index.html Scrapy]<br /> : Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating "selectors" specified by CSS or XPath expressions. <br /> <br /> ==== pix2code ====<br /> <br /> * [https://github.com/tonybeltramelli/pix2code pix2Code]<br /> : Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&feature=youtu.be Video] demo of trained neural network. <br /> <br /> * [https://github.com/fjbriones/pix2code2 pix2code2]<br /> : An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.<br /> <br /> * [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]<br /> : Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. <br /> <br /> * [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]<br /> : pix2code implemented in PyTorch, also not ready for general usage yet.<br /> <br /> * [https://github.com/ngundotra/code2pix code2pix]<br /> : A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://github.com/aditya-grover/node2vec node2Vec]<br /> : Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.<br /> : A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow<br /> <br /> * [https://networkx.github.io/documentation/stable/index.html NetworkX]<br /> : NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. <br /> <br /> *[https://radimrehurek.com/gensim/ Gensim]<br /> : Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.<br /> <br /> * [http://www.numpy.org/ NumPy]<br /> : NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.<br /> <br /> === DSL Development ===<br /> <br /> * [http://hackage.haskell.org/package/lucid Lucid]<br /> : Lucid is a DSL implemented with Haskell for writing HTML. It represents DOM elements as functions, and uses specific notation to differentiate between data elements and code elements. <br /> <br /> <br /> === General ===<br /> <br /> * [https://keras.io/ Keras]<br /> : In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.<br /> <br /> * [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]<br /> : From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.<br /> <br /> * [https://www.h5py.org/ H5PY]<br /> : The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy<br /> <br /> === Useful tutorials ===<br /> : Since we will be using a two-layer LSTMs in tensorflow, this [https://medium.com/@erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40 article] might be useful.<br /> <br /> === Proposed Model ===<br /> : Here is a visualization of the model what we might want to use for our extractor<br /> [[File: Extractor-Model.png| first diagram of extractor model]]</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=File:Extractor-Model.png&diff=25373 File:Extractor-Model.png 2019-04-18T20:48:51Z

<p>Hiep: </p> <hr /> <div></div>

Hiep http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&diff=25356 LP Extractor Protocol 2019-04-18T05:37:07Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=LP Extractor Protocol<br /> |Has owner=Lasya Rajan,<br /> |Has project status=Active<br /> }}<br /> <br /> ==Summary==<br /> <onlyinclude>The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar "paired input" networks, and are in the process refining our understanding of the pre-existing code and work related to each step.</onlyinclude><br /> <br /> Files location:<br /> E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21<br /> <br /> ==Proposed Method==<br /> <br /> According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. <br /> <br /> ==== DFS Encoding ====<br /> <br /> Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). <br /> <br /> ==== Adjacency Matrix ====<br /> <br /> By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).<br /> <br /> ==== Edges to Vertices Matrix ====<br /> <br /> For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). <br /> <br /> ==== Supervised Learning Approach (HTML to DSL) ====<br /> <br /> Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. <br /> <br /> [[File:Pix2code.png|thumb|center|upright=3|Image from "Project Goal V2" of Pix2Code architecture]]<br /> <br /> ==Literature==<br /> <br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] <br /> :This is the documentation for the Pix2Code architecture mentioned. <br /> <br /> * [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]<br /> :This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.<br /> <br /> * [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]<br /> :This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. <br /> <br /> * [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]<br /> : This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.<br /> <br /> *[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]<br /> : This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. <br /> <br /> *[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]<br /> : The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. <br /> <br /> *[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]<br /> : This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.<br /> <br /> *[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&rep=rep1&type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]<br /> : This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. <br /> <br /> *[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]<br /> : This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]<br /> : This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. <br /> <br /> * [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]<br /> : node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. <br /> <br /> * [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]<br /> : V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. <br /> <br /> * [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]<br /> : This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. <br /> <br /> * [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]<br /> : In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. <br /> <br /> * [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]<br /> :This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. <br /> <br /> <br /> === General ===<br /> <br /> * [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]<br /> : This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.<br /> <br /> * [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]<br /> : This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.<br /> <br /> <br /> == Implementation ==<br /> <br /> This section contains possible implementation libraries and tools for various components of the extractor.<br /> <br /> === HTML Tree Structure Analysis ===<br /> <br /> * [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]<br /> : A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.<br /> <br /> *[https://docs.scrapy.org/en/latest/index.html Scrapy]<br /> : Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating "selectors" specified by CSS or XPath expressions. <br /> <br /> ==== pix2code ====<br /> <br /> * [https://github.com/tonybeltramelli/pix2code pix2Code]<br /> : Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&feature=youtu.be Video] demo of trained neural network. <br /> <br /> * [https://github.com/fjbriones/pix2code2 pix2code2]<br /> : An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.<br /> <br /> * [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]<br /> : Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. <br /> <br /> * [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]<br /> : pix2code implemented in PyTorch, also not ready for general usage yet.<br /> <br /> * [https://github.com/ngundotra/code2pix code2pix]<br /> : A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.<br /> <br /> === DFS Encoding ===<br /> <br /> * [https://github.com/aditya-grover/node2vec node2Vec]<br /> : Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.<br /> : A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow<br /> <br /> * [https://networkx.github.io/documentation/stable/index.html NetworkX]<br /> : NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. <br /> <br /> *[https://radimrehurek.com/gensim/ Gensim]<br /> : Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.<br /> <br /> * [http://www.numpy.org/ NumPy]<br /> : NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.<br /> <br /> === DSL Development ===<br /> <br /> * [http://hackage.haskell.org/package/lucid Lucid]<br /> : Lucid is a DSL implemented with Haskell for writing HTML. It represents DOM elements as functions, and uses specific notation to differentiate between data elements and code elements. <br /> <br /> <br /> === General ===<br /> <br /> * [https://keras.io/ Keras]<br /> : In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.<br /> <br /> * [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]<br /> : From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.<br /> <br /> * [https://www.h5py.org/ H5PY]<br /> : The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy<br /> <br /> === Useful tutorials ===<br /> : Since we will be using a two-layer LSTMs in tensorflow, this [https://medium.com/@erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40 article] might be useful.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=GPU_fix&diff=25304 GPU fix 2019-04-16T18:14:49Z

<p>Hiep: </p> <hr /> <div>==Fixing the GPU on the RDP (Aprl 2019)==<br /> <br /> At the time of writing, CUDA and tensorflow cannot connect to our GPU. This can be verified by opening python3 and type<br /> import tensorflow as tf<br /> tf.test.is_gpu_available()<br /> <br /> In addition, GeForce Experience and NVIDIA control panel do not work right now.<br /> <br /> Hiep has checked GPU driver by going to device manager -> Display adapter -> RTX Titan -> Properties -> Driver -> Update Driver. However, nothing new is installed since the current driver is already up-to-date.<br /> <br /> Hiep has also restarted several NVIDIA services by opening Run -> type in services.msc -> scrolled down to NVIDIA services -> clicked on restart. However, it still cannot resolve the issue.<br /> <br /> Since Geforce Experience cannot be opened, Hiep has followed a solution from NVIDIA forum, which is opening Run -> type in services.msc -> scrolled down to NVIDIA service -> Under nvidia telemetry container -> Choose Log on -> Check Log on as Local System Account.<br /> <br /> However, none of the above approaches actually resolved the issue. Reinstalling NVIDIA software might be necessary.<br /> <br /> The error we have while opening GeForce experience is <br /> The code execution cannot proceed because wlanapi.dll was not found. Reinstalling the program may fix this problem.<br /> <br /> However, after re-downloading and re-installing GeForce experience, the error does not go away.<br /> <br /> According to this [https://www.lifewire.com/how-to-fix-wlanapi-dll-not-found-or-missing-errors-2624238 article], adding a featured called Wireless LAN service may fix the problem.<br /> <br /> In addition, the computer cannot find the GPU when we type 'dxdiag' in the cmd prompt. This may be a driver-related issue and this [http://www.bsocialshine.com/2016/06/how-to-fix-all-dll-file-missing-error.html article] suggests that installing 'Directx end user runtime web installer' will fix the problem.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=GPU_fix&diff=25294 GPU fix 2019-04-16T17:54:27Z

<p>Hiep: </p> <hr /> <div>==Fixing the GPU on the RDP (Aprl 2019)==<br /> <br /> At the time of writing, CUDA and tensorflow cannot connect to our GPU. This can be verified by opening python3 and type<br /> import tensorflow as tf<br /> tf.test.is_gpu_available()<br /> <br /> In addition, GeForce Experience and NVIDIA control panel do not work right now.<br /> <br /> Hiep has checked GPU driver by going to device manager -> Display adapter -> RTX Titan -> Properties -> Driver -> Update Driver. However, nothing new is installed since the current driver is already up-to-date.<br /> <br /> Hiep has also restarted several NVIDIA services by opening Run -> type in services.msc -> scrolled down to NVIDIA services -> clicked on restart. However, it still cannot resolve the issue.<br /> <br /> Since Geforce Experience cannot be opened, Hiep has followed a solution from NVIDIA forum, which is opening Run -> type in services.msc -> scrolled down to NVIDIA service -> Under nvidia telemetry container -> Choose Log on -> Check Log on as Local System Account.<br /> <br /> However, none of the above approaches actually resolved the issue. Reinstalling NVIDIA software might be necessary.<br /> <br /> The error we have while opening GeForce experience is <br /> The code execution cannot proceed because wlanapi.dll was not found. Reinstalling the program may fix this problem.<br /> <br /> According to this [https://www.lifewire.com/how-to-fix-wlanapi-dll-not-found-or-missing-errors-2624238 article], adding a featured called Wireless LAN service may fix the problem.<br /> <br /> In addition, the computer cannot find the GPU when we type 'dxdiag' in the cmd prompt. This may be a driver-related issue and this [http://www.bsocialshine.com/2016/06/how-to-fix-all-dll-file-missing-error.html article] suggests that installing 'Directx end user runtime web installer' will fix the problem.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=GPU_fix&diff=25264 GPU fix 2019-04-15T16:06:40Z

<p>Hiep: </p> <hr /> <div>==Fixing the GPU on the RDP (Aprl 2019)==<br /> <br /> At the time of writing, CUDA and tensorflow cannot connect to our GPU. This can be verified by opening python3 and type<br /> import tensorflow as tf<br /> tf.test.is_gpu_available()<br /> <br /> In addition, GeForce Experience and NVIDIA control panel do not work right now.<br /> <br /> Hiep has checked GPU driver by going to device manager -> Display adapter -> RTX Titan -> Properties -> Driver -> Update Driver. However, nothing new is installed since the current driver is already up-to-date.<br /> <br /> Hiep has also restarted several NVIDIA services by opening Run -> type in services.msc -> scrolled down to NVIDIA services -> clicked on restart. However, it still cannot resolve the issue.<br /> <br /> Since Geforce Experience cannot be opened, Hiep has followed a solution from NVIDIA forum, which is opening Run -> type in services.msc -> scrolled down to NVIDIA service -> Under nvidia telemetry container -> Choose Log on -> Check Log on as Local System Account.<br /> <br /> However, none of the above approaches actually resolved the issue. Reinstalling NVIDIA software might be necessary.<br /> <br /> The error we have while opening GeForce experience is <br /> The code execution cannot proceed because wlanapi.dll was not found. Reinstalling the program may fix this problem.<br /> <br /> According to this [https://www.lifewire.com/how-to-fix-wlanapi-dll-not-found-or-missing-errors-2624238 article], adding a featured called Wireless LAN service may fix the problem.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=GPU_fix&diff=25261 GPU fix 2019-04-11T21:24:03Z

<p>Hiep: Created page with "==Fixing the GPU on the RDP (Aprl 11th 2019)== At the time of writing, CUDA and tensorflow cannot connect to our GPU. This can be verified by opening python3 and type import..."</p> <hr /> <div>==Fixing the GPU on the RDP (Aprl 11th 2019)==<br /> <br /> At the time of writing, CUDA and tensorflow cannot connect to our GPU. This can be verified by opening python3 and type<br /> import tensorflow as tf<br /> tf.test.is_gpu_available()<br /> <br /> In addition, GeForce Experience and NVIDIA control panel do not work right now.<br /> <br /> Hiep has checked GPU driver by going to device manager -> Display adapter -> RTX Titan -> Properties -> Driver -> Update Driver. However, nothing new is installed since the current driver is already up-to-date.<br /> <br /> Hiep has also restarted several NVIDIA services by opening Run -> type in services.msc -> scrolled down to NVIDIA services -> clicked on restart. However, it still cannot resolve the issue.<br /> <br /> Since Geforce Experience cannot be opened, Hiep has followed a solution from NVIDIA forum, which is opening Run -> type in services.msc -> scrolled down to NVIDIA service -> Under nvidia telemetry container -> Choose Log on -> Check Log on as Local System Account.<br /> <br /> However, none of the above approaches actually resolved the issue. Reinstalling NVIDIA software might be necessary.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=RDP_Software_Configuration&diff=25240 RDP Software Configuration 2019-04-11T18:41:25Z

<p>Hiep: </p> <hr /> <div>All software installed on the RDP, as well as its configuration, should be recorded on this page!<br /> <br /> ==Base installation==<br /> <br /> Ed installed the following during the build:<br /> *ActiveState Perl 5.26.3<br /> *ArcGIS Desktop (instructions at http://answers.library.georgetown.edu/faq/247307)<br /> *ArcGIS Reader (ESU196456098)<br /> **Python 2.7 (installed with ArcGIS in C:\Python27\ArcGIS10.6)<br /> *CUDA 10.1<br /> *Google Chrome<br /> *Komodo 9 IDE (licence is E:\mcnair\installs\Komodo-IDE-9-Windows-S19344C4830A.exe)<br /> *.NET 3.5 (install from media, see instructions [https://awsbloglink.wordpress.com/2018/10/25/windows-server-2019-measures-to-be-taken-when-installing-net-framework-3-5-fails/])<br /> *Matlab 2018a (instructions at http://uis.georgetown.edu/computers/purchase/software/matlab/install)<br /> *Office 2019<br /> *STATA 15MP (24 core, network edition, 2 licenses)<br /> *SDC Platinum<br /> *Textpad 8<br /> *Visual Studio 2018 Community Edition<br /> **Anaconda 3 & Python 3.6 (installed with Microsoft Visual Studio, in C:\Program Files (x86)\Microsoft Visual Studio\Shared\)<br /> <br /> Hiep installed the following:<br /> *Git windows 2.21.0<br /> *Git bash (to use git, no new path added)<br /> ==Python and R==<br /> <br /> Ed installed additional new versions of:<br /> *Python 2.7<br /> *Anaconda 3 (with the add to path option)<br /> *R 3.5.3<br /> <br /> Afterwards C:\Python27, C:\Python27\Lib and C:\Program Files\R\R-3.5.3\bin\x64 were added to the path (search "edit system environment variables"). C:\Python27\python.exe was copied to C:\Python27\python2.exe and C:\ProgramData\Anaconda3\python.exe was copied to python3.exe. <br /> <br /> Users wanting to run python can therefore run any of the following:<br /> python -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> python3 -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> python2 -- runs python 2.7 in C:\Python27<br /> py -3 -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> py -2 -- runs python 2.7 in C:\Python27<br /> <br /> For some reason this configuration stopped working. It seems that C:\ProgramData\Anaconda3 was removed from the path. It has now been added back. If you have an issue, please try closing and reopening your shell, or disconnecting and reconnecting your session.<br /> <br /> For the old RDP configuration, see notes on [[Python on the RDP]]. There was also a GIT server on the old RDP, which hosted our [[Software Repository]]. All of the projects in the [[Software Repository Listing]] are on the E drive. We may install a new GIT server at some point.<br /> <br /> ==Adding libraries==<br /> <br /> If you add a library or package to a programming language, for instance through pip or manually, record what you did here!<br /> <br /> The following packages have been downloaded for python3 via pip<br /> tensorflow 1.13.1<br /> keras 2.2.4<br /> open-cv 4.0<br /> networkx 4.3<br /> sklearn 0.20.1<br /> numpy 1.16.2 (upgraded from 1.15.4)</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=RDP_Software_Configuration&diff=25202 RDP Software Configuration 2019-04-09T20:12:39Z

<p>Hiep: </p> <hr /> <div>All software installed on the RDP, as well as its configuration, should be recorded on this page!<br /> <br /> ==Base installation==<br /> <br /> Ed installed the following during the build:<br /> *ActiveState Perl 5.26.3<br /> *ArcGIS Desktop (instructions at http://answers.library.georgetown.edu/faq/247307)<br /> *ArcGIS Reader (ESU196456098)<br /> **Python 2.7 (installed with ArcGIS in C:\Python27\ArcGIS10.6)<br /> *CUDA 10.1<br /> *Google Chrome<br /> *Komodo 9 IDE (licence is E:\mcnair\installs\Komodo-IDE-9-Windows-S19344C4830A.exe)<br /> *.NET 3.5 (install from media, see instructions [https://awsbloglink.wordpress.com/2018/10/25/windows-server-2019-measures-to-be-taken-when-installing-net-framework-3-5-fails/])<br /> *Matlab 2018a (instructions at http://uis.georgetown.edu/computers/purchase/software/matlab/install)<br /> *Office 2019<br /> *STATA 15MP (24 core, network edition, 2 licenses)<br /> *SDC Platinum<br /> *Textpad 8<br /> *Visual Studio 2018 Community Edition<br /> **Anaconda 3 & Python 3.6 (installed with Microsoft Visual Studio, in C:\Program Files (x86)\Microsoft Visual Studio\Shared\)<br /> <br /> ==Python and R==<br /> <br /> Ed installed additional new versions of:<br /> *Python 2.7<br /> *Anaconda 3 (with the add to path option)<br /> *R 3.5.3<br /> <br /> Afterwards C:\Python27, C:\Python27\Lib and C:\Program Files\R\R-3.5.3\bin\x64 were added to the path (search "edit system environment variables"). C:\Python27\python.exe was copied to C:\Python27\python2.exe and C:\ProgramData\Anaconda3\python.exe was copied to python3.exe. <br /> <br /> Users wanting to run python can therefore run any of the following:<br /> python -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> python3 -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> python2 -- runs python 2.7 in C:\Python27<br /> py -3 -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> py -2 -- runs python 2.7 in C:\Python27<br /> <br /> For some reason this configuration stopped working. It seems that C:\ProgramData\Anaconda3 was removed from the path. It has now been added back. If you have an issue, please try closing and reopening your shell, or disconnecting and reconnecting your session.<br /> <br /> For the old RDP configuration, see notes on [[Python on the RDP]]. There was also a GIT server on the old RDP, which hosted our [[Software Repository]]. All of the projects in the [[Software Repository Listing]] are on the E drive. We may install a new GIT server at some point.<br /> <br /> ==Adding libraries==<br /> <br /> If you add a library or package to a programming language, for instance through pip or manually, record what you did here!<br /> <br /> The following packages have been downloaded for python3 via pip<br /> tensorflow 1.13.1<br /> keras 2.2.4<br /> open-cv 4.0<br /> networkx 4.3<br /> sklearn 0.20.1<br /> numpy 1.16.2 (upgraded from 1.15.4)</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=RDP_Software_Configuration&diff=25170 RDP Software Configuration 2019-04-08T19:39:21Z

<p>Hiep: </p> <hr /> <div>All software installed on the RDP, as well as its configuration, should be recorded on this page!<br /> <br /> ==Base installation==<br /> <br /> Ed installed the following during the build:<br /> *ActiveState Perl 5.26.3<br /> *ArcGIS Desktop (instructions at http://answers.library.georgetown.edu/faq/247307)<br /> *ArcGIS Reader (ESU196456098)<br /> **Python 2.7 (installed with ArcGIS in C:\Python27\ArcGIS10.6)<br /> *CUDA 10.1<br /> *Google Chrome<br /> *Komodo 9 IDE (licence is E:\mcnair\installs\Komodo-IDE-9-Windows-S19344C4830A.exe)<br /> *.NET 3.5 (install from media, see instructions [https://awsbloglink.wordpress.com/2018/10/25/windows-server-2019-measures-to-be-taken-when-installing-net-framework-3-5-fails/])<br /> *Matlab 2018a (instructions at http://uis.georgetown.edu/computers/purchase/software/matlab/install)<br /> *Office 2019<br /> *STATA 15MP (24 core, network edition, 2 licenses)<br /> *SDC Platinum<br /> *Textpad 8<br /> *Visual Studio 2018 Community Edition<br /> **Anaconda 3 & Python 3.6 (installed with Microsoft Visual Studio, in C:\Program Files (x86)\Microsoft Visual Studio\Shared\)<br /> <br /> ==Python and R==<br /> <br /> Ed installed additional new versions of:<br /> *Python 2.7<br /> *Anaconda 3 (with the add to path option)<br /> *R 3.5.3<br /> <br /> Afterwards C:\Python27, C:\Python27\Lib and C:\Program Files\R\R-3.5.3\bin\x64 were added to the path (search "edit system environment variables"). C:\Python27\python.exe was copied to C:\Python27\python2.exe and C:\ProgramData\Anaconda3\python.exe was copied to python3.exe. <br /> <br /> Users wanting to run python can therefore run any of the following:<br /> python -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> python3 -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> python2 -- runs python 2.7 in C:\Python27<br /> py -3 -- runs python 3.7 in C:\ProgramData\Anaconda3<br /> py -2 -- runs python 2.7 in C:\Python27<br /> <br /> For the old RDP configuration, see notes on [[Python on the RDP]]. There was also a GIT server on the old RDP, which hosted our [[Software Repository]]. All of the projects in the [[Software Repository Listing]] are on the E drive. We may install a new GIT server at some point.<br /> <br /> ==Adding libraries==<br /> <br /> If you add a library or package to a programming language, for instance through pip or manually, record what you did here!<br /> <br /> The following packages have been downloaded for python3 via pip<br /> tensorflow 1.13.1<br /> keras 2.2.4<br /> open-cv 4.0<br /> networkx 4.3<br /> sklearn 0.20.1</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=Pix2code&diff=25138 Pix2code 2019-04-05T22:45:14Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=Pix2code experimentation<br /> |Has owner=Hiep Nguyen,<br /> }}<br /> <br /> ==Brief Introduction==<br /> Pix2code is an AI model that can convert GUI images to DSL codes and then uses a compiler to convert DSL code to HTML, Android XML, and iOS Storyboard. More details can be found [https://arxiv.org/pdf/1705.07962.pdf here] in the original paper. Instructions to train and use the models can be found on the original [https://github.com/tonybeltramelli/pix2code github] page. There is an improved version of pix2code, which is [https://github.com/fjbriones/pix2code2 pix2code2]. It uses a Convolutional Neural Network (CNN) as an autoencoder for the GUI before training. The users also include a pre-trained model to experiment with. What we have in the RDP right now is pix2code2.<br /> <br /> ==Usage of pix2code on RDP==<br /> Currently, source code and pre-trained model for pix2code are living on <br /> E:/projects/pix2code_test<br /> <br /> To generate DSL code from specific GUI images, first place the image on pix2code_test directory, then do the following<br /> <br /> cd pix2code_test/model<br /> ./sample.py ../bin pix2code2 ../test_img.png ../code greedy #to use greedy algorithm, replace greedy with 1,2,3..,k to use beam search with size k.<br /> <br /> The GUI code will be inside the pix2code_test/code directory<br /> <br /> To generate GUI to HTML:<br /> cd compiler<br /> ./web_compiler.py ./code/test_img.gui<br /> <br /> ==Discussion==<br /> While pix2code can preserve the structure of the HTML page quite well, it cannot preserve the contents of the website. Most of the texts from the original page are distorted in the generated DSL. Moreover, pix2code is extremely expensive to train and the current model only works for very simple GUIs that are similar to ones in the training set. Hence, pix2code model would not be suited for building an information extractor. However, we can learn from the source code how to input and structure GUI data and construct LSTM networks on top of GUI and output DSL code.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=Pix2code&diff=25137 Pix2code 2019-04-05T22:43:41Z

<p>Hiep: </p> <hr /> <div>{{Project<br /> |Has title=Pix2code experimentation<br /> |Has owner=Hiep Nguyen,<br /> }}<br /> <br /> ==Brief Introduction==<br /> Pix2code is an AI model that can convert GUI images to DSL codes and then uses a compiler to convert DSL code to HTML, Android XML, and iOS Storyboard. More details can be found [https://arxiv.org/pdf/1705.07962.pdf here] in the original paper. Instructions to train and use the models can be found on the original [https://github.com/tonybeltramelli/pix2code github] page. There is an improved version of pix2code, which is [https://github.com/fjbriones/pix2code2 pix2code2]. It uses a Convolutional Neural Network (CNN) as an autoencoder for the GUI before training. The users also include a pre-trained model to experiment with.<br /> <br /> ==Usage of pix2code on RDP==<br /> Currently, source code and pre-trained model for pix2code are living on <br /> E:/projects/pix2code_test<br /> <br /> To generate DSL code from specific GUI images, first place the image on pix2code_test directory, then do the following<br /> <br /> cd pix2code_test/model<br /> ./sample.py ../bin pix2code2 ../test_img.png ../code greedy #to use greedy algorithm, replace greedy with 1,2,3..,k to use beam search with size k.<br /> <br /> The GUI code will be inside the pix2code_test/code directory<br /> <br /> To generate GUI to HTML:<br /> cd compiler<br /> ./web_compiler.py ./code/test_img.gui<br /> <br /> ==Discussion==<br /> While pix2code can preserve the structure of the HTML page quite well, it cannot preserve the contents of the website. Most of the texts from the original page are distorted in the generated DSL. Moreover, pix2code is extremely expensive to train and the current model only works for very simple GUIs that are similar to ones in the training set. Hence, pix2code model would not be suited for building an information extractor. However, we can learn from the source code how to input and structure GUI data and construct LSTM networks on top of GUI and output DSL code.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=Pix2code&diff=25136 Pix2code 2019-04-05T22:42:15Z

<p>Hiep: /* Usage on RDP */</p> <hr /> <div>{{Project<br /> |Has title=Pix2code experimentation<br /> |Has owner=Hiep Nguyen,<br /> }}<br /> <br /> ==Brief Introduction==<br /> Pix2code is an AI model that can convert GUI images to DSL codes and then uses a compiler to convert DSL code to HTML, Android XML, and iOS Storyboard. More details can be found [https://arxiv.org/pdf/1705.07962.pdf here] in the original paper. Instructions to train and use the models can be found on the original [https://github.com/tonybeltramelli/pix2code github] page. There is an improved version of pix2code, which is [https://github.com/fjbriones/pix2code2 pix2code2]. It uses a Convolutional Neural Network (CNN) as an autoencoder for the GUI before training. The users also include a pre-trained model to experiment with.<br /> <br /> ==Usage of pix2code on RDP==<br /> Currently, source code and pre-trained model for pix2code are living on <br /> E:/projects/pix2code_test<br /> <br /> To generate DSL code from specific GUI images, first place the image on pix2code_test directory, then do the following<br /> <br /> cd pix2code_test/model<br /> ./sample.py ../bin pix2code2 ../test_img.png ../code greedy #to use greedy algorithm, replace greedy with 1,2,3..,k to use beam search with size k.<br /> <br /> The GUI code will be inside the pix2code_test/code directory<br /> <br /> To generate GUI to HTML:<br /> cd compiler<br /> ./web_compiler.py ./code/test_img.gui<br /> <br /> ==Discussion==<br /> While pix2code can preserve the structure of the HTML page quite well, it cannot preserve the contents of the websites. Most of the texts from the original page are distorted in the generated DSL. Moreover, pix2code is extremely expensive to train and the current model only works for very simple GUIs that are similar to ones in the training set. Hence, pix2code model would not be suited for building an information extractor. However, we can learn from the source code how to input and structure GUI data and construct LSTM networks on top of GUI and output DSL code.</div>

Hiep http://www.edegan.com/mediawiki/index.php?title=Pix2code&diff=25135 Pix2code 2019-04-05T22:41:53Z

<p>Hiep: Created page with "{{Project |Has title=Pix2code experimentation |Has owner=Hiep Nguyen, }} ==Brief Introduction== Pix2code is an AI model that can convert GUI images to DSL codes and then uses..."</p> <hr /> <div>{{Project<br /> |Has title=Pix2code experimentation<br /> |Has owner=Hiep Nguyen,<br /> }}<br /> <br /> ==Brief Introduction==<br /> Pix2code is an AI model that can convert GUI images to DSL codes and then uses a compiler to convert DSL code to HTML, Android XML, and iOS Storyboard. More details can be found [https://arxiv.org/pdf/1705.07962.pdf here] in the original paper. Instructions to train and use the models can be found on the original [https://github.com/tonybeltramelli/pix2code github] page. There is an improved version of pix2code, which is [https://github.com/fjbriones/pix2code2 pix2code2]. It uses a Convolutional Neural Network (CNN) as an autoencoder for the GUI before training. The users also include a pre-trained model to experiment with.<br /> <br /> ==Usage on RDP==<br /> Currently, source code and pre-trained model for pix2code are living on <br /> E:/projects/pix2code_test<br /> <br /> To generate DSL code from specific GUI images, first place the image on pix2code_test directory, then do the following<br /> <br /> cd pix2code_test/model<br /> ./sample.py ../bin pix2code2 ../test_img.png ../code greedy #to use greedy algorithm, replace greedy with 1,2,3..,k to use beam search with size k.<br /> <br /> The GUI code will be inside the pix2code_test/code directory<br /> <br /> To generate GUI to HTML:<br /> cd compiler<br /> ./web_compiler.py ./code/test_img.gui<br /> <br /> ==Discussion==<br /> While pix2code can preserve the structure of the HTML page quite well, it cannot preserve the contents of the websites. Most of the texts from the original page are distorted in the generated DSL. Moreover, pix2code is extremely expensive to train and the current model only works for very simple GUIs that are similar to ones in the training set. Hence, pix2code model would not be suited for building an information extractor. However, we can learn from the source code how to input and structure GUI data and construct LSTM networks on top of GUI and output DSL code.</div>

Hiep