Adventures of a self-learning data scientist

I have finished manually cleaned up text files from 1947 to 2020

I put the cleaned-up files in texts/presidents/ of this GitHub repo. I name a file as 1947_pres.txt. These files include only texts of the president part, not including those of the Council of Economic Advisers part.

I cleaned up by:

correcting word order, where digitization made mistakes in shaping lines;
correcting words, where optical recognition made mistakes due to dirt, or the author apparently misspelled;
making punctuation common over reports, following my memory of “The Mac is Not a Typewriter” by Robin P. Williams, which I must have, but could not find now. As an exception, using minus mark instead of hyphen; and
changing lines, where I encounter “;”, “:”, “—” or “,” and when I feel changed lines are natural if they were in President Johnson’s reports.

I tried to be consistent over reports of different authors. Honestly, I am not sure whether “text” has become more consistent or not, as a result.

I tried Google Colab, but failed

The file in the same repository, let_pres_speak_failed.ipynb, is my failed attempt to use Google Colab to create a model.

It timed out at 186th among 500 epochs.

I tried docker, struggled with an error of dnn implementation, and eventually found a solution

So I came back to docker, and tried tensorflow 2.3.0rc1, hoping newer version had fixed the errors.

docker run -u $(id -u):$(id -g) \
  --gpus all -it -p 8888:8888 -v `pwd`:/tf/notebooks \
  tensorflow/tensorflow:2.3.0rc1-gpu-jupyter

However, I encountered the same problem of dnn implementation. I searched and found tensorflow: Fail to find dnn implementation in StackOverflow. I added the lines:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

And it worked! To allow GPU grow was the key.

I created a model, and generated a text

The file in the same repository, let_em_speak_retry.ipynb, is the result I’ve got.

To my surprise, docker working in my machine was 4 times faster than Google Colab. It took approximately 8 hours.

Then I seeded the text, “I have some proposals to the Congress”. The model continued:

“of the first months of the past year have been accomplished to keep that matter up their abundance ending upon the war ii and hearts inflation continues to adjust peoples testifies nor paperwork proposals know moreover coverage to assures visible in our economic destabilizing coverage overregulated that skillfully assures we beef destabilizing coverage under coverage implies antitrust destabilizing fiscally testify to approve urban coverage coverage coverage and stymies consumption authorities dividend coverage appropriated coverage reaching else coverage propose proposals propose proposals stymies preference loans continued values meant that urgent coverage downward know coverage fostered that referred sdr disease coverage for”

Grammar is not correct, and the word “coverage” appears repeatedly. However, some sets of words sound natural, as the model guesses next word after word.

You can try

Although this model has no practical use, I have saved it as “ErpPresModel.h5”. You can play with it, by opening you_can_try.ipynb in Google Colab.

--- title: 'Update: let POTUS speak' author: Mitsuo Shiota date: '2020-07-17' categories: - economics - Python - tensorflow --- ## I have finished manually cleaned up text files from 1947 to 2020 I put the cleaned-up files in `texts/presidents/` of [this GitHub repo](https://github.com/mitsuoxv/erp). I name a file as `1947_pres.txt`. These files include only texts of the president part, not including those of the Council of Economic Advisers part. I cleaned up by: 1. correcting word order, where digitization made mistakes in shaping lines; 1. correcting words, where optical recognition made mistakes due to dirt, or the author apparently misspelled; 1. making punctuation common over reports, following my memory of ["The Mac is Not a Typewriter" by Robin P. Williams](https://www.goodreads.com/book/show/41600.The_Mac_is_Not_a_Typewriter), which I must have, but could not find now. As an exception, using minus mark instead of hyphen; and 1. changing lines, where I encounter ";", ":", "—" or "," and when I feel changed lines are natural if they were in President Johnson's reports. I tried to be consistent over reports of different authors. Honestly, I am not sure whether "text" has become more consistent or not, as a result. ## I tried Google Colab, but failed The file in [the same repository](https://github.com/mitsuoxv/erp), [let_pres_speak_failed.ipynb](https://github.com/mitsuoxv/erp/blob/master/let_pres_speak_failed.ipynb), is my failed attempt to use Google Colab to create a model. It timed out at 186th among 500 epochs. ## I tried docker, struggled with an error of dnn implementation, and eventually found a solution So I came back to docker, and tried tensorflow 2.3.0rc1, hoping newer version had fixed the errors. ``` docker run -u $(id -u):$(id -g) \ --gpus all -it -p 8888:8888 -v `pwd`:/tf/notebooks \ tensorflow/tensorflow:2.3.0rc1-gpu-jupyter ``` However, I encountered the same problem of dnn implementation. I searched and found [tensorflow: Fail to find dnn implementation](https://stackoverflow.com/questions/56333388/tensorflow-fail-to-find-dnn-implementation) in StackOverflow. I added the lines: ``` from tensorflow.compat.v1 import ConfigProto from tensorflow.compat.v1 import InteractiveSession config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config) ``` And it worked! To allow GPU grow was the key. ## I created a model, and generated a text The file in [the same repository](https://github.com/mitsuoxv/erp), [let_em_speak_retry.ipynb](https://github.com/mitsuoxv/erp/blob/master/let_em_speak_retry.ipynb), is the result I've got. To my surprise, docker working in [my machine](https://mitsuoxv.rbind.io/posts/2020-05-30-i-built-a-pc-installed-pop-os-and-am-running-rstudio/) was 4 times faster than Google Colab. It took approximately 8 hours. Then I seeded the text, "I have some proposals to the Congress". The model continued: "of the first months of the past year have been accomplished to keep that matter up their abundance ending upon the war ii and hearts inflation continues to adjust peoples testifies nor paperwork proposals know moreover coverage to assures visible in our economic destabilizing coverage overregulated that skillfully assures we beef destabilizing coverage under coverage implies antitrust destabilizing fiscally testify to approve urban coverage coverage coverage and stymies consumption authorities dividend coverage appropriated coverage reaching else coverage propose proposals propose proposals stymies preference loans continued values meant that urgent coverage downward know coverage fostered that referred sdr disease coverage for" Grammar is not correct, and the word "coverage" appears repeatedly. However, some sets of words sound natural, as the model guesses next word after word. ## You can try Although this model has no practical use, I have saved it as "ErpPresModel.h5". You can play with it, by opening [you_can_try.ipynb](https://github.com/mitsuoxv/erp/blob/master/you_can_try.ipynb) in Google Colab.