第1章: 準備運動

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

#coding: UTF-8
import random

text = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
textsp = text.split()

ans = []
cut = []

#def typo(text):
for i in textsp:
  if len(textsp) <= 4:
      ans.append(i)
  else:
      naka = list(i[1:-1])
      nakara = random.sample(naka,len(naka))
      cut.append(i[0:1])
      cut.append(nakara)
      cut.append(i[:-1])
      ans.append(" ".join(cut))
  print " ".join(ans)

実行結果 haruka@ubuntu:~/NLP100$ python 09.py Traceback (most recent call last): File "09.py", line 20, in ans.append(" ".join(cut)) TypeError: sequence item 1: expected string, list found

第2章: UNIXコマンドの基礎

hightemp.txtは，日本の最高気温の記録を「都道府県」「地点」「℃」「日」のタブ区切り形式で格納したファイルである．以下の処理を行うプログラムを作成し，hightemp.txtを入力ファイルとして実行せよ．さらに，同様の処理をUNIXコマンドでも実行し，プログラムの実行結果を確認せよ．

15.末尾のN行を出力

自然数Nをコマンドライン引数などの手段で受け取り，入力のうち末尾のN行だけを表示せよ．確認にはtailコマンドを用いよ．

#coding: UTF-8
import sys

argvs = sys.argv[1]

with open("hightemp.txt","r") as f:
  lines = f.readlines()
  for i in range(len(lines)-int(argvs),len(lines)):
      print lines[i],

実行結果 haruka@ubuntu:~/NLP100$ python 15.py 5 埼玉県鳩山 39.9 1997-07-05 大阪府豊中 39.9 1994-08-08 山梨県大月 39.9 1990-07-19 山形県鶴岡 39.9 1978-08-03 愛知県名古屋 39.9 1942-08-02

コマンド確認 haruka@ubuntu:~/NLP100$ tail -5 "hightemp.txt" 埼玉県鳩山 39.9 1997-07-05 大阪府豊中 39.9 1994-08-08 山梨県大月 39.9 1990-07-19 山形県鶴岡 39.9 1978-08-03 愛知県名古屋 39.9 1942-08-02

17.１列目の文字列の異なり

1列目の文字列の種類（異なる文字列の集合）を求めよ．確認にはsort, uniqコマンドを用いよ．

#coding: UTF-8

k = set("")
with open("hightemp.txt","r") as f:
  for line in f:
    ki = k.add((line.split()[0])
  for i in ki
    print i

実行結果 haruka@ubuntu:~/NLP100$ python 17.py set(['\xa5', '\xe7', '\xe9', '\xab', '\x8c', '\x98', '\x9c', '\x9f']) set(['\x9c', '\xe5', '\xe7', '\x89', '\x8c', '\x8e', '\xbc', '\x9f']) set(['\xe5', '\xe7', '\xe9', '\x8c', '\x90', '\xb2', '\x98', '\x9c']) set(['\xa2', '\xe5', '\xe7', '\x8c', '\xb1', '\xbd', '\x9c']) set(['\xa2', '\xe5', '\xe7', '\xe6', '\xa8', '\x8c', '\xb1', '\x9c']) set(['\xe5', '\xe7', '\xe6', '\xad', '\x8c', '\xb1', '\x92', '\x9c']) set(['\xa1', '\xe5', '\xe7', '\xe9', '\x8c', '\xb2', '\x99', '\x9d', '\x9c']) set(['\xa2', '\xe5', '\xe7', '\xe6', '\xa8', '\x8c', '\xb1', '\x9c']) set(['\x9c', '\xe5', '\xe7', '\x89', '\x8c', '\x8e', '\xbc', '\x9f']) set(['\xa4', '\xe7', '\xa6', '\xe9', '\xac', '\x8c', '\x9c', '\xbe']) set(['\xa4', '\xe7', '\xa6', '\xe9', '\xac', '\x8c', '\x9c', '\xbe']) set(['\xa5', '\x84', '\xe7', '\xe6', '\x8c', '\x9b', '\x9c', '\x9f']) set(['\x83', '\xe5', '\xe7', '\x89', '\xe8', '\x8d', '\x8c', '\x91', '\x9c']) set(['\xa1', '\xe5', '\xe7', '\xe9', '\x8c', '\xb2', '\x99', '\x9d', '\x9c']) set(['\xe5', '\x84', '\xe7', '\xe6', '\xaa', '\x8c', '\x9b', '\x9c']) set(['\xa2', '\xe5', '\xe7', '\x8c', '\xb1', '\xbd', '\x9c']) set(['\xe5', '\xe7', '\xe9', '\x8c', '\x90', '\xb2', '\x98', '\x9c']) set(['\xa4', '\xe7', '\xa6', '\xe9', '\xac', '\x8c', '\x9c', '\xbe']) set(['\x83', '\xe5', '\xe7', '\x89', '\xe8', '\x8d', '\x8c', '\x91', '\x9c']) set(['\x9c', '\xe5', '\xe7', '\x89', '\x8c', '\x8e', '\xbc', '\x9f']) set(['\xe5', '\xa4', '\xa7', '\xe9', '\xaa', '\x98', '\xba', '\x9c']) set(['\xa2', '\xe5', '\xe7', '\xe6', '\xa8', '\x8c', '\xb1', '\x9c']) set(['\xa2', '\xe5', '\xe7', '\x8c', '\xb1', '\xbd', '\x9c']) set(['\xa5', '\x84', '\xe7', '\xe6', '\x8c', '\x9b', '\x9c', '\x9f'])

コマンド確認 haruka@ubuntu:~/NLP100$ cut -f1 "hightemp.txt" | sort | uniq 愛知県愛媛県岐阜県群馬県高知県埼玉県山形県山梨県静岡県千葉県大阪府和歌山県

第3章:正規表現

Wikipediaの記事を以下のフォーマットで書き出したファイルjawiki-country.json.gzがある．

1行に1記事の情報がJSON形式で格納される各行には記事名が"title"キーに，記事本文が"text"キーの辞書オブジェクトに格納され，そのオブジェクトがJSON形式で書き出されるファイル全体はgzipで圧縮される以下の処理を行うプログラムを作成せよ．

20.JSONデータの読み込み

Wikipedia記事のJSONファイルを読み込み，「イギリス」に関する記事本文を表示せよ．問題21-29では，ここで抽出した記事本文に対して実行せよ．

#coding: UTF-8

import json

uk = open("jawiki-uk.txt","w")

with open("jawiki-country.json","r") as f:
  for line in f.readlines():
    aa = json.loads(line,"utf-8")
    if aa["title"] == u"イギリス":
       res =  aa["text"]
       uk.write(res)

実行結果 haruka@ubuntu:~/NLP100$ python 20.py Traceback (most recent call last): File "20.py", line 12, in uk.write(res) UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-21: ordinal not in range(128)

第4章:形態素解析

夏目漱石の小説『吾輩は猫である』の文章（neko.txt）をMeCabを使って形態素解析し，その結果をneko.txt.mecabというファイルに保存せよ．このファイルを用いて，以下の問に対応するプログラムを実装せよ．

なお，問題37, 38, 39はmatplotlibもしくはGnuplotを用いるとよい．

30.形態素解析結果の読み込み

形態素解析結果（neko.txt.mecab）を読み込むプログラムを実装せよ．ただし，各形態素は表層形（surface），基本形（base），品詞（pos），品詞細分類1（pos1）をキーとするマッピング型に格納し，1文を形態素（マッピング型）のリストとして表現せよ．第4章の残りの問題では，ここで作ったプログラムを活用せよ．

#coding: UTF-8

res = []
with open("neko.txt.mecab","r") as f:
  for i in f:
    i.replace("\t",",")
    i.split(",")
    surface = i[0]
    base = i[1]
    pos = i[2]
    pos1 = i[6]

    res.append({
    surface,base,pos,pos1
    })

print res

今日のまとめ大凶作。解けなくて苦しいけどこういうの好み明日はもう少し進捗をだしたい今日解けたのは15

今日のしんちょくはなし

日記

検索エンジニアになりたい

言語処理100本ノック 2015　3日目