言語処理100本ノックを見つけたのでやる。環境はPython2.7 + Ubuntu15.10です。いたらぬ点もありますがよろしくお願いします

もう解けてるやつ　00,01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,17,18,20,21,22,24 まだ解けてないやつ　たくさん

第2章: UNIXコマンドの基礎

hightemp.txtは，日本の最高気温の記録を「都道府県」「地点」「℃」「日」のタブ区切り形式で格納したファイルである．以下の処理を行うプログラムを作成し，hightemp.txtを入力ファイルとして実行せよ．さらに，同様の処理をUNIXコマンドでも実行し，プログラムの実行結果を確認せよ．

16.ファイルをN分割する

自然数Nをコマンドライン引数などの手段で受け取り，入力のファイルを行単位でN分割せよ．同様の処理をsplitコマンドで実現せよ．

#coding: UTF-8

import sys

argvs = sys.argv[1]
argument = int(argvs)

fileread = open("hightemp.txt","r")
filereadlines = fileread.readlines()

numoflines = sum(1 for line in open("hightemp.txt"))
div = numoflines / argument

if (numoflines % argument) == 0:
  for i in xrange(div):
    splitfile = "".join(filereadlines[argument*(i):argument*(i+1)])
    print splitfile
else:
  print "No split by this argument"

fileread.close()

実行結果 haruka@ubuntu:~/NLP100$ python nlp16.py 8 高知県江川崎 41 2013-08-12 埼玉県熊谷 40.9 2007-08-16 岐阜県多治見 40.9 2007-08-16 山形県山形 40.8 1933-07-25 山梨県甲府 40.7 2013-08-10 和歌山県かつらぎ 40.6 1994-08-08 静岡県天竜 40.6 1994-08-04 山梨県勝沼 40.5 2013-08-10

埼玉県越谷 40.4 2007-08-16 群馬県館林 40.3 2007-08-16 群馬県上里見 40.3 1998-07-04 愛知県愛西 40.3 1994-08-05 千葉県牛久 40.2 2004-07-20 静岡県佐久間 40.2 2001-07-24 愛媛県宇和島 40.2 1927-07-22 山形県酒田 40.1 1978-08-03

岐阜県美濃 40 2007-08-16 群馬県前橋 40 2001-07-24 千葉県茂原 39.9 2013-08-11 埼玉県鳩山 39.9 1997-07-05 大阪府豊中 39.9 1994-08-08 山梨県大月 39.9 1990-07-19 山形県鶴岡 39.9 1978-08-03 愛知県名古屋 39.9 1942-08-02

コマンド確認 haruka@ubuntu:~/NLP100$ split -l 8 "hightemp.txt" nlp16_split_file

高知県    江川崎   41  2013-08-12
埼玉県   熊谷  40.9    2007-08-16
岐阜県   多治見   40.9    2007-08-16
山形県   山形  40.8    1933-07-25
山梨県   甲府  40.7    2013-08-10
和歌山県    かつらぎ    40.6    1994-08-08
静岡県   天竜  40.6    1994-08-04
山梨県   勝沼  40.5    2013-08-10

埼玉県    越谷  40.4    2007-08-16
群馬県   館林  40.3    2007-08-16
群馬県   上里見   40.3    1998-07-04
愛知県   愛西  40.3    1994-08-05
千葉県   牛久  40.2    2004-07-20
静岡県   佐久間   40.2    2001-07-24
愛媛県   宇和島   40.2    1927-07-22
山形県   酒田  40.1    1978-08-03

岐阜県    美濃  40  2007-08-16
群馬県   前橋  40  2001-07-24
千葉県   茂原  39.9    2013-08-11
埼玉県   鳩山  39.9    1997-07-05
大阪府   豊中  39.9    1994-08-08
山梨県   大月  39.9    1990-07-19
山形県   鶴岡  39.9    1978-08-03
愛知県   名古屋   39.9    1942-08-02

19.各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

各行の1列目の文字列の出現頻度を求め，その高い順に並べて表示せよ．確認にはcut, uniq, sortコマンドを用いよ．

#coding: UTF-8
from collections import Counter

ans = []
fileread = open("hightemp.txt","r")

for line in fileread:
  ans.append(line.split()[0])

counter = Counter(ans)
for count,word in counter.most_common():
  print "\t" + count , word

fileread.close()

実行結果 haruka@ubuntu:~/NLP100$ python nlp19.py 山形県 3 埼玉県 3 群馬県 3 山梨県 3 愛知県 2 岐阜県 2 千葉県 2 静岡県 2 高知県 1 和歌山県 1 愛媛県 1 大阪府 1

コマンド確認 haruka@ubuntu:~/NLP100$ cut -f1 "hightemp.txt" | sort | uniq -c | sort -r 3 山梨県 3 山形県 3 埼玉県 3 群馬県 2 千葉県 2 静岡県 2 岐阜県 2 愛知県 1 和歌山県 1 大阪府 1 高知県 1 愛媛県

第3章:正規表現

Wikipediaの記事を以下のフォーマットで書き出したファイルjawiki-country.json.gzがある．

1行に1記事の情報がJSON形式で格納される各行には記事名が"title"キーに，記事本文が"text"キーの辞書オブジェクトに格納され，そのオブジェクトがJSON形式で書き出されるファイル全体はgzipで圧縮される以下の処理を行うプログラムを作成せよ． ※"jawiki-country.json"でエラーをだしすぎたあまりいつのまにかスワップファイルとなってしまっていたので以下では"jawiki-countrys.json"で処理しています。

23.セクション構造

記事中に含まれるセクション名とそのレベル（例えば"== セクション名 =="なら1）を表示せよ．

#coding: UTF-8

import re

with open("jawiki-uks.txt","r") as f:
  for line in f.readlines():
    matchtext = re.match(r"(?P<number>=*)(?P<section>.*)=+$",line)
    if matchtext is not None:
      print "Level:"
      print (int(line.count("=")/2)-1)
      print "SectionName:"
      print line

実行結果 haruka@ubuntu:~/NLP100$ python nlp23.py Level: 1 SectionName: ==国名==

Level: 1 SectionName: ==歴史==

Level: 1 SectionName: ==地理==

Level: 2 SectionName: ===気候===

Level: 1 SectionName: ==政治==

Level: 1 SectionName: ==外交と軍事==

Level: 1 SectionName: ==地方行政区分==

Level: 2 SectionName: ===主要都市===

Level: 1 SectionName: ==科学技術==

Level: 1 SectionName: ==経済==

Level: 2 SectionName: ===鉱業===

Level: 2 SectionName: ===農業===

Level: 2 SectionName: ===貿易===

Level: 2 SectionName: ===通貨===

Level: 2 SectionName: ===企業===

Level: 1 SectionName: ==交通==

Level: 2 SectionName: ===道路===

Level: 2 SectionName: ===鉄道===

Level: 2 SectionName: ===海運===

Level: 2 SectionName: ===航空===

Level: 1 SectionName: ==通信==

Level: 1 SectionName: ==国民==

Level: 2 SectionName: ===言語===

Level: 2 SectionName: ===宗教===

Level: 2 SectionName: === 婚姻 ===

Level: 2 SectionName: ===教育===

Level: 1 SectionName: ==文化==

Level: 2 SectionName: ===食文化===

Level: 2 SectionName: ===文学===

Level: 2 SectionName: === 哲学 ===

Level: 2 SectionName: ===音楽===

Level: 3 SectionName: ====イギリスのポピュラー音楽====

Level: 2 SectionName: ===映画===

Level: 2 SectionName: ===コメディ===

Level: 2 SectionName: ===国花===

Level: 2 SectionName: ===世界遺産===

Level: 2 SectionName: ===祝祭日===

Level: 1 SectionName: ==スポーツ==

Level: 2 SectionName: ===サッカー===

Level: 2 SectionName: ===競馬===

Level: 2 SectionName: ===モータースポーツ===

Level: 1 SectionName: ==脚注==

Level: 1 SectionName: ==関連項目==

Level: 1 SectionName: ==外部リンク==

25.テンプレートの抽出

記事中に含まれる「基礎情報」テンプレートのフィールド名と値を抽出し，辞書オブジェクトとして格納せよ．

#coding: UTF-8
import re

dic = {}

with open("jawiki-uks.txt","r") as f:
  for line in f.readlines():
    searchtext = re.search("^\|(?P<fieldname>.*) = (?P<value>.*)",line)
#    if searchtext:
#      print searchtext.group(1),
#      print searchtext.group(2) 
    if searchtext:
      dic[searchtext.group(1)] = searchtext.group(2)
      print "\n".join("%s: %s" % i for i in dic.items())

今日のまとめ今日解けたのは16,19,23 今日のしんちょくはなし

言語処理100本ノック 2015のリンク言語処理100本ノック 2015　1日目言語処理100本ノック 2015　2日目言語処理100本ノック 2015　3日目言語処理100本ノック 2015　4日目言語処理100本ノック 2015　5日目

日記

検索エンジニアになりたい

言語処理100本ノック 2015　6日目