QuickSyllabus

データエンジニアリング

前期集中その他その他. 単位数/Credit(s): 1. 対象学科・専攻/Departments: 情報基礎科学専攻、システム情報科学専攻、人間社会情報科学専攻、応用情報科学専攻. 学期/Term: 前期集中. 履修年度: 2024. 使用言語: 英語（English）. Japanese materials might be provided..

開講年度

2024

授業題目/Class Subject

データエンジニアリング
Data Engineering

授業の目的・概要及び達成方法等

このコースでは、pandasとNumPyを用いたデータ管理及びエンジニアリングの技術を深めます。データの読み込み、清掃、変換、統合などの基本操作を学び、データサイエンス及びエンジニアリングにおける実践的なスキルを身につけることを目指します。各セッションは、実際のデータセットを用いた演習と課題で構成され、理論と実践のバランスを重視します。

授業の目的・概要及び達成方法等(Ｅ）

In this course, we will delve into data management and engineering techniques using the pandas and NumPy libraries. Students will learn fundamental operations such as data loading, cleaning, transformation, and merging, with the aim of acquiring practical skills in data science and engineering. Each session includes exercises and assignments with real datasets, emphasizing a balance between theory and practice.

学修の到達目標/Goal of Study

学生がデータエンジニアリングの基本的な技術を理解し、実践的な問題解決能力を身につけること。また、データサイエンスプロジェクトにおけるデータの前処理及び分析のための効率的な手法を習得すること。

To understand the basic techniques of data engineering and acquire practical problem-solving skills. Also, to learn efficient methods for data preprocessing and analysis in data science projects.

授業内容・方法と進度予定/Contents and progress schedule of the class

このコースは5回の3時間セッションで構成されます。演習は、PythonのpandasとNumPyライブラリを使用し、データ管理とエンジニアリングの基本を実践的に学びます。セッションごとに異なるトピックを取り上げ、データサイエンスとエンジニアリングの進んだトピックに向けて必要なスキルを身につけます。

第1セッション: Pandasとデータ構造の紹介
Pandasの概要、データサイエンスおよび機械学習における役割
SeriesとDataFrameの基本構造、作成方法、基本操作
CSV, Excel, SQLデータベースからのデータ読み込み方法
データの探索と検査のためのhead(), tail(), info(), describe()メソッドの使用

第2セッション: データクリーニングと準備
欠損データの扱い方: isnull(), notnull(), dropna(), fillna()メソッドの使用
データフィルタリング: 条件を使用した行/列のフィルタリング、query()を使用した複雑なフィルタリング
データタイプの変換: astype()を使用したデータタイプの変更
インデックスと選択: .loc[], .iloc[]、ブーリアンインデックスを使用した高度なインデックスオプション

第3セッション: データ変換
列の操作: 列の追加、削除、変更
関数の適用: apply(), map(), applymap()を使用した要素毎、列/行毎の操作
グルーピングと集約: groupby()を使用したデータのグルーピングとsum(), mean(), count()などの集約関数の実行
ピボットテーブルとクロスタブ: pivot_table()とcrosstab()を使用したデータの要約

第4セッション: データ結合と再形成
結合と追加: concat()とappend()を使用したデータフレームの垂直および水平方向の結合
マージとジョイン: merge()とjoin()を使用したデータフレームのデータベーススタイルのマージと結合
データの再形成: melt(), pivot(), stack()/unstack()メソッドを使用したデータフレームの再形成技術

第5セッション: 時系列データと高度なトピック
時系列データの扱い: pandasでの時系列データの処理、リサンプリング技術、時間ベースのインデックス
カテゴリカルデータの管理: メモリ最適化のためのカテゴリタイプの使用、カテゴリカルデータの操作
テキストデータの基本操作: 文字列メソッド、正規表現を含む基本的なテキストデータ操作技術

This course is composed of five 3-hour sessions. The exercises involve using Python's pandas and NumPy libraries to practically learn the basics of data management and engineering. Different topics are covered in each session, acquiring the necessary skills for advanced topics in data science and engineering.

Session 1: Introduction to Pandas and Data Structures
Introduction to Pandas: Overview of pandas and its role in data science and machine learning.
Series and DataFrame: Understanding the basic data structures in pandas, including creation and basic operations.
Reading Data: How to read data from various sources (CSV, Excel, SQL databases) into DataFrames.
Data Inspection: Methods like head(), tail(), info(), and describe() to explore and inspect data.

Session 2: Data Cleaning and Preparation
Handling Missing Data: Techniques to deal with missing data using methods like isnull(), notnull(), dropna(), and fillna().
Data Filtering: Using conditions to filter rows/columns and using query() for complex filtering.
Type Conversion: Changing data types with astype() for proper data manipulation and analysis.
Indexing and Selection: Advanced indexing options like .loc[], .iloc[], and boolean indexing.

Session 3: Data Transformation
Column Operations: Adding, removing, and modifying DataFrame columns.
Apply Functions: Utilizing apply(), map(), and applymap() for element-wise, column/row-wise operations.
Grouping and Aggregation: Using groupby() for grouping data and performing aggregate functions like sum(), mean(), count().
Pivot Tables and Crosstabs: Creating pivot tables with pivot_table() and cross-tabulations with crosstab() for data summarization.

Session 4: Data Merging and Reshaping
Concatenation and Append: Combining multiple DataFrames vertically and horizontally using concat() and append().
Merge and Join: Database-style merging and joining of DataFrames using merge() and join().
Reshaping Data: Techniques for reshaping and pivoting DataFrames using melt(), pivot(), and stack()/unstack() methods.

Session 5: Time Series and Advanced Topics
Time Series Data: Handling time series data in pandas, resampling techniques, and time-based indexing.
Categorical Data: Managing categorical data, using category type for memory optimization, and categorical manipulation.
Text Data: Basic text data manipulation techniques, including string methods and regular expressions.
Advanced Data Filtering: Using query() for complex data filtering scenarios and conditional logic.

成績評価方法/Evaluation Method

授業内演習50%、各セッション終了後の課題50%

In-class Exercise 50%, Each session assignment 50%

教科書および参考書/Textbook and references

授業時間外学修

各セッションの復習および課題に取り組むこと。

授業時間外学修（Ｅ）

Review session materials and complete assignments.

オフィスアワー

オフィスアワーは設けておりませんが、コースに関する質問や不明点は、samy.baladram@tohoku.ac.jp までメールでお問い合わせください。

オフィスアワー（Ｅ）

Office hours are not available for this course. For any inquiries or clarifications related to the course, please email samy.baladram@tohoku.ac.jp.

その他/In addition

実習のため、ラップトップを持参してください。PCが必要な場合は、事前にお知らせください。
Please bring your laptop for practice sessions. If you need a PC, let us know in advance.