> 文章列表 > CIT 594 Module 7 Programming AssignmentCSV Slicer

CIT 594 Module 7 Programming AssignmentCSV Slicer

CSV文件处理,听起来像是个枯燥的编程任务,但其实它是个藏着无数玄机的“数据魔术师”。你得像个侦探一样,逐字符扫描,解析那些逗号和引号背后的逻辑。比如,文件里可能藏着“双引号中的逗号”,它可不是用来分隔字段的,而是数据的一部分!这就好比你在阅读一篇满是标点符号的小说,得时刻注意上下文,不然就会被误导。

那么问题来了:为什么要这么麻烦?CSV文件已经存在几十年了,为啥不用更高效的格式?答案很简单:它既“人见人爱”又“机灵好用”。几乎所有应用程序都能处理CSV文件,而且它还能轻易被人类编辑和阅读。这就像是你用纸笔做笔记,虽然不如打字快,但灵活性和通用性是无与伦比的。

当然,CSV也有它的“小脾气”,比如没有统一的规范。RFC 4180是个广泛接受的标准,但它也有灵活性,比如允许忽略某些规则。这意味着你在处理CSV时,得时刻准备好应对各种意外情况,比如空字段、多行内容等。这就像是你去参加一个即兴演讲,得随时调整你的思路,应对观众的反应。

所以,处理CSV文件不仅是技术活,更是艺术活。你得像个编剧一样,为每一行数据设计好剧本,确保每个字符都发挥它的作用。在这个过程中,你不仅能提升编程技能,还能锻炼逻辑思维。下次再遇到CSV文件,别急着抱怨,把它当成是一场即兴表演,享受其中的乐趣吧!

CIT 594 Module 7 Programming AssignmentCSV Slicer

CIT 594 Module 7 Programming Assignment

CSV Slicer

In this assignment you will read files in a format known as “comma separated values” (CSV), interpret the formatting and output the content in the structure represented by the file.

Q1703105484

Learning Objectives

In completing this assignment, you will:

  • Implement a method to extract content from an input CSV
  • Read and understand a formal specification for the CSV format
  • Use a “state machine”

Background and Getting Started

Applications have long used delimited text files to store and transmit tables. The simplicity of the format means these files are human readable and editable as well as relatively easy to support in code. The breadth of support from nearly any general purpose application or tabular data loading library continues to make these files particularly popular for publishing data despite their limitations and inefficiencies. You will see more of this in later assignments where you will be required to support reading publicly available datasets.

One of the more common choices for delimiters are commas to separate fields within a row of the table, and line breaks to separate rows of the table. The common name for these comma based file formats are “comma separated values” (CSV).

To support values in the fields that include delimiter characters (e.g., if a comma is part of the data, not just a field separator) requires some added complexity. There is no single, universally-accepted specification for CSV files, so we will focus on one specific format that is widely accepted for this assignment, RFC 4180.

To get started, carefully read sections 1 and 2 of RFC 4180: RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files

That document uses a precise syntax to express an exact formal specification. If you are unsure of how to interpret the ANBF grammar you can reference RFC 2234.

CSV Format For This Assignment

You will write a method to process CSV files character-by-character. The exact format is a relaxation of RFC 4180.

Adjustments and clarifications of the descriptive rules:

  1. Formatting characters describe structure; they are not part of the content. For example,  the comma that separates two fields should not be included in either field. For example (easy3.csv):

"example of using "" in a field",1

Should result in a single row that is the equivalent an array constructed with the Java expres- sion:

new String[]{ "example of using \\" in a field", "1" }

  1. The same applies for the escape character in an escaped field. Two double quotes in the middle of an escaped field count as a single double quote character in the content.
  2. You may ignore rule number 4 (for this assignment). Instead, follow these rules:
    1. No special treatment should be applied to the first row. You should not treat a header any differently than a record.
    2. You will not need to check if all rows have the same number of fields.

    1. Commas at the end of a line signify empty fields. For example, "a,b,c," results in a row with four fields: ["a", "b", "c", ""]1
    2. An empty line terminated by a line break is a valid row if it’s outside an escaped field. Inside an escaped field, it is just part of the content of that field.
  1. Clarification of rule 2: There is no additional record if the file ends at the start of a line. That includes but is not limited to example in rule 1 where the file ends with CRLF EOF

Adjustments to the grammar:

CRLF = [CR] LF

TEXTDATA =     x00-09  /   x0B-0C /    x0E-21  /   x23-2B  /   x2D-7F

The first rule change makes the carriage return (CR) optional everywhere CRLF is used. This is a relaxation seen in many places to adjust for the fact that many systems and applications choose to omit the carriage return and only use a single line feed character for line breaks.

The second rule change expands TEXTDATA2 to all characters that are not comma, double quote, carriage return, and line feed.

Note: these are relaxations, therefore all documents that are valid for the strict interpretation of RFC 4180 are valid for this format.

For the purposes of the assignment there are five classes of characters for you to consider:

Common Name

RFC Name

RFC Code (hex)

Decimal

Java Character

Line Feed

LF

x0A

x0D x22 x2C

10

\\n

\\r "

,

[^\\r\\n,"]3

Carriage Return

CR

13

Double Quote

DQUOTE

34

Comma

COMMA

44

Anything Else

TEXTDATA

 

1The line in rule 4 in RFC about "The last field in the record must not be followed by  a comma" is referring       to their rule that each line should contain the same number of fields.  A comma at the end of a line would insert       an additional, empty field. The statement indicates that an extra comma that would increase the number of fields  beyond the expected limit should not be dropped or ignored.

2We will not test with characters above %x7F (127), but you are welcome to include %xFF                                                                                                                   7FFFFFFF

(Integer.MAX_VALUE) in TEXTD