How to Read Japanese Text Files in Python 3
Introduction
If you accept been office of the data science (or any information!) industry, you would know the claiming of working with different data types. Different formats, different compression, different parsing on different systems – y'all could be speedily pulling your hair! Oh and I have not talked nigh the unstructured information or semi-structured information nonetheless.
For any information scientist or data engineer, dealing with dissimilar formats can become a wearisome task. In real-world, people rarely get dandy tabular data. Thus, it is mandatory for any information scientist (or a data engineer) to exist aware of different file formats, common challenges in handling them and the best / efficient ways to handle this data in existent life.
This article provides common formats a data scientist or a data engineer must exist aware of. I will kickoff introduce yous to different mutual file formats used in the industry. Later, we'll run into how to read these file formats in Python.
P.South. In rest of this article, I will exist referring to a information scientist, only the same applies to a data engineer or any data science professional.
Table of Contents
- What is a file format?
- Why should a data scientist empathise different file formats?
- Dissimilar file formats and how to read them in Python?
- Comma-separated values
- XLSX
- ZIP
- Plain Text (txt)
- JSON
- XML
- HTML
- Images
- Hierarchical Information Format
- DOCX
- MP3
- MP4
1. What is a file format?
A file format is a standard way in which information is encoded for storage in a file. Starting time, the file format specifies whether the file is a binary or ASCII file. Second, information technology shows how the information is organized. For example, comma-separated values (CSV) file format stores tabular data in plain text.
To identify a file format, y'all can usually look at the file extension to get an idea. For example, a file saved with proper name "Data" in "CSV" format will appear as "Data.csv". By noticing ".csv" extension we can clearly identify that it is a "CSV" file and data is stored in a tabular format.
ii. Why should a information scientist understand unlike file formats?
Commonly, the files you will come across will depend on the application y'all are building. For case, in an image processing arrangement, you need image files as input and output. And then you will mostly see files in jpeg, gif or png format.
Every bit a data scientist, y'all demand to empathise the underlying construction of various file formats, their advantages and dis-advantages. Unless yous empathise the underlying construction of the data, you volition non exist able to explore it. Also, at times you need to make decisions about how to store data.
Choosing the optimal file format for storing information can ameliorate the performance of your models in data processing.
Now, we will await at the following file formats and how to read them in Python:
- Comma-separated values
- XLSX
- ZIP
- Apparently Text (txt)
- JSON
- XML
- HTML
- Images
- Hierarchical Information Format
- DOCX
- MP3
- MP4
iii. Different file formats and how to read them in Python
3.i Comma-separated values
Comma-separated values file format falls under spreadsheet file format.
What is Spreadsheet File Format?
In spreadsheet file format, data is stored in cells. Each cell is organized in rows and columns. A column in the spreadsheet file tin have unlike types. For example, a cavalcade tin exist of string type, a date type or an integer type. Some of the near popular spreadsheet file formats are Comma Separated Values ( CSV ), Microsoft Excel Spreadsheet ( xls ) and Microsoft Excel Open XML Spreadsheet ( xlsx ).
Each line in CSV file represents an ascertainment or commonly called a record. Each tape may contain one or more than fields which are separated by a comma.
Sometimes y'all may come beyond files where fields are non separated past using a comma merely they are separated using tab. This file format is known every bit TSV (Tab Separated Values) file format.
The beneath image shows a CSV file which is opened in Notepad.
Reading the data from CSV in Python
Let us look at how to read a CSV file in Python. For loading the data you tin can apply the "pandas" library in python.
import pandas as pd
df = pd.read_csv("/habitation/Loan_Prediction/railroad train.csv")
Above code will load the railroad train.csv file in DataFrame df.
3.2 XLSX files
XLSX is a Microsoft Excel Open up XML file format. It also comes under the Spreadsheet file format. It is an XML-based file format created past Microsoft Excel. The XLSX format was introduced with Microsoft Office 2007.
In XLSX data is organized nether the cells and columns in a sail. Each XLSX file may contain one or more sheets. And then a workbook can contain multiple sheets.
The below prototype shows a "xlsx" file which is opened in Microsoft Excel.
In above image, you tin can see that at that place are multiple sheets nowadays (lesser left) in this file, which are Customers, Employees, Invoice, Gild. The image shows the information of but one sheet – "Invoice".
Reading the data from XLSX file
Let'southward load the data from XLSX file and define the sheet proper noun. For loading the data you lot tin use the Pandas library in python.
import pandas as pd
df = pd.read_excel("/home/Loan_Prediction/railroad train.xlsx", sheetname = "Invoice")
Above code will load the canvass "Invoice" from "train.xlsx" file in DataFrame df.
3.3 Naught files
Aught format is an archive file format.
What is Archive File format?
In Archive file format, you create a file that contains multiple files along with metadata. An archive file format is used to collect multiple data files together into a unmarried file. This is washed for but compressing the files to use less storage infinite.
There are many popular computer data archive format for creating archive files. Zip, RAR and Tar existence the most popular archive file format for compressing the data.
So, a Aught file format is a lossless compression format, which means that if yous shrink the multiple files using ZIP format you can fully recover the information after decompressing the Cypher file. ZIP file format uses many compression algorithms for compressing the documents. You lot can easily identify a Null file by the .zilch extension.
Reading a .Aught file in Python
You lot tin read a aught file by importing the "zipfile" parcel. Beneath is the python lawmaking which tin read the "train.csv" file that is within the "T.zip".
import zipfile archive = zipfile.ZipFile('T.nada', 'r') df = archive.read('train.csv')
Here, I take discussed one of the famous archive format and how to open information technology in python. I am not mentioning other archive formats. If you want to read about different annal formats and their comparisons y'all can refer this link.
three.4 Apparently Text (txt) file format
In Apparently Text file format, everything is written in plainly text. Usually, this text is in unstructured form and in that location is no meta-data associated with information technology. The txt file format can easily exist read by whatever program. But interpreting this is very difficult by a computer program.
Let's take a uncomplicated example of a text File.
The following example shows text file data that incorporate text:
"In my previous article, I introduced you to the basics of Apache Spark, different data representations (RDD / DataFrame / Dataset) and nuts of operations (Transformation and Activeness). We even solved a motorcar learning problem from one of our past hackathons. In this commodity, I will go on from the place I left in my previous article. I will focus on manipulating RDD in PySpark by applying operations (Transformation and Actions)."
Suppose the above text written in a file called text.txt and you want to read this and then yous can refer the beneath lawmaking.
text_file = open("text.txt", "r") lines = text_file.read()
3.5 JSON file format
JavaScript Object Notation(JSON) is a text-based open standard designed for exchanging the data over web. JSON format is used for transmitting structured data over the web. The JSON file format can be hands read in any programming language because it is language-contained information format.
Permit's take an example of a JSON file
The post-obit example shows how a typical JSON file stores information of employees.
{ "Employee": [ { "id":"1", "Proper noun": "Ankit", "Sal": "1000", }, { "id":"2", "Name": "Faizy", "Sal": "2000", } ] }
Reading a JSON file
Let'south load the data from JSON file. For loading the data you can use the pandas library in python.
import pandas as pd
df = pd.read_json("/abode/kunal/Downloads/Loan_Prediction/train.json")
3.six XML file format
XML is also known as Extensible Markup Linguistic communication. As the proper name suggests, it is a markup linguistic communication. It has certain rules for encoding data. XML file format is a human-readable and car-readable file format. XML is a self-descriptive language designed for sending information over the net. XML is very similar to HTML, but has some differences. For example, XML does not use predefined tags every bit HTML.
Let's have the uncomplicated example of XML File format.
The following example shows an xml document that contains the information of an employee.
<?xml version="one.0"?> <contact-info> <name>Ankit</proper name> <visitor>Anlytics Vidhya</visitor> <telephone>+9187654321</phone> </contact-info>
The "<?xml version="1.0″?>" is a XML declaration at the start of the file (it is optional). In this deceleration, 5ersion specifies the XML version and encoding specifies the character encoding used in the document. <contact-info> is a tag in this document. Each XML-tag needs to be closed.
Reading XML in python
For reading the data from XML file you can import xml.etree. ElementTree library.
Let's import an xml file called railroad train and print its root tag.
import xml.etree.ElementTree every bit ET tree = ET.parse('/home/sunilray/Desktop/2 sigma/train.xml') root = tree.getroot() print root.tag
3.vii HTML files
HTML stands for Hyper Text Markup Linguistic communication. It is the standard markup language which is used for creating Web pages. HTML is used to describe construction of spider web pages using markup. HTML tags are same as XML merely these are predefined. You can easily identify HTML document subsection on basis of tags such as <head> represent the heading of HTML document. <p> "paragraph" paragraph in HTML. HTML is non case sensitive.
The following example shows an HTML document.
<!DOCTYPE html> <html> <head> <title>Page Title</championship> </head> <body><h1>My First Heading</h1> <p>My first paragraph.</p></body> </html>
Each tag in HTML is enclosed under the athwart bracket(<>). The <!DOCTYPE html> tag defines that certificate is in HTML format. <html> is the root tag of this document. The <head> element contains heading part of this document. The <title>, <body>, <h1>, <p> represent the title, body, heading and paragraph respectively in the HTML document.
Reading the HTML file
For reading the HTML file, you lot tin use BeautifulSoup library. Please refer to this tutorial, which will guide you how to parse HTML documents. Beginner'south guide to Web Scraping in Python (using BeautifulSoup)
3.8 Prototype files
Image files are probably the about fascinating file format used in information science. Any reckoner vision application is based on image processing. So it is necessary to know different image file formats.
Usual prototype files are 3-Dimensional, having RGB values. But, they tin can likewise exist 2-Dimensional (grayscale) or 4-Dimensional (having intensity) – an Image consisting of pixels and meta-data associated with information technology.
Each paradigm consists i or more frames of pixels. And each frame is made upwards of two-dimensional array of pixel values. Pixel values can be of whatever intensity. Meta-data associated with an image, tin can be an image type (.png) or pixel dimensions.
Permit's have the instance of an image past loading it.
from scipy import misc f = misc.face() misc.imsave('face.png', f) # uses the Paradigm module (PIL) import matplotlib.pyplot equally plt plt.imshow(f) plt.bear witness()
At present, let's check the type of this image and its shape.
type(f) , f.shape
numpy.ndarray,(768, 1024, iii)
If you desire to read about image processing you can refer this article. This commodity will teach you epitome processing with an case – Basics of Image Processing in Python
3.9 Hierarchical Data Format (HDF)
In Hierarchical Data Format ( HDF ), you can store a large amount of data easily. Information technology is non but used for storing high volumes or circuitous data but also used for storing small volumes or simple data.
The advantages of using HDF are as mentioned beneath:
- Information technology can exist used in every size and type of organization
- It has flexible, efficient storage and fast I/O.
- Many formats support HDF.
There are multiple HDF formats present. But, HDF5 is the latest version which is designed to address some of the limitations of the older HDF file formats. HDF5 format has some similarity with XML. Like XML, HDF5 files are cocky-describing and let users to specify complex data relationships and dependencies.
Let's take the case of an HDF5 file format which tin can be identified using .h5 extension.
Read the HDF5 file
Yous tin can read the HDF file using pandas. Beneath is the python code can load the train.h5 data into the "t".
t = pd.read_hdf('train.h5')
3.ten PDF file format
PDF (Portable Document Format) is an incredibly useful format used for interpretation and display of text documents along with incorporated graphics. A special feature of a PDF file is that information technology tin can be secured by a password.
Here's an instance of a pdf file.
Reading a PDF file
On the other hand, reading a PDF format through a program is a complex chore. Although there exists a library which do a good job in parsing PDF file, 1 of them is PDFMiner. To read a PDF file through PDFMiner, yous take to:
- Download PDFMiner and install it through the website
- Extract PDF file by the following code
pdf2txt.py <pdf_file>.pdf
3.11 DOCX file format
Microsoft word docx file is some other file format which is regularly used past organizations for text based data. It has many characteristics, like inline addition of tables, images, hyperlinks, etc. which helps in making docx an incredibly important file format.
The advantage of a docx file over a PDF file is that a docx file is editable. You can also change a docx file to whatsoever other format.
Here'southward an instance of a docx file:
Reading a docx file
Like to PDF format, python has a community contributed library to parse a docx file. Information technology is chosen python-docx2txt.
Installing this library is easy through pip past:
pip install docx2txt
To read a docx file in Python utilise the following code:
import docx2txt text = docx2txt.process("file.docx")
3.12 MP3 file format
MP3 file format comes nether the multimedia file formats. Multimedia file formats are similar to image file formats, just they happen to be one the well-nigh complex file formats.
In multimedia file formats, you can store variety of information such as text prototype, graphical, video and audio data. For instance, A multimedia format can allow text to be stored as Rich Text Format (RTF) information rather than ASCII data which is a obviously-text format.
MP3 is one of the most mutual audio coding formats for digital audio. A mp3 file format uses the MPEG-i (Moving Picture show Experts Grouping – 1) encoding format which is a standard for lossy compression of video and sound. In lossy compression, in one case you have compressed the original file, yous cannot recover the original data.
A mp3 file format compresses the quality of audio by filtering out the sound which can not be heard by humans. MP3 compression usually achieves 75 to 95% reduction in size, so it saves a lot of space.
mp3 File Format Structure
A mp3 file is made up of several frames. A frame can be further divided into a header and data block. Nosotros telephone call these sequence of frames an unproblematic stream.
A header in mp3 commonly, identify the beginning of a valid frame and a information blocks incorporate the (compressed) sound information in terms of frequencies and amplitudes. If you desire to know more about mp3 file structure you tin refer this link.
Reading the multimedia files in python
For reading or manipulating the multimedia files in Python you can use a library called PyMedia.
3.13 MP4 file format
MP4 file format is used to store videos and movies. It contains multiple images (called frames), which play in grade of a video as per a specific fourth dimension period. In that location are two methods for interpreting a mp4 file. One is a closed entity, in which the whole video is considered as a single entity. And other is mosaic of images, where each image in the video is considered every bit a different entity and these images are sampled from the video.
Here's is an example of mp4 video
Reading an mp4 file
MP4 also has a customs built library for reading and editing mp4 files, called MoviePy.
You lot can install the library from this link. To read a mp4 video clip, in Python use the following code.
from moviepy.editor import VideoFileClip clip = VideoFileClip('<video_file>.mp4')
You can and so display this in jupyter notebook as below
ipython_display(clip)
End Notes
In this article, I have introduced you to some of the basic file formats, which are used by data scientist on a solar day to day basis. There are many file formats I accept not covered. Good affair is that I don't need to cover all of them in one article.
I hope you institute this article helpful. I would encourage you lot to explore more file formats. Good luck! If you still have any difficulty in understanding a specific data format, I'd like to interact with you in comments. If you lot take whatsoever more doubts or queries experience free to drib in your comments below.
Learn, compete, hack and become hired
Source: https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/
0 Response to "How to Read Japanese Text Files in Python 3"
إرسال تعليق