Difference between revisions of "Python:Loading and Saving Data"

From PrattWiki
Jump to navigation Jump to search
Line 15: Line 15:
 
and should be ignored
 
and should be ignored
 
XXX this is a comment
 
XXX this is a comment
1 2 3
+
1     2 3
 
# as is this
 
# as is this
4 5 6
+
4 5   6
7 8 9
+
  7   8   9
 
XXX and another one!
 
XXX and another one!
 
10 11 12 $ what about now?
 
10 11 12 $ what about now?

Revision as of 16:18, 30 January 2021

This page will provide information on various different ways to load and save data in Python. It includes descriptions of methods in both the numpy and pandas modules. You will first need to know exactly what type of file you want to read from or write to since some methods only work on a limited subset of file types.

For example codes below, assume that the following has already run:

import numpy as np
import pandas as pd

Numpy loadtxt()

The Numpy module has a loadtxt() method that reads a rectangular set of typically numerical data from a text file and returns a Numpy array. By default, this method can read data separated by spaces or tabs. If there are other delimiters (e.g. commas) the method needs the delimiter="" kwarg to establish what the delimiter is. Note that the method will still ignore whitespace, so if there are spaces after the delimiter you do not need to explicitly include that in your delimiter option.

There are other useful kwargs for this method. If your data file has one or more header rows that you want to ignore, you can supply the skiprows=N kwarg. If your data file has comments of some kind - whether at the top or on subsequent lines - you can supply the comments="" kwarg. If there is more than one symbol or set of symbols, you can give the kwarg a list or tuple of symbols. Note that the command will ignore the comment indicator as well as anything after it on that line. All of that is to say, if you were to have the following text in a file called spaces_comments.txt

This is at the top
and should be ignored
XXX this is a comment
1     2 3
# as is this
4 5   6
   7   8   9
XXX and another one!
10 11 12 $ what about now?

you could load it into an array called a with:

a = np.loadtxt("spaces_comments.txt", skiprows=2, comments=("#", "$", "XXX"))

and the result would be:

In [N]: print(a)
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]]

There are several other options that can be very useful if you want to specify the format of what you are reading (to include storing strings) and to automatically split the data set into separate variables for the data in each column. For the latter, if you have a data file called commas.txt as follows:

1, 2, 3
4, 5, 6
7, 8, 9
10, 11, 12

you could load each column into its own array with:

import numpy as np
x, y, z = np.loadtxt("commas.txt", delimiter=",", unpack=True)

after which you will have:

In []: x
Out[]: array([ 1.,  4.,  7., 10.])

In []: y
Out[]: array([ 2.,  5.,  8., 11.])

In []: z
Out[]: array([ 3.,  6.,  9., 12.])

Note that the results are 1-dimensional arrays - meaning they are neither columns nor rows but...1-dimensional arrays.