Difference between revisions of "Python:Loading and Saving Data"

From PrattWiki
Jump to navigation Jump to search
Line 1: Line 1:
 
This page will provide information on various different ways to load and save data in Python.  It includes descriptions of methods in both the numpy and pandas modules.  You will first need to know exactly what type of file you want to read from or write to since some methods only work on a limited subset of file types.   
 
This page will provide information on various different ways to load and save data in Python.  It includes descriptions of methods in both the numpy and pandas modules.  You will first need to know exactly what type of file you want to read from or write to since some methods only work on a limited subset of file types.   
  
For example codes below, assume that the following has already run:
+
If you are loading and saving from text files, the <code>np.loadtxt()</code> and <code>np.savetxt()</code> may be sufficient.  They are fairly straight forward and can be told work with headers, footers, and comments, as well as different delimiters.
 +
 
 +
On the other hand, if you are loading from Excel documents, or you want to load a data set into a dataframe, you will want to use methods from pandas.  Depending on the form of the data file, you may use <code>pd.read_table()</code>, <code>pd.read_csv()</code>, or <code>pd.read_excel()</code>.  Dataframes have their own built-in methods for saving to different file types.
 +
 
 +
For the example codes below, assume that the following has already run:
 
<syntaxhighlight lang=python>
 
<syntaxhighlight lang=python>
 
import numpy as np
 
import numpy as np

Revision as of 16:41, 30 January 2021

This page will provide information on various different ways to load and save data in Python. It includes descriptions of methods in both the numpy and pandas modules. You will first need to know exactly what type of file you want to read from or write to since some methods only work on a limited subset of file types.

If you are loading and saving from text files, the np.loadtxt() and np.savetxt() may be sufficient. They are fairly straight forward and can be told work with headers, footers, and comments, as well as different delimiters.

On the other hand, if you are loading from Excel documents, or you want to load a data set into a dataframe, you will want to use methods from pandas. Depending on the form of the data file, you may use pd.read_table(), pd.read_csv(), or pd.read_excel(). Dataframes have their own built-in methods for saving to different file types.

For the example codes below, assume that the following has already run:

import numpy as np
import pandas as pd

np.loadtxt()

The Numpy module has a loadtxt() method that reads a rectangular set of typically numerical data from a text file and returns a Numpy array. By default, this method can read data separated by spaces or tabs. If there are other delimiters (e.g. commas) the method needs the delimiter="" kwarg to establish what the delimiter is. Note that the method will still ignore whitespace, so if there are spaces after the delimiter you do not need to explicitly include that in your delimiter option.

There are other useful kwargs for this method. If your data file has one or more header rows that you want to ignore, you can supply the skiprows=N kwarg. If your data file has comments of some kind - whether at the top or on subsequent lines - you can supply the comments="" kwarg. If there is more than one symbol or set of symbols, you can give the kwarg a list or tuple of symbols. Note that the command will ignore the comment indicator as well as anything after it on that line. All of that is to say, if you were to have the following text in a file called spaces_comments.txt

This is at the top
and should be ignored
XXX this is a comment
1     2 3
# as is this
4 5   6
   7   8   9
XXX and another one!
10 11 12 $ what about now?

you could load it into an array called a with:

a = np.loadtxt("spaces_comments.txt", skiprows=2, comments=("#", "$", "XXX"))

and the result would be:

In [N]: print(a)
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]]

There are several other options that can be very useful if you want to specify the format of what you are reading (to include storing strings) and to automatically split the data set into separate variables for the data in each column. For the latter, if you have a data file called commas.txt as follows:

1, 2, 3
4, 5, 6
7, 8, 9
10, 11, 12

you could load each column into its own array with:

import numpy as np
x, y, z = np.loadtxt("commas.txt", delimiter=",", unpack=True)

after which you will have:

In []: x
Out[]: array([ 1.,  4.,  7., 10.])

In []: y
Out[]: array([ 2.,  5.,  8., 11.])

In []: z
Out[]: array([ 3.,  6.,  9., 12.])

Note that the results are 1-dimensional arrays - meaning they are neither columns nor rows but...1-dimensional arrays.

np.savetxt()

If you want to save the values in one or more arrays to a text file, you can use Numpy's savetxt() method. The easiest use case is simply to give the command a file name and an array; the command will create a file with the numerical values saved as floating-point numbers with 19 significant figures (!) separated by spaces. Common kwargs include delimiter="" if you want to change the delimiter and fmt="" if you want to change the format. The format string is similar to the format string used in the print command, except instead of beginning with a colon, it begins with a percent sign.

As an example, if we want to generate a variable called rolls containing a 3x5 set of random integers between 1 and 6, inclusive, and then save it to a text file called dice.out, the simplest way to do that would be:

rolls = np.random.randint(1, 7, (3,5))
np.savetxt("dice.out", rolls)

This works, but it produces a file containing numbers like:

5.000000000000000000e+00

It is a bit absurd to have 19 significant figures for integers! We can save the array using integers and include commas between the values with:

np.savetxt("nice_dice.out", rolls, delimiter=",", fmt="%i")

The contents of that file will be:

5,4,1,3,1
1,6,4,4,4
6,2,5,6,2