10 Essential NumPy Functions for Data Science

Data science is a rapidly growing field that uses statistical and computational methods to extract insights from data. One of the most powerful tools for data science in Python is the NumPy library, which provides a wide range of functions for working with arrays of numerical data.

This article explores ten essential NumPy functions that every data scientist should know. These functions provide the foundation for many common data analysis tasks, from creating and manipulating arrays to performing statistical calculations. So whether you’re new to data science or an experienced practitioner, read on to discover how these essential NumPy functions can help you work more efficiently with your data.

Creating and Manipulating Arrays

One of the most fundamental operations in data science is creating and manipulating arrays of numerical data. Arrays are data structures that store multiple values of the same type in a fixed-size sequence. NumPy provides several functions that make it easy to create and manipulate arrays in Python.

np.array()

np.array() is used to create a NumPy array from a list or tuple. For example, we can create a one-dimensional array of five integers by passing a list to the function:

import numpy as np
a = np.array([1, 2, 3, 4, 5])
print(a)
# Output: [1 2 3 4 5]

We can also create a two-dimensional array of three rows and two columns by passing a nested list to the function:

b = np.array([[1, 2], [3, 4], [5, 6]])
print(b)
# Output: [[1 2]
#          [3 4]
#          [5 6]]

np.arange()

np.arange() returns an array with evenly spaced values within a specified range. The function takes three arguments: the start value, the stop value, and the step size. For example, we can create an array of ten values from zero to nine by passing zero as the start value, ten as the stop value, and one as the step size:

c = np.arange(0, 10, 1)
print(c)
# Output: [0 1 2 3 4 5 6 7 8 9]

We can also create an array of five values from zero to one by passing zero as the start value, one as the stop value, and 0.2 as the step size:

d = np.arange(0, 1, 0.2)
print(d)
# Output: [0.  0.2 0.4 0.6 0.8]

np.reshape()

np.reshape()changes the shape of an array without changing its data. The function takes two arguments: the array to be reshaped and the new shape as a tuple of integers. For example, we can reshape the one-dimensional array a into a two-dimensional array of two rows and three columns by passing a and (2, 3) to the function:

a = np.array([1,2,3,4,5,6])
e = np.reshape(a, (2, 3))
print(e)
# Output: [[1 2 3]
#          [4 5 6]]

We can also reshape the two-dimensional array b into a one-dimensional array of six elements by passing b and (6,) to the function:

b = np.array([[1, 2], [3, 4], [5, 6]])
f = np.reshape(b,(6,)
print (f)
# Output: [1,2,3,4,5,6]

Combining and Splitting Arrays

Besides creating and manipulating arrays, NumPy also provides functions for combining and splitting arrays. These functions allow you to join multiple arrays into a single array or split an array into multiple sub-arrays.

np.concatenate()

np.concatenate() joins two or more arrays along an axis. The function takes two arguments: a sequence of arrays to be concatenated and the axis along which the arrays will be joined. For example, we can concatenate two one-dimensional arrays a and b along the first axis by passing (a, b) and 0 to the function:

import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.concatenate((a, b), axis=0)
print(c)
# Output: [1 2 3 4 5 6]

We can also concatenate two two-dimensional arrays d and e along the second axis by passing (d, e) and 1 to the function:

d = np.array([[1, 2], [3, 4]])
e = np.array([[5, 6], [7, 8]])
f = np.concatenate((d, e), axis=1)
print(f)
# Output: [[1 2 5 6]
#          [3 4 7 8]]

np.split()

np.split()splits an array into multiple sub-arrays. The function takes three arguments: the array to be split, the number of equally sized sub-arrays to create, and the axis along which to split the array. For example, we can split the one-dimensional array c into three equally sized sub-arrays along the first axis by passing c, 3, and 0 to the function:

g = np.split(c, 3, axis=0)
print(g)
# Output: [array([1]), array([2]), array([3])]

We can also split the two-dimensional array f into two equally sized sub-arrays along the second axis by passing f, 2, and 1 to the function:

h = np.split(f, 2, axis=1)
print(h)
# Output: [array([[1, 2],
#                 [3, 4]]),
#          array([[5, 6],
#                 [7, 8]])]

These are some of the basic functions that NumPy provides for combining and splitting arrays. In the next section, we will learn how to sort and search arrays using NumPy functions.

Sorting and Searching Arrays

Another important aspect of data science is sorting and searching arrays. Sorting arrays means arranging the elements of an array in a certain order, such as ascending or descending. Searching arrays means finding the elements of an array that satisfy a certain condition, such as being equal to a given value or being within a given range. NumPy provides several functions that make it easy to sort and search arrays in Python.

np.sort()

np.sort() returns a sorted copy of an array. The function takes two arguments: the array to be sorted and the axis along which to sort the array. For example, we can sort the one-dimensional array a in ascending order by passing a to the function:

import numpy as np
a = np.array([5, 2, 7, 1, 4, 6, 3])
b = np.sort(a)
print(b)
# Output: [1 2 3 4 5 6 7]

We can sort the one-dimensional array a in descending order by passing a to the function and reverse the array using [::-1].

b = np.sort(a) [::-1]
print(b)
# Output: [7 6 5 4 3 2 1]

We can also sort the two-dimensional array c in descending order along the first axis by passing c and 0 to the function:

c = np.array([[1, 2], [4, 3], [6, 5]])
d = np.sort(c, axis=0)[::-1]
print(d)
# Output: [[6 5]
#          [4 3]
#          [1 2]]

np.where()

np.where() returns the indices of the elements in an input array where the condition is satisfied (when only the condition is specified in the argument). The function can take two more arguments, the value to be returned for elements that satisfy the condition, and the value to be returned for elements that do not satisfy the condition. For example, we can find the indices of elements in the one-dimensional array a that are greater than for by passing a > 4, a, and -1 to the function:

a = np.array([5, 2, 7, 1, 4, 6, 3])
e = np.where(a > 4, a, -1)
print(e)
# Output: [5 -1 7 -1 -1 6 -1]

We can also apply the arguments in the two-dimensional array. For example we can find the elements c that are equal to five by passing c == 5, c, and -1 to the function:

f = np.where(c == 5, c, -1)
print(f)
# Output: [[-1 -1]
#          [-1 -1]
#          [5 -1]]

These are some of the basic functions that NumPy provides for sorting and searching arrays.

Statistical Functions

In data science, it is often necessary to compute basic statistics on arrays of numerical data. Statistics such as the mean, median, and standard deviation can provide valuable insights into the distribution and variability of the data. NumPy provides several functions for computing these and other statistical measures on arrays.

np.mean()

np.mean()computes the arithmetic mean along the specified axis. The function takes two arguments: the input array and the axis along which to compute the mean. For example, we can compute the mean of the one-dimensional array a by passing a and None to the function:

import numpy as np

b = np.mean(a, axis=None)
print(b)
# Output: 3.0

We can also compute the mean of each column of the two-dimensional array c by passing c and 0 to the function:

c = np.array([[1, 2], [3, 4], [5, 6]])
d = np.mean(c, axis=0)
print(d)
# Output: [3. 4.]

np.median()

np.median() computes the median along the specified axis. The function takes two arguments: the input array and the axis along which to compute the median. For example, we can compute the median of the one-dimensional array a by passing a and None to the function:

e = np.median(a, axis=None)
print(e)
# Output: 3.0

We can also compute the median of each row of the two-dimensional array c by passing c and 1 to the function:

f = np.median(c, axis=1)
print(f)
# Output: [1.5 3.5 5.5]

np.std()

np.std() computes the standard deviation along the specified axis. The standard deviation measures how spread out the values in an array are from their mean. The function takes two arguments: the input array and the axis along which to compute the standard deviation. For example, we can compute the standard deviation of the one-dimensional array a by passing a and None to the function:

g = np.std(a, axis=None)
print(g)
# Output: 1.4142135623730951

We can also compute the standard deviation of each column of the two-dimensional array c by passing c and 0 to the function:

h = np.std(c, axis=0)
print(h)
# Output: [1.63299316 1.63299316]

These are some of the basic statistical functions that NumPy provides for computing basic statistics on arrays. Using these functions, you can quickly and easily gain insights into your data and make informed decisions based on your analysis.

Conclusion

In this article, we have explored 10 essential NumPy functions that every data scientist should know. These functions cover some of the most common and useful operations for creating, manipulating, combining, splitting, sorting, searching, and computing statistics on arrays of numerical data. Using these functions, you can work more efficiently and effectively with your data and perform various data analysis tasks in Python.

I hope you have enjoyed this article and learned something new and valuable from it. If you have questions or feedback, please leave a comment below. Thank you for reading, and happy coding!