Python基础知识整理

Mengkelyu

早些时候整理的python基础知识，很多地方格式没有调整，希望大家见谅啦~

change the working directory of anaconda

In the terminal, run

jupyter notebook --generate-config

Modify the config file and restart Anaconda Navigator:

Open the jupyter_notebook_config.py file in any suitable text editor and modify the “c.NotebookApp.notebook_dir” entry to point to the desired working directory. You will have to modify the “\” to “\” in your windows file path. Make sure to uncomment the line by removing the “#”.

Save the file and restart the Anaconda Navigator.

get current working directory

os.getcwd()

enumerate

loop through the items

lst = ["app", "banana", "gig"]
for thing in lst:
    print(thing)

index + item: use enumerate

lst = ["app", "banana", "gig"]
for idx, thing in enumerate(lst):
    print(idx)
    print(thing)

How to find a subset of a list

sublist = [i for i in list if i > x]

The summation of list

Similar to union_all

a = [1,2,3]
b = [3,3,4]
a+b
# [1, 2, 3, 3, 3, 4]

Differences of loc and iloc and []

Difference between df['col_name'].values and df[['col_name']].values. The former gives a 1d array and the latter gives a 2d array
loc[] is the same as [] in most of the times!!! But it is better to call it explicitly
Avoid chain indexing!!! like Ax['s']['as']. It can be replaced by .loc['as','s']
The way to index on column name and row number without chain indexing

df.loc[df.index[0], 'NAME']
# or
df.iloc[0, df.columns.get_loc("a")]

loc is label-based, which means that we have to specify the name of the rows and columns that we need to filter out.
- For example, let’s say we search for the rows whose index is 1, 2 or 100. We will not get the first, second or the hundredth row here. Instead, we will get the results only if the name of any index is 1, 2 or 100.

# select all rows with a condition
data.loc[data.age >= 15]
# select with multiple conditions
data.loc[(data.age >= 12) & (data.gender == 'M')]
# Select a range of rows using loc
#slice
data.loc[1:3]
# Using loc, we can also slice the Pandas dataframe over a range of indices. If the indices are not in the sorted order, it will select only the rows with index 1 and 3
# Select only required columns with a condition
data.loc[(data.age >= 12), ['city', 'gender']]
# update a column with condition
data.loc[(data.age >= 12), ['section']] = 'M'
# update multiple columns with condition
data.loc[(data.age >= 20), ['section', 'city']] = ['S','Pune']

# select a column
data.loc[['col_name']]

# select index + column
data.loc[data.age >= 12,'col_name']

On the other hand, iloc is integer index-based. So here, we have to specify rows and columns by their integer index.

# select rows with indexes
data.iloc[[0,2]]
# select rows with particular indexes and particular columns
data.iloc[[0,2],[1,3]]
# select a range of rows
data.iloc[1:3]
# select a range of rows and columns
data.iloc[1:3,2:4]

How to slice series

# df_temp is of pandas.series object
df_temp = df_all.isnull().sum(axis=0)
df_temp[df_temp>0]

Select a particular column

df['label']

Basic picture

# packages
import matplotlib.pyplot as plt
%matplotlib inline

df_train['label'].value_counts().plot(kind='bar')
# create fig in each sub graphs
fig = plt.figure(figsize=(18 ,10))

for idx, row in enumerate(images):
    ax = fig.add_subplot(2,3,idx + 1)
    ax.set_xticks([])
    ax.set_yticks([])
    pixels = df_train.iloc[row, 1:786].values.reshape((28,28))
    ax.imshow(pixels, cmap="gray")
    ax.set_title(df_train.iloc[row]['label'], fontsize = 24)

Difference index, array, list

Array / List

Lists and arrays are used in Python to store data(any data type- strings, integers etc), both can be indexed and iterated also. Difference between lists and arrays are the functions that you can perform on them like for example when you want to divide an array by 4, the result will be printed on request but in case of a list, python will throw an error message.

index

Index, on the other hand, is immutable
Index: Immutable ndarray implementing an ordered, sliceable set.

Properties

Index.values	Return an array representing the data in the Index.
Index.is_monotonic	Alias for is_monotonic_increasing.
Index.is_monotonic_increasing	Return if the index is monotonic increasing (only equal or increasing) values.
Index.is_monotonic_decreasing	Return if the index is monotonic decreasing (only equal or decreasing) values.
Index.is_unique	Return if the index has unique values.

dataframe.sum

dataframe.sum(axis=0)
按照行对每列进行sum

drop / dropna / isna /fillna

index of missing values in a particular column

idx_missing = df[column].isna()

find the rows withour missing values

df[-idx_missing]
df.loc[-idx_missing]

fill na with value

# fill na with No College
# inplace means apply the changes to the original dataframe and no output
# will be produced
nba["College"].fillna("No College", inplace = True) 
# fill na with mean
nba["College"].fillna(np.mean(nba["College"]), inplace = True) 
# method : Method is used if user doesn’t pass any value. Pandas has different methods bfill/ ffill which fills the place with value in the Previous/Back respectively.
nba["College"].fillna(method = "ffill", inplace = True)
# 用空值前面的值去填充它

Matrix operation

In Python we can solve the different matrix manipulations and operations. Numpy Module provides different methods for matrix operations.

add() − add elements of two matrices.
subtract() − subtract elements of two matrices.
divide() − divide elements of two matrices.
multiply() − multiply elements of two matrices.
dot() − It performs matrix multiplication, does not element wise multiplication.
sqrt() − square root of each element of matrix.
sum(x,axis) − add to all the elements in matrix. Second argument is optional, it is used when we want to compute the column sum if axis is 0 and row sum if axis is 1.
“T” − It performs transpose of the specified matrix.

import numpy
# Two matrices are initialized by value
x = numpy.array([[1, 2], [4, 5]])
y = numpy.array([[7, 8], [9, 10]])
#  add()is used to add matrices
print ("Addition of two matrices: ")
print (numpy.add(x,y))
# subtract()is used to subtract matrices
print ("Subtraction of two matrices : ")
print (numpy.subtract(x,y))
# divide()is used to divide matrices
print ("Matrix Division : ")
print (numpy.divide(x,y))
print ("Multiplication of two matrices: ")
print (numpy.multiply(x,y))
print ("The product of two matrices : ")
print (numpy.dot(x,y))
print ("square root is : ")
print (numpy.sqrt(x))
print ("The summation of elements : ")
print (numpy.sum(y))
print ("The column wise summation  : ")
print (numpy.sum(y,axis=0))
print ("The row wise summation: ")
print (numpy.sum(y,axis=1))
# using "T" to transpose the matrix
print ("Matrix transposition : ")
print (x.T)

lambda functions

x = lambda a: a+1
x = lambda a, b : a * b

# Apply lambda function in dataframe
df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float')

reshape numpy array

numpy allow us to give one of new shape parameter as -1 (eg: (2,-1) or (-1,3) but not (-1, -1)). It simply means that it is an unknown dimension and we want numpy to figure it out. And numpy will figure this by looking at the 'length of the array and remaining dimensions' and making sure it satisfies the above mentioned criteria

String formats

The format() method formats the specified value(s) and insert them inside the string's placeholder.

The placeholder is defined using curly brackets: {}. Read more about the placeholders in the Placeholder section below.

The format() method returns the formatted string.

txt1 = "My name is {fname}, I'am {age}".format(fname = "John", age = 36)
txt2 = "My name is {0}, I'am {1}".format("John",36)
txt3 = "My name is {}, I'am {}".format("John",36)

Inside the placeholders you can add a formatting type to format the result

:< Left aligns the result (within the available space)
:> Right aligns the result (within the available space)
:^ Center aligns the result (within the available space)
:= Places the sign to the left most position
:+ Use a plus sign to indicate if the result is positive or negative
:- Use a minus sign for negative values only
: Use a space to insert an extra space before positive numbers (and a minus sign befor negative numbers)
:, Use a comma as a thousand separator
:_ Use a underscore as a thousand separator
:b Binary format
:c Converts the value into the corresponding unicode character
:d Decimal format
:e Scientific format, with a lower case e
:E Scientific format, with an upper case E
:f Fix point number format :.2f means 2 digits are preserved
:F Fix point number format, in uppercase format (show inf and nan as INF and NAN)
:g General format
:G General format (using a upper case E for scientific notations)
:o Octal format
:x Hex format, lower case
:X Hex format, upper case
:n Number format
:% Percentage format

Open a file

The available modes are:

Character	String
'r'	open for reading (default)
'w'	open for writing, truncating the file first
'x'	open for exclusive creation, failing if the file already exists
'a'	open for writing, appending to the end of the file if it exists
'b'	binary model
't'	text mode (default)
'+'	open for updating (reading and writing)

The default mode is 'r' (open for reading text, synonym of 'rt'). Modes 'w+' and 'w+b' open and truncate the file (先清空). Modes 'r+' and 'r+b' open the file with no truncation.

As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

flatten

numpy.ndarray.flatten() function

The flatten() function is used to get a copy of an given array collapsed into one dimension.

‘C’ means to flatten in row-major (C-style) order. ‘F’ means to flatten in column-major (Fortran- style) order. ‘A’ means to flatten in column-major order if a is Fortran contiguous in memory, row-major order otherwise. ‘K’ means to flatten a in the order the elements occur in memory. The default is ‘C’.

ndarray.flatten(order='C')

underscore in python

Underscore _ is considered as "I don't Care" or "Throwaway" variable in Python

The underscore _ is also used for ignoring the specific values. If you don’t need the specific values or the values are not used, just assign the values to underscore.

x, _, y = (1, 2, 3)

>>> x
1

>>> y 
3

.copy()

df_copy = df_all

df_copy和df_all在这里会是联动的，只是称呼变了，就像vba里面的set一样

df_copy = df_all.pd.copy()

创造了一个新的object，两者是不联动的

Flatten a list

Given a list of l

flat_list = [item for sublist in l for item in sublist]

in in pandas

data.isin([])