Python基础知识整理
-
早些时候整理的python基础知识,很多地方格式没有调整,希望大家见谅啦~
change the working directory of anaconda
In the terminal, run
jupyter notebook --generate-config
Modify the config file and restart Anaconda Navigator:
Open the jupyter_notebook_config.py file in any suitable text editor and modify the “c.NotebookApp.notebook_dir” entry to point to the desired working directory. You will have to modify the “\” to “\” in your windows file path. Make sure to uncomment the line by removing the “#”.
Save the file and restart the Anaconda Navigator.
get current working directory
os.getcwd()
enumerate
- loop through the items
lst = ["app", "banana", "gig"] for thing in lst: print(thing)
- index + item: use enumerate
lst = ["app", "banana", "gig"] for idx, thing in enumerate(lst): print(idx) print(thing)
How to find a subset of a list
sublist = [i for i in list if i > x]
The summation of list
Similar to union_all
a = [1,2,3] b = [3,3,4] a+b # [1, 2, 3, 3, 3, 4]
Differences of loc and iloc and []
-
Difference between df['col_name'].values and df[['col_name']].values. The former gives a 1d array and the latter gives a 2d array
-
loc[] is the same as [] in most of the times!!! But it is better to call it explicitly
-
Avoid chain indexing!!! like Ax['s']['as']. It can be replaced by .loc['as','s']
-
The way to index on column name and row number without chain indexing
df.loc[df.index[0], 'NAME'] # or df.iloc[0, df.columns.get_loc("a")]
- loc is label-based, which means that we have to specify the name of the rows and columns that we need to filter out.
- For example, let’s say we search for the rows whose index is 1, 2 or 100. We will not get the first, second or the hundredth row here. Instead, we will get the results only if the name of any index is 1, 2 or 100.
# select all rows with a condition data.loc[data.age >= 15] # select with multiple conditions data.loc[(data.age >= 12) & (data.gender == 'M')] # Select a range of rows using loc #slice data.loc[1:3] # Using loc, we can also slice the Pandas dataframe over a range of indices. If the indices are not in the sorted order, it will select only the rows with index 1 and 3 # Select only required columns with a condition data.loc[(data.age >= 12), ['city', 'gender']] # update a column with condition data.loc[(data.age >= 12), ['section']] = 'M' # update multiple columns with condition data.loc[(data.age >= 20), ['section', 'city']] = ['S','Pune'] # select a column data.loc[['col_name']] # select index + column data.loc[data.age >= 12,'col_name']
- On the other hand, iloc is integer index-based. So here, we have to specify rows and columns by their integer index.
# select rows with indexes data.iloc[[0,2]] # select rows with particular indexes and particular columns data.iloc[[0,2],[1,3]] # select a range of rows data.iloc[1:3] # select a range of rows and columns data.iloc[1:3,2:4]
How to slice series
# df_temp is of pandas.series object df_temp = df_all.isnull().sum(axis=0) df_temp[df_temp>0]
- Select a particular column
df['label']
Basic picture
# packages import matplotlib.pyplot as plt %matplotlib inline df_train['label'].value_counts().plot(kind='bar') # create fig in each sub graphs fig = plt.figure(figsize=(18 ,10)) for idx, row in enumerate(images): ax = fig.add_subplot(2,3,idx + 1) ax.set_xticks([]) ax.set_yticks([]) pixels = df_train.iloc[row, 1:786].values.reshape((28,28)) ax.imshow(pixels, cmap="gray") ax.set_title(df_train.iloc[row]['label'], fontsize = 24)
Difference index, array, list
Array / List
Lists and arrays are used in Python to store data(any data type- strings, integers etc), both can be indexed and iterated also. Difference between lists and arrays are the functions that you can perform on them like for example when you want to divide an array by 4, the result will be printed on request but in case of a list, python will throw an error message.
index
Index, on the other hand, is immutable
Index: Immutable ndarray implementing an ordered, sliceable set.Properties
Index.values Return an array representing the data in the Index. Index.is_monotonic Alias for is_monotonic_increasing. Index.is_monotonic_increasing Return if the index is monotonic increasing (only equal or increasing) values. Index.is_monotonic_decreasing Return if the index is monotonic decreasing (only equal or decreasing) values. Index.is_unique Return if the index has unique values. dataframe.sum
dataframe.sum(axis=0)
按照行对每列进行sumdrop / dropna / isna /fillna
- index of missing values in a particular column
idx_missing = df[column].isna()
- find the rows withour missing values
df[-idx_missing] df.loc[-idx_missing]
- fill na with value
# fill na with No College # inplace means apply the changes to the original dataframe and no output # will be produced nba["College"].fillna("No College", inplace = True) # fill na with mean nba["College"].fillna(np.mean(nba["College"]), inplace = True) # method : Method is used if user doesn’t pass any value. Pandas has different methods bfill/ ffill which fills the place with value in the Previous/Back respectively. nba["College"].fillna(method = "ffill", inplace = True) # 用空值前面的值去填充它
Matrix operation
In Python we can solve the different matrix manipulations and operations. Numpy Module provides different methods for matrix operations.
-
add() − add elements of two matrices.
-
subtract() − subtract elements of two matrices.
-
divide() − divide elements of two matrices.
-
multiply() − multiply elements of two matrices.
-
dot() − It performs matrix multiplication, does not element wise multiplication.
-
sqrt() − square root of each element of matrix.
-
sum(x,axis) − add to all the elements in matrix. Second argument is optional, it is used when we want to compute the column sum if axis is 0 and row sum if axis is 1.
-
“T” − It performs transpose of the specified matrix.
import numpy # Two matrices are initialized by value x = numpy.array([[1, 2], [4, 5]]) y = numpy.array([[7, 8], [9, 10]]) # add()is used to add matrices print ("Addition of two matrices: ") print (numpy.add(x,y)) # subtract()is used to subtract matrices print ("Subtraction of two matrices : ") print (numpy.subtract(x,y)) # divide()is used to divide matrices print ("Matrix Division : ") print (numpy.divide(x,y)) print ("Multiplication of two matrices: ") print (numpy.multiply(x,y)) print ("The product of two matrices : ") print (numpy.dot(x,y)) print ("square root is : ") print (numpy.sqrt(x)) print ("The summation of elements : ") print (numpy.sum(y)) print ("The column wise summation : ") print (numpy.sum(y,axis=0)) print ("The row wise summation: ") print (numpy.sum(y,axis=1)) # using "T" to transpose the matrix print ("Matrix transposition : ") print (x.T)
lambda functions
x = lambda a: a+1 x = lambda a, b : a * b # Apply lambda function in dataframe df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float')
reshape numpy array
numpy allow us to give one of new shape parameter as -1 (eg: (2,-1) or (-1,3) but not (-1, -1)). It simply means that it is an unknown dimension and we want numpy to figure it out. And numpy will figure this by looking at the 'length of the array and remaining dimensions' and making sure it satisfies the above mentioned criteria
String formats
The format() method formats the specified value(s) and insert them inside the string's placeholder.
The placeholder is defined using curly brackets: {}. Read more about the placeholders in the Placeholder section below.
The format() method returns the formatted string.
txt1 = "My name is {fname}, I'am {age}".format(fname = "John", age = 36) txt2 = "My name is {0}, I'am {1}".format("John",36) txt3 = "My name is {}, I'am {}".format("John",36)
Inside the placeholders you can add a formatting type to format the result
:<
Left aligns the result (within the available space):>
Right aligns the result (within the available space):^
Center aligns the result (within the available space):=
Places the sign to the left most position:+
Use a plus sign to indicate if the result is positive or negative:-
Use a minus sign for negative values only:
Use a space to insert an extra space before positive numbers (and a minus sign befor negative numbers):,
Use a comma as a thousand separator:_
Use a underscore as a thousand separator:b
Binary format:c
Converts the value into the corresponding unicode character:d
Decimal format:e
Scientific format, with a lower case e:E
Scientific format, with an upper case E:f
Fix point number format :.2f means 2 digits are preserved:F
Fix point number format, in uppercase format (show inf and nan as INF and NAN):g
General format:G
General format (using a upper case E for scientific notations):o
Octal format:x
Hex format, lower case:X
Hex format, upper case:n
Number format:%
Percentage format
Open a file
The available modes are:
Character String 'r' open for reading (default) 'w' open for writing, truncating the file first 'x' open for exclusive creation, failing if the file already exists 'a' open for writing, appending to the end of the file if it exists 'b' binary model 't' text mode (default) '+' open for updating (reading and writing) The default mode is 'r' (open for reading text, synonym of 'rt'). Modes 'w+' and 'w+b' open and truncate the file (先清空). Modes 'r+' and 'r+b' open the file with no truncation.
As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
flatten
numpy.ndarray.flatten() function
The flatten() function is used to get a copy of an given array collapsed into one dimension.
‘C’ means to flatten in row-major (C-style) order. ‘F’ means to flatten in column-major (Fortran- style) order. ‘A’ means to flatten in column-major order if a is Fortran contiguous in memory, row-major order otherwise. ‘K’ means to flatten a in the order the elements occur in memory. The default is ‘C’.
ndarray.flatten(order='C')
underscore in python
Underscore _ is considered as "I don't Care" or "Throwaway" variable in Python
The underscore _ is also used for ignoring the specific values. If you don’t need the specific values or the values are not used, just assign the values to underscore.
x, _, y = (1, 2, 3) >>> x 1 >>> y 3
.copy()
df_copy = df_all
df_copy和df_all在这里会是联动的,只是称呼变了,就像vba里面的set一样
df_copy = df_all.pd.copy()
创造了一个新的object,两者是不联动的
Flatten a list
Given a list of l
flat_list = [item for sublist in l for item in sublist]
in in pandas
data.isin([])