Wednesday, 15 April 2015

python - Updating a dataframe column in spark -


Given the new Spark Data Frame API, it is not clear whether it is possible to modify the dataframe column.

How do I go about changing a value in the line in the x column y ?

A new data frame with the desired modifications.

If you want to change the value in a column based on a condition, such as np.where :

  from pyspark.sql import as the function F update_func = (F.when (F.col ( 'update_col') == replace_val, new_value) or (F.col ( 'update_col'))) df = df.with column ( 'New_column_name', update_func )  

If you want to do some work on a column and create a new column that is added to the dataframe:

  Important T-Pyspark. F. Import pyspark.sql.types as sql.f as T DEF my_func G) Return on material for columns transformed_value # if we believe that my_func a string my_udf = F.UserDefinedFunction (my_func, T.StringType ()) Df = df.with column ( 'new_column_name', my_udf ( 'update_col '))  

If you want the new column to have the same name as the old column, then you have additional steps:

  df = df.drop (' Update_col '). With Colmrenamit ( 'new_column_name', 'update_col')  

When you have a column You can work on a column and return a new dataframe reflecting that change. For this you can implement the operation to implement a UserDefinedFunction first and then select that function selectively for the selected column. Python:

userdefinedFunction import pyspark.sql.functions to StringType name pyspark.sql.types = 'target_column' udf = UserDefinedFunction (lambda x: 'new_value', StringType ()) new_df = Old_df.select (* [ udf (column) .alias (name) if the column == name and column to column old_df.column])

new_df now old_df is the same schema (as that assuming old_df.target_column also StringType anyway) but the column all values ​​target_column Will be new_value .


No comments:

Post a Comment