> For the complete documentation index, see [llms.txt](https://cswang.gitbook.io/spark-by-zeppelin/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://cswang.gitbook.io/spark-by-zeppelin/dataframe.md).

# DataFrame

To create a DataFrame from an RDD of Rows, there are two main options:

1\) As already pointed out, you could use toDF() which can be imported by import sqlContext.implicits.\_. However, this approach only works for the following types of RDDs:

RDD\[Int]\
RDD\[Long]\
RDD\[String]\
RDD\[T <: scala.Product]\
(source: Scaladoc of the SQLContext.implicits object)

The last signature actually means that it can work for an RDD of tuples or an RDD of case classes (because tuples and case classes are subclasses of scala.Product).

So, to use this approach for an RDD\[Row], you have to map it to an RDD\[T <: scala.Product]. This can be done by mapping each row to a custom case class or to a tuple, as in the following code snippets:

val df = rdd.map({\
case Row(val1: String, ..., valN: Long) => (val1, ..., valN)\
}).toDF("col1\_name", ..., "colN\_name")\
or

case class MyClass(val1: String, ..., valN: Long = 0L)\
val df = rdd.map({\
case Row(val1: String, ..., valN: Long) => MyClass(val1, ..., valN)\
}).toDF("col1\_name", ..., "colN\_name")\
The main drawback of this approach (in my opinion) is that you have to explicitly set the schema of the resulting DataFrame in the map function, column by column. Maybe this can be done programatically if you don't know the schema in advance, but things can get a little messy there. So, alternatively, there is another option:

2\) You can use createDataFrame(rowRDD: RDD\[Row], schema: StructType) as in the accepted answer, which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:

val rdd = oldDF.rdd\
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)\
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

## Example of retrieving row values

val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user\_id", "category\_id")

val transactions\_with\_counts = transactions\
.groupBy($"user\_id", $"category\_id")\
.count

transactions\_with\_counts.printSchema

// root\
// |-- user\_id: integer (nullable = false)\
// |-- category\_id: integer (nullable = false)\
// |-- count: long (nullable = false)\
There are a few ways to access Row values and keep expected types:

### Pattern matching

import org.apache.spark.sql.Row

transactions\_with\_counts.map{\
case Row(user\_id: Int, category\_id: Int, rating: Long) =>\
Rating(user\_id, category\_id, rating)\
}

### Typed get\* methods like getInt, getLong:

transactions\_with\_counts.map(\
r => Rating(r.getInt(0), r.getInt(1), r.getLong(2))\
)

### getAs method which can use both names and indices:

transactions\_with\_counts.map(r => Rating(\
r.getAs\[Int]\("user\_id"), r.getAs\[Int]\("category\_id"), r.getAs[Long](https://github.com/cswang888/spark-by-zeppelin/tree/73ad63f6074159241fd6065871a876d2711c7abf/2/README.md)\
))\
It can be used to properly extract user defined types, including mllib.linalg.Vector. Obviously accessing by name requires a schema.

### Converting to statically typed Dataset (Spark 1.6+ / 2.0+):

transactions\_with\_counts.as\[(Int, Int, Long)]

## Inferring the Schema Using Reflection

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table.

## Programmatically Specifying the Schema

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

1.Create an RDD of Rows from the original RDD;\
2.Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.\
3.Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

## ReadPreference in MongoDB

![Difference readPreference](/files/-M4imimnuflMK3Wi_FD3)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://cswang.gitbook.io/spark-by-zeppelin/dataframe.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
