Hello,
Do not assume anything. Never. Ever. Specially with SparkR (Apache Spark 2.1.0).
When using the gapply function, maybe you want to return the key to mark the results in a function as follows:
countRows <- function(key, values) {
df <- data.frame(key=key, nvalues=nrow(values))
return(df)
}
count <- gapplyCollect(data, "keyAttribute", countRows)
countRows <- function(key, values) {
df <- data.frame(key=key, nvalues=nrow(values))
return(df)
}
count <- gapplyCollect(data, "keyAttribute", countRows)
SURPRISE. You can’t.
You should get this error:
Error in match.names(clabs, names(xi)): names do not match previous names
Well, that’s weird. Why is this happening?
Actually, key is a list because you can specify more than one column, therefore it already has a descriptor name which overwrites the one you specify, producing that two different keys have two different names. An easy way to fix this is just to unlist the key
countRows <- function(key, values) {
df <- data.frame(key=unlist(key), nvalues=nrow(values))
return(df)
}
count <- gapplyCollect(data, "keyAttribute", countRows)
MIND THAT THIS DOES NOT WORK WHEN YOU USE MORE THAN ONE COLUMNS FOR GROUPING!