How to unnest data and obtain the first element from an array in SparkR? - r

I am new to SparkR and trying first steps of data preparation.The dataset is something of this kind. I was trying to subset and select significant columns. My question is how can I select a column from an array element. I was trying something like this, which allowed me to select columns by un-nesting data but couldn't unnest and flatten the array to get it's first element. Helpful Link
select.col <- SparkR::select(data,c("parsed.nid","parsed.status","parsed.sections.element[0].name"))

I myself found a way to resolve this issue.This can be done in two simple steps :-
First we need to use explode() in SparkR, to get all the contents in
the list from that column.
Next, we need to use windowPartitionBy() in SparkR to create a
partitions and then we can get anything we want based on our
requirements like row_number(),dense_rank(),rank() etc. Like here we want the first element of the list, so I have used row_number function.
Snippet :
data.select <- SparkR::select(data,c("parsed.nid","parsed.status","parsed.sections"))
names(data.select) <- c("nid","status","sections")
categories <- SparkR::select(data.select,data.select$nid,data.select$status,explode(data.select$sections))
ws <- SparkR::orderBy(SparkR::windowPartitionBy("nid","status","sections"),"nid")
data.final <- SparkR::mutate(categories,row_num = over(row_number(), ws))
##If we want to get the first element of the array.
data.final <- data.final[data.final$row_num==1,]
Please add your suggestions as well.

Related

Combine lapply and gsub to replace a list of values for another list of values

I am currently looking for a way to simplify searching through a column within a dataframe for a vector of values and replacing each of of those values with another value (also contained within a separate vector). I can run a for loop for this, but it must be possible within the apply family, I'm just not seeing it yet. Very new to using the apply family and could use help.
So far, I've been able to have it replace all instances of the first value in my vector with the new first value in the new vector, it just isn't iterating past the first level. I hope this makes sense. Here is the code I have:
#standardize tank location
old_tank_list <- c("7.C.4","7.C.5","7.C.6","7.C.7","7.C.8","7.C.9","7.C.10","7.C.11")
new_tank_list <- c("7.B.3-4","7.C.3-4","7.C.1-2","7.C.5-6","7.C.7-8","7.C.9-10","7.E.9-10","7.C.11-12")
sapply(df_growth$Tank,function(y) gsub(old_tank_list,std_tank_list,y))
Tank is the name of the column I am trying to replace all of these values within. I haven't assigned it back yet, because I want to test the functionality first. Thanks for any help you can offer.
Hopefully, this image will help. The photo on the left is the column before my function is applied. The column on the right is after. Basically, I just want to batch change text values.
Before and After
library(dplyr)
df %>%
mutate(Tank = recode(Tank, !!!setNames(new_tank_list, old_tank_list)))

For loops in R iterates only the last entry

I want to repeat a column vector that have 300rows about 241times and to concatonate it. The data is downloadable in this link.
https://1drv.ms/u/s!AiZLoqatH-p7rD0og-RufSi6fljB
I tried the following code.
read.csv("stack_overflow.csv")
fund_name = d[,1]
fund_name_panel=c()
for (i in 1:300{x1=rep(fund_name[i], 241) fund_name_Panel=append(x1,fund_name_panel)}
Result: unfortunately, My code repeats only the very last row of the data. How can i repeat each of the 300rows rather than the very last?
Any hint is appreciated.
From your description of the problem you are committing a very simple error a lot of people make when first learning for loops. First since you are making a new variable (fund_name_panel) you need to create an empty vector the length of the vector you will use in the for loop.
fund_name_panel <- numeric(length(fund_name))
Use nrow() instead of length() if fund_name is a data.frame and not a vector.
Secondly, you will need to specify the row (i) in both the now new vector (fund_name_panel) and the vector of you are referencing in the for loop (fund_name) see code below.
fund_name_panel <- numeric(length(fund_name))
for(i in 1:length(fund_name)){
x[i]=y[i]
}

Name new dataframes from character vectors - loop

I think this one is easy but I still can't figure it out and I really need help with this. I've looked everywhere but still couldn't find it.
Let's say I have this vector:
filenames <- c("fn1", "fn2", "fn3")
And I want to associate them with an dataframe that is created according to a function, that is generated at that time
df|name from filenames[i]| <- df
so it would return these dataframes
dffn1
dffn2
dffn3
I hope I made myself clear. My problem is create a new data frame and name it according to a list or whatever, in a for loop.
You can use assign to achieve what you want.
for(nms in filenames){
assign(paste('df',nms,sep=''), df) }

Getting nested elements from a list

I am trying to get nested elements from a list. I can extract the elements using: unlist(pull_lists[[i]]$content[[n]]['sha']), however, it seems that I cannot insert them in a nested list. I have extracted a single element of the list in a gist, which creates the reproducible example below. Here is what I have so far:
library("devtools")
pull_lists <- list(source_gist("669dfeccad88cd4348f7"))
sha_list <- list()
for (i in length(pull_lists)){
for (n in length(pull_lists[[i]]$content)){
sha_list[i][n] <- unlist(pull_lists[[i]]$content[[n]]['sha'])
}
}
How can I insert the elements in a nested fashion?
When I download the content, I get a much more complicated structure than you do. For me, it's not pull_lists[[i]]$content, it's pull_lists[[i]]$value$content[[1 or 2]]$parents$sha. The reason nothing is populating is because there is nothing there to populate (ie, n = 0).
I've had to deal with similar data structures before. What I found was that it's much easier to search the naming structure after unlisting rather than to figure out the correct sequence of subsets.
Here's an example:
sha_locations <- grep("sha$",names(unlist(pull_list[[1]])))
unlist(pull_list[[1]])[sha_locations]
Cleaning the for loop a bit, this would look like:
sha_list <- lapply(
pull_list,
function(x) unlist(x)[grep("sha$",names(unlist(x)))]
)
Since there are multiple SHAs, and the question only asks for the SHAs at specific positions, you need to extract those SHAs:
sha_list <- sha_list[[1]][attr(sha_list[[1]], "names")=="value.content.sha"]

get column from list of dataframes R

I am an R beginner and I am stuck on this problem. I had a dataframe and by using the split() function I have created a list of dataframes, e.g:
dfList <- split(mtcars, mtcars$cyl)
Now I want to retrieve a column of a specific dataframe, e.g. column 2 from dataframe 1, so something like
dfList[1][2]
What I can do right now is create for loops to get inside the data structure. But I can't find a oneliner to do it, if it exists. How can I do that? Thanks in advance!
I'm putting docendo's comment here to close out the question.
If you want to extract an element from a list (and treat it like a data.frame) rather than subset a list (to create a smaller list), you need to use the [[ ]] syntax. Plus, to get a column by index from a data.frame, you either need to use [[ idx ]] or [, idx ]. These are pretty basic indexing operations that you will probably want to review if you will be programming in R. So your "correct" call is probably
dfList[[1]][[2]]

Resources