Sparklyr how to view variables - r

Hi I have a deeply nested json file. I used sparklyr to read this json file and called this "data" object.
Firstly I will show what the data structure looks like:
# Database: spark_connection
data
-a : string
-b : string
-c : (struct)
c1 : string
c2 : (struct)
c21: string
c22: string
Something like this. So if I extract "a" using:
data %>% sdf_select(a)
I can view what the data inside, like:
# Database: spark_connection
a
<chr>
1 Hello world
2 Stack overflow is epic
THE PROBLEM now comes is when i use sdf_select() a deeper structure i.e.
data %>% sdf_select(c.c2.c22)
Viewing the data inside, I get this
# Database: spark_connection
c22
<list>
1 <list [1]>
2 <list [1]>
3 <list [1]>
4 <lgl [1]>
so if I collect the data so that the spark data frame turns into R data frame and viewing the data using commands
View(collect(data %>% sdf_select(c.c2.c22)))
The data shows
1 list("Good")
2 list("Bad")
3 NA
How do I turn every entry in each list above to a data frame table so that it shows Good, Bad, NA only instead with list("") on it?

I was unable to reproduce this. I used
[{"a":"jkl","b":"mno","c":{"c1":"ghi","c2":{"c21":"abc","c22":"def"}}}]
written to a test.json, followed by
spk_df <- spark_read_json(sc, "tmp", "file:///path/to/test.json")
spk_df %>% sdf_schema_viewer()
This seems to match the schema that you provided. However when I use sparklyr.nested::sdf_select() I get a different result.
spk_df %>% sdf_select(c.c2.c22)
# # Source: table<sparklyr_tmp_7431373dca00> [?? x 1]
# # Database: spark_connection
# c22
# <chr>
# 1 def
where c22 is a character column.
My guess is that in your real data, one of the levels is actually an array of structs. If this is the case, then indexing into an array forces a list wrapping (or else data would need to be dropped). You can resolve this in spark land using sdf_explode or you can resolve it locally in a variety of ways. For example, using purrr you would do something like:
df <- collect(spk_df)
df %>% mutate(c22=purrr::map(c22, ~unlist))
It is possible that you will need to write a function wrapping unlist to deal with different data types in different rows (the NA values are logical).
unlist_and_cast <- function(x) {
as.charater(unlist(x))
}
df %>% mutate(c22=purrr::map(c22, ~unlist_and_cast))
would do the trick I think (untested).

Related

How to convert dataframe from Charachter to Numeric

I know this question may be repeated but i tried all the solutions in :
How to convert entire dataframe to numeric while preserving decimals?
https://statisticsglobe.com/convert-data-frame-column-to-numeric-in-r
But didn't work
i imported excel data : from my computer manually :
File > import data > excel and i set the type of data as numeric
i checked my data using
View(Old_data)
and it s true of type numeric
head(Old_data)
QC_G.F9_01_4768 QC_G.F9_01_4765
M95T834 70027.02 69578.19
M97T834 95774.14 81479.30
M105T541 75686.39 68455.65
M109T834 72093.07 70942.65
M111T834_2 77502.98 77527.54
M114T834 68132.06 70296.73
M121T834 52233.05 56074.64
M125T834 44559.99 35831.79
M128T834 59257.48 59574.73
M135T834 105136.55 105274.98
but after data i Converted rows into columns and columns into rows using R :
New_data <- as.data.frame(t(Old_data))
When i checked my new data using :
View(New_data)
I found that my columns are of type character and not numeric
i tried to convert New_data to numeric
New_data_B -> as.numeric(New_data)
i checked my data using
dim(New_data_B)
17 1091
Here's example of my data
New_data_B
#> Name MT95T843 MT95T756
#> 1 QC_G.F9_01_4768 70027.02132 95774.13597
#> 2 QC_G.F9_01_4765 69578.18634 81479.29575
#> 3 QC_G.F9_01_4762 69578.18634 87021.95427
#> 4 QC_G.F9_01_4759 68231.14338 95558.76738
#> 5 QC_G.F9_01_4756 64874.12936 96780.77245
#> 6 QC_G.F9_01_4753 63866.65780 91854.35304
#> 7 CtrF01R5_G.D1_01_4757 66954.38799 128861.36163
#> 8 CtrF01R4_G.D5_01_4763 97352.55229 101353.25927
#> 9 CtrF01R3_G.C8_01_4754 61311.78576 7603.60896
#> 10 CtrF01R2_G.D3_01_4760 85768.36117 109461.75445
#> 11 CtrF01R1_G.C9_01_4755 85302.81947 104253.84537
#> 12 BtiF01R5_G.D7_01_4766 61252.42545 115683.73755
#> 13 BtiF01R4_G.D6_01_4764 81873.96379 112164.14229
#> 14 BtiF01R3_G.D2_01_4758 84981.21914 0.00000
#> 15 BtiF01R2_G.D4_01_4761 36629.02462 124806.49101
#> 16 BtiF01R1_G.D8_01_4767 0.00000 109927.26425
#> 17 rt 13.90181 13.90586
also i converted my data to csv file and i imported it :
Old_data <- as.data.frame(read.csv("data.csv" , sep="," , header=TRUE,stringsAsFactors=FALSE))
And also using :
#install.packages("readxl")
library("readxl")
Old_data <- read_excel("data.xlsx")
I tried the solution suggested by Mr sveer
New_data <- cbind(Name=Old_data[1,],as.data.frame(t(Old_data[-1,])))
it gives this result
head(New_data)
When i tried
View(New_data)
Name.QC_G.F9_01_4768 Name.QC_G.F9_01_4765
70027.02 69578.19
95774.14 81479.30
75686.39 68455.65
72093.07 70942.65
77502.98 77527.54
68132.06 70296.73
52233.05 56074.64
4559.99 35831.79
59257.48 59574.73
105136.55 105274.98
it delets the rownames !
Im just confused of this problem, i think the problem is because i converted rows into columns and columns into rows
Please tell me for any clarification and also if i can send the data to someone so he can try
Thank you very much
Reason why you get character type and not numeric:
Transponsing the data will lead to a matrix. A matrix can take only a single class ie. character when there are mixed class.
Solution:
I am still not sure about the structure of your data. It is always a good idea to add a reproducible example, if the data is large you could also use pastebin or just reproduce as described.
I assume that when you load the data via: File > import data > excel that the first column is called "Name".
To get your desired output (especially rownames) you could try:
setNames(as.data.frame(t(Old_data[,-1])),Old_data[[1]]) -> df
If you want to transform the rownames to a column:
tibble::rownames_to_column(df, "Name")

Extract and match sets from list of filenames

I have a dataset of 4000+ images. For the purpose of figuring out the code, I moved a small subset of them to another folder.
The files look like this:
folder
[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
I cannot remove the name prior to the -ch# as that information is important. What I want to do, however, is to filter this list of images, and return only sets (ie: r01c02f10p01) which have all four ch values (ch1-4).
I was originally thinking that we could approach the issue along the lines of this:
ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")
Applying this list with the file.remove function, similar to this:
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
file.remove(folder,final2)
However, creating new variables for each ch value fragments out each file. I am unsure how to use these to actually distinguish whether an individual image has all four ch values to meaningfully filter my images. I'm kind of at a loss, as the other sources I've seen have issues that don't quite match this problem.
Earlier, I was able to remove the all images with ch5 from my image set like this. I was thinking this may be helpful in trying to filter only images which have ch1-ch4, but I'm not sure how to proceed.
##Create folder variable which has all image files
folder <- list.files(getwd())
##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
##Remove final2 from folder
file.remove(folder,final2)
To summarize: I expect to filter files from a random assortment without complete ch values (ie: maybe only ch1 and ch2, or ch3 and ch4, or ch1, ch2, ch3, and ch4), to an assortment which only contains files which have a complete set (four files with ch1, ch2, ch3, and ch4).
Starting with a vector of filenames like you would get from list.files or something similar, you can create a data frame of filenames, use regex to extract the alphanumeric part at the beginning and the number that follows "-ch". Then check that all elements of an expected set (I put this in ch_set, but there might be another way you need to do this) occur in each group's set of CH values.
# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")
library(dplyr)
ch_set <- 1:4
files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
mutate(ch = as.numeric(ch)) %>%
group_by(group) %>%
filter(all(ch_set %in% ch))
files_to_keep
#> # A tibble: 8 x 3
#> # Groups: group [2]
#> filename group ch
#> <chr> <chr> <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01 1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01 2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01 3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01 4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01 1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01 2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01 3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01 4
Now that you have a dataframe of the complete groups, just pull the matching filenames back out:
files_to_keep$filename
#> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
#> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
#> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
One thing to note is that this worked without the mutate line where I converted ch to numeric—i.e. comparing character versions of those numbers to regular numeric version of them—because under the hood, %in% converts to matching types. That didn't seem totally safe if you needed to scale this, so I converted to have them in matching types.

Can I filter out certain rows/records when retrieving data from Salesforce using the RForcecom function "rforcecom.retrieve"?

Thanks for helping me with my first Stack Overflow question. I am trying to retrieve all the data from several fields in an Object called "Applied Questionnaire"; however, I do not want to retrieve any records that have the name "Training Site".
Currently, this is my code, which works:
quarterly_site_scores = rforcecom.retrieve(session, "AppliedQuestionnaire__c",
c("Site__c", "Site_Name__c", "Total_Score__c"))
%>% rename(site_id = Site__c, site_name = Site_Name__c)
quarterly_site_scores = quarterly_site_scores[!(quarterly_site_scores$site_name == "TRAINING PARK SITE" |
quarterly_site_scores$status != "Completed"),]
However, I'm wondering if there's a more elegant, streamlined solution here. Can I filter at the same time I retrieve? Or is there a better way to filter here?
(I've simplified the code here - I'm actually pulling in about ten fields and filtering on about five or six criteria, just in this one example).
Thank you.
Adding what the OP discovered as an answer using the salesforcer package which returns the SOQL resultset as a tbl_df.
library(salesforcer)
library(tidyverse)
sf_auth(username, password, security_token)
# list all object names in a Salesforce org
ped_objects <- sf_list_objects() %>% .$sobjects %>% map_chr(~pluck(., "name"))
# list all the fields on a particular object
fields <- sf_describe_object_fields('AppliedQuestionnaireBundle2__c')
# write a query to retrieve certain records from that object
site_scores_soql <- "SELECT Site__c,
Site_Name__c,
Total_Score__c
FROM AppliedQuestionnaireBundle2__c
WHERE Site_Name__c != 'GENERIC SITE'
AND Site_Name__c != 'TRAINING PARK SITE'
AND Status__c = 'Completed'"
# run the query
quarterly_site_scores <- sf_query(site_scores_soql)
quarterly_site_scores
#> # A tibble: 3 x 3
#> Site__c Site_Name__c Total_Score__c
#> <chr> <chr> <dbl>
#> 1 A Site Name1 78
#> 2 B Site Name2 52
#> 3 C Site Name3 83

Need to use jsonlite to handle ndjson message list using stream_in() and stream_out()

I have an ndjson data source. For a simple example, consider a text file with three lines, each containing a valid json message. I want to extract 7 variables from the messages and put them in a dataframe.
Please use the following sample data in a text file. You can paste this data into a text editor and save it as "ndjson_sample.txt"
{"ts":"1","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":12353,\"Var5\":1,\"Var6\":\"abc\",\"Var7\":\"x\"}"}
{"ts":"2","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-68,\"Var4\":4528,\"Var5\":1,\"Var6\":\"def\",\"Var7\":\"y\"}"}
{"ts":"3","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":-5409,\"Var5\":1,\"Var6\":\"ghi\",\"Var7\":\"z\"}"}
The following three lines of code accomplish what I want to do:
file1 <- "ndjson_sample.txt"
json_data1 <- ndjson::stream_in(file1)
raw_df_temp1 <- as.data.frame(ndjson::flatten(json_data1$ct))
For reasons I won't get into, I cannot use the ndjson package. I must find a way to use the jsonlite package to do the same thing using the stream_in() and stream_out() functions. Here's what I tried:
con_in1 <- file(file1, open = "rt")
con_out1 <- file(tmp <- tempfile(), open = "wt")
callback_func <- function(df){
jsonlite::stream_out(df, con_out1, pagesize = 1)
}
jsonlite::stream_in(con_in1, handler = callback_func, pagesize = 1)
close(con_out1)
con_in2 <- file(tmp, open = "rt")
raw_df_temp2 <- jsonlite::stream_in(con_in2)
This is not giving me the same data frame as a final output. Can you tell me what I'm doing wrong and what I have to change to make raw_df_temp1 equal raw_df_temp2?
I could potentially solve this with a the fromJSON() functions operating on each line of the file, but I'd like to find a way to do it with the stream functions. The files I will be dealing with a are quite large and so efficiency will be key. I need this to be as fast as possible.
Thank you in advance.
Currently under ct you'll find a string that can (subsequently) be fed to fromJSON independently, but it will not be parsed as such. Ignoring your stream_out(stream_in(...),...) test, here are a couple of ways to read it in:
library(jsonlite)
json <- stream_in(file('ds_guy.ndjson'), simplifyDataFrame=FALSE)
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
cbind(
ts = sapply(json, `[[`, "ts"),
do.call(rbind.data.frame, lapply(json, function(a) fromJSON(a$ct)))
)
# ts Var1 Var2 Var3 Var4 Var5 Var6 Var7
# 1 1 6 6 -70 12353 1 abc x
# 2 2 6 6 -68 4528 1 def y
# 3 3 6 6 -70 -5409 1 ghi z
Calling fromJSON on each string might be cumbersome, and with larger data this slow-down is why there is stream_in, so if we can capture the "ct" component into a stream of its own, then ...
writeLines(sapply(json, `[[`, "ct"), 'ds_guy2.ndjson')
(There are far-more-efficient ways to do this with non-R tools, including perhaps a simple
sed -e 's/.*"ct":"\({.*\}\)"}$/\1/g' -e 's/\\"/"/g' ds_guy.ndjson > ds_guy.ndjson2
though this makes a few assumptions about the data that may not be perfectly safe. A better solution would be to use jq, which should "always" correctly-parse proper json, then a quick sed to replace escaped quotes:
jq '.ct' ds_guy.ndjson | sed -e 's/\\"/"/g' > ds_guy2.ndjson
and you can do that with system(...) in R if needed.)
From there, under the assumption that each line will contain exactly one row of data.frame data:
json2 <- stream_in(file('ds_guy2.ndjson'), simplifyDataFrame=TRUE)
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
cbind(ts=sapply(json, `[[`, "ts"), json2)
# ts Var1 Var2 Var3 Var4 Var5 Var6 Var7
# 1 1 6 6 -70 12353 1 abc x
# 2 2 6 6 -68 4528 1 def y
# 3 3 6 6 -70 -5409 1 ghi z
NB: in the first example, "ts" is a factor, all others are character because that's what fromJSON gives. In the second example, all strings are factor. This can easily be addressed through judicious use of stringsAsFactors=FALSE, depending on your needs.

Subset by function's variable using $variable

I am having trouble to subset from a list using a variable of my function.
rankhospital <- function(state,outcome,num = "best") {
#code here
e3<-dataframe(...,state.name,...)
if (num=="worst"){ return(worst(state,outcome))
}else if((num%in%b=="TRUE" & outcome=="heart attack")=="TRUE"){
sep<-split(e3,e3$state.name)
hosp.estado<-sep$state
hospital<-hosp.estado[num,1]
return(as.character(hospital))
I split my data frame by state (which is a variable of my function)
But hosp.estado<-sep$state doesn't work. I have also tried as.data.frame.
The function (rankhospital("NY"....) returns me a character(0).
When I feed the sep$state with sep$"NY" directly in code it works perfectly so I guess the problem is I can't use a function's variable to do this. Am I right? What could I use instead?
Thank you!!
If state is a variable in your function, you can refer to a column with the name given by state using: sep[state] or sep[[state]]. The first produces a data frame with one column named based on the value of state. The second produces an unnamed vector.
df=data.frame(NY=rnorm(10),CA=rnorm(10), IL=rnorm(10))
state="NY"
df[state]
# NY
# 1 -0.79533912
# 2 -0.05487747
# 3 0.25014132
# 4 0.61824329
# 5 -0.17262350
# 6 -2.22390027
# 7 -1.26361438
# 8 0.35872890
# 9 -0.01104548
# 10 -0.94064916
df[[state]]
# [1] -0.79533912 -0.05487747 0.25014132 0.61824329 -0.17262350 -2.22390027 -1.26361438 0.35872890 -0.01104548 -0.94064916
class(df[state])
# [1] "data.frame"
class(df[[state]])
# [1] "numeric"
It seems like you are trying to get the top hospital in a state. You don't want to split here (see the result of sep to see what I mean). Instead, use:
as.character(e3[e3$state.name==state, 1][num])
This hopefully does what you want.
You need sep[[state]] instead of sep$state to get the data frame out of your sep list, which matches the state parameter of your function. Like this:
e3 <- read.csv("https://raw.github.com/Hindol/data-analysis-coursera/master/HW3/hospital-data.csv")
state <- "WY"
num <- 1:5
sep<-split(e3,e3$State)
hosp.estado<-sep[[state]]
hospital<-hosp.estado[num,1]
as.character(hospital)
# [1] "530002" "530006" "530008" "530010" "530011"

Resources