bucket data in R Data frame - r

I have a code in python that creates bucket dataframe from a simple dataframe. I want to replicate in R. Till now I understand that I can use transform function but I am unable to do it. can anyone help me in this?
This is dataframe
Here is the bucketing code in python

I achieved this with below lines of code
bins = seq(0,max(df_s$wordCount)+input$bins,by = 5)
df_s <-transform(df_s,group = cut(df_s$wordCount,bins))
df <- aggregate(df_s$Freq, by=list(Category=df_s$group), FUN=sum)
#Ronak, thanks for your advice.

Related

R extract labels from a rda data frame

I am looking at some data downloaded from ICPSR and I am specifically using their R data file (.rda). Beneath the column name of each data file, there are some descriptions of the variables (a.k.a labels). An example is attached as well.
I tried various ways to get the label including base::label, Hmisc::label, labelled::var_label, sjlabelled::get_label and etc. But none worked.
So I am asking any ideas on how to extract the labels from this data file?
Thanks very much in advance!
this could work using purrr
#load library
library(purrr)
#get col n
n <- ncol(yourdata)
#extract labels as vector
labels <- map_chr(1:n, function(x) attr(yourdata[[x]], "label") )
This worked for me (I am working with ICPSR 35206):
attributes(yourdata)$variable.labels -> labels
Make sure that your attribute referring to the labels is actually called "variable.labels".

How to format data to automate table production?

I would be very grateful for any guidance on how to use the xltabr package to automatically format tables in r, please:
https://github.com/moj-analytical-services/xltabr
In SPSS for example, I would apply the relevant weight and then run a cross tab on the raw data e.g var1*var2.
How would you go about doing this in r so that the package recognises it to produce the table?
Much appreciated.
You need to create/ read in the dataframe which you want to use first.
dat <- read.spss("mydataframe.sav")
Then you need to put it in the format you want: As in your example of crosstables, you can do this:
library(reshape2)
ct <- reshape2::dcast(iris, variable1 ~ variable2, fun.aggregate = length)
#depending on what data you want, you can change the fun.aggreagte function (e.g. sum or mean).
Then you can use the xltabr package to prepare the excel file by creating a Workbook:
wb <- xltabr::auto_crosstab_to_wb(ct)
Then you can save it as .xlsx file:
library(openxlsx)
openxlsx::saveWorkbook(wb, file = "crosstable.xlsx", overwrite = T)
I hope this helps

Darksky api in R loop not working

I am a huge R fan, but it never seems to work out for me, I am trying to use an API to get weather data, but I cannot write the loop. I have all the codes in the right format, but when I import the file into r, the cells appear like
-33.86659241, 151.2081909, \"2014-10-01T02:00:00"\
and this is preventing me from running the code. So rather than using a loop I need to use a mailmerge to create 5000 lines of code. Any help would be really appreciated.
tmp <- get_forecast_for(-33.86659241, 151.2081909, "2014-10-01T02:00:00", add_headers=TRUE)
fdf <- as.data.frame(tmp)
fdf$ID <- "R_3nNli1Hj2mlvFVo"
fd <- rbind(fd,fdf)
Here is the code with loop -
df <- read.csv("~/Machine Learning/Darksky.csv", header=T,sep=",", fill = TRUE)
for(i in 1:length(df$DarkSky)){
fdf <- get_forecast_for(df$LocationLatitude[i], df$LocationLongitude[i], df$DarkSky[i], add_headers=TRUE)
fdf <- as.data.frame(fdf)
fdf <- fdf[1:2,]
fd <- rbind(fd,fdf)
}
I also wanted to rbind the retreived data onto a dataframe but it does not work. I also wanted to cbind the identifier, which would be the value in df$DarkSky[i], but it will not work.
CSV -
LocationLatitude LocationLongitude DarkSky
-33.86659241 151.2081909 "2014-10-01T02:00:00"
The get_forecast_for function takes three parameters, the latitude, longitude and the date, structured as above, I have the loop working for latitude and longitude, but the time/date is not working.

calculate percentage in R

I am a beginner in R and R is for me actually only the means to analyse my statistical data, so I am far from being a programmer. I need some help with Building percentages of my variables from an Excel sheet. I Need R.total with R.Max as 100% base. this is what I did:
DB <- read_excel("WechslerData.xlsx", sheet=1, col_names=TRUE,
col_types=NULL, na="", skip=0)
I wanted to to use prop.table
but this dose not work with me. than I tried to make data frame
R.total <- DB$R.total
R.max <- DB$R.max
DB.rus <- data.frame(R.total, R.max)
but prop.table still dose not work. Can somebody give me a hint?
Not really sure what you want, but for this mock data.
r.total <- runif(100,min=0, max=.6) # generate random variable
r.max <- runif(100,min=0.7, max=1) # generate random variable
df <- data.frame(r.total, r.max) # create mock data frame
You could try
# create a new column which is the r.total percentage of r.max
df$percentage <- df$r.total / df$r.max
Hope it helps.

Is there a way to parallelize summary functions running over loop?

For an input data frame
input<-data.frame(col1=seq(1,10000),col2=seq(1,10000),col3=seq(1,10000),col4=seq(1,10000))
I have to run the following summaries stored in another Data frame
summary<-data.frame(Summary_name=c('Col1_col2','Col3_Col4','Col2_Col3'),
ColIndex=c("1,2","3,4","2,3"))
#summary
Summary_name ColIndex
Col1_col2 1,2
Col3_Col4 3,4
Col2_Col3 2,3
I have the following function to run the aggregates
loopSum<-function(input,summary){
for(i in seq(1,nrow(summary))){
summary$aggregate[i]<-sum(input[,as.numeric(unlist(str_split(summary$ColIndex[i],',')))])}
return(summary)
}
My requirement is to run the sum as used in loopSum only in parallel, ie I would like to run all the summaries in one shot and thus reduce the total time taken for the function to create the summaries. Is there a way to do this?
My actual scenarios requires me to create summary statistics over hundreds of columns for each Summary_name in summary data.frame, I am looking for the most optimized way to do this. Any help is much appreciated.
Does it improve the running time?
library(tidyr)
input1 <- colSums(input)
summary1 <- separate(summary, "ColIndex", into=c("X1", "X2"), sep=",", convert = TRUE)
summary$aggregate <- input1[summary1$X1] + input1[summary1$X2]

Resources