Creating multiple rows in a dataframe with a single sampling command - r

I want to create a dataframe consisting of hundred rows, each row representing the output of this code:
sample(seq(100, 1000),20,replace=T)
Is there a way to order R to repicate the above code 100 times and create a dataframe out of it?

We can use replicate
t(replicate(100, sample(seq(100, 1000),20,replace=T)))

Related

R Beginner: how to combine two variables and merge them into a new dataframe

If I have the mean and standard error from the mean (SE) for a particular set of numbers, how would I go about combining the two values into one dataframe? For example, I have the variable mean_boeing (for the average) and stde_boeing (for the error from the mean) and I want to combine these two into one dataframe. Ultimately, I will be doing this for several other variables, combining them all into one big dataframe so that I can graph them in ggplot.
Thanks
We can use data.frame to create a data.frame
df1 <- data.frame(mean_boeing, stde_boeing)

ftable() fails because r can't generate a table with more than 2^31 elements

i encountered a problem using the ftable() function in R.
I basically have large data frames, where i want to delete all duplicated rows. Which is simply done with:
distinct(my_df)
I also want to count how many times a certain row does appear in the dataframe. which can be done with:
my_ftab <- as.data.frame(ftable(my_df))
my_ftab <- arrange(my_ftab[my_ftab$Freq>0,],desc(Freq))
This will return a data frame showing me the the distinct rows and how many times they occur..
When the size of my_df exceeds approx. 1000 * 30 it stops working because R cant produce data tables with more than 2^31 elements, which apparently would be necassary for some intermediate calculation step.
So my question is if there is a function that produces a similar output as ftable, but does not have its limitations?

Adding a column with function values to Spark dataframes with SparkR

I am using SparkR to work with some project that includes R and spark in its technology stack.
I have to create new columns with booleans values returned from validation functions. I can do this job easily with spark dataframes and one expression like:
sdf1$result <- sdf1$value == sdf2$value
The problem is when I have to compare two dataframes of different lengths.
What is the best way to operate sdf1 and sdf2 dataframes with a function and assign the value to a new column of sdf1? Let's suppose that I want to generate a column with the minimum length between sdf1 and sdf2.
If you have dataframes of different lengths, I logically assume that you have some column(s) that determines how to line up the values between the two dataframes. You will have to perform a join between the two dataframes on these columns (see SparkR::merge / SparkR::join) and then do your comparison operation to create your new column on the resulting dataframe.

How to copy multiple columns to a new dataframe in R

I have a data set (df2) with 400 columns and thousands of rows. The columns all have different names but all have either 'typeP' or 'typeR' at the end of their names. They are not ordered sequentially (eg. P,P,P,P,R,R,R,R) but randomly (P,P,R,R,R,P,R,P etc). I want to create a new data frame with just those columns whose names have 'type P' in their names.
I'm very new to R and so far I have only managed to find the positions of those columns using: grep("typeP",colnames(df2)). Any help would be appreciated!
After we get the index, we can use that to subset the initial dataset
df3 <- df2[grep("typeP",colnames(df2))]

Split dataframe based on one column in r

I have a huge dataframe of around 1M rows and want to split the dataframe based on one column & different ranges.
Example dataframe:
length<-sample(rep(1:400),100)
var1<-rnorm(1:100)
var2<-sample(rep(letters[1:25],4))
test<-data.frame(length,var1,var2)
I want to split the dataframe based on length at different ranges (ex: all rows for length between 1 and 50).
range_length<-list(1:50,51:100,101:150,151:200,201:250,251:300,301:350,351:400)
I can do this by subsetting from the dataframe, ex: test1<-test[test$length>1 &test$length<50,]
But i am looking for more efficient way using "split" (just a line)
range = seq(0,400,50)
split(test, cut(test$length, range))
But do heed Justin's suggestion and look into using data.table instead of data.frame and I'll also add that it's very unlikely that you actually need to split the data.frame/table.

Resources