Randomly Select 10 percent of data from the whole data set in R - r

For my project, I have taken a data set which have 1296765 observations of 23 columns, I want to take just 10% of this data randomly. How can I do that in R.
I tried the below code but it only sampled out just 10 rows. But, I wanted to select randomly 10% of the data. I am a beginner so please help.
library(dplyr)
x <- sample_n(train, 10)

Here is a function from dplyr that select rows at random by a specific proportion:
dplyr::slice_sample(train,prop = .1)

In base R, you can subset by sampling a proportion of nrow():
set.seed(13)
train <- data.frame(id = 1:101, x = rnorm(101))
train[sample(nrow(train), nrow(train) / 10), ]
id x
69 69 1.14382456
101 101 -0.36917269
60 60 0.69967564
58 58 0.82651036
59 59 1.48369123
72 72 -0.06144699
12 12 0.46187091
89 89 1.60212039
8 8 0.23667967
49 49 0.27714729

Related

How to create a messy_impute() function that imputes NA values in messy data with mean or median?

I have the following data frame for a student with homework and exam scores.
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 NA 41 61 11 A
The Homework_10 score is missing, and I need to create a function to impute the NA value with mean or median.
The function messy_impute should have the following arguments:
data : data frame or tibble to be imputed.
center : whether to impute using mean or median.
margin : whether to use row or column to input value (1- use row 2-use column).
For example,
messy_impute(student1,mean,1) should print out
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 **62** 41 61 11 A
since the mean of the rest of the homework is 62.
And, if the mean of the columns (other students) in section A for homework 10 is 50, then
messy_impute(student1,mean,2) should print out
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 **50** 41 61 11 A
since the mean of columns in section A is 50.
Please note the if the margin is 2, then the calculation should be done with the same section.
I'm really stuck on this defining the function.
Base R solution:
# Define function to Impute a row-wise mean (assumes one observation per student):
row_wise_mean_impute <- function(df){
grade_df <- df[,names(df) != "studid"]
return(cbind(df[,c("studid"), drop = FALSE],
replace(grade_df, is.na(grade_df), apply(grade_df, 1, mean, na.rm = TRUE))))
}
# Apply function:
row_wise_mean_impute(student1)
Data:
x <- c(rnorm(85, 50, 3), rnorm(15, 50, 15))
student1 <- cbind(studid = 1010101, data.frame(t(x)))
student1[, 10] <- NA_real_

Sampling from an R Dataframe [duplicate]

This question already has answers here:
How to split data into training/testing sets using sample function
(28 answers)
Closed 2 years ago.
I have a dataframe with various real estate listings similar to the following.
ADDRESS PRICE ZIP ...
123 Main St 400,000 45678
23 Green Ln 380,000 45670
29 Green Ln 385,000 45670
...
I want to perform a stratified random sample for a testing dataset. In other words, I want to take ~30% of the entries from each ZIP code and separate them into a new dataset. I am not familiar with R dataframes, so how would I perform such an operation?
I've used the sample function like so
sample(c(1:103), size=31, replace = F)
but how do I put these specific rows into a new dataframe?
8 85 5 83 66 46 39 75 101 94 10 68 63 74 22 86 42
59 52 97 62 11 44 96 88 28 9 36 2 78 49
For a stratified sampling you can use the createDataPartition function from the caret package by inserting the variable according to which you want to stratify (in your case ZIP). By using [[1]] you select the first element of the list which contains the row indices necessary for the split. Afterwards, you subset your original dataset by select only the rows given by train_index
train_index <- caret::createDataPartition(your_data$ZIP, p = 0.7)[[1]]
train_data <- your_data[train_index,]
test_data <- your_data[-train_index,]
The dplyr solution would be this one I believe:
train_set <- df %>%
group_by(ZIP) %>%
sample_frac(0.3)
It will return a dataframe with sample values for each ZIP group

How do I plot from data frames?

From the following code, I got a data frame in R. I am trying to plot the data frame; however, I am only interested in the score they got on the Final. So I want the x-axis to be the number of students, which is 6, since that's how many data points their are, and I want the y-axis to be Final. Is there a way to do this from just the data frame?
data <- data.frame(Score1=c(100,36,58,77,99,92),Score2=c(56,68,68,98,15,35), Final=c(63,87,89,45,99,18))
Output listed below:
Score1 Score2 Final
1 100 56 63
2 36 68 87
3 58 68 89
4 77 98 45
5 99 15 99
6 92 35 18
Or will I have to do something like this instead? But this gives me an error that the lengths are not the same.
data <- data.frame(Score1=c(100,36,58,77,99,92),Score2=c(56,68,68,98,15,35))
Final=c(63,87,89,45,99,18)
f.data <- cbind(data,Final)
b <- 6
plot(b,Final)
Use the following
library(ggplot2);
qplot( x = 1:6, y = data$Final)
The code below can do the trick.
plot(data$Final)

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

How can I get column data to be added based on a group designation using R?

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.
One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134
Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))
I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Resources