How do I plot from data frames? - r

From the following code, I got a data frame in R. I am trying to plot the data frame; however, I am only interested in the score they got on the Final. So I want the x-axis to be the number of students, which is 6, since that's how many data points their are, and I want the y-axis to be Final. Is there a way to do this from just the data frame?
data <- data.frame(Score1=c(100,36,58,77,99,92),Score2=c(56,68,68,98,15,35), Final=c(63,87,89,45,99,18))
Output listed below:
Score1 Score2 Final
1 100 56 63
2 36 68 87
3 58 68 89
4 77 98 45
5 99 15 99
6 92 35 18
Or will I have to do something like this instead? But this gives me an error that the lengths are not the same.
data <- data.frame(Score1=c(100,36,58,77,99,92),Score2=c(56,68,68,98,15,35))
Final=c(63,87,89,45,99,18)
f.data <- cbind(data,Final)
b <- 6
plot(b,Final)

Use the following
library(ggplot2);
qplot( x = 1:6, y = data$Final)

The code below can do the trick.
plot(data$Final)

Related

Randomly Select 10 percent of data from the whole data set in R

For my project, I have taken a data set which have 1296765 observations of 23 columns, I want to take just 10% of this data randomly. How can I do that in R.
I tried the below code but it only sampled out just 10 rows. But, I wanted to select randomly 10% of the data. I am a beginner so please help.
library(dplyr)
x <- sample_n(train, 10)
Here is a function from dplyr that select rows at random by a specific proportion:
dplyr::slice_sample(train,prop = .1)
In base R, you can subset by sampling a proportion of nrow():
set.seed(13)
train <- data.frame(id = 1:101, x = rnorm(101))
train[sample(nrow(train), nrow(train) / 10), ]
id x
69 69 1.14382456
101 101 -0.36917269
60 60 0.69967564
58 58 0.82651036
59 59 1.48369123
72 72 -0.06144699
12 12 0.46187091
89 89 1.60212039
8 8 0.23667967
49 49 0.27714729

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

Passing variable names to mapply (using reshape)

I'm trying to take a long-format dataframe and create several wide-format dataframes from it according to a list of different variables.
My thought is to use mapply to pass the set of variables I want to filter by positionally to the dataset. But it doesn't look like mapply can read in the list of vars.
Data:
library(dplyr)
library(reshape2)
set.seed(1234)
data <- data.frame(
region = sample(c("northeast","midwest","west"), 40, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=4, by = "1 day"),10),
employed = sample(50000:100000, 40, replace = T),
girls = sample(1:40),
guys = sample(1:40)
)
For each of the quantitative variables (employed, girls, and guys), I want to create a wide-format dataframe with dates as rows, regions as columns.
Could I use mapply to do this more succinctly than running melt and dcast separately for each of {"employed","girls", "guys"}?
For example:
mapply(function(d,y) {melt(d[,c('region','date',y)], id.vars=c('region','date'))},
data,
c('employed','girls','guys')
)
tells me:
>Error in `[.default`(d, , c("region", "date", y)) :
incorrect number of dimensions
What I'm looking to get is a list of the wide-format dataframes; I figured mapply would be the easiest way to pass multiple arguments, but if there's a better way to go at this, I'm all for it.
Example:
$employed
date midwest northeast west
1 2010-02-01 62196 513366 119070
2 2010-02-02 334849 271383 160552
3 2010-02-03 187070 320594 119721
4 2010-02-04 146575 311999 310009
$girls
date midwest northeast west
1 2010-02-01 40 154 26
2 2010-02-02 88 76 61
3 2010-02-03 67 84 39
4 2010-02-04 48 95 42
$guys
date midwest northeast west
1 2010-02-01 16 140 43
2 2010-02-02 115 70 43
3 2010-02-03 63 64 42
4 2010-02-04 54 94 76
The old standby of split/lapply
d<-melt(data,id.vars=c("region","date"))
lapply(split(d,d$variable),function(x) dcast(x,date~region,sum))
Example data has multiple matches, so I used an aggregating function of sum.

R: How to divide a data frame by column values?

Suppose I have a data frame with 3 columns and 10 rows as follows.
# V1 V2 V3
# 10 24 92
# 13 73 100
# 25 91 120
# 32 62 95
# 15 43 110
# 28 54 84
# 30 56 71
# 20 82 80
# 23 19 30
# 12 64 89
I want to create sub-dataframes that divide the original by the values of V1.
For example,
the first data frame will have the rows with values of V1 from 10-14,
the second will have the rows with values of V1 from 15-19,
the third from 20-24, etc.
What would be the simplest way to make this?
So if this is your data
dd<-data.frame(
V1=c(10,13,25,32,15,38,30,20,23,13),
V2=c(24,73,91,62,43,54,56,82,19,64),
V3=c(92,100,120,95,110,84,71,80,30,89)
)
then the easiest way to split is using the split() command. And since you want to split in ranges, you can use the cut() command to create those ranges. A simple split can be done with
ss<-split(dd, cut(dd$V1, breaks=seq(10,35,by=5)-1)); ss
split returns a list where each item is the subsetted data.frame. So to get at the data.frame with the values for 10-14, use ss[[1]], and for 15-19, use ss[[2]] etc.

How can I get column data to be added based on a group designation using R?

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.
One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134
Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))
I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Resources