Converting different maximum scores to percentage out of 100 - r

I have three different datasets with 3 students and 3 subjects each, with different maximum scores(125,150,200). How to calculate the mean percentage(out of 100) per subject of a standard(not section), when all the three maximum scores are different. which are not comparable at this point.
Class2:
section1.csv
english maths science
name score(125) score(125) score(125)
sam 114 112 111
erm 89 91 97
asd 101 107 118
section2.csv
english maths science
name score(150) score(150) score(150)
wer 141 127 143
rahul 134 119 145
rohit 149 135 139
section3.csv
english maths science
name score(200) score(200) score(200)
vinod 178 186 176
manoj 189 191 185
deepak 191 178 187
P.s: Expected columns in the output:
class1 englishavg mathsavg scienceavg( the values are the summation of mean percentage of all the three section)
Here is the piece of the code. I tried.
files <- list.files(pattern = ".csv") ## creates a vector with all file names in your folder
list_files <- lapply(files,read.csv,header=F,stringsAsFactors=F)
list_files <- lapply(list_files, function(x) x)
engav <- sapply(list_files,function(x) mean(as.numeric(x[,2]),na.rm=T)/2)
mathav <- sapply(list_files,function(x) mean(as.numeric(x[,3]),na.rm=T)/2)
scienceav <- sapply(list_files,function(x) mean(as.numeric(x[,4]),na.rm=T)/2)
result <- cbind(files,engav,mathav,scienceav)
Looking forward for an assistance.

Related

How to sum column based on value in another column in two dataframes?

I am trying to create a limit order book and in one of the functions I want to return a list that sums the column 'size' for the ask dataframe and the bid dataframe in the limit order book.
The output should be...
$ask
oid price size
8 a 105 100
7 o 104 292
6 r 102 194
5 k 99 71
4 q 98 166
3 m 98 88
2 j 97 132
1 n 96 375
$bid
oid price size
1 b 95 100
2 l 95 29
3 p 94 87
4 s 91 102
Total volume: 318 1418
Where the input is...
oid,side,price,size
a,S,105,100
b,B,95,100
I have a function book.total_volumes <- function(book, path) { ... } that should return total volumes.
I tried to use aggregate but struggled with the fact that it is both ask and bid in the limit order book.
I appreciate any help, I am clearly a complete beginner. Only hear to learn :)
If there is anything more I can add to this question so is more clear feel free to leave a comment!

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

R 2.15.0 lm() on Windows 6.1.7601 - Subsetting data frame receive error when labeling columns

Purpose: Subset a dataframe into 2 columns that are labeled with new names
for example:
Age Height
1 65 183
2 73 178
[data1[dataset1$Age>50 | dataset1$Height>140,], c("Age","Cm")]
# Error: unexpected ',' in "data1[data1$Age>50 | data1$Height>140,],"
What I've tried:
data1[dataset1$Age>50 | dataset1$Height>140,] #This doesn't organize results in columns
data1[dataset1$Age>50 | dataset1$Height>140,], c("Age","Cm") #Returns same error
I can't get the columns to be organized side-by-side with the labels in c("label1", "label2"). Thanks for your help! New to R and learning it alongside biostats.
If I got it clearly can subset function be of help
dataset1 <- data.frame(
age=c(44,77,21,55,66,90,23,54,31),
height=c(144,177,121,155,166,190,123,154,131)
)
data1 <- as.data.frame(subset(dataset1,dataset1$age>50 | dataset1$height>140))
colnames(data1) <- c("Age", "Height")
I may have missed what you were trying to do need a bit more reproducible data I think.
Nevertheless I had a go
dataset1 = data.frame(cbind((35:75),(135:175)))
colnames(dataset1) = c("Age","Height")
Age Height
35 135
36 136
37 137
38 138
39 139
40 140
41 141
42 142
43 143
44 144
and subset
data1 = dataset1[dataset1$Age>50 | dataset1$Height>140,]
colnames(data1) = c("Age","Cm")
Age Cm
41 141
42 142
43 143
44 144
45 145
46 146
47 147
48 148
49 149
50 150
My apologies if I missed what you wanted but to me it wasn't very clear.

How can I use a table in R for names.arg

I have a table in R that I am using to make a barplot:
86 17 482 424 C
87 18 600 426 T
88 11 279 427 Q
89 X 399 436 B
I can make the plot with barplot(table$V3) but how do I use the values in V4 as the names for each V3 entry?
just use
barplot(DF$V3, names.arg=DF$V4)
where DF is your data.frame (a table is something else in R. If you actually mean table, please indicate as such)

grep: How can i search through my data using a wildcard in R

I have recently started using R. So now I am trying to get some data out of it. However, the results I get are quite confusing. I have datas from the year 1961 to 1963 of everyday in the format 1961-04-25. I created a vector called: date
So when I try to use grep to just search for the period between April 10 and May 21 and display the dates I used this command:
date[date >= grep("196.-04-10", date, value = TRUE) &
date <= grep("196.-05-21", date, value = TRUE)]
The results I get is are somehow confusing as it is making 3 days steps instead of giving me every single day... see below.
[1] "1961-04-10" "1961-04-13" "1961-04-16" "1961-04-19" "1961-04-22" "1961-04-25" "1961-04-28" "1961-05-01" "1961-05-04" "1961-05-07" "1961-05-10"
[12] "1961-05-13" "1961-05-16" "1961-05-19" "1962-04-12" "1962-04-15" "1962-04-18" "1962-04-21" "1962-04-24" "1962-04-27" "1962-04-30" "1962-05-03"
[23] "1962-05-06" "1962-05-09" "1962-05-12" "1962-05-15" "1962-05-18" "1962-05-21" "1963-04-11" "1963-04-14" "1963-04-17" "1963-04-20" "1963-04-23"
[34] "1963-04-26" "1963-04-29" "1963-05-02" "1963-05-05" "1963-05-08" "1963-05-11" "1963-05-14" "1963-05-17" "1963-05-20"
I think the grep strategy is misguided, but maybe something like this will work ... basically, I'm computing the day-of-year (Julian date, yday()) and using that for comparison.
z <- as.Date(c("1961-04-10","1961-04-11","1961-04-12",
"1961-05-21","1961-05-22","1961-05-23",
"1963-04-09","1963-04-12","1963-05-21","1963-05-22"))
library(lubridate)
z[yday(z)>=yday(as.Date("1961-04-10")) & yday(z)<=yday(as.Date("1961-05-21"))]
## [1] "1961-04-10" "1961-04-11" "1961-04-12" "1961-05-21" "1963-04-12"
## [6] "1963-05-21"yz <- year(z)
Actually, this solution is fragile to leap-years ...
Better (?):
yz <- year(z)
z[z>=as.Date(paste0(yz,"-04-10")) & z<=as.Date(paste0(yz,"-05-21"))]
(You should definitely test this for yourself, I haven't tested carefully!)
Using a date format for your variable would be the best bet here.
## set up some test data
datevar <- seq.Date(as.Date("1961-01-01"),as.Date("1963-12-31"),by="day")
test <- data.frame(date=datevar,id=1:(length(datevar)))
head(test)
## which looks like:
> head(test)
date id
1 1961-01-01 1
2 1961-01-02 2
3 1961-01-03 3
4 1961-01-04 4
5 1961-01-05 5
6 1961-01-06 6
## find the date ranges you want
selectdates <-
(format(test$date,"%m") == "04" & as.numeric(format(test$date,"%d")) >= 10) |
(format(test$date,"%m") == "05" & as.numeric(format(test$date,"%d")) <= 21)
## subset the original data
result <- test[selectdates,]
## which looks as expected:
> result
date id
100 1961-04-10 100
101 1961-04-11 101
102 1961-04-12 102
103 1961-04-13 103
104 1961-04-14 104
105 1961-04-15 105
106 1961-04-16 106
107 1961-04-17 107
108 1961-04-18 108
109 1961-04-19 109
110 1961-04-20 110
111 1961-04-21 111
112 1961-04-22 112
113 1961-04-23 113
114 1961-04-24 114
115 1961-04-25 115
116 1961-04-26 116
117 1961-04-27 117
118 1961-04-28 118
119 1961-04-29 119
120 1961-04-30 120
121 1961-05-01 121
122 1961-05-02 122
123 1961-05-03 123
124 1961-05-04 124
125 1961-05-05 125
126 1961-05-06 126
127 1961-05-07 127
128 1961-05-08 128
129 1961-05-09 129
130 1961-05-10 130
131 1961-05-11 131
132 1961-05-12 132
133 1961-05-13 133
134 1961-05-14 134
135 1961-05-15 135
136 1961-05-16 136
137 1961-05-17 137
138 1961-05-18 138
139 1961-05-19 139
140 1961-05-20 140
141 1961-05-21 141
465 1962-04-10 465
...

Resources