For example
df
Cars Male female
Ford focus 23 64
vw golf 76 12
ford ka 34 55
renault megane 12 83
How do i find the ratio of male to female for every car >0.5
Just subset your data frame using that ratio:
df[df$Male / df$Female > 0.5, ]
Cars Male Female
2 vw golf 76 12
3 ford ka 34 55
Demo
You might try a which() function:
df[which(df[,2]/df[,3]>0.5),1]
Good luck!
Related
I have a data.frame where most, but not all, data are recorded over a 12-month period. This is specified in the months column.
I need to transform the revenue and cost variables only (since they are flow data, compared to total_assets which is stock data) so I get the 12-month values.
In this example, for Michael and Ravi I need to replace the values in revenue and cost by (12/months)*revenue and (12/months)*cost, respectively.
What would be a possible way to do this?
df1 = data.frame(name = c('George','Andrea', 'Micheal','Maggie','Ravi'),
months=c(12,12,4,12,9),
revenue=c(45,78,13,89,48),
cost=c(56,52,15,88,24),
total_asset=c(100,121,145,103,119))
df1
name months revenue cost total_asset
1 George 12 45 56 100
2 Andrea 12 78 52 121
3 Micheal 4 13 15 145
4 Maggie 12 89 88 103
5 Ravi 9 48 24 119
Using dplyr:
library(dplyr)
df1 %>%
mutate(cost = (12/months)*cost,
revenue = (12/months)*revenue)
An alternative if for any reason you have to use base R is:
df1$revenue <- 12/df1$months * df1$revenue
df1$cost <- 12/df1$months * df1$cost
df1
#> name months revenue cost total_asset
#> 1 George 12 45 56 100
#> 2 Andrea 12 78 52 121
#> 3 Micheal 4 39 45 145
#> 4 Maggie 12 89 88 103
#> 5 Ravi 9 64 32 119
Created on 2022-06-01 by the reprex package (v2.0.1)
Slightly different base R approach with with():
df1 = data.frame(name = c('George','Andrea', 'Micheal','Maggie','Ravi'),
months=c(12,12,4,12,9),
revenue=c(45,78,13,89,48),
cost=c(56,52,15,88,24),
total_asset=c(100,121,145,103,119))
df1$revenue <- with(df1, 12/months * revenue)
df1$cost <- with(df1, 12/months * cost)
head(df1)
#> name months revenue cost total_asset
#> 1 George 12 45 56 100
#> 2 Andrea 12 78 52 121
#> 3 Micheal 4 39 45 145
#> 4 Maggie 12 89 88 103
#> 5 Ravi 9 64 32 119
Created on 2022-06-01 by the reprex package (v2.0.1)
Var1 <- 90:115
Var2 <- 1:26
Var3 <- 52:27
data <- data.frame(Var1, Var2, Var3)
Hi, I want to select from each column the 10 largest values and save them in a new data frame? I know that in my example the new data frame will contain 20 rows but I don't understand the correct workflow.
That's what I'm expecting:
Var1 Var2 Var3
90 1 52
91 2 51
92 3 50
93 4 49
94 5 48
95 6 47
96 7 46
97 8 45
98 9 44
99 10 43
106 17 36
107 18 35
108 19 34
109 20 33
110 21 32
111 22 31
112 23 30
113 24 29
114 25 28
115 26 27
I can solve my problem for three column with this approach
df <- subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43)
but if I have to do that for 50+ columns it's not really the best solution.
This can be done by looping over the columns with lapply, sort them, and get the first 10 values with head
data.frame(lapply(data, function(x) head(sort(x,
decreasing=TRUE) ,10)))
If we need the first 10 rows, just use
head(data, 10)
Update
Based on the OP's edit
data[sort(Reduce(union,lapply(data, function(x)
order(x,decreasing=TRUE)[1:10]))),]
I think this is what you want:
data[sort(unique(c(sapply(data,order,decreasing=T)[1:10,]))),]
Basically index the top 10 elements from each column, merge them and remove duplicate, reorder and extract it from the original data.
A direct answer to your question:
nv1 <- sort(Var1,decreasing = TRUE)[1:10]
nv2 <- sort(Var2,decreasing = TRUE)[1:10]
nv3 <- sort(Var2,decreasing = TRUE)[1:10]
nd <- data.frame(nv1, nv2, nv3)
But why would you want to do such a thing? You're breaking the order of the data -- Var3 is increasing and the others are decreasing. Perhaps you want a list, rather than a data frame?
This might help:
thresh <- sapply(data,sort,decreasing=T)[10,]
data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),]
First, a vector thresh is defined, which contains the tenth largest value of each column. Then we perform a loop over the columns to check if any of the values is larger than or equal to the corresponding threshold value. The !! is a shorthand notation for as.logical(), which (owing to the combination with rowSums) selects those rows where at least one of the values is above or equal to the threshold. In your example this yields the output:
# Var1 Var2 Var3
#1 90 1 52
#2 91 2 51
#3 92 3 50
#4 93 4 49
#5 94 5 48
#6 95 6 47
#7 96 7 46
#8 97 8 45
#9 98 9 44
#10 99 10 43
#17 106 17 36
#18 107 18 35
#19 108 19 34
#20 109 20 33
#21 110 21 32
#22 111 22 31
#23 112 23 30
#24 113 24 29
#25 114 25 28
#26 115 26 27
Which is equal to the output that you obtain with the command you posted:
#> identical(data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),], subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43))
[1] TRUE
My data looks something like this
students<-data.table(studid=c(1:6) ,FACULTY= c("IT","SCIENCE", "LAW","IT","IT","IT"),
SEX=c("Male","Male","Male","Female","Female","Male"), WAM=c(65,35,98,55,20,80))
studid FACULTY SEX AVE_MARK (WAM)
1 IT Male 65
2 SCIENCE Male 35
3 LAW Male 98
4 IT Female 55
5 IT Female 20
6 IT Male 80
I have used the following code to calculate the averages
degrees[, mean(WAM, na.rm=T),by=FACULTY][order(-V1)]
So my headings are
FACULTY VI
IT 65
LAW 50
etc
Any advice on how to do this would be greatly appreciated.
I would like to break this up by sex also
FACULTY VI VI
Male Female
IT 65 11
LAW 50 11
You could try
dcast.data.table(students, FACULTY~SEX, fun.aggregate=mean, na.rm=TRUE,
value.var='WAM')
# FACULTY Female Male
#1: IT 37.5 72.5
#2: LAW NaN 98.0
#3: SCIENCE NaN 35.0
Do you definitely need it in cross tabular format? If so, akrun's answer is the way to go.
Otherwise, here they are stacked:
> students[, mean(WAM, na.rm=T),by=c('FACULTY','SEX')]
FACULTY SEX V1
1: IT Male 72.5
2: SCIENCE Male 35.0
3: LAW Male 98.0
4: IT Female 37.5
I'm currently learning R with help of video on coursera. When trying to exclude all hospital of state which have less than 20 hospital form table, I couldn't able to find correct solution with lack of programming knowledge of R (as I had program lots with C, Logic I tried to implemented in R is also like C)
Code I had used is like
>test <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
>test[, 11] <- as.numeric(outcome[, 11])
>test2 <- table(outcome$State)
Here from table test2, I can get the value of particular row as test2[[2]] but couldn't able to find out how to use conditional logic to get state with less then 20 hospital (If i get the state name then I can use subset() to address actual problem). Also I had look on dimnames() function but could find out any idea to solve my problem. So my question is, in R how could I check the threshold value with table value.
Value store in test2 is
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37
MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX
134 133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370
UT VA VI VT WA WI WV WY ##State Name
42 87 2 15 88 125 54 29 ##Count of Hospital
as Arun also specified on his comment... you can do it as names(test2[test2 >= 20]) in order to get state with higher than 20 Hospital... Here is nice explanation why you have to avoid subset.
Or yo can transform your table to a data.frame and use subset
dat <- as.data.frame(test2)
subset(dat, Freq < 20)
nn Freq
1 AK 17
8 DC 8
9 DE 6
12 GU 1
13 HI 19
42 RI 12
49 VI 2
50 VT 15
I am working with two different data sets, and I'd like to move data from one to the other. I'm thinking of it this way: One contains the results, paired with the correct factor (HTm), and I want to spread those out over another frame. Here is the first frame:
head(five)
Week Game.ID VTm VPts HTm HPts HDifferential VDifferential
1 1 NFL_20050908_OAK#NE OAK 20 NE 30 10 -10
2 1 NFL_20050911_ARI#NYG ARI 19 NYG 42 23 -23
3 1 NFL_20050911_CHI#WAS CHI 7 WAS 9 2 -2
4 1 NFL_20050911_CIN#CLE CIN 27 CLE 13 -14 14
5 1 NFL_20050911_DAL#SD DAL 28 SD 24 -4 4
6 1 NFL_20050911_DEN#MIA DEN 10 MIA 34 24 -24
VTm.f HTm.f average
1 OAK NE 19.4375
2 ARI NYG 19.4375
3 CHI WAS 19.4375
4 CIN CLE 19.4375
5 DAL SD 19.4375
6 DEN MIA 19.4375
> tail(five)
Week Game.ID VTm VPts HTm HPts HDifferential VDiff
262 19 NFL_20060114_WAS#SEA WAS 10 SEA 20 10 -10
263 19 NFL_20060115_CAR#CHI CAR 29 CHI 21 -8 8
264 19 NFL_20060115_PIT#IND PIT 21 IND 18 -3 3
265 20 NFL_20060122_CAR#SEA CAR 14 SEA 34 20 -20
266 20 NFL_20060122_PIT#DEN PIT 34 DEN 17 -17 17
267 21 NFL_20060205_SEA#PIT SEA 10 PIT 21 11 -11
VTm.f HTm.f average
262 WAS SEA 0
263 CAR CHI 0
264 PIT IND 0
265 CAR SEA 0
266 PIT DEN 0
267 SEA PIT 0
and here is the other (aggregated means from the first frame).
head(fiveINFO)
HTm HPts VPts average
1 ARI 19.87500 19.00000 19.43750
2 ATL 24.75000 19.12500 21.93750
3 BAL 19.37500 13.75000 16.56250
4 BUF 16.50000 17.37500 16.93750
5 CAR 25.12500 23.27273 24.19886
6 CHI 18.77778 14.00000 16.38889
tail(fiveINFO)
VTm HPts VPts average
27 SEA 21.00 25.000 23.0000
28 SF 30.75 12.625 21.6875
29 STL 28.00 22.000 25.0000
30 TB 15.75 15.375 15.5625
31 TEN 28.00 14.750 21.3750
32 WAS 20.60 18.800 19.7000
For reference, this data is looking at NFL scores. I want to take the averages in fiveINFO, frame two, and move them to the corresponding team in the first frame. five is 266 rows long, while fiveINFO is 32 rows — fiveINFO contains each HTm only once, while five contains each one 8-10 times, depending on the number of home games each team plays. I found several answers that seemed similar, but with much smaller data sets. I don't want to merge the two; I want the averages data from the second frame to be spread across the appropriate HTm values in the first frame.
I'm imagining I'll need to use some kind of for loop for this, but everything I'm doing is striking out. Help?
total<-merge(five, fiveINFO, by="HTm")
where total is the data frame that has the merged columns from five and fiveINFO based on htm column. The value of htm that do not match in five and fiveINFO will not be filled. But, if you want that filled with NA, you can do so explicitly ( use this option in merge function: all=TRUE, all.x or all.y = TRUE).
You can also remove extra columns that you don't want after merging.
total=subset(total,select= -c(HPts,VPts)) #for removing columns HPts, VPts from the merged data-frame