R - function which() that is not bringing the right output - r

I have a table temp1 that has 2 columns "Hospital.Name" and "heart attack" and 1 variable called "colname"
colname <- "heart attack"
Hospital.Name heart attack
ROUND ROCK MEDICAL CENTER 14.9
CYPRESS FAIRBANKS MEDICAL CENTER 12.0
I am trying to bring the record with the lowest "heart attack" number but I am getting an error on my formula it brings nothing, this is what I have:
temp1[which(temp1[[colname1]] == min(as.numeric(temp1[[colname1]]))),]
[1] Hospital.Name heart attack
<0 rows> (or 0-length row.names)
is bringing no results
but I know the right part of the formula is right because when I use
min(as.numeric(temp1[[colname1]]))
[1] 12
I get the min result of the "heart attack" column
Please help me with my formula:
temp1[which(temp1[[colname1]] == min(as.numeric(temp1[[colname1]]))),]

If I understood you correctly then you want all the information against a row for which one of the variables has minimum value.
You can try which.min if this is what you want to do.
using mtcars data set present in R session:
mtcars[which.min(mtcars$mpg),]
Above will fetch record(row) which has minimum value of mpg field in mtcars data.
#> mtcars[which.min(mtcars$mpg),]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Cadillac Fleetwood 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
Now If you use which the way you have used in your dataset, you can have something like this:
mtcars[which(mtcars[[colname1]] == min(mtcars[[colname1]])),]
This will produce two records like below:
#> mtcars[which(mtcars[[colname1]] == min(mtcars[[colname1]])),]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
#Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
Moral of the story which.min produces first instances of logical match, but which can give you both the instances of the match if there are multiple records of same minimum value.
From Documentation:
Determines the location, i.e., index of the (first) minimum or maximum
of a numeric (or logical) vector.
In your case it might be something like:
temp1[which.min(temp1[,colname]) ,]
In case if its not in numeric, then rather doing lot of things in a step, break it for simplicity.
temp1[,colname] <- as.numeric(temp1[,colname]) ##numeric conversion
temp1[which.min(temp1[,colname]) ,]
where colname = "heart attack" as per your question
If you use below code you can have multiple records, also it seems you have written the right code , your code is not working because you have a typo between colname and colname1
temp1[which(temp1[[colname]] == min(temp1[[colname]])),]

Related

how to find range of a continous variable where a count variable is non-zero R

I am trying to find the range of variable lat for each other column containing occurence records e.g. 0,1,2,3 etc. where the record of occurrence is non-zero (range of lat where occurence >0). I've tried to subset the data for each column without rows with 0 individuals recorded but I can't get it to work.
i tried to extract the minimum and maximum of lat for each species column where the occurence was >0 using which.max/min:
allfreq$lat[which.min(allfreq$lat[allfreq$Fem.mad !=0])]
however the results made no sense in that the values were nowhere near the minimum and maximum I observed visually.
Using mtcars dataset
> sapply(mtcars,function(x){range(x[x!=0])})
mpg cyl disp hp drat wt qsec vs am gear carb
[1,] 10.4 4 71.1 52 2.76 1.513 14.5 1 1 3 1
[2,] 33.9 8 472.0 335 4.93 5.424 22.9 1 1 5 8

Random sampling of parquet prior to collect

I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this:
library(dplyr)
set.seed(-1)
mtcars %>% slice_sample(n = 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.3 0 0 3 2
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.0 1 0 4 2
But my dataset is stored as a parquet file. As an example, I'll create a parquet from mtcars:
library(arrow)
# Create parquet file
write_dataset(mtcars, "~/mtcars", format = "parquet")
open_dataset("~/mtcars") %>%
slice_sample(n = 3) %>%
collect()
# Error in UseMethod("slice_sample") :
# no applicable method for 'slice_sample' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Clearly, slice_sample isn't implemented for parquet files and neither is slice:
open_dataset("~/mtcars") %>% nrow() -> n
subsample <- sample(1:n, 3)
open_dataset("~/mtcars") %>%
slice(subsample) %>%
collect()
# Error in UseMethod("slice") :
# no applicable method for 'slice' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Now, I know filter is implemented, so I tried that:
open_dataset("~/mtcars") %>%
filter(row_number() %in% subsample) %>%
collect()
# Error: Filter expression not supported for Arrow Datasets: row_number() %in% subsample
# Call collect() first to pull data into R.
(This also doesn't work if I create a filtering vector first, e.g., foo <- rep(FALSE, n); foo[subsample] <- TRUE and use that in filter.)
This error offers some helpful advice, though: collect the data and then subsample. The issue is that the file is ginormous. So much so, that it crashes my session.
Question: is there a way to randomly subsample a parquet file before loading it with collect?
It turns out that there is an example in the documentation that pretty much fulfils my goal. That example is a smidge dated, as it uses sample_frac which has been superseded rather than slice_sample, but the general principle holds so I've updated it here. As I don't know how many batches there will be, here I show how it can be done with proportions, like Pace suggested, instead of pulling a fixed number of columns.
One issue with this approach is that (as far as I understand) it does require that the entire dataset is read in, it just does it in batches rather than in one go.
open_dataset("~/mtcars") %>%
map_batches(~ as_record_batch(slice_sample(as.data.frame(.), prop = 0.1))) %>%
collect()
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 2 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
# 3 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4

Selecting rows with partial matching where a column has a string not working for decimals

For example, if I want to keep only those rows of the data mtcars where the variable qsec contains this decimal .50, following the solutions given here, I use:
mtcars_stringed<-mtcars%>%filter(str_detect(qsec, ".50"))
mtcars_stringed<-mtcars[mtcars$qsec %like% ".50", ]
mtcars_stringed <- mtcars[grep(".50", mtcars$qsec), ]
View(mtcars_stringed)
Surprisingly, all these strategies fail, by returning null, while in fact mtcars$qsec has values containing .50 such as 14.50, 15.50,
Any alternative solution, or is there something I am missing? Thanks in advance.
When you treat a numeric as a string, it is converted as.character(mtcars$qsec). If you look at that, you'll see that in the conversion, trailing 0s are dropped, so we get, e.g., "14.5", "15.5".
It will work if you use the regex pattern "\\.5$", \\ to make the . a ., not just "any character", and $ to match the end of the string.
mtcars %>% filter(str_detect(qsec, "\\.5$"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
# 2 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
However, in general, treating decimals as strings can be risky. A better approach might to get rid of the integer with %% 1 and then test for nearness to 0.5 within some tolerance, this will avoid precision issues.
mtcars %>% filter(abs(qsec %% 1 - 0.5) < 1e-10)
You are probably looking for:
mtcars %>%
filter(qsec %% 0.50 == 0 & qsec %% 1 != 0)
mpg cyl disp hp drat wt qsec vs am gear carb
1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
2 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

Fetch values of other columns which have max value in given column

I can find max values of rows, disp, hp in mtcars dataset using sapply function, which gives 472 335 respectively:
sapply(list(mtcars$disp,mtcars$hp), max, na.rm=TRUE)
Now I want cyl for these values, i.e. cyl of cars where maximum value of sapply(list(mtcars$disp,mtcars$hp),max,na.rm=TRUE) is found.
Which function should I be using? I tried unsuccessfully with which,rownames,colnames:
mtcars(which(sapply(list(mtcars$disp,mtcars$hp),max,na.rm=TRUE)))
rownames(which(sapply(list(mtcars$disp,mtcars$hp),max,na.rm=TRUE))))
mtcars$cyl(sapply(list(mtcars$disp,mtcars$hp),max,na.rm=TRUE))
library(dplyr)
filter(mtcars, hp==max(hp) | disp == max(disp))$cyl
And the data.table solution is:
require(data.table)
mtcars <- as.data.table(mtcars)
mtcars[hp==max(hp) | disp==max(disp)]
mpg cyl disp hp drat wt qsec vs am gear carb
1: 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
2: 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8
# if you want to get one column, e.g. 'cyl'
mtcars[hp==max(hp) | disp == max(disp), cyl]
[1] 8 8
# if you want to get several columns, do either of:
mtcars[hp==max(hp) | disp == max(disp), .(cyl,qsec)]
mtcars[hp==max(hp) | disp == max(disp), list(cyl,qsec)]
cyl qsec
1: 8 17.98
2: 8 14.60

For loop over a List of Data frames

I'm using the mtcars dataset in R. I have a list of data frames (mtcars dataset split into number of cylinders). I need to:
Identify the car with the min value for miles per gallon (mpg) within each cylinder type (i.e. 4,6,8).
Create a vector that stores the values of horsepower (hp) for each of the cars found in step 1 (the length of the vector will be 3).
Steps I have performed so far, as follows:
# load the data
data(mtcars)
# split cars data.frame into a list of data frames by cylinder
cars <- split(mtcars, mtcars$cyl)
# find the position within each data frame for the min values of mpg (i.e. first
# column)
positions <- sapply(cars,function(x) which.min(x[,1]))
As I see it, the next step would be to make a loop over each data frame to find the horsepower value for each position. I have tried to make a For loop for this, but I haven't been able to make it work. Maybe there's even a better solution for this problem.
You don't need to split the data and then use sapply. There are many ways to reach that output using much more efficient ways. Here's possible data.table solution
mtcars$Cars <- rownames(mtcars)
library(data.table)
data.table(mtcars)[, list(Car = Cars[which.min(mpg)],
HP = hp[which.min(mpg)]),
by = cyl]
# cyl Car HP
# 1: 6 Merc 280C 123
# 2: 4 Volvo 142E 109
# 3: 8 Cadillac Fleetwood 205
Or maybe using dplyr
library(dplyr)
mtcars %>%
mutate(Cars = rownames(mtcars)) %>%
group_by(cyl) %>%
summarize(Car = Cars[which.min(mpg)], HP = hp[which.min(mpg)])
# Source: local data frame [3 x 3]
#
# cyl Car HP
# 1 4 Volvo 142E 109
# 2 6 Merc 280C 123
# 3 8 Cadillac Fleetwood 205
From the pre-split cars set, you can do it this way with Map and Reduce.
> Reduce(rbind,
Map(function(x) x[which.min(x$mpg), "hp", drop = FALSE],
cars, USE.NAMES = FALSE)
)
hp
# Volvo 142E 109
# Merc 280C 123
# Cadillac Fleetwood 205
If you wanted a vector, you can assign the above code to a variable, say rr, and do
> setNames(rr[,1], rownames(rr))
# Volvo 142E Merc 280C Cadillac Fleetwood
# 109 123 205
This is really easy if you use the plyr library. Here ya go:
library(plyr)
data(mtcars)
mpMins <- ddply(mtcars, .(cyl),summarize, min = min(mpg), .drop = FALSE)
mpMins
cyl min
1 4 21.4
2 6 17.8
3 8 10.4
This only gives you the minimum value of the mpg though, you want the horsepower too
hpMins <- (merge(mpMins, mtcars, by.x = c("min","cyl"), by.y = c("mpg","cyl" )))$hp
hpMins
[1] 205 215 123 109
Strange, there are four values. You said you wanted three. If you go back and check the data though, there are two minimum values of 10.4 for the 8 cylinder category. Remember to be careful when going to summary values (like minimums) to individual observations.

Resources