Extract multiple values from a dataset by subsetting with a vector - r

I have a data frame called "Navi", with 72 rows that describe all possible combinations of three variables f,g and h.
head(Navi)
f g h
1 40.00000 80 0.05
2 57.14286 80 0.05
3 74.28571 80 0.05
4 91.42857 80 0.05
5 108.57143 80 0.05
6 125.71429 80 0.05
I have a dataset that also contains these 3 variables f,g and h along with several others.
head(dataset1[,7:14])
# A tibble: 6 x 8
h f g L1 L2 Ref1 Ref2 FR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.02 20 100 53 53 0.501 2.00 2
2 0.02 20 260 67 67 0.200 5.01 5.2
3 0.02 20 420 72 71 0.128 7.83 8.4
4 0.02 20 580 72 72 0.0956 10.5 11.6
5 0.02 20 740 73 73 0.0773 12.9 14.8
6 0.02 20 900 72 71 0.0655 15.3 18
What I'm trying to do is:
for each row in the combinations data frame, filter the dataset by the three variables f,g and h.
Then, if there are exact matches, give me the matching rows of this dataset, then extract the values in the columns "L1" and "FR" in this dataset and calculate the average of them. Save the average value in the vectors "L_M2" and "FR_M2"
If there aren't exact matches, give me the rows where f,g,h in the dataset are closest to f,g,h from the data frame. Then extract all values for L and FR in these rows, and calculate the average. Save the average value in the vectors "L_M2" and "FR_M2".
What I've already tried:
I created two empty vectors where the extracted values shall be saved later on.
Then I am looping over every row of the combinations data frame, filtering the dataset by f,g and h.
The result would be multiple rows, where the values for f,g and h are the same in the dataset as in the row of the combinations data frame.
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
}
The thing is, I don't know what to do from here on. I don't know how to check for rows with closest values by multiple variables, if there are no exact matches...
I only did something more or less similar in the past, but I only checked for the closes "g" value like this:
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
for (i in 1:(nrow(Navi)){
Differences <- abs(Navi[i,2]- matchingRows$G)
indexofMin <- which(Differences == min (Differences))
L_M2 <- append(L_M2, matchingRows$L[[indexofMin]], after = length(L_M2))
FR_M2 <- append(FR_M2, matchingRows$FR[[indexofMin]], after = length(FR_M2))
}
So can anybody tell me how to achieve this extraction process?I am still pretty new to R, so please tell me If I made a rookie mistake or forgot to include some crucial information. Thank you!

First convert your data into dataframe (if not done before).
Navi <- data.frame(Navi)
Savi <- data.frame(dataset1[,7:14])
Then use merge to filter your lines:
df1 <- merge(Navi, Savi, by = c("f","g","h"))
Save "L1" and "FR" average from df1:
Average1 <- ((df1$L1+df1$FR)/2)
Get you your new Navi dataframe which doen not have exact match on f,g,h columns
Navi_new <- Navi[!duplicated(rbind(df1, Navi))[-seq_len(nrow(df1))], ]
For comparing the values with nearest match:
A1 <- vapply(Navi_new$f, function(x) x-Savi$f, numeric(3))
A2 <- apply(abs(A1), 2, which.min)
B1 <- vapply(A1$g, function(x) x-Savi$g, numeric(3))
B2 <- apply(abs(B1), 2, which.min)
C1 <- vapply(B1$g, function(x) x-Savi$g, numeric(3))
C2 <- apply(abs(C1), 2, which.min)
You can use C2 dataframe to get the average of "L1" and "FR" like 3 steps back.

Related

sum function in R data frame not capturing correct data

I have a dataframe that looks like this:
channel start.time stop.time vp duration X id overlaps
1: 4_speech 14.183 16.554 CH1 2.371 NA 165 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825
2: 4_speech 21.196 22.259 CH1 1.063 NA 166 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485
3: 4_speech 28.001 31.518 CH1 3.517 NA 167 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519
4: 4_speech 34.867 36.549 CH1 1.682 NA 168 2_body_CH1_3.308
5: 4_speech 41.019 42.265 CH1 1.246 NA 169 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288
6: 4_speech 55.262 57.800 CH1 2.538 NA 170 2_body_CH1_2.494;2_body_N_6.571
The first 6 columns show information about a particular observation, the 7th column 'overlaps' shows a list of other observations from a different data-frame that co-occur with the observations in this data frame. Each of the observations in the overlaps column is structured like this: 'channel_vp_duration'. For example, the first observation in 'overlaps' in row 1 shows that '1_hands' is the channel, 'CH1' is vp (a kind of value), and the 1.145 the duration of that observation.
I want a sum of all the durations for a given observation type. I can sort of get this with the following code that was adapted from an answer provided by a stack user on a question I previously asked about how to get the overlaps data in the first place.
library(data.table)
library(stringr)
setDT(speech_rows)
speech_rows[, id := .I]
setkey(speech_rows, id)
#self join on subset by row
speech_rows[speech_rows, durs := {
temp <- df[!id == i.id & start.time < i.stop.time & stop.time > i.start.time & channel == "1_hands", ]
sum(temp$duration)
}, by = .EACHI]
This adds another columns 'durs' which is supposed to show the total duration of all the numeric values attached to a '1_hands' string in the overlaps column. Thus producing the following (first 6 columns removed to save space):
overlaps durs
1: 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825 0.000
2: 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485 1.417
3: 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519 1.750
4: 2_body_CH1_3.308 0.000
5: 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288 5.559
6: 2_body_CH1_2.494;2_body_N_6.571 0.000
But there is a problem, the sum() function does not capture all of the relevant strings. In row 1, there is the string: "1_hands_CH1_1.145", it is the only '1_hands' string in that row, so the value under durs for row 1 should be '1.145'. But the function ignores it for some reason. In row 2, the durs sum is correct. In row 3, it counts only one of the 1_hands values, and ignores the other. In row 5, it correctly finds both of the 1_hands values and adds them together. Rows 4 and 6 are have correct 'durs' values because there are no 1_hands observations in them.
This is very strange, and I don't know it correctly detects the numeric values at some times but not at others. This is problem #1
Problem #2: I cannot specify what I want beyond '1_hands', what I really want to do is get the sum of durations for all 1_hands_CH1 values, NOT all 1_hands values. To do this, I assume that you would just need to change the strings in 'channel == 1_hands'
temp <- df[!id == i.id & start.time < i.stop.time & stop.time > i.start.time & channel == **"1_hands"**, ]
But if I change it to something like "1_hands_CH1" all of the durs values will be zero, it can't anything past '1_hands'.
So in sum, I want to know why the math isn't working like I want it to, and why I can't select more specific strings.
Here is one way you could get durations out of your overlaps column using the tidyverse. You can set text_string equal to what you want durations for. I have provided some examples of how to enter your text string. The example below returns durations for all "1_hands" observations. If you wanted durations just for the "1_hands_CH1", then you would just set text_string <- "1_hands_CH1".
# Load tidyverse
library(tidyverse)
# Set text_string Equal To Specific String You Want Durations For
text_string <- "1_hands_[A-Z0-9]+"
# Examples For text_string
# text_string <- "1_hands_CH1" ## example for getting 1_hands_CH1
# text_string <- ""2_body_N" ## example for getting 2_body_N
# text_string <- "1_hands_[A-Z0-9]+" ## example for getting all 1_hands
# text_string <- "2_body_[A-Z0-9]+" ## example for getting all 2_body
# df With Durations
df_with_durs <- df %>%
as_tibble() %>%
mutate(str_matches = str_match_all(overlaps, str_glue("{text_string}_[0-9.]+")),
durs = map(str_matches,
function(x) {
durs <- str_remove(x, str_glue("{text_string}_"))
num_durs <- as.numeric(durs)
sum_durs <- sum(num_durs)
return(sum_durs)
}
)
) %>%
unnest(cols = durs) %>%
select(-str_matches)
# View Output
df_with_durs
# channel start.time stop.time vp duration X id overlaps durs
# <chr> <dbl> <dbl> <chr> <dbl> <lgl> <int> <chr> <dbl>
# 4_speech 14.2 16.6 CH1 2.37 NA 165 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825 1.14
# 4_speech 21.2 22.3 CH1 1.06 NA 166 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485 1.42
# 4_speech 28.0 31.5 CH1 3.52 NA 167 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519 5.31
# 4_speech 34.9 36.5 CH1 1.68 NA 168 2_body_CH1_3.308 0
# 4_speech 41.0 42.3 CH1 1.25 NA 169 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288 5.56
# 4_speech 55.3 57.8 CH1 2.54 NA 170 2_body_CH1_2.494;2_body_N_6.571 0

Conditionally add values to a new column and replace values in the conditioning column in R

I am working on a project where I need to read files into my environment and afterwards based on the row's name change a value and add new values to new columns: i.e.
X1 Area Mean Min Max file_row_name
55 0.165 31.384 4 82 ./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv
56 0.097 45.867 4 121 ./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv
168 0.042 28.252 20 49 ./Fluorescence Analysis/T0-25-50-10x-1-1.csv
So in the example I want to look at each row's file_row_name and if the rows have the same name, create two variables: Conc & Rep and replace the values at file_row_name so as to look like this:
X1 Area Mean Min Max file_row_name Conc Rep
55 0.165 31.384 4 82 T0 12.5 3
56 0.097 45.867 4 121 T0 12.5 3
168 0.042 28.252 20 49 T0 25 1
So far what I've done is:
my_df$Conc[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- 12.5
my_df$Rep[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- 3
my_df$file_row_name[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- "T0"
my_df$Conc[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- 12.5
my_df$Rep[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- 3
my_df$file_row_name[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- "T0"
But this takes too long and when I try an if clause:
if(my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv",
"./Fluorescence Analysis/T0-12.5-150-10x-3.csv")){
my_df$Conc = "12.5"
my_df$Rep = 3
my_df$file_row_name = "T0"
}
it tells me that:
Warning message:
In if (my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv", :
the condition has longitud > 1 and only the first element will be used
And if I manage to bypass that warning message with another code piece, basically the columns file_row_name Conc and Rep get replaced with the same value and nothing is changed based on condition.
Instead of if (which is not vectorized), we create a logical row index and use to assign
i1 <- my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv",
"./Fluorescence Analysis/T0-12.5-150-10x-3.csv")
mydf[i1, c("Conc", "Rep", "file_row_name")] <- list("12.5", 3, "T0")

How to prevent R from rounding in frequency function?

I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...
The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.
Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))

How can I most efficiently set 0 vals to NA in a subset of columns? [duplicate]

This question already has answers here:
How to replace NA values in a table for selected columns
(12 answers)
Closed 6 years ago.
I have a book on statistics (using R) showing the following:
> pima$diastolic [pima$diastolic = = 0] <- NA
> pima$glucose [pima$glucose == 0] <- NA
> pima$triceps [pima$triceps == 0] <- NA
> pima$insulin [pima$insulin == 0] <- NA
> pima$bmi [pima$bmi == 0] <- NA
Is there a way to do it in one line or more efficiently? I see there are functions such as with, apply, subset for doing similar stuff but could not figure out how to put them together...
Sample data (how do I read this in as a dataframe (like pythons stringio):
pregnant glucose diastolic triceps insulin bmi diabetes age test
1 6 148 72 35 0 33.6 0.627 50 positive
2 1 85 66 29 0 26.6 0.351 31 negative
3 8 183 64 0 0 23.3 0.672 32 positive
4 1 89 66 23 94 28.1 0.167 21 negative
5 0 137 40 35 168 43.1 2.288 33 positive
6 5 116 74 0 0 25.6 0.201 30 negative
Something like this:
Use lapply() to use a function for every column
In the function, test if the column is numeric. If numeric, then replace zeros with NA, else return the original column, unchanged:
Try this:
pima[] <- lapply(pima, function(x){ if(is.numeric(x)) x[x==0] <- NA else x})
Or for predefined columns
cols = c("diastolic", "glucose", "triceps", "insulin", "bmi")
pima[cols] <- lapply(pima[cols], function(x) {x[x==0] <- NA ; x})
Or using is.na<-
is.na(pima[cols]) <- pima[cols] == 0
Using data.table you can try
for (col in c("diastolic","glucose","triceps","insulin", "bmi")) pima[(get(col))==0, (col) := NA]
more details here:
How to replace NA values in a table *for selected columns*? data.frame, data.tableenter link description here
Using dplyr, you could do:
# banal function definition
zero_to_NA <- function(col) {
# any code that works here
# I chose this because it is concise and efficient
`is.na<-`(col, col==0)
}
# Assuming you want to change 0 to NA only in these 3 columns
pima <- pima %>%
mutate_each(funs(zero_to_NA), diastolic, glucose, triceps)
Or you could skip the function definition and write directly:
pima <- pima %>%
mutate_each(funs(`is.na<-`(., .==0)),
diastolic, glucose, triceps)

Sorting a dataframe by multiple columns

I wanted to sort a data frame while the team names in the "name" column stay with the ratings in the "ratings" column. For example, ne has the highest rating of 13.62. I need both "ne" and "13.62" to be sorted to the first position. Here is some of my code:
x <-t(nfl_data)
y <- solve(x)
myfun = function(i) round( (1/13)*(sum(x[,i])) + mean(y[,i]), digits=2 )
ratings = numeric(32)
for (i in 1:32){
ratings[i] = myfun(i)
}
teams <- c('ari','atl','bal','buf','car','chi','cin','cle','dal',
'den','det','grn','hou','ind','jac','kc','mia','min','ne',
'no','nyj','nyg','oak','phi','pit','sd','sea','sf','stl',
'tb','tn','was')
df <- data.frame(teams,ratings)
df[with(df, order(teams, -ratings)), ]
Here is the sample output of df:
teams ratings
1 ari -3.73
2 atl 9.46
3 bal 2.31
4 buf -5.46
5 car -0.69
6 chi 7.57
7 cin 6.69
8 cle -4.23
I get the same results if I try running the ordered data frame code. What am I doing wrong?
Sort on ratings column
df[with(df, order(-ratings)), ]

Resources