R limit output of dataframe?

R limit output of dataframe? - r

I have a data frame of transactions.
I am using dplyr to filter the transaction by gender.
Gender in my case is 0 or 1.
I want to filter 2 rows one with Gender == 0 and the second with Gender == 1.
The closest was to do it like this
df %>% arrange(Gender)
and then select 2 transactions in the middle where one is 1 and the second is 0.
Please advise.

To randomly sample a row/cell where condition in another cell is satisfied you can use sample like this:
# Dummy data: X = value of interest, G = Gender (0,1)
df1 <- data.frame("X" = rnorm(10, 0, 1), "G" = sample(c(0,1), replace = T, size = 10))
# Sampling
sample(df1[,'X'][df1[,'G'] == 1], size = 1)
sample(df1[,'X'][df1[,'G'] == 0], size = 1)
This is taking one value of X for each gender (condition of G being set by [df1[,'G'] == 1]).
Building from the comment by docendo discimus you can use the popular dplyr package, using the script below, but note that this runs considerably slower (5 times slower, 3M rows & 1000 iterations) than the sample approach I offered above:
pull(df1 %>% group_by(G) %>% sample_n(1), X)

Related

Loop only running for the last iteration in R - Looping over participants

I am very new to R and I am trying to run a loop, so any help is greatly appreciated.
I have longitudinal data with multiple timepoints for each participant, which looks like the picture attached1
I need to replace the NA values with the values from when the Years variable is equal to 0, and I want to write a loop to do this for each participant. I have written some code which seems to work, however it only gives output for the last iteration of the loop (the last participant). This is the code I am using:
x <- c(1:4)
n = length(x)
for(i in 1:n)
{
data <- subset(df, ID %in% c(x[i]))
data$outcome <- ifelse(is.na(data$outcome),
data[1,3],
data$outcome)
}
Using this code, the output gives only the last iteration (i.e. in this case, ID 4). I need to complete this for all IDs.
Any help is much appreciated! Thankyou.

I'm not 100% clear on your intent, but this will, within an ID, fill all outcomes missing values by the (first) outcomes value from a row where Years == 0.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(outcome = coalesce(outcome, first(outcomes[Years == 0])))
Obvioustly untested, but if you provide some sample data I'll happily help debug.

Your loop replaces data$outcome each iteration. That is why you only get the last result.
Here's my inelegant solution:
Making sample data to match yours (not including unused column)
my_dat <- data.frame("years" = sample(c(0, 1.5, 3), 30, replace = T),
"outcome" = as.numeric(sample(c("", 1, 2), 30, replace = T)))
Find which rows are both 0 for years and missing outcome
my_index <- my_dat$years == 0 * is.na(my_dat$outcome)
Assign 0 to replace NA:
my_dat$outcome[my_index] <- 0

A simpler tidyverse method:
library(tidyverse)
df %>%
filter(ID %in% x) %>%
mutate(outcome = ifelse(is.na(Outcome), Years, Outcome))

your question could do with some clarification and a repreducible example. As I understand it from: "I need to replace the NA values with the values from when the Years variable is equal to 0". So if outcome equals NA and Years equals 0 you want outcom to equal 0?
set.seed(1984) # ser the seed so that my_dat is the same each time
# using a modified df from markhogue answer...
my_dat <- data.frame(
ID = 1:30,
years = sample(c(0, 1.5, 3), 30, replace = T),
outcome = as.numeric(sample(c("", 1, 2), 30, replace = T))
)
my_dat # have a look at rows 9 and 22
# ifelse given two conditions does year == 0 and is.na(outcome)
my_dat$outcome <- ifelse(my_dat$year == 0 & is.na(my_dat$outcome), my_dat$years, my_dat$outcome)
my_dat # have a look at rows 9 and 22
Let me know if this is what you need :)

fuzzy comparing names in R - how to find highest possible sum in a matrix (with boundary conditions)

I have a matrix with similarity scores that looks like this:
I need to find the highest possible sum of scores in this matrix. The sums have to fulfill a condition though:
If a number has been used for the sum, no numbers of its row or column or any previous rows or columns can be used for the sum anymore. This is because the order of the names matters.
I can start at any number, but all the values to the left and above that value will then be disqualified for the rest of that sum, as well as the values in the same row and column.
The highest possible sum with this system is 130 (10 + 100 + 10 + 10). That's the number i want in the end.
My strategy at the moment is to calculate all possible sums, and then simply select the highest. But how do i code the condition that i described above? Does it make sense?
Here are more examples of allowed (green) and not allowed (red) sums:
Another example of a valid sum:
I start at the upper left corner. I choose 10.
I can not add the 12 or the 11 that are in the same column anymore.
I choose one of the remaining numbers. 12.
I can no longer choose 100, 11, 10, or 25 from that column and row (and the previous columns and rows). I can now only choose 22 or 10 for the last number to add.
If i pick 22, my total sum is 44. If i pick 10, my total sum 32.
The reason i'm using this system is because i'm trying to create an algorithm that compares full names of people and assigns it a probability that they're the same person - purely based on name information.
My current code looks like this:
library(tidyverse)
library(stringdist)
string.compare <- function(Var1, Var2){
string1 <- Var1 %>% tolower() %>% trimws() %>% str_replace_all(pattern = "[[:punct:]]", replacement = "") %>% strsplit(" ") %>% unlist()
string2 <- Var2 %>% tolower() %>% trimws() %>% str_replace_all(pattern = "[[:punct:]]", replacement = "") %>% strsplit(" ") %>% unlist()
compare <- array(NA, dim = c(length(string1),
length(string2)), dimnames = list(string1,
string2))
compare[] <- do.call(mapply,
c(list(FUN = string.score),
expand.grid(dimnames(compare), stringsAsFactors = FALSE)))
sums <- func_calc_sums(compare) # This is where is need help. How to write this function?
output(max(sums))
}
string.score <- function(Var1, Var2){
phonetic.weight <- 50 # this is an important variable. it determines the weight of the phonetic comparison. 100 = no weight, 0 = phonetic is all that matters.
if(is.null(Var1) | is.null(Var2) | is.na(Var1) | is.na(Var2) | Var1 == "" | Var2 == ""){ # if one of the entries is empty, score 0
return(0)
} else if(Var1 == substr(Var2, 1, 1)){ # if Var1 is an abbreviation of Var2, score 10
return(10)
} else if(nchar(Var1) == 1){ # if Var1 is an abbreviation but not of Var2, score 0
return(0)
} else if(Var2 == substr(Var1, 1, 1)){ # if Var2 is an abbreviation of Var1, score 10
return(10)
} else if(phonetic(Var1) == phonetic(Var2)){ # If Var1 and Var2 are phonetically similar, give score based on stringdist
return(round(100 - (phonetic.weight * stringdist(Var1, Var2, method = "osa") / nchar(Var1)), 0))
} else { # If Var1 and Var2 are not phonetically similar, give a score based on stringdist but lower
return(round(100 - (100 * stringdist(Var1, Var2, method = "osa") / nchar(Var1)), 0))
}
}
If you enter for example Var1 <- " a. michelle hernandes s. " and Var2 <- " Alexa michelle h. sanchez" and then run it through the function string.compare (the function is not finished, you'll have to execute the code line by line) it will first clean up the strings, and then split them into separate words.
Those words get assigned as rownames and colnames of a matrix, over which a scoring system is run string.score. Then you end up with the matrix at the beginning of this post.

I have two suggestions, that might be helpful:
there is an implementation of the Needleman-Wunsch algorithm, which was mentioned by Roman Cheplyaka, in R on github. You can find it here: https://gist.github.com/juliuskittler/ed53696ac1e590b413aac2dddf0457f6
You could try to solve the problem recursively using for example the maximum path sum function described here: https://lucidmanager.org/data-science/project-euler-18/
You'd have to explicitly insert the constraints you mentioned to block these paths from being eligible, I think. Here's a run of the default function on your dataset:
testmat <- matrix(data = c(10, 0, 0 , 0, 12, 100, 12, 25, 11, 11, 10, 22, 0, 0,
0, 10),
ncol = 4,
nrow = 4,
byrow = T)
path.sum <- function(triangle) {
for (rij in nrow(triangle):2) {
for (kol in 1:(ncol(triangle)-1)) {
triangle[rij - 1,kol] <- max(triangle[rij,kol:(kol + 1)]) + triangle[rij - 1, kol]
}
triangle[rij,] <- NA
}
return(max(triangle, na.rm = TRUE))
}
> path.sum(testmat)
[1] 130

It sounds like you are looking for the Needleman–Wunsch dynamic programming algorithm. Just set the match score to the value of your similarity function, and the mismatch/insertion/deletion scores to 0.
The algorithm is not too hard to implement, and you can find a lot of code samples online.

Clustering rows by ID based on a column value condition multiple times

Some time ago I opened a related question in this post
Suppose I have the following df:
data <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1,1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1,1,1,0,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48,24,20,21,10,10),
ClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,6))
And I want to obtain:
data <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1,1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1,1,1,0,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48,24,20,21,10,10),
ClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,6),
DesiredResultClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,6,6,6,7,8,9,10,10,11))
The conditions are:
If value of 'Control' is higher than 12 and actual 'Obs1' value is equal to 1 and to previous 'Obs1' value, 'DesiredResultClusterObs1' value should add +1 (the main difference with the other question is that consecutive control values above 12 must be considered)
Any idea of how can I achieve the desired result.

I don't know much how to use the whith() and rle() functions, but i've got to a solution to the problem, using ifelse.
data <- data %>% mutate (aux = ifelse (Control>12 & Obs1 == 1 & lag(Obs1) ==1,1,0),
DesiredResultClusterObs1 = ClusterObs1 + cumsum(aux))
The aux variable is not necessary, it just help to see step by step. You can do the following too
data <- data %>% mutate (DesiredResultClusterObs1 =
ClusterObs1 +
cumsum(ifelse (Control>12 & Obs1 == 1 & lag(Obs1) ==1,1,0)))

R function or loop that could go through a binary variable (1 and 0) in a dataframe and returns a third variable (y) value from a different column

I do need some help. I am trying to build a function or a loop using R that could go through a binary variable (1 and 0) in a dataframe in such way that everytime 1 is followed by a 0, I could save a vector indicating the value of a third variable (y) in the same line where it occurred. I tried a couple of options based on previous posts, but nothing gives me something even close from that.
My data looks a bit like that:
ID <- rep(1001, 5)
variable <- c(1, 1, 0, 1, 0)
y <- c(10, 20, 30, 40, 50)
df <- cbind(ID, variable, y)
In this case, for example, the answer would give me a vector with the y values 30 and 50. Sorry if someone already has answered that, I could not find something similar. Thanks a lot!

Here's a 'vectorial' solution. Basically, I paste together variable in position i and i+1. Then I check to see if the combination is "10". The position you want is actually the next one (e.g. i+1), so we add 1.
df <- data.frame(ID, variable, y)
idx <- which(paste0(df$variable[-nrow(df)], df$variable[-1]) == "10") + 1
df$y[idx]

Here is an approach with tidyverse:
library(tidyverse)
df %>%
as.tibble %>%
mutate(y1 = ifelse(lag(variable) == 1 & variable == 0, y, NA)) %>%
pull(y1)
#output
[1] NA NA 30 NA 50
and in base R:
ifelse(c(NA, df[-nrow(df),2]) == 1 & df[, 2] == 0, df[, 3], NA)
if the lag of variable is 1 and the variable is 0 then return y, else return NA.
If you would like to remove the NA. wrap it in na.omit

Join 2 data frames using data.table with conditions

I have these two data frames:
set.seed(42)
A <- data.table(station = sample(1:10, 1000, replace=TRUE),
hash = sample(letters[1:5], 1000, replace=TRUE),
point = sample(1:24, 1000, replace=TRUE))
B <- data.table(station = sample(1:10, 100, replace=TRUE),
card = sample(letters[6:10], 100, replace=TRUE),
point = sample(1:24, 100, replace=TRUE))
Dataframe A contains more than 1M rows.
I try to find hash (from A) for each card (from B). I have some conditions there: stations and points in A lays in a range(for station +- 1 and for points just + 2).
I use grouping B by card and execute for each group function for binding rows after implementing such conditions and get max by freq.
detect <- function(x){
am0 <- data.frame(station = 0,
hash = 0,
point = 0)
for (i in 1:nrow(x)) {
am1 <- A %>%
filter(station %in% (B$station[i] - 1) : (B$station[i] + 1) &
point > B$point[i] & point < B$point[i] + 2)
am0 <- rbind(am0, am1)
}
t <- as.data.frame(table(am0$hash))
t <- t %>%
arrange(-Freq) %>%
filter(row_number() == 1)
return(t)
}
And then just:
library(dplyr)
B %>%
group_by(card) %>%
do(detect(.)) %>%
ungroup
But I don't know how to implement function by each group with indices [i] so I actually get a wrong result.
# A tibble: 5 x 3
card Var1 Freq
<chr> <fctr> <int>
1 f c 46
2 g c 75
3 h c 41
4 i c 64
5 j c 62
I`m a beginner but I know best solution for big datasets - using data.table library for join 2 datasets like these. Can you help me to find decision for it?

I think what you want to do is:
#### Prepare join limits
B[, point_limit := as.integer(point + 2)]
B[, station_lower := as.integer(station - 1)]
B[, station_upper := as.integer(station + 1)]
## Join A on B, creates All combinations of points in A and B fulfilling the conditions
joined_table <- B[A,
, on = .( point_limit >= point, point <= point,
station_lower <= station, station_upper >= station),
nomatch = 0,
allow.cartesian=TRUE]
## Count the occurrences of the combinations
counted_table <- joined_table[,.N, by=.(card,hash)][order(card, -N)]
## Select the top for each group.
counted_table[, head(.SD, 1 ),by = .(card)][order(card)]
This will create a full table with all the information in and then do the counting on that. It relies purely on data.tables since to fully take advantage of the speed gains from that package. The data.table vignette is good if you are unfamiliar with the syntax. The nomatch condition ensures that we are doing an inner join.
This will probably be fine if A is only 1M rows and B is kept the same size, depending on your datas distribution. We can however split B also in a similar way to your do statement using the package purrr. I'm not sure how this interacts with R:s garabage collection however.
frame_list <- purrr::map(unique(B$card),
~ B[card == .x][A,
, on = .(point_limit >= point,
point <= point,
station_lower <= station,
station_upper >= station),
nomatch = 0,
allow.cartesian = TRUE][, .N, by = .(card, hash)])
counted_table_mem <- rbindlist(frame_list )
Something to note in this is that I use, rbindlist instead of multiple rbind. Repeatedly calling rbind will be very slow, since you will need to allocate new memory each time.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R limit output of dataframe? - r

Related

Loop only running for the last iteration in R - Looping over participants

fuzzy comparing names in R - how to find highest possible sum in a matrix (with boundary conditions)

Clustering rows by ID based on a column value condition multiple times

R function or loop that could go through a binary variable (1 and 0) in a dataframe and returns a third variable (y) value from a different column

Join 2 data frames using data.table with conditions

Categories

Resources