R: match values in dataframes

R: match values in dataframes - r

I have the following list of dataframes:
a<-data.frame(
Data0=c("Y","Y","Y","Y","Y","Y","N","N","N","N","N","N"),
Data1=c(16,18,19,20,21,50,16,18,19,20,21,50),
Data2=c(2.2291,2.0743,1.9369,1.8148,1.7064,1.6102,2.2291,2.0743,1.9369,1.8148,1.7064,1.6102)
)
b<-data.frame(
Data0=c(-2 , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,10 ,11) ,
Data1=c(0.8891 ,0.8891,0.9051,1,0.8891,0.8891,0.7907,0.8891,0.9929,0.8891,0.8891,0.8891,0.8891)
)
dfl<-list(a,b)
I would like to make a function, for each of the dataframes in the list, that returns the numbers in the last column, for the matching variables in the previous columns.
For a, if I send "Y" and 16, I want 2.2291 and for b, if I send 5 I want 0.7907. The problem for a is that if I send a number smaller than 16, for either "Y" or "N", I want it to return the value as if it were 16, without manipulating the input variable. I manage to do this for b, but for a, where I have a "split" between "Y" and "N", I don't know how to solve this.
So far:
get_value<-function(vector){
if (length(vector)==1) {
from<-append(head(unlist(dfl[[2]][1],use.names=FALSE),-1),-9999,0)
to <-unlist(dfl[[2]][1],use.names=FALSE)
match_from<-vector > from
match_to<-vector <=to
pos<-which(match_from==match_to)
return(unlist(dfl[[2]][pos,2]))
}
else{
print("not sure what to do for a")
}
}
get_value(4)
get_value(-44)
The solution needs to take into account that sometimes for a, the columns Data0 and Data1 have switched places.
EDIT:
input/output table:
INPUT OUTPUT
c("Y",13) --> 2.2291
c("Y",50) --> 1.6102
c("N",20) --> 1.8148
c("N",50) --> 1.6102
c(-44)) --> 0.8891

An alternate approach using tidyverse functions could be as follows:
library(tidyverse)
library(magrittr)
get_value<-function(vector){
if (length(vector)==1)
{ df <- dfl[[2]]
k <- df %>%
arrange(Data0)%>%
filter(Data0 >= vector) %>%
select(Data1) %>%
head(1)
}
else
{df <- dfl[[1]]
k <- df %>%
arrange(Data0)%>%
filter(Data0 == vector[1]) %>%
filter(Data1 >= vector[2]) %>%
select(Data2) %>%
head(1)
}
return(k)
}
and test some outputs as:
> get_value(c(-44))
Data1
1 0.8891
> get_value(c('N',16))
Data2
1 2.2291
> get_value(c('Y',16))
Data2
1 2.2291
> get_value(c('N',12))
Data2
1 2.2291
> get_value(c('Y',11))
Data2
1 2.2291
> get_value(c('Y',18))
Data2
1 2.0743

Ok, I finally figured out a way, but it is not so elegant, so help would be appreciated.
get_value<-function(vector){
if (length(vector)==1) {
from<-append(head(unlist(dfl[[2]][1],use.names=FALSE),-1),-9999,0)
to <-unlist(dfl[[2]][1],use.names=FALSE)
match_from<-vector > from
match_to<-vector <= to
pos<-which(match_from==match_to)
return(unlist(dfl[[2]][pos,2]))
}
else{
pos_1<-which(unlist(dfl[[1]][1],use.names=FALSE)==vector[1])
from<-append(head(as.numeric(unlist(dfl[[1]][2],use.names=FALSE))[pos_1],-1),-9999,0)
to <-as.numeric(unlist(dfl[[1]][2],use.names=FALSE))[pos_1]
match_from<-as.numeric(vector[2]) > from
match_to <-as.numeric(vector[2]) <=to
pos<-pos_1[match_from == match_to]
return(unlist(dfl[[1]][pos,3]))
}
}

Related

Using an if loop to create matrix using data from three different dataframes

I have three dataframes. One with means of an original dataset and two that represent the 2.75th percentile and 97.5th percentile of null distributions I created using a loop that sampled the original sample.
I have an if loop created that I want to replicate for each to generate a matrix that looks like the same format as my dataframes but just consisting "-", "+" or "0". These symbols are dependent on if the mean from the first dataframe is greater than the 97.25th percentile variable of the third dataframe, if the mean from the first dataframe is lower than than the 2.75th percentile of the second dataframe, and if the mean from the first dataframe is between these two variables. Any ideas about how to go about doing this if loop to accomplish this?
if (mean > 97.25th percentile){
print("-")
} else if(mean < 2.75th percentile ) {
print("+")
} else {
print("0")
}

One approach is to first initialize an empty data.frame before starting the loop.
Then you can use the if else if else structure you were proposing.
set.seed(3)
df.mean <- data.frame(mean = runif(100,0,100))
df.2.75 <- data.frame(centil275 = runif(100,1,4))
df.97.5 <- data.frame(centil975 = runif(100,97,100))
df.result <- data.frame()
for(i in 1:nrow(df.mean)){
if(df.mean[i,1] > df.97.5[i,1]) df.result[i,1] <- "-"
else if(df.mean[i,1] < df.2.75[i,1]) df.result[i,1] <- "+"
else df.result[i,1] <- "0"
}
df.final <- do.call(cbind,list(df.mean,df.2.75,df.97.5,df.result))
df.final
df.final
# mean centil275 centil975 V1
#1 16.8041526 3.299013 99.19449 0
#2 80.7516399 3.046397 99.58147 0
#3 38.4942351 1.627392 98.02813 0
#4 32.7734317 3.135831 98.47877 0
#5 60.2100675 2.815895 99.56156 0
#6 60.4394054 2.021678 98.16176 0
#7 12.4633444 1.123511 99.17952 0
#8 29.4600924 2.205258 99.05237 0
#9 57.7609919 1.237179 99.53660 0
#10 63.0979274 1.937658 97.09840 0
#...
table(df.final[,4])
# - + 0
# 1 3 96

Assuming the 3 dataframe are called dfm, df97 and df2, you can use nested ifelse.
result <- ifelse(dfm > df97, '-', ifelse(dfm < df2, '+', 0))
Or using dplyr::case_when :
result <- dplyr::case_when(df.mean > df.97.5 ~ '-',
df.mean < df.2.75 ~ '+', TRUE~'0')

R writing a function to avoid for loop

Hi I am trying to learn ways in which I can avoid loops in my codes.
I have an example data here:
options(warn=-1) #Turning warnings off here
Company=c("A","C","B","B","A","C","C","A","B","C","B","A")
CityID=as.character(c(1,1,1,2,2,2,3,3,3,4,4,4))
Value=c(120.5,123,125,122.5,122.1,121.7,123.2,123.7,120.7,122.3,120.1,122)
Sales=c(1,1,0,0,0,1,1,0,1,0,1,0)
df=data.frame(Company,CityID,Sales,Value)
df$new_value=0
I also created a custom function (simple example only for testing purposes) as below.
funcCity12 = function(data){
data_new=data[which(data$CityID == '1'|data$CityID == '2'),]
for (i in 1:nrow(data_new)){
data_company=df[(df$Company)==data_new[i,'Company'] & !df$CityID==1 & !df$CityID==2,]
data_new[i,'new_value'] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
}
data_new
}
df2=funcCity12(data=df) # obtaining the result here
Now I am trying to write a function to avoid the for loop in the previous function.
funcCity12_no_loop = function(x,df){
data_company=df[(df$Company)==x[,'Company'] & !df$CityID==1 & !df$CityID==2,]
x[,'new_value'] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
}
funcCity12_no_loop(x=df[1,],df=df) #Output for the first row of df1
This seems to be working when I input the rows individually. What I am stuck at is how to run this function for all rows of the dataframe. I am not sure if the 2nd function requires more changes for this purpose. Any help is appreciated. Thanks in advance.
P.S. For the second function, my initial reaction was to create a for loop and loop through the observations, but that defeats the whole purpose.
EDIT
This is based on #eonurk's answer
zz=apply(df,1, function(x){
data_company=df[(df$Company)==x[1] & !df$CityID==1 & !df$CityID==2,]
x[5] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
})
Output is shown below:

You can use apply function to reach out each individual observation of your dataframe.
For instance, you can multiplicate Values and Sales columns for no reason at all with following:
apply(df,1, function(x){ as.numeric(x["Sales"])*as.numeric(x["Value"])})
Edit:
Now you just need to use dplyr package
zz=apply(df,1, function(x){
data_company=df[(df$Company)==x[1] & !df$CityID==1 & !df$CityID==2,]
x[5] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
}) %>% as.data.frame %>% t

Here is one way without a loop. First we filter based on your criteria, then we group by company and calculate the max, then we join the dataframe to the original dataset (also filtered based on your criteria). I didn't make it a function, but the building blocks are all there.
library(tidyverse)
list(
df %>%
filter(CityID %in% 1:2) %>%
select(-new_value),
df %>%
filter(! CityID %in% 1:2 & Sales == 1) %>%
group_by(Company) %>%
summarise(new_value = max(Value))
) %>%
reduce(full_join, by = "Company")
#> Company CityID Sales Value new_value
#> 1 A 1 1 120.5 NA
#> 2 C 1 1 123.0 123.2
#> 3 B 1 0 125.0 120.7
#> 4 B 2 0 122.5 120.7
#> 5 A 2 0 122.1 NA
#> 6 C 2 1 121.7 123.2

how to apply multiple "if" functions by group in R

I'm trying to apply some basic "if" by group over a large dataset in R.
I tried to write a function and to apply over the groups using dplyr but it's not working. What could be the problem?
#dataframe
db <- data.frame(ID=c(1,1),
type=c("a","b"),
qual=c("no","OK"))
#if (no problem)
attach(db)
if(db[type =="a","qual"]=="OK"){
db[type =="a","qual_fin"] <- "OK"
db[type =="b","qual_fin"] <- "no"
} else if ( db[type =="b","qual"]=="OK"){
db[type =="b","qual_fin"] <- "OK"
db[type =="a","qual_fin"] <- "no"
} else {db$qual_fin <- "no"
}
#dataframe with groups
db <- data.frame(ID=c(1,1,2,2),
type=c("a","b","a","b"),
qual=c("OK","OK","no","OK"))
#function
quality <- function( a,b, qual_fin_a,qual_fin_b){
if(a =="OK"){
qual_fin_a <- "OK"
qual_fin_b <- "no"
} else if ( b =="OK"){
qual_fin_b <- "OK"
qual_fin_a <- "no"
} else {qual_fin_a <- "no"
qual_fin_b <- "no"
}}
#if by group
library(dplyr)
db2 <- db %>%
group_by(ID) %>%
do(quality(a=db[db$type =="a","qual"],
b=db[db$type =="b","qual"],
qual_fin_a=db[db$type=="a","qual_fin"],
qual_fin_b=db[db$type=="b","qual_fin"]))
I expect this result:
> db
ID type qual qual_fin
1 1 a OK OK
2 1 b OK no
3 2 a no no
4 2 b OK OK
I imagine that the solution is pretty simple but I'm struggling to find it!

Suppress replacement errors in R

I have a sequence of data frame subsetting operations. Some of them might fail because the rows to be replaced do not exist. I would still like the others to execute. Example:
source_data[source_data$abbr_d == "bdp",]$party_id <- 32
source_data[source_data$abbr_d == "svp",]$party_id <- 4
source_data[source_data$abbr_d == "cvp",]$party_id <- 2
source_data[source_data$abbr_d == "fdp",]$party_id <- 1
source_data[source_data$abbr_d == "gps",]$party_id <- 13
source_data[source_data$abbr_d == "sp",]$party_id <- 3
source_data[source_data$abbr_d == "csp",]$party_id <- 8
source_data[source_data$abbr_d == "pcs",]$party_id <- 8
Error in `$<-.data.frame`(`*tmp*`, "party_id", value = 13) :
replacement has 1 row, data has 0
source_data[source_data$abbr_d == "lega",]$party_id <- 18
source_data[source_data$abbr_d == "edu",]$party_id <- 16
source_data[source_data$abbr_d == "glp",]$party_id <- 31
I would like the script to continue after the error has been thrown. I've tried using tryCatch() but that doesn't really help because I don't know in advance at which point the replacement will fail.
Is there a way to tell R to just "not care" about those replacement errors? And still continue with the next replacement operations?
The only solution I came up with is to use if-statements like this, which is tedious:
if(nrow(source_data[source_data$abbr_d == "lega", 1]) > 0){
source_data[source_data$abbr_d == "lega",]$party_id <- 18
}
if(nrow(source_data[source_data$abbr_d == "edu", 1]) > 0){
source_data[source_data$abbr_d == "edu",]$party_id <- 16
}
etc...

That is quite verbose code. Luckily, there is a way to get this done in a fraction of the code, and preventing your issue. My suggestion is to use a lookup table to build the party_id column
df = data.frame(abbr_d = sample(LETTERS[1:8], 100, replace = TRUE))
lookup_table = 1:8
names(lookup_table) = LETTERS[1:8]
# A B C D E F G H
# 1 2 3 4 5 6 7 8
df$party_id = lookup_table[df$abbr_d]
So, you create the link between abbr_d and party_id once (here letters and simple numbers, but simply replace your values), and use the df$abbr_d column to subset the lookup table. This maps the labels in abbr_d to the values that correspond to that for party_id.
The error you see is avoided because only addr_d values that are actually in the data are looked up in the lookup table. These unneeded values in the lookup table do not pose an issue.
A dplyr approach as a bonus:
library(dplyr)
df %>% mutate(party_id = lookup_table[abbr_d])

You can use data.table library to mitigate the issue
txt<-"
1,a,1
2,b,2
3,c,3
4,d,4
"
dat = read.delim(textConnection(txt),
header=FALSE,sep=",",strip.white=TRUE)
dat
dat[dat$V2=="e",]$V3<-4
# Error in `$<-.data.frame`(`*tmp*`, "V3", value = 4) :
# le tableau de remplacement a 1 lignes, le tableau remplacé en a 0
library(data.table)
data=as.data.table(dat)
data[data$V2=="e",]$V3<-4
# no error thrown
data.table is often faster than data frame, afaik.

Trimming NAs based on column subset - a more elegant solution?

A New Year's quandary for the stackoverflow community which has been quite the help by reading posts and answers in the past (this is my first question). I've found a work around, but I'm wondering if other approaches/solutions might be suggested.
I am attempting to remove trailing NA's from a large data.frame, but those NA's are only found in a few of the columns of the data.frame and I would like to retain all columns in the output. Here is a representative data subset.
df=data.frame(var1=rep("A", 8), var2=c("a","b","c","d","e","f","g","h"), var3=c(0,1,NA,2,3,NA,NA,NA), var4=c(0,0,NA,4,5,NA,NA,NA), var5=c(0,0,NA,0,2,4,NA,NA))
Goals of the process:
Trim trailing NAs based on NA presence in var3,var4 and var5
Retain all columns in final output
Only remove trailing NAs (i.e. row 3 remains in record as a placeholder)
Only trim if all columns have an NA (i.e. row 7 and 8, but not row 6)
Based on these goals, the solution should remove the last two rows of df:
df.output = df[-c(7,8),]
The behaviour of na.trim (in the zoo package) is ideal (as it limits removal to those NA's at the end of the data.frame, with sides="right"), and my work-around involved altering the na.trim.default function to include a subset term.
Any suggestions? Many thanks for any help.
EDIT: Just to complete this question, below is the function I created from the na.trim.default code which also works, but as noted, does require loading the zoo package.
na.trim.multiplecols <- function (object, colrange, sides = c("both", "left", "right"), is.na = c("any","all"),...)
{
is.na <- match.arg(is.na)
nisna <- if (is.na == "any" || length(dim(object[,colrange])) < 1) {
complete.cases(object[,colrange])
}
else rowSums(!is.na(object[,colrange])) > 0
idx <- switch(match.arg(sides), left = cumsum(nisna) > 0,
right = rev(cumsum(rev(nisna) > 0) > 0), both = (cumsum(nisna) >
0) & rev(cumsum(rev(nisna)) > 0))
if (length(dim(object)) < 2)
object[idx]
else object[idx, , drop = FALSE]
}

Something based on max(which(!is.na())) will work. We use this to find the largest index of non-missing data from the columns of interest.
Using your df
ind <- max(max(which(!is.na(df$var3))),
max(which(!is.na(df$var4))),
max(which(!is.na(df$var5))))
df[1:ind, ]
var1 var2 var3 var4 var5
1 A a 0 0 0
2 A b 1 0 0
3 A c NA NA NA
4 A d 2 4 0
5 A e 3 5 2
6 A f NA NA 4

Edit: First solution using base rle and apply
t <- rle(apply(as.matrix(df[,3:5]), 1, function(x) all(is.na(x))))
r <- ifelse(t$values[length(t$values)] == TRUE, t$lengths[length(t$lengths)], 0)
head(df, -r)
Second solution using Rle from package IRanges:
require(IRanges)
t <- min(sapply(df[,3:5], function(x) {
o <- Rle(x)
val <- runValue(o)
if (is.na(val[length(val)])) {
len <- runLength(o)
out <- len[length(len)]
} else {
out <- 0
}
}))
head(df, -t)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: match values in dataframes - r

Related

Using an if loop to create matrix using data from three different dataframes

R writing a function to avoid for loop

how to apply multiple "if" functions by group in R

Suppress replacement errors in R

Trimming NAs based on column subset - a more elegant solution?

Categories

Resources