I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum (Compare group of two columns and return index matches R) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2. Functions like match, merge, join, intersect won't work here. I have been trying to use purr::pluck but didn't get far. The dataframes are of different sizes.
Below is an example:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
if it is, then create a new variable new_mpg with value of 1.
It's hard to show the exact expected output here.
I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows. An efficient solution would be much appreciated.
Thanks
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply, or maybe changing the organization of it within apply, e.g., by apply(temp1.df[,c("mpg","cyl")]....
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any of the mpg for this line falls between (from dplyr) Start and End, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg.
I'm guessing there's a way to do this with rowwise, but I could never get it to work properly...
Related
I have a data frame with a large number of columns containing numeric values.
I'd like to dynamically calculate the mean value of the two consecutive columns (so mean of column 1 and column 2, mean of column 3 and 4, mean of 5 and 6 etc...) and either store it into new column names or replace one of the two columns I used in the calculation.
I tried creating a function that calculate the mean of two columns and storing it into the first column then applying a loop to that function so it applies to my whole datatable.
However I'm struggling with mutate: since I dynamically generate the name of the column I use (they all start with "PUISSANCE" then a number) through a glue, it displays the value as a string into the mutate and doesn't evaluate it.
mean_col <- function(data, k)
{
n<-2*k+1
m<-2*k+2
varname_even <- paste("PUISSANCE", m,sep="")
varname_odd <- paste("PUISSANCE", n,sep="")
mutate(data, "{{varname_odd}}" := ({{varname_odd}}+{{varname_even}})/2) %% *here is the issue, the argument on the right is considered as non numeric, since it is the sum of two strings...*
data
}
for (k in 0:24) {
my_data_set <- mean_col(my_data_set,k)
}
Ok guys, just to let you know that I managed to solve it myself.
I did a pivot_longer transmute in order to put all the "PuissanceXX" in the same column and the values associated in another.
Then I used str_extract to get only the number XX from the string "PUISSANCEXX" that I converted into a numeric.
Thanks to a division by 2 (-0,5) I managed to have each successive value being X and X,5. so getting both values to X thanks to a floor. Then I just did a group_by/summarize in order to get the sum and that's it !
pivot_longer(starts_with("PUISSANCE"),names_to = "heure", values_to = "puissance") %>%
mutate("time" = floor(as.numeric(str_extract(heure, "\\d+"))/2-0.5)) %>%
select(-heure) %>%
group_by(time) %>%
summarise("power" = mean(puissance))
I need to divide certain values in a column by 1000 but do not know how to go about it
I attempted to use this function initially:
test <- Updins(weight,)
test$weight <- as.numeric(test$weight) / 1000
head(test)
with Updins being the dataframe and weight the column just to see if it would at least divide the entire column by 1000 but no such luck. It did not recognise 'test' as a variable.
Can anyone provide any guidance? I'm very new to R :)
If 'Updins is the dataset object name, we can select the columns with [ and not with ( as ( is used for function invoke
test <- Updins['weight']
test$weight <- as.numeric(test$weight) / 1000
Here is a fake data set to divide all rows by 1000. I also included a for-loop as one potential way to only do this for certain rows. Since you didn't specify how you were doing that, I just did it for any rows that had a value greater than 1,005, and I did a second version for only dividing by 1,000 if the ID was an odd number. If you have NAs this you may need an addition if statement to deal with them. I will provide an example for that in the third/last for-loop example.
ID<-1:10
grams<-1000:1009
df<-data.frame(ID,grams)
df$kg<-as.numeric(df$grams)/1000
df[,"kg"]<-as.numeric(df[,"grams"])/1000 #will do the same thing as the line above
for(i in 1:nrow(df)){
if(df[i,"grams"]>1005){df[i,"kg3"]<-as.numeric(df[i,"grams"])/1000}
}#if the weight is greater than 1,005 grams.
for(i in 1:nrow(df)){
if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#if the id is an odd number
df[3,"grams"]<-NA#add an NA to the weight data to test the next loop
for(i in 1:nrow(df)){
if(is.na(df[i,"grams"]) & (df[i,"ID"] %in% seq(1,101, by = 2))){df[i,"kg4"]<-NA}
else if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#Same as above, but works with NAs
Hard without data to work with or expected output, but here's a skeleton that you could probably use:
library(dplyr) #The package you'll need, for the pipes (%>% -- passes objects from one line to the next)
test <- Updins %>% #Using the dataset Updins
mutate(weight = ifelse(as.numeric(weight) > 199, #CHANGING weight variable. #Where weight > 50...
as.character(as.numeric(weight)/1000), #... divide a numeric version of the weight variable by 1000, but keep as a character...
weight) #OTHERWISE, keep the weight variable as is
head(test)
I kept the new value as a character, because it seems that your weight variable is a character variable based on some of the warnings ('NAs introduced by coercion') that you're getting.
In writing code for a function, I have selected complete cases from the 2nd column of a data frame with 4 columns called "myData" and confirmed that 117 of >1700 rows have been selected into "mycases" by printing those values. The selection code is:
mycases <- myData[complete.cases(myData[,2]),2]
I can sum the values of these 117 cases successfully, but when I try to count them using code:
fkount <- nrow(mycases)
R returns NULL. What I am doing wrong? Is there some easier way to get the number of cases?
mycases is in your case a vector. If you want to know its length use length(mycases).
I guess you want something like this.
library(dplyr)
myData <- data.frame(A = c(1:3, NA), B = c(1,NA,2,NA))
myData %>% filter(complete.cases(.)) %>% nrow()
When you extract a single column from your data frame (or from a matrix), it is by default converted into a vector, and nrow does not work on vectors (since they don't have rows + columns).
You have (at least) 2 options:
use length() instead. This will work, but has the risk that if you use the same code later to extract 2 (or more) columns, it will now give a probably-undesired result: either the total length of an extracted matrix (all the elements), or the number of columns of an extracted data frame.
use the drop=FALSE argument of [ ]. This will prevent conversion of a single column into a vector, and it will remain a 2d object (but with ncol equal to 1). Then nrow will work as you intend.
Example:
mydata=data.frame(matrix(1:100,ncol=5))
# using length()
length( mydata[,2] )
# 20
# but watch out!
length( mydata[,2:3] )
# 2
# using drop=FALSE
nrow( mydata[,2,drop=FALSE] )
# 20
# safer:
nrow( mydata[,2:3,drop=FALSE] )
# 20
I am having an issue with mutate function in dplyr.
I am trying to
add a new column called state depending on the change in one of the column (V column). (V column repeat itself with a sequence so each sequence (rep(seq(100,2100,100),each=96) corresponds to one dataset in my df)
Error: impossible to replicate vector of size 8064
Here is reproducible example of md df:
df <- data.frame (
No=(No= rep(seq(0,95,1),times=84)),
AC= rep(rep(c(78,110),each=1),times=length(No)/2),
AR = rep(rep(c(256,320,384),each=2),times=length(No)/6),
AM = rep(1,times=length(No)),
DQ = rep(rep(seq(0,15,1),each=6),times=84),
V = rep(rep(seq(100,2100,100),each=96),times=4),
R = sort(replicate(6, sample(5000:6000,96))))
labels <- rep(c("CAP-CAP","CP-CAP","CAP-CP","CP-CP"),each=2016)
I added here 2016 value intentionally since I know the number of rows of each dataset.
But I want to assign these labels with automated function when the dataset changes. Because there is a possibility the total number of rows may change for each df for my real files. For this question think about its only one txt file and also think about there are plenty of them with different number of rows. But the format is the same.
I use dplyr to arrange my df
library("dplyr")
newdf<-df%>%mutate_each(funs(as.numeric))%>%
mutate(state = labels)
is there elegant way to do this process?
Iff you know the number of data sets contained in df AND the column you're keying off --- here, V --- is ordered in df like it is in your toy data, then this works. It's pretty clunky, and there should be a way to make it even more efficient, but it produced what I take to be the desired result:
# You'll need dplyr for the lead() part
library(dplyr)
# Make a vector with the labels for your subsets of df
labels <- c("AP-AP","P-AP","AP-P","P-P")
# This line a) produces an index that marks the final row of each subset in df
# with a 1 and then b) produces a vector with the row numbers of the 1s
endrows <- which(grepl(1, with(df, ifelse(lead(V) - V < 0, 1, 0))))
# This line uses those row numbers or the differences between them to tell rep()
# how many times to repeat each label
newdf$state <- c(rep(labels[1], endrows[1]), rep(labels[2], endrows[2] - endrows[1]),
rep(labels[3], endrows[3] - endrows[2]), rep(labels[4], nrow(newdf) - endrows[3]))
I have two data frames of different length, and I want to add a new column to the first data frame with corresponding values of the second data frame.
The corresponding value is defined by the following condition if (DF1[i,1] == DF2[,1] & DF1[i,2] == DF2[i,2]) == TRUE, then the value of this row should be taken from DF2 and written to DF1$newColumn[i].
The following data frames are used to illustrate the question:
DF1<-data.frame(X = rep(c("A","B","C"),each=3),
Y = rep(c("a","b","c"),each=3))
DF2<-data.frame(X = c("A","B","C"),
Y = c("a","b","c"),
Z = c(1:3))
I tried to use if() statements as in the text above but the condition returns a vector of TRUE/FALSE and that doesn't seem to work.
The code that works that I use now is
for (i in 1 : length(DF1[,1])) {
DF1$Z[i] <- subset(DF2,DF2$X == DF1$X[i] & DF2$Y == DF1$Y[i])$Z
}
However it is incredibly slow (user system elapsed 115.498 12.341 127.799 for my full dataframe) and there must be a more efficient way to code this. Also, I have read repeatedly that vectorizing is more efficient then loops but I don't know how to do that.
I do need to work with conditional statements though so something like
DF1$Zz<-rep(DF2$Z,each=3)
wouldn't work for my real dataset.
DF1$Z <- sapply(1:nrow(DF1), function(i) DF2$Z[DF2$X==DF1$X[i] & DF2$Y==DF1$Y[i]]) seems to be taking roughly a quarter of the time of your for loop.
I created DF1 with 300 each reps, and my function took ~2secs to run; your loop with subset took ~8secs to run, and repackaging your loop into an sapply it took ~5secs to run.