Merging specific rows by summing certain columns on grouping variables - r

The following dataframe is a subset of a bigger df, which contains duplicated information
df<-data.frame(Caught=c(92,134,92,134),
Discarded=c(49,47,49,47),
Units=c(170,170,220,220),
Hours=c(72,72,72,72),
Colour=c("red","red","red","red"))
In Base R, I would like to get the following:
df_result<-data.frame(Caught=226,
Retained=96,
Units=390,
Hours=72,
colour="red")
So basically the results is the sum of unique values for columns Caught, Retained, Units and leaving the same value for Hours and colour (Caught=92+134, Retained=49+47, Units= 170+220, Hours=72, colour="red)
However, I intend to do this in a much bigger data.frame with several columns. My idea was to apply a function based on column names as:
l <- lapply(df, function(x) {
if(names(x) %in% c("Caught","Discarded","Units"))
sum(unique(x))
else
unique(x)
})
as.data.frame(l)
However, this does not work, as I am not entirely sure how to extract vector names when using lapply() and other functions such as this.
I have tried withouth succes to implement by(), apply() functions.
Thanks

Asking for Base R:
l <- lapply( df, function(n) {
if( is.numeric(n) )
sum( unique(n) )
else
unique( n )
})
as.data.frame(l)
This solution takes advantage of the fact that data.frames are really just lists of vectors.
It produces this:
# Caught Discarded Units Hours Colour
# 226 96 390 72 red

A proposition:
df <-data.frame(Caught=c(92,134,92,134),
Discarded=c(49,47,49,47),
Units=c(170,170,220,220),
Hours=c(72,72,72,72),
Colour=c("red","red","red","red"))
df
#> Caught Discarded Units Hours Colour
#> 1 92 49 170 72 red
#> 2 134 47 170 72 red
#> 3 92 49 220 72 red
#> 4 134 47 220 72 red
df_results <- data.frame(Caught = sum(unique(df$Caught)),
Discarded = sum(unique(df$Discarded)),
Units = sum(unique(df$Units)),
Hours = unique(df$Hours),
Colour = unique(df$Colour))
df_results
#> Caught Discarded Units Hours Colour
#> 1 226 96 390 72 red
# Created on 2021-02-23 by the reprex package (v0.3.0.9001)
Regards,

Related

Creating a loop with compare_means

I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot
Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}

Double entries in dataframe after merg r

My data
Hello, I have a problem with merging two dataframes with each other.
The goal is to merge them so that each date has the corresponding values. If there is no corresponding value, I want to replace NA with 0.
names(FiresNearLA.ab.03)[1] <- "Date.Local"
U.NO2.ab.03 <- unique(NO2.ab.03) # No2.ab.03 has all values multiplied
ind <- merge(FiresNearLA.ab.03,U.NO2.ab.03, all = TRUE, all.x=TRUE)
ind[is.na(ind)] <- 0
So far so good. And the first lines look like they are supposed to look. But beginning from 2004-04-24, all dates are doubled and it writes weird values in the second NO2.Mean colum.
U.NO2.Mean table:
Date.Local NO2.Mean
361 2004-03-31 30.217391
365 2004-04-24 50.000000
366 2004-04-25 47.304348
370 2004-04-26 50.913043
374 2004-04-27 41.157895
ind table:
Date.Local FIRE_SIZE F.number.n_fires NO2.Mean
113 2004-04-22 34.30 10 13.681818
114 2004-04-23 45.00 13 17.222222
115 2004-04-24 55.40 22 28.818182
116 2004-04-24 55.40 22 50.000000
117 2004-04-25 2306.85 15 47.304348
118 2004-04-25 2306.85 15 21.090909
Why, are there Values in NO2.Mean for 2004-04-23 and 2004-04-22 days if they should be 0? and why does it double the values after the 24th and where do the second ones come from?
Thank you
So I managed to merge your data:
FiresNearLA.ab.03 <- dget("FiresNearLA.ab.03.txt", keep.source = FALSE)
U.NO2.ab.03 <- dget("NO2.ab.03.txt", keep.source = FALSE)
ind <- merge(FiresNearLA.ab.03,
U.NO2.ab.03,
all = TRUE,
by.x = "DISCOVERY_DATEymd",
by.y = "Date.Local")
As a side note: Usually, you share a small sample of your data on stackoverflow, not the whole thing. In your case, dput(FiresNearLA.ab.03[1:50, ]) and then copy and paste from the console to the question would have been sufficient.
Back to your problem: The duplication already happens in NO2.ab.03 and a number of dates and values occurs twice or more often. The easiest way to solve this (in my experience) is to use the package data.table which has a duplicated which is more straightforward and also faster:
library(data.table)
# Test duplicated occurrences in U.NO2.ab.03
> table(duplicated(U.NO2.ab.03, by = c("DISCOVERY_DATEymd", "NO2.Mean")))
FALSE TRUE
7767 27308
>
> nrow(ind)
[1] 35229
# Remove duplicated rows from data frame
> ind <- ind[!duplicated(ind, by = c("DISCOVERY_DATEymd", "NO2.Mean")), ]
> nrow(ind)
[1] 7921
After these steps, you should be fine :)
I got the answer. The Original data source of NO3.ab.03 was faulty.
As JonGrub suggestes said, the problem was within NO3.ab.O3. For some days it had two different NO3.Means corresponding to the same date. I deleted this rows and now its working good. Thank you again for the help and the great advices

ifelse didn't work in dataframe in R

I have a question about ifelse in data.frame in R. I checked several SO posts about it, and unfortunately none of these solutions fitted my case.
My case is, making a conditional calculation in a data frame, but it returns the condition has length > 1 and only the first element will be used even after I used ifelse function in R, which should work perfectly according to the SO posts I checked.
Here is my sample code:
library(scales)
head(temp[, 2:3])
previous current
1 0 10
2 50 57
3 92 177
4 84 153
5 30 68
6 162 341
temp$change = ifelse(temp$previous > 0, rate(temp$previous, temp$current), temp$current)
rate = function(yest, tod){
value = tod/yest
if(value>1){
return(paste("+", percent(value-1), sep = ""))
}
else{
return(paste("-", percent(1-value), sep = ""))
}
}
So if I run the ifelse one, I will get following result:
head(temp[, 2:4])
previous current change
1 0 10 10
2 50 57 +NaN%
3 92 177 +NaN%
4 84 153 +NaN%
5 30 68 +NaN%
6 162 341 +NaN%
So my question is, how should I deal with it? I tried to assign 0 to the last column before I run ifelse, but it still failed.
Many thanks in advance!
Try the following two segments, both should does what you wanted. May be it is the second one you are looking for.
library(scales)
set.seed(1)
temp <- data.frame(previous = rnorm(5), current = rnorm(5))
rate <- function(i) {
yest <- temp$previous[i]
tod <- temp$current[i]
if (yest <= 0)
return(tod)
value = tod/yest
if (value>1) {
return(paste("+", percent(value-1), sep = ""))
} else {
return(paste("-", percent(1-value), sep = ""))
}
}
temp$change <- unlist(lapply(1:dim(temp)[1], rate))
Second:
ind <- which(temp$previous > 0)
temp$change <- temp$current
temp$change[ind] <- unlist(lapply(ind,
function(i) rate(temp$previous[i], temp$current[i])))
In the second segment, the function rate is same as you've coded it.
Here's another way to do the same
# 1: load dplyr
#if needed install.packages("dplyr")
library(dplyr)
# 2: I recreate your data
your_dataframe = as_tibble(cbind(c(0,50,92,84,30,162),
c(10,57,177,153,68,341))) %>%
rename(previous = V1, current = V2)
# 3: obtain the change using your conditions
your_dataframe %>%
mutate(change = ifelse(previous > 0,
ifelse(current/previous > 1,
paste0("+%", (current/previous-1)*100),
paste0("-%", (current/previous-1)*100)),
current))
Result:
# A tibble: 6 x 3
previous current change
<dbl> <dbl> <chr>
1 0 10 10
2 50 57 +%14
3 92 177 +%92.3913043478261
4 84 153 +%82.1428571428571
5 30 68 +%126.666666666667
6 162 341 +%110.493827160494
Only the first element in value is evaluated. So, the output of rate solely depend on the first row of temp.
Adopting the advice I received from warm-hearted SO users, I vectorized some of my functions and it worked! Raise a glass to SO community!
Here is the solution:
temp$rate = ifelse(temp$previous > 0, ifelse(temp$current/temp$previous > 1,
temp$current/temp$previous - 1,
1 - temp$current/temp$previous),
temp$current)
This will return rate with scientific notation. If "regular" notation is needed, here is an update:
temp$rate = format(temp$rate, scientific = F)

R: split a list of value ranges (in text) into two lists

I have a column of temperature ranges (or single temperature if only one reading was recorded) like this:
"117-118"
"117-118"
"117-122"
"122-128"
"123"
"118-124"
"118-124"
"118-124"
"123-128"
"91-101"
...
In R, how can I split this column into two columns (i.e., low temperature and high temperature columns)?
input <- c("117-118"
"117-118",
"117-122",
"122-128",
"123",
"118-124",
"118-124",
"118-124",
"123-128",
"91-101")
Using read.table with a custom separator:
newData <- read.table(text = input, sep = "-", fill = TRUE)
Note the blanks will be filled in with NA. If you want the mean, you could do:
means <- rowMeans(newData, na.rm = TRUE)
You could also use strsplit and sapply as you mention in your comment like this:
newData <- t(sapply(strsplit(input, "-"), function(x) as.numeric(c(x, x)[1:2])))
# If you want NAs as before, change from c(x, x) to c(x, NA)
Thanks for some useful tips given by #akrun and others, I found a way to get this work:
temp <- c("117-118", "117-118", "117-122", "122-128" ,"123", "118-124", "118-124", "118-124", "123-128", "91-101")
low <- as.numeric(lapply(strsplit(temp,"-"), function(x) x[1]))
high <- as.numeric(lapply(strsplit(temp,"-"), function(x) x[2]))
This gives results as:
> low
[1] 117 117 117 122 123 118 118 118 123 91
> high
[1] 118 118 122 128 NA 124 124 124 128 101
In the case of single value, assign that value to both lists can be done as follows (may not be the best way but it works):
high[is.na(high)] <- low[is.na(high)]
Which leads to this result:
> high
[1] 118 118 122 128 123 124 124 124 128 101

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

Resources