Select most frequent elements by a variable - r

If I have a data frame that looks like this:
id=c(1,1,1,2,2,3,3,3,3)
ans=c(1,1,3,3,3,4,3,1,4)
d=cbind(id,ans)
How can I select the most frequent answer per ID?
I would like to return a data frame that looks like this:
id=c(1,2,3)
ans=c(1,3,4)
d.out=cbind(id,ans)

You need a 2-way table, and then find the max count in each row:
tab <- table(id, ans)
data.frame(id=rownames(tab), ans=colnames(tab)[max.col(tab)])

What about this?
res <- sapply(split(ans, id), function(x) names(sort(table(x),decreasing=TRUE)[1]))
data.frame(id = names(res), ans = res)
id ans
1 1 1
2 2 3
3 3 4

Related

Find the mean of one variable subseted by another variable

I have a list of identical dataframes. Each data frame contains columns with unique variables (temp/DO) and with repeated variables (eg-t1).
[[1]]
temp DO t1
1 4 1
3 9 1
5 7 1
I want to find the mean of DO when the temperature is equal to t1.
t1 represents a specific temperature, but the value varies for each data frame in the list so I can't specify an actual value.
So far I've tried writing a function
hvod<-function(DO, temp, depth){
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
hfit<-hvod(data$DO, data$temp, data$depth)
But for whatever reason t1 is not recognized. Any ideas on the function OR
a way to combine select (dplyr function) and lapply to solve this?
I've seen similar posts put none that apply to the issue of a specific value (t1) that changes for each data frame.
I would just take the dataframe as argument and do rest of the logic inside function as it gives more control to the function. Something like this would work,
hvod<-function(data){
temp <- data$temp
t1 <- data$t1
DO <- data$DO
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
You can try using dplyr::bind_rows function to combine all data.frames from list in one data.frame.
Then group on data.frame number to find the mean of DO for rows having temp==t1 as:
library(dplyr)
bind_rows(ll, .id = "DF_Name") %>%
group_by(DF_Name) %>%
filter(temp==t1) %>%
summarise(MeanDO = mean(DO)) %>%
as.data.frame()
# DF_Name MeanDO
# 1 1 4.0
# 2 2 6.5
# 3 3 6.7
Data:
df1 <- read.table(text =
"temp DO t1
1 4 1
3 9 1
5 7 1",
header = TRUE)
df2 <- read.table(text =
"temp DO t1
3 4 3
3 9 3
5 7 1",
header = TRUE)
df3 <- read.table(text =
"temp DO t1
2 4 2
2 9 2
2 7 2",
header = TRUE)
ll <- list(df1, df2, df3)
Thank you Thiloshon and MKR for the help! I had initial combined the data I needed into one list of data frames but to answer this I actually had my data in separate data frames (fitsObs and df1).
The variables I was working with in the code were 1 to 1, so by finding the range where depth and d2 were the same (I used temp and t1 in the example), I could find the mean over that range .
for(i in 1:1044){
df1 <- GLNPOsurveyCTD$data[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
deptho <- -abs(df1$depth) #defining temp and depth in the loop
to <- df1$temp
do <- df1$DO
xx <- which(deptho <= fitObs$d2) #mean over range xx
mhtemp <- mean(to[xx], na.rm=TRUE)
mHDO <- mean(do[xx], na.rm=TRUE)
}

stack data frame by rows

I have a data set like this:
df <- data.frame(v1 = rnorm(12),
v2 = rnorm(12),
v3 = rnorm(12),
time = rep(1:3,4))
It looks like this:
> head(df)
v1 v2 v3 time
1 0.1462583 -1.1536425 3.0319594 1
2 1.4017828 -1.2532555 -0.1707027 2
3 0.3767506 0.2462661 -1.1279605 3
4 -0.8060311 -0.1794444 0.1616582 1
5 -0.6395198 0.4637165 -0.9608578 2
6 -1.6584524 -0.5872627 0.5359896 3
I now want to stack row 1-3 in a new column, then rows 4-6, then 7-9 and so on.
This is may naive way to do it, but there must be fast way, that doesn't use that many helper variables and loops:
l <- list()
for(i in 1:length(df)) {
l[[i]] <- stack(df[i:(i+2), -4])$values #time column is removed, was just for illustration
}
result <- do.call(cbind, l)
only base R should be used.
We can use split on the 'time' column
sapply(split(df[-4], cumsum(df$time == 1)), function(x) stack(x)$values)
Or instead of stack, unlist could be faster
sapply(split(df[-4], cumsum(df$time == 1)), unlist)
Based on the OP's code, it seems to be subsetting the rows based on the sequence of column
sapply(1:length(df), function(i) unlist(df[i:(i+2), -4]))

variable corresponding to a rowname in r

Now I have a .df looks like below:
v1 v2 v3
1 2 3
4 5 6
What should I do with rownames such that if v2 of rownames(df) %% 2 == 0 does not equal to v2 of rownames(df) %% 2 == 1, then delete both rows?
Thank you all.
Update:
For this df below, you can see that for row 1 and 2, they have the same ID, so I want to keep these two rows as a pair (CODE shows 1 and 4).
Similarly I want to keep row 10 and 11 because they have the same ID and they are a pair.
What should I do to get a new df?
1) Create a dataframe with column for number of times the id comes
library(sqldf)
df2=sqldf("select count(id),id from df group by id"
2) merge them
df3=merge(df1,df2)
3) select only if count>1
df3[df3$count>1,]
If what you are looking for is to keep paired IDs and delete the rest (I doubt it is as simple as this), then ..
Extract your ids: I have written them out, you should extract.
id = c(263733,263733,2913733,3243733,3723733,4493733,273733,393733,2953733,3583733,3583733)
sort them
Find out which ones to keep.
id1 = cbind(id[1:length(id)-1],id[2:length(id)])
chosenID = id1[which(id1[,1]==id1[,2]),1]
And then extract from your df those rows that have chosenID.

Edit column under condition

I have a table:
id <- c(1,1,2,2,2,2,2,3,3,4,4,5,5,5)
dist <- c(0,1,1,0,2,15,0,4,4,0,5,5,16,2)
data <- data.frame(id, dist )
I would like to edit the column id when dist is superior to a certain value (let´s say 10). I am looking to add +1 when data$dist >10
The final output would be:
data$id_new <- c(1,1,2,2,2,3,3,4,4,5,5,6,7,7)
Is it possible to do something with a if loop? I tried to something with a loop but I am still not successful.
Maybe using cumsum:
data$new_id <- data$id + cumsum(data$dist > 10)
Explanation:
cumsum(data$dist > 10) will return the cumulative sum of indices in data$dist which are greater than 10. You can see how this works by taking the expression apart in R and seeing how each piece works.
We can use duplicated with >
with(data, cumsum(dist > 10| !duplicated(id)))
#[1] 1 1 2 2 2 3 3 4 4 5 5 6 7 7

Random row selection in R

I have this dataframe
id <- c(1,1,1,2,2,3)
name <- c("A","A","A","B","B","C")
value <- c(7:12)
df<- data.frame(id=id, name=name, value=value)
df
This function selects a random row from it:
randomRows = function(df,n){
return(df[sample(nrow(df),n),])
}
i.e.
randomRows(df,1)
But I want to randomly select one row per 'name' (or per 'id' which is the same) and concatenate that entire row into a new table, so in this case, three rows. This has to loop throught a 2000+ rows dataframe. Please show me how?!
I think you can do this with the plyr package:
library("plyr")
ddply(df,.(name),randomRows,1)
which gives you for example:
id name value
1 1 A 8
2 2 B 11
3 3 C 12
Is this what you are looking for?
Here's one way of doing it in base R.
> df.split <- split(df, df$name)
> df.sample <- lapply(df.split, randomRows, 1)
> df.final <- do.call("rbind", df.sample)
> df.final
id name value
A 1 A 7
B 2 B 11
C 3 C 12

Resources