I have a dataframe that looks like this:
df <- data.frame("CB_1.1"=c(0,5,6,2), "CB_1.16"=c(1,5,3,6), "HC_2.11"=c(3,3,4,5), "HC_1.12"=c(2,3,4,5), "HC_1.13"=c(1,0,0,5))
> df
CB_1.1 CB_1.16 HC_2.11 HC_1.12 HC_1.13
1 0 1 3 2 1
2 5 5 3 3 0
3 6 3 4 4 0
4 2 6 5 5 5
I would like to take the mean of rows that share substring of the column name, before the ".". Resulting in a dataframe like this:
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
You'll notice that the column HC_2.11 values remain the same, because no other column has HC_2 in this dataframe.
Any help would be appreciated!
1) apply/tapply For each row use tapply on it using an INDEX of the name prefixes and a function mean. Transpose the result. No packages are used.
prefix <- sub("\\..*", "", names(df))
t(apply(df, 1, tapply, prefix, mean))
giving this matrix (wrap it in data.frame(...) if you need a data frame result):
CB_1 HC_1 HC_2
[1,] 0.5 1.5 3
[2,] 5.0 1.5 3
[3,] 4.5 2.0 4
[4,] 4.0 5.0 5
2) lm Run the indicated regression. The +0 in the formula means don't add on an intercept. The transpose of the coefficients will be the required matrix, m. In the next line make the names nicer. prefix is from (1). No packages are used.
m <- t(coef(lm(t(df) ~ prefix + 0)))
colnames(m) <- sub("prefix", "", colnames(m))
m
giving this matrix
CB_1 HC_1 HC_2
[1,] 0.5 1.5 3
[2,] 5.0 1.5 3
[3,] 4.5 2.0 4
[4,] 4.0 5.0 5
This follows from the facts that (1) the model matrix X contains only ones and zeros and (2) distinct columns of it are orthogonal. The model matrix is shown here:
X <- model.matrix(~ prefix + 0) # model matrix
X
giving:
prefixCB_1 prefixHC_1 prefixHC_2
1 1 0 0
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$prefix
[1] "contr.treatment"
Because the columns of the model matrix X are orthogonal the coefficient corresponding to any column for a particular row, y, of df (column of t(df)) is just sum(x * y) / sum(x * x) and since x is a 0/1 vector that equals the mean of the values of y corresponding to the 1's in x.
3) stack/tapply Convert to long form inserting an id column at the same time. Then use tapply to convert back to wide form tapply-ing mean. No packages are used.
long <- transform(stack(df), ind = sub("\\..*", "", ind), id = c(row(df)))
with(long, tapply(values, long[c("id", "ind")], mean))
giving this table. Wrap it in as.data.frame.matrix if you want a data.frame.
ind
id CB_1 HC_1 HC_2
1 0.5 1.5 3
2 5.0 1.5 3
3 4.5 2.0 4
4 4.0 5.0 5
Here is a base R solution using rowMeans + split.default, i.e.,
dfout <- as.data.frame(Map(rowMeans, split.default(df,factor(s <- gsub("\\..*$","",names(df)), levels = unique(s)))))
such that
> dfout
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
If you do not mind the order of column names, you can use the shorter code below
dfout <- as.data.frame(Map(rowMeans,split.default(df,gsub("\\..*$","",names(df)))))
such that
> dfout
CB_1 HC_1 HC_2
1 0.5 1.5 3
2 5.0 1.5 3
3 4.5 2.0 4
4 4.0 5.0 5
One option involving dplyr and purrr could be:
map_dfc(.x = unique(sub("\\..*$", "", names(df))),
~ df %>%
transmute(!!.x := rowMeans(select(., starts_with(.x)))))
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
A base option could be:
#find column names splitting on "."
cols <- unique(sapply(strsplit(names(df),".", fixed = T), `[`, 1))
#loop through each column name and find the rowMeans
as.data.frame(sapply(cols, function (x) rowMeans(df[grep(x, names(df))])))
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
Related
I have a dataframe and i want to compare variables in line 3 using if statement with the use of more or less
let's say i want to compare the same values in third column with more or less 0.2
data >
NAME A B C D
first 3 2 4 5
second 1 2 3 4
third 7 7.1 7.5 6.9
four 2 1 0 5
here a program to compare the exact values
for (i in 1:3) {
d <- i+1
for (j in d:4) {
if(data [3,i] == data [3,j] ){
print(paste("The columns" , colnames(data[,i]) ,"and " , colnames(data[,i]) , "are equal"))
}
}
}
Here it retuns nothings because the program compares the exacte values and me i want to compare that have the same values more or less 0.2
the result i want is
the column A and B are equal
the column A and D are equal
it's because A(=7) + or - the same as B(7.1)
and the same thing for D
A(=7) + or - D (6.9)
Thank you
Take the combination of columns then compare with tolerance:
df1 <- read.table(text ="
NAME A B C D
first 3 2 4 5
second 1 2 3 4
third 7 7.1 7.5 6.9
four 2 1 0 5", header = TRUE)
tolerance = 0.2
cbind(df1,
combn(colnames(df1[, 2:5]), 2, FUN = function(x){
paste0(x[ 1 ],
ifelse(abs(df1[, x[ 1 ] ] - df1[, x[ 2 ] ]) <= tolerance, "=", "!="),
x[ 2 ])
}))
# NAME A B C D 1 2 3 4 5 6
# 1 first 3 2.0 4.0 5.0 A!=B A!=C A!=D B!=C B!=D C!=D
# 2 second 1 2.0 3.0 4.0 A!=B A!=C A!=D B!=C B!=D C!=D
# 3 third 7 7.1 7.5 6.9 A=B A!=C A=D B!=C B=D C!=D
# 4 four 2 1.0 0.0 5.0 A!=B A!=C A!=D B!=C B!=D C!=D
I have UTM coordinate values from GPS collared leopards, and my analysis gets messed up if there are any points that are identical. What I want to do is add a 1 to the end of the decimal string to make each value unique.
What I have:
> View(coords)
> coords
X Y
1 623190.9 4980021
2 618876.6 4980729
3 618522.7 4980896
4 618522.7 4980096
5 618522.7 4980096
6 622674.1 4976161
I want something like this, or something that will make each number unique (doesn't have to be a +1)
> coords
X Y
1 623190.9 4980021
2 618876.6 4980729
3 618522.7 4980896
4 618522.71 4980096.1
5 618522.72 4977148.2
6 622674.1 4976161
Ive looked at existing questions and got this to work for a simulated data set, but not for values with more than 1 duplicated value.
DF <- data.frame(A=c(5,5,6,6,7,7), B=c(1, 1, 2, 2, 2, 3))
>View(DF)
A B
1 5 1
2 5 1
3 6 2
4 6 2
5 7 2
6 7 3
DF <- do.call(rbind, lapply(split(DF, list(DF$A, DF$B)),
function(x) {
x$A <- x$A + seq(0, by=0.1, length.out=nrow(x))
x$B <- x$B + seq(0, by=0.1, length.out=nrow(x))
x
}))
>View(DF
A B
5.1.1 5.0 1.0
5.1.2 5.1 1.1
6.2.3 6.0 2.0
6.2.4 6.1 2.1
7.2 7.0 2.0
7.3 7.0 3.0
The'2s' in column B don't continue to add a decimal place when there are more than 2. I also had a problem accomplishing this when the number was more than 4 digits (i.e. XXXXX vs XX) There's probably a better way to do this, but I would love help on adding these decimals and possibly altering them in the original data frame which has 12 columns of various data.
It is easier to use make.unique
DF[] <- lapply(DF, function(x) as.numeric(make.unique(as.character(x))))
DF
# A B
#1 5.0 1.0
#2 5.1 1.1
#3 6.0 2.0
#4 6.1 2.1
#5 7.0 2.2
#6 7.1 3.0
If we have this list with 100 objects:
List
[[1]]
x y z
379.0 0.6 1.0
369.0 0.3 2.0
[[2]]
x y z
359.0 6.2 4.0
379.0 4.6 1.0
[[3]]
x y z
379.0 6.0 1.0
379.0 0.5 1.0
.
.
[[100]]
How could I obtain a matrix using "y" values of list-columns of each object (all of them have the same length), obtaining something like:
X:
y1 y2 y3 ... y100
0.6 6.2 6.0
0.3 4.6 0.5
Try below:
# example list
l <- list(cars[1:3, ], cars[4:6, ])
l
# [[1]]
# speed dist
# 1 4 2
# 2 4 10
# 3 7 4
#
# [[2]]
# speed dist
# 4 7 22
# 5 8 16
# 6 9 10
Subset 2nd column using "[" function, then bind columns:
do.call(cbind, lapply(l, "[", 2))
# dist dist
# 1 2 22
# 2 10 16
# 3 4 10
Or another similar approach, instead of binding columns, we convert to data.frame:
as.data.frame(lapply(l, "[", 2))
# dist dist.1
# 1 2 22
# 2 10 16
# 3 4 10
Note: We could subset using column names, too. In my example data 2nd column name is "dist", so replace number 2 above with "dist" and it should work the same.
I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry.
So far so good, here is my sample data
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
sample output
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
4 NA 1.5
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
Now I want to
1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.
2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.
3.) keep only first occurence of the ia, and delete the rest.
In the end, I would want to have something like this:
a
ia time
1 1 6.9
3 2 6.3
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
I found this for summing, but it does not take into account the consecutive factor
aggregate(time~ia,data=a,FUN=sum)
and I found this for deleting
a[cumsum(rle(as.numeric(a[,1]))$lengths),]
although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.
if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.
With data.table (as RHertel suggested for na.locf):
library(data.table)
library(zoo)
setDT(a)[na.locf(ia, fromLast=T)==na.locf(ia), sum(time), cumsum(c(T,!!diff(na.locf(ia))))]
# id V1
#1: 1 6.9
#2: 2 6.3
#3: 3 20.4
#4: 4 7.3
#5: 5 2.3
#6: 6 4.3
You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:
library(dplyr)
library(zoo)
a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia))
#> ia time
#> 1 1 4.5
#> 2 1 2.4
#> 3 2 3.6
#> 4 2 1.5
#> 5 2 1.2
#> 6 1 4.9
#> 7 1 6.4
#> 8 1 4.4
#> 9 2 7.3
#> 10 1 2.3
#> 11 2 4.3
Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#>
#> group ia time
#> (int) (dbl) (dbl)
#> 1 1 1 6.9
#> 2 2 2 6.3
#> 3 3 1 15.7
#> 4 4 2 7.3
#> 5 5 1 2.3
#> 6 6 2 4.3
If you want to get rid of the group column, use the additional lines:
result %>%
ungroup() %>%
select(-group)
nas <- which(is.na(df$ia))
add.index <- sapply(nas, function(x) {logi <- which(as.logical(df$ia))
aft <- logi[logi > x][1]
fore <- tail(logi[logi< x], 1)
if(df$ia[aft] == df$ia[fore]) aft else NA})
df$ia[nas] <- df$ia[add.index]
df <- df[complete.cases(df),]
First we determine if the NA values of the column are surrounded by the same value. If yes, the surrounding value replaces the NA. There is no problem if the data has consecutive NA values.
Next we do a standard sum by group operation. cumsum allows us to create a unique group based on changes in the numbers.
df$grps <- cumsum(c(F, !df$ia[-length(df$ia)] == df$ia[-1]))+1
aggregate(time ~ grps, df, sum)
# grps time
# 1 1 6.9
# 2 2 6.3
# 3 3 20.4
# 4 4 7.3
# 5 5 2.3
# 6 6 4.3
This is a base R approach. With packages like dplyr, zoo, or data.table different options are available as they come built with specialized functions to do what we did here.
Let's say there is a vector sim that contains the following sequence of numbers:
1
2
4
7
5
3
2.5
4
6
How can I filter out all the segments of decreasing values in order to achieve sim with only increasing values? The expected result:
1
2
4
7
2.5
4
6
Based on #akrun's suggestion:
dif <- diff(sim) > 0
sim[ c(dif[1], dif) | c(dif, dif[length(dif)]) ]
[1] 1.0 2.0 4.0 7.0 2.5 4.0 6.0