pairwise subtraction in a dataf rame with groups in different lengths - r

I have a data frame in 18528 rows and 3 columns like below:
Sample Target Value
100 A 21.5
100 A 20.5
100 B 19.5
100 B 19.75
100 B 18.15
100 B 21.95
200 A 21.1
200 A 21.6
200 B 23.5
200 B 20.75
100 C 21.25
100 C 22.0
100 C 18.33
100 C 21.84
I need to calculate difference between values in each groups:
Sample Target Value dif
100 A 21.5 1
100 A 20.5 1
100 B 19.5 0.25
100 B 19.75 1.6
100 B 18.15 3.8
100 B 21.95 2.45
200 A 21.1 0.5
200 A 21.6 0.5
200 B 23.5 2.75
200 B 20.75 2.75
100 C 21.25 0.75
100 C 22.0 3.67
100 C 18.33 3.51
100 C 21.84 0.59
Then if difference is more than 2, make that value "NA" like:
Sample Target Value dif
100 A 21.5 1
100 A 20.5 1
100 B 19.5 0.25
100 B 19.75 1.6
100 B 18.15 3.8
100 B NA 2.45
200 A 21.1 0.5
200 A 21.6 0.5
200 B NA 2.75
200 B NA 2.75
100 C 21.25 0.75
100 C 22.0 3.67
100 C NA 3.51
100 C 21.84 0.59
I used combn to calculate difference, but I got Error, I think the reason can be different length in groups (2 and 4).
Thanks in advance

You can get desired output using dplyr package. If you don't have it installed first run command install.packages("dplyr") or install it manually.
Then what we have:
require("dplyr")
mydf <- read.table(text = "
Sample Target Value
100 A 21.5
100 A 20.5
100 B 19.5
100 B 19.75
100 B 18.15
100 B 21.95
200 A 21.1
200 A 21.6
200 B 23.5
200 B 20.75
100 C 21.25
100 C 22.0
100 C 18.33
100 C 21.84", header = T)
mydf1 <- mydf %>% group_by(Sample, Target) %>%
mutate(ValueShifted = c(Value[-1], Value[1]) ) %>%
mutate(dif = abs(Value - ValueShifted) ) %>%
mutate(NewValue = c(1, NA)[(as.numeric(dif > 2)+1)] * Value )
> mydf1
Source: local data frame [14 x 6]
Groups: Sample, Target
Sample Target Value ValueShifted dif NewValue
1 100 A 21.50 20.50 1.00 21.50
2 100 A 20.50 21.50 1.00 20.50
3 100 B 19.50 19.75 0.25 19.50
4 100 B 19.75 18.15 1.60 19.75
5 100 B 18.15 21.95 3.80 NA
6 100 B 21.95 19.50 2.45 NA
7 200 A 21.10 21.60 0.50 21.10
8 200 A 21.60 21.10 0.50 21.60
9 200 B 23.50 20.75 2.75 NA
10 200 B 20.75 23.50 2.75 NA
11 100 C 21.25 22.00 0.75 21.25
12 100 C 22.00 18.33 3.67 NA
13 100 C 18.33 21.84 3.51 NA
14 100 C 21.84 21.25 0.59 21.84

Related

how to apply round() to odd or even rows only in R

assume my original dataframe is :
a b d e
1 1 1 2 1
2 20 30 40 30
3 1 2 6 2
4 40 50 40 50
5 5 5 3 5
6 60 60 60 60
I want to add a percentage row below each row.
a b d e
1 1.00 1.00 2.00 1.00
2 0.79 0.66 1.57 0.66
3 20.00 30.00 40.00 30.00
4 13.51 20.27 27.03 20.27
5 1.00 2.00 6.00 2.00
6 0.66 1.57 3.97 1.57
7 40.00 50.00 40.00 50.00
8 27.03 33.78 27.03 33.78
9 5.00 5.00 3.00 5.00
10 3.94 3.31 2.36 3.31
11 60.00 60.00 60.00 60.00
12 40.54 40.54 40.54 40.54
but as you see, my odd rows get .00 which I do not want.
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df <- df %>% slice(rep(1:n(), each=2))
df[seq_len(nrow(df)) %% 2 ==0, ] <- round(100*df[seq_len(nrow(df)) %% 2 ==0,
]/colSums(df[seq_len(nrow(df)) %% 2 ==0, ]),2)
how can I keep my odd rows without decimals?
The problem is that columns in data frames can only hold one type of data. If some of the columns in your data frame have decimals, then the whole column must be of type double. The only way to change how your data frame appears is via its print method.
Fortunately, you can easily turn your data frame into a tibble. This is a type of data frame, but prints in such a way that the integers don't have decimal points afterwards.
df
#> a b d e
#> 1 1.00 1.00 2.00 1.00
#> 2 0.79 0.66 1.57 0.66
#> 3 20.00 30.00 40.00 30.00
#> 4 13.51 20.27 27.03 20.27
#> 5 1.00 2.00 6.00 2.00
#> 6 0.66 1.57 3.97 1.57
#> 7 40.00 50.00 40.00 50.00
#> 8 27.03 33.78 27.03 33.78
#> 9 5.00 5.00 3.00 5.00
#> 10 3.94 3.31 2.36 3.31
#> 11 60.00 60.00 60.00 60.00
#> 12 40.54 40.54 40.54 40.54
dplyr::tibble(df)
#> # A tibble: 12 x 4
#> a b d e
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1
#> 2 0.79 0.66 1.57 0.66
#> 3 20 30 40 30
#> 4 13.5 20.3 27.0 20.3
#> 5 1 2 6 2
#> 6 0.66 1.57 3.97 1.57
#> 7 40 50 40 50
#> 8 27.0 33.8 27.0 33.8
#> 9 5 5 3 5
#> 10 3.94 3.31 2.36 3.31
#> 11 60 60 60 60
#> 12 40.5 40.5 40.5 40.5
Created on 2022-04-26 by the reprex package (v2.0.1)
Allan Cameron is right, that a tibble prints better and does what you want. To offer another solution, though, if you're trying to print something that you might send to a text file (rather than just look at on the screen), you could print the values to character strings as follows:
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df %>%
mutate(obs = row_number(),
across(-obs, ~.x/sum(.x)),
type = "pct") %>%
bind_rows(df %>% mutate(obs = row_number(),
type = "raw")) %>%
mutate(type = factor(type, levels=c("raw", "pct"))) %>%
arrange(obs, type) %>%
mutate(across(a:e, ~case_when(
type == "raw" ~ sprintf("%.0f", .x),
TRUE ~ sprintf("%.2f%%", .x*100)))) %>%
select(-c(obs, type))
#> a b d e
#> 1 1 1 2 1
#> 2 0.79% 0.68% 1.32% 0.68%
#> 3 20 30 40 30
#> 4 15.75% 20.27% 26.49% 20.27%
#> 5 1 2 6 2
#> 6 0.79% 1.35% 3.97% 1.35%
#> 7 40 50 40 50
#> 8 31.50% 33.78% 26.49% 33.78%
#> 9 5 5 3 5
#> 10 3.94% 3.38% 1.99% 3.38%
#> 11 60 60 60 60
#> 12 47.24% 40.54% 39.74% 40.54%
Created on 2022-04-26 by the reprex package (v2.0.1)
Also note, I think the percentages you calculated are wrong. When I used your data, I get:
sum(df$a[c(2,4,6,8,10,12)])
#> [1] 86.47
And when I use mine, that are different from yours, I get 100 (if we turn them back into numbers from strings).

How to convert a list into a data.frame in R?

I've created a frequency table in R with the fdth package using this code
fdt(x, breaks = "Sturges")
The specific result was:
Class limits f rf rf(%) cf cf(%)
[-15.907,-11.817) 12 0.00 0.10 12 0.10
[-11.817,-7.7265) 8 0.00 0.07 20 0.16
[-7.7265,-3.636) 6 0.00 0.05 26 0.21
[-3.636,0.4545) 70 0.01 0.58 96 0.79
[0.4545,4.545) 58 0.00 0.48 154 1.27
[4.545,8.6355) 91 0.01 0.75 245 2.01
[8.6355,12.726) 311 0.03 2.55 556 4.57
[12.726,16.817) 648 0.05 5.32 1204 9.89
[16.817,20.907) 857 0.07 7.04 2061 16.93
[20.907,24.998) 1136 0.09 9.33 3197 26.26
[24.998,29.088) 1295 0.11 10.64 4492 36.90
[29.088,33.179) 1661 0.14 13.64 6153 50.55
[33.179,37.269) 2146 0.18 17.63 8299 68.18
[37.269,41.36) 2525 0.21 20.74 10824 88.92
[41.36,45.45) 1349 0.11 11.08 12173 100.00
It was given as a list:
> class(x)
[1] "fdt.multiple" "fdt" "list"
I need to convert it into a data frame object, so I can have a table. How can I do it?
I'm a beginner at using R :(
Since you did not provide a reproducible example of your data I have used example from the help page of ?fdt which is closer to what you have.
library(fdth)
mdf <- data.frame(c1=sample(LETTERS[1:3], 1e2, TRUE),
c2=as.factor(sample(1:10, 1e2, TRUE)),
n1=c(NA, NA, rnorm(96, 10, 1), NA, NA),
n2=rnorm(100, 60, 4),
n3=rnorm(100, 50, 4),
stringsAsFactors=TRUE)
fdt <- fdt(mdf,breaks='FD',by='c1')
class(fdt)
#[1] "fdt.multiple" "fdt" "list"
You can extract the table part from each list and bind them together.
result <- purrr::map_df(fdt, `[[`, 'table')
#In base R
#result <- do.call(rbind, lapply(fdt, `[[`, 'table'))
result
# Class limits f rf rf(%) cf cf(%)
#1 [8.1781,9.1041) 5 0.20833333 20.833333 5 20.833333
#2 [9.1041,10.03) 6 0.25000000 25.000000 11 45.833333
#3 [10.03,10.956) 10 0.41666667 41.666667 21 87.500000
#4 [10.956,11.882) 3 0.12500000 12.500000 24 100.000000
#5 [53.135,56.121) 4 0.16000000 16.000000 4 16.000000
#6 [56.121,59.107) 8 0.32000000 32.000000 12 48.000000
#7 [59.107,62.092) 8 0.32000000 32.000000 20 80.000000
#....

R moving average

As an example I use the Boston data with 3 columns (id (added), medv, lstat) and 506 observations.
I want to calculate a moving average for k-1 observations for the variable medv. This means that the mean value should be calculated over all observations except a certain row. For id 1, the mean value is calculated from line 2-506. For id 2, the mean value is calculated over line 1 + 3-506. For id 3, the mean value is calculated over the lines 1-2 + 4-506 and so on.
In a second step the calculation of the mean value should be conditional, e.g. above the median and below the median in two different columns. This means that we first check whether a value within each column (medv and lstat) is above or below the median. If the value in medv is above the median, we calculate the mean value of lstat from the values that are above the median in lstat. If the value in medv is below the median, we calculate the mean value of lstat from the values that are below the median. See example table below for the first 10 rows. The median for the first 10 rows is 25.55 for medv and 7.24 for lstat.
Here is the data:
library(mlbench)
data(BostonHousing)
df <- BostonHousing
df$id <- seq.int(nrow(df))
df <- subset(df, select = c(id, medv, lstat))
id medv lstat mean1out meancond
1 24.0 4.98 26.66667 4.50
2 21.6 9.14 26.93333 4.50
3 34.7 4.03 25.47778 17.55
4 33.4 2.94 25.62222 17.55
5 36.2 5.33 25.31111 17.55
6 28.7 5.21 26.14444 17.55
7 22.9 12.43 26.78889 4.50
8 27.1 19.15 26.32222 17.55
9 16.5 29.93 27.50000 4.50
10 18.9 17.10 27.23333 4.50
The first part of the problem is already solved by #r2evans.
For the second part we can calculate median of lstat and medv, compare and assign values.
#First part from #r2evans answer.
n <- nrow(df)
df$mean1out <- (mean(df$medv)*n - df$medv)/(n-1)
#Second part
med_lsat <- median(df$lstat)
med_medv <- median(df$medv)
higher_lsat <- mean(df$lstat[df$lstat > med_lsat])
lower_lsat <- mean(df$lstat[df$lstat < med_lsat])
df$meancond <- ifelse(df$medv > med_medv, higher_lsat, lower_lsat)
df
# id medv lstat mean1out meancond
#1 1 24.0 4.98 26.66667 4.498
#2 2 21.6 9.14 26.93333 4.498
#3 3 34.7 4.03 25.47778 17.550
#4 4 33.4 2.94 25.62222 17.550
#5 5 36.2 5.33 25.31111 17.550
#6 6 28.7 5.21 26.14444 17.550
#7 7 22.9 12.43 26.78889 4.498
#8 8 27.1 19.15 26.32222 17.550
#9 9 16.5 29.93 27.50000 4.498
#10 10 18.9 17.10 27.23333 4.498
data
df <- BostonHousing
df$id <- seq.int(nrow(df))
df <- subset(df, select = c(id, medv, lstat))
df <- head(df, 10)
mean(dat$medv[-3])
# [1] 25.47778
sapply(seq_len(nrow(dat)), function(i) mean(dat$medv[-i]))
# [1] 26.66667 26.93333 25.47778 25.62222 25.31111 26.14444 26.78889 26.32222 27.50000 27.23333
Alternatively (mathematically), without the sapply, you can get the same numbers this way:
n <- nrow(dat)
(mean(dat$medv)*n - dat$medv)/(n-1)
# [1] 26.66667 26.93333 25.47778 25.62222 25.31111 26.14444 26.78889 26.32222 27.50000 27.23333
For your conditional mean, a simple ifelse works:
n <- nrow(dat)
transform(
dat,
a = (mean(dat$medv)*n - dat$medv)/(n-1),
b = ifelse(medv <= median(medv),
mean(lstat[ lstat <= median(lstat) ]),
mean(lstat[ lstat > median(lstat) ]))
)
# id medv lstat mean1out meancond a b
# 1 1 24.0 4.98 26.66667 4.50 26.66667 4.498
# 2 2 21.6 9.14 26.93333 4.50 26.93333 4.498
# 3 3 34.7 4.03 25.47778 17.55 25.47778 17.550
# 4 4 33.4 2.94 25.62222 17.55 25.62222 17.550
# 5 5 36.2 5.33 25.31111 17.55 25.31111 17.550
# 6 6 28.7 5.21 26.14444 17.55 26.14444 17.550
# 7 7 22.9 12.43 26.78889 4.50 26.78889 4.498
# 8 8 27.1 19.15 26.32222 17.55 26.32222 17.550
# 9 9 16.5 29.93 27.50000 4.50 27.50000 4.498
# 10 10 18.9 17.10 27.23333 4.50 27.23333 4.498
(I'm inferring that the differences are rounding errors on data entry.)

How do I apply an ifelse function to all cells in a data frame?

I am trying to apply an ifelse statement to all the cells in my data frame. I'm pretty sure I am overthinking this but would appreciate some help/guidance!
I have a dataframe of (slightly modified) percent cover of vegetation from a number of sites where the site names and the vegetation types are the row names and column names, respectively (ie. the data frame should only consist of numeric values):
dwarf shrub equisetum forb fungi graminoid lichen moss shrub-forb tall shrub tree
site1 33.25 0 21.25 1.0 35.25 3.25 60.00 0.00 34.25 0.25
site2 30.25 0 15.00 0.0 25.75 7.50 62.25 1.50 26.75 0
site3 50.00 0 10.00 0.5 23.50 3.25 65.00 6.75 18.50 0
site4 46.00 0 7.75 0.0 32.75 2.25 33.75 4.50 11.25 0.75
site5 28.00 0 11.00 0.0 40.00 6.00 30.00 0.00 38.00 0
site6 40.25 0 10.50 0.0 5.75 6.25 7.25 3.25 8.75 1.25
I am trying to round the numbers to the nearest whole number such that the round() function is used when the value is greater than 1 and the ceiling() function is used when the value is less than 1.
Here is the code I have written to try do this:
new.df <- if(old.df > 1){
round(old.df, digits = 0)} else{
ceiling(old.df)
}
I have also tried without the ceiling function:
new.df <- if(old.df > 1){
round(old.df, digits = 0)} else{
old.df == 1
}
I have not been successful in applying the second half of the statement (ceiling()). I get this error:
Warning message:
In if (old.df > 1) { :
the condition has length > 1 and only the first element will be used
Any assistance would be much appreciated, thank you!
You mentioned ifelse, I think it's straight-forward enough to apply this to each column using lapply. (I'll add the isnum check in case there are non-numeric columns in the data, feel free to ignore it if your data is always numeric.)
isnum <- sapply(dat, is.numeric)
dat[isnum] <- lapply(dat[isnum], function(x) ifelse(x > 1, ceiling(x), round(x, 0)))
dat
# dwarf_shrub equisetum forb fungi graminoid lichen moss shrub_forb tall shrub tree
# 1 site1 34 0 22 1 36 4 60 0 35 0
# 2 site2 31 0 15 0 26 8 63 2 27 0
# 3 site3 50 0 10 0 24 4 65 7 19 0
# 4 site4 46 0 8 0 33 3 34 5 12 1
# 5 site5 28 0 11 0 40 6 30 0 38 0
# 6 site6 41 0 11 0 6 7 8 4 9 2
Data: I had to rename some of the columns since some of your column names are not as easy to read in as easily (spaces, hyphens).
dat <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
dwarf_shrub equisetum forb fungi graminoid lichen moss shrub_forb tall shrub tree
site1 33.25 0 21.25 1.0 35.25 3.25 60.00 0.00 34.25 0.25
site2 30.25 0 15.00 0.0 25.75 7.50 62.25 1.50 26.75 0
site3 50.00 0 10.00 0.5 23.50 3.25 65.00 6.75 18.50 0
site4 46.00 0 7.75 0.0 32.75 2.25 33.75 4.50 11.25 0.75
site5 28.00 0 11.00 0.0 40.00 6.00 30.00 0.00 38.00 0
site6 40.25 0 10.50 0.0 5.75 6.25 7.25 3.25 8.75 1.25")

R: Manipulating dataframes

Df_01a
Name re1 re2 re3 parameter
a 144 39.7 0.012 fed
b 223 31.2 5 fed
c 304 6.53 100 fed
d 187 51.3 25 fed
e 110 2.94 100 fed
f 151 4.23 75 fed
g 127 36.7 0.012 fed
Df_01b
Name re1 re2 re3 parameter
a 142 39.3 0.042 feh
b 221 31.0 4 feh
c 301 6.13 90 feh
d 185 41.3 15 feh
e 107 2.44 940 feh
f 143 2.23 75 feh
g 121 31.7 0.012 feh
Df_02
parameter c1 c2 c3
1 fed 5 4 3
2 feh 3 4 2
3 fea 5 4 3
4 few 2 4 3
Desired result:
c-value re-value name
5 142 a_fed
4 39.3 a_fed
3 0.042 a_fed
5 221 b_fed
4 31.0 b_fed
3 4 b_fed
5 304 c_fed
4 6.53 c_fed
3 100 c_fed
....
3 0.012 g_fed
3 142 a_feh
4 39.3 a_feh
2 0.042 a_feh
3 221 b_feh
4 31.0 b_feh
2 4 b_feh
....
I have Df_01a, Df_01b, Df_01c, Df_01d. These have a parameter in
column 5: fed, feh, fea, few, respectively (See Df_02).
Each parameter has 3 values, given by c1, c2 and c3 in Df_02.
How can I get the desired data.frame shown above?
code
library(dplyr)
library(tidyr)
rbind(Df_01a,Df_01b) %>% gather("re-col","re-value",c("re1","re2","re3")) %>%
inner_join(Df_02 %>% rename(re1=c1,re2=c2,re3=c3) %>% gather("re-col","c-value",c("re1","re2","re3"))) %>%
arrange(parameter,Name) %>%
unite(name,Name,parameter) %>%
select(`c-value`,`re-value`,`name`)
result
# c-value re-value name
# 1 5 144.000 a_fed
# 2 4 39.700 a_fed
# 3 3 0.012 a_fed
# 4 5 223.000 b_fed
# 5 4 31.200 b_fed
# 6 3 5.000 b_fed
# 7 5 304.000 c_fed
# 8 4 6.530 c_fed
# 9 3 100.000 c_fed
# 10 5 187.000 d_fed
# 11 4 51.300 d_fed
# 12 3 25.000 d_fed
# 13 5 110.000 e_fed
# 14 4 2.940 e_fed
# 15 3 100.000 e_fed
# 16 5 151.000 f_fed
# 17 4 4.230 f_fed
# 18 3 75.000 f_fed
# 19 5 127.000 g_fed
# 20 4 36.700 g_fed
# 21 3 0.012 g_fed
# 22 3 142.000 a_feh
# 23 4 39.300 a_feh
# 24 2 0.042 a_feh
# 25 3 221.000 b_feh
# 26 4 31.000 b_feh
# 27 2 4.000 b_feh
# 28 3 301.000 c_feh
# 29 4 6.130 c_feh
# 30 2 90.000 c_feh
# 31 3 185.000 d_feh
# 32 4 41.300 d_feh
# 33 2 15.000 d_feh
# 34 3 107.000 e_feh
# 35 4 2.440 e_feh
# 36 2 940.000 e_feh
# 37 3 143.000 f_feh
# 38 4 2.230 f_feh
# 39 2 75.000 f_feh
# 40 3 121.000 g_feh
# 41 4 31.700 g_feh
# 42 2 0.012 g_feh
data
Df_01a <- read.table(text="Name re1 re2 re3 parameter
a 144 39.7 0.012 fed
b 223 31.2 5 fed
c 304 6.53 100 fed
d 187 51.3 25 fed
e 110 2.94 100 fed
f 151 4.23 75 fed
g 127 36.7 0.012 fed",header=T,stringsAsFactors=F)
Df_01b <- read.table(text="Name re1 re2 re3 parameter
a 142 39.3 0.042 feh
b 221 31.0 4 feh
c 301 6.13 90 feh
d 185 41.3 15 feh
e 107 2.44 940 feh
f 143 2.23 75 feh
g 121 31.7 0.012 feh",header=T,stringsAsFactors=F)
Df_02 <- read.table(text="parameter c1 c2 c3
1 fed 5 4 3
2 feh 3 4 2
3 fea 5 4 3
4 few 2 4 3",header=T,stringsAsFactors=F)

Resources