Transforming columns based off separate dataframe - R solution - r

I'm trying to write some more sophisticated code. I have a problem which I have a solution for, but I would like to increase the flexibility of the code.
Input data (d):
ID Height_cm Height_m Weight_kg Weight_lb
1 180 70
2 165 120
3 1.8 80
4 100 60
5 190 200
6 1.7 100
I want to transform height into cm and weight in to kg, all in one column like this:
ID Height Weight
1 180 70
2 165 54
3 180 80
4 100 45
5 190 90
6 170 100
I have a solution, but it's hard-coded:
library(tidyverse)
d$Height <- NA
d$Weight <-NA
d$Height <- ifelse(!is.na(d$Height_cm), d$Height_cm, d$Height_m * 0.01)
d$Weight <- ifelse(!is.na(d$Weight_kg), d$Weight_kg, d$Weight_lb * 0.45)
d <- d %>% select(ID, Height, Weight)
I want to be more sophisticated and take in an input file (below) and make the transformation based on that dataframe. The results will be the same, but it uses this transformation df to achieve the same thing:
Transformation_d:
marker unit new_col_name transformation
Height_cm centimetre Height 1
Height_m metre Height 0.01
Weight_kg kilogram Weight 1
Weight_lb pounds Weight 0.45
This is where I'm stuck... I'd appreciate some guidance.
Go easy, I'm new to R!

Height_m should have transformation value as 100, I guess.
You can use cur_column() to get the column name to match with Transformation_d dataframe and get corresponding transformation value to multiply.
library(dplyr)
d %>%
mutate(across(-1, ~. * Transformation_d$transformation[match(cur_column(),
Transformation_d$marker)])) %>%
transmute(ID,
Height = coalesce(Height_cm, Height_m),
Weight = coalesce(Weight_kg, Weight_lb))
# ID Height Weight
#1 1 180 70
#2 2 165 54
#3 3 180 80
#4 4 100 27
#5 5 190 90
#6 6 170 100
A similar process in base R
cbind(d[1],
transform(sweep(d[-1], 2,
Transformation_d$transformation[match(names(d)[-1],
Transformation_d$marker)], `*`),
Height = ifelse(is.na(Height_cm), Height_m, Height_cm),
Weight = ifelse(is.na(Weight_kg), Weight_lb, Weight_kg)))

Related

multiplying columns in R

I have a data frame like this.
> abc
ID 1.x 2.x 1.y 2.y
1 4 10 20 30 40
2 16 5 10 5 10
3 42 16 17 18 19
4 91 20 20 20 20
5 103 103 42 56 84
How do I create two additional columns '1' and '2' by multiplying 1.x * 1.y and 2.x * 2.y in a generalized way?
I am trying to get a generalized solution where number of columns can be too many. So I want to multiply all x with all y. While x and y are fixed, n has to be figured out from data frame.
For simplicity lets assume n is also fixed however it is a large number.
One thing i can try is :-
abc[,c(6,7)]=abc[,c(2,3)]*abc[,c(4,5)]
It will work only if col positions are contiguous. This is good enough for me. If anyone can have more generalized solution, it will benefit us all.
If there are only couple of variables to multiply, we can do this with transform by multiplying the columns of interest
transform(abc, new1 = `1.x`*`1.y`, new2 = `2.x`*`2.y`, check.names = FALSE)
# ID 1.x 2.x 1.y 2.y new1 new2
#1 4 10 20 30 40 300 800
#2 16 5 10 5 10 25 100
#3 42 16 17 18 19 288 323
#4 91 20 20 20 20 400 400
#5 103 103 42 56 84 5768 3528
If we have lots of columns, then one approach is to split the dataset into a list of data.frames by taking the substring of names and then loop through the list and multiply the rows with do.call
abc[paste0("new", 1:2)] <- lapply(split.default(abc[-1],
sub("\\.[a-z]+$", "", names(abc)[-1])), function(x) do.call(`*`, x))
Or another option is (based on the pairwise column multiplication)
apply(aperm(array(unlist(abc[-1]), c(5, 2, 2)),
c(3, 1, 2)), 3, matrixStats::colProds)
Mutate will preserve the original variables. Mutate_all will allow you to multiply all columns in your dataframe.
abc %>%
mutate(new_vary1 = `1.x`* `2.x`,
new_vary2 = `1.y`* `2.y`) %>%
mutate_all(funs(.*`1.x`))

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

groups of different size randomly selected within different classes

i have such a difficult question (at least to me) that i spend 2 hours just writing it. Complete impossible to program it by my self. I try to be very clear and i´m sorry if i didn´t. I´m doing this in a very rustic way in excel, but i really need to program this.
i have a data.frame like this
id_pix id_lote clase f1 f2
45 4 Sg 2460 2401
46 4 Sg 2620 2422
47 4 Sg 2904 2627
48 5 M 2134 2044
49 5 M 2180 2104
50 5 M 2127 2069
83 11 S 2124 2062
84 11 S 2189 2336
85 11 S 2235 2162
86 11 S 2162 2153
87 11 S 2108 2124
with 17451 "id_pixel"(rows), 2080 "id_lote" and 9 "clase"
this is the "id_lote" count per "clase" (v1 is the id_lote count)
clase v1
1: S 1099
2: P 213
3: Sg 114
4: M 302
5: Alg 27
6: Az 77
7: Po 228
8: Cit 13
9: Ma 7
i need to split the "id_lote" randomly within the "clase". I mean i have 1099 "id_lote" for the "S" "clase" that are 9339 "id_pixel" (rows) and i want to randomly select 50 % of "id_lote" that are x "id_pixel"(rows). And do this for every "clase" considering that the size (number of "id_lote") of every "clase" are different. I also would like to be able to change the size of the selection (50 %, 30 %, etc). And i also want to keep the not selected set of "id_lote". I hope some one can help me with this!
here is the reproducible example
this is the data with 2 clase (S and Az), with 6 id_lote and 13 id_pixel
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
30 2 S 3021 2985
31 2 S 3020 2596
71 9 S 4725 4404
72 9 S 4759 4943
75 11 S 2728 2225
218 21 Az 4830 3007
219 21 Az 4574 2761
220 21 Az 5441 3092
1155 126 Az 7209 2449
1156 126 Az 7035 2932
and one result could be:
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
75 11 S 2728 2225
1155 126 Az 7209 2449
1156 126 Az 7035 2932
were 50% of id_lote were randomly selected in clase "S" (2 of 4 id_lote) but all the id_pixel in selected id_lote were keeped. The same for clase "Az", one id_lote was randomly selected (1 of 2 in this case) and all the id_pixel in selected id_lote were keeped.
what colemand77 proposed helped a lot. I think dplyr package is usefull for this but i think that if i do
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
i get the 30 % of the data of each clase but not grouped by id_lote like i need! I mean 30 % of the rows (id_pixel) were selected instead of id_lote.
i hope this example help to understand what i want to do and make it usefull for everybody. I´m sorry if i wasn´t clear enough the first time.
Thanks a lot!
First glimpse I'd say the dplyr package is your friend here.
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
so you first use group_by() and include the grouping levels you want to sample from, then you use sample_frac to sample the fraction of the results you want for each group.
As near as I can tell this is what you are asking for. If not, please consider re-stating your question to include either a reproducible example or clarify. Cheers.
to "keep" the not-selected members, I would add a column of unique ids, and use an anti-join anti_join()(also from the dplyr package) to find the id's that are not in common between the two data.frames (the results of the sampling and the original).
## Update ##
I'm understanding better now, I believe. Think about this as a two step process...
1) you want to select x% (50 in example) of the id_lote from each clase and return those id_lote #s (i'm assuming that a given id_lote does not exist for multiple clase?)
2) you want to see all of the id_pixels that correspond to each id_lote, all in one data.frame
I've broken this down into multiple steps for illustration, not because it is the fastest / prettiest.
raw data: (couldn't read your data into R.)
df<-data.frame(id_pix = c(1:200),
id_lote = sample(1:20,200, replace = TRUE),
clase = sample(letters[seq_along(1:10)], 200, replace = TRUE),
f1 = sample(1000:2000,200, replace = TRUE),
f2 = sample(2000:3000,200, replace = TRUE))
1) figure out which id_lote correspond to which clase - for this we use the dplyr summarise function and store it in a variable
summary<-df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise()
returns:
Source: local data frame [125 x 2]
Groups: clase
clase id_lote
1 a 1
2 a 2
3 a 4
4 a 5
5 a 6
6 a 7
7 a 8
8 a 9
9 a 11
10 a 12
.. ... ...
then we sample to get the 30% of the id_lote for each clase..
sampled_summary <- summary %>%
group_by(clase) %>%
sample_frac(.3,replace = FALSE)
so the result of this is a data table with two columns, (clase and id_lote) with 30% of the id_lotes shown for each clase.
2) ok so now we have the id_lotes randomly selected from each class but not the id_pix that are associated with that class. To accomplish this we do a join to get the corresponding full data set including the id_pix, etc.
result <- sampled_summary %>%
left_join(df)
The above makes a copy of the data set a bunch, so if you have a substantial data set you could just do it all at one go:
result <- df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise() %>%
group_by(clase) %>%
sample_frac(.5,replace = FALSE) %>%
left_join(df)
if this doesn't get you what you want, let me know and we'll take another crack at it.

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

Resources