R: Data frame column specific data manipulation - r

I have a data frame
x <- data.frame(id=letters[1:3],val0=c(100,200,300),val1=c(400,500,600),val2=c(700,800,900))
I want to divide odd columns with a specific number n1(say) and even columns with another number n2 (say). So, the result I want is:
>n1<-2
>n2<-5
id val0 val1 val2
a 50 80 350
b 100 100 400
c 150 120 450
Can someone suggest me how to do this?
Thanks.

You can use function seq() to generate values for column numbers and then subset those columns. For even columns start with 2 and for odd star with 3. Then replace selected columns with the same selected columns divided by number you are interested in.
x[,seq(2,ncol(x),2)]<-x[,seq(2,ncol(x),2)]/n1
x[,seq(3,ncol(x),2)]<-x[,seq(3,ncol(x),2)]/n2

A slightly disguised for loop:
x[] <- lapply(seq_len(ncol(x)), function(i) x[, i]/ifelse(i%%2, 2, 5))
And just for kicks:
x[] <- lapply(seq_len(ncol(x)), function(i) x[, i]/if(i%%2) 2 else 5)

Related

Drawing equally-sized samples from differently-sized substrata of a dataframe in R [duplicate]

This question already has answers here:
Sample n random rows per group in a dataframe
(5 answers)
Stratified random sampling from data frame
(6 answers)
Closed 4 years ago.
I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:
df <- data.frame(
word = sample(LETTERS, 100, replace = T),
position = sample(1:5, 100, replace = T)
)
head(df)
word position
1 K 1
2 R 5
3 J 2
4 Y 5
5 Z 5
6 U 4
Obviously, the tranches of 'position' are differently sized:
table(df$position)
1 2 3 4 5
15 15 17 28 25
To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:
df_pos1 <- df[df$position==1,]
df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]
df_pos2 <- df[df$position==2,]
df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]
df_pos3 <- df[df$position==3,]
df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]
df_pos4 <- df[df$position==4,]
df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]
df_pos5 <- df[df$position==5,]
df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]
and so on, to finally combine the individual samples in a single dataframe:
df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)
but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':
df_samples <-c()
for(i in unique(df$position)){
df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
}
df_samples
word position
13 D 2
2 R 5
12 G 3
4 Y 5
16 Z 3
11 S 3
6 U 4
14 J 3
9 O 5
1 K 1
What's wrong with this code and how can it be improved?
Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().
df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])
final_df <- do.call(rbind, df_list)
Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.
Specifically,
by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.
do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.
Each time you run the loop you are overwriting the last entry. Try:
df_samples <- data.frame()
df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])
We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient
i1 <- setDT(df)[, sample(.I, 3), position]$V1
df[i1]
Or use sample_n from tidyverse
library(tidyverse)
df %>%
group_by(position) %>%
sample_n(3)
Or as a function
f1 <- function(data) {
data as.data.table(data)
i1 <- data[, sample(.I, 3), by = position]$V1
data[i1]
}

Add column to dataframe which is the sd of rnorm from previous columns

I have a data frame of two columns
set.seed(120)
df <- data.frame(m1 = runif(500,1,30),n1 = round(runif(500,10,25),0))
and I wish to add a third column that uses column n1 and m1 to generate a normal distribution and then to get the standard deviation of that normal distribution. I mean to use the values in each row of the columns n1 as the number of replicates (n) and m1 as the mean.
How can I write a function to do this? I have tried to use apply
stdev <- function(x,y) sd(rnorm(n1,m1))
df$Sim <- apply(df,1,stdev)
But this does not work. Any pointers would be much appreciated.
Many thanks,
Matt
Your data frame input looks like:
# > head(df)
# m1 n1
# 1 12.365323 15
# 2 4.654487 15
# 3 10.993779 24
# 4 24.069388 22
# 5 6.684450 18
# 6 15.056766 16
I mean to use the values in each row of the columns n1 and m1 as the number of replicates (n) and as the mean.
First show you how to use apply:
apply(df, 1, function(x) sd(rnorm(n = x[2], mean = x[1])))
But a better way is to use mapply:
mapply(function(x,y) sd(rnorm(n = x, mean = y)), df$n1, df$m1)
apply is ideal for matrix input; for data frame input you get great overhead for type conversion.
Another option
lapply(Map(rnorm,n=df$m1,mean=df$n1),sd)

Using aggregate to get the mean of duplicate rows in a data.frame in r

I have a matrix B that is 10 rows x 2 columns:
B = matrix(c(1:20), nrow=10, ncol=2)
Some of the rows are technical duplicates, and they correspond to the same
number in a list of length 20 (list1).
list1 = c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8)
list1 = as.list(list1)
I would like to use this list (list1) to take the mean of any duplicate values for all columns in B such that I end up with a matrix or data.frame with 8 rows and 2 columns (all the duplicates are averaged).
Here is my code:
aggregate.data.frame(B, by=list1, FUN=mean)
And it generates this error:
Error in aggregate.data.frame(B, by = list1, FUN = mean) :
arguments must have same length
What am I doing wrong?
Thank you!
Your data have 2 variables (2 columns), each with 10 observations (10 rows). The function aggregate.data.frame expects the elements in the list to have the same length as the number of observations in your variables. You are getting an error because the vector in your list has 20 values, while you only have 10 observations per variable. So, for example, you can do this because now you have 1 variable with 20 observations, and list 1 has a vector with 20 elements.
B <- 1:20
list1 <- list(B=c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8))
aggregate.data.frame(B, by=list1, FUN=mean)
The code will also work if you give it a matrix with 2 columns and 20 rows.
aggregate.data.frame(cbind(B,B), by=list1, FUN=mean)
I think this answer addresses why you are getting an error. However, I am not sure that it addresses what you are actually trying to do. How do you expect to end up with 8 rows and 2 columns? What exactly would the cells in that matrix represent?

For-loop and storing results in an array in R

Let say I’ve a data frame consists of one variable (x)
df <- data.frame(x=c(1,2,3,3,5,6,7,8,9,9,4,4))
I want to know how many numbers are less than 2,3,4,5,6,7.
I know how to do this manually using
# This will tell you how many numbers in df less than 4
xnew <- length(df[ which(df$x < 4), ])
My question is how can I automate this by using for-loop or other method(s)? And I need to store the results in an array as follows
i length
2 1
3 2
4 4
5 6
6 7
7 8
Thanks
One way would be to loop over (sapply) the numbers (2:7), check which elements in df$x is less than (<) the "number" and do the sum, cbind with the numbers, will give the matrix output
res <- cbind(i=2:7, length=sapply(2:7, function(y) sum(df$x <y)))
Or you can vectorize by creating a matrix of numbers (2:7) with each number replicated by the number of rows of df, do the logical operation < with df$x. The logical operation is repeated for each column of the matrix, and get the column sums using colSums.
length <- colSums(df$x <matrix(2:7, nrow=nrow(df), ncol=6, byrow=TRUE))
#or
#length <- colSums(df$x < `dim<-`(rep(2:7,each=nrow(df)),c(12,6)))
cbind(i=2:7, length=length)
num = c(2,3,4,5,6,7)
res = sapply(num, function(u) length(df$x[df$x < u]))
data.frame(number=num,
numberBelow=res)
A vectorized solution:
findInterval(2:7*(1-.Machine$double.eps),sort(df$x))
The .Machine$double.eps part assure that you are taking just the numbers lower than and not lower or equal than.

Sample with constraint, vectorized

What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)

Resources