How to re-arrange a data.frame - r

I am interested in re-arranging a data.frame in R. Bear with me a I stumble through a reproducible example.
I have a nominal variable which can have 1 of two values. Currently this nominal variable is a column. Instead I would like to have two columns, representing the two values this nominal variable can have. Here is an exmample data frame. S is the nominal variable with values T and C.
n <- c(1,1,2,2,3,3,4,4)
s <- c("t","c","t","c","t","c","t","c")
b <- c(11,23,6,5,12,16,41,3)
mydata <- data.frame(n, s, b)
I would rather have a data frame that looked like this
n.n <- c(1,2,3,4)
trt <- c(11,6,23,41)
cnt <- c(23,5,16,3)
new.data <- data.frame(n.n, trt, cnt)
I am sure there is a way to use mutate or possibly tidyr but I am not sure what the best route is and my data frame that I would like to re-arrange is quite large.

you want spread:
library(dplyr)
library(tidyr)
new.data <- mydata %>% spread(s,b)
n c t
1 1 23 11
2 2 5 6
3 3 16 12
4 4 3 41

How about unstack(mydata, b~s):
c t
1 23 11
2 5 6
3 16 12
4 3 41

Related

Create a function for making an index out of different indicators across different datasets

I'm working with the European Social Survey, and have different dataframes for each country. All of these dataframes are equal except for the values on each variable. What I would like to do is to create a new variable in each dataset that is equal to the sum of several other variabels. Is there a way to create a functions that does this for every dataframes?
What I have done before i simply creating a new column with:
Data$new <- Data$old1 + Data$old2...etc.
However, when working with several variables over several datasets this seams rather inefficient, and I'm quite sure that there must exist an easier way. I just don't know what to google.
Example:
I have two dataframes, A and B:
A1 <- c(1,2,3,4,5)
A2 <- c(6,7,8,9,10)
A <- data.frame(A1, A2)
B1 <- c(10,12,13,15,24)
B2 <- c(23,24,25,45,65)
B <- data.frame(B1, B2)
What I want to do is for each dataframe create a new column which is equal to the sum of the other two. Usually I would do that like this
A$A3 <- A$A1 + A$A2
B$B3 <- B$B1 + B$B2
However, doing this across several dataframes with a large amount of variables seems like and inefficient way of doing it. Since the name of the variables are the same across the dataframes, is there a way to make a function that looks for said variable, and create the new one in a better way?
We can create a helper auto_add:
auto_add <- function(df, col_a, col_b){
df$total <- rowSums(df[c(col_a,col_b)])
df
}
auto_add(A,"A1","A2")
For many data sets and if the target columns are known, we could do:
auto_add <- function(df,target_cols){
df$total <- rowSums(df[c(target_cols)])
df
}
lapply(list(A,B),auto_add,target_cols=1:2)
Result:
[[1]]
A1 A2 total
1 1 6 7
2 2 7 9
3 3 8 11
4 4 9 13
5 5 10 15
[[2]]
B1 B2 total
1 10 23 33
2 12 24 36
3 13 25 38
4 15 45 60
5 24 65 89
An option with map/dplyr
library(tidyverse)
map(mget(c("A", "B")), ~ .x %>%
mutate(Total = reduce(., `+`)))

reshape a matrix in R and converting every n rows into one row

I have a data frame df like:
year location
1 A
2 B
3 C
------------------
1 X
5 A
10 F
------------------
3 F
5 x
2 y
I would like to reshape it to
year_1 location_1 year_2 location_2 year_3 location_3
1 A 2 B 3 F
3 C 1 X 5 X
5 A 10 F 2 Y
I can do a hack, and concatenate the first two columns, and do
d <- matrix(df, nrow = 70, byrow = FALSE)
But again later I have to split the concatenated stuff, is there a neat way of doing this?
How about splitting and recombinining:
wide.df <- Reduce(cbind, split(df, cumsum( rep(c(1,0,0), nrow(df)/3) ) )
This would have the advantage over coercion to matrix and back-coercion to dataframe that it would not have any difficulty with factor or characters messing up the classes. Using a matrix as an intermediate would first loose all the levels and if you had both characters and factors you would have a really confusing mess.
You might need to fiddle with column names a bit if you needed the exact result and I'd be happy to assist in that if you posted a copy-pasteable [MCVE]

R: Sum each column by subgroup

I have a matrix of plant species occurrence data. The matrix is set up so that every column is a species, and every row is a sampling location. I also have identifiers that group sampling locations based on certain environmental variables. I would like to create columns sums for each species, but subgrouped by the specific environmental variables.
An example data set:
library(vegan)
data("dune")
data("dune.env")
dune$plot <- c(1:20); dune.env$plot <- c(1:20)
merge(dune, dune.env)
So there are now 20 plots, with 30 species observed, and 5 associated environmental variables. I would like to generate the sum of the number of individuals observed per species, grouped by "Management". I have tried something like this:
library(tidyverse)
sums <- group_by(data, data$Management) %>% colSums(data[,(2:31)], na.rm = TRUE)
but I always get an error about incorrect dims. I am not sure how I would go about solving my problem. Ideally, the result would be a dataframe with 4 rows (1 for each management type) where all the species (cols 2:31) have been summed.
rowsums does what you need:
dat <- merge(dune, dune.env)
> rowsum(dat[,2:31], dat$Management)
Achimill Agrostol Airaprae Alopgeni Anthodor Bellpere Bromhord Chenalbu ...
BF 7 0 0 2 4 5 8 0 ...
HF 6 7 0 8 9 2 4 0 ...
NM 2 13 5 0 8 2 0 0 ...
SF 1 28 0 26 0 4 3 1 ...
use data.table:
require(data.table)
a <- merge(dune, dune.env)
setDT(a)
a[, lapply(.SD, sum), by = Management, .SDcols = names(a)[2:31]]
Well, I was doing something very similar a few days ago:
How to obtain species richness and abundance for sites with multiple samples using dplyr
To modify the excellent answer given by #akrun:
df <- merge(dune, dune.env)
library(dplyr)
df2<- df %>%
group_by(Management) %>%
summarise_at(sum, .vars = vars(Achimill:Callcusp))

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

sample function in R

I Have just started learning R using RStudio and I have, perhaps, some basic questions.
One of them regards the "sample" function.
More specifically, my dataset consists of 402224 observations of 147 variables. My task is to take a sample of 50 observations and then produce a dataframe and so on.
But when the function sample is executed
y = sample(mydata, 50, replace = TRUE, prob = NULL)
the result is a dataset with 40224 observations of 50 variables. That is, the sampling is done at variables and not obesrvations.
Do you have any idea why does it happen?
Thank you in advance.
If you want to create a data frame of 50 observations with replacement from your data frame, you can try:
mydata[sample(nrow(mydata), 50, replace=TRUE), ]
Alternatively, you can use the sample_n function from the dplyr package:
sample_n(mydata, 50)
The other answers people have been giving are to select rows, but it looks like you are after columns. You can still accomplish this in a similar way.
Here's a sample df.
df = data.frame(a = 1:5, b = 6:10, c = 11:15)
> df
a b c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
Then, to randomly select 2 columns and all observations we could do this
> df[ , sample(1:ncol(df), 2)]
c a
1 11 1
2 12 2
3 13 3
4 14 4
5 15 5
So, what you'll want to do is something like this
y = mydata[ , sample(1:ncol(mydata), 50)]
That is because sample accepts only vectors.
try the following:
library(data.table)
set.seed(10)
df_sample<- data.table(df)
df[sample(.N, 402224 )]

Resources