Make data frame out of list - r

My problem is really simple: I have a dataframe with 3 columns
> head(subset_only_aster)
compound contrast sign_level
2 10 + 11 + 12 + 13 + 14-MeC30 Precocene.undeveloped - Acetone.undeveloped *
7 10 + 11 + 12 + 13 + 14-MeC30 Precocene.developed - Acetone.undeveloped **
Of which I want to make a data frame where 'compound' should be the row names (there are 65 compounds all together), the 'contrasts' (which is a variable with 6 levels) should be the columns (6 columns) and the variable 'sign_level' should be the data in the data frame.
Don't know where to begin, can't find the answer on the web neither. Can anybody help?

Here is a base R solution:
dat <- expand.grid(compounds=letters[1:3], contrast=LETTERS[5:10])
dat[, "sgn"] <- sample(c("*", "**", "***"), nrow(dat), replace=TRUE)
reshape(dat, direction="wide", idvar="compounds", timevar="contrast")

You can use the spread-function in tidyr:
DF<-data.frame(compound=rep(LETTERS[1:2],2),
contrast=c(rep(letters[1],2),rep(letters[2],2)),
signlevel=1:4)
library(tidyr)
DF2<-tidyr::spread(DF,contrast,signlevel)

Related

In R how to use an ifelse() with a vector or dataframe for classification [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
R ifelse statement
(3 answers)
Closed 2 years ago.
I'm attempting to use a vector or a dataframe to classify the size of fish above or below a threshold size. If I want to use a standard threshold size, say "30" for all fish, I have no problems.
sp_data <- data.frame(
X1 = c("fish1","fish1","fish2","fish2","fish3"),
X2 = c(20,30,32,21,50)
)
sp_data$X3 <-ifelse(sp_data$X2>=30 , "above", "below")
The above code works as intended however what I really want is to use a unique threshold for each fish. So I created this dataframe where I can list the fish and it's corresponding size threshold.
size_data <-data.frame(
S1 = c("fish1", "fish2", "fish3"),
S2 = c(25, 26, 30)
)
spdata$X4 <- ifelse(spdata$X1 == sizedata$S1 &
sizedata$S2 >= sp_data$X2, above, below)
This doesn't work, I think because its looking at all of spdata$X1 sizedata$S1 instead of row by row. Perhaps ifelse is not the best way to solve this problem but its the closest I've found so far. I'm thinking I need a loop or an apply to make this work but I'm not sure where to go from here.
There is a dplyr way to do this.
library(dplyr)
sp_data %>%
inner_join(size_data, by = c("X1" = "S1")) %>%
mutate(X4 = case_when(X2 >= S2 ~ "above",
TRUE ~ "below")) %>%
select(-S2)
X1 X2 X4
1 fish1 20 below
2 fish1 30 above
3 fish2 32 above
4 fish2 21 below
5 fish3 50 above
You could join the datasets together and then use the calculation. Here's the original data:
sp_data <- data.frame(
X1 = c("fish1","fish1","fish2","fish2","fish3"),
X2 = c(20,30,32,21,50)
)
Note, that you've got to change the S1 to X1 in the size_data object so that the variable you're merging on is the same in both datasets:
size_data <-data.frame(
X1 = c("fish1", "fish2", "fish3"),
S2 = c(25, 26, 30)
)
Then you can merge them with left_join() from dplyr:
library(dplyr)
sp_data <- left_join(sp_data, size_data)
Finally, you can make the calculation you want:
sp_data$X3 <- ifelse(X2 > S2, "above", "below")
> sp_data
X1 X2 S2 X3
1 fish1 20 25 below
2 fish1 30 25 above
3 fish2 32 26 above
4 fish2 21 26 below
5 fish3 50 30 above`

I want to calculate maximum variance according to the sales with storetype

I am a beginner in R. I have dataset with 11 column and 3000 obs.
The data frame has 3000 obs and 11 columns. There are 6 columns of various sales and I want to measure the variance in each sale column across store_Type:
table(s1$store_Type)
Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3
242 1226 200 350
I am not sure how to start to this problem.
To calculate variance on a data set you can use var(). To calculate by columns, use apply(). For example:
# create fake data
set.seed(123) # for reproducibility
dat <- as.data.frame(matrix(runif(15,100,200), ncol = 3, nrow = 5))
colnames(dat) <-c("Store 1", "Store 2", "Store 3")
# generate variance
var.dat <-apply(dat, MARGIN=2, FUN=var) # by column
var.dat
Store 1 Store 2 Store 3
866.3951 914.2388 978.7129
I used tapply() to find the individual variance for different sales. Then I summed up the variance for individual store type to get the variance. Thought of grouping by Store Type but couldn't get the result. Eventhough, I got the answer, felt mechanical doing it. There must be another way of doing this.!
v2 = tapply(store_train$sales1,store_train$store_Type, var)
v3 = tapply(store_train$sales2,store_train$store_Type, var)
v4 = tapply(store_train$sales3,store_train$store_Type, var)
v5 = tapply(store_train$sales4,store_train$store_Type, var)
v1[1] + v2[1] + v3[1] + v4[1] + v5[1]
v1[2] + v2[2] + v3[2] + v4[2] + v5[2]
v1[3] + v2[3] + v3[3] + v4[3] + v5[3]
v1[4] + v2[4] + v3[4] + v4[4] + v5[4]

cbind 1:nrows of same ID variable value to original data.frame

I have a large dataframe, where a variable id (first column) recurs with different values in the second column. My idea is to order the dataframe, to split it into a list and then lapply a function which cbinds the sequence 1:nrows(variable id) to each group. My code so far:
DF <- DF[order(DF[,1]),]
DF <- split(DF,DF[,1])
DF <- lapply(1:length(DF), function(i) cbind(DF[[i]], 1:length(DF[[i]])))
But this gives me an error: arguments imply different number of rows.
Can you elaborate?
> head(DF, n=50)
cell area
1 1 121.2130
2 2 81.3555
3 3 81.5862
4 4 83.6345
...
33 1 121.3270
34 2 80.7832
35 3 81.1816
36 4 83.3340
DF <- DF[order(DF$cell),]
What I want is:
> head(DF, n=50)
cell area counter
1 1 121.213 1
33 1 121.327 2
65 1 122.171 3
97 1 122.913 4
129 1 123.697 5
161 1 124.474 6
...and so on.
This is my code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF <- splitstackshape::getanID(DF, "cell")[] # thanks to akrun's answer
ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = cell)) +
geom_line(aes(group = cell)) + geom_point(size=0.1)
}
And the plot looks like this:
Most cells increase in area, only some decrease. This is only a first try to visualize my data, so what you can't see very well is that the areas drop down periodically due to cell division.
Additional question:
There is a problem I didn't take into account beforehand, which is that after a cell division a new cell is added to the data.frame and is handed the initial index 1 (you see in the image that all cells start from .id=1, not later), which is not what I want - it needs to inherit the index of its creation time. First thing that comes into my mind is that I could use a parsing mechanism that does this job for a newly added cell variable:
DF$.id[DF$cell != temporary.cellindex] <- max(DF$.id[DF$cell != temporary.cellindex])
Do you have a better idea? Thanks.
There is a boundary condition which may ease the problem: fixed number of cells at the beginning (32). Another solution would be to cut away all data before the last daughter cell is created.
Update: Additional question solved, here's the code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF$.id <- c(0, cumsum(diff(DF$cell) < 0)) + 1L # Indexing
title <- getwd()
myplot <- ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = factor(cell))) +
geom_line(aes(group = cell)) + geom_line(size=0.1) + theme(legend.position="none") + ggtitle(title)
#save the plot
image=myplot
ggsave(file="cell_areas_time.svg", plot=image, width=10, height=8)
}
We can use getanID from splitstackshape
library(splitstackshape)
getanID(DF, "cell")[]
There's a much easier method to accomplish that goal. Use ave with seq.int
DF$group_seq <- ave(DF, DF[,1], FUN=function(x){ seq.int(nrow(x)) } )

R: running t-test on one column of a dataframe versus all others [duplicate]

This question already has an answer here:
t-test in R between individuals columns and the rest of a given dataframe
(1 answer)
Closed 9 years ago.
I have a dataframe of the basic form:
> head(raw.data)
NAC cOF3 APir Pu Tu V2.3 mOF3 DGpf
1 6.314770 6.181188 6.708971 6.052134 6.546938 6.079848 6.640716 6.263770
2 8.825595 8.740217 9.532026 8.919598 8.776969 8.843287 8.631505 9.053732
3 5.518933 5.982044 5.632379 5.712680 5.655525 5.580141 5.750969 6.119935
4 6.063098 6.700194 6.255736 5.124315 6.133631 5.891009 6.070467 6.062815
5 8.931570 9.048621 9.258875 8.681762 8.680993 9.040971 8.785271 9.122226
6 5.694149 5.356218 5.608698 5.894171 5.629965 5.759247 5.929289 6.092337
I would like to perform t-tests of each column versus all other columns and save the subsequent p-values to a variable in some variation of the following:
#run tests
test.result = mapply(t.test, one.column, other.columns)
#store p-values
p.values = stack(mapply(function(x, y)
+ t.test(x,y)$p.value, one.column, other.columns))
Or would aov() be a better option for such an analysis? In any case, I would like to know how to streamline doing it using t-tests.
Here's one solution:
Read in the data:
dat <- read.table(text='NAC cOF3 APir Pu Tu V2.3 mOF3 DGpf
1 6.314770 6.181188 6.708971 6.052134 6.546938 6.079848 6.640716 6.263770
2 8.825595 8.740217 9.532026 8.919598 8.776969 8.843287 8.631505 9.053732
3 5.518933 5.982044 5.632379 5.712680 5.655525 5.580141 5.750969 6.119935
4 6.063098 6.700194 6.255736 5.124315 6.133631 5.891009 6.070467 6.062815
5 8.931570 9.048621 9.258875 8.681762 8.680993 9.040971 8.785271 9.122226
6 5.694149 5.356218 5.608698 5.894171 5.629965 5.759247 5.929289 6.092337')
Get all possible pairwise combinations:
com <- combn(colnames(dat), 2)
Get the p-values
p <- apply(com, 2, function(x) t.test(dat[,x[1]], dat[,x[2]])$p.val)
Put into a data frame:
data.frame(comparison = paste(com[1,], com[2,], sep = ' vs. '), p.value = p)
An even better solution is to use melt from the rehape package and pairwise.t.test:
library(reshape)
with(melt(dat), pairwise.t.test(value, variable, p.adjust.method = 'none'))
If you want to pair just the first with all other columns, you can also use this:
x <- sapply(dat[,-1], function(x) t.test(x, dat[,1])$p.value)
data.frame(variable = names(x), p.value = as.numeric(x))

Reshaping a complex dataset from long to wide using recast()

I am working with a dataset that comes with lme4, and am trying to learn how to apply reshape2 to convert it from long to wide [full code at the end of the post].
library(lme4)
data("VerbAgg") # load the dataset
The dataset has 9 variables; 'Anger', 'Gender', and 'id' don't vary with 'item', while 'resp',
'btype', 'situ', 'mode', and 'r2' do.
I have successfully been able to convert the dataset from long to wide format using reshape():
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
Which yields 316 observations on 123 variables, and appears to be correctly transformed. However, I have had no success using reshape/reshape2 to reproduce the wide dataframe.
wide2 <- recast(VerbAgg, id + Gender + Anger ~ item + variable)
Using Gender, item, resp, id, btype, situ, mode, r2 as id variables
Error: Casting formula contains variables not found in molten data: Anger
I may not be 100% clear on how recast defines id variables, but I am very confused why it does not see "Anger". Similarly,
wide3 <- recast(VerbAgg, id + Gender + Anger ~ item + variable,
id.var = c("id", "Gender", "Anger"))
Error: Casting formula contains variables not found in molten data: item
Can anyone see what I am doing wrong? I would love to obtain a better understanding of melt/cast!
Full code:
## load the lme4 package
library(lme4)
data("VerbAgg")
head(VerbAgg)
names(VerbAgg)
# Using base reshape()
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
# Using recast
library(reshape2)
wide2 <- recast(VerbAgg, id + Gender + Anger ~ item + variable)
wide3 <- recast(VerbAgg, id + Gender + Anger ~ item + variable,
id.var = c("id", "Gender", "Anger"))
# Using melt/cast
m <- melt(VerbAgg, id=c("id", "Gender", "Anger"))
wide <- o cast(m,id+Gender+Anger~...)
Aggregation requires fun.aggregate: length used as default
# Yields a list object with a length of 8?
m <- melt(VerbAgg, id=c("id", "Gender", "Anger"), measure.vars = c(4,6,7,8,9))
wide <- dcast(m, id ~ variable)
# Yields a data frame object with 6 variables.
I think the following code does what you want.
library(lme4)
data("VerbAgg")
# Using base reshape()
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
dim(wide) # 316 123
# Using melt/cast
require(reshape2)
m1 <- melt(VerbAgg, id=c("id", "Gender", "Anger","item"), measure=c('resp','btype','situ','mode','r2'))
wide4 <- dcast(m1,id+Gender+Anger~item+variable)
dim(wide4) # 316 123
R> wide[1:5,1:6]
Anger Gender id resp.S1WantCurse btype.S1WantCurse situ.S1WantCurse
1 20 M 1 no curse other
2 11 M 2 no curse other
3 17 F 3 perhaps curse other
4 21 F 4 perhaps curse other
5 17 F 5 perhaps curse other
R> wide4[1:5,1:6]
id Gender Anger S1WantCurse_resp S1WantCurse_btype S1WantCurse_situ
1 1 M 20 no curse other
2 2 M 11 no curse other
3 3 F 17 perhaps curse other
4 4 F 21 perhaps curse other
5 5 F 17 perhaps curse other

Resources