Reversing a melting operation with reshape2 [duplicate] - r

This question already has an answer here:
Simpler way to reconstitute a melted data frame back to the original
(1 answer)
Closed 9 years ago.
Consider the following code.
library (reshape2)
x = rnorm (20)
y = x + rnorm (rnorm (20, sd = .01))
dfr <- data.frame (x, y)
mlt <- melt (dfr)
When I try to reverse this operation with dcast,
dcast (mlt, value ~ variable)
I get instead a data frame with three columns (not suitable for scatter-plotting, for instance).
How can I reenact the original data frame with dcast?

How could R know the ordering that existed before the melt? i.e. the notion that row one of x matches up with row one of y.
If you add an index column (since R will complain about duplicated row.names) you can do this operation simply:
dfr$idx <- seq_along(dfr$x)
mlt <- melt(dfr, id.var='idx')
dcast(mlt, idx ~ variable, value.var='value')

Related

apply a function to some columns of a data frame, while storing the result in the original data frame [duplicate]

This question already has answers here:
Coerce multiple columns to factors at once
(11 answers)
Closed 3 years ago.
I have a data frame, where I would like to render some of the columns as factor (at the moment they are numeric).
For example:
dd = data_frame( x = c(0, 0, 0, 1, 1, 1), y = c(1,2,3,4,5,6))
I would like to make only the first column a factor:
lapply(dd[,1], as.factor)
But the result is a list (of a factor), and is not saved back to the original data frame.
Is there a way to achieve this?
We can use
library(dplyr)
dd <- dd %>%
mutate(x = factor(x))
Or for multiple columns
nm1 <- names(dd)[1:2]
dd <- dd %>%
mutate_at(vars(nm1), factor)
In the OP's code, the issue is that it is looping through the first column elements into a list. Instead, we need just
dd[,1] <- factor(dd[,1])
Or
dd[[1]] <- factor(dd[[1]])
NOTE: For a single column, we don't need any lapply
If we want to apply to multiple columns
dd[nm1] <- lapply(dd[nm1], factor)

Creating Subset data frames in R within For loop [duplicate]

This question already has answers here:
Split a large dataframe into a list of data frames based on common value in column
(3 answers)
Closed 4 years ago.
What I am trying to do is filter a larger data frame into 78 unique data frames based on the value of the first column in the larger data frame. The only way I can think of doing it properly is by applying the filter() function inside a for() loop:
for (i in 1:nrow(plantline))
{x1 = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])}
The issue is I don't know how to create a new data frame, say x2, x3, x4... every time the loop runs.
Can someone tell me if that is possible or if I should be trying to do this some other way?
There must be many duplicates for this question
split(plantline, plantline$Plant_Line)
will create a list of data.frames.
However, depending on your use case, splitting the large data.frame into pieces might not be necessary as grouping can be used.
You could use split -
# creates a list of dataframes into 78 unique data frames based on
# the value of the first column in the larger data frame
lst = split(large_data_frame, large_data_frame$first_column)
# takes the dataframes out of the list into the global environment
# although it is not suggested since it is difficult to work with 78
# dataframes
list2env(lst, envir = .GlobalEnv)
The names of the dataframes will be the same as the value of the variables in the first column.
It would be easier if we could see the dataframes....
I propose something nevertheless. You can create a list of dataframes:
dataframes <- vector("list", nrow(plantline))
for (i in 1:nrow(plantline)){
dataframes[[i]] = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])
}
You can use assign :
for (i in 1:nrow(plantline))
{assign(paste0(x,i), filter(rawdta.df, Plant_Line == plantline$Plant_Line[i]))}
alternatively you can save your results in a list :
X <- list()
for (i in 1:nrow(plantline))
{X[[i]] = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])}
Would be easier with sample data. by would be my favorite.
d <- data.frame(plantline = rep(LETTERS[1:3], 4),
x = 1:12,
stringsAsFactors = F)
l <- by(d, d$plantline, data.frame)
print(l$A)
print(l$B)
Solution using plyr:
ma <- cbind(x = 1:10, y = (-4:5)^2, z = 1:2)
ma <- as.data.frame(ma)
library(plyr)
dlply(ma, "z") # you split ma by the column named z

Alternative to FOR Loop for below [duplicate]

This question already has answers here:
Group Data in R for consecutive rows
(3 answers)
Closed 6 years ago.
I have written a for loop that takes a group of 5 rows from a dataframe and passes it to a function, the function then returns just one row after doing some operations on those 5 rows. Below is the code:
for (i in 1:nrow(features_data1)){
if (i - start == 4){
group = features_data1[start:i,]
group <- as.data.frame(group)
start <- i+1
sub_data = feature_calculation(group)
final_data = rbind(final_data,sub_data)
}
}
Can anyone please suggest me an alternative to this as the for loop is taking a lot of time. The function feature_calculation is huge.
Try this for a base R approach:
# convert features to data frame in advance so we only have to do this once
features_df <- as.data.frame(features_data1)
# assign each observation (row) to a group of 5 rows and split the data frame into a list of data frames
group_assignments <- as.factor(rep(1:ceiling(nrow(features_df) / 5), each = 5, length.out = nrow(features_df)))
groups <- split(features_df, group_assignments)
# apply your function to each group individually (i.e. to each element in the list)
sub_data <- lapply(X = groups, FUN = feature_calculation)
# bind your list of data frames into a single data frame
final_data <- do.call(rbind, sub_data)
You might be able to use the purrr and dplyr packages for a speed-up. The latter has a function bind_rows that is much quicker than do.call(rbind, list_of_data_frames) if this is likely to be very large.

Selecting data based on different values in the same column and unique value in another column in R [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed 6 years ago.
Sorry if the title is confusing, wasn't sure how to describe this problem. Ok so I have a dataframe with one column that is sampling site, of which I have many, and one column that is sampling method, of which there are only two. Here's a simplified version:
site <- c("X", "Y", "X","Z")
method <- c("A", "B", "B", "A")
data <- data.frame(site, method)
data
site method
1 X A
2 Y B
3 X B
4 Z A
Now some sites got sampled using both sampling method A and method B, and some got sampled by only method A or method B.
I am trying to select only those sites that got sampled using both methods. For example, the output for this data would look like this:
site method
1 X A
2 X B
I don't have a sample code because I honestly do not know how to do this. Please help!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'site', if the length of the unique 'method' is greater than 1, then get the Subset of Data.table.
library(data.table)
setDT(data)[, if(uniqueN(method)>1) .SD , by = site]
Or with dplyr, we can do it.
library(dplyr)
data %>%
group_by(site) %>%
filter(n_distinct(method)>1)
A possible base R option would be
data[ with(data, ave(method, site, FUN = function(x) length(unique(x))>1)),]

R: How to include lm residual back into the data.frame? [duplicate]

This question already has answers here:
Aligning Data frame with missing values
(4 answers)
Closed 6 years ago.
I am trying to put the residuals from lm back into the original data.frame:
fit <- lm(y ~ x, data = mydata, weight = ind)
mydata$resid <- fit$resid
The second line would normally work if the residual has the same length as the number of rows of mydata. However, in my case, some of the elements of ind is NA. Therefore the residual length is usually less than the number of rows. Also fit$resid is a vector of "numeric" so there is no label for me to merge it back with the mydata data.frame. Is there an elegant way to achieve this?
I think it should be pretty easy if ind is just a vector.
sel <- which(!is.na(ind))
mydata$resid <- NA
mydata$resid[sel] <- fit$resid

Resources