How to reshape a data frame from wide to long format in R? - r

I am new to R. I am trying to read data from Excel in the mentioned format
x1 x2 x3 y1 y2 y3 Result
1 2 3 7 8 9
4 5 6 10 11 12
and data.frame in R should take data in mentioned format for 1st row
x y
1 7
2 8
3 9
then I want to use lm() and export the result to result column.
I want to automate this for n rows i.e once results of 1st column is exported to Excel then I want to import data for second row.
Please Help.

library(gdata)
# this spreadsheet is exactly as in your question
df.original <- read.xls("test.xlsx", sheet="Sheet1", perl="C:/strawberry/perl/bin/perl.exe")
#
#
> df.original
x1 x2 x3 y1 y2 y3
1 1 2 3 7 8 9
2 4 5 6 10 11 12
#
# for the above code you'll just need to change the argument 'perl' with the
# path of your installer
#
# now the example for the first row
#
library(reshape2)
df <- melt(df.original[1,])
df$variable <- substr(df$variable, 1, 1)
df <- as.data.frame(lapply(split(df, df$variable), `[[`, 2))
> df
x y
1 1 7
2 2 8
3 3 9
Now, at this stage we automated the process of inport/transformation (for one line).
First question: How you want the data to look like when every line will be treated?
Second question: In result, what do you want exactly to put? residual, fitted values? what you need from lm()?
EDIT:
ok, #kapil tell me if the final shape of df is what you thought:
library(reshape2)
library(plyr)
df <- adply(df.original, 1, melt, .expand=F)
names(df)[1] <- "rowID"
df$variable <- substr(df$variable, 1, 1)
rows <- df$rowID[ df$variable=="x"] # with y would be the same (they are expected to have the same legnth)
df <- as.data.frame(lapply(split(df, df$variable), `[[`, c("value")))
df$rowID <- rows
df <- df[c("rowID", "x", "y")]
> df
rowID x y
1 1 1 7
2 1 2 8
3 1 3 9
4 2 4 10
5 2 5 11
6 2 6 12
regarding the coefficient you can calculate for each rowID (which refers to the actual row in the xls file) in this way:
model <- dlply(df, .(rowID), function(z) {print(z); lm(y ~ x, df);})
> sapply(model, `[`, "coefficients")
$`1.coefficients`
(Intercept) x
6 1
$`2.coefficients`
(Intercept) x
6 1
so, for each group (or row in original spreadsheet) you have (as expected) two coefficients, intercept and slope, therefore I can't figure out how you want the coefficient to fit inside the data.frame (especially in the 'long' way it appears just above). But if you wanted the data.frame to stay in 'wide' mode then you can try this:
# obtained the object model, you can put the coeff in the df.original data.frame
#
> ldply(model, `[[`, "coefficients")
rowID (Intercept) x
1 1 6 1
2 2 6 1
df.modified <- cbind(df.original, ldply(model, `[[`, "coefficients"))
> df.modified
x1 x2 x3 y1 y2 y3 rowID (Intercept) x
1 1 2 3 7 8 9 1 6 1
2 4 5 6 10 11 12 2 6 1
# of course, if you don't like it, you can remove rowID with df.modified$rowID <- NULL
Hope this helps, and let me know if you wanted the 'long' version of df.

Related

variable names in for loop

x_names <-c("x1","x2","x3")
data <- c(1,2,3,4)
fake <- c(2,3,4,5)
for (i in x_names)
{
x = fake
data = as.data.frame(cbind(data,x))
#data <- data %>% rename(x_names = x)
}
I made a toy example. This code will generate a data frame with 1 column called data, and 3 columns called x. Instead of calling the columns x, I want them with the name x1, x2, x3 (stored in x_names). I put the x_name in the code (comment out), but it does not work. Could you help me with it?
We can also use map_dfc from tidyverse:
library(tidyverse)
cbind(data, map_dfc(x_names, ~ tibble(!!.x := fake)))
Output:
data x1 x2 x3
1 1 2 2 2
2 2 3 3 3
3 3 4 4 4
4 4 5 5 5
We can avoid the for loop and use replicate to repeat fake data using setNames to name the dataframe with x_names.
cbind(data, setNames(data.frame(replicate(length(x_names), fake)), x_names))
# data x1 x2 x3
#1 1 2 2 2
#2 2 3 3 3
#3 3 4 4 4
#4 4 5 5 5
Ideally one should avoid growing objects in a loop, however one way to solve OP's problem in loop is
for (i in seq_along(x_names)) {
data = cbind.data.frame(data, fake)
names(data)[i + 1] <- x_names[i]
}
An option is just to assign the 'fake' to create the new columns in base R
data[x_names] <- fake
data
# data x1 x2 x3
#1 1 2 2 2
#2 2 3 3 3
#3 3 4 4 4
#4 4 5 5 5
EDIT: Based on comments from #avid_useR
data
data <- data.frame(data)
When you exchange your out-commented line
#data <- data %>% rename(x_names = x)
with
colnames(data)[ncol(data)] <- i
it should set the right colnames.

Subset using loop over data frame in R

I have a dataframe which has 50 variables for values 1-5, but some of them contains values more than 5 like 18656, I need to remove all these values from the dataframe. Is there a function which can do this.
I am using this code
func <- function(df_likert, col){
df_likert <- subset(df_likert, col <= 5)
}
for (i in names(df_likert)) {
func(df_likert, i)
}
library(dplyr)
# example dataset
dt = data.frame(x1 = c(1,2,3,4,5),
x2 = c(3,3,4,5,10),
x3 = c(10,1,1,2,3))
# original dataset
dt
# x1 x2 x3
# 1 1 3 10
# 2 2 3 1
# 3 3 4 1
# 4 4 5 2
# 5 5 10 3
# update dataset
dt %>%
mutate_all(function(x) ifelse(x > 5, NA, x)) %>%
na.omit()
# x1 x2 x3
# 2 2 3 1
# 3 3 4 1
# 4 4 5 2
This solution removes all rows with values more than 5, as you mentioned. If you exclude the na.omit part you can just replace those values with NA instead of removing the whole row.

Remove elements of list based on condition during "loop"

I'm dealing with a very large list of large data frames (~2GB). To save space and reduce file size, I want to remove some elements of the list that are all NA. As part of the operation I need to gather and then bind into a single data.frame.
Here's an example:
library(tidyr)
library(dplyr)
a <- data.frame(x=rep(1,3), y1=1:3, y2=1:3)
b <- data.frame(x=rep(2,3), y1=NA, y2=NA)
c <- data.frame(x=rep(3,3), y1=1:3, y2=NA)
l <- list(a,b,c)
t <- lapply(l, function(x){
gather(x, key="type", value="value", -x) # %>%
#remove list element here %>%
#do other operations like mutate here
}) %>%
bind_rows
The result of this includes some data.frames that are all NA for my values of y.
I would like to remove elements from the list completely. If remove all rows with NA it still leaves me with an empty list element, which then crashes further calculations with mutate or other operations.
I'm trying to take care of this operation with the first call to lapply because I find that doing filtering after that requires a lot of memory (often crashing after maxing out the 16GB I have on this computer). In the title when I say "list" I'm referring to this apply statement.
In this example the result should look like:
> t[-(7:12),]
x type value
1 1 y1 1
2 1 y1 2
3 1 y1 3
4 1 y2 1
5 1 y2 2
6 1 y2 3
13 3 y1 1
14 3 y1 2
15 3 y1 3
16 3 y2 NA
17 3 y2 NA
18 3 y2 NA
So, I'm not 100% sure I understood the question, but assuming I did, a possible answer would be:
t <- lapply(l, function(x){
gather(x, key="type", value="value", -x) %>%
subset(!sum(!is.na(value)) == 0) })
%>% bind_rows
t
x type value
1 1 y1 1
2 1 y1 2
3 1 y1 3
4 1 y2 1
5 1 y2 2
6 1 y2 3
7 3 y1 1
8 3 y1 2
9 3 y1 3
10 3 y2 NA
11 3 y2 NA
12 3 y2 NA

How to split a list and save objects individually?

I am trying to add a new column to multiple data frames, and then replace the original data frame with the new one. This is how I am creating the new data frames:
df1 <- data.frame(X1=c(1,2,3),X2=c(1,2,3))
df2 <- data.frame(X1=c(4,5,6),X2=c(4,5,6))
groups <- list(df1,df2)
groups <- lapply(groups,function(x) cbind(x,X3=x[,1]+x[,2]))
groups
[[1]]
X1 X2 X3
1 1 1 2
2 2 2 4
3 3 3 6
[[2]]
X1 X2 X3
1 4 4 8
2 5 5 10
3 6 6 12
I'm satisfied with how the new data frames have been created. What I'm stuck on is then breaking up my groups list and then saving the list elements back into their respective original data frames.
Desired Output
Essentially, I want to do something like df1,df2 <- groups[[1]],groups[[2]] but that is of course not syntatically valid. I have more than 2 data frames, which is why I'm hoping for a more programmatic approach than simply typing out N lines of code.
for (i in 1:length(groups)){
assign(paste("df",i,sep=""),as.data.frame(groups[[i]]))
}
should do it. Try it out, please.
#Rockbar led me to a general solution as well:
for(i in 1:length(groups)){
assign(names(groups)[i],as.data.frame(groups[[i]]))
}
> df1
X1 X2 X3
1 1 1 2
2 2 2 4
3 3 3 6
> df2
x1 X3 X3
1 4 4 8
2 5 5 10
3 6 6 12
I should note that this only works if the objects in the list are all named. Thank you again #Rockbar for guiding me to this.

Variable Length Core Name Identification

I have a data set with the following row-naming scheme:
a.X.V
where:
a is a fixed-length core ID
X is a variable-length string that subsets a, which means I should keep X
V is a variable-length ID which specifies the individual elements of a.X to be averaged
. is one of {-,_}
What I am trying to do is take column averages of all the a.X's. A sample:
sampleList <- list("a.12.1"=c(1,2,3,4,5), "b.1.23"=c(3,4,1,4,5), "a.12.21"=c(5,7,2,8,9), "b.1.555"=c(6,8,9,0,6))
sampleList
$a.12.1
[1] 1 2 3 4 5
$b.1.23
[1] 3 4 1 4 5
$a.12.21
[1] 5 7 2 8 9
$b.1.555
[1] 6 8 9 0 6
Currently I am manually gsubbing out the .Vs to get a list of general :
sampleList <- t(as.data.frame(sampleList))
y <- rowNames(sampleList)
y <- gsub("(\\w\\.\\d+)\\.d+", "\\1", y)
Is there a faster way to do this?
This is one half of 2 issues I've encountered in a workflow. The other half was answered here.
You can use a vector of patterns to find the locations of the columns you want to group. I included a pattern I knew wouldn't match anything in order to show that the solution is robust to that situation.
# A *named* vector of patterns you want to group by
patterns <- c(a.12="^a.12",b.12="^b.12",c.12="^c.12")
# Find the locations of those patterns in your list
inds <- lapply(patterns, grep, x=names(sampleList))
# Calculate the mean of each list element that matches the pattern
out <- lapply(inds, function(i)
if(l <- length(i)) Reduce("+",sampleList[i])/l else NULL)
# Set the names of the output
names(out) <- names(patterns)
Perhaps you could consider messing with your data structure to make it easier to apply some standard tools:
sampleList <- list("a.12.1"=c(1,2,3,4,5),
"b.1.23"=c(3,4,1,4,5), "a.12.21"=c(5,7,2,8,9),
"b.1.555"=c(6,8,9,0,6))
library(reshape2)
m1 <- melt(do.call(cbind,sampleList))
m2 <- cbind(m1,colsplit(m1$Var2,"\\.",c("coreID","val1","val2")))
The results looks like this:
head(m2)
Var1 Var2 value coreID val1 val2
1 1 a.12.1 1 a 12 1
2 2 a.12.1 2 a 12 1
3 3 a.12.1 3 a 12 1
Then you can more easily do something like this:
aggregate(value~val1,mean,data=subset(m2,coreID=="a"))
R is poised to do this stuff if you would just move to data.frames instead of lists. Make Your 'a', 'X', and 'V' into their own columns. Then you can use ave, by, aggregate, subset, etc.
data.frame(do.call(rbind, sampleList),
do.call(rbind, strsplit(names(sampleList), '\\.')))
# X1 X2 X3 X4 X5 X1.1 X2.1 X3.1
# a.12.1 1 2 3 4 5 a 12 1
# b.1.23 3 4 1 4 5 b 1 23
# a.12.21 5 7 2 8 9 a 12 21
# b.1.555 6 8 9 0 6 b 1 555

Resources