Related
I have a list of dataframes:
set.seed(23)
date_list = seq(1:30)
testframe = data.frame(Date = date_list)
testframe$ABC = rnorm(30)
testframe$DEF = rnorm(30)
testframe$GHI = seq(from = 10, to = 25, length.out = 30)
testframe$JKL = seq(from = 5, to = 45, length.out = 30)
testlist = list(testframe, testframe, testframe)
names(testlist) = c("df1464", "df6355", "df94566")
I want now to extract the name of each dataframe and add it to its columns. So the columnnames of the first dataframe in the list should be: Date_df1464, ABC_df1464, DEF_df1464, GHI_df1464 and JKL_df1464
I created this loop, but its not working:
for (a in names(testlist)) {
for(i in 1: length(testlist)){
allcolnames = colnames(testlist[[i]])
allcolnames = paste(allcolnames, a , sep = "_")
testlist[[i]] = colnames(allcolnames)
}
}
I get this error:
Error in testlist[[i]] : subscript out of bounds
I am pretty clueless why it doesnt work. Any ideas?
Two ways to accomplish this. The better, more encapsulated way would be to use Map, looping over the individual data frames and their corresponding names:
new.testlist <- Map(function(df, name) {
names(df) <- paste(names(df), name, sep = '_')
return(df)
}, testlist, names(testlist))
> str(new.testlist)
List of 3
$ df1464 :'data.frame': 30 obs. of 5 variables:
..$ Date_df1464: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
..$ ABC_df1464 : num [1:30] 0.193 -0.435 0.913 1.793 0.997 ...
..$ DEF_df1464 : num [1:30] -0.5532 0.0982 -1.1467 -1.2499 -0.2021 ...
..$ GHI_df1464 : num [1:30] 10 10.5 11 11.6 12.1 ...
..$ JKL_df1464 : num [1:30] 5 6.38 7.76 9.14 10.52 ...
$ df6355 :'data.frame': 30 obs. of 5 variables:
..$ Date_df6355: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
..$ ABC_df6355 : num [1:30] 0.193 -0.435 0.913 1.793 0.997 ...
..$ DEF_df6355 : num [1:30] -0.5532 0.0982 -1.1467 -1.2499 -0.2021 ...
..$ GHI_df6355 : num [1:30] 10 10.5 11 11.6 12.1 ...
..$ JKL_df6355 : num [1:30] 5 6.38 7.76 9.14 10.52 ...
$ df94566:'data.frame': 30 obs. of 5 variables:
..$ Date_df94566: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
..$ ABC_df94566 : num [1:30] 0.193 -0.435 0.913 1.793 0.997 ...
..$ DEF_df94566 : num [1:30] -0.5532 0.0982 -1.1467 -1.2499 -0.2021 ...
..$ GHI_df94566 : num [1:30] 10 10.5 11 11.6 12.1 ...
..$ JKL_df94566 : num [1:30] 5 6.38 7.76 9.14 10.52 ...
The riskier way would be to use the super assignment operator to loop over the names, trusting that testlist remains reliable in your global environment. Note that this second method changes the column names in testlist as a side effect, and is generally NOT considered good practice. Max Teflon's answer is somewhat similar, in that it relies on testlist existing in the global environment, without passing it explicitly to the modifying function.
sapply(names(testlist), function(x) {
names(testlist[[x]]) <<- paste(names(testlist[[x]]), x, sep = '_')
})
You could switch two Map in series; the inner Map prepares the new names, the outer Map applies it onto the sublists' names.
testlist <- Map(`names<-`, testlist,
Map(paste, lapply(testlist, names), names(testlist), sep="_"))
Result
lapply(testlist, names)
# $df1464
# [1] "Date_df1464" "ABC_df1464" "DEF_df1464" "GHI_df1464" "JKL_df1464"
#
# $df6355
# [1] "Date_df6355" "ABC_df6355" "DEF_df6355" "GHI_df6355" "JKL_df6355"
#
# $df94566
# [1] "Date_df94566" "ABC_df94566" "DEF_df94566" "GHI_df94566" "JKL_df94566"
Your solution was nearly right, you just do not need to loop two times.
And your colnames call was the wrong way around.
This should work:
for(i in 1: length(testlist)){
allcolnames = colnames(testlist[[i]])
allcolnames = paste(allcolnames, names(testlist)[i] , sep = "_")
colnames(testlist[[i]]) = allcolnames
}
This also works, without any fors ;):
set.seed(23)
date_list = seq(1:30)
testframe = data.frame(Date = date_list)
testframe$ABC = rnorm(30)
testframe$DEF = rnorm(30)
testframe$GHI = seq(from = 10, to = 25, length.out = 30)
testframe$JKL = seq(from = 5, to = 45, length.out = 30)
testlist = list(testframe, testframe, testframe)
names(testlist) = c("df1464", "df6355", "df94566")
out <- lapply(names(testlist),function(name){
dummy <- testlist[[name]]
names(dummy) <- paste0(names(testlist[[name]]) ,'_',name)
dummy
})
str(out)
#> List of 3
#> $ :'data.frame': 30 obs. of 5 variables:
#> ..$ Date_df1464: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ ABC_df1464 : num [1:30] 0.193 -0.435 0.913 1.793 0.997 ...
#> ..$ DEF_df1464 : num [1:30] -0.5532 0.0982 -1.1467 -1.2499 -0.2021 ...
#> ..$ GHI_df1464 : num [1:30] 10 10.5 11 11.6 12.1 ...
#> ..$ JKL_df1464 : num [1:30] 5 6.38 7.76 9.14 10.52 ...
#> $ :'data.frame': 30 obs. of 5 variables:
#> ..$ Date_df6355: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ ABC_df6355 : num [1:30] 0.193 -0.435 0.913 1.793 0.997 ...
#> ..$ DEF_df6355 : num [1:30] -0.5532 0.0982 -1.1467 -1.2499 -0.2021 ...
#> ..$ GHI_df6355 : num [1:30] 10 10.5 11 11.6 12.1 ...
#> ..$ JKL_df6355 : num [1:30] 5 6.38 7.76 9.14 10.52 ...
#> $ :'data.frame': 30 obs. of 5 variables:
#> ..$ Date_df94566: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ ABC_df94566 : num [1:30] 0.193 -0.435 0.913 1.793 0.997 ...
#> ..$ DEF_df94566 : num [1:30] -0.5532 0.0982 -1.1467 -1.2499 -0.2021 ...
#> ..$ GHI_df94566 : num [1:30] 10 10.5 11 11.6 12.1 ...
#> ..$ JKL_df94566 : num [1:30] 5 6.38 7.76 9.14 10.52 ...
I want to reshape my data from wide to long format so that I can use ggplot to create graphs. I am having some problems to properly arragne the data. So far I start my process with a list of 27 dataframes (just showing you the first 10 ones):
> str(NDVI_stat)
List of 27
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 1 mean: num [1:10] 0.1796 0.3105 0.1422 0.0937 0.1711 ...
..$ NDVI 1 sd : num [1:10] 0.1117 0.05845 0.00743 0.02754 0.01506 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 2 mean: num [1:10] 0.0819 0.5954 0.1328 0.0953 0.1492 ...
..$ NDVI 2 sd : num [1:10] 0.00872 0.10508 0.00863 0.01878 0.02303 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 3 mean: num [1:10] 0.0634 0.681 0.2108 0.0151 0.179 ...
..$ NDVI 3 sd : num [1:10] 0.0344 0.076 0.0361 0.0638 0.0428 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 4 mean: num [1:10] 0.0971 0.6885 0.2326 0.1157 0.3219 ...
..$ NDVI 4 sd : num [1:10] 0.00991 0.07509 0.02054 0.02793 0.0303 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 5 mean: num [1:10] 0.0817 0.4825 0.2754 0.1003 0.4155 ...
..$ NDVI 5 sd : num [1:10] 0.00998 0.05034 0.02781 0.03248 0.04056 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 6 mean: num [1:10] 0.1119 0.7667 0.582 0.0997 0.4426 ...
..$ NDVI 6 sd : num [1:10] 0.023 0.0672 0.0649 0.0331 0.0557 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 7 mean: num [1:10] 0.1997 0.6567 0.5111 0.0988 0.3307 ...
..$ NDVI 7 sd : num [1:10] 0.0671 0.0756 0.0435 0.0288 0.0457 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 8 mean: num [1:10] 0.3626 0.7356 0.6304 0.0954 0.335 ...
..$ NDVI 8 sd : num [1:10] 0.1454 0.0888 0.0502 0.0298 0.038 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 9 mean: num [1:10] 0.541 0.748 0.637 0.089 0.577 ...
..$ NDVI 9 sd : num [1:10] 0.0968 0.0721 0.0396 0.0276 0.0656 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 10 mean: num [1:10] 0.6691 0.4377 0.6713 0.0942 0.6827 ...
..$ NDVI 10 sd : num [1:10] 0.088 0.0698 0.033 0.0316 0.0688 ...
$ :'data.frame': 10 obs. of 2 variables:
I am using rbindlist from the data.table package to merge everything into a single dataframe
newdf<-rbindlist(NDVI_stat, use.names = TRUE, fill = TRUE)
The code works properly but I am not creating the structure I really need. The output is a dataframe with 270 (27 daframes * 10 rows in each one) observations and 54 variables (27 dataframes * 2 columns in each one)
image of newdf
As you can see in the image newdf it is creating 270 rows but what I want to obtain is 10 rows (so avoid the NA values)
Any help on that?
This question is similar to this one
Plot dataframe with ggplot2 - R
The difference is that I changed the way I produced my input and know I dont know how to arrange the dataframe properly to later use
NDVIdf_forplot <- gather(NDVIdf, key = statistic, value = value, -ID)
and then use ggplot to create my graph
Any help on that?
I think you're asking how to column bind the matrices. As far as I'm aware, data.table doesn't have a cbindlist function so you could try: do.call("cbind", NDVI_stat) though that's not quite the same and will fail if you don't have an equal number of rows in each dataframe.
The problem is that the variable names are different in each df of the list. Once that is solved, the rest is as you imagine it to be.
An example with dplyr/tidyr:
df1<-data.frame(mean1=c(2,3),
sd1 = c(1,2))
df2<-data.frame(mean2=c(4,5),
sd2 = c(3,4))
listdf<-list(df1,df2)
str(listdf)
Gives
List of 2
$ :'data.frame': 2 obs. of 2 variables:
..$ mean1: num [1:2] 2 3
..$ sd1 : num [1:2] 1 2
$ :'data.frame': 2 obs. of 2 variables:
..$ mean2: num [1:2] 4 5
..$ sd2 : num [1:2] 3 4
To rename all data frames and bind them together row by row
library(tidyverse)
listdf%>%map(function(x){x%>%rename_(mean = names(x)[1],
sd = names(x)[2])})%>%
bind_rows()
gives
mean sd
2 1
3 2
4 3
5 4
I am running a simple one-way ANOVA across multiple groups within a single data frame.
Dataframe available here: https://www.dropbox.com/s/6nsjk4l1pgiwal3/cut1.csv?dl=0
>download.file('https://www.dropbox.com/s/6nsjk4l1pgiwal3/cut1.csv?raw=1', destfile = "cut1.csv", method = "auto")
> data <- read.csv("cut1.csv")
> cut1 <- data %>% mutate(Plot = as.factor(Plot), Block = as.factor(Block), Cut = as.factor(Cut))
> str(cut1)
'data.frame': 160 obs. of 6 variables:
$ Plot : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Block : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 2 3 3 ...
$ Treatment : Factor w/ 4 levels "AN","C","IU",..: 4 2 3 1 1 3 4 2 3 1 ...
$ Cut : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Measurement: Factor w/ 10 levels "ADF","Ash","Crude_Protein",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Value : num 956 965 961 963 955 ...
I used some code from this SO question to enable the aov function to be applied to every level of Measurementfactor:
anova_1<- sapply(unique(as.character(cut1$Measurement)),
function(meas)aov(Value~Treatment+Block,cut1,subset=(Measurement==meas)),
simplify=FALSE,USE.NAMES=TRUE)
summary_1 <- lapply(anova_1, summary)
I can look manually through summary_1 but ideally what I would like to do is extract the p values for each level of the Measurement factor into a dataframe which I could then filter so that I only see which ones are <0.5. I would then run TukeyHSD on these.
summary_1 looks like this (only first 2 lists shown):
> str(summary_1)
List of 10
$ Dry_matter :List of 1
..$ :Classes ‘anova’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ Df : num [1:3] 3 3 9
.. ..$ Sum Sq : num [1:3] 359 167 612
.. ..$ Mean Sq: num [1:3] 119.8 55.5 68
.. ..$ F value: num [1:3] 1.761 0.816 NA
.. ..$ Pr(>F) : num [1:3] 0.224 0.517 NA
..- attr(*, "class")= chr [1:2] "summary.aov" "listof"
$ Crude_Protein:List of 1
..$ :Classes ‘anova’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ Df : num [1:3] 3 3 9
.. ..$ Sum Sq : num [1:3] 306 721 1606
.. ..$ Mean Sq: num [1:3] 102 240 178
.. ..$ F value: num [1:3] 0.572 1.347 NA
.. ..$ Pr(>F) : num [1:3] 0.647 0.319 NA
..- attr(*, "class")= chr [1:2] "summary.aov" "listof"
I can extract the p value from one of the lists in summary_1 like this:
> summary_1$OAH[[1]][,5][1]
[1] 0.4734992
However, I dont know how to extract from all the nested lists and place in a dataframe.
Much obliged for any help.
You can use the package broom in combination with dplyr to apply Anova by Measurement, and assign the output to a data.frame in a tidy format.
library(broom)
library(dplyr)
summaries <- cut1 %>% group_by(Measurement) %>%
do(tidy(aov(Value ~ Treatment + Block, data = .)))
head(summaries)
# Measurement term df sumsq meansq statistic p.value
# (fctr) (chr) (dbl) (dbl) (dbl) (dbl) (dbl)
#1 ADF Treatment 3 41.416875 13.805625 3.097871 0.07138437
#2 ADF Block 1 8.001125 8.001125 1.795388 0.20729351
#3 ADF Residuals 11 49.021375 4.456489 NA NA
#4 Ash Treatment 3 38.511875 12.837292 1.051787 0.40840601
#5 Ash Block 1 34.980125 34.980125 2.865998 0.11856463
#6 Ash Residuals 11 134.257375 12.205216 NA NA
Here's a solution in vanilla R:
# you can shorten your example -- download.file not necessary
cut1 <- read.csv('https://www.dropbox.com/s/6nsjk4l1pgiwal3/cut1.csv?raw=1') %>%
mutate(Plot = as.factor(Plot), Block = as.factor(Block), Cut = as.factor(Cut))
# split-apply-combine strategy
do.call(rbind, lapply(split(cut1,cut1$Measurement),
function(x) with(x, summary(aov(Value ~ Treatment + Block)))[[1]]
)
)
returns:
Df Sum Sq Mean Sq F value Pr(>F)
ADF.Treatment 3 41.42 13.81 6.7088 0.01133 *
ADF.Block 3 38.50 12.83 6.2366 0.01405 *
ADF.Residuals 9 18.52 2.06
Ash.Treatment 3 38.51 12.84 0.9162 0.47115
Ash.Block 3 43.13 14.38 1.0261 0.42602
Ash.Residuals 9 126.11 14.01
Crude_Protein.Treatment 3 306.42 102.14 0.5723 0.64733
Crude_Protein.Block 3 721.42 240.47 1.3473 0.31946
Crude_Protein.Residuals 9 1606.39 178.49
D.Treatment 3 9.47 3.16 4.5530 0.03331 *
D.Block 3 7.57 2.52 3.6383 0.05751 .
D.Residuals 9 6.24 0.69
Dry_matter.Treatment 3 359.39 119.80 1.7609 0.22432
Dry_matter.Block 3 166.62 55.54 0.8164 0.51656
Dry_matter.Residuals 9 612.27 68.03
ME.Treatment 3 0.24 0.08 4.5530 0.03331 *
ME.Block 3 0.19 0.06 3.6383 0.05751 .
ME.Residuals 9 0.16 0.02
NCGD.Treatment 3 2777.57 925.86 4.5530 0.03331 *
NCGD.Block 3 2219.55 739.85 3.6383 0.05751 .
NCGD.Residuals 9 1830.17 203.35
NDF.Treatment 3 355.91 118.64 6.8809 0.01050 *
NDF.Block 3 336.70 112.23 6.5095 0.01239 *
NDF.Residuals 9 155.17 17.24
OAH.Treatment 3 1.41 0.47 0.9108 0.47350
OAH.Block 3 1.37 0.46 0.8850 0.48488
OAH.Residuals 9 4.65 0.52
Sugar.Treatment 3 86.18 28.73 5.0212 0.02577 *
Sugar.Block 3 51.64 17.21 3.0085 0.08720 .
Sugar.Residuals 9 51.49 5.72
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
My data frame, my.data, contains both numeric and factor variables. I want to standardise just the numeric variables in this data frame.
> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Could the standardising work by doing this? I want to standardise the columns 8,9,10,11 and 12 but I think I have the wrong code.
mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))
Thanks in advance
Here is one option to standardize
mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x)
You can use the dplyr package to do this:
mydata2%>%mutate_if(is.numeric,scale)
Here are some options to consider, although it is answered late:
# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)
# Set working directory
setwd("path")
# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39),
"Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
"Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
"Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
"Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
"Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))
Let us check the structure of df:
str(df)
'data.frame': 10 obs. of 6 variables:
$ Age : num 21 19 25 34 45 63 39 28 50 39
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num 2138 1516 2213 2500 2660 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num 172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num 60 70 88 48 71 51 65 44 53 91
We see that Age, Salary, Height and Weight are numeric and Name and Gender are categorical (factor variables).
Let us scale just the numeric variables using only base R:
1) Option: (slight modification of what akrun has proposed here)
start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
(x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1
Time difference of 0.02717805 secs
str(df1)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
2) Option: (akrun's approach)
start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2
Time difference of 0.02599907 secs
str(df2)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
3) Option:
start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3
Time difference of -59.6766 secs
str(df3)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
4) Option (using tidyverse and invoking dplyr):
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4
Time difference of 0.012043 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
Based on what kind of structure as output you demand and speed, you can judge. If your data is unbalanced and you want to balance it, and suppose you want to do classification after that after scaling the numeric variables, the matrix numeric structure of the numeric variables, namely - Age, Salary, Height and Weight will cause problems. I mean,
str(df4$Age)
num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
- attr(*, "scaled:center")= num 36.3
- attr(*, "scaled:scale")= num 13.8
Since, for example, ROSE package (which balances data) doesn't accept data structures apart from int, factor and num, it will throw an error.
To avoid this issue, the numeric variables after scaling can be saved as vectors instead of a column matrix by:
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)
end_time4 <- Sys.time()
end_time4 - start_time4
with
Time difference of 0.01400399 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
I have 28 list within a list and I try to add another variable called ID to each individual list. I found this Dataframes in a list; adding a new variable with name of dataframe to be very helpful. But when i tried his code, it doesn't work in my case. I think it's because my list doesn't have clear labels [1],[2].[3], etc.. that the code can recognize.
all$id <- rep(names(mylist), sapply(mylist, nrow))
>List of 1
$ :List of 28
..$ :'data.frame': 271 obs. of 12 variables:
.. ..$ Sample_ID : Factor w/ 271 levels "MC25",..: 19 27 2
.. ..$ Reported_Analyte : Factor w/ 10 levels "2-Butoxyethanol",..: 7 7 7
..$ Date_Collected : Factor w/ 71 levels "2010-05-08","2010-05-09",..: 8 9 1
.. ..$ Result2 : num [1:271] 0.11 0.11 0.11 0.11
..$ :'data.frame': 6 obs. of 12 variables:
.. ..$ Sample_ID : Factor w/ 271 levels "MC25",..: 19 27 2
.. ..$ Reported_Analyte : Factor w/ 10 levels "2-Butoxyethanol",..: 7 7 7
..$ Date_Collected : Factor w/ 71 levels "2010-05-08","2010-05-09",..: 8 9 1
.. ..$ Result2 : num [1:271] 0.11 0.11 0.11 0.11
It really isn't very clear what you want to achieve (the post you linked to was about collapsing over the list of data frames and adding into the collapsed version an ID variable indicating which original data frame each row in the collapsed data frame came from).
I see a complication with your data; you have a list of 28 data frames within a list. You can see that in the output from str() that is given in your Q. You can see this better with this example data set (here all the data frames are the same but that is just for expedience)
set.seed(42)
dat <- data.frame(Sample_ID = factor(sample(10)),
Reported_Analyte = factor(sample(LETTERS, 10)),
Date_Collected = Sys.Date() + 0:9,
Result2 = rnorm(10))
mylist <- list(lapply(1:28, function(x) dat))
If we look at mylist using str() we see the nature of the complication I mentioned:
R> str(mylist, max = 2)
List of 1
$ :List of 28
..$ Data_frame_ 1 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 2 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 3 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 4 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 5 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 6 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 7 :'data.frame': 10 obs. of 4 variables:
....<etc>
Where the post you linked to was starting from was the list inside your outer list and that list had named components. If you don't need the outer list, perhaps best to throw it away at this stage:
mylist2 <- mylist[[1]]
## the `[[` are important as we want the 1st component *inside* the list
## using `[` would just give us a list within a list again.
Names can then be added to this list
names(mylist2) <- paste("Data_frame_", seq_along(mylist2), sep = "")
which would result in
R> str(mylist2)
List of 28
$ Data_frame_1 :'data.frame': 10 obs. of 4 variables:
..$ Sample_ID : Factor w/ 10 levels "1","2","3","4",..: 10 9 3 6 4 8 5 1 2 7
..$ Reported_Analyte: Factor w/ 10 levels "C","F","I","J",..: 6 7 10 2 5 8 9 1 3 4
..$ Date_Collected : Date[1:10], format: "2012-05-02" "2012-05-03" ...
..$ Result2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
$ Data_frame_2 :'data.frame': 10 obs. of 4 variables:
..$ Sample_ID : Factor w/ 10 levels "1","2","3","4",..: 10 9 3 6 4 8 5 1 2 7
..$ Reported_Analyte: Factor w/ 10 levels "C","F","I","J",..: 6 7 10 2 5 8 9 1 3 4
..$ Date_Collected : Date[1:10], format: "2012-05-02" "2012-05-03" ...
..$ Result2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
....<etc>
Notice the List of 1 is no longer reported.
If the list of data frames within a list is important to you (not sure why it would be, but OK), then you can assign the names to the [[1]]st component directly.
names(mylist[[1]]) <- paste("Data_frame_", seq_along(mylist[[1]]), sep = "")
(Notice I'm using the original mylist and on both occasions I index that list with [[1]].)
The result is similar to the above though the list within a list structure is retained:
R> str(mylist)
List of 1
$ :List of 28
..$ Data_frame_1 :'data.frame': 10 obs. of 4 variables:
.. ..$ Sample_ID : Factor w/ 10 levels "1","2","3","4",..: 10 9 3 6 4 8 5 1 2 7
.. ..$ Reported_Analyte: Factor w/ 10 levels "C","F","I","J",..: 6 7 10 2 5 8 9 1 3 4
.. ..$ Date_Collected : Date[1:10], format: "2012-05-02" "2012-05-03" ...
.. ..$ Result2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
..$ Data_frame_2 :'data.frame': 10 obs. of 4 variables:
.. ..$ Sample_ID : Factor w/ 10 levels "1","2","3","4",..: 10 9 3 6 4 8 5 1 2 7
.. ..$ Reported_Analyte: Factor w/ 10 levels "C","F","I","J",..: 6 7 10 2 5 8 9 1 3 4
.. ..$ Date_Collected : Date[1:10], format: "2012-05-02" "2012-05-03" ...
.. ..$ Result2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
....<etc>
If you now wish to proceed with collapsing the individual data frames into a single data frame, but retaining the information about which data frame they came from, we would do this for mylist2:
all2 <- do.call("rbind", mylist2)
all2 <- transform(all2, id = rep(names(mylist2), sapply(mylist2, nrow)))
rownames(all2) <- seq_len(nrow(all2)) ## reset rownames for compactness
which gives:
R> head(all2)
Sample_ID Reported_Analyte Date_Collected Result2 id
1 10 L 2012-05-02 1.3048697 Data_frame_1
2 9 R 2012-05-03 2.2866454 Data_frame_1
3 3 W 2012-05-04 -1.3888607 Data_frame_1
4 6 F 2012-05-05 -0.2787888 Data_frame_1
5 4 K 2012-05-06 -0.1333213 Data_frame_1
6 8 T 2012-05-07 0.6359504 Data_frame_1
For mylist we use something very similar, but just index into mylist using [[1]]:
all1 <- do.call("rbind", mylist[[1]])
all1 <- transform(all1, id = rep(names(mylist[[1]]), sapply(mylist[[1]], nrow)))
rownames(all1) <- seq_len(nrow(all1)) ## reset rownames for compactness
R> head(all1)
Sample_ID Reported_Analyte Date_Collected Result2 id
1 10 L 2012-05-02 1.3048697 Data_frame_1
2 9 R 2012-05-03 2.2866454 Data_frame_1
3 3 W 2012-05-04 -1.3888607 Data_frame_1
4 6 F 2012-05-05 -0.2787888 Data_frame_1
5 4 K 2012-05-06 -0.1333213 Data_frame_1
6 8 T 2012-05-07 0.6359504 Data_frame_1
As you can see repeatedly having to refer to your list of data frames as mylist[[1]] is a pain if you dont need the outer list.
Update:
If you don't want to collapse the list into a single data frame, see #Andrie's answer, but modify it to read:
ml2 <- ml1
ml2[[1]] <- lapply(seq_along(ml[[1]]), function(x)cbind(ml[[1]][[x]], id=x))
so you account for the list within list structure.
I answer this using a constructed example of a list with samples from mtcars.
First, create a list of data frames. Do this by sampling 10 rows from mtcars for each element of the list:
ml <- lapply(1:3, function(x)mtcars[sample(1:32, 10), 1:3])
So, now you have an unnamed list of 3 data frames. Next you want to add an id column. The trick is to use lapply over a sequence of list items using seq_along(ml), and then to cbind your id to each data frame:
ml2 <- lapply(seq_along(ml), function(x)cbind(ml[[x]], id=x))
The results are what you required:
str(ml2)
List of 3
$ :'data.frame': 10 obs. of 4 variables:
..$ mpg : num [1:10] 15 24.4 26 15.8 22.8 21 32.4 17.3 17.8 30.4
..$ cyl : num [1:10] 8 4 4 8 4 6 4 8 6 4
..$ disp: num [1:10] 301 147 120 351 108 ...
..$ id : int [1:10] 1 1 1 1 1 1 1 1 1 1
$ :'data.frame': 10 obs. of 4 variables:
..$ mpg : num [1:10] 33.9 19.2 24.4 10.4 30.4 22.8 16.4 21.4 15.5 21.5
..$ cyl : num [1:10] 4 6 4 8 4 4 8 6 8 4
..$ disp: num [1:10] 71.1 167.6 146.7 460 75.7 ...
..$ id : int [1:10] 2 2 2 2 2 2 2 2 2 2
$ :'data.frame': 10 obs. of 4 variables:
..$ mpg : num [1:10] 15.5 21 13.3 21.5 21.4 30.4 21 18.1 30.4 15.2
..$ cyl : num [1:10] 8 6 8 4 4 4 6 6 4 8
..$ disp: num [1:10] 318 160 350 120 121 ...
..$ id : int [1:10] 3 3 3 3 3 3 3 3 3 3