Arrange dataframe format for ggplot - R - r

I want to reshape my data from wide to long format so that I can use ggplot to create graphs. I am having some problems to properly arragne the data. So far I start my process with a list of 27 dataframes (just showing you the first 10 ones):
> str(NDVI_stat)
List of 27
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 1 mean: num [1:10] 0.1796 0.3105 0.1422 0.0937 0.1711 ...
..$ NDVI 1 sd : num [1:10] 0.1117 0.05845 0.00743 0.02754 0.01506 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 2 mean: num [1:10] 0.0819 0.5954 0.1328 0.0953 0.1492 ...
..$ NDVI 2 sd : num [1:10] 0.00872 0.10508 0.00863 0.01878 0.02303 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 3 mean: num [1:10] 0.0634 0.681 0.2108 0.0151 0.179 ...
..$ NDVI 3 sd : num [1:10] 0.0344 0.076 0.0361 0.0638 0.0428 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 4 mean: num [1:10] 0.0971 0.6885 0.2326 0.1157 0.3219 ...
..$ NDVI 4 sd : num [1:10] 0.00991 0.07509 0.02054 0.02793 0.0303 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 5 mean: num [1:10] 0.0817 0.4825 0.2754 0.1003 0.4155 ...
..$ NDVI 5 sd : num [1:10] 0.00998 0.05034 0.02781 0.03248 0.04056 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 6 mean: num [1:10] 0.1119 0.7667 0.582 0.0997 0.4426 ...
..$ NDVI 6 sd : num [1:10] 0.023 0.0672 0.0649 0.0331 0.0557 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 7 mean: num [1:10] 0.1997 0.6567 0.5111 0.0988 0.3307 ...
..$ NDVI 7 sd : num [1:10] 0.0671 0.0756 0.0435 0.0288 0.0457 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 8 mean: num [1:10] 0.3626 0.7356 0.6304 0.0954 0.335 ...
..$ NDVI 8 sd : num [1:10] 0.1454 0.0888 0.0502 0.0298 0.038 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 9 mean: num [1:10] 0.541 0.748 0.637 0.089 0.577 ...
..$ NDVI 9 sd : num [1:10] 0.0968 0.0721 0.0396 0.0276 0.0656 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ NDVI 10 mean: num [1:10] 0.6691 0.4377 0.6713 0.0942 0.6827 ...
..$ NDVI 10 sd : num [1:10] 0.088 0.0698 0.033 0.0316 0.0688 ...
$ :'data.frame': 10 obs. of 2 variables:
I am using rbindlist from the data.table package to merge everything into a single dataframe
newdf<-rbindlist(NDVI_stat, use.names = TRUE, fill = TRUE)
The code works properly but I am not creating the structure I really need. The output is a dataframe with 270 (27 daframes * 10 rows in each one) observations and 54 variables (27 dataframes * 2 columns in each one)
image of newdf
As you can see in the image newdf it is creating 270 rows but what I want to obtain is 10 rows (so avoid the NA values)
Any help on that?
This question is similar to this one
Plot dataframe with ggplot2 - R
The difference is that I changed the way I produced my input and know I dont know how to arrange the dataframe properly to later use
NDVIdf_forplot <- gather(NDVIdf, key = statistic, value = value, -ID)
and then use ggplot to create my graph
Any help on that?

I think you're asking how to column bind the matrices. As far as I'm aware, data.table doesn't have a cbindlist function so you could try: do.call("cbind", NDVI_stat) though that's not quite the same and will fail if you don't have an equal number of rows in each dataframe.

The problem is that the variable names are different in each df of the list. Once that is solved, the rest is as you imagine it to be.
An example with dplyr/tidyr:
df1<-data.frame(mean1=c(2,3),
sd1 = c(1,2))
df2<-data.frame(mean2=c(4,5),
sd2 = c(3,4))
listdf<-list(df1,df2)
str(listdf)
Gives
List of 2
$ :'data.frame': 2 obs. of 2 variables:
..$ mean1: num [1:2] 2 3
..$ sd1 : num [1:2] 1 2
$ :'data.frame': 2 obs. of 2 variables:
..$ mean2: num [1:2] 4 5
..$ sd2 : num [1:2] 3 4
To rename all data frames and bind them together row by row
library(tidyverse)
listdf%>%map(function(x){x%>%rename_(mean = names(x)[1],
sd = names(x)[2])})%>%
bind_rows()
gives
mean sd
2 1
3 2
4 3
5 4

Related

Change df columns from lists to vectors

I've been using R for a while, but lists perplex me.
For some reason in some cases my function outputs a data frame of lists:
str() returns something like:
*'data.frame': 4683 obs. of 6 variables:
$ f1:List of 4683
..$ : num -0.196
..$ : num -0.205
..$ : num -0.209
..$ : num -0.218
..$ : num -0.197
..$ : num -0.136
..$ : num -0.22*
instead of
*'data.frame': 4683 obs. of 6 variables:
$ f1: num -0.197 -0.205 -0.208 -0.218 -0.197 ...
$ f2: num -0.13 -0.139 -0.136 -0.137 -0.126 ...
$ f3: num -0.216 -0.221 -0.214 -0.209 -0.203 ...
$ f4: num 0.00625 -0.04806 -0.04888 -0.02979 -0.03813 ...
$ f5: num -0.15 -0.178 -0.173 -0.207 -0.154 ...
$ f6: num -0.191 -0.224 -0.25 -0.183 -0.209 ...*
...
like I'd expect. Is there some simple way to convert df from the first case to the second?
I have tried manually casting columns as vectors, which not only doesn't work, but also would not be very effective.
When we have a data frame like this
df
# 1 1, 2, 3 1, 2, 3
# 2 4, 5, 6 4, 5, 6
where
df |> str()
# 'data.frame': 2 obs. of 2 variables:
# $ :List of 2
# ..$ : int 1 2 3
# ..$ : int 4 5 6
# $ :List of 2
# ..$ : int 1 2 3
# ..$ : int 4 5 6
we can do
r <- do.call(data.frame, df)
r
# X1.3 X4.6 X1.3.1 X4.6.1
# 1 1 4 1 4
# 2 2 5 2 5
# 3 3 6 3 6
where
str(r)
# 'data.frame': 3 obs. of 4 variables:
# $ X1.3 : int 1 2 3
# $ X4.6 : int 4 5 6
# $ X1.3.1: int 1 2 3
# $ X4.6.1: int 4 5 6
Explanation: do.call constructs a data.frame() call with df (which is a "data.frame" as well as a "list") as ... arguments. So in our df with two lists of length 2, we get two data frames with 2 columns, i.e. a resulting data frame with 4 columns in this case.
By the way, you can use Reduce(.) just as well.
Data:
df <- structure(list(list(1:3, 4:6), list(1:3, 4:6)), names = c("",
""), class = "data.frame", row.names = c(NA, -2L))

How can I merge a list of dataframes into one singular matrix?

str(txt_files_df)
List of 941
$ :'data.frame': 19 obs. of 2 variables:
..$ V1: num [1:19] -2.87 4.83 21.33 25.73 32.33 ...
..$ V2: num [1:19] 8274 8274 8279 8281 8286 ...
$ :'data.frame': 19 obs. of 2 variables:
..$ V1: num [1:19] -1.8 9.2 20.2 25.7 37.8 ...
..$ V2: num [1:19] 7199 7199 7200 7202 7213 ...
This is the file that I have that basically has V1 as the x values and V2 as the y values that plot a polygon each. I have 941 sets of those and I would like to know if there is a way to combine all X values into one column and all Y value into another to become one singular large matrix. The merge function doesn't allow me to use it since they are not in the correct format but I do not know which function I should use instead. Thank you!

How to manipulate data.frame object in different list more elegantly?

I have data.frame objects in the list which is the output of my function I implemented. However, I intend to make new list where data.frame object in different list put it together. I tried several way to get my expected output but not much elegant. Does anyone know any useful trick of doing this manipulation efficiently ? Is there any elegant solution to accomplish this task ? Any idea?
This is mini example:
savedList <- list(
foo_saved = data.frame(v1=c(1,6,16), v2=c(4,12,23)),
bar_saved = data.frame(v1=c(7,19,31), v2=c(16,28,41)),
cat_saved = data.frame(v1=c(5,13,26), v2=c(11,21,42))
)
dropedList <- list(
foo_droped = data.frame(v1=c(4,9,20), v2=c(7,15,29)),
bar_droped = data.frame(v1=c(14,26,35), v2=c(21,30,47)),
cat_droped = data.frame(v1=c(18,29,39), v2=c(25,36,48))
)
This is my expected output:
foo <- list(
foo_saved = data.frame(v1=c(1,6,16), v2=c(4,12,23)),
foo_droped = data.frame(v1=c(4,9,20), v2=c(7,15,29))
)
bar <- list(
bar_saved = data.frame(v1=c(7,19,31), v2=c(16,28,41)),
bar_droped = data.frame(v1=c(14,26,35), v2=c(21,30,47))
)
cat <- list(
cat_saved = data.frame(v1=c(5,13,26), v2=c(11,21,42)),
cat_droped = data.frame(v1=c(18,29,39), v2=c(25,36,48))
)
I tried some existing solution but I am not feeling satisfy with it. How can I get my desired output easily ? Is there any efficient, compatible solution for this ? Thanks a lot
You could combine the two lists, then split on the common part of the names. split() is not the most efficient function ever, but the code for this is very simple.
x <- c(savedList, dropedList)
split(x, sub("_.*", "", names(x)))
This gives the following:
List of 3
$ bar:List of 2
..$ bar_saved :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 7 19 31
.. ..$ v2: num [1:3] 16 28 41
..$ bar_droped:'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 14 26 35
.. ..$ v2: num [1:3] 21 30 47
$ cat:List of 2
..$ cat_saved :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 5 13 26
.. ..$ v2: num [1:3] 11 21 42
..$ cat_droped:'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 18 29 39
.. ..$ v2: num [1:3] 25 36 48
$ foo:List of 2
..$ foo_saved :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 1 6 16
.. ..$ v2: num [1:3] 4 12 23
..$ foo_droped:'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 4 9 20
.. ..$ v2: num [1:3] 7 15 29
You can use mapply for this, it will iterate thru both lists and make a list with each pair of items:
res <- mapply( list, savedList, dropedList, SIMPLIFY = F)
str(res)
List of 3
$ foo_saved:List of 2
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 1 6 16
.. ..$ v2: num [1:3] 4 12 23
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 4 9 20
.. ..$ v2: num [1:3] 7 15 29
$ bar_saved:List of 2
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 7 19 31
.. ..$ v2: num [1:3] 16 28 41
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 14 26 35
.. ..$ v2: num [1:3] 21 30 47
$ cat_saved:List of 2
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 5 13 26
.. ..$ v2: num [1:3] 11 21 42
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 18 29 39
.. ..$ v2: num [1:3] 25 36 48

How to standardize a data frame which contains both numeric and factor variables

My data frame, my.data, contains both numeric and factor variables. I want to standardise just the numeric variables in this data frame.
> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Could the standardising work by doing this? I want to standardise the columns 8,9,10,11 and 12 but I think I have the wrong code.
mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))
Thanks in advance
Here is one option to standardize
mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x)
You can use the dplyr package to do this:
mydata2%>%mutate_if(is.numeric,scale)
Here are some options to consider, although it is answered late:
# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)
# Set working directory
setwd("path")
# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39),
"Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
"Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
"Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
"Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
"Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))
Let us check the structure of df:
str(df)
'data.frame': 10 obs. of 6 variables:
$ Age : num 21 19 25 34 45 63 39 28 50 39
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num 2138 1516 2213 2500 2660 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num 172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num 60 70 88 48 71 51 65 44 53 91
We see that Age, Salary, Height and Weight are numeric and Name and Gender are categorical (factor variables).
Let us scale just the numeric variables using only base R:
1) Option: (slight modification of what akrun has proposed here)
start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
(x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1
Time difference of 0.02717805 secs
str(df1)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
2) Option: (akrun's approach)
start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2
Time difference of 0.02599907 secs
str(df2)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
3) Option:
start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3
Time difference of -59.6766 secs
str(df3)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
4) Option (using tidyverse and invoking dplyr):
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4
Time difference of 0.012043 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
Based on what kind of structure as output you demand and speed, you can judge. If your data is unbalanced and you want to balance it, and suppose you want to do classification after that after scaling the numeric variables, the matrix numeric structure of the numeric variables, namely - Age, Salary, Height and Weight will cause problems. I mean,
str(df4$Age)
num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
- attr(*, "scaled:center")= num 36.3
- attr(*, "scaled:scale")= num 13.8
Since, for example, ROSE package (which balances data) doesn't accept data structures apart from int, factor and num, it will throw an error.
To avoid this issue, the numeric variables after scaling can be saved as vectors instead of a column matrix by:
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)
end_time4 <- Sys.time()
end_time4 - start_time4
with
Time difference of 0.01400399 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...

convert list with unequal vector length to data frame by strata in R

I have the output from a coxph function, which is estimated by strata. I would like to transform this output from a list into a data frame. The code I ran for coxph is below:
k <- coxph(Surv(cum.goodp, dlq.next) ~ rpc.length + cluster(itemcode) + strata(sector), data = nr.sample)
m <- summary(survfit(k))
There are twenty different strata used to estimate the coxph. Here is the structure of the list
List of 16
$ n : int [1:20] 870 843 2278 603 6687 8618 15155 920 2598 654 ...
$ time : num [1:870] 1 2 3 4 5 6 7 8 9 10 ...
$ n.risk : num [1:870] 870 592 448 361 320 286 232 214 196 186 ...
$ n.event : num [1:870] 246 126 77 34 33 25 18 18 8 6 ...
$ n.censor : num [1:870] 32 18 10 7 1 29 0 0 2 0 ...
$ strata : Factor w/ 20 levels "sector=11","sector=21",..: 1 1 1 1 1 1 1 1 1 1 ...
$ surv : num [1:870] 0.725 0.571 0.471 0.425 0.379 ...
$ type : chr "right"
$ cumhaz : num [1:870] 0.322 0.561 0.754 0.856 0.971 ...
$ std.err : num [1:870] 0.015 0.017 0.0174 0.0174 0.0173 ...
$ upper : num [1:870] 0.755 0.605 0.506 0.46 0.414 ...
$ lower : num [1:870] 0.696 0.538 0.438 0.392 0.347 ...
$ conf.type: chr "log"
$ conf.int : num 0.95
$ call : language survfit(formula = k)
$ table : num [1:20, 1:7] 870 843 2278 603 6687 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20] "sector=11" "sector=21" "sector=22" "sector=23" ...
.. ..$ : chr [1:7] "records" "n.max" "n.start" "events" ...
- attr(*, "class")= chr "summary.survfit"
I have done this before, but without strata. When I did not have strata I used the following approach:
col <- lapply(c(1 : 7), function(x) m[x])
tbl <- do.call(data.frame, col)
However, when I try that approach here, I get the familiar error:
cannot coerce class "c("survfit.cox", "survfit")" to a data.frame
All columns have the same name, but they are of different length. If possible, I would like to add a column to the final data frame that contains the particular strata that the results are for. Is there a way to do this? It doesn't have to be in base R. Any help would be much appreciated. Thanks so much.
This problem can be solved via the tidy function in the broom package. For the example above, the code is:
n <- survfit(k)
df <- tidy(n)
The tidy function produces a data frame with a variable "strata". It does not, however, provide the median and mean, but they can be estimated from the data frame df if one were so inclined. If the survfit object has multiple strata, the glance(list) cannot provide the median or mean.

Resources