I have a large data frame called "df" (with some NA values inside)
dim(df)
[1] 2174 420
I would like to change the dimension of it into 32610 rows and 28 columns (by row), for example:
#df=
a b c d e f g ...
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ...
2 .........
3 .........
4 .........
5 .........
6 .........
...........
Into:
#new.df=
r1 r2 r3 r4 r5 r6 r7 ... ... r28
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
2 29 30 ...
3 .........
4 .........
5 .........
6 .........
...........
Therefore, new dimension:
dim(new.df)
[1] 32610 28
Can anyone help me with the code?
To reformat the layout of the data by row we can create an array from the unlisted elements of the original data.frame:
matrix(unlist(t(df)), byrow=T, 32610, 28)
Reproducible Example
There is no reason to not have a reproducible example in your question. It is very easy to simplify the problem to understand the underlying solution:
df <- as.data.frame(matrix(1:12,3, byrow=T))
df
V1 V2 V3 V4
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
matrix(unlist(t(df)), byrow=T, 6, 2)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[6,] 11 12
Related
In R, I have a very long dataframe in which there are two columns as follows:
up
low
5
10
10
20
20
30
NA
NA
NA
NA
NA
NA
NA
NA
NA
Na
NA
NA
I would like to repeat the sequence of numbers in these two columns until the end of the dataframe. So, my desired table should look like this:
up
low
5
10
10
20
20
30
5
10
10
20
20
30
5
10
10
20
20
30
How can I do it in R? What codes can be used to do this?
Please help me.
Thanks
here is a tidyverse approach using purrr:
purrr::map_dfr(seq_len(3), ~df) %>%
na.omit()
up low
1 5 10
2 10 20
3 20 30
10 5 10
11 10 20
12 20 30
19 5 10
20 10 20
21 20 30
How about replicating the data frame without the NAs, i.e.
sapply(na.omit(df),rep.int,times=(nrow(df) / nrow(na.omit(df))))
# v1 v2
# [1,] 5 10
# [2,] 10 20
# [3,] 20 30
# [4,] 5 10
# [5,] 10 20
# [6,] 20 30
# [7,] 5 10
# [8,] 10 20
# [9,] 20 30
I would use rep and row.names:
> df[rep(row.names(na.omit(df)), nrow(df) / nrow(na.omit(df))),]
up low
1 5 10
2 10 20
3 20 30
1.1 5 10
2.1 10 20
3.1 20 30
1.2 5 10
2.2 10 20
3.2 20 30
>
To reset the index:
out <- df[rep(row.names(na.omit(df)), nrow(df) / nrow(na.omit(df))),]
row.names(out) <- NULL
> out
up low
1 5 10
2 10 20
3 20 30
4 5 10
5 10 20
6 20 30
7 5 10
8 10 20
9 20 30
>
The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.
I have a dataset consisting of two variables, Contents and Time like so:
Time Contents
2017M01 123
2017M02 456
2017M03 789
. .
. .
. .
2018M12 789
Now I want to create a numeric vector that aggregates Contents for six months, that is I want to sum 2017M01 to 2017M06 to one number, 2017M07 to 2017M12 to another number and so on.
I'm able to do this by indexing but I want to be able to write: "From 2017M01 to 2017M06 sum contents corresponding to that sequence" in my code.
I would really appreciate some help!
You can create a grouping variable based on the number of rows and number of elements to group. For your case, you want to group every 6 rows so your data frame should be divisible with 6. Using iris to demonstrate (It has 150 rows, so 150 / 6 = 25)
rep(seq(nrow(iris)%/%6), each = 6)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
There are plenty of ways to handle how you want to call it. Here is a custom function that allows you to do that (i.e. create the grouping variable),
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
return(rep(seq(nrow(df)%/%i1), each = i1))
}
f1("2017M01:2017M06", iris)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
EDIT: We can easily make the function compatible with 'non-0-remainder' divisions by concatenating the final result with a repetition of the max+1 value of the final result of remainder times, i.e.
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
final_v <- rep(seq(nrow(df) %/% i1), each = i1)
if (nrow(df) %% i1 == 0) {
return(final_v)
} else {
remainder = nrow(df) %% i1
final_v1 <- c(final_v, rep((max(final_v) + 1), remainder))
return(final_v1)
}
}
So for a data frame with 20 rows, doing groups of 6, the above function will yield the result:
f1("2017M01:2017M06", df)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4
I have a huge data set. Data covers around 4000 regions.
I need to do a multiplication like this: each number in each row should be multiplied by the corresponding column name/value (0 or...) at first.
Then, these resulting numbers should be summed up and be divided by total number (totaln) in that row.
For example, the data is like this:
region totan 0 1 2 3 4 5 6 7 .....
1 1346 5 7 3 9 23 24 34 54 .....
2 1256 7 8 4 10 34 2 14 30 .....
3 1125 83 43 23 11 16 4 67 21 .....
4 3211 43 21 67 12 13 12 98 12 .....
5 1111 21 8 9 3 23 13 11 0 .....
.... .... .. .. .. .. .. .. .. .. .....
4000 2345 21 9 11 45 67 89 28 7 .....
The calculation should be like this:
For example in region 1:
(5*0)+(7*1)+(3*2)+(9*3)+(23*4)+(24*5)+(34*6)+(7*54)...= the result/1346=the result
I need to do such an analysis for all the regions.
I tried a couple of ways like use of "for" and "apply" but did not get the required result.
This can be done fully vectorized:
Data:
> df
region totan 0 1 2 3 4 5 6 7
1 1 1346 5 7 3 9 23 24 34 54
2 2 1256 7 8 4 10 34 2 14 30
3 3 1125 83 43 23 11 16 4 67 21
4 4 3211 43 21 67 12 13 12 98 12
5 5 1111 21 8 9 3 23 13 11 0
6 4000 2345 21 9 11 45 67 89 28 7
as.matrix(df[3:10]) %*% as.numeric(names(df)[3:10]) / df$totan
[,1]
[1,] 0.6196137
[2,] 0.3869427
[3,] 0.6711111
[4,] 0.3036437
[5,] 0.2322232
[6,] 0.4673774
This should be significantly faster on a huge dataset than any for or *apply loop.
You could use the tidyverse :
library(tidyverse)
df %>% gather(k,v,-region,-totan) %>%
group_by(region,totan) %>% summarize(x=sum(as.numeric(k)*v)/first(totan))
## A tibble: 5 x 3
## Groups: region [?]
# region totan x
# <int> <int> <dbl>
#1 1 1346 0.620
#2 2 1256 0.387
#3 3 1125 0.671
#4 4 3211 0.304
#5 5 1111 0.232
for (i in 1:nrow(data)) {
sum(data[i,3:(ncol(data))]*names(data)[3:ncol(data)])/data[i,2]
}
alternatively
apply(data,1,function(x){
sum(x[3:length(x)]*names(x)[3:length(x)])/x[2]
}
I have list of data with varying list length:
[[1]]
[1] "2009" "2010" "2011" "2012"
[[2]]
[1] "2010" "2011" "2012" "2013"
[[3]]
[1] "2008" "2009" "2010" "2011" "2012"
[[4]]
[1] "2011" "2012"
I would like to get one column data.frame like this:
2009
2010
2011
2012
2010
2011
....
I went on doing this unsuccessfully:
# transpose list of years
YearsDf <- lapply(GetYears, data.frame)
Remove colnames (since the list of dataframes gave some weird column names):
YearsOk <- lapply(YearsDf, function(x) "colnames<-"(x, NULL))
All this comes to:
[[1]]
NA
1 2009
2 2010
3 2011
4 2012
[[2]]
NA
1 2010
2 2011
3 2012
4 2013
......
Now just bind and get data.frame. This gave NA's
ldply(YearsOk, data.frame)
How I get to the data.frame of one column?
Did you consider unlist?
myL <- list(as.character(2009:2012),
as.character(2010:2011),
as.character(2009:2014))
data.frame(year = unlist(myL))
# year
# 1 2009
# 2 2010
# 3 2011
# 4 2012
# 5 2010
# 6 2011
# 7 2009
# 8 2010
# 9 2011
# 10 2012
# 11 2013
# 12 2014
If you think it will be important for you to retain which list element the value came from, consider stack (which requires a named list) or melt from the "reshape2" package:
library(reshape2)
melt(myL)
# value L1
# 1 2009 1
# 2 2010 1
# ...SNIP...
# 11 2013 3
# 12 2014 3
## stack requires names, so add some in...
stack(setNames(myL, seq_along(myL)))
# values ind
# 1 2009 1
# 2 2010 1
# ...SNIP...
# 12 2014 3
Finally, this is absolutely not the approach I would take, but based on your example code, perhaps you were trying to do something like:
do.call(rbind, lapply(myL, function(x) data.frame(year = x)))
It's quite simple. This answer gets for different length
Q<-list(a=a,b=b)
str(Q)
List of 2
$ a: int [1:11] 1 2 3 4 5 6 7 8 9 10 ...
$ b: int [1:29] 2 3 4 5 6 7 8 9 10 11 ...
Q$a
[1] 1 2 3 4 5 6 7 8 9 10 11
T<-c(Q$a,Q$b)
T
[1] 1 2 3 4 5 6 7 8 9 10 11 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[28] 18 19 20 21 22 23 24 25 26 27 28 29 30
TT<-data.frame(T)
TT
T
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 2
13 3
14 4
15 5
16 6
17 7
18 8
19 9
20 10
21 11
22 12
23 13
24 14
25 15
26 16
27 17
28 18
29 19
30 20
31 21
32 22
33 23
34 24
35 25
36 26
37 27
38 28
39 29
40 30