In R, I have a very long dataframe in which there are two columns as follows:
up
low
5
10
10
20
20
30
NA
NA
NA
NA
NA
NA
NA
NA
NA
Na
NA
NA
I would like to repeat the sequence of numbers in these two columns until the end of the dataframe. So, my desired table should look like this:
up
low
5
10
10
20
20
30
5
10
10
20
20
30
5
10
10
20
20
30
How can I do it in R? What codes can be used to do this?
Please help me.
Thanks
here is a tidyverse approach using purrr:
purrr::map_dfr(seq_len(3), ~df) %>%
na.omit()
up low
1 5 10
2 10 20
3 20 30
10 5 10
11 10 20
12 20 30
19 5 10
20 10 20
21 20 30
How about replicating the data frame without the NAs, i.e.
sapply(na.omit(df),rep.int,times=(nrow(df) / nrow(na.omit(df))))
# v1 v2
# [1,] 5 10
# [2,] 10 20
# [3,] 20 30
# [4,] 5 10
# [5,] 10 20
# [6,] 20 30
# [7,] 5 10
# [8,] 10 20
# [9,] 20 30
I would use rep and row.names:
> df[rep(row.names(na.omit(df)), nrow(df) / nrow(na.omit(df))),]
up low
1 5 10
2 10 20
3 20 30
1.1 5 10
2.1 10 20
3.1 20 30
1.2 5 10
2.2 10 20
3.2 20 30
>
To reset the index:
out <- df[rep(row.names(na.omit(df)), nrow(df) / nrow(na.omit(df))),]
row.names(out) <- NULL
> out
up low
1 5 10
2 10 20
3 20 30
4 5 10
5 10 20
6 20 30
7 5 10
8 10 20
9 20 30
>
Related
I have a vector of numbers
a = c(1:100)
digits = c(0:9)
I want to know the frequency of digits in the vector. I want the output precisely as the below example:
Digits Frequency
0 10
1 20
2 20
3 20
4 20
5 20
6 20
7 20
8 20
9 20
How to get this output using R?
You can convert the numbers to character with as.character then split to individual characters with strsplit and count the frequency with table.
table(unlist(strsplit(as.character(a), "")))
# 0 1 2 3 4 5 6 7 8 9
#11 21 20 20 20 20 20 20 20 20
Or in case for more variations of input:
table(unlist(strsplit(gsub("[^[:digit:]]", "", format(a, scientific =FALSE)), "")))
# 0 1 2 3 4 5 6 7 8 9
#11 21 20 20 20 20 20 20 20 20
The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.
I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22
I have a large data frame called "df" (with some NA values inside)
dim(df)
[1] 2174 420
I would like to change the dimension of it into 32610 rows and 28 columns (by row), for example:
#df=
a b c d e f g ...
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ...
2 .........
3 .........
4 .........
5 .........
6 .........
...........
Into:
#new.df=
r1 r2 r3 r4 r5 r6 r7 ... ... r28
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
2 29 30 ...
3 .........
4 .........
5 .........
6 .........
...........
Therefore, new dimension:
dim(new.df)
[1] 32610 28
Can anyone help me with the code?
To reformat the layout of the data by row we can create an array from the unlisted elements of the original data.frame:
matrix(unlist(t(df)), byrow=T, 32610, 28)
Reproducible Example
There is no reason to not have a reproducible example in your question. It is very easy to simplify the problem to understand the underlying solution:
df <- as.data.frame(matrix(1:12,3, byrow=T))
df
V1 V2 V3 V4
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
matrix(unlist(t(df)), byrow=T, 6, 2)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[6,] 11 12
I have a dataframe df
Reads Counts
aaaa 10
bbbb 20
cccc 25
and so on.
I want to calculate the number of reads which exceed a certain value of counts and plot that. Example I want a data frame that looks like
Counts>= #reads with Counts>=
1 3
2 3
3 3
11 2
20 2
21 1
and so on. Can you suggest how I can get such a dataframe and plot it.
Given the levels you want to plot at...
cutoffs <- 1:30
... you could do something like:
data.frame(cutoff=cutoffs, num.above=Reduce("+", lapply(dat$Counts, ">=", cutoffs)))
# cutoff num.above
# 1 1 3
# 2 2 3
# 3 3 3
# 4 4 3
# 5 5 3
# 6 6 3
# 7 7 3
# 8 8 3
# 9 9 3
# 10 10 3
# 11 11 2
# 12 12 2
# 13 13 2
# 14 14 2
# 15 15 2
# 16 16 2
# 17 17 2
# 18 18 2
# 19 19 2
# 20 20 2
# 21 21 1
# 22 22 1
# 23 23 1
# 24 24 1
# 25 25 1
# 26 26 0
# 27 27 0
# 28 28 0
# 29 29 0
# 30 30 0
Basically for each value in the original data frame you compute a vector of whether it's greater than or equal to each cutoff (using lapply with >=). Then you add them up (using Reduce with +), getting the total number greater than or equal to each cutoff.
Another option would be using outer/colSums
cutoff <- 1:30
data.frame(cutoff=cutoffs, num.above=colSums(outer(df$Counts, cutoffs, ">=")))