I have several files with the following structure:
data <- matrix(c(1:100000), nrow=1000, ncol=100)
The first 500 rows are X coordinates and the final 500 rows are Y coordinates of several object contours. Row # 1 (X) and row 501 (Y) correspond to coordinates of the same object. I need to:
transpose the whole matrix and arrange it so now row 1 is column 1 and row 501 is column 2 and have paired x, y coordinates in contiguous columns. Row 2 and row 502 should be in column 1 and column 2 below the data of previous object.
ideally, have an extra column with filename info.
thanks.
Simpler version:
Transpose the matrix, then create a vector with the column indices and subset with them:
mat <- matrix(1:100, nrow = 10)
mat2 <- t(mat)
cols <- unlist(lapply(1:(nrow(mat2)/2), function(i) c(i, i+nrow(mat2)/2)))
mat3 <- mat2[,cols]
Then just make it a dataframe as below.
You can subset pairs of rows separated by nrow/2, make them a 2-column matrix and then cbind them all:
df <- as.data.frame(do.call(cbind, lapply(1:(nrow(mat)/2), function(i) {
matrix(mat[c(i, nrow(mat)/2 + i),], ncol = 2, byrow = TRUE)
})))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 fname
# 1 1 6 2 7 3 8 4 9 5 10 a
# 2 11 16 12 17 13 18 14 19 15 20 e
# 3 21 26 22 27 23 28 24 29 25 30 e
# 4 31 36 32 37 33 38 34 39 35 40 o
# 5 41 46 42 47 43 48 44 49 45 50 y
# 6 51 56 52 57 53 58 54 59 55 60 q
# 7 61 66 62 67 63 68 64 69 65 70 v
# 8 71 76 72 77 73 78 74 79 75 80 b
# 9 81 86 82 87 83 88 84 89 85 90 v
# 10 91 96 92 97 93 98 94 99 95 100 y
Then just add the new column as necessary, since it's now a dataframe:
df$fname <- sample(letters, nrow(df), TRUE)
What about
n <- 500
df <- data.frame(col1 = data[1:n, ],
col2 = data[(nrow(data) - 500):nrow(data), ],
fileinfo = "this is the name of the file...")
Try David's answer, but this way:
n <- 500
df <- data.frame(col1 = data[1:n, ],
col2 = data[(nrow(data) - (n-1)):nrow(data), ],
fileinfo = "this is the name of the file...")
Related
I have a function like this
extract = function(x)
{
a = x$2007[6:18]
b = x$2007[30:42]
c = x$2007[54:66]
}
the subsetting needs to continue up to 744 in this way. I need to skip the first 6 data points, and then pull out every other 12 points into a new object or a list. Is there a more elegant way to do this with a for loop or apply?
Side note: if 2007 is truly a column name (you would have had to explicitly do this, R defaults to converting numbers to names starting with letters, see make.names("2007")), then x$"2007"[6:18] (etc) should work for column reference.
To generate that sequence of integers, let's try
nr <- 100
ind <- seq(6, nr, by = 12)
ind
# [1] 6 18 30 42 54 66 78 90
ind[ seq_along(ind) %% 2 == 1 ]
# [1] 6 30 54 78
ind[ seq_along(ind) %% 2 == 0 ]
# [1] 18 42 66 90
Map(seq, ind[ seq_along(ind) %% 2 == 1 ], ind[ seq_along(ind) %% 2 == 0 ])
# [[1]]
# [1] 6 7 8 9 10 11 12 13 14 15 16 17 18
# [[2]]
# [1] 30 31 32 33 34 35 36 37 38 39 40 41 42
# [[3]]
# [1] 54 55 56 57 58 59 60 61 62 63 64 65 66
# [[4]]
# [1] 78 79 80 81 82 83 84 85 86 87 88 89 90
So you can use this in your function to create a list of subsets:
nr <- nrow(x)
ind <- seq(6, nr, by = 12)
out <- lapply(Map(seq, ind[ seq_along(ind) %% 2 == 1 ], ind[ seq_along(ind) %% 2 == 0 ]),
function(i) x$"2007"[i])
we could use
split( x[7:744] , cut(7:744,seq(7,744,12)) )
I have a table with eighty columns and I want to create columns by multiplying var1*var41 var1*var42....var1*var80. var2*var41 var2*var42...var2*var80. How could I write a loop to multiply the columns and write the labeled product into a .csv? The result should have 1600 additional columns.
I took a stab at this with some fake data:
# Fake data (arbitraty 5 rows)
mtx <- sample(1:100, 5 * 80, replace = T)
dim(mtx) <- c(5,80)
colnames(mtx) <- paste0("V", 1:ncol(mtx)) # Name the original columns
mtx[1:5,1:5]
# V1 V2 V3 V4 V5
#[1,] 8 10 69 84 92
#[2,] 59 34 36 96 86
#[3,] 51 26 78 63 8
#[4,] 74 93 73 70 49
#[5,] 62 30 20 43 9
Using a for loop, one might try something like this:
v <- expand.grid(1:40,41:80) # all combos
v[c(1:3,1598:1600),]
# Var1 Var2
#1 1 41
#2 2 41
#3 3 41
#1598 38 80
#1599 39 80
#1600 40 80
# Initialize matrix for multiplication results
newcols <- matrix(NA, nrow = nrow(mtx), ncol = nrow(v))
# Run the for loop
for(i in 1:nrow(v)) newcols[,i] <- mtx[,v[i,1]] * mtx[,v[i,2]]
# save the names as "V1xV41" format with apply over rows (Margin = 1)
# meaning, for each row in v, paste "V" in front and "x" between
colnames(newcols) <- apply(v, MARGIN = 1, function(eachv) paste0("V", eachv, collapse="x"))
# combine the additional 1600 columns
tocsv <- cbind(mtx, newcols)
tocsv[,78:83] # just to view old and new columns
# V78 V79 V80 V1xV41 V2xV41 V3xV41
#[1,] 17 92 13 429 741 1079
#[2,] 70 94 1 4836 4464 5115
#[3,] 6 77 93 3740 1020 3468
#[4,] 88 34 26 486 258 66
#[5,] 48 77 61 873 4365 970
# Write it
write.csv(tocsv, "C:/Users/Evan Friedland/Documents/NEWFILENAME.csv")
I have a long list of numbers, e.g.
set.seed(123)
y<-round(runif(100, 0, 200))
And I would like to store in column y the number of values that exceed each value in column x of a data frame:
df <- data.frame(x=seq(0,200,20))
I can compute the numbers manually, like this:
length(which(y>=20)) #93 values exceed 20
length(which(y>=40)) #81 values exceed 40
etc. I know I can use a for-loop with all values of x, but is there a more elegant way?
I tried this:
df$y <- length(which(y>=df$x))
But this gives a warning and does not give me the desired output.
The data frame should look like this:
df
x y
1 0 100
2 20 93
3 40 81
4 60 70
5 80 61
6 100 47
7 120 40
8 140 29
9 160 19
10 180 8
11 200 0
You can compare each value of df$x against all value of y using sapply
sapply(df$x, function(a) sum(y>a))
#[1] 99 93 81 70 61 47 40 29 18 6 0
#Looking at your output, maybe you want
sapply(df$x, function(a) sum(y>=a))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Here's another approach using outer that allows for element wise comparison of two vectors
rowSums(outer(df$x,y, "<="))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Yet one more (from alexis_laz's comment)
length(y) - findInterval(df$x, sort(y), left.open = TRUE)
# [1] 100 93 81 70 61 47 40 29 19 8 0
I have a list which has multiple vectors (total 80) of various lengths. On the x-axis I want the names of these vectors. On the y-axis I want to plot the values corresponding to each vector. How can I do it in R?
One way to do this is to reshape the data using reshape2::melt or some other method. Please try and make a reproducible example. I think this is the gist of what you are after:
set.seed(4)
mylist <- list(a = sample(1:50, 10, T),
b = sample(25:40, 15, T),
c = sample(51:75, 20, T))
mylist
# $a
# [1] 30 1 15 14 41 14 37 46 48 4
#
# $b
# [1] 37 29 26 40 31 32 40 34 40 37 36 40 33 32 35
#
# $c
# [1] 71 63 72 63 64 65 56 72 67 63 75 62 66 60 51 74 57 65 55 73
library(ggplot2)
library(reshape2)
df <- melt(mylist)
head(df)
# value L1
# 1 30 a
# 2 1 a
# 3 15 a
# 4 14 a
# 5 41 a
# 6 14 a
ggplot(df, aes(x = factor(L1), y = value)) + geom_point()
I have this data.frame:
x <- rnorm(1000, 3, 2)
groups <- rep(c("GroupA", "GroupB"), each = 500)
df <- data.frame(x, groups)
Using dplyr, I can sample 100 rows of df then calculate the difference between the means of GroupA and GroupB:
df_difference_means <- df %>%
add_rownames %>%
filter(rowname %in% sample(1:1000, 100)) %>%
group_by(groups) %>%
summarise(mean.x = mean(x)) %>%
as.data.frame %>%
summarise(difference.mean.x = mean.x[2] - mean.x[1]) %>%
mutate(.replicate = 1) %>%
as.data.frame
difference.mean.x .replicate
1 -0.7258672 1
How, using dplyr, can I repeat this process 100 times and output the results in a data.frame. The resultant data.frame should look like df_difference_means_100:
difference.mean.x <- rnorm(100, -0.72, 2)
.replicate <- 1:100
df_difference_means_100 <- data.frame(difference.mean.x, .replicate)
df_difference_means_100
difference.mean.x .replicate
1 -1.74745341 1
2 -1.60671744 2
3 -0.73216685 3
4 2.53595482 4
5 -2.13187162 5
6 0.42921334 6
7 -1.23031115 7
8 2.66900128 8
9 -0.26267355 9
10 0.97573805 10
11 4.38242693 11
12 -2.09175166 12
13 1.17403184 13
14 0.77553541 14
15 -3.61322099 15
16 1.85055915 16
17 0.06395296 17
18 -1.42459781 18
19 2.90383461 19
20 -1.79359430 20
21 -0.43856161 21
22 1.81433832 22
23 3.15741676 23
24 -1.14643453 24
25 -2.14220126 25
26 -0.32972133 26
27 -0.27037302 27
28 2.20310891 28
29 3.05937838 29
30 0.11348566 30
31 0.09080867 31
32 -2.11559132 32
33 -0.50134470 33
34 0.31628255 34
35 0.96801232 35
36 3.42165046 36
37 2.47089399 37
38 -1.34196912 38
39 -1.11181326 39
40 -3.48664556 40
41 -2.49013457 41
42 3.67952537 42
43 -3.80781570 43
44 0.68793508 44
45 0.05869912 45
46 5.25205269 46
47 -3.00920009 47
48 -2.48109066 48
49 -0.22790952 49
50 1.41952375 50
51 0.79675613 51
52 1.13585093 52
53 0.63646903 53
54 0.56779986 54
55 -1.48099201 55
56 -0.24586261 56
57 3.16075196 57
58 -0.55765459 58
59 1.78498217 59
60 3.38490948 60
61 -0.09666898 61
62 -2.38897557 62
63 -0.50976285 63
64 4.25219676 64
65 -1.57526334 65
66 0.58006652 66
67 0.89549514 67
68 -0.17842015 68
69 -2.57422568 69
70 4.14008849 70
71 -3.48424762 71
72 -3.48788857 72
73 -4.22862573 73
74 1.98098272 74
75 0.73889898 75
76 -2.78759887 76
77 -0.75359051 77
78 -0.24062074 78
79 -0.39441863 79
80 -0.58710463 80
81 -2.95208480 81
82 -0.18225793 82
83 0.98356501 83
84 0.77963590 84
85 -1.21736133 85
86 1.36733389 86
87 -0.41273956 87
88 4.58347146 88
89 0.37946472 89
90 -5.02405002 90
91 -0.09883054 91
92 -1.99874326 92
93 -0.77896124 93
94 -0.05878099 94
95 0.82023492 95
96 2.29944232 96
97 -2.24368129 97
98 1.39608682 98
99 -0.61909894 99
100 0.74170204 100
Here's a possible approach using dplyr combined with replicate and lapply:
# define a custom function:
my_func <- function(df) {
df %>%
summarise(difference.mean.x = mean(x[groups == "GroupA"]) -
mean(x[groups == "GroupB"]))
}
# sample repeatedly (100 times) 100 rows of df and store in a list
# apply the custom function to each sample in the list,
# bind rows together and create an index column, all in a "pipe":
replicate(100, sample_n(df, 100), simplify = FALSE) %>%
lapply(., my_func) %>%
bind_rows %>%
mutate(replicate = 1:n())
#Source: local data frame [100 x 2]
#
# difference.mean.x replicate
#1 0.2531246 1
#2 -0.1595892 2
#3 0.1759745 3
#4 -0.1119139 4
#5 -0.1332090 5
#6 -0.8790818 6
#7 0.2170683 7
#8 -0.3484234 8
#9 0.2238635 9
#10 -0.4445486 10
#.. ... ...
Here's a way to put all the logic in a single dplyr pipeline, but it comes at the expense of copying df 100 times at the outset:
set.seed(111)
rep_df <- lapply(1:100, function(rep) {
df[['replicate']]=rep
df
})
rep_df <- do.call(rbind, rep_df)
rep_df %>%
group_by(replicate) %>%
sample_n(100) %>%
group_by(replicate, groups) %>%
summarize(mean_x = mean(x)) %>%
summarize(mean_x_group_diff = diff(mean_x)) -> rep_df
str(rep_df)
PS. You could also use a similar pipeline within an lapply call. This is more compact and doesn't copy df 100 times, but may be less readable:
set.seed(111)
output <- lapply(1:100, function(rep) {
sample_n(df, 100) %>%
group_by(groups) %>%
summarize(mean_x = mean(x)) %>%
summarize(mean_x_group_diff = diff(mean_x)) %>%
mutate(replicate=rep)
})
rep_df <- do.call(rbind, output)
str(rep_df)
You can create a dplyr-friendly formula using
row_rep <- function(df, n) {
df[rep(1:nrow(df), times = n),]
}
Sourced from https://gist.github.com/mdlincoln/528a53939538b07ade86