How do I regroup data? - r

I am looking to change the structure of my dataframe, but I am not really sure how to do it. I am not even sure how to word the question either.
ID <- c(1,8,6,2,4)
a <- c(111,94,85,76,72)
b <- c(75,37,86,55,62)
dataframe <- data.frame(ID,a,b)
ID a b
1 1 111 75
2 8 94 37
3 6 85 86
4 2 76 55
5 4 72 62
Above is the code with the output, however, I want the output to look like the following; however, the only way I know how to do this is to just type it manually, is there any other way other than changing the input manually? Because I have quite a large data set that I would like to change and manually would just take forever.
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62

We may use pivot_longer
library(dplyr)
library(tidyr)
dataframe %>%
pivot_longer(cols = a:b, names_to = 'letter')
-output
# A tibble: 10 × 3
ID letter value
<dbl> <chr> <dbl>
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62

A base R option using reshape:
df <- reshape(dataframe, direction = "long",
v.names = "value",
varying = 2:3,
times = names(dataframe)[2:3],
timevar = "letter",
idvar = "ID")
df <- df[ order(match(df$ID, dataframe$ID)), ]
row.names(df) <- NULL
Output
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62

Related

R: How to merge a new data frame to several other data frames in a list

I have several seperate data frames that I would like to keep separated because merging them together would create a very large element.
However, there are variables from another data frame that I would like to merge with all of them now.
Here is an example of what I would like to do:
df1 <- data.frame(ID1 = c(1:10), Var1 = rep(c(1,0),5))
df2 <- data.frame(ID1 = c(1:10), Var2 = c(21:30))
dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
mergewith <- data.frame(ID1 = c(1:10), ID2 = c(41:50))
My goal is that df1 and df2 will look like this:
df1
ID1 Var1 ID2
1 1 1 41
2 2 0 42
3 3 1 43
4 4 0 44
5 5 1 45
6 6 0 46
7 7 1 47
8 8 0 48
9 9 1 49
10 10 0 50
df2
ID1 Var2 ID2
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
What I have tried so far is:
dat = lapply(dfs,function(x){
merge(names(x), mergewith, by = "ID1");x})
list2env(dat,.GlobalEnv)
However, then I get the following message:
"'by' must specify a uniquely valid column"
Is it possible to do this without using a loop?
You can try Map
> Map(function(x, y) merge(x, y, by = "ID1"), dfs, list(mergewith))
[[1]]
ID1 Var1 ID2
1 1 1 41
2 2 0 42
3 3 1 43
4 4 0 44
5 5 1 45
6 6 0 46
7 7 1 47
8 8 0 48
9 9 1 49
10 10 0 50
[[2]]
ID1 Var2 ID2
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
You can use lapply to merge all the dataframes in dfs with mergewith. Use list2env to get the changed dataframes in the global environment.
list2env(lapply(dfs, function(x) merge(x, mergewith, by = 'ID1')), .GlobalEnv)

Creating a new dataset for each combination of rows in groups

I'm trying to create a dataset for each combination of rows from separate groups. Ideally, one row from each group would be selected and there would be a dataset for every combination. I have a dataset of that looks similar in structure to the sample below:
Name Group Stat1 Stat2
1 1 a 63 38
2 2 a 33 62
3 3 b 3 66
4 4 b 57 67
5 5 c 42 69
6 6 c 47 14
7 7 c 16 10
8 8 d 21 46
9 9 d 72 1
Trying to get the end result of the first dataset to look like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 8 d 21 46
With the second data dataset looking like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 9 d 72 1
Until every combination has been exhausted. I've tried strategies using apply functions and combn but cannot seem to get the result I want. This does not seem too challenging to me conceptually, so I'm not sure what I'm missing.
Any help would be greatly appreciated! Thanks in advance!
Lots of ways to approach this. A simple solution is to just generate all 4 row combos, then subset to those with all distinct Group values. I named your data df and assumed Name would be unique row id. If that's not true, you could replace df$Name with 1:nrow(df)
# All 4 row combos of row ids
combs <- combn(df$Name, 4)
# Match group labels to row ids
g <- matrix(df$Group[combs], nrow = 4)
# 4 row combs filtered to all distinct group vals
combs <- combs[,apply(g, 2, function(i) all(!duplicated(i)))]
# For each 4 row combo, extract rows from the dataframe
final_list <- apply(combs, 2, function(i) df[i,])
final_list[1:3]
[[1]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
8 8 d 21 46
[[2]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
9 9 d 72 1
[[3]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
6 6 c 47 14
8 8 d 21 46

Label columns with a ascending number [duplicate]

This question already has answers here:
Make sequential numeric column names prefixed with a letter
(3 answers)
Closed 2 years ago.
I want to label columns with a ascending number. The reason is because in a bigger dataset I want to be able to sort the columns so they get in the right order.
How do i code this? Thanks!
set.seed(8)
id <- 1:6
diet <- rep(c("A","B"),3)
period <- rep(c(1,2),3)
score1 <- sample(1:100,6)
score2 <- sample(1:100,6)
score3 <- sample(1:100,6)
df <- data.frame(id, diet, period, score1, score2,score3)
df
id diet period score1 score2 score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
It should look like:
x1id x2diet x3period x4score1 x5score2 x6score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
I was thinking something like this, but something is missing....
colnames(wellbeing) <- paste(1:ncol, colnames(wellbeing))
Another options:
colnames(df) <- paste0('x', 1:dim(df)[2], colnames(df))
or
df %>%
dplyr::rename_all(~ paste0('x', 1:ncol(df), .))
Both methods would yield the same output:
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
You can use :
names(df) <- paste0('x', seq_along(df), names(df))
df
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
Maybe add an underscore?
names(df) <- paste0('x', seq_along(df), "_", names(df))
names(df)
#[1] "x1_id" "x2_diet" "x3_period" "x4_score1" "x5_score2" "x6_score3"
Here is a mapply approach.
mapply(paste0, paste0("x", 1:ncol(df)), names(df))

How to read such a file into R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a file which contains data format like this:
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
A_row 17 16 10 12 9 15 10 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 3 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
71 72 73 74 75 76 77 78 80 81 83 84 85 86 87 88 89 90 94 97 103 104
A_row 1 6 0 2 9 5 1 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 2 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
Is there anyway to read this format into R? Thanks! :>
library(stringi)
library(dplyr)
library(magrittr)
library(tidyr)
text =
"48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
A_row 17 16 10 12 9 15 10 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 3 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
71 72 73 74 75 76 77 78 80 81 83 84 85 86 87 88 89 90 94 97 103 104
A_row 1 6 0 2 9 5 1 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 2 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1"
df =
text %>%
# split over newlines (could also be accomplished by readLines)
stri_split_fixed(pattern = "\n") %>%
# need to take first list corresponding to text
extract2(1) %>%
# make the text a column in the dataframe
{data_frame(values = .)} %>%
# identify rows based on what type of data they contain
# assume a repeating pattern every 3 lines
mutate(variable = c("id", "A_row", "B_row") %>% rep(length.out = n())) %>%
# for each type of data
group_by(variable) %>%
summarize(value =
values %>%
# concatenate all values
paste(collapse = " ") %>%
# remove headers (might need to modify regex)
stri_replace_all_regex("[A-Z]_row ", "") %>%
# split as space separated data
stri_split_regex(pattern = " +")) %>%
# unnest the lists
unnest(value) %>%
# make values numeric
mutate(value = as.numeric(value)) %>%
# for each variable, number 1 through n() to guess new row ID's
group_by(variable) %>%
mutate(n = 1:n()) %>%
# reshape data
spread(variable, value)
As commented above, one approach would be to use read.delim (maybe in chunks using skip & nrows), and then cbind to reassemble them.
Depending on the file (as pasted it looks like it might need additional preprocessing to be used with read.delim), another approach would be to use readLines and strsplit

How to efficiently find pattern in one column and assign a corresponding value to another column in a list of data frames?

I have a list of 15 data frames with each 13 columns (time + 6 stations with each 3 layers) and 172 rows. I want to collapse those columns (observations at stations) in basically two columns (one for station and one for observation) by applying a function over the whole list. Here I use gather from tidyr. In addition, I want to find a pattern (upper, middle or lower layer) in one of the columns and assign a new value (depth) in a new column. For this I use ddply from plyr and grep. My problem is that it is veryyyy slow. I guess I created a bottleneck with my limited R knowledge. So where is the bottleneck and how to improve it?
an example:
data <- list(a = data.frame(time = (1:180), alpha.upper = sample(1:180),
beta.middle = sample(1:180), gamma.lower = sample(1:180)),
b = data.frame(time(1:180), alpha.upper = sample(1:180),
beta.middle = sample(1:180), gamma.lower = sample(1:180)))
> data
$a
time alpha.upper beta.middle gamma.lower
1 1 133 179 99
2 2 175 147 56
3 3 169 9 24
4 4 116 129 75
5 5 92 65 65
6 6 141 73 49
$b
time alpha.upper beta.middle gamma.lower
1 1 111 2 89
2 2 84 81 159
3 3 93 82 84
4 4 44 58 125
5 5 31 33 131
6 6 1 120 63
my code is:
> data2<-lapply(data, function(x) {
x<-gather(x,stn,value,-time)
x<-arrange(x,time)
x<-ddply(x,c("time","stn","value"), function(x) {
if (grepl(".upper",x$stn) == TRUE)
{
x$depth<-1
return(x)
}
if (grepl(".lower",x$stn) == TRUE)
{
x$depth<-3
return(x)
}
if (grepl(".middle",x$stn) == TRUE)
{
x$depth<-2
return(x)
}
})
return(x)
})
the result should be:
> data2
$a
time stn value depth
1 1 alpha.upper 111 1
2 1 beta.middle 2 2
3 1 gamma.lower 89 3
4 2 alpha.upper 84 1
5 2 beta.middle 81 2
6 2 gamma.lower 159 3
$b
1 1 alpha.upper 38 1
2 1 beta.middle 151 2
3 1 gamma.lower 93 3
4 2 alpha.upper 61 1
5 2 beta.middle 56 2
6 2 gamma.lower 66 3
First of all let's reproduce your data.
dataa <- read.table(text =
"time alpha.upper beta.middle gamma.lower
1 133 179 99
2 175 147 56
3 169 9 24
4 116 129 75
5 92 65 65
6 141 73 49", header = T, sep = " ")
datab <- read.table(text =
"time alpha.upper beta.middle gamma.lower
1 1 111 2 89
2 2 84 81 159
3 3 93 82 84
4 4 44 58 125
5 5 31 33 131
6 6 1 120 63", header = T, sep = " ")
mydata <- list(a = dataa, b = datab)
# $a
# time alpha.upper beta.middle gamma.lower
# 1 1 133 179 99
# 2 2 175 147 56
# 3 3 169 9 24
# 4 4 116 129 75
# 5 5 92 65 65
# 6 6 141 73 49
# $b
# time alpha.upper beta.middle gamma.lower
# 1 1 111 2 89
# 2 2 84 81 159
# 3 3 93 82 84
# 4 4 44 58 125
# 5 5 31 33 131
# 6 6 1 120 63
Here I named the variable mydata because there is a fuction data in standart package utils and it's better not to use this name for a variable.
As far as I've got it, you need to make every data.frame of the list from "wide" to "long" form. You can use gather from tidyr package and in my opinion it's a clever choise, but for this situation I show how we can get the same result with basic R tools.
rebuilddf <- function(df)
{ # first of all see the difference between rep(1:3, each = 3) and rep(1:3, times = 3)
res_df <- data.frame(
time = rep(df$time, each = 3),# first column of new data.frame -
# we repeat each time mark 3 times
# as we know that there are exactly 3
# observations for single time: upper, middle, lower
stn = rep(colnames(df)[-1], times = nrow(df)), # second column
# fill it with words "alpha.upper",
# "beta.middle", "gamma.lower" which are colnames(df)[-1]
# repeated nrow(df) times
value = as.vector(t(as.matrix(df[,-1]))) ) #
# numbers of 2:4 columns of our data.frame are
# transposed and then arranged in a vector
# the result is like reading it row by row
# to understand what's happening with the matrix you can try this code
# m <- matrix(1:20, nrow = 4)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 5 9 13 17
# [2,] 2 6 10 14 18
# [3,] 3 7 11 15 19
# [4,] 4 8 12 16 20
# as.vector(t(m))
# 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19 4 8 12 16 20
# after that we add column "depth"
# as I got it, we need 1 for "upper", 2 for "middle" and 3 for "lower"
# we make it with the help of two nested ifelse functions
res_df <- transform(res_df, depth = ifelse(stn == "alpha.upper", 1,
ifelse(stn == "beta.middle", 2, 3)) )
return(res_df)
}
If the names of columns are not always the same, and only the end of the name is invariant we can modify condition for depth as follows:
res_df <-
transform(res_df,
depth = ifelse(rev(strsplit(stn, "[.]")[[1]])[1] == "upper",
1,
ifelse(rev(strsplit(stn, "[.]")[[1]])[1] == "middle", 2, 3)
) )
# we work with
# rev(strsplit(stn, "[.]")[[1]])[1]
# it may be "upper", "middle" or "lower"
# here we split character string of form "some.name1.upper" or
# "some.other.colname.lower" by every dot in text, then take
# the first from end part of the string (rev does reversing order)
You may also modify the condition and use grepl, but I believe it will be faster with strsplit.
When we've finished with our rebuilddf function let's watch what it does.
lapply(mydata, rebuilddf)
# $a
# time stn value depth
# 1 1 alpha.upper 133 1
# 2 1 beta.middle 179 2
# 3 1 gamma.lower 99 3
# 4 2 alpha.upper 175 1
# 5 2 beta.middle 147 2
# 6 2 gamma.lower 56 3
# 7 3 alpha.upper 169 1
# 8 3 beta.middle 9 2
# 9 3 gamma.lower 24 3
# 10 4 alpha.upper 116 1
# 11 4 beta.middle 129 2
# 12 4 gamma.lower 75 3
# 13 5 alpha.upper 92 1
# 14 5 beta.middle 65 2
# 15 5 gamma.lower 65 3
# 16 6 alpha.upper 141 1
# 17 6 beta.middle 73 2
# 18 6 gamma.lower 49 3
#
# $b
# time stn value depth
# 1 1 alpha.upper 111 1
# 2 1 beta.middle 2 2
# 3 1 gamma.lower 89 3
# 4 2 alpha.upper 84 1
# 5 2 beta.middle 81 2
# 6 2 gamma.lower 159 3
# 7 3 alpha.upper 93 1
# 8 3 beta.middle 82 2
# 9 3 gamma.lower 84 3
# 10 4 alpha.upper 44 1
# 11 4 beta.middle 58 2
# 12 4 gamma.lower 125 3
# 13 5 alpha.upper 31 1
# 14 5 beta.middle 33 2
# 15 5 gamma.lower 131 3
# 16 6 alpha.upper 1 1
# 17 6 beta.middle 120 2
# 18 6 gamma.lower 63 3
I want to believe it's your desired output, though in the question you show us in a data.frame numbers from b and vice versa.

Resources