Hi I am new to R so please bear with me,
I have my data arranged like so,
Length Seq X
28 GTGCACCGCAAGTGCTTCTAAGAAGGAT 19
28 TGCACCGCAAGTGCTTCTAAGAAGGATC 18
29 GTGCACCGCAAGTGCTTCTAAGAAGGATC 19
29 GTGCACCGCAAGTGCTTCTAAGAAGGATC 19
and I used
count(dF, vars=c("Length", "X"))
to generate a freq table that looks like:
Length X freq
28 15 160
28 16 163
28 17 21
29 15 198
29 16 410
29 17 104
How can I rearrange the data so that it looks something like this?
Length 15 16 17 total
28 160 163 21 344
29 198 410 104 712
30 205 614 393 1212
Tot 2746 6564 2012 11322
(I know these values are wrong)
If you want it to look like your example:
# your data
df<- data.frame(Length = c(28, 28, 28, 29, 29, 29),
X = c(15, 16, 17, 15, 16, 17),
freq = c(160, 163, 21, 198, 410, 104))
use this function
require(reshape)
tabler <- function(a){
b <- cast(a, Length~X)
b <- cbind(b, rowSums(b))
b <- rbind(b, colSums(b))
colnames(b)[ncol(b)] <- b[nrow(b),1] <- "total"
return(b)
}
tabler(df)
returns:
Length 15 16 17 total
1 28 160 163 21 344
2 29 198 410 104 712
3 total 358 573 125 1056
A base R option is
addmargins(xtabs(freq~Length+X, df1))
# X
#Length 15 16 17 Sum
# 28 160 163 21 344
# 29 198 410 104 712
# Sum 358 573 125 1056
data
df1 <- structure(list(Length = c(28L, 28L, 28L, 29L, 29L, 29L),
X = c(15L,
16L, 17L, 15L, 16L, 17L), freq = c(160L, 163L, 21L, 198L, 410L,
104L)), .Names = c("Length", "X", "freq"), class = "data.frame",
row.names = c(NA, -6L))
Related
I have an R data frame with many columns, and I want to sum only columns (header: score) having cell value >25 under row named "Matt". The sum value can be placed after the last column.
input (df1)
Name
score
score
score
score
score
Alex
31
15
18
22
23
Pat
37
18
29
15
28
Matt
33
27
18
88
9
James
12
36
32
13
21
output (df2)
Name
score
score
score
score
score
Matt
Alex
31
15
18
22
23
68
Pat
37
18
59
55
28
110
Matt
33
27
18
88
9
148
James
12
36
32
13
21
61
Any thoughts are more than welcome,
Regards,
One option is to extract the row where 'Name' is 'Matt', without the first column create a logical vector ('i1'), use that to subset the columns and get the rowSums
i1 <- df1[df1$Name == "Matt",-1] > 25
df1$Matt <- rowSums(df1[-1][,i1], na.rm = TRUE)
Or using tidyverse
library(dplyr)
df1 %>%
mutate(Matt = rowSums(select(cur_data(),
where(~ is.numeric(.) && .[Name == 'Matt'] > 25))))
-output
# Name score score.1 score.2 score.3 score.4 Matt
#1 Alex 31 15 18 22 23 68
#2 Pat 37 18 29 15 28 70
#3 Matt 33 27 18 88 9 148
#4 James 12 36 32 13 21 61
data
df1 <- structure(list(Name = c("Alex", "Pat", "Matt", "James"), score = c(31L,
37L, 33L, 12L), score.1 = c(15L, 18L, 27L, 36L), score.2 = c(18L,
29L, 18L, 32L), score.3 = c(22L, 15L, 88L, 13L), score.4 = c(23L,
28L, 9L, 21L)), class = "data.frame", row.names = c(NA, -4L))
You can try the code below
df$Matt <- rowSums(df[-1] * (df[df$Name == "Matt", -1] > 25)[rep(1, nrow(df)), ])
which gives
> df
Name score score score score score Matt
1 Alex 31 15 18 22 23 68
2 Pat 37 18 29 15 28 70
3 Matt 33 27 18 88 9 148
4 James 12 36 32 13 21 61
I have a data set that contains some missing values which can be completed by merging with a another dataset. My example:
This is the updated data set I am working with.
DF1
Name Paper Book Mug soap computer tablet coffee coupons
1 2 3 4 5 6 7 8 9
2 21 22 23 23 23 7 23 9
3 56 57 58 59 60 7 62 9
4 80.33333 81.33333 82.33333 83 83.66667 7 85 9
5 107.3333 108.3333 109.3333 110 110.6667 7 112 9
6 134.3333 135.3333 136.3333 137 137.6667 7 139 9
7 161.3333 162.3333 163.3333 164 164.6667
8 188.3333 189.3333 190.3333 191 191.6667 7 193 9
9 215.3333 216.3333 217.3333 218 218.6667 7 220 9
10 242.3333 243.3333 244.3333 245 245.6667 7 247 9
11 269.3333 270.3333 271.3333 272 272.6667 7 274 9
12 296.3333 297.3333 298.3333 299 299.6667
13 323.3333 324.3333 325.3333 326 326.6667 7 328 9
14 350.3333 351.3333 352.3333 353 353.6667 7 355 9
15 377.3333 378.3333 379.3333 380 380.6667
16 404.3333 405.3333 406.3333 407 407.6667 7 409 9
17 431.3333 432.3333 433.3333 434 434.6667 7 436 9
18 458.3333 459.3333 460.3333 461 461.6667 7 463 9
19 485.3333 486.3333 487.3333 488 488.6667
DF2
Name Paper Book Mug soap computer tablet coffee coupons
7 161.3333 162.3333 163.3333 164 164.6667 6 6 6
12 296.3333 297.3333 298.3333 299 299.6667 88 96 25
15 377.3333 378.3333 379.3333 380 380.6667 88 62 25
19 485.3333 486.3333 487.3333 488 488.6667 88 88 78
I want to get:
Name Paper Book Mug soap computer tablet coffee coupons
1 2 3 4 5 6 7 8 9
2 21 22 23 23 23 7 23 9
3 56 57 58 59 60 7 62 9
4 80.33333 81.33333 82.33333 83 83.66667 7 85 9
5 107.3333 108.3333 109.3333 110 110.6667 7 112 9
6 134.3333 135.3333 136.3333 137 137.6667 7 139 9
7 161.3333 162.3333 163.3333 164 164.6667 6 6 6
8 188.3333 189.3333 190.3333 191 191.6667 7 193 9
9 215.3333 216.3333 217.3333 218 218.6667 7 220 9
10 242.3333 243.3333 244.3333 245 245.6667 7 247 9
11 269.3333 270.3333 271.3333 272 272.6667 7 274 9
12 296.3333 297.3333 298.3333 299 299.6667 88 96 25
13 323.3333 324.3333 325.3333 326 326.6667 7 328 9
14 350.3333 351.3333 352.3333 353 353.6667 7 355 9
15 377.3333 378.3333 379.3333 380 380.6667 88 62 25
16 404.3333 405.3333 406.3333 407 407.6667 7 409 9
17 431.3333 432.3333 433.3333 434 434.6667 7 436 9
18 458.3333 459.3333 460.3333 461 461.6667 7 463 9
19 485.3333 486.3333 487.3333 488 488.6667 88 88 78
I have tried the following code:
DF1[,c(4:6)][is.na(DF1[,c(4:6)]<-DF2[,c(2:4)][match(DF1[,1],DF2[,1])]
[which(is.na(DF1[,c(4:6)]))]
One of the solutions using dplyr will work, if I omit the columns which are already complete. Not sure if it my version of dplyr, which I have updated last week.
Any help is greatly appreciated! Thanks!
We can do a left join and then coalesce the columns
library(dplyr)
DF1 %>%
left_join(DF2, by = c('NameVar')) %>%
transmute(NameVar, Var1, Var2,
Var3 = coalesce(Var3.x, Var3.y),
Var4 = coalesce(Var4.x, Var4.y),
Var5 = coalesce(Var5.x, Var5.y))
-output
# NameVar Var1 Var2 Var3 Var4 Var5
#1 Sub1 30 45 40 34 65
#2 Sub2 25 30 30 45 45
#3 Sub3 74 34 25 30 49
#4 Sub4 30 45 40 34 65
#5 Sub5 25 30 69 56 72
#6 Sub6 74 34 74 34 60
Or using data.table
library(data.table)
nm1 <- setdiff(intersect(names(DF1), names(DF2)), 'NameVar')
setDT(DF1)[DF2, (nm1) := Map(fcoalesce, mget(nm1),
mget(paste0("i.", nm1))), on = .(NameVar)]
data
DF1 <- structure(list(NameVar = c("Sub1", "Sub2", "Sub3", "Sub4", "Sub5",
"Sub6"), Var1 = c(30L, 25L, 74L, 30L, 25L, 74L), Var2 = c(45L,
30L, 34L, 45L, 30L, 34L), Var3 = c(40L, NA, NA, 40L, 69L, NA),
Var4 = c(34L, NA, NA, 34L, 56L, NA), Var5 = c(65L, NA, NA,
65L, 72L, NA)), class = "data.frame", row.names = c(NA, -6L
))
DF2 <- structure(list(NameVar = c("Sub2", "Sub3", "Sub6"), Var3 = c(30L,
25L, 74L), Var4 = c(45L, 30L, 34L), Var5 = c(45L, 49L, 60L)),
class = "data.frame", row.names = c(NA,
-3L))
structure(list(Date = c("KW 52 / 2016", "KW 1 / 2017", "KW 2 / 2017",
"KW 3 / 2017"), Sales_AT = c(150L, 169L, 143L, 170L), Sales_CH = c(150L,
169L, 143L, 170L), Sales_GER = c(150L, 169L, 143L, 170L), Sales_HUN = c(134L,
139L, NA, 125L), Sales_JP = c(134L, NA, 142L, 125L), Sales_POL = c(127L,
175L, 150L, 141L), Sales_SWE = c(125L, NA, 159L, 131L), Sales_USA = c(169L,
159L, NA, 132L), difference_AT = c(NA, 19L, -26L, 27L), difference_CH = c(NA,
19L, -26L, 27L), difference_GER = c(NA, 19L, -26L, 27L), difference_HUN = c(NA,
5L, NA, -14L), difference_JP = c(NA, NA, 8L, -17L), difference_POL = c(NA,
48L, -25L, -9L), difference_SWE = c(NA, NA, 34L, -28L), difference_USA = c(NA,
-10L, NA, -27L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
This is my dataset which looks like this:
A tibble: 4 x 17
Date Sales_AT Sales_CH Sales_GER Sales_HUN Sales_JP Sales_POL Sales_SWE Sales_USA difference_AT difference_CH difference_GER difference_HUN difference_JP difference_POL difference_SWE difference_USA
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 KW 52 / 2016 150 150 150 134 134 127 125 169 NA NA NA NA NA NA NA NA
2 KW 1 / 2017 169 169 169 139 NA 175 NA 159 19 19 19 5 NA 48 NA -10
3 KW 2 / 2017 143 143 143 NA 142 150 159 NA -26 -26 -26 NA 8 -25 34 NA
4 KW 3 / 2017 170 170 170 125 125 141 131 132 27 27 27 -14 -17 -9 -28 -27
I want to reorder the dataset to have the sales and difference column of each country next to each other.
I´m look for a dplyr solution which works like this, but in a dynamic way:
wide_result %>%
select(contains("AT"), contains("CH"), contains("HUN"), contains("JP"), contains("USA"))
Can anyone help me?
Using base R:
df[c(1, order(sub(".*_", "", names(df)[-1])) + 1)]
Here's a way we can do it. Basically, we put the names of the data into a tibble, extract the part of the name after the _ (when possible), and then sort by that extracted text.
names_sort <- tibble(nn = names(dat)) %>%
filter(nn != "Date") %>% # remove Date column, since we'll select that first
# replace everything before and up to _ with ""
mutate(names_fix = gsub(".*_", "", nn)) %>%
arrange(names_fix) %>%
pull(nn)
dat %>%
select(Date, names_sort)
# Date Sales_AT difference_AT Sales_CH difference_CH
# <chr> <int> <int> <int> <int>
# 1 KW 52 / 2016 150 NA 150 NA
# 2 KW 1 / 2017 169 19 169 19
# 3 KW 2 / 2017 143 -26 143 -26
# 4 KW 3 / 2017 170 27 170 27
You can use dplyr select_at:
vars <- c("CH", "AT")
df %>%
select_at(vars(one_of("Date",
paste0("Sales_", vars),
paste0("difference_", vars))))
# A tibble: 4 x 5
Date Sales_CH Sales_AT difference_CH difference_AT
<chr> <int> <int> <int> <int>
1 KW 52 / 2016 150 150 NA NA
2 KW 1 / 2017 169 169 19 19
3 KW 2 / 2017 143 143 -26 -26
4 KW 3 / 2017 170 170 27 27
I'm using R and I have a data frame called df which has (n*P) rows and N columns.
C1 C2 ... CN-1 CN
1-1 100 36 ... 136 76
1-2 120 -33 ... 87 42
1-3 150 14 ... 164 24
:
1-n 20 36 ... 136 76
2-1 109 26 ... 166 87
2-2 -33 87 ... 42 24
2-3 100 36 ... 136 76
:
2-n 100 36 ... 136 76
:
P-1 150 14 ... 164 24
P-2 100 36 ... 765 76
P-3 150 14 ... 164 94
:
P-n 10 26 ... 106 76
And I want to transform this data frame into a data frame with n rows and (N*P) columns. The new data frame, df.new, should look like
C1-1 C2-1 ... CN-1-1 CN-1 C1-2 C2-2 ... CN-1-2 CN-2 ... C1-P C2-P ... CN-1-P CN-P
R1 100 36 ... 136 76 20 36 ... 136 76 ... 150 14 ... 164 24
R2 120 -33 ... 87 42 109 26 ... 166 87 ... 100 36 ... 765 76
:
:
Rn 20 36 ... 136 76 100 36 ... 136 76 ... 10 26 ... 106 76
That is to say, the first N columns of df.new are rbind of rows 1-1, 2-1, 3-1, ... , P-1 of df. The next N columns of df.new are rbind of rows 1-2, 2-2, 3-2, ... , P-2 of df. It follows till the last N columns of df.new which will be composed of rows rows 1-n, 2-n, 3-n, ... , P-n of df. (R1 of df.new is cbind of rows 1-1, 1-2,...,1-n. R2 of df.new is cbind of rows 2-1, 2-2,...,2-n. Rn of df.new is cbind of rows P-1, P-2,...,P-n.)
n, P and N are variables so the value of them depend on the case. I tried to create df.new using for loops but doesn't work well.
Here is my try which I kind of gave up.
for (j in 1:n) {
df.new <- data.frame(matrix(vector(), 1, dim(df)[2],
dimnames = list(c(), colnames(df))),
stringsAsFactors=F)
for (i in 1:nrow(df)) {
if (i %% n == 0) {
df.new <- rbind(df.new, df[i,])
} else if (i %% n == j) {
df.new <- rbind(df.new, df[i,])
}
}
assign(paste0("df.new", j), df.new)
}
library(dplyr)
library(tidyr)
library(tibble)
df %>%
rownames_to_column("rowname") %>%
separate(rowname, c("rowname_prefix", "rowname_suffix"), "-") %>%
gather(col_name, value, -rowname_prefix, -rowname_suffix) %>%
mutate(col_name = paste(col_name, rowname_prefix, sep="-")) %>%
select(-rowname_prefix) %>%
spread(col_name, value) %>%
mutate(rowname_suffix = paste0("R", rowname_suffix)) %>%
column_to_rownames("rowname_suffix")
Output is:
C1-1 C1-2 C1-3 C2-1 C2-2 C2-3 C3-1 C3-2 C3-3 C4-1 C4-2 C4-3
R1 100 109 150 36 26 14 136 166 164 76 87 24
R2 120 -33 100 -33 87 36 87 42 765 42 24 76
R3 150 100 150 14 36 14 164 136 164 24 76 94
R4 20 100 10 36 36 26 136 136 106 76 76 76
Sample data:
df <- structure(list(C1 = c(100L, 120L, 150L, 20L, 109L, -33L, 100L,
100L, 150L, 100L, 150L, 10L), C2 = c(36L, -33L, 14L, 36L, 26L,
87L, 36L, 36L, 14L, 36L, 14L, 26L), C3 = c(136L, 87L, 164L, 136L,
166L, 42L, 136L, 136L, 164L, 765L, 164L, 106L), C4 = c(76L, 42L,
24L, 76L, 87L, 24L, 76L, 76L, 24L, 76L, 94L, 76L)), .Names = c("C1",
"C2", "C3", "C4"), class = "data.frame", row.names = c("1-1",
"1-2", "1-3", "1-4", "2-1", "2-2", "2-3", "2-4", "3-1", "3-2",
"3-3", "3-4"))
# C1 C2 C3 C4
#1-1 100 36 136 76
#1-2 120 -33 87 42
#1-3 150 14 164 24
#1-4 20 36 136 76
#2-1 109 26 166 87
#2-2 -33 87 42 24
#2-3 100 36 136 76
#2-4 100 36 136 76
#3-1 150 14 164 24
#3-2 100 36 765 76
#3-3 150 14 164 94
#3-4 10 26 106 76
I have a dataframe with 10 variables all of them numeric, and one of the variable name is age, I want to group the observation based on age.example. age 17 to 18 one group, 19-22 another group and then each row should be attached to each group. And resulting should be a dataframe for further manipulations.
Model of the dataframe:
A B AGE
25 50 17
30 42 22
50 60 19
65 105 17
355 400 21
68 47 20
115 98 18
25 75 19
And I want result like
17-18
A B AGE
25 50 17
65 105 17
115 98 18
19-22
A B AGE
30 42 22
50 60 19
355 400 21
68 47 20
115 98 18
25 75 19
I did group the dataset according to Age var using the split function, now my concern is how I could manipulate the grouped data. Eg:the answer looked like
$1
A B AGE
25 50 17
65 105 17
115 98 18
$2
A B AGE
30 42 22
50 60 19
355 400 21
68 47 20
115 98 18
25 75 19
My question is how can I access each group for further manipulation?
for eg: if I want to do t-test for each group separately?
The split function will work with dataframes. Use either cut with 'breaks' or findInterval with an appropriate set of cutpoints (named 'vec' if you are using named parameters) as the criterion for grouping, the second argument to split. The default for cut is intervals closed on the right and default for findInterval is closed on the left.
> split(dat, findInterval(dat$AGE, c(17, 19.5, 22.5)))
$`1`
A B AGE
1 25 50 17
3 50 60 19
4 65 105 17
7 115 98 18
8 25 75 19
$`2`
A B AGE
2 30 42 22
5 355 400 21
6 68 47 20
Here is the approach with cut
lst <- split(df1, cut(df1$AGE, breaks=c(16, 18, 22), labels=FALSE))
lst
# $`1`
# A B AGE
#1 25 50 17
#4 65 105 17
#7 115 98 18
#$`2`
# A B AGE
#2 30 42 22
#3 50 60 19
#5 355 400 21
#6 68 47 20
#8 25 75 19
Update
If you need to find the sum, mean of columns for each "list" element
lapply(lst, function(x) rbind(colSums(x[-3]),colMeans(x[-3])))
But, if the objective is to find the summary statistics based on the group, it can be done using any of the aggregating functions
library(dplyr)
df1 %>%
group_by(grp=cut(AGE, breaks=c(16, 18, 22), labels=FALSE)) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE),
mean=mean(., na.rm=TRUE)), A:B)
# grp A_sum B_sum A_mean B_mean
#1 1 205 253 68.33333 84.33333
#2 2 528 624 105.60000 124.80000
Or using aggregate from base R
do.call(data.frame,
aggregate(cbind(A,B)~cbind(grp=cut(AGE, breaks=c(16, 18, 22),
labels=FALSE)), df1, function(x) c(sum=sum(x), mean=mean(x))))
data
df1 <- structure(list(A = c(25L, 30L, 50L, 65L, 355L, 68L, 115L, 25L
), B = c(50L, 42L, 60L, 105L, 400L, 47L, 98L, 75L), AGE = c(17L,
22L, 19L, 17L, 21L, 20L, 18L, 19L)), .Names = c("A", "B", "AGE"
), class = "data.frame", row.names = c(NA, -8L))