Splitting columns of a dataframe to merge a repetitive variable - r

I normally find an answer in previous questions posted here, but I can't seem to find this one, so here is my maiden question:
I have a dataframe with one column with repetitive values, I would like to split the other columns and have only 1 value in the first column and more columns than in the original dataframe.
Example:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
The original dataframe has 3 columns and 15 rows.
And it would turn into a dataframe with 5 rows and the columns would be split into 7 columns: 'test', 'time1', 'time2', 'time3', 'score1', score2', 'score3'.
Does anyone have an idea how this could be done?

I think using dcast with rowid from the data.table-package is well suited for this task:
library(data.table)
dcast(setDT(df), test ~ rowid(test), value.var = c('time','score'), sep = '')
The result:
test time1 time2 time3 score1 score2 score3
1: 1 52 3 29 21 131 45
2: 2 79 44 6 119 1 186
3: 3 67 95 39 18 459 121
4: 4 83 50 40 493 466 497
5: 5 46 14 4 465 9 24

Please try this:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
df$class <- c(rep('a', 5), rep('b', 5), rep('c', 5))
df <- split(x = df, f = df$class)
binded <- cbind(df[[1]], df[[2]], df[[3]])
binded <- binded[,-c(5,9)]
> binded
test time score class time.1 score.1 class.1 time.2 score.2 class.2
1 1 40 404 a 57 409 b 70 32 c
2 2 5 119 a 32 336 b 93 177 c
3 3 20 345 a 44 91 b 100 42 c
4 4 47 468 a 60 265 b 24 478 c
5 5 16 52 a 38 219 b 3 92 c
Let me know if it works for you!

Related

Replace value with the mean based on two classes

I have a dataset with 2 calendar variables (Week & Hour) and 1 Amount variable:
Week Hour Amount
35 1 367
35 2 912
36 1 813
36 2 482
37 1 112
37 2 155
35 1 182
35 2 912
36 1 551
36 2 928
37 1 125
37 2 676
I wish to replace each value of Amount with the mean from each observation with the same Week/Hour pair. For instance, here there are 2 obs. for (Week=35, Hour=1), with Amount values of 367 and 182. Hence, for this example, the 2 rows with (Week=35, Hour=1) should have the Amount replaced with mean(c(367,182). The final output should be:
Week Hour Amount
35 1 274.5
35 2 912.0
36 1 682.0
36 2 705.0
37 1 118.5
37 2 415.5
35 1 274.5
35 2 912.0
36 1 682.0
36 2 705.0
37 1 118.5
37 2 415.5
I have the following code that solves this issue. However, for the complete dataset with thousands of rows, it is very slow. Is there any way to automatically reshape with with this paired means?
dataset = data.frame(Week=c(35,35,36,36,37,37,35,35,36,36,37,37),
Hour = c(1,2,1,2,1,2,1,2,1,2,1,2),
Amount = c(367,912,813,482,112,155,182,912,551,928,125,676))
means <- reshape2::dcast(dataset, Week~Hour, value.var="Value", mean)
for (i in 1:nrow(dataset)) {
print(i)
dataset$Amount[i] <- means[means$Week==dataset$Week[i],which(colnames(means)==dataset$Hour[i])]
}
Possible solution with dplyr:
dataset %>%
group_by(Week, Hour) %>%
summarise(mean_amount = mean(Amount))
You group by Week and Hour and calculate the mean based on this condition.
EDIT
To keep the original structure (number of rows) alter the code to
dataset %>%
group_by(Week, Hour) %>%
mutate(Amount = mean(Amount))
If the idea is just to get the mean Amount by Week and Hour, this would work:
aggregate(Amount ~ ., dataset, mean)
Week Hour Amount
1 35 1 274.5
2 36 1 682.0
3 37 1 118.5
4 35 2 912.0
5 36 2 705.0
6 37 2 415.5
EDIT:
If, however, the idea is to put the averages back into the dataset, then this should work:
x <- aggregate(Amount ~ ., dataset, mean)
dataset$Amount <- x$Amount[match(apply(dataset[,1:2], 1, paste0, collapse = " "),
apply(x[,1:2], 1, paste0, collapse = " "))]
dataset
Week Hour Amount
1 35 1 274.5
2 35 2 912.0
3 36 1 682.0
4 36 2 705.0
5 37 1 118.5
6 37 2 415.5
7 35 1 274.5
8 35 2 912.0
9 36 1 682.0
10 36 2 705.0
11 37 1 118.5
12 37 2 415.5
Explanation:
This pastes together into strings the rows of the first two columns in the means dataframe x and in datasetusing the function apply it uses match on these strings to assign the means values to the corresponding rows in dataset
EDIT 2:
Alternatively, you can use interaction and, respectively, %in% for this transformation:
dataset$Amount <- x$Amount[match(interaction(dataset[,1:2]), interaction(x[,1:2]))]
# or:
dataset$Amount <- x$Amount[interaction(x[,1:2]) %in% interaction(dataset[,1:2])]
Base R solution:
dataset$Amount <- with(dataset, ave(dataset$Amount, dataset$Week, dataset$Hour, FUN = mean))
Data:
dataset = data.frame(Week=c(35,35,36,36,37,37,35,35,36,36,37,37),
Hour = c(1,2,1,2,1,2,1,2,1,2,1,2),
Amount = c(367,912,813,482,112,155,182,912,551,928,125,676))

Replace column values based on column in another dataframe

I would like to replace some column values in a df based on column in another data frame
This is the head of the first df:
df1
A tibble: 253 x 2
id sum_correct
<int> <dbl>
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 16
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
and some sum_correct need to be replaced by the correct values in another df using the id to trigger the replacement
df 2
A tibble: 14 x 2
id sum_correct
<int> <dbl>
1 866103 61
2 866124 79
3 866152 85
4 867101 24
5 867140 76
6 867146 51
7 867152 56
8 867200 50
9 867209 97
10 879657 56
11 879680 61
12 879683 58
13 879693 77
14 881451 57
how I can achieve this in R studio? thanks for the help in advance.
You can make an update join using match to find where id matches and remove non matches (NA) with which:
idx <- match(df1$id, df2$id)
idxn <- which(!is.na(idx))
df1$sum_correct[idxn] <- df2$sum_correct[idx[idxn]]
df1
id sum_correct
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 61
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
you can do a left_join and then use coalesce:
library(dplyr)
left_join(df1, df2, by = "id", suffix = c("_1", "_2")) %>%
mutate(sum_correct_final = coalesce(sum_correct_2, sum_correct_1))
The new column sum_correct_final contains the value from df2 if it exists and from df1 if a corresponding entry from df2 does not exist.

Sum multiple columns that have specific name in columns

I would like to sum the values of Var1 and Var2 for each row and produce a new column titled Vars which gives the total of Var1 and Var2. I would then like to do the same for Col1 and Col2 and have a new column titled Cols which gives the sum of Col1 and Col2. How do I write the code for this? Thanks in advance.
df
ID Var1 Var2 Col1 Col2
1 34 22 34 24
2 3 25 54 65
3 87 68 14 78
4 66 98 98 100
5 55 13 77 2
Expected outcome would be the following:
df
ID Var1 Var2 Col1 Col2 Vars Cols
1 34 22 34 24 56 58
2 3 25 54 65 28 119
3 87 68 14 78 155 92
4 66 98 98 100 164 198
5 55 13 77 2 68 79
Assuming that column ID is irrelevant (no groups) and you are happy to specify column names (solution hard-coded, not generic).
A base R solution:
df$Vars <- rowSums(df1[, c("Var1", "Var2")])
df$Cols <- rowSums(df1[, c("Col1", "Col2")])
A tidyverse solution:
library(dplyr)
library(purrr)
df %>% mutate(Vars = map2_int(Var1, Var2, sum),
Cols = map2_int(Col1, Col2, sum))
# or just
df %>% mutate(Vars = Var1 + Var2,
Cols = Col1 + Col2)
There are many different ways to do this. With
library(dplyr)
df = df %>% #input dataframe
group_by(ID) %>% #do it for every ID, so every row
mutate( #add columns to the data frame
Vars = Var1 + Var2, #do the calculation
Cols = Col1 + Col2
)
But there are many other ways, eg with apply-functions etc. I suggest to read about the tidyverse.
Another dplyr way is to use helper functions starts_with to select columns and then use rowSums to sum those columns.
library(dplyr)
df$Vars <- df %>% select(starts_with("Var")) %>% rowSums()
df$Cols <- df %>% select(starts_with("Col")) %>% rowSums()
df
# ID Var1 Var2 Col1 Col2 Vars Cols
#1 1 34 22 34 24 56 58
#2 2 3 25 54 65 28 119
#3 3 87 68 14 78 155 92
#4 4 66 98 98 100 164 198
#5 5 55 13 77 2 68 79
A solution summing up all columns witch have the same name and end with numbers using gsub in base:
tt <- paste0(gsub('[[:digit:]]+', '', names(df)[-1]),"s")
df <- cbind(df, sapply(unique(tt), function(x) {rowSums(df[grep(x, tt)+1])}))
df
# ID Var1 Var2 Col1 Col2 Vars Cols
#1 1 34 22 34 24 56 58
#2 2 3 25 54 65 28 119
#3 3 87 68 14 78 155 92
#4 4 66 98 98 100 164 198
#5 5 55 13 77 2 68 79
Or an even more general solution:
idx <- grep('[[:digit:]]', names(df))
tt <- paste0(gsub('[[:digit:]]+', '', names(df)[idx]),"s")
df <- cbind(df, sapply(unique(tt), function(x) {rowSums(df[idx[grep(x, tt)]])}))

MCA from dataframe

I have dataframe
name a b c d e f
1 220-volt 1 8 12 17 22 8
2 aliexpress 7 133 317 372 358 349
3 bonprix 0 3 14 13 21 11
4 citilink 1 20 40 31 29 30
5 dns 1 16 37 34 39 38
6 ebay 3 32 65 50 55 58
7 eldorado 0 19 76 44 42 56
8 kupivip 0 8 17 24 11 18
9 labirint 0 15 30 34 36 32
10 lamoda 3 25 66 73 68 55
and I try to build mca plot.
I use FactoMineR and use code
library(FactoMineR)
df <- read.table("info.csv", header = TRUE, sep=';')
row.names(df) = df$name
df = df[,-1]
res.mca <- MCA(df)
but it returns
Error in which(unlist(lapply(listModa, is.numeric))) : argument to 'which' is not logical
How can I avoid this error?
I downloaded the code an reproduced your data.frame ( please use dput, or an other reproducible example ) and got the same error.
When you ?MCA you will find that x has to be:
a data frame with n rows (individuals) and p columns (categorical variables)
After I changed the columns to factors the function runs.
Try this:
df[] <- lapply(df, factor)
Tip: use row.names = 1 to set the first column as row names for your data.frame when you read the data.
df <- read.table("info.csv", header = T, sep = ";", row.names = 1)

Transpose data in R by IDnumber

I have a question of transposing data in R. Basically I am looking for an alternative to do proc transpose by id prefix = test and proc transpose by id prefix = score in R.
So I have a set of data looks like the following
ID test date score
1 4/1/2001 98
1 5/9/2001 65
1 5/23/2001 85
2 3/21/2001 76
2 4/8/2001 58
2 5/22/2001 67
2 6/15/2001 53
3 1/15/2001 46
3 5/30/2001 55
4 1/8/2001 71
4 2/14/2001 95
4 7/15/2001 93
and I would love to transpose it into:
id test date1 score1 test date2 score2 testdate3 score3 testdate4 score4
1 4/1/2000 98 5/9/2001 65 5/23/2001 85 .
2 3/21/2001 76 4/8/2001 58 5/22/2001 67 6/15/2001 53
3 1/15/2001 46 5/30/2001 55 . .
4 1/8/2001 71 2/14/2001 95 7/15/2001 93 .
This is a basic "long" to "wide" reshaping task. In base R, you can use reshape, but only after adding a "time" variable, like this:
mydf$time <- with(mydf, ave(ID, ID, FUN = seq_along))
reshape(mydf, direction = "wide", idvar = "ID", timevar = "time")
# ID test.date.1 score.1 test.date.2 score.2 test.date.3 score.3
# 1 1 4/1/2001 98 5/9/2001 65 5/23/2001 85
# 4 2 3/21/2001 76 4/8/2001 58 5/22/2001 67
# 8 3 1/15/2001 46 5/30/2001 55 <NA> NA
# 10 4 1/8/2001 71 2/14/2001 95 7/15/2001 93
# test.date.4 score.4
# 1 <NA> NA
# 4 6/15/2001 53
# 8 <NA> NA
# 10 <NA> NA

Resources