How to split a column in R? [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
Anyone know how can I split a column to multiple ones?
For example: I want to split column "score" and "class", then make the values of column "grade" as column name. In my data, I have 50 different values in column "grade" instead of only two in the example below.
In the data frame 2, the row names are the values of column "class" in data frame 1.
data frame 1
class grade score
A a 12
B a 45
C a 75
D a 18
E a 6
A b 45
B b 92
C b 78
D b 36
E b 39
data frame 2
a b
A 12 45
B 45 92
C 75 78
D 18 36
E 6 39

Base R's unstack does this out of the box:
unstack(df, score ~ grade)
# a b
#1 12 45
#2 45 92
#3 75 78
#4 18 36
#5 6 39
As does xtabs:
as.data.frame.matrix(xtabs(score ~ class + grade, data=df))
# a b
#A 12 45
#B 45 92
#C 75 78
#D 18 36
#E 6 39

library(reshape2)
dcast(df, class ~ grade, value.var = "score")
class a b
1 1 12 45
2 2 45 92
3 3 75 78
4 4 18 36
5 5 6 39
df <- structure(list(class = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L), grade = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), .Label = c("a", "b"), class = "factor"), score = c(12L,
45L, 75L, 18L, 6L, 45L, 92L, 78L, 36L, 39L)), .Names = c("class",
"grade", "score"), class = "data.frame", row.names = c(NA, -10L
))

Another option is spread from library(tidyr)
library(tidyr)
spread(df1, grade, score)

Related

Merging two datasets by an ID without adding new columns that say ".x" or ".y"

Suppose I have two datasets. One main dataset, with many columns of metadata, and one new dataset which will be used to fill in some of the gaps in concentrations in the main dataset:
Main dataset:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 NA NA
1 4 22 0 NA NA
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 NA NA
2 4 37 3 NA NA
New data set to merge:
study_id timepoint concentration1 concentration2
1 3 11 20
1 4 21 35
2 3 7 17
2 4 14 25
Whenever I merge by "study_id" and "timepoint", I get two new columns that are "concentration1.y" and "concentration2.y" while the original columns get renamed as "concentration1.x" and "concentration2.x". I don't want this.
This is what I want:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 11 20
1 4 22 0 21 35
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 7 17
2 4 37 3 14 25
In other words, I want to merge by "study_id" and "timepoint" AND merge the two concentration columns so the data are within the same columns. Please note that both datasets do not have identical columns (dataset 1 has 1000 columns with metadata while dataset2 just has study id, timepoint, and concentration columns that match the concentration columns in dataset1).
Thanks so much in advance.
Using coalesce is one option (from dplyr package). This still adds the two columns for concentration 1 and 2 from the second data frame. These would be removed after NA filled in.
library(tidyverse)
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
mutate(concentration1 = coalesce(concentration1.x, concentration1.y),
concentration2 = coalesce(concentration2.x, concentration2.y)) %>%
select(-concentration1.x, -concentration1.y, -concentration2.x, -concentration2.y)
Or to generalize with multiple concentration columns:
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y")) %>%
map_df(reduce, coalesce)
Edit: To prevent the resultant column names from being alphabetized from split.default, you can add an intermediate step of sorting the list based on the first data frame's column name order.
df3 <- df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y"))
df3[names(df1)] %>%
map_df(reduce, coalesce)
Output
study_id timepoint age occupation concentration1 concentration2
1 1 1 21 0 3 7
2 1 2 21 0 4 6
3 1 3 22 0 11 20
4 1 4 22 0 21 35
5 2 1 36 3 0 4
6 2 2 36 3 2 11
7 2 3 37 3 7 17
8 2 4 37 3 14 25
Data
df1 <- structure(list(study_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
timepoint = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), age = c(21L,
21L, 22L, 22L, 36L, 36L, 37L, 37L), occupation = c(0L, 0L,
0L, 0L, 3L, 3L, 3L, 3L), concentration1 = c(3L, 4L, NA, NA,
0L, 2L, NA, NA), concentration2 = c(7L, 6L, NA, NA, 4L, 11L,
NA, NA)), class = "data.frame", row.names = c(NA, -8L))
df2 <- structure(list(study_id = c(1L, 1L, 2L, 2L), timepoint = c(3L,
4L, 3L, 4L), concentration1 = c(11L, 21L, 7L, 14L), concentration2 = c(20L,
35L, 17L, 25L)), class = "data.frame", row.names = c(NA, -4L))

Retrieve matched observation based on distance algorithm

What I am trying to do is close to propensity score matching (or causal matching, MatchIt) but not quite the same.
I am simply interested in finding and gathering together the closest (pairwise) observations from a dataset with mixed variables (categorical and numerical).
The dataset looks like this:
id child age edu y
1 11011209 0 69 some college 495
2 11011212 0 44 secondary/primary 260
3 11011213 1 40 some college 175
4 11020208 1 47 secondary/primary 0
5 11020212 1 50 secondary/primary 25
6 11020310 0 65 secondary/primary 525
7 11020315 1 43 college 0
8 11020316 1 41 secondary/primary 5
9 11031111 0 49 secondary/primary 275
10 11031116 1 42 secondary/primary 0
11 11031119 0 32 college 425
12 11040801 1 38 secondary/primary 0
13 11040814 0 52 some college 260
14 11050109 0 59 some college 405
15 11050111 1 35 secondary/primary 20
16 11050113 0 51 secondary/primary 40
17 11051001 1 38 college 165
18 11051004 1 36 college 10
19 11051011 0 63 secondary/primary 455
20 11051018 0 44 college 40
What I want is to match the variables {child, age, edu} but not y (nor id).
Because I use a dataset with mixed variables I can use the gower distance
library(cluster)
# test on first ten observations
dt = dt[1:10, ]
# gower distance
ddmen = daisy(dt[,-c(1,5)], metric = 'gower')
Now, I want to retrieve the closest observations
mg = as.matrix(ddmen)
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m =
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()
close = mgg %>% dplyr::select(Var2, closest, dis = m) %>% distinct()
close gives me
Var2 closest dis
1 1 6 0.37931034
2 2 9 0.05747126
3 3 8 0.34482759
4 4 5 0.03448276
5 5 4 0.03448276
6 6 9 0.18390805
7 7 10 0.34482759
8 8 10 0.01149425
9 9 2 0.05747126
10 10 8 0.01149425
I can merge close to my original data
dt$id = 1:10
dt2 = merge(dt, close, by.x = 'id', by.y = 'Var2', all = T)
Then, bind it
vlist = vector('list', 10)
for(i in 1:10){
vlist[[i]] = dt2[ c( which(dt2$id == i), dt2$closest[dt2$id == i] ), ] %>%
mutate(p = i)
}
bind_rows(vlist)
and get
id child age edu y closest dis p
1 1 0 69 some college 495 6 0.37931034 1
2 6 0 65 secondary/primary 525 9 0.18390805 1
3 2 0 44 secondary/primary 260 9 0.05747126 2
4 9 0 49 secondary/primary 275 2 0.05747126 2
...
p then is the identifier of the matched pairs, based on id. So, you can notice that individuals can be in different pairs (because the closest matching of 1 on 2 is not necessarily symmetrical, 2 might have another closest match than 1).
Questions
First, there is a little bug in the code here:
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m =
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()
I get this error message Column closest must be length 19 (the group size) or one, not 2
The code works for 10 observations but not for 20 (complete dataset provided here).
Why?
Second, is there a package available to do this automatically?
dt = structure(list(id = c(11011209L, 11011212L, 11011213L, 11020208L,
11020212L, 11020310L, 11020315L, 11020316L, 11031111L, 11031116L,
11031119L, 11040801L, 11040814L, 11050109L, 11050111L, 11050113L,
11051001L, 11051004L, 11051011L, 11051018L), child = structure(c(1L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L,
2L, 1L, 1L), .Label = c("0", "1"), class = "factor"), age = c(69L,
44L, 40L, 47L, 50L, 65L, 43L, 41L, 49L, 42L, 32L, 38L, 52L, 59L,
35L, 51L, 38L, 36L, 63L, 44L), edu = structure(c(3L, 2L, 3L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 1L, 2L,
1L), .Label = c("college", "secondary/primary", "some college"
), class = "factor"), y = c(495, 260, 175, 0, 25, 525, 0, 5,
275, 0, 425, 0, 260, 405, 20, 40, 165, 10, 455, 40)), class = "data.frame",
.Names = c("id",
"child", "age", "edu", "y"), row.names = c(NA, -20L))

Convert long format with only unique comparisons to full square matrix with the diag

Let's say I have that kind of input data
A file with a data.frame with data in long format, and only uniques comparisons between Species_A and Species_B as follow:
Species_A Species_B values
A B 58
A C 64
A D 78
A E 32
B C 10
B D 12
B E 54
C D 99
C E 84
D E 42
I wonder how I can easily convert the input file into the square matrix
A B C D E
A 100 58 64 78 32
B 58 100 10 12 54
C 64 10 100 99 84
D 78 12 99 100 42
E 32 54 84 42 100
I think you can achieve your goal with matrix subsetting.
# get row/column names of new matrix from columns 1 and 2 of data.frame
myNames <- sort(unique(as.character(unlist(df[1:2]))))
# build matrix of 0s
myMat <- matrix(0, 5, 5, dimnames = list(myNames, myNames))
# fill in upper triangle
myMat[as.matrix(df[c(1,2)])] <- df$values
# fill in the lower triangle
myMat[as.matrix(df[c(2,1)])] <- df$values
# fill in the diagonal
diag(myMat) <- 100
which returns
myMat
A B C D E
A 100 58 64 78 32
B 58 100 10 12 54
C 64 10 100 99 84
D 78 12 99 100 42
E 32 54 84 42 100
Note
It is also possible to fill in the lower triangle. with
myMat[lower.tri(myMat)] <- t(myMat)[lower.tri(myMat)]
data
df <-
structure(list(Species_A = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Species_B = structure(c(1L, 2L, 3L, 4L, 2L, 3L, 4L, 3L, 4L,
4L), .Label = c("B", "C", "D", "E"), class = "factor"), values = c(58L,
64L, 78L, 32L, 10L, 12L, 54L, 99L, 84L, 42L)), .Names = c("Species_A",
"Species_B", "values"), class = "data.frame", row.names = c(NA,
-10L))
A solution using tidyverse functions:
library(tidyverse)
cor_data <- tribble(
~Species_A, ~Species_B, ~values,
"A","B",58,
"A","C",64,
"A","D",78,
"A","E",32,
"B","C",10,
"B","D",12,
"B","E",54,
"C","D",99,
"C","E",84,
"D","E",42)
expand.grid(unique(cor_data[["Species_A"]]), unique(cor_data[["Species_A"]])) %>%
left_join(cor_data, by =c("Var1" = "Species_A", "Var2" = "Species_B")) %>%
left_join(cor_data, by =c("Var1" = "Species_B", "Var2" = "Species_A")) %>%
transmute(Species_A = Var1, Species_B = Var2, values = coalesce(values.x, values.y)) %>%
spread(Species_B, values)
Ok i finally did the trick
1/ Add self comparison in the data table
2/ Use reshape(df, idvar = "Species_A", timevar = "Species_B", direction = "wide"), constructing sqaure matrix with NA as missing values
3/ reorder the matrix row and column by counting NA ( in order to retrieve the lower or upper triangular matrix) and now we have half_matrix
4/ then fill the missing part of the matrix by sum the half_matrix and its transposed matrix
square_matrix_full = t(half_matrix) + half_matrix
5/ diag(square_matrix_full) = 100

how to calculate a specific subset in dataframe in r and save the calculation in another list

I have two lists:
list 1:
id name age
1 jake 21
2 ashly 19
45 lana 18
51 james 23
5675 eric 25
list 2 (tv watch):
id hours
1 1.1
1 3
1 2.5
45 5.6
45 3
51 2
51 1
51 2
this is just an example, the real lists are very big :list 1 - 5000 id's, list 2/3/4 - has more then 1 million rows (not a unique id).
I need for every list 2 and up to calculate average/sum/count to every id and add the value to list 1.
notice that I need the calculation saved in another list with different row numbers.
example:
list 1:
id name age tv_average
1 jake 21 2.2
2 ashly 19 n/a
45 lana 18 4.3
51 james 23 1.6667
5675 eric 25 n/a
this are my tries:
for (i in 1:nrow(list2)) {
p <- subset(list2,list2$id==i)
list2$tv_average[i==list2$id] <- sum(p$hours)/(nrow(p))
}
error:
out of 22999 rows it only work on 21713 rows.
Try this
#Sample Data
data1 = structure(list(id = c(1L, 2L, 45L, 51L, 5675L), name = structure(c(3L,
1L, 5L, 4L, 2L), .Label = c("ashly", "eric", "jake", "james",
"lana"), class = "factor"), age = c(21L, 19L, 18L, 23L, 25L)
), .Names = c("id",
"name", "age"), row.names = c(NA, -5L), class = "data.frame")
data2 = structure(list(id = c(1L, 1L, 1L, 3L, 45L, 45L, 51L, 51L, 51L,
53L), hours = c(1.1, 3, 2.5, 10, 5.6, 3, 2, 1, 2, 6)), .Names = c("id",
"hours"), class = "data.frame", row.names = c(NA, -10L))
# Use aggregate to calculate Average, Sum, and Count and Merge
merge(x = data1,
y = aggregate(hours~id, data2, function(x)
c(mean = mean(x),
sum = sum(x),
count = length(x))),
by = "id",
all.x = TRUE)
# id name age hours.mean hours.sum hours.count
#1 1 jake 21 2.200000 6.600000 3.000000
#2 2 ashly 19 NA NA NA
#3 45 lana 18 4.300000 8.600000 2.000000
#4 51 james 23 1.666667 5.000000 3.000000
#5 5675 eric 25 NA NA NA

Subset of data with criteria of two columns

I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time
If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89

Resources