I am trying to add this table:
# [,1]
#[1,] -0.870 8
#[2,] -0.750 7
#[3,] 2.290 2
#[4,] -0.050 5
#[5,] 0.355 4
#[6,] -0.895 9
#[7,] 3.290 1
#[8,] -0.510 6
#[9,] 0.430 3
#[10,] -3.290 10
Into this the respective "predAwayScore" and "predHomeScore" columns in my data frame.
I want to insert the left hand column of the first data set (-.87, -.75, etc.) into the appropriate cells. The right hand side of that data set (8,7,2,etc.) corresponds to the letter on the data frame that the value needs to be entered. (For instance, AwayTeam E = 5 = -.05)
I am unsure how to insert one column into another data frame, and how to refer to the corresponding letter guide that is attached.
I appreciate any help.
One option is to create a named vector, use that to match the 'AwayTeam', 'HomeTeam' values to get the corresponding scores and assign those values to the columns 'predAwayScore', 'predHomeScore'
nm1 <- setNames(m1[,1], LETTERS[m1[,2]])
df1$predAwayScore <- nm1[df1[['AwayTeam']]]
df1$predHomeScore <- nm1[df1[['HomeTeam']]]
df1
# Week AwayTeam HomeTeam predAwayScore predHomeScore
#1 3 E A -0.050 3.29
#2 3 A F 3.290 -0.51
#3 4 H E -0.870 -0.05
#4 4 I A -0.895 3.29
#5 5 F C -0.510 0.43
#6 5 F J -0.510 -3.29
data
m1 <- structure(c(-0.87, -0.75, 2.29, -0.05, 0.3555, -0.895, 3.29,
-0.51, 0.43, -3.29, 8, 7, 2, 5, 4, 9, 1, 6, 3, 10), .Dim = c(10L,
2L))
df1 <- structure(list(Week = c(3, 3, 4, 4, 5, 5), AwayTeam = c("E",
"A", "H", "I", "F", "F"), HomeTeam = c("A", "F", "E", "A", "C",
"J")), class = "data.frame", row.names = c(NA, -6L))
Related
I want to replace the NA values for observations within a particular sub-group, but the sequence of the observations in that group is not ordered properly. So I am wondering if there exists some dplyr or plyr command that would allow me to replace missing values in a column belonging to one dataframe using the values from the same column from another dataframe while matching on the values of that "key" column.
Here's what I got. Hope someone could shed light on this. Thanks.
## data frame that contains missing values in "diff" column
df <- data.frame(type = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3),
diff = c(0.1, 0.3, NA, NA, NA, NA, NA, 0.2, 0.7, NA, 0.5, NA),
name = c("A", "B", "C", "D", "E", "A", "B", "C", "F", "A", "B", "C"))
## replace with values from this smaller data frame
df2 <- data.frame(diff_rep = c(0.3, 0.2, 0.4), name = c("A", "B", "C"))
## replace using ifelse
df$diff <- ifelse(is.na(df$diff) & (df$type == 2), df2$diff_rep , df$diff)
df
type diff name
1 1 0.1 A
2 1 0.3 B
3 1 NA C
4 2 0.3 D
5 2 0.2 E
6 2 0.4 A
7 2 0.3 B
8 2 0.2 C
9 2 0.7 F
10 3 NA A
11 3 0.5 B
12 3 NA C
## desired output
type diff name
1 1 0.1 A
2 1 0.3 B
3 1 NA C
4 2 NA D
5 2 NA E
6 2 0.3 A
7 2 0.2 B
8 2 0.4 C
9 2 0.7 F
10 3 NA A
11 3 0.5 B
12 3 NA C
Assuminhg row 9 is a mistake, you can use a left join first and then use ifelse() and coalesce() to get your desired result. coalesce() returns the first non-missing value
left_join(df, df2, by = "name") %>%
mutate(diff_wanted = if_else(type == 2,
coalesce(diff, diff_rep),
diff),
diff_wanted = ifelse(name %in% df2$name,
diff_wanted,
NA)) %>%
select(type, diff_wanted, name)
The dataframe I am working on is coded in dyadic format where each observation (i.e., row) contains a source node (from) and a target node (to) along with other some dyadic covariates (such as dyadic correlation, corr).
For simplicity sake, I want to treat each dyad as un-ordered and generate a unique identifier for each dyad like the one (i.e., df1) elow:
# original data
df <- data.frame(
from = c("A", "A", "A", "B", "C", "A", "D", "E", "F", "B"),
to = c("B", "C", "D", "C", "B", "B", "A", "A", "A", "A"),
corr = c(0.5, 0.7, 0.2, 0.15, 0.15, 0.5, 0.2, 0.45, 0.54, 0.5))
from to corr
1 A B 0.50
2 A C 0.70
3 A D 0.20
4 B C 0.15
5 C B 0.15
6 A B 0.50
7 D A 0.20
8 E A 0.45
9 F A 0.54
10 B A 0.50
# desired format
df1 <- data.frame(
from = c("A", "A", "A", "B", "C", "A", "D", "E", "F", "B"),
to = c("B", "C", "D", "C", "B", "B", "A", "A", "A", "A"),
corr = c(0.5, 0.7, 0.2, 0.15, 0.15, 0.5, 0.2, 0.45, 0.54, 0.5),
dyad = c(1, 2, 3, 4, 4, 1, 3, 5, 6, 1))
from to corr dyad
1 A B 0.50 1
2 A C 0.70 2
3 A D 0.20 3
4 B C 0.15 4
5 C B 0.15 4
6 A B 0.50 1
7 D A 0.20 3
8 E A 0.45 5
9 F A 0.54 6
10 B A 0.50 1
where dyad A-B/B-A, A-D/D-A are treated as identical pairs and are assigned with the same dyad identifiers.
While it's easy to extract a list of un-ordered pairs from the original data, it's hard to map them onto the original dataframe to generate un-ordered dyad identifiers. Could anyone offer some insights on this?
One dplyr option could be:
df %>%
mutate(dyad = group_indices(., paste0(pmax(from, to), pmin(from, to))))
from to corr dyad
1 A B 0.50 1
2 A C 0.70 2
3 A D 0.20 4
4 B C 0.15 3
5 C B 0.15 3
6 A B 0.50 1
7 D A 0.20 4
8 E A 0.45 5
9 F A 0.54 6
10 B A 0.50 1
Or:
df %>%
mutate(dyad = dense_rank(paste0(pmax(from, to), pmin(from, to))))
However, if you need to assign the identifiers in a specific order (meaning that the identifiers hold some information on their own), then the solution from #Ronak Shah could be better for you.
One way using apply could be to sort and paste the value in two column, convert them to factor and then integer to get a unique number for each combination.
df$temp <- apply(df[1:2], 1, function(x) paste(sort(x), collapse = "_"))
df$dyad <- as.integer(factor(df$temp, levels = unique(df$temp)))
df$temp <- NULL
df
# from to corr dyad
#1 A B 0.50 1
#2 A C 0.70 2
#3 A D 0.20 3
#4 B C 0.15 4
#5 C B 0.15 4
#6 A B 0.50 1
#7 D A 0.20 3
#8 E A 0.45 5
#9 F A 0.54 6
#10 B A 0.50 1
I have a data.table similar to the one below, but with around 3 million rows and a lot more columns.
key1 price qty status category
1: 1 9.26 3 5 B
2: 1 14.64 1 5 B
3: 1 16.66 3 5 A
4: 1 18.27 1 5 A
5: 2 2.48 1 7 A
6: 2 0.15 2 7 C
7: 2 6.29 1 7 B
8: 3 7.06 1 2 A
9: 3 24.42 1 2 A
10: 3 9.16 2 2 C
11: 3 32.21 2 2 B
12: 4 20.00 2 9 B
Heres the dput() string
dados = structure(list(key1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4),
price = c(9.26, 14.64, 16.66, 18.27, 2.48, 0.15, 6.29, 7.06,
24.42, 9.16, 32.21, 20), qty = c(3, 1, 3, 1, 1, 2, 1, 1,
1, 2, 2, 2), status = c(5, 5, 5, 5, 7, 7, 7, 2, 2, 2, 2,
9), category = c("B", "B", "A", "A", "A", "C", "B", "A",
"A", "C", "B", "B")), .Names = c("key1", "price", "qty",
"status", "category"), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000004720788>)
I need to transform this data so that I have one entry for each key, and on the proccess I need to create some additional variables. So far I was using this:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
key.aggregate = function(x){
return(data.table(
key1 = Mode(x$key1),
perc.A = sum(x$price[x$category == "A"],na.rm=T)/sum(x$price),
perc.B = sum(x$price[x$category == "B"],na.rm=T)/sum(x$price),
perc.C = sum(x$price[x$category == "C"],na.rm=T)/sum(x$price),
status = Mode(x$status),
qty = sum(x$qty),
price = sum(x$price)
))
}
new_data = split(dados,by = "key1") #Runs out of RAM here
results = rbindlist(lapply(new_data,key.aggregate))
And expecting the following output:
> results
key1 perc.A perc.B perc.C status qty price
1: 1 0.5937447 0.4062553 0.00000000 5 8 58.83
2: 2 0.2780269 0.7051570 0.01681614 7 4 8.92
3: 3 0.4321208 0.4421414 0.12573782 2 6 72.85
4: 4 0.0000000 1.0000000 0.00000000 9 2 20.00
But I'm always running out of RAM when splitting the data by keys. I've tried using only a third of the data, and now only a sixth of it but it still gives the same Error: cannot allocate vector of size 593 Kb.
I'm thinking this approach is very inefficient, which would be the best way to get this result?
Hello I have one question of matching two data.frames.
Consider I have two datasets:
Dataframe 1:
"A" "B"
91 1
92 3
93 11
94 4
95 10
96 6
97 7
98 8
99 9
100 2
structure(list(A = 91:100, B = c(1, 3, 11, 4, 10, 6, 7, 8, 9,
2)), .Names = c("A", "B"), row.names = c(NA, -10L), class = "data.frame")
Dataframe 2:
"C" "D"
91.12 1
92.34 3
93.65 11
94.23 4
92.14 10
96.98 6
97.22 7
98.11 8
93.15 9
100.67 2
91.45 1
96.45 3
83.78 11
84.66 4
100 10
structure(list(C = c(91.12, 92.34, 93.65, 94.23, 92.14, 96.98,
97.22, 98.11, 93.15, 100.67, 91.25, 96.45, 83.78, 84.66, 100),
D = c(1, 3, 11, 4, 10, 6, 7, 8, 9, 2, 1, 3, 11, 4, 10)), .Names = c("C",
"D"), row.names = c(NA, -15L), class = "data.frame")
Now I want to find the rounded matches between column A and C and replace column D by the respective value in column B from Dataframe 1. Where there is no corresponding value (by rounded matches between A and C) I want to get an NaN for the replaced column D.
result:
"C" "newD"
91.12 1
92.34 3
93.65 4
94.23 4
92.14 3
96.98 7
97.22 7
98.11 8
93.15 11
100.67 NaN
91.25 1
96.45 6
83.78 NaN
84.66 NaN
100 2
structure(list(C = c(91.12, 92.34, 93.65, 94.23, 92.14, 96.98,
97.22, 98.11, 93.15, 100.67, 91.25, 96.45, 83.78, 84.66, 100),
D = c(1, 3, 4, 4, 3, 7, 7, 8, 11, NaN, 1, 6, NaN, NaN, 2)), .Names = c("C",
"D"), row.names = c(NA, -15L), class = "data.frame")
Does anybody knows how to do that especially for large datasets?
Thanks a lot!
Making an update join with data.table:
library(data.table)
setDT(DF1); setDT(DF2)
DF2[, A := round(C)]
DF2[, D := DF1[DF2, on=.(A), x.B] ]
# alternately, chain together in one step:
DF2[, A := round(C)][, D := DF1[DF2, on=.(A), x.B] ]
This gives NAs in unmatched rows. To switch it... DF2[is.na(D), D := NaN].
To drop the new DF2$A column, use DF2[, A := NULL].
Does anybody knows how to do that especially for large datasets?
This modifies DF2 in place (instead of making a new table like a vanilla join as in Mike's answer), so it should be fairly efficient for large tables. It might perform better if A is stored as an integer instead of a float in both tables.
On data.table 1.9.6, use on="A", B instead of on=.(A), x.B. Thanks to Mike H for checking this.
You can create a lookup table where the values in A are used to look up the values in B.
Lookup = df1$B
names(Lookup) = df1$A
df3 = data.frame(C = df2$C, newD = Lookup[as.character(round(df2$C))])
df3$newD[is.na(df3$newD)] = NaN
For these types of merges I like sql:
library(sqldf)
res <- sqldf("SELECT l.C, r.B
FROM df2 as l
LEFT JOIN df1 as r
on round(l.C) = round(r.A)")
res
# C B
#1 91.12 1
#2 92.34 3
#3 93.65 4
#4 94.23 4
#5 92.14 3
#6 96.98 7
#7 97.22 7
#8 98.11 8
#9 93.15 11
#10 100.67 NA
#11 91.45 1
#12 96.45 6
#13 83.78 NA
#14 84.66 NA
#15 100.00 2
I am a beginner in R. I have a table that looks like this:
> means
as er op rt
a 34.66667 3.5 87 4
b 22.66667 4.5 9 5
c 5.00000 7.5 6 9
d 6.00000 0.5 6 3
e 3.00000 8.0 7 89
and another one that looks like this:
> table
exp ctrl
1 as er
2 rt op
I want to extract the values from the columns in "means" that are indicated in column "exp" of "table", like this:
> means_exp <- means[, table$exp]
In the real situation both tables would be much bigger, so I don't want to just specify the names of the columns to extract one by one.
However, with that command I am getting this:
> means_exp
as er
a 34.66667 3.5
b 22.66667 4.5
c 5.00000 7.5
d 6.00000 0.5
e 3.00000 8.0
but I am supposed to get columns "as" and "rt", not "as" and "er"
Any idea why the wrong columns are extracted?
Thank you!
Here is the dput of the first table:
structure(c(34.6666666666667, 22.6666666666667, 5, 6, 3, 3.5,
4.5, 7.5, 0.5, 8, 87, 9, 6, 6, 7, 4, 5, 9, 3, 89), .Dim = c(5L,
4L), .Dimnames = list(c("a", "b", "c", "d", "e"), c("as", "er",
"op", "rt")))
and that of the second:
structure(list(exp = structure(1:2, .Label = c("as", "rt"), class = "factor"),
ctrl = structure(1:2, .Label = c("er", "op"), class = "factor")), .Names = c("exp",
"ctrl"), class = "data.frame", row.names = c(NA, -2L))
The reason the OP got different columns with the 'exp' column in 'table' is the class of the exp. It would be factor class, so converting to character is an option.
means[,as.character(table$exp)]
The factor gets coerced to integer and we get
as.integer(factor(table$exp))
#[1] 1 2
means[,factor(table$exp)]
# as er
#a 34.66667 3.5
#b 22.66667 4.5
#c 5.00000 7.5
#d 6.00000 0.5
#e 3.00000 8.0
So, it selects the first 2 columns instead of the 'as' and 'rt'