This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 1 year ago.
This is something that I come across fairly often and this is the solution that I typically land on. I'm wondering if anyone has suggestions on a less verbose way to accomplish this task. Here I create an example dataframe that contains three columns. One of these columns is a parameter code rather than a parameter name.
Second I create an example of a reference dataframe that contains unique parameter codes and the associated parameter names.
My solution has been to use a 'for' loop to match the parameter name from real_param_names with the associated parameter codes in dat. I have tried to use match() and replace but haven't quite found a way that those work. None of the examples I've come across in old questions have quite hit the mark either but would be happily referred to one that does. Thank in advance.
dat <- data.frame(site = c(1,1,2,2,3,3,4,4),
param_code = c('a','b','c','d','a','b','c','d'),
param_name = NA)
dat
real_param_names <- data.frame(param_code = c('a','b','c','d'),
param_name = c('gold', 'silver', 'mercury', 'lead'))
real_param_names
for (i in unique(dat$param_code)) {
dat$param_name[dat$param_code==i] <- real_param_names$param_name[real_param_names$param_code==i]
}
dat
This is a merge/join operation:
First, let's get rid of the unnnecessary dat$param_name, since it'll be brought over in the merge:
dat$param_name <- NULL
dat
# site param_code
# 1 1 a
# 2 1 b
# 3 2 c
# 4 2 d
# 5 3 a
# 6 3 b
# 7 4 c
# 8 4 d
Now the merge:
merge(dat, real_param_names, by = "param_code", all.x = TRUE)
# param_code site param_name
# 1 a 1 gold
# 2 a 3 gold
# 3 b 1 silver
# 4 b 3 silver
# 5 c 2 mercury
# 6 c 4 mercury
# 7 d 2 lead
# 8 d 4 lead
Some good links for the concepts of joins/merges: How to join (merge) data frames (inner, outer, left, right), https://stackoverflow.com/a/6188334/3358272
Related
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 1 year ago.
I have a df with two columns where the elements in them are codes:
> head(listaNombres)
ocupacion1 ocupacion2
1 11-2020 11-9190
2 11-2020 41-1010
3 11-2020 41-2030
4 11-2020 41-3090
5 11-2020 41-4010
6 11-3030 11-9190
And then a separate df with the meaning for each code:
> head(descripcion)
# A tibble: 6 x 2
broadGroup Desc
<chr> <chr>
1 11-1010 Chief Executives
2 11-1020 General and Operations Managers
3 11-1030 Legislators
4 11-2010 Advertising and Promotions Managers
5 11-2020 Marketing and Sales Managers
6 11-2030 Public Relations and Fundraising Managers
How can I convert the codes in the first df with the Desc column in the second?
This question has been answered a few times but non seem to have an answer that is simple and also uses base R. I'm not a fan of making people use unnecessary packages, so I'll write this up since there is an easy and straight forward solution that requires no extra packages.
Using the 'match' function we can
oldvalues <- descripcion$broadGroup
# sets up values we wish to change from
newvalues <- descripcion$Desc
# sets up the values we want to change to
listaNombres$ocupacion1 = newvalues[ match(listaNombres$ocupacion1, oldvalues) ]
# Overwrite current ocupacion1 values with desired recode
listaNombres$ocupacion2 = newvalues[ match(listaNombres$ocupacion2, oldvalues) ]
# Overwrite current ocupacion2 values with desired recode
Say we have
v3$recode = v2[ match(v3$recod, v1) ]
What this does is, is takes our three vectors, v1, v2 and v3 and using match(v3,v1), match returns a vector of positions in v1 where the first match between an element of v3 and v1 occurs. We then select elements from v2 using this vector of positions, which gives us the recoded version of v3$record. We then feed this recoded vector of values straight back into v3$record overwriting the old values.
edit: I've since had a look using R and this solution works using the following mockup dataset
ocupacion1 = c(1,2,3,4)
ocupacion2 = c(3,4,4,2)
listaNombres = data.frame(ocupacion1,ocupacion2)
broadGroup = c(1,2,3,4)
Desc = c("one","two","three","four")
descripcion = data.frame(broadGroup,Desc)
combining everything gives the following
> ocupacion1 = c(1,2,3,4)
> ocupacion2 = c(3,4,4,2)
> listaNombres = data.frame(ocupacion1,ocupacion2)
> head(listaNombres)
ocupacion1 ocupacion2
1 1 3
2 2 4
3 3 4
4 4 2
> broadGroup = c(1,2,3,4)
> Desc = c("one","two","three","four")
> descripcion = data.frame(broadGroup,Desc)
> head(descripcion)
broadGroup Desc
1 1 one
2 2 two
3 3 three
4 4 four
> oldvalues <- descripcion$broadGroup
> newvalues <- descripcion$Desc
> listaNombres$ocupacion1 = newvalues[ match(listaNombres$ocupacion1, oldvalues) ]
> listaNombres$ocupacion2 = newvalues[ match(listaNombres$ocupacion2, oldvalues) ]
> # Overwrite current ocupacion2 values with desired recode
> head(listaNombres)
ocupacion1 ocupacion2
1 one three
2 two four
3 three four
4 four two
This question already has answers here:
R - Change column name using get()
(3 answers)
Closed 2 years ago.
My goal is to set the column names on a data frame. The name of this data frame is stored in a variable name_of_table.
name_of_table<-"table_13"
assign(name_of_table,read.csv("table_13_air_vehicle_risks_likelihood_and_cost_effects.csv", header=FALSE))
# This works fine, like table_13 <- read.csv(...)
first_level_header <- c("one","two","three","four","five")
colnames(get(name_of_table)) <- first_level_header
# Throws error:
#Error in colnames(get(name_of_table)) <- first_level_header :
# could not find function "get<-"
Obviously if I substitute table_13 for get(name_of_table) this works.
If instead I try:
colnames(names(name_of_table)) <- first_level_header
#Throws error: Error in `colnames<-`(`*tmp*`, value = c("one", "two", "three", "four", : attempt to set
#'colnames' on an object with less than two dimensions
I was pointed to this post earlier: R using get() inside colnames
But eval(parse(paste0("colnames(",name_of_table,")<- first_level_header"))), besides being hideous, does not work either: Error in file(filename, "r") : cannot open the connection
I don't understand the suggestion involving SetNames.
I apologize if get/assign is not the right approach, of course I want to do this the "right" way, I appreciate the guidance.
You could use library(data.table)
table_13 = data.table(1:5, 1:5, 1:5, 1:5, 1:5)
setnames(get(name_of_table), first_level_header) # N.B. also works for a data.frame
# one two three four five
# 1: 1 1 1 1 1
# 2: 2 2 2 2 2
# 3: 3 3 3 3 3
# 4: 4 4 4 4 4
# 5: 5 5 5 5 5
This question already has answers here:
Use `j` to select the join column of `x` and all its non-join columns
(2 answers)
What does < stand for in data.table joins with on=
(2 answers)
Closed 3 years ago.
Morning everyone
In data.table I found that with a left join, when mentioning a column name implicitly i.e. without mentioning the table (in which the column resides in) induces unexpected results despite unique column names.
dummy data
x <- data.table(a = 1:2); x
# a
# 1: 1
# 2: 2
y <- data.table(c = 1
,d = 2); y
# c d
# 1: 1 2
left join without mentioning table name in retrieve of column c
z <- y[x, on=.(c=a), .(a,c,d)]; z
# a c d
# 1: 1 1 2
# 2: 2 2 NA
Problem arises when looking at results above. Row 2 of column c is supposed to be NA. However, it shows 2
This is only rectified when the user explicitly mentions the table:
z <- y[x, on=.(c=a), .(a,x.c,d)]; z
# a x.c d
# 1: 1 1 2
# 2: 2 NA NA
It is perhaps worth mentioning the x in x.c is referring to the position of syntax x[i], in this case, table y
My question is why is the explicit mention of table necessary for a task seemingly basic. Or am I missing something? Thank you.
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I am looking to add a column to my data that will list the individual count of the observation in the dataset. I have data on NBA teams and each of their games. They are listed by date, and I want to create a column that lists what # in each season each game is for each team.
My data looks like this:
# gmDate teamAbbr opptAbbr id
# 2012-10-30 WAS CLE 2012-10-30WAS
# 2012-10-30 CLE WAS 2012-10-30CLE
# 2012-10-30 BOS MIA 2012-10-30BOS
Commas separate each column
I've tried to use "add_count" but this has provided me with the total # of games each team has played in the dataset.
Prior attempts:
nba_box %>% add_count()
I expect the added column to display the # game for each team (1-82), but instead it now shows the total number of games in the dataset (82).
Here is a base R example that approaches the problem from a for loop standpoint. Given that a team can be either column, we keep track of the teams position by unlisting the data and using the table function to sum the previous rows.
# intialize some fake data
test <- as.data.frame(t(replicate(6, sample( LETTERS[1:3],2))),
stringsAsFactors = F)
colnames(test) <- c("team1","team2")
# initialize two new columns
test$team2_gamenum <- test$team1_gamenum <- NA
count <- NULL
for(i in 1:nrow(test)){
out <- c(count, table(unlist(test[i,c("team1","team2")])))
count <- table(rep(names(out), out)) # prob not optimum way of combining two table results
test$team1_gamenum[i] <- count[which(names(count) == test[i,1])]
test$team2_gamenum[i] <- count[which(names(count) == test[i,2])]
}
test
# team1 team2 team1_gamenum team2_gamenum
#1 B A 1 1
#2 A C 2 1
#3 C B 2 2
#4 C B 3 3
#5 A C 3 4
#6 A C 4 5
If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :
X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
A B
1: 1 1
2: 1 1
3: 2 1
4: 2 1
5: 3 1
6: 3 2
7: 4 2
8: 4 2
9: 5 2
10: 5 2
X[J(3)]
A B
1: 3 1
2: 3 2
There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :
X[J(3,1)]
A B
1: 3 1
But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.
Here is a simple function that will extract the correct unique values and return a data table to use as a key.
X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5),
C = letters[1:20], key=c('A','B','C'))
make.key <- function(ddd, what){
# the names of the key columns
zzz <- key(ddd)
# the key columns you wish to keep all unique values
whichUnique <- setdiff(zzz, names(what))
## unique data.table (when keyed); .. means "look up one level"
ud <- lapply([, ..whichUnique], unique)
## append the `what` columns and a Cross Join of the new
## key columns
do.call(CJ, c(ud,what)[zzz])
}
X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b
I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.
Adding secondary keys is on the feature request list :
FR#1007 Build in secondary keys
In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which #mnel generalises nicely in his answer).