Matching elements of two data frames in R [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two data frames. The first looks like this
name
1 a
2 b
3 c
4 d
5 f
and the second like this
name value
1 b 3
2 d 4
3 f 5
4 a 1
5 c 2
6 k 7
7 m 6
Now I want to add a second column to the first data frame which contains the values of elements taken from the second list. It has to look like this
name value
1 a 1
2 b 3
3 c 2
4 d 4
5 f 5
Can somebody help me this?

you can use merge to do this. In case your first data frame is called df1 and the second one df2:
merge(df1, df2, by='name')

What you want to do is an inner join. You might try with the dplyr package.
library(dplyr)
x <- data.frame(name = c("a", "b", "c", "d", "f"), stringsAsFactors = FALSE)
y <- data.frame(name = c("b", "d", "f", "a", "c", "k", "m"),
value = c(3, 4, 5, 1, 2, 7, 6),
stringsAsFactors = FALSE)
joined <- dplyr::inner_join(x, y, by = "name")

Related

How to compare two variable and different length data frames to add values from one data frame to the other, repeating values where necessary

I apologize as I'm not sure how to word this title exactly.
I have two data frames. df1 is a series of paths with columns "source" and "destination". df2 stores values associated with the destinations. Below is some sample data:
df1
row
source
destination
1
A
B
2
C
B
3
H
F
4
G
B
df2
row
destination
n
1
B
26
2
F
44
3
L
12
I would like to compare the two data frames and add the n column to df1 so that df1 has the correct n value for each destination. df1 should look like:
row
source
destination
n
1
A
B
26
2
C
B
26
3
H
F
44
4
G
B
26
The data that I'm actually working with is much larger, and is never the same number of rows when I run the program. The furthest I've gotten with this is using the which command to get the right values, but only each value once.
df2[ which(df2$destination %in% df1$destination), ]$n
[1] 26 44
When what I would need is the list (26,26,44,26) so I can save it to df1$n
We can use a merge or left_join
library(data.table)
setDT(df1)[df2, n := i.n, on = .(destination)]
A base R option using match
transform(
df1,
n = df2$n[match(destination, df2$destination)]
)
which gives
row source destination n
1 1 A B 26
2 2 C B 26
3 3 H F 44
4 4 G B 26
Data
df1 <- data.frame(row = 1:4, source = c("A", "C", "H", "G"), destination = c("B", "B", "F", "B"))
df2 <- data.frame(row = 1:3, destination = c("B", "F", "L"), n = c(26, 44, 12))

How to sum rows and keep their name in a dataframe [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a data frame with some variables with the same name but different values. I need to sum the values and keep the original values as a separate column.
data <- data.frame(cod = c("A", "B", "C", "A", "A", "B"),
values = c(3, 4, 5, 1, 2, 5))
data
cod Values
A 3
B 4
C 5
A 1
A 2
B 5
I expect the following, where the original Values column is kept the same and the group sum is added as a new column, Values2:
> data2
cod Values Values2
A 3 6
B 4 9
C 5 5
A 1 6
A 2 6
B 5 9
An option with base R would be
data$Values2 <- with(data, ave(Values, cod, FUN = sum))

How to return 2 specific rows from a dataframe?

firstVector <- c("A", "B", "C", "D", "E")
secondVector <- c(1, 2, 3, 4, 5)
thirdVector <- c("a", "b", "c", "d", "e")
myDataFrame <- data.frame(firstVector, secondVector, thirdVector)
How do I extract row 3 and 4 from my data frame? I want to print it row 3 and 4 in order it to look like this:
firstVector secondVector thirdVector
3 C 3 c
4 D 4 d
You can subset your dataframe like this [rows,columns]:
myDataFrame[c(3,4),]
In your case you want a vector containing rows 3 and 4, therefore c(3,4), you can add more columns in the vector to subset more rows, for example c(1,2,3,12).
If you dont provide an argument it returns the whole dimension. In your example you subset rows, and return all the columns
it's the same for columns:
myDataFrame[c(3,4),c(1,2)]
you can subset rows 3 and 4 and columns 1 and 2.
Another way to do this is using :
c(1:4) means from 1 to 4
Hope this helps

Trying to count categories variables in R using rowSums

I'm trying to get a count for each of the observation categories per row.
In the example of the data below, the top line containing photo, 2, 3, 4, 5, 6 is the headers and the line beneath that contains the observations.
I would do it in excel with countif however dataset is huge with this only being a tiny sample. Plus screw excel :)
photo 2 3 4 5 6
30001004501 SINV_SPO_V SINV_HYD LSUB_SAND Unc SINV_SPO_V
I'm trying to do it so that it will create a new column for each observation I count, ie if I were trying to determine the frequency of "Unc" would have its own column with how many times "Unc" was counted for each row.
The code below is one of the things I've tried over the last couple of days as well as variations of count and length commands but with no success
data$Unc <-rowSums(data[,3:52] == "Unc", na.rm = F)
I'm trying to get R to only count the columns between 3 and 52
Thanking in advance for any help is getting incredibly frustrating as I know it should be really easy
I hope this makes sense
So if i understood your request correctly this is a data.table solution of your problem, you can use 3:52 in measure.vars for your task. Also this only works if photo is a unique id variable, if it isn't you should create one yourself and use that instead
library(data.table)
# create example data.table
dt <- data.table(photo = 1:6,
x1 = c("a", "b", "a", "c", "a", "d"),
x2 = c("c", "c", "a", "c", "a", "d"),
x3 = c("c", "c", "a", "c", "a", "d"))
# Melt data.table, select which columns you need
dt_melt <- melt.data.table(dt, id.vars = 'photo', measure.vars = 2:3, variable.name = 'column')
# Get a resulting data.table with pairs of photo and observation
result_dt <- dt_melt[, .N, by = c('photo', 'value')]
photo value N
1: 1 a 1
2: 2 b 1
3: 3 a 2
4: 4 c 2
5: 5 a 2
6: 6 d 2
7: 1 c 1
8: 2 c 1
# For wide representation
dcast(result_dt, photo ~ value, value.var = 'N', fill = 0)
photo a b c d
1: 1 1 0 1 0
2: 2 0 1 1 0
3: 3 2 0 0 0
4: 4 0 0 2 0
5: 5 2 0 0 0
6: 6 0 0 0 2
I think that a way to solve your problem is to use the table function:
col1 <- c('a','b','b','b','a','c','b','a','c')
col2 <- c('d','e','d','d','d','d','d','d','e')
data = data.frame(col1,col2)
table(col1)
table(col2)
tab = table(data)
tab
margin.table(tab,1)
margin.table(tab,2)
table(col1) will give you the frequencies for the categorical variables of col1, and this gives the same result as margin.table(tab,1). So it depends if you prefer to work on the data.frame or on the columns directly.

Observation number by group [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
In R I have a data frame with observations described by several values one of which is a factor. I have sorted the dataset by this factor and would like to add a column in which I would get a number of observation on each level of the factor e.g.
factor obsnum
a 1
a 2
a 3
b 1
b 2
b 3
b 4
c 1
c 2
...
In SAS I do it with something like:
data logs.full;
set logs.full;
count + 1;
by cookie;
if first.cookie then count = 1;
run;
How can I achieve that in R?
Thanks,
Use rle (run length encoding) and sequence:
x <- c("a", "a", "a", "b", "b", "b", "b", "c", "c")
data.frame(
x=x,
obsnum = sequence(rle(x)$lengths)
)
x obsnum
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
Here is the ddply() solution
dataset <- data.frame(x = c("a", "a", "a", "b", "b", "b", "b", "c", "c"))
library(plyr)
ddply(dataset, .(x), function(z){
data.frame(obsnum = seq_along(z$x))
})
One solution using base R, assuming your data is in a data.frame named dfr:
dfr$cnt<-do.call(c, lapply(unique(dfr$factor), function(curf){
seq(sum(dfr$factor==curf))
}))
There are likely better solutions (e.g. employing package plyr and its ddply), but it should work.

Resources