This question already has answers here:
Merge unequal dataframes and replace missing rows with 0
(7 answers)
Closed 8 years ago.
Hello I have been looking for a solution for quite some time. I'm sure the answer is easy but I've been pulling my hair out here!
I have two data frames that are similar (in fact one represents a more complete dataset). They both have two columns, one containing string values as a factor and one containing numerical values.
df.A looks like this:
Category Number
A 1
B 2
C 3
D 4
and df.B looks like this
Category Number
A 5
B 6
C 7
These categories (ABCD) are common between the two dataframes. In trying to get df.B to have a category D with a NA or 0 value (I am working with percentages so either NA or 0 is fine), my code looks like this:
proto <- df.A
proto$number <- NULL
df.B <- rbind.fill(proto,df.B)
My thought is this would add the fourth row for category D and give NA value but instead results in
Category Number
A NA
B NA
C NA
D NA
NA 5
NA 6
NA 7
I tried removing the factor class from category on both df.A and df.B, tried using rbind.fill.matrix instead...to be honest I am pretty new to R and this is giving me a lot of trouble. How do I get R to recognize that ABCD are the same factor across dataframes?
You can achieve the desired result by using merge:
merge(df.A,df.B,by='Category',all=T)
which will produce the following output:
# Category Number.x Number.y
#1 A 1 5
#2 B 2 6
#3 C 3 7
#4 D 4 NA
Related
This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 2 years ago.
I have tried various functions including compare and all.equal but I am having difficulty finding a test to see if variables are the same.
For context, I have a data.frame which in some cases has a duplicate result. I have tried copying the data.frame so I can compare it with itself. I would like to remove the duplicates.
One approach I considered was to look at row A from dataframe 1 and subtract it from row B from dataframe 2. If they equal to zero, I planned to remove one of them.
Is there an approach I can use to do this without copying my data?
Any help would be great, I'm new to R coding.
Suppose I had a data.frame named data:
data
Col1 Col2
A 1 3
B 2 7
C 2 7
D 2 8
E 4 9
F 5 12
I can use the duplicated function to identify duplicated rows and not select them:
data[!duplicated(data),]
Col1 Col2
A 1 3
B 2 7
D 2 8
E 4 9
F 5 12
I can also perform the same action on a single column:
data[!duplicated(data$Col1),]
Col1 Col2
A 1 3
B 2 7
E 4 9
F 5 12
Sample Data
data <- data.frame(Col1 = c(1,2,2,2,4,5), Col2 = c(3,7,7,8,9,12))
rownames(data) <- LETTERS[1:6]
I am trying to loop over many data frames in R and I feel like this is a rather basic question. However, I only found similar questions that were solved with specific functions that don't match my problem (like calculating means or medians, changing column names, ...). I hope to find a more general solution that can be applied for any change or calculation in various data frames here.
I have a lot (about 500) of data frames that look somewhat like this (very simplified):
df0100
a b c d
1 4 3 5 NA
2 2 5 4 NA
3 4 4 3 NA
...
df0130
a b c d
1 3 2 3 NA
2 4 5 3 NA
3 4 3 2 NA
...
For each of them, I want to calculate a new value (also simplified here) from the values in a and c in the first row and insert the value in any row in column d. It works fine like this for a single data frame:
df0100$d <- ((df0100[1,1]*(df0100[1,3]+13.5)/(3*exp(df0100[1,3]))/100
which leads to
df0100
a b c d
1 4 3 5 36.60858
2 2 5 4 36.60858
3 4 4 3 36.60858
....
Since I don't want to do this for every single of the 500 data frames, I saved them as a list and tried to loop over them as follows. I thought the easiest way would be to replace the former 'df0100' by each data frame name but both versions didn't work. Can anyone tell me what I have to change?
my_files <- list.files(pattern=".csv")
my_data <- lapply(my_files, read.csv)
Version 1:
for (n in my_data)
{
n$d <- ((n[1,1]*(n[1,3]+13.5)/(3*exp(n[1,3]))/100
}
Version 2:
my_data <- lapply(my_data, function(n){
n$d <- ((n[1,1]*(n[1,3]+13.5)/(3*exp(n[1,3]))/100
})
This is my first question here, I hope it makes sense to you.
Let I have a data frame where some colums rae factor type and there is column named "index" which is not a column. I want to extract columns
which are factor tyepe and
the "index" column.
For example let
df<-data.frame(a=runif(10),b=as.factor(sample(10)),index=as.numeri(1:10))
So df is:
a b index
0.16187501 5 1
0.75214741 8 2
0.08741729 3 3
0.58871514 2 4
0.18464752 9 5
0.98392420 1 6
0.73771960 10 7
0.97141474 6 8
0.15768011 7 9
0.10171931 4 10
Desired output is(let it be a data frame called df1)
df1:
b index
5 1
8 2
3 3
2 4
9 5
1 6
10 7
6 8
7 9
4 10
which consist the factor column and the column named "index".
I use such a code
vars<-apply(df,2,function(x) {(is.factor(x)) || (names(x)=="index")})
df1<-df[,vars]
However, this code does not work. How can I return df1 using apply types function in R? I will be very glad for any help. Thanks a lot.
You could do:
df[ , sapply(df, is.factor) | grepl("index", names(df))]
I think two things went wrong with your method: First, apply converts the data frame to a matrix, which doesn't store values as factors (see here for more on this). Also, in a matrix, every value has to be of the same mode (character, numeric, etc.). In this case, everything gets coerced to character, so there's no factor to find.
Second, the column name isn't accessible within apply (AFAIK), so names(x) returns NULL and names(x)=="index" returns logical(0).
This question already has answers here:
Reshape three column data frame to matrix ("long" to "wide" format) [duplicate]
(6 answers)
Closed 7 years ago.
I'm trying to build a pivot table from this data frame below. "VisitID" is the unique ID for a user who came to visit a website, "PageName" is the page they visited, and "Order" is the sequence of the page they visited. For example, the first row of this data frame means "user 001 visited Homepage, which is the 1st page he/she visted".
VisitID PageName Order
001 Homepage 1
001 ContactUs 2
001 News 3
002 Homepage 1
002 Careers 2
002 News 3
The desired output should cast "VisitID" as rows and "Order" as columns, and fill the table with the "PageName":
1 2 3
001 Homepage ContactUs News
002 Homepage Careers News
I've thought about using reshape::cast to do the task, but I believe it only works when you give it an aggregated function. I might be wrong though. Thanks in advance for anyone who can offer help.
You don't need to aggregate. As long as there's only one row for each combination of columns in the casting formula, you'll get the value of value.var inserted in the output.
library(reshape2)
dcast(mydata, VisitID ~ Order, value.var="PageName")
Here's an example:
# Fake data
dat = data.frame(group1=rep(LETTERS[c(1,1:3)],each=2), group2=rep(letters[c(1,1:3)]),
values=1:8)
dat
group1 group2 values
1 A a 1
2 A a 2
3 A b 3
4 A c 4
5 B a 5
6 B a 6
7 C b 7
8 C c 8
Note that rows 1 and 2 have the same values of the group columns, as do rows 5 and 6. As a result, dcast aggregates by counting the number of values in each cell.
dcast(dat, group1 ~ group2, value.var="values")
Aggregation function missing: defaulting to length
group1 a b c
1 A 2 1 1
2 B 2 0 0
3 C 0 1 1
Now lets remove rows 1 and 5 to get rid of the duplicated group combinations. Since there's now only one value per cell, dcast returns the actual value, rather than a count of the number of values.
dcast(dat[-c(1,5),], group1 ~ group2, value.var="values")
group1 a b c
1 A 2 3 4
2 B 6 NA NA
3 C NA 7 8
How can one merge two data frames, one column-wise and other one row-wise? For example, I have two data frames like this:
A: add1 add2 add3 add4
1 k NA NA NA
2 l k NA NA
3 j NA NA NA
4 j l NA NA
B: age size name
1 5 6 x
2 8 2 y
3 1 3 x
4 5 4 z
I want to merge the two data.frames by row.name. However, I want to merge the data.frame A column-wise, instead of row-wise. So, I'm looking for a data.frame like this for result:
C:id age size name add
1 5 6 x k
2 8 2 y l
2 8 2 y k
3 1 3 x j
4 5 4 z j
4 5 4 z l
For example, suppose you have information of people in table B including name, size, etc. These information are unique values, so you have one row per person in B. Then, suppose that in table A, you have up to 5 past addresses of people. First column is the most recent address; second, is the second most recent address; etc. Now, if someone has less than 5 addresses (e.g. 3), you have NA in the 4 and 5 columns for that person.
What I want to achieve is one data frame (C) that includes all of this information together. So, for a person with two addresses, I'll need two rows in table C, repeating the unique values and only different in the column address.
I was thinking of repeat the rows of A data frame by the number of non-NA values while keeping the row.names the same as they were (like data frame D) and then merge the the new data frame with B. But I'm not sure how to do this.
D: address
1 k
2 l
2 k
3 j
4 j
4 l
Thank you!
Change the first data.frame to long format, then it's easy. df1 is A and df2 is B. I also name the numbers id.
require(tidyr)
# wide to long (your example D)
df1tidy <- gather(df1,addname,addval,-id)
# don't need the original add* vars or NA's
df1tidy$addname <- NULL
df1tidy <- df1tidy[!is.na(df1tidy$addval), ]
# merge them into the second data.frame
merge(df2,df1tidy,by = 'id',all.x = T)