How can one merge two data frames, one column-wise and other one row-wise? For example, I have two data frames like this:
A: add1 add2 add3 add4
1 k NA NA NA
2 l k NA NA
3 j NA NA NA
4 j l NA NA
B: age size name
1 5 6 x
2 8 2 y
3 1 3 x
4 5 4 z
I want to merge the two data.frames by row.name. However, I want to merge the data.frame A column-wise, instead of row-wise. So, I'm looking for a data.frame like this for result:
C:id age size name add
1 5 6 x k
2 8 2 y l
2 8 2 y k
3 1 3 x j
4 5 4 z j
4 5 4 z l
For example, suppose you have information of people in table B including name, size, etc. These information are unique values, so you have one row per person in B. Then, suppose that in table A, you have up to 5 past addresses of people. First column is the most recent address; second, is the second most recent address; etc. Now, if someone has less than 5 addresses (e.g. 3), you have NA in the 4 and 5 columns for that person.
What I want to achieve is one data frame (C) that includes all of this information together. So, for a person with two addresses, I'll need two rows in table C, repeating the unique values and only different in the column address.
I was thinking of repeat the rows of A data frame by the number of non-NA values while keeping the row.names the same as they were (like data frame D) and then merge the the new data frame with B. But I'm not sure how to do this.
D: address
1 k
2 l
2 k
3 j
4 j
4 l
Thank you!
Change the first data.frame to long format, then it's easy. df1 is A and df2 is B. I also name the numbers id.
require(tidyr)
# wide to long (your example D)
df1tidy <- gather(df1,addname,addval,-id)
# don't need the original add* vars or NA's
df1tidy$addname <- NULL
df1tidy <- df1tidy[!is.na(df1tidy$addval), ]
# merge them into the second data.frame
merge(df2,df1tidy,by = 'id',all.x = T)
Related
I have a data frame that contains duplicate column names. I'm aware that it's non-standard to use duplicated column names, but these names are actually being reassigned downstream using user inputs. For now, I'm attempting to positionally subset a data frame, but the column names become deduplicated. Here's an example.
> df <- data.frame(x = 1:4, y = 2:5, y = LETTERS[2:5], y = (2+(2:5)), check.names = F)
> df
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
4 4 5 E 7
However, when I attempt to subset, the names change...
> df[, 1:3]
x y y.1
1 1 2 B
2 2 3 C
3 3 4 D
4 4 5 E
Is there any way to prevent this from happening? It only occurs when I subset on columns, not rows.
> df[1:3,]
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
Edit for others noticing this behavior:
I've done some digging into the behavior and this relevant section from the help page for extract.data.frame (type ?'[')
The relevant section states:
If [ returns a data frame it will have unique (and non-missing) row
names, if necessary transforming the row names using make.unique.
Similarly, if columns are selected column names will be transformed to
be unique if necessary (e.g., if columns are selected more than once,
or if more than one column of a given name is selected if the data
frame has duplicate column names).
This explains the why, appreciate the comments so far on addressing how to best navigate this.
Here is an option, although I think it is not a good idea to have duplicated column names.
as.data.frame(as.list(df)[1:3], check.names = F)
# x y y
# 1 1 2 B
# 2 2 3 C
# 3 3 4 D
# 4 4 5 E
I am working in R and I have two data frames. My goal is to merge them based on two columns and then remove whichever rows were merged from the first data frame. So, for example, if I started with something like the following:
A: x y z
1 2 3
4 5 6
B: q x y
7 1 2
3 8 9
After merging based on (x,y) and removing matching rows from A, I would want to end up with:
A: x y z
4 5 6
C: q x y z
7 1 2 3
Is there a way to add a "flag" or "remove" column to A that evaluates to true wherever the rows match with a row in B? What is an efficient way to do this other than looping through A and B?
The dplyr library offers several joining options.
If you use inner_join, you will return rows that match both in A and B.
inner_join(A,B,by=c("x","y"))
x y z q
1 2 3 7
If you use anti_join, you will return only the rows that match in A and not in B:
anti_join(A,B,by=c("x","y"))
x y z
4 5 6
How does one go about switching a data frame based on column names between to tables with a lookup table in between.
Orig
A B C
1 2 3
2 2 2
4 5 6
Ret
D E
7 8
8 9
2 4
lookup <- data.frame(Orig=c('A','B','C'),Ret=c('D','D','E'))
Orig Ret
1 A D
2 B D
3 C E
So that the final data frame would be
A B C
7 7 8
8 8 9
2 2 4
We can match the 'Orig' column in 'lookup' with the column names of 'Orig' to find the numeric index (although, it is in the same order, it could be different in other cases), get the corresponding 'Ret' elements based on that. We use that to subset the 'Ret' dataset and assign the output back to the original dataset. Here I made a copy of "Orig".
OrigN <- Orig
OrigN[] <- Ret[as.character(lookup$Ret[match(as.character(lookup$Orig),
colnames(Orig))])]
OrigN
# A B C
#1 7 7 8
#2 8 8 9
#3 2 2 4
NOTE: as.character was used as the columns in 'lookup' were factor class.
I believe that the following will work as well.
OrigN <- Orig
OrigN[, as.character(lookup$Orig)] <- Ret[, as.character(lookup$Ret)]
This method applies a column shuffle to Orig (actually a copy OrigN following #Akrun) and then fills these columns with the appropriately ordered columns of Ret using the lookup.
How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)
You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.
I have a number of large datasets with ~10 columns, and ~200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row to be present, I would like to set a threshold for how many NAs are allowed in a row.
My Dataframe looks something like this:
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
C NA 9 4 NA 4 8 4 NA 5 NA
D 2 2 6 8 4 NA 3 7 1 32
And I would like to be able to delete the rows that contain more than 2 cells containing NA to get
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
D 2 2 6 8 4 NA 3 7 1 32
complete.cases removes all rows containing any NA, and I know one can delete rows that contain NA in certain columns but is there a way to modify it so that it is non-specific about which columns contain NA, but how many of the total do?
Alternatively, this dataframe is generated by merging several dataframes using
file1<-read.delim("~/file1.txt")
file2<-read.delim(file=args[1])
file1<-merge(file1,file2,by="chr.pos",all=TRUE)
Perhaps the merge function could be altered?
Thanks
Use rowSums. To remove rows from a data frame (df) that contain precisely n NA values:
df <- df[rowSums(is.na(df)) != n, ]
or to remove rows that contain n or more NA values:
df <- df[rowSums(is.na(df)) < n, ]
in both cases of course replacing n with the number that's required
If dat is the name of your data.frame the following will return what you're looking for:
keep <- rowSums(is.na(dat)) < 2
dat <- dat[keep, ]
What this is doing:
is.na(dat)
# returns a matrix of T/F
# note that when adding logicals
# T == 1, and F == 0
rowSums(.)
# quickly computes the total per row
# since your task is to identify the
# rows with a certain number of NA's
rowSums(.) < 2
# for each row, determine if the sum
# (which is the number of NAs) is less
# than 2 or not. Returns T/F accordingly
We use the output of this last statement to
identify which rows to keep. Note that it is not necessary to actually store this last logical.
If d is your data frame, try this:
d <- d[rowSums(is.na(d)) < 2,]
This will return a dataset where at most two values per row are missing:
dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]