Finding common rows in R - r

While trying to get my data fit for analysis, I can't seem to do this correctly. Presume I have a datasets in this form:
df1
V1 V2df1
a H
b Y
c Y
df2
V1 V2df2
a Y
j H
b Y
and three more (5 datasets of different lengths alltogether). What I am trying to do is the following. First I must find all common elements from the first column(V1) - in this case those are: a,b. Then according to those common elements, I'm trying to build a joined dataset, where values of V1 would be common to all five datasets and values from other columns would be appended in the same row. So to explain with an example,
my result should look something like:
V1 V2df1 V2df2
a H Y
b Y Y
I managed to get some code working, but apperently the results are not correct. What I did:
read all the lines from all files into variables(example: a<-df1[,1] and so on) and find common rows like:
red<-Reduce(intersect, list(a,b,c,d,e))
then I filtered specific datasets like:
df1 <- unique(filter(df1, V1 %in% red))
I ordered every dataset according to row:
df1<-data.frame(df1[with(df1, order(V1)),])
and deleted duplicates(of elements in first column):
df1<- df1[unique(df1$V1),]
I then created a new dataset with:
newdata<-data.frame(V1common=df1[,1], V2df1=df1[,2],V2df2=df2[,2]...)
... means for all five of datasets. I actually got the same number of rows(a good sign since there are the same number of rows within intersection), and then appended other sorted columns, but something doesn't add up. Thanks for any advice. (I omitted the use of libraries and such, the code is for illustrative purposes).

You can use join_all from plyr package
require(plyr)
df <- join_all(list(df1,df2,df3,df4, df5), by = 'V1', type = 'inner')

Related

Selecting and arranging column data in R

I have this data set that requires some cleaning up. Is there a way to code in R such that it picks up columns with more than 3 different levels from the data set? Eg column C has the different education level and I would like it to be selected along with column D and F. While column E and G wont be picked up because it doesnt meet the more than 3 level requirement.
At the same time I need one of the columns to be arranged in a specific way? Eg Education, I would like PHD to be at the top. The other levels of education does not need to be in any order
Sorry i am really new to R and I attached a snapshot of a sample data i replicated from the original
All help is greatly appreciated
It is a bit complicated to replicate the data as it is an image, but you could use this function to select those columns of your dataframe that have at least 3 levels.
First I converted to factor those columns you are considering, in this case from column C or 3. Then with the for loop I identify those columns with more than 2 levels, and save the result in a vector and then filter the original data set according to these columns.
select_columns <- function(df){
df <- data.frame(lapply(df[,-c(1,2)], as.factor))
selectColumns <- c()
for (i in 1:length(df)) {
if((length(unique(df[,i])) > 3) ){
selectColumns[i] <- colnames(df[i])
}
}
selectColumns <- na.omit(selectColumns)
return(data %>% select(c(1:2),selectColumns))
}
select_columns(your_data_frame)

In R I need to add two elements from two different columns into a third column with a for loop

I have three columns, two of which have numerical values for 10 rows and one open column for the sum of the other two.
I am trying to add c1 + c2 = c3 using a for-loop
Well here is one proposition. First, let's assume that df is your data frame with 2 first columns with some noise(random numbers from a normal distribution), plus a third column with only zeros:
df <- data.frame(var1=rnorm(10),var2=rnorm(10),var3=numeric(length = 10))
To get the result as you wish, you can simply try (as mentioned on the comments):
df$var3<-df$var1+df$var2
However, if you want it with the for loop try the following:
for(j in 1:10)
{
df$var3[j] <- df$var1[j]+df$var2[j]
}
print(df)
Hopefully, it will help.

match common rows between different dataframes in a new organized df

Can someone help me to match three or more different ranked df to have a final one containing only the rows common to all of them? I am trying match and merge functions but I can not go any further.
here is how the data may look like:
A <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
B <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
C <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
"letter" is however the "row.names" on each df has only one column with the numerical "x", the ranked values.
There are not many details, but I try to suggest a basic approach. The function below tests if the two arguments provided from dataFrame1 and dataFrame2 match between them. In the evenience of TRUE answer, it stores the common value in a new dataFrame3. The index in the square brackets represents the rows that you would like to test.
matching_row <- function(x, y) {
if (identical(x, y)) {
dataFrame3 <- x
}
}
dataFrame3 <- matching_row(dataFrame$x[row], dataFrame2$x[row])
You can modify the function according to the characteristics of your data by adding, i.e., a loop if the dataframes are quite big, ore more strict/flexible logical conditions to test the identity between dataframes.

Find values of DataFrame A in DataFrame B and replace them with other column's values (create unique identifier for panel study)

I want to find values from one DataFrame A in another
DataFrame B and replace them with values from another column of B.
My problem in detail:
I have ten datasets that all together compose one-panel study. That means, that some people have been interviewed more than one time and - accordingly - have rows in more than one of those datasets. Each dataset also contains new study members who had not been interviewed before.
Unfortunately, the unique identifier to represent study members has to be deleted, for some reason. I have to replace it with another unique identifier without losing the panel quality (that is, the same person has to be associated with the same identifier in all datasets they appear in).
My idea was to:
a) load all datasets,
b) create a DataFrame with an allocation table with an old identifier and newly created identifier and then
c) "search and replace" identifiers in the original datasets.
The last step doesn't work.
I have two questions:
1) Does anyone know how to do step 3?
2) My way seems cumbersome. Does anyone know a more simple solution?
a) Loading
library(foreign)
#df1 <- read.spss("D1.sav", to.data.frame=TRUE, use.value.labels=FALSE)
#df2 <- read.spss("D2.sav", to.data.frame=TRUE, use.value.labels=FALSE)
#... (for all 10 datasets)
#For the sake of the example two random datasets that also
#include some NA and "overlap"
df1 <- c(NA,NA,NA,seq(100,200))
df2 <- c(NA,NA,NA,seq(150,250))
b) Creating Allocation
(only for unique ids
because people who were interviewed more
than once should receive the same allocation
in all datasets (panel study))
df <- data.frame(id=c(df1,df2),
alloc=c(df1,df2))
df <- subset(df, !duplicated(df$id))
df$alloc <- 1:dim(df)[1]
c) Overwriting old identifer with a new one (doesn't work)
Here for the example datasets:
df1 <- ifelse(df1 %in% df[,1], df[,2], df1)
df2 <- ifelse(df2 %in% df[,1], df[,2], df2)
#With the real datasets in this form:
#df1$identifer <- ifelse(df1$identifer %in% df[,1],
#df[,2], df1$identifer)
#df2$identifer <- ifelse(df2$identifer %in% df[,1],
#df[,2], df2$identifer)
#... (for all 10 datasets)

Mean of triplicate

I've just cleaned up a data frame that I scraped from an excel spreadsheet by amongst other things, removing percentage signs from some of the numbers see, Removing Percentages from a Data Frame.
The data has twenty four rows representing the parameters and results from eight experiments done in triplicate. Eg, what one would get from,
DF1 <- data.frame(X = 1:24, Y = 2 * (1:24), Z = 3 * (1:24))
I want to find the mean of each of the triplicates (which, fortunately are in sequential order) and create a new data frame with eight rows and the same amount of columns.
I tried to do this using,
DF2 <- data.frame(replicate(3,sapply(DF1, mean)))
which gave me the mean of each column as rows three times. I wanted to get a dataframe that would give me,
data.frame(X = c(2,5,8,11,14,17,20,23), Y = c(4,10,16,22,28,34,40,23), Z = c(6,15,24,33,42,51,60,69))
which I worked out by hand; it's supposed to be the reduced result.
Thanks, ...
Any help would be gratefully recieved.
Nice task for codegolf!
aggregate(DF1, list(rep(1:8, each=3)), mean)[,-1]
to be more general, you should replace 8 with nrow(DF1).
... or, my favorite, using matrix multiplication:
t(t(DF1) %*% diag(8)[rep(1:8,each=3),]/3)
This works:
foo <- matrix(unlist(by(data=DF1,INDICES=rep(1:8,each=3),FUN=colMeans)),
nrow=8,byrow=TRUE)
colnames(foo) <- colnames(DF1)
Look at ?by.

Resources