Merge dataframes with unequal rows, and no matching column names R - r

I am trying to take df1 (a summary table), and merge it into df2 (master summary table).
This is a snapshot of df2, ignore the random 42, just the answer to the ultimate question.
This is an example of what df1, looks like.
Lastly, I have a vector called Dates. This matches the dates that are the column names for df2.
I am trying to cycle through 20 file, and gather the summary statistics of that file. I then want to enter that data into df2 to be stored permanently. I only need to enter the Earned column.
I have tried to use merge but since they do not have shared column names, I am unable to.
My next attempt was to try this. But it gave an error, because of unequal row numbers.
df2[,paste(Dates[i])] <- cbind(df2,df1)
Then I thought that maybe if I specified the exact location, it might work.
df2[1:length(df1$Earned),Dates[i]] <- df1$Earned
But that gave and error "New columns would leave holes after existing columns"
So then I thought of trying that again, but with cbind.
df2[1:length(df1$Earned),Dates[i]] <- cbind(df2, df1$Earned)
##This gave an error for differing row numbers
df2 <- cbind(df2[1:length(df1$Earned),Dates[i]],df1$earned)
## This "worked" but it replaced all of df2 with df1$earned, so I basically lost the rest of the master table
Any ideas would be greatly appreciated. Thank you.

Something like this might work:
df1[df1$TreatyYear %in% df2$TreatyYear, Dates] <- df2$Earned
Example
df <- data.frame(matrix(NA,4,4))
df$X1 <- 1:4
df[df$X1 %in% c(1,2),c("X3","X4")] <- c(1,2)

The only solution that I have found so far is to force df1$Earned into a vector. Then append the vector to be the exact length of the df2. Then I am able to insert the values into df2 by the specific column.
temp_values <- append(df1$Earned,rep(0,(length(df2$TreatyYear)-length(df1$TreatyYear))),after=length(df1$Earned))
df2[,paste(Dates[i])] <- temp_values
This is kind of a roundabout way to fix it, but not a very pleasant way. Any better ideas would be appreciated.

Related

How to split 1 row into multiple rows where columns hold same value types

I have a wide dataset which makes it really difficult to manipulate the data in the way I need. It looks like the dummy table below:
Dummy_table_unsorted
Essentially, as seen in the table, the information held in 1 row is at a user level, you have a user id and then all the animals owned by each user are in this row. What I would like it, I want this at animal level, so that a user can have multiple entries, which represent each of their different animals. I have pasted a table below of what I would like it to look like:
Dummy_table_sorted
Is there a simple way to do this? I have an idea as to how, but it is very long winded. I thought to maybe subset by selected columns relating to one animal only and merge the datasets back together. The problem is, in may data, it is possible for one person to have up to 100 animals, which makes this very long winded.
Please can someone offer a suggestion or a package/command that would allow me to change this wide dataset into a long one?
Thank You.
First, you should provide data that someone can easily insert into R. Screenshots are not helpful and increase the amount of work a person needs to perform to help you.
The data as you have it should be able to be split, and recombined with bind_rows or rbind. I would subset the data into three dataframes, rename columns, and bind. Assuming your original data is called df
df1 <- df[,c(1:4)]
df2 <- df[,c(1,5:7)]
df3 <- df[,c(1,8:10)]
# rename columns to match
names(df1) <- c('user id', 'animal', 'colour', 'legs')
names(df2) <- c('user id', 'animal', 'colour', 'legs')
names(df3) <- c('user id', 'animal', 'colour', 'legs')
remade <- bind_rows(df1, df2) %>%
bind_rows(df3)

How can I get the column/variable names of a dataframe that fit certain parameters?

I came across a problem in my DataCamp exercise that basically asked "Remove the column names in this vector that are not factors." I know what they -wanted- me to do, and that was to simply do glimpse(df) and manually delete elements of the vector containing the column names, but that wasn't satisfying for me. I figured there was a simple way to store the column names of the dataframe that are factors into a vector. So, I tried two things that ended up working, but I worry they might be inefficient.
Example data Frame:
factorVar <- as.factor(LETTERS[1:10])
df1 <- data.frame(x = 1, y = 1:10, factorVar = sample(factorVar, 10))
My first solution was this:
vector1 <- names(select_if(df1, is.factor))
This worked, but select_if returns an entire tibble of a filtered dataframe and then gets the column names. Surely there's an easier way...
Next, I tried this:
vector2 <- colnames(df1)[sapply(df1,is.factor)]
This also worked, but I wanted to know if there's a quicker, more efficient way of filtering column names based on their type and then storing the results as a vector.

R: Leaving out undefined columns when subsetting a dataframe

I have a dataframe df1 with 300+ Columns, and I am trying to subset it to df2 by column names based on a couple dozen inputs in list1. However, some of the items in list1 are not actually column names in df1. So, I get an "undefined columns" error:
df2 <- df1[,list1]
Error in [.data.frame(df1, , list1) :
undefined columns selected
I realize the reason for this error. BUT I can't go through list1 and see which ones occur in df1 and which ones don't. Because I have to do this many many times with many many different lists. Is there a way to simply ignore those (and not include those null column name values from list1 in the subset df2)?
Thanks for your kind help.
How about something like: df2 <- df1[,list1[list1 %in% names(df1)]]
I was hoping for a nicer/shorter) solution for my own use.
Here is another
df1[,intersect(names(df1), list1)]

Replace only a part of a subsetted vector

Suppose I've got a data frame called someMatrix. Now in this matrix I want to replace only the first three rows of the 4 column.
I came up with this idea.
(someMatrix[,4])[1:3] <- replacement
but I get following error: could not find function "(<-"
Any idea how I could solve this?
Thanks!
You may subset with brackets as many times you want, without bothering with parentheses:
a <- cbind(rnorm(10), rnorm(10))
a[1:5, ][2:3, ][, 2][1]

How can I use the row.names attribute to order the rows of my dataframe in R?

I created a random forest and predicted the classes of my test set, which are living happily in a dataframe:
row.names class
564028 1
275747 1
601137 0
922930 1
481988 1
...
The row.names attribute tells me which row is which, before I did various operations that scrambled the order of the rows during the process. So far so good.
Now I would like get a general feel for the accuracy of my predictions. To do this, I need to take this dataframe and reorder it in ascending order according to the row.names attribute. This way, I can compare the observations, row-wise, to the labels, which I already know.
Forgive me for asking such a basic question, but for the life of me, I can't find a good source of information regarding how to do such a trivial task.
The documentation implores me to:
use attr(x, "row.names") if you need to retrieve an integer-valued set of row names.
but this leaves me with nothing but NULL.
My question is, how can I use row.names which has been loyally following me around in the various incarnations of dataframes throughout my workflow? Isn't this what it is there for?
None of the other solutions would actually work.
It should be:
# Assuming the data frame is called df
df[ order(as.numeric(row.names(df))), ]
because the row name in R is character, when the as.numeric part is missing it, it will arrange the data as 1, 10, 11, ... and so on.
This worked for me:
new_df <- df[ order(row.names(df)), ]
If you have only one column in your dataframe like in my case you have to add drop=F:
df[ order(rownames(df)) , ,drop=F]
For completeness:
#BondedDust's answer works perfectly for the rownames attribute, but your example does not use the rownames attribute. The output provided in your question indicates use of a column named "row.names", which isn't the same thing (all listed in #BondedDust's comment). Here would be the answer if you wished to sort by the "row.names" column in example given in your question (there is another posting on this, located here). This answer assumes you are using a dataframe named "df", with one column named "row.names":
ordered.df <- df[order(df$row.names),] #this orders the df by the "row.names" column
Alternatively, to order by the first column (same thing if you're still using your example):
ordered.df <- df[order(df[,1]),] #this orders the df by the first column
Hope this is helpful!
This will be done almost automatically since the "[" function will display in lexical order of any vector that can be matched to rownames():
df[ rownames(df) , ]
You might have thought it would be necessary to use:
df[ order(rownames(df)) , ]
But that would have given you an ordering of 1:100 of 1,10,100, 12,13, ...,2,20,21, ... , because the argument to "[" gets coerced to character.
Assuming your data frame is named 'df'you can create a new ordered data frame 'ord.df' that will contain the row names of df as well as it values in the following one line of code:
>ord.df<-cbind(rownames(df)[order(rownames(df))], df[order(rownames(df)),])
new_df <- df[ order(row.names(df)), ]
or something similar won't work. After this statement, the new_df does not have a rowname any more. I guess a better solution is to add a column as rowname, sort by it, and set it as the rowname
you can simply sort your df by using this :
df <- df[sort(rownames(df)),]
and then do what you want !

Resources