Merge is duplicating rows in r - r

I have two data sets with country names in common.
first data frame
As you can see, both data sets have a two letter country code formated the same way.
After running this code:
merged<- merge(aggdata, Trade, by="Group.1" , all.y = TRUE, all.x=TRUE)
I get the following result
Rather than having 2 rows with the same country code, I'd like them to be combine.
Thanks!

I strongly suspect that the Group.1 strings in one or other of your data frames has one or more trailing spaces, so they appear identical when viewed, but are not. An easy way of visually checking whether they are the same:
levels(as.factor(Trade$Group.1))
levels(as.factor(aggdata$Group.1))
If the problem does turn out to be trailing spaces, then if you are using R 3.2.0 or higher, try:
Trade$Group.1 <- trimws(Trade$Group.1)
aggdata$Group.1 <- trimws(aggdata$Group.1)
Even better, if you are using read.table etc. to input your data, then use the parameter strip.white=TRUE
For future reference, it would be better to post at least a sample of your data rather than a screenshot.

The following works for me:
aggdata <- data.frame(Group.1 = c('AT', 'BE'), CASEID = c(1587.6551, 506.5), ISOCNTRY = c(NA, NA),
QC17_2 = c(2.0, 1.972332), D70 = c(1.787440, 1.800395))
Trade <- data.frame(Group.1 = c('AT', 'BE'), trade = c(99.77201, 100.10685))
merged<- merge(aggdata, Trade, by="Group.1" , all.y = TRUE, all.x=TRUE)
I had to transcribe your data by hand from your screenshots, so I only did the first two rows. If you could paste in a full sample of your data, that would be helpful. See here for some guidelines on producing a reproducible example: https://stackoverflow.com/a/5963610/236541

Related

Issue with duplicate last names/cannot find object

Recent excel graduate trying to transition to R, so am very new to this.
I am trying to create a player based sports model. However, when trying to print the code I have already written, R is conflating players with the same last name (using dplyr). Essentially it has created two columns (player_last_name.x and player_last_name.y), and has merged these players stats. My first thoughts were to merge the first and last name columns into one. However, not sure how R goes with merging categorical data.
Also, R seems to not be able to find my third variable in season_TOG.
Any help would be appreciated.
Thanks.
disp <- playerdata %>%
group_by(player_first_name, player_last_name)%>%
summarise(season_disposals = sum(disposals))%>%
games <- playerdata %>%
group_by(player_first_name, player_last_name) %>%
summarise(season_game_count = n_distinct(match_round))%>%
TOG <- playerdata %>%
group_by(player_first_name, player_last_name)%>%
summarise(season_TOG = sum(time_on_ground_percentage))%>%
PropModel_df <- merge(disp, games, TOG, by="player_first_name", "player_last_name")%>%
PropModel_df <- transform(PropModel_df, avg_disp = season_disposals/season_game_count)%>%
PropModel_df <- transform(PropModel_df, avg_TOG = season_TOG/season_game_count)%>%
print(PropModel_df)```
```Error in eval(substitute(list(...)), `_data`, parent.frame()) :
object 'season_TOG' not found```
There are at least three clear issues here.
Your code is not parse-able: you have extra %>% in several points of your code. It might be that this is just an artifact of your question, and you reduced some otherwise unnecessary portions of your code but didn't clean up your pipes ... in which case thank you for reducing your code, but please try your reduced code before posting in the question.
merge accepts exactly two frames to join, so your
PropModel_df <- merge(disp, games, TOG, by="player_first_name", "player_last_name")
will fail by that notion. You'll need to merge the first two (merge(disp, games, by=...) and then merge those results with TOG.
When you join using multiple fields, you need to include them in a single-vector. Your code (adjusted for #2):
PropModel_df <- merge(disp, games, by="player_first_name", "player_last_name")
should be
PropModel_df <- merge(disp, games, by = c("player_first_name", "player_last_name"))
Further detail: when arguments are provided without names, they are assigned by position. Because merge arguments are
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
incomparables = NULL, ...)
this is the apparent argument names for your call:
merge(x = disp, y = games, by = "player_first_name", by.x = "player_last_name")
which is (I believe) not what you intend.

Why am I getting 'Can't bind data because some arguments have the same nameTraceback:' [duplicate]

#use readtable to create data frames of following unzipped files below
x.train <- read.table("UCI HAR Dataset/train/X_train.txt")
subject.train <- read.table("UCI HAR Dataset/train/subject_train.txt")
y.train <- read.table("UCI HAR Dataset/train/y_train.txt")
x.test <- read.table("UCI HAR Dataset/test/X_test.txt")
subject.test <- read.table("UCI HAR Dataset/test/subject_test.txt")
y.test <- read.table("UCI HAR Dataset/test/y_test.txt")
features <- read.table("UCI HAR Dataset/features.txt")
activity.labels <- read.table("UCI HAR Dataset/activity_labels.txt")
colnames(x.test) <- features[,2]
dataset_test <- cbind(subject.test,y.test,x.test)
colnames(dataset_test)[1] <- "subject"
colnames(dataset_test)[2] <- "activity"
test <- select(features, V2)
dataset_test <- select(dataset_test,subject,activity)
[1] Error: Can't bind data because some arguments have the same name
features is a two column dataframe with the second columns containing
the names for x.test
subject.test is a single column data frame
y.test is a single column data frame
x.test is a wide data frame
After naming and binding these data frames I tried to use dplyr::select to select certain frames. However, I get an error returning dataset_test:
"Error: Can't bind data because some arguments have the same name"
However, test does not return an error and properly filters. Why is there the difference in behaviour?
The data I am using can be downloaded online. The data sources correspond to the variable names, except "_" are used instead of "."
dput
> dput(head(x.test[,1:5],2))
structure(list(V1 = c(0.25717778, 0.28602671), V2 = c(-0.02328523,
-0.013163359), V3 = c(-0.014653762, -0.11908252), V4 = c(-0.938404,
-0.97541469), V5 = c(-0.92009078, -0.9674579)), row.names = 1:2, class = "data.frame")
> dput(head(subject.test,2))
structure(list(V1 = c(2L, 2L)), row.names = 1:2, class = "data.frame")
> dput(head(y.test,2))
structure(list(V1 = c(5L, 5L)), row.names = 1:2, class = "data.frame")
> dput(head(features,2))
structure(list(V1 = 1:2, V2 = c("tBodyAcc-mean()-X", "tBodyAcc-mean()-Y"
)), row.names = 1:2, class = "data.frame")
I had exactly the same problem and I think I'm looking at the same dataset as you. It's motion sensor data from a smart phone, isn't it?
The problem is exactly what the error message says! That dang set has duplicate column names. Here's how I explored it. I couldn't use your dput commands, so I couldn't try out your data. I'm showing my code and results. I suggest you substitute your variable, dataset_test, where I have samsungData.
Here's the error. If you just select the dataset, but don't indicate the columns, the error message identifies the duplicates.
select(samsungData)
That gave me this error, which is just what your own dplyr error was trying to tell you.
Error: Columns "fBodyAcc-bandsEnergy()-1,8", "fBodyAcc-bandsEnergy()-9,16", "fBodyAcc-bandsEnergy()-17,24", "fBodyAcc-bandsEnergy()-25,32", "fBodyAcc-bandsEnergy()-33,40", ... must have a unique name
Then I wanted to see where that first column was duplicated. (I don't think I'll ever work well with regular expressions, but this one made me mad and I wanted to find it.)
has_dupe_col <- grep("fBodyAcc\\-bandsEnergy\\(\\)\\-1,8", names(samsungData))
names(samsungData)[has_dupe_col]
Results:
[1] "fBodyAcc-bandsEnergy()-1,8" "fBodyAcc-bandsEnergy()-1,8" "fBodyAcc-bandsEnergy()-1,8"
That showed me that the same column name appears in three positions. That won't play nicely in dplyr.
Then I wanted to see a frequency table for all the column names and call out the duplicates.
names_freq <- as.data.frame(table(names(samsungData)))
names_freq[names_freq$Freq > 1, ]
A bunch of them appear three times! Here are just a few.
Var1 Freq
9 fBodyAcc-bandsEnergy()-1,16 3
10 fBodyAcc-bandsEnergy()-1,24 3
11 fBodyAcc-bandsEnergy()-1,8 3
Conclusion:
The tool (dplyr) isn't broken, the data is defective. If you want to use dplyr to select from this dataset, you're going to have to locate those duplicate column names and do something about them. Maybe you change the column name (dplyr's mutate will do it for you without grief). On the other hand, maybe they're supposed to be duplicated and they're there because they're a time series or some iteration of experimental observations. Maybe then what you need to do is merge those columns into one and provide another dimension (variable) to distinguish them.
That's the analysis part of data analysis. You'll have to dig into the data to see what the right answer is. Either that, or the question you're trying to answer need not even include those duplicate columns, in which case you throw them away and sleep peacefully.
Welcome to data science! At best, it's just 10% cool math and machine learning. 90% is putting on gloves and a mask and wiping up crap like this in your data.
I recently ran into this same problem with a different data set. My tidyverse solution to identifying duplicate column names in the dataframe (df) was:
tibble::enframe(names(df)) %>% count(value) %>% filter(n > 1)
This error is often caused by a data frame having columns with identical names, that should be the first thing to check. I was trying to check my own data frame with dplyr select helper functions (start_with, contains, etc.), but even those won't work, so you may need to export to a csv to check in Excel or some other program or use base functions to check for duplicate column names.
Another possibility to find duplicate column names using Base R would be using duplicated:
colnames(df)[which(duplicated(colnames(df)))]

How to append rows of data frame to the first one in R?

I have a data frame that I am trying to condense from multiples rows into one row.The data set is fairly large, but I am starting with a small subset. So here I want to turn 2 rows into 1; I want the information to follow the information in the first row.
The original problem was that I had a column of data that I need to "flatten" so that I can use the bits and pieces. The column is in JSON format.
"[{\"task\":\"T0\",\"task_label\":\"Did any birds visit the feeding platform or bird feeders?\",\"value\":\"**Yes**—but there were no displacements. Next, enter all of the birds you see at the feeders. \"},{\"task\":\"T1\",\"value\":[{\"choice\":\"EUROPEANSTARLING\",\"answers\":{\"WHATISTHELARGESTNUMBEROFINDIVIDUALSTHATYOUSAWSIMULTANEOUSLY\":\"4\"},\"filters\":{}},{\"choice\":\"MOURNINGDOVE\",\"answers\":{\"WHATISTHELARGESTNUMBEROFINDIVIDUALSTHATYOUSAWSIMULTANEOUSLY\":\"2\"},\"filters\":{}}]},{\"task\":\"T6\",\"task_label\":\"Is it actively precipitating (rain or snow)?\",\"value\":[\"Yes.\"]}]"
So I used code developed by another coder to "flatten" this out by task. Then, I want to join it back up so that I have one line of information for each classification.
Currently, I have merged tasks T0 and T4, but I need to merge this to another task, T5. In order to do that, I need to reduce the data in merge of T0 and T4 to one row. So right now I'm working with a small subset of the data and have a table that essentially looks like this:
x <- data.frame("subject_ids" = c(19232716, 19232716), "classification_id" = c(120545061,120545061), "task_index.x" = c(1,1),
"task.x" = c("TO","TO"), "value" = c("Displacement","Displacement"), "task_index.y"=c(2,5), "task.y"= c("T4, T4","T4"),
"total.species"=c("2,2","1"), "choice" = c("MOURNINGDOVE, COMMONGRACKLE","MOURNINGDOVE"), "S_T"=c("Target,Target","Target,Source"))
but I want it to look like this:
y <- data.frame("subject_ids" = c(19232716), "classification_id" = c(120545061), "task_index.x" = c(1),
"task.x" = c("TO"), "value" = c("Displacement"), "task_index.y"=c(2), "task.y"= "T4, T4",
"total.species"=c("2,2"), "choice" = c("MOURNINGDOVE, COMMONGRACKLE"), "S_T"=c("Target,Target"),
"task_index.y"=c(5), "task.y"= "T4",
"total.species"=c("1"), "choice" = c("MOURNINGDOVE"), "S_T"=c("Target,Source"))

Convert a dataframe to a character array?

I am trying to convert a dataframe to a character array in R.
THIS WORKS BUT THE TEXT FILE ONLY CONTAINS LIKE 83 RECORDS
data <- readLines("https://www.r-bloggers.com/wp-content/uploads/2016/01/vent.txt")
df <- data.frame(data)
textdata <- df[df$data, ]
THIS DOES NOT WORK..MAYBE BECAUSE IT HAS 3k RECORDS?
trump_posts <- read.csv(file="C:\\Users\\TAFer\\Documents\\R\\TrumpFBStatus1.csv",
sep = ",", stringsAsFactors = TRUE)
trump_text <- trump_posts[trump_posts$Facebook.Status, ]
All I know is I have a dataframe called trump posts. The frame has a single column called Facebook.Status. I just wanted to turn it into a character array so I can run an analysis on it.
Any help would be very much appreciated.
Thanks
If Facebook.Status is a character vector you can directly perform your analysis on it.
Or you can try:
trump_text <- as.character(trump_posts$Facebook.Status)
I think you are somehow confusing data.frame syntax with data.table syntax. For DF, you'd reference vector as df$col. However, for DT it is somewhat similar to what you wrote dt[,col] or dt[,dt$col]. Also, if you want a character vector right away, set stringsAsFactors = F in your read.csv. Otherwise you'll need extra conversion, for example, dt[,as.character(col)] or as.character(df$col).
And on a side note, size of vector is almost never an issue, unless you hit the limits of your hardware.

cca per groups and row.names

I am somewhat new to R, so forgive my basic questions.
I perform a CCA on a full dataset (358 sites, 40 abiotic parameters, 100 species observation).
library(vegan)
env <- read.table("env.txt", header = TRUE, sep = "\t", dec = ",")
otu <- read.table(otu.txt", header = TRUE, sep = "\t", dec = ",")
cca <- cca(otu~., data=env)
cca.plot <- plot(cca, choices=c(1,2))
vif.cca(cca)
ccared <- cca(formula = otu ~EnvPar1,2,n, data = env)
ccared.plot <- plot(ccared, choices=c(1,2))
orditorp(ccared.plot, display="sites")
This works without using sample names in the first columns (initially, the first column containing numeric samples names got interpreted as a variable, so i used tables without that information. When i add site names to the plot via orditorp, it gives "row.name=n" in the plot.)
I want to use my sample names, however. I tried row.names=1 on both tables with sample name information:
envnames <- read.table("envwithnames.txt", header = TRUE, row.names=1, sep = "\t", dec = ",")
otunames <- read.table("otuwithnames.txt", header = TRUE, row.names=1, sep = "\t", dec = ",")
, and any combination of env/otu/envnames/otunames. cca worked out well in any case, but any plot command yielded
plot.ccarownames <- plot(cca(ccarownames, choices=c(1,2)))
Error in rowSums(X) : 'x' must be numeric
My second problem is connected to that: The 358 sites are grouped into 6 groups (4x60,2x59). The complete matrix has this information inferred as an extra column.
Since i couldnt work out the row name problem, i am even more stuck with nominal data, anyhow.
The original matrix contains a first column (sample names, numeric, but can be easily transformed to nominal) and second one (group identity, nominal), followed by biological observations.
What i would like to have:
A CCA containing all six groups that is coloring sites per group.
A CCA containing only data for one group (without manual
construction of individual input tables)
CCA plots that are using my original sample names.
Any help is appreciated! Really, i am stuck with it since yesterday morning :/
I'm using cca() from vegan myself and I have some of your own problems, however I've been able to at least solve your original "row names" problem. I'm doing a CCA analysis on data from 41 soils, with 334 species and 39 environmental factors.
In my case I used
rownames(MyDataSet) <- MyDataSet$ObservationNamesColumn
(I used default names such as MyDataSet for the sake of example here)
However I still had environmental factors which weren't numerical (such as soil texture). You could try checking for non numerical factors in case you have a mistake in your original dataset or an abiotic factor which is not interpreted as numerical for any other reason. To do this you can either use the command str(MyDataSet) which tells you the nature of each of your variable, or lapply(MyDataSet, class) which also tells you the same but in a different output.
In case you have abiotic factors which are not numerical (again, such as texture) and you want to remove them, you can do so by creating a whole new dataset using only the numerical variables (you will still keep your observation names as they were defined as row names), this is rather easy to do and can be done using something similar to this:
MyDataSet.num <- MyDataSet[,sapply(MyDataSet, is.numeric)]
This creates a new data set which has the same rows as the original but only columns (variables) with numeric values. You should be able then to continue your work using this new data set.
I am very new to both R programming and statistics (I'm a microbiologist) but I hope this helps!

Resources