I would like to know how to combine duplicated/repeated rows into one. I am currently working with an OTU table with taxonomy, which is the result of merging five different sequencing runs. Each taxon (rows) has a number of counts per sample (columns).
The problem is that now I have multiple duplicated/repeated taxa by merging and keeping all the results together. For example:
What I would like to do is to combine/summarize those repeated taxa into one single row. I do not want to remove the data, I just want to combine it. I have seen similar posts in the forum, but mosts just want to delete the duplicates. I am not sure of how to proceed, any help will be appreciated!
Related
I have recently received an output from the online survey (ESRI Survey123), storing the each recored attribte as a new column of teh table. The survey reports characteristics of single trees located on study site: e.g. beech1, beech2, etc. For each beech, several attributes are recorded such as height, shape, etc.
This is how the output table looks like in Excel. ID simply represent the site number:
Now I wonder, how can I read those data into R to make sure that columns 1:3 belong to beech1, columns 4:6 represent beech2, etc.? I am looking for something that would paste the beech1 into names of the following columns: beech1.height, beech1.shape. But I am not sure how to do it?
I'm looking for help with a problem I'm trying to solve in R.
I have DNA alignment results, a big dataset (more than 200 000 rows, and 20 columns) and I want to clean it and delete non-specific sequence and have at the end just one DNA sequence name for one species.
I've tried unique,duplicate and distinct function but they always keep the first duplicate rows and I don't want them, I would like to delete ALL the duplicate rows.
Do you have an idea how to solve my problem?
Turns out I shouldn't have trusted the source of my data. They left duplicate observations and didn't clean the data as well as I assumed. So this question is moot.
I am attempting to merge two data frames. I've done this many times in the past with great success (after weeding out typos). I've been beating my head against the wall with this one. I cannot find the issue. One file has only 6 columns, 4 of which are repeated in the larger file. I need to merge by unique combinations of these 4 columns. For instance, Plant 1 at Transect A at Site X in year 2014 should have only 1 row. Each Transect and Site have unique prefixes assigned to each plant, but I need to subset out by these 4 columns later, so I want to maintain them.
I've tried both cbind() and merge(). In merge I've also used all=true or false, since I know some of the rows are basically populated by NAs only and don't add anything to my analyses.
dat=cbind(dens, df)
dat=cbind(dens, df), by=c("Year", "site", "transect", "PlantID"))
or
dat=merge(dens, df, by=c("PlantID","Year", "site", "transect"), all=F)
These data files are both only just over 7000 observations in length. But when I cbind or merge, I get the same df, which is well over 10,000 observations. I've looked at the output and a good number of the individuals have been quadrupled. I'm sure it's something very simple that I've missed but at this point I need fresh and knowledgeable eyes.
Here is a link to the two data files on Google Drive.
https://drive.google.com/drive/folders/1JQXSadqxQBOXM5AAOFAr-BmuoX9TXKXh?usp=sharing
A couple of things, when you merge you usually only use one primary key to merge on as multiple can be prone to issues. From your description is sounds like the keys you are using are not the same. For instance one dataset has column Col1 and the other has col1 or worse they are different data types, but they appear to be the same on screen. Maybe try taking a small subset of your datasets and trying merging those before throwing the whole process at it and being surprised it doesn't work.
I hope this has not been answered, but when I search for a solution to my problem I am not getting any results.
I have a data.frame of 2000+ observations and 20+ columns. Each row represents a different observation and each column represents a different facet of data for that observation. My objective is to iterate through the data.frames and select observations which match criteria (eg. I am trying to pick out observations that are in certain states). After this, I need to subtract or add time to convert it to its appropriate time zone (all of the times are in CST). What I have so far is an exorbitant amount of subsetting commands that pick out the rows that are of the state being checked against. When I try to write a for loop I can only get one value returned, not the whole row.
I was wondering if anyone had any suggestions or knew of any functions that could help me. I've tried just about everything, but I really don't want to have to go through each state of observations and modify the time. I would prefer a loop that could easily go through the data, select rows based on their state, subtract or add time, and then place the row back into its original data.frame (replacing the old value).
I appreciate any help.
Probably a pretty basic question, and hopefully one not repeated elsewhere. I’m looking at some ISSP survey data in R, and I made a separate data frame for respondents who answered “Government agencies” on one of the questions:
gov.child<-data[data$"V33"=="Government agencies",]
Then I used the table function to see how many total respondents answered that way in each country (C_ALPHAN is the variable name for country):
table(gov.child$C_ALPHAN)
Then I made a matrix of this table:
gov.child.matrix<-as.matrix(table(gov.child$C_ALPHAN))
So I now have a two-column matrix with just the two-letter country code (the C_ALPHAN code) and the number of people who answered “Government agencies.” But I want to know what percentage of respondents in those countries answered that way, so I need to divide this number by the total number of respondents for that country.
Is there some way (a function maybe?) to, after adding a new column, tell R that for each row, it has to divide the number in column two by the total number of rows in the original data set that correspond to the country code in column one (i.e., the n for that country)? Or should I just manually make a vector with the n for each country, which is available on the ISSP website, and add it to the matrix? I'm loathe to to that because of the possibility of making a data entry error, but maybe that's the best way.