Simple Merge Quadrupling Observation Numbers - r

Turns out I shouldn't have trusted the source of my data. They left duplicate observations and didn't clean the data as well as I assumed. So this question is moot.
I am attempting to merge two data frames. I've done this many times in the past with great success (after weeding out typos). I've been beating my head against the wall with this one. I cannot find the issue. One file has only 6 columns, 4 of which are repeated in the larger file. I need to merge by unique combinations of these 4 columns. For instance, Plant 1 at Transect A at Site X in year 2014 should have only 1 row. Each Transect and Site have unique prefixes assigned to each plant, but I need to subset out by these 4 columns later, so I want to maintain them.
I've tried both cbind() and merge(). In merge I've also used all=true or false, since I know some of the rows are basically populated by NAs only and don't add anything to my analyses.
dat=cbind(dens, df)
dat=cbind(dens, df), by=c("Year", "site", "transect", "PlantID"))
or
dat=merge(dens, df, by=c("PlantID","Year", "site", "transect"), all=F)
These data files are both only just over 7000 observations in length. But when I cbind or merge, I get the same df, which is well over 10,000 observations. I've looked at the output and a good number of the individuals have been quadrupled. I'm sure it's something very simple that I've missed but at this point I need fresh and knowledgeable eyes.
Here is a link to the two data files on Google Drive.
https://drive.google.com/drive/folders/1JQXSadqxQBOXM5AAOFAr-BmuoX9TXKXh?usp=sharing

A couple of things, when you merge you usually only use one primary key to merge on as multiple can be prone to issues. From your description is sounds like the keys you are using are not the same. For instance one dataset has column Col1 and the other has col1 or worse they are different data types, but they appear to be the same on screen. Maybe try taking a small subset of your datasets and trying merging those before throwing the whole process at it and being surprised it doesn't work.

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

I split a very large dataframe into a number of smaller ones but I am unable to merge them back together

So I am working with some Census API data. Sadly there are a number of NAs in some of the columns. I am replacing the NAs with the column means, but I figured I would get more accurate information if I split the data by county first. Here is where my problem lies; I am unable to merge them back to a single dataframe correctly. I know that unsplit doesn't work for lists of dataframes so instead as per other posts I am using
do.call("rbind", hdi_tract$county)
instead. It seems to have worked but now I am getting more NA values than what I started with before splitting the data. Why is this the case?

Querying UniProt and RefSeq databases with FASTA headers

This is my first post to StackOverflow. I've been playing with this problem for over a week, and I haven't found a solution using the search function or my limited computational skills. I have a dataset comprised columns of FASTA headers, Sequence, and count number. The FASTA headers technically contain all the info I need, but that's where I'm running into problems...
Different formats
Some of the entries come from UniProt:
tr|V5RFN6|V5RFN6_EBVG_Epstein-Barr_nuclear_antigen_2_(Fragment)_OS=Epstein-Barr_virus_(strain_GD1)_GN=EBNA2_PE=4_SV=1
Some of the entries come from RefSeq:
gi|139424477|ref|YP_001129441.1|EBNA-2[Human_herpesvirus_4_type_2]
Synonyms
I'd like to make graphs using count number vs virus or gene, and I thought it would be easy enough to split up the headers and go from there. However, what I'm discovering is that there's a seemingly endless number of permutations on the names of virus and gene names. In the example above, EBV goes by no less than 4 names, and each individual gene has several different formattings.
I used a lengthy ifelse statement to create a column for virus family name. I shortened the following below to just include EBV, but you can imagine it stretching on for all common viruses.
EBV <- c("EBVG", "Human_herpesvirus_4", "Epstein", "Human_gammaherpesvirus_4")
joint.virus <- joint.virus %>% mutate(Virus_Family =
ifelse(grepl(paste(EBV, collapse = "|"), x = joint.virome$name), "EBV", NA))
This isn't so bad, but I had to do something similar for all of EBV's ~85 genes. Not only was this tedious, but it isn't feasible to do this for all the viruses I want to look at.
I looked into querying the databases using the UniProt.ws package to pull out organism name and gene name, but you need to start from the taxID (which isn't included in the UniProt header). I feel like there should be some way to use the FASTA header to get the organism name and gene name.
I am presently using R. I would greatly appreciate any advice going forward. Is there a package that I'm overlooking? Should I be using a different tool to do this?
Thanks!

Column means over a finite range of rows

I am working with climate data in New Mexico and I am an R novice. I am trying to replace NA with means but there are 37 different sites in my df. I want the means of the column for which the DF$STATION.NAME (in column 1) is unique. I cant be using data from one location to find the mean of another... obviously. so really I should have a mean for each month, for each station.
My data is organized by station.name vertically in column 1 and readings for months jan-dec in columns following, including a total column at the end (right). readings or observations are for each station for each month, over several years (station name listed in new row for each new year.)
I need to replace the NAs with the sums of the CLDD for the given month within the given station.name, how do I do this?
Try asking that question on https://stats.stackexchange.com/ (as suggested by the statistics tag), there are probably more R users there than on the general programming site. I also added the r tag to your question.
There is nothing wrong with splitting your data into station-month subsets, filling the missing values there, then reassembling them into one big matrix!
See also:
Replace mean or mode for missing values in R
Note that the common practice of filling missing values with means, medians or modes is popular, but may dilute your results since this will obviously reduce variance. Unless you have a strong physical argument why and how the missing values can be interpolated, it would be more elegant if you could find a way that can deal with missing values directly.

Trouble getting my data into wide form with the reshape package

I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.

Resources