How to combine rows based off of duplicate values? - r

Basically what we have is several columns as follows:
Household ID, restaurantspend, groceryspend, foodtruckspend
We have duplicate household ids because each spend is in its own individual column so an example of our data looks like this:
data example
We want to have the Household ID only have 1 row per id and combine the numerical values of the other column.

aggdata = aggregate(mydata, by=list(mydata$HouseHoldID),Fun=sum)
I have created the above table and saved it as "Mydata". Run the above code. View the output "aggdata", you can see an extra column "Group.1" that's the group based on "HouseHoldID". You can ignore the second column "HouseHoldId" as the same information will be available in the column "Group.1".

Related

Merging two data-frame with different observation which are including same column in r

I have two data-frames with different observations but including same column let say "id". The values of column "id" are duplicated in bigger data-frame. Also, in the small data-frame there is a column called "names" which is not existed in a bigger df. I want to add new column called "name" into my bigger df like this:

How to Split one row into two

I want to create multiple columns based on subjects and Marks. data in column Name and Age will remain same for different subjects as shown
My first table is input data and the second one is desired output.

Find closest datapoint to a date in another dataframe

I have two data frames. One data frame is called Measurements and has 500 rows. The columns are PatientID, Value and M_Date. The other data frame is called Patients and has 80 rows and the columns are PatientID, P_Date.
Each patient ID in Patients is unique. For each row in Patients, I want to look at the set of measurements in Measurements with the same PatientID (there are maybe 6-7 per patient).
From this set of measurements, I want to identify the one with M_Date closest to P_Date. I want to append this value to Patients in a new column. How do I do this? I tried using ddplyr but can't figure out how to access two data frames at once within this function.
you probably want to install the install.packages("survival") and the neardate function within it to solve your problem.
It has a good example in the documentation

How to sort the first 20 rows in first column in alphabetical order in a data frame

I'm new to R coding and i'm doing exercises and I got stuck. In my data frame, the first row are patients e.g patient 1, patient 2 etc and the first column are gene names eg gene abc123,gene def456. What I want to know is how to sort the first 20 rows in column 1 in alphabetical order. Thanks
EDIT
I have put up a screenshot of the file in excel and i am trying to extract the ones in the red box in alphabetical order. I am unsure what to call column 1 in the console as it doesn't have a heading. In the file provided, each row represents expression values for a single gene, and each column
represents expression values for a single sample (patient).
The first column of each row is the gene identifier: (gene-symbol|entrez ID)
e.g. "A2M|2" (A2M is the gene-symbol and 2 is the entrez database identifier for alpha 2 macroglobulin)
Each sample identifier is formatted as: TCGA-ID_Tissue
where the Tissue is either "TissueA" or "TissueB" e.g. "TCGA-AA-3548_TissueA"
The question is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
screenshot of the table

Count of columns with filters

I have a dataframe with multiple columns and I want to apply different functions on each column.
An example of my dataset -
I want to calculate the count of column pq110a for each country mentioned in qcountry2 column(me-mexico,br-brazil,ar-argentina). The problem I face here is that I have to use filter on these columns for example for sample patients I want-
Count of pq110 when the values are 1 and 2 (for some patients)
Count of pq110 when the value is 3 (for another patients)
Similarly when the value is 6.
For total patient I want-total count of pq110.
Output I am expecting is-Output
Similalry for each country I want this output.
Please suggest how can I do this for other columns also,countrywise.
Thanks !!
I guess what you want to do is count the number of columns of 'pq110' which have the same value within different 'qcountry2'.
So I'll try to use 'tapply' to divide data into several subsets and then use 'table' to count column number for each different value.
tapply(my_data[,"pq110"], INDEX = as.factor(my_data[,"qcountry2"]), function(x)table(x))

Resources