merge rows by subject number - r

What I am trying to do is merge my dataframe by rows. For instance let's say my data.frame is called data and it looks like this: I have 5 columns- subject contains 5s and 6s, Phase contains Post-Lure and Pre-Lure, Type contains Visual and Auditory and Memory contains a list of scores. Ex:
Subject Phase Type Memory
1 5 Post-Lure Visual 0.80000000
2 5 Post-Lure Auditory 0.70666667
3 5 Pre-Lure Visual 0.40000000
4 5 Pre-Lure Auditory 0.61333333
5 6 Post-Lure Visual 0.80000000
6 6 Post-Lure Auditory 0.54666667
As you can see from the code above, the subject is repeated (subject 5 is the same person but the phase and/or type are now different). Thus, I am looking for a code that will make all of the data for each subject on the same row. Hence, the memory scores, and the different types and phases each subject were exposed to will just now become additional columns on the same row. I feel aggregate may do the trick but is it possible to use that code without applying a function to each of the numbers. Any help would be greatly appreciated. Thank you.

As mentioned in the comment, you need to add an "indicator" variable of some sort (for example, how many "times" there are for each subject).
That can be done with ave and seq_along:
mydf$time <- with(mydf, ave(Subject, Subject, FUN=seq_along))
Next, you can use reshape() to go from "long" to "wide".
reshape(mydf, direction = "wide",
idvar="Subject", timevar="time")
# Subject Phase.1 Type.1 Memory.1 Phase.2 Type.2 Memory.2
# 1 5 Post-Lure Visual 0.8 Post-Lure Auditory 0.7066667
# 5 6 Post-Lure Visual 0.8 Post-Lure Auditory 0.5466667
# Phase.3 Type.3 Memory.3 Phase.4 Type.4 Memory.4
# 1 Pre-Lure Visual 0.4 Pre-Lure Auditory 0.6133333
# 5 <NA> <NA> NA <NA> <NA> NA
If you wanted to use the "reshape2" or "tidyr" packages, you would first have to get the data into a "long" form using melt or gather, but note that in the process, your variable types would be converted since a single column would be containing several data types.

Do you just want to reshape your data? The question isn't clear. Let's call your dataframe df. Then
library(reshape2)
dcast(df, Subject ~ Phase + Type)
will produce
Subject Post-Lure_Auditory Post-Lure_Visual Pre-Lure_Auditory Pre-Lure_Visual
1 5 0.7066667 0.8 0.6133333 0.4
2 6 0.5466667 0.8 NA NA

Related

How do I change the order of multiple grouped values in a row dependent on another variable in that row in R?

I need some help conditionally sorting/switching data based on a factor variable.
I'm not sure if it's a typical use case I just can't formulate properly enough for a search engine to show me a solution or if it is that niche but I haven't found anything yet.
I currently have a dataframe like this:
id group a1 a2 a3 a4 b1 b2 b3 b4
1 1 2 6 6 3 4 4 6 4
2 2 5 2 2 2 2 5 2 3
3 1 6 3 3 1 3 6 4 1
4 1 4 8 4 2 7 8 8 9
5 2 3 1 1 4 2 1 1 7
For context this is from a psychological experiment where people went through two variations of a task and the order of those conditions was determined by the experimental group they were assigned to. The columns represent different measurements from different trials and are currently grouped together for the same variable and in chronological order, meaning a1,a2,a3,a4 are essentially the same variable at consecutive time points, same with b1,b2,b3,b4.
I want to split them up for the different conditions so regardless of which group (=which order of tasks) someone went through, data from one condition should come first in the dataframe and columns should still be grouped together for the same variables and in chronological order within that condition. It should essentially look like this:
id group c1a1 c1a2 c2a1 c2a2 c1b1 c1b2 c2b1 c2b2
1 1 2 6 6 3 4 4 6 4
2 2 2 2 5 2 2 3 2 5
3 1 6 3 3 1 3 6 4 1
4 1 4 8 4 2 7 8 8 9
5 2 1 4 3 1 1 7 2 1
So essentially for group 1 everything stays the same since they happened to go through the conditions in the same order that I want to have in the new dataframe while for group 2 values are being switched where the originally second half of values for each variable is put in front of the originally first one.
I hope I formulated the problem in a way, people can understand it.
My real dataset is a bit more complicated it has 180 columns minus id and group so 178.
I have 13 variables some of which were measured over two conditions with 5 trials for each of those and some which have those 5 trials for each of the 2 main condition but which also have 2 adittional measurements for each condition where the order was determined by the same group variable.
(We essentially asked participants to do the task again in two certain ways, which allowed us to see if they were capable of doing them like that if they wanted to under the circumstences of both main conditions).
So there are an adittional 4 columns for some variables which need to be treated seperately. It should look like this when transformed (x and y are the 2 extra tasks where only b was measured once):
id group c1a1 c1a2 c2a1 c2a2 c1b1 c1b2 c1bx c1by c2b1 c2b2 c2bx c2by
1 1 2 6 6 3 4 4 3 7 6 4 4 2
2 2 2 2 5 2 2 3 4 3 2 5 2 2
3 1 6 3 3 1 3 6 2 2 4 1 1 1
4 1 4 8 4 2 7 8 1 1 8 9 5 8
5 2 1 4 3 1 1 7 8 9 2 1 3 4
What I want to say with this is, I need a pretty general solution.
I already tried formulating a function for creation of two seperate datasets for the groups and then merging them by id but got stuck with the automatic creation and naming of columns which I can't seem to wrap my head around. dplyr is currently loaded and used for some other transformations but since I'm not really good with it, I need to ask for your help regarding a solution with or without it. I'm still pretty new to R and this is for my bachelor thesis.
Thanks in advance!
Your question leaves a few things unclear that make this hard to answer, but here is maybe a start that could help, or at least help clarify your problem.
It would really help if you could clarify 2 pieces of info, what types of column rearrangements you need, and how you distinguish what indicates that a row needs to have this transformation.
I'm also wondering if instead of trying to manipulate your data in its current shape, if it not might be more practical to figure out how to change the shape of your data to better represent your data, perhaps using something like pivot_longer(), I don't know how this data will ultimately be used or what the actual values indicate, but it doesn't seem to be very tidy in its current form, and instead having a "longer" table might be more meaningful, but I'll still provide what I think is a solution to your listed problem.
This creates some example data that looks like it reflects yours in the example table.
ID=seq(1:10)
group=sample(1:2,10,replace=T)
Data=matrix(sample(1:10,80,replace=T),nrow=10,ncol=8)
DataFrame=data.frame('ID'=ID,'Group'=group,Data)
You then define the groups of columns that need to be kept together. I can't tell if there is an automated way for you to indicate which columns are grouped, but this might get bulky if done manually. Some more information on what your column names actually are, and how they are distributed in groups would help.
ColumnGroups=list('One'=c('X1','X2'),'Two'=c('X3','X4'),'Three'=c('X5','X6'),'Four'=c('X7','X8'))
You can then figure out which rows need to have rearranged done by using some conditional. Based on your example, I'm assuming when the group variable equals 2, then the rearranging needs to be done, which is what I've used here.
FlipRows=DataFrame$Group==2
You can then have R only apply the rearrangement needed to those rows that need it, and define the rearrangement based on the ordering of the different column groups. I know you ask for a general solution, but is hard to identify the general solution you need without knowing what types of column rearrangements you need. If it is always flipping two sets of consecutive column groups, that would be easier to define without having to type it all out. What I have done here would require you to manually type out the order of the different column groups that you would like the rows to be rearranged as. The SortedDataFrame object seems to be what you are looking for, but might not actually reflect your real data. I removed columns 1 and 2 in this operation since those are ID and group which you don't want overridden.
SortedDataFrame=DataFrame
SortedDataFrame[FlipRows,-c(1,2)]=DataFrame[FlipRows,c(ColumnGroups$Two,ColumnGroups$One,ColumnGroups$Four,ColumnGroups$Three)]
This solution won't work if you need to rearrange each row differently, but it is unclear if that is the case. Try to provide any of the other info requested here, and let me know where this solution doesn't work for you, and that.

R: how to merge two columns (column addition) while ignoring rows with same value

I have a data.frame like this
I want to add Sample_Intensity_RTC and Sample_Intensity_nRTC's values and then create a new column, however in cases of Sample_Intensity_RTC and Sample_Intensity_nRTC have the same value, no addition operation is done.
Please not that these columns are not rounded in the same way, so many numbers are same with different nsmall.
It seems you just want to combine these two columns, not add them in the sense of addition (+). Think of a zipper perhaps. Or two roads merging into one.
The two columns seem to have been created by two separate processes, the first looks to have more accuracy. However, after importing the data provided in the link, they have exactly the same values.
test <- read.csv("test.csv", row.names = 1)
options(digits=10)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC
1 191017QMXP002 NA NA
2 191017QNXP008 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46
6 191017USXP002 NA 76984658.00
In any case, to combine them, we can just use ifelse with the condition is.na for the first column.
test$new_col <- ifelse(is.na(test$Sample_Intensity_RTC),
test$Sample_Intensity_nRTC,
test$Sample_Intensity_RTC)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
1 191017QMXP002 NA NA NA
2 191017QNXP008 41293681.00 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46 76693308.46
6 191017USXP002 NA 76984658.00 76984658.00
sapply(test, function(x) sum(is.na(x)))
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
0 126 143 108
You could also use the coalesce function from dplyr.

Sum-product in R for specific conditions

I'm looking to do sumproduct in r as we do in excel.
It's a little challenging as i have to apply some logical conditions meanwhile.
Excel code looks like this
SUMPRODUCT(--(ID=A2),--(INDIRECT(A1)<>"-"),INDIRECT(B1),C1)
here ID, A1 ,B1 are name ranges on other sheet of same workbook.
ID $ Quantity
1 23 34
2 4 55
3 NA 6
4 6 45
5 7 NA
6 8 NA
I want logical operators because some values are NA and i don't want to take them in consideration. I want this process to be automated without much manual work.
I've done this upto some extent using deplyr but it's not giving satisfactory results.

R Merge By adding rows

I have 2 data sets, 1 with information about travels, and the other one containing the cost per travel depending on where I'm leaving from. I need to get the total cost of the travel and it's easy to do with a merge by where I'm leaving from, but when I do this it adds 1,500 rows to my 100,000 row dataset..
Anyone knows why is this happening? The biggest dataset is 100,000 rows, the other one is about 10,000
EDITThis is a subset of df1
x Poste Locat V3
1 905916 Mixco 0.3
2 905818 Mixco 0.6
3 905818 Mixco 0.6
4 905338 Castellana 0.5
5 904876 Mixco 0.3
This is a subset of df2
x Vehiculo Poste
1 Camion 340592
2 Camion 262776
3 Camion 340622
4 Camion 243254
5 Camion 258505
I need to merge both datasets using "Poste", as I will get the cost based on both "Locat"(location) and "Vehiculo"(vehicle) from another dataset.
sol <- merge(sol, df[,c(5,16)], by="Poste")

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you
This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]
I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA
split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

Resources