How to remove duplicates right beneath original response? [duplicate] - r

This question already has answers here:
Remove repeated numbers in a sequence
(3 answers)
Closed 2 years ago.
Background: I have a survey attached to a excel sheet and at times duplication of a response takes place. This is due to user interaction.The duplication takes place right beneath the original response. I would like the R to delete the duplications that takes place next to/right beneath the original response. I would like the original response to be kept. Is there a way to target the duplicated responses right beneath the original one?
If my dataframe looks this:
Area Year Course Tested Grade
1 Git 1 Material Y A
2 Ort 3 Fabric Y B
3 Pinst 2 Pattern N NA
4 Coker 1 Fashion Y B+
5 Coker 1 Fashion Y B+
6 South 4 Business N NA
This is what I would want:
Area Year Course Tested Grade
1 Git 1 Material Y A
2 Ort 3 Fabric Y B
3 Pinst 2 Pattern N NA
4 Coker 1 Fashion Y B+
5 South 4 Business N NA
Thank you in advance

Assuming you want to only delete the duplicates if it happens in consecutive rows and keep it if they happen elsewhere you can use rleidv along with duplicated :
df[!duplicated(data.table::rleidv(df)),]
# Area Year Course Tested Grade
#1 Git 1 Material Y A
#2 Ort 3 Fabric Y B
#3 Pinst 2 Pattern N <NA>
#4 Coker 1 Fashion Y B+
#6 South 4 Business N <NA>

Related

Get rate of change from messy data

I have a database that looks like this:
Id session q1 q2 q3 ...
1 1 4 5 5
1 2 4 5 6
1 3 5 5 6
2 1 4 4 5
2 2 5 4 5
2 3 5 5 6
Basically, different subjects with 3 different measurements of the same questions. What I want to do is measure the rate of change, and check if every observation improved over time or if there where observations who got worse results in session 3 than in session 1 or 2.
The only thing I have manged to do is get it a bit more tidy with pivot_wider like this:
pivot_wider(id_cols = Id, names_from = session, values_from = c(q1:q4))
The problem is that I have more than 70 questions, and I havenĀ“t figured out a way to automate this instead of doing hundreds of line with mutate in the form of:
mutate(q1change = q1_3 - q1_1)
I was wondering if anyone could come up with a better and simpler solution so I can check this rate of change for each variable.
Ideally I would also like to plot it after I have gotten the rate of change value, so I can show graphically if there where observations that gotten worse.
Thanks

How do I change the order of multiple grouped values in a row dependent on another variable in that row in R?

I need some help conditionally sorting/switching data based on a factor variable.
I'm not sure if it's a typical use case I just can't formulate properly enough for a search engine to show me a solution or if it is that niche but I haven't found anything yet.
I currently have a dataframe like this:
id group a1 a2 a3 a4 b1 b2 b3 b4
1 1 2 6 6 3 4 4 6 4
2 2 5 2 2 2 2 5 2 3
3 1 6 3 3 1 3 6 4 1
4 1 4 8 4 2 7 8 8 9
5 2 3 1 1 4 2 1 1 7
For context this is from a psychological experiment where people went through two variations of a task and the order of those conditions was determined by the experimental group they were assigned to. The columns represent different measurements from different trials and are currently grouped together for the same variable and in chronological order, meaning a1,a2,a3,a4 are essentially the same variable at consecutive time points, same with b1,b2,b3,b4.
I want to split them up for the different conditions so regardless of which group (=which order of tasks) someone went through, data from one condition should come first in the dataframe and columns should still be grouped together for the same variables and in chronological order within that condition. It should essentially look like this:
id group c1a1 c1a2 c2a1 c2a2 c1b1 c1b2 c2b1 c2b2
1 1 2 6 6 3 4 4 6 4
2 2 2 2 5 2 2 3 2 5
3 1 6 3 3 1 3 6 4 1
4 1 4 8 4 2 7 8 8 9
5 2 1 4 3 1 1 7 2 1
So essentially for group 1 everything stays the same since they happened to go through the conditions in the same order that I want to have in the new dataframe while for group 2 values are being switched where the originally second half of values for each variable is put in front of the originally first one.
I hope I formulated the problem in a way, people can understand it.
My real dataset is a bit more complicated it has 180 columns minus id and group so 178.
I have 13 variables some of which were measured over two conditions with 5 trials for each of those and some which have those 5 trials for each of the 2 main condition but which also have 2 adittional measurements for each condition where the order was determined by the same group variable.
(We essentially asked participants to do the task again in two certain ways, which allowed us to see if they were capable of doing them like that if they wanted to under the circumstences of both main conditions).
So there are an adittional 4 columns for some variables which need to be treated seperately. It should look like this when transformed (x and y are the 2 extra tasks where only b was measured once):
id group c1a1 c1a2 c2a1 c2a2 c1b1 c1b2 c1bx c1by c2b1 c2b2 c2bx c2by
1 1 2 6 6 3 4 4 3 7 6 4 4 2
2 2 2 2 5 2 2 3 4 3 2 5 2 2
3 1 6 3 3 1 3 6 2 2 4 1 1 1
4 1 4 8 4 2 7 8 1 1 8 9 5 8
5 2 1 4 3 1 1 7 8 9 2 1 3 4
What I want to say with this is, I need a pretty general solution.
I already tried formulating a function for creation of two seperate datasets for the groups and then merging them by id but got stuck with the automatic creation and naming of columns which I can't seem to wrap my head around. dplyr is currently loaded and used for some other transformations but since I'm not really good with it, I need to ask for your help regarding a solution with or without it. I'm still pretty new to R and this is for my bachelor thesis.
Thanks in advance!
Your question leaves a few things unclear that make this hard to answer, but here is maybe a start that could help, or at least help clarify your problem.
It would really help if you could clarify 2 pieces of info, what types of column rearrangements you need, and how you distinguish what indicates that a row needs to have this transformation.
I'm also wondering if instead of trying to manipulate your data in its current shape, if it not might be more practical to figure out how to change the shape of your data to better represent your data, perhaps using something like pivot_longer(), I don't know how this data will ultimately be used or what the actual values indicate, but it doesn't seem to be very tidy in its current form, and instead having a "longer" table might be more meaningful, but I'll still provide what I think is a solution to your listed problem.
This creates some example data that looks like it reflects yours in the example table.
ID=seq(1:10)
group=sample(1:2,10,replace=T)
Data=matrix(sample(1:10,80,replace=T),nrow=10,ncol=8)
DataFrame=data.frame('ID'=ID,'Group'=group,Data)
You then define the groups of columns that need to be kept together. I can't tell if there is an automated way for you to indicate which columns are grouped, but this might get bulky if done manually. Some more information on what your column names actually are, and how they are distributed in groups would help.
ColumnGroups=list('One'=c('X1','X2'),'Two'=c('X3','X4'),'Three'=c('X5','X6'),'Four'=c('X7','X8'))
You can then figure out which rows need to have rearranged done by using some conditional. Based on your example, I'm assuming when the group variable equals 2, then the rearranging needs to be done, which is what I've used here.
FlipRows=DataFrame$Group==2
You can then have R only apply the rearrangement needed to those rows that need it, and define the rearrangement based on the ordering of the different column groups. I know you ask for a general solution, but is hard to identify the general solution you need without knowing what types of column rearrangements you need. If it is always flipping two sets of consecutive column groups, that would be easier to define without having to type it all out. What I have done here would require you to manually type out the order of the different column groups that you would like the rows to be rearranged as. The SortedDataFrame object seems to be what you are looking for, but might not actually reflect your real data. I removed columns 1 and 2 in this operation since those are ID and group which you don't want overridden.
SortedDataFrame=DataFrame
SortedDataFrame[FlipRows,-c(1,2)]=DataFrame[FlipRows,c(ColumnGroups$Two,ColumnGroups$One,ColumnGroups$Four,ColumnGroups$Three)]
This solution won't work if you need to rearrange each row differently, but it is unclear if that is the case. Try to provide any of the other info requested here, and let me know where this solution doesn't work for you, and that.

Is there an R function to redefine a variable so I can use the spread function?

I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.

Is there an efficient algorithm to create this type of schedule? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am creating a schedule for a sports league with several dozen teams. I already have all of the games in a set order and now I just need to assign one team to be the "home" team and one to be "away" for each game.
The problem has two constraints:
Each pair of teams must play an equal number of home and away
games against each other. For example, if team A and team B play 4
games, then 2 must be hosted by A and 2 by B. Assume that each pair
of teams plays an even number of games against each other.
No team should have more than three consecutive home games or three
consecutive away games at any point in the schedule.
I have been trying to use brute force in R to solve this problem but I can't get any of my code blocks to solve the issue in a timely fashion. Does anyone have any advice on how to deal with either (or both) of the above constraints algorithmically?
You need to do more research on simple scheduling.
There are a lot of references on line for these things.
Here are the basics for your application. Let's assume a league of 6 teams; the process is the same for any number.
Match 1: Simply write down the team numbers in order, in pairs, in a ring. Flatten he ring into two lines. Matches are upper (home) and lower(away).
1 2 3
6 5 4
Matches 2-5: Team 1 stays in place; the others rotate around the ring.
1 6 2
5 4 3
1 5 6
4 3 2
1 4 5
3 2 6
1 3 4
2 6 5
That's one full cycle. To balance the home-away schedule, simply invert the fixtures every other match:
1 2 3 5 4 3 1 5 6 3 2 6 1 3 4
6 5 4 1 6 2 4 3 2 1 4 5 2 6 5
There's your first full round. Simply replicate this, again switching home-away fixtures in alternate rounds. Thus, the second round would be:
6 5 4 1 6 2 4 3 2 1 4 5 2 6 5
1 2 3 5 4 3 1 5 6 3 2 6 1 3 4
Repeat this pair of rounds as many times as needed to get the length of schedule you need.
If you have an odd quantity of teams, simply declare one of the numbers to be the "bye" in the schedule. I find it easiest to follow if I use the non-rotating team -- team 1 in this example.
Note that this home-switching process guarantees that no team has three consecutive matches either home or away: they get two in a row when rounding the end of the row. However, even the two-in-a-row doesn't suffer at the end of the round: both of those teams break the streak in the first match of the next round.
Unfortunately, for an arbitrary existing schedule, you are stuck with a brute-force search with backtracking. You can employ some limits and heuristics, such as balancing partial home-away fixtures as the first option at each juncture. Still, the better approach is to make your original schedule correct by design.
There's also a slight problem that you cannot guarantee that your existing schedule will fulfill the given requirements. For instance, given the 8-team fixtures in this order:
1 2 3 4
5 6 7 8
1 2 5 6
3 4 7 8
1 3 5 7
2 4 6 8
It is not possible to avoid having at least two teams playing three consecutive home or away matches.

Stacking two data frame columns into a single separate data frame column in R

I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12

Resources