Add new column to dataframe, based on values at specific rows within that dataframe [duplicate] - r

This question already has answers here:
Matching up two vectors in R
(2 answers)
Closed 4 years ago.
Suppose I have a dataframe such as the below
people.dat <- data.frame("ID" = c(2001, 1001, 2005, 2001 5000), "Data"
= c(100, 300, 500, 900, 200))
Which looks something like this
+------+------+
| ID | Data |
+------+------+
| 2001 | 100 |
| 1001 | 300 |
| 2005 | 500 |
| 2001 | 900 |
| 5000 | 200 |
+------+------+
Suppose the first thing I do is work out how many unique ID values are in the dataframe (this is necessary, due to the size of the real dataset in question)
unique_ids <- sort(c(unique(people.dat$ID)))
Which gives
[1] 1001 2001 2005 5000
Where I get stuck is that I would like to add a new column, say "new_id", which looks at the "ID" value in the dataframe and evaluates its position in unique_ids, and assigns positional value (so the column "new_id" consists of values at each row which range from 1:length(unique_ids)
An example of the output would be as follows
+------+------+--------+
| ID | Data | new_id |
+------+------+--------+
| 2001 | 100 | 2 |
| 1001 | 300 | 1 |
| 2005 | 500 | 3 |
| 2001 | 900 | 1 |
| 5000 | 200 | 4 |
+------+------+--------+
I thought about using a for loop with if statements, but my first attempts didn't quite hit the mark. Although, if I just wanted to replace "ID" with a sequential value, the following code would work (but where I get stuck is that I want to keep ID, but add another "new_id" column)
for (i in 1:48){
people.dat$ID[people.dat$ID == unique_ids[i]] <- i
}
Thank you for any help. Hope I have made the question as clear as possible (although I struggled to phrase some of it, so please let me know if there is anything specific that needs clarifying)

This is more like a 'rank' problem
people$rank=as.numeric(factor(people$ID))
people
ID Data rank
1 2001 100 2
2 1001 300 1
3 2005 500 3
4 2001 900 2
5 5000 200 4

Related

How to convert comma-separated multiple responses into dummy coded columns in R

In a survey, there was a question that asked "what aspect of the course helped you learn concepts the most? Select all that apply"
Here is what the list of responses looked like:
Student_ID = c(1,2,3)
Responses = c("lectures,tutorials","tutorials,assignments,lectures", "assignments,presentations,tutorials")
Grades = c(1.1,1.2,1.3)
Data = data.frame(Student_ID,Responses,Grades);Data
Student_ID | Responses | Grades
1 | lectures,tutorials | 1.1
2 | tutorials,assignments,lectures | 1.2
3 | assignments,presentations,tutorials | 1.3
Now I want to create a data frame that looks something like this
Student_ID | Lectures | Tutorials | Assignments | Presentation | Grades
1 | 1 | 1 | 0 | 0 | 1.3
2 | 1 | 1 | 1 | 0 | 1.4
3 | 0 | 1 | 1 | 1 | 1.3
I managed to separate the comma separated responses into columns, using the splitstackshape package. So currently my data looks like this:
Student ID | Response 1 | Response 2 | Response 3 | Response 4 | Grades
1 | lectures | tutorials | NA | NA | 1.1
2 | tutorials | assignments | lectures | NA | 1.2
3 | assignments| presentation| tutorials | NA | 1.3
But as I stated earlier, I would like my table to look like the way I presented above, in dummy codes. I am stuck on how to proceed. Perhaps an idea is to go through each observation in the columns and append 1 or 0 to a new data frame with lectures,tutorials,assignments,presentation as the headers?
First the Response column is converted from factor to character class. Each element of that column is then split on comma. I don't know what all the possible responses are, so I used all that are present. Next the split Response column is tabulated, specifying the possible levels. The resulting list is converted into a matrix before being mixed into the old data.frame.
Data$Responses <- as.character(Data$Responses)
resp.split <- strsplit(Data$Responses, ",")
lev <- unique(unlist(resp.split))
resp.dummy <- lapply(resp.split, function(x) table(factor(x, levels=lev)))
Data2 <- with(Data, data.frame(Student_ID, do.call(rbind, resp.dummy), Grades))
Data2
# Student_ID lectures tutorials assignments presentations Grades
# 1 1 1 1 0 0 1.1
# 2 2 1 1 1 0 1.2
# 3 3 0 1 1 1 1.3
I found a response to my question. I initially did
library(splitstackshape)
Responses = cSplit(Data, "Responses",",")
Then I added the following line:
library(qdapTools)
TA <- mtabulate(as.data.frame(t(TA)))
It worked for me.

Remove duplicates where values are swapped across 2 columns in R [duplicate]

This question already has answers here:
pair-wise duplicate removal from dataframe [duplicate]
(4 answers)
Closed 6 years ago.
I have a simple dataframe like this:
| id1 | id2 | location | comment |
|-----|-----|------------|-----------|
| 1 | 2 | Alaska | cold |
| 2 | 1 | Alaska | freezing! |
| 3 | 4 | California | nice |
| 4 | 5 | Kansas | boring |
| 9 | 10 | Alaska | cold |
The first two rows are duplicates because id1 and id2 both went to Alaska. It doesn't matter that their comment are different.
How can I remove one of these duplicates -- either one would be fine to remove.
I was first trying to sort id1 and id2, then get the index where they are duplicated, then go back and use the index to subset the original df. But I can't seem to pull this off.
df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))
We can use apply with MARGIN=1 to sort by row for the 'id' columns, cbind with 'location' and then use duplicated to get a logical index that can be used for removing/keeping the rows.
df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]
# id1 id2 location comment
#1 1 2 Alaska cold
#3 3 4 California nice
#4 4 5 Kansas boring
#5 9 10 Alaska cold

Fetch min and max values in the same row using group by

The first thing is my English is basic. Sorry.
Second thing, and the most important here: I can't find the way to do a simple query. My table is like this:
------------------------------------------
id_det_iti | id_iti | orden_iti| id_ciudad
--------------------------------------------
1 | 1 | 1 | 374
2 | 1 | 2 | 25
3 | 1 | 3 | 241
4 | 2 | 1 | 34
5 | 2 | 2 | 22
6 | 2 | 3 | 352
7 | 2 | 4 | 17
--------------------------------------------
Then, I wanna get results like this:
------------------------------------------
id_iti | min | id_ciudad | max | id_ciudad
------------------------------------------
1 | 1 | 374 | 3 | 241
2 | 1 | 34 | 4 | 17
------------------------------------------
I need to show the max and the min value in the same row group by id_iti.
I have tried to use full join, but I'm working with sqlite, and that's not an option. I spend a long day trying with different options but I can't found the solution. I hope you guys can help me.
Thanks in advance!
Edit:
SELECT a.id_iti, c.id_ciudad, d.id_ciudad
FROM detalle_itinerario as a,
(SELECT MAX(orden_iti),id_ciudad, id_iti FROM detalle_itinerario) AS c
INNER JOIN
(SELECT MIN(orden_iti),id_ciudad, id_iti FROM detalle_itinerario) AS d
ON c.id_iti=d.id_iti
GROUP BY a.id_iti;
That's only one of my attempts, but I get just values of the first coincidence.
First, use a simple query to get the min/max values for each group:
SELECT id_iti,
MIN(orden_iti) AS min,
MAX(orden_iti) AS max,
FROM detalle_itinerario
GROUP BY id_iti;
You can the use these values to join back to the original table:
SELECT a.id_iti,
a.min,
a2.id_ciudad,
a.max,
a3.id_ciudad
FROM (SELECT id_iti,
MIN(orden_iti) AS min,
MAX(orden_iti) AS max
FROM detalle_itinerario
GROUP BY id_iti) AS a
JOIN detalle_itinerario AS a2 ON a.id_iti = a2.id_iti AND a.min = a2.orden_iti
JOIN detalle_itinerario AS a3 ON a.id_iti = a3.id_iti AND a.max = a3.orden_iti;

Creating variables by conditional command in R

I have a longitudinal dataset in which people are turning 40 in different years, and I need to do an analysis (propensity score matching) with the 40 year-olds. I want to create an income variable which would use Income 1992 for people who turn forty in 1998, uses Income 1994 for people who turn forty in 2000 and so on.
My data looks like this (and I want Incomenew to look like this):
ID | SourceYear| Income1992| Income1994 | Incomenew |
|---------------|------------|------------| |
| 1 | 1998 | 10000 | 12000 | 10000 |
| 2 | 2000 | 20000 | 15000 | 15000 |
| 3 | 1998 | 17000 | 16000 | 17000 |
| 4 | 2000 | 18000 | 20000 | 20000 |
I am interested in their income 6 years before they turn 40. I already adjusted all income variables for the purchasing power of a certain year.I tried this:
Incomenew<-NA
Incomenew[SourceYear=="1998"]<-Income1992[SourceYear=="1998"]
Incomenew[SourceYear=="2000"]<-Income1994[SourceYear=="2000"]
I get all NAs
I also tried this:
`Incomenew<-if (SourceYear=="1998")] {Income1992}
else if (SourceYear==2000)
{Income1994}`
I get the following error
Error in if (SourceYear== "1998") { : argument is of length zero
It would be of great help if someone could help with this, I would really appreciate it.
In my original dataset I had some NA's for the SourceYear. I didn't realize that it was important for this command.
The first command actually works, if a subset without NA's in the SourceYear is used. An example is:
ID<-c(1,2,3,4,5,6)
SourceYear<-c("1998", "2000", "1998","2002","2000", "2002", NA)
Income92<-c(100000,120000,170000,180000, 190000, NA)
Income94<-c(120000,150000,160000,20000,NA, 120000)
Income96<-c(130000, 110000,NA, 180000, 190000, 180000)
incomedata<-data.frame(ID, SourceYear,Income92, Income94, Income96, Incomenew)
summary(incomedata)
incomedata1<-subset(incomedata, !is.na(incomedata$SourceYear))
incomedata1$Incomenew<-rep(NA, length(incomedata1$SourceYear))
incomedata1$Incomenew[incomedata1$SourceYear=="1998"]<-
incomedata1$Income92[incomedata1$SourceYear=="1998"]
incomedata1$Incomenew[incomedata1$SourceYear=="2000"]<-
incomedata1$Income94[incomedata1$SourceYear=="2000"]
incomedata1$Incomenew[incomedata1$SourceYear=="2002"]<-
incomedata1$Income96[SourceYear=="2002"]

How to add certain elements of column in Matrix in R?

N* [1]| [2] | [3]
1* | 3 | 20 | 3 |
2* | 2 | 10 | 3 |
3* | 3 | 25 | 3 |
4* | 1 | 15 | 3 |
5* | 3 | 30 | 3 |
Can you help me to get a sum of second column, but only sum of elements that has 3 in the first row. For example in that matrix it is 20+25+30=75. In a fastest way (it's actually big matrix).
P.S. I tried something like this with(Train, sum(Column2[,"Date"] == i))
As you can see I need sum Of Colomn2 where date has certain meaning (from 1 to 12)
We can create a logical index with the first column and use that to subset the second column and get the sum
sum(m1[m1[,1]==3,2])
EDIT: Based on #Richard Scriven's comment.

Resources