I have below table, with two columns in Teradata:
ID1 ID2
10 12
10 13
13 15
15 17
19 21
21 23
In 1st row there are 2 ids , 10 and 12. This means 12 is duplicate of 10, so 12 must be replaced by 10. Similarly in next row 13 must be replaced by 10. However in third row, there is 13 and 15, which means 15 should be replaced by 13. But, since 13 has been already replaced by 10, both 13 and 15 must be replaced by 10.
Output which I expect is:
id orgl id
10 10
12 10
13 10
15 10
17 10
19 19
21 19
23 19
Can anyone please help me out with this. Thanks in advance.
Related
I have a list of integers that represent years of education:
education= 12 14 17 15 12 19 16 12 16 14 12 18 12 13 18 18 10 13 12 18
22 16 13 22 12 15 12 16 18 18 18 20 18 16 13 12 16 13 18 20 20 20 14 18
18 12 18 16 20 18 14 16 19 12 12 11 13 13
I am trying to categorize the years into 3 different levels:
9-12
13-17
18+
I have tried to used the cut function:
edulevels=cut(education,c(9,12,13,17,18,22))
but it creates 2 additional levels for 12-13 and 17-18:
Levels: (9,12] (12,13] (13,17] (17,18] (18,22]
How do I get it to only create these three levels?
simplest solution
edulevels= cut(education,c(9,12.5,17.5,22), labels = c("9-12", "13-17", "18+"))
Intervals defined by the cut() function are closed on the right. To see what that means, try this:
cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]
As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.
When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up.
edulevels <- cut(education,
c(8.5, 12.5, 17.5, Inf),
labels=c('9-12','13-17','18+')
)
Simple question, I think. Basically, I want to use the concept "less than or equal to a number" as the condition to select the row of one column, and then find the value on the same row in another column. But what happens if the number stated in the condition isn't found in the first column?
Let's assume this is my data frame:
df<-as.data.frame((matrix(c(1:10,11:20), nrow = 10, ncol = 2)))
df
V1 V2
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Let's assume I want to use the condition <=5 in df$V1 to obtain the row that is used to find the value of the same row in df$V2.
df[which(df$V1 <= 5),2]
15
But what happens if the number used in the condition isn't found? Let's assume this is my new data.frame
V1 V2
1 1 11
2 2 12
3 3 13
4 4 14
5 6 15
6 7 16
7 8 17
8 9 18
9 10 19
10 11 20
Using the same above command df[which(df$V1 <= 5),2], I obtain a different answer. For some reason I obtain the entire column instead of one number.
11 12 13 14 15 16 17 18 19 20
Any suggestions?
Use the subset operator:
df[df[,2]<= 5,1]
so I am having a bit of bother combining two columns into one. I have two columns of ages, which are split into child and adolescent columns. For example:
child adolescent
1 NA 12
2 NA 15
3 NA 12
4 NA 12
5 NA 13
6 NA 13
7 NA 13
8 NA 14
9 14 15
10 NA 12
11 12 13
12 NA 12
13 NA 13
14 NA 14
15 NA 14
16 12 13
17 NA 14
18 NA 13
19 NA 13
20 NA 14
21 NA 12
22 NA 13
23 12 15
24 NA 13
25 NA 15
26 NA 12
27 NA 15
28 NA 15
29 NA 13
30 NA 12
31 13 15`
Now what I would like to do is combine them into one column called "age" and remove all the na values. However when I try the following code, I encounter a problem:
age<- c(na.omit(data$child),na.omit(data$adolescent))
The problem being that my original data has 514 rows, yet when I combine the two columns, removing the nas, I somehow end up with 543 values, not 514 and I don't know why.
So, if possible, could someone explain firstly why I am getting more values than I planned, and secondly what might be a better way to combine the two columns.
EDIT: I am looking for something like this
age
1 12
2 15
3 12
4 12
5 13
6 13
7 13
8 14
9 14
10 12
11 12
12 12
13 13
14 14
15 14
16 12
17 14
18 13
19 13
20 14
21 12
22 13
23 12
24 13
25 15
26 12
27 15
28 15
29 13
30 12
31 13
32 14
33 13
34 11
35 15
36 13
Thanks in advance
This line:
age<- c(na.omit(data$child),na.omit(data$adolescent))
concatenates all the non-missing values from the child field to all the non-missing values from the adolescent field. I suspect you want to use one of these solutions
# youngest age
age<- pmin(data$child,data$adolescent,na.rm=T)
# oldest age
age<- pmax(data$child,data$adolescent,na.rm=T)
# child age, replaced with adolescent if missing
age<- data$child
age[is.na(age)] <- data$adolescent[is.na(age)]
# ^ notice same logical index ^
# |_______________________________|
Your code works on the example data, but you could try this:
age <- c(data$child, data$adolescent)
age <- age[!is.na(age)]
This combines the two columns from the data frame into a vector and removes all NA elements.
df$age <- ifelse( !(is.na(df$child)), df$child , df$adolescent)
I used igraph package to detect communities. When I used membership(community) function, the result is:
1 2 3 4 5 6 7 13 17 18 19 20 22 23 24 25
12 9 1 10 12 6 12 16 1 11 6 6 3 13 16 1
29 30 31 33 34 37 38 39 40 41 42 43 44 45 46 47
9 5 11 14 13 6 13 11 12 13 1 16 11 6 12 7
...
The first line is node ID and the second line is its corresponding community ID.
Suppose the name of the above result is X. I used Y=data.frame(X). The result is:
community
1 12
2 9
3 1
4 10
5 12
6 6
7 12
13 16
...
I want to use the first column (1,2,3,...), for instance, Y[13,]=16. But in this case, it is Y[8,]=16. How to do this?
This question may be very simple. But I do not know how to google it. Thanks.
Function as.data.frame() converts a named vector to a data frame, where the names of the vector elements are used as row names.
In other words, use a construct like rownames(Y)[8] to access the first column (or the row names, actually).
I'm working on an R project where I'm trying to compare frequencies their respective values. Essentially I have a 11852X3 column data frame with position number in slot 1, a unique value ranging from 1-11852 in the second column, and then the same set of unique values just in different positions in column 3.
Essentially because the values in columns 2 and 3 have overlap I want to find the difference between these two values based on the position number (1st) column on the far left and store it in another data frame. So if the the second column has the value 2017 in position one and then the third column also has 2017 in position one, the new data frame would have an entry of 2017 and then a value of 0 since they have the same position. If column 2 has a value of 5276 in the second position, and column 3 has the value 5276 in position 73 then the new data frame would have a value of 70.
I would love some guidance as to the way on how to do this. Thanks.
Let me know if the below code works fro you. The code will generate negative values if the number in 3rd column occurs above the number in 2nd column.
#Generate simulated data
n = 20
x <- data.frame(c1 = c(1:n), c2 = sample(n),c3 = sample(n))
#Calculate diff in position by taking difference in order
x$diff = order(x$c3)- order(x$c2)
#Reassign difference to its correct position
x$diff[order(x$c2)] <- x$diff
x
c1 c2 c3 diff
1 1 12 8 4
2 2 11 5 9
3 3 7 4 6
4 4 15 3 12
5 5 19 12 12
6 6 13 1 12
7 7 9 14 12
8 8 18 16 7
9 9 8 7 -8
10 10 16 20 -2
11 11 6 11 1
12 12 4 6 -9
13 13 14 10 -6
14 14 5 17 -12
15 15 10 18 -2
16 16 1 15 -10
17 17 3 19 -13
18 18 2 13 2
19 19 17 9 -5
20 20 20 2 -10