I have the following data:
x y id
1 2
2 2 1
3 4
5 6 2
3 4
2 1 3
The blanks in column id should have the same values as the next id value. Meaning my data should actually look like this:
x y id
1 2 1
2 2 1
3 4 2
5 6 2
3 4 3
2 1 3
I also have a list:
list[[1]] = 1 3 2
Or alternatively a column:
c(1,3,2) = 1, 3, 2
Now I would like to reorder my data based on column id accroding to the order in the list. My data should like this then:
x y id
1 2 1
2 2 1
3 4 3
2 1 3
3 4 2
5 6 2
Is there an efficient way to do this?
EDIT: I don't think it is a duplicate of in R Sorting by absolute value without changing the data because I do no want to sort by absolute value but by specific order that is given in a list.
A base R option would be (assuming that the blanks in 'id' column is NA)
i1 <- !is.na(df1$id)
df1[i1,][match(df1$id[i1], list[[1]]),] <- df1[i1, ]
df1
# x y id
#1 1 2 NA
#2 2 2 1
#3 3 4 NA
#4 2 1 3
#5 3 4 NA
#6 5 6 2
If we need to change the NA to succeeding non-NA element
library(zoo)
df1$id <- na.locf(df1$id, fromLast = TRUE)
data
df1 <- structure(list(x = c(1L, 2L, 3L, 5L, 3L, 2L), y = c(2L, 2L, 4L,
6L, 4L, 1L), id = c(NA, 1L, NA, 2L, NA, 3L)), class = "data.frame",
row.names = c(NA, -6L))
Related
I have a list of Likert values, the values range from 1 to 5. Each possible response may occur once, more than once or not at all per column. I have several columns and rows, each row corresponds to a participant, each column to a question. There is no NA data.
Example:
c1
c2
c3
1
1
5
2
2
5
3
3
4
3
4
3
2
5
1
1
3
1
1
5
1
The goal is to count the frequencies of the answer options column wise, to consequently compare them.
So the resulting table should look like this:
-
c1
c2
c3
1
3
1
3
2
2
1
0
3
2
2
1
4
0
1
1
5
0
2
2
I know how to do this for one column, and I can look at the frequencies with apply(ds, 1, table), but I do not manage to put this into a table to work further with.
Thanks!
This should do it, using plyr:
count_df = setNames(data.frame(t(plyr::ldply(apply(df, 2, table), rbind)[2:6])), colnames(df))
count_df[is.na(count_df)] = 0
You may use table in sapply -
sapply(df, function(x) table(factor(x, 1:5)))
# c1 c2 c3
#1 3 1 3
#2 2 1 0
#3 2 2 1
#4 0 1 1
#5 0 2 2
This approach can also be used in dplyr if you prefer that.
library(dplyr)
df %>% summarise(across(.fns = ~table(factor(., 1:5))))
We may use a vectorized option in base R
table(data.frame(v1 = unlist(df1), v2 = names(df1)[col(df1)]))
v2
v1 c1 c2 c3
1 3 1 3
2 2 1 0
3 2 2 1
4 0 1 1
5 0 2 2
data
df1 <- structure(list(c1 = c(1L, 2L, 3L, 3L, 2L, 1L, 1L), c2 = c(1L,
2L, 3L, 4L, 5L, 3L, 5L), c3 = c(5L, 5L, 4L, 3L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))
suppose I want to find duplicate rows for columns:
cols<-c("col1", "col2")
I know for data f4 duplicate rows are:
Jo<-df4[duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE), ]
and removing these duplicate rows from data set is given:
No<-df4[!(duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE)), ]
I want to modify the above codes. Suppose there is a column called mode. It takes integers between 1 to 4. I don't want all of duplicate rows have the same mode==2.
example
col1 col2 mode
1 3 5
5 3 9
1 2 1
1 2 1
3 2 2
3 2 2
4 1 3
4 1 2
4 1 2
output
Jo:
col1 col2 mode
1 2 1
1 2 1
4 1 3
4 1 2
4 1 2
No:
col1 col2 mode
1 3 5
5 3 9
3 2 2
3 2 2
in the above example in 3 and 4-th rows since mode==2 for both it is not duplicate but for three last row since one of them is not 2 , the are duplicate
Based on the updated dataset,
library(dplyr)
out1 <- df2 %>%
group_by_at(vars(cols)) %>%
filter(n() > 1, !all(mode ==2))
out2 <- anti_join(df2, out1)
out1
# A tibble: 5 x 3
# Groups: col1, col2 [2]
# col1 col2 mode
# <int> <int> <int>
#1 1 2 1
#2 1 2 1
#3 4 1 3
#4 4 1 2
#5 4 1 2
out2
# col1 col2 mode
#1 1 3 5
#2 5 3 9
#3 3 2 2
#4 3 2 2
Or with data.table
library(data.table)
i1 <- setDT(df2)[ , .I[.N > 1 & !all(mode == 2)], by = cols]$V1
df2[i1]
# col1 col2 mode
#1: 1 2 1
#2: 1 2 1
#3: 4 1 3
#4: 4 1 2
#5: 4 1 2
df2[!i1]
# col1 col2 mode
#1: 1 3 5
#2: 5 3 9
#3: 3 2 2
#4: 3 2 2
Or using base R
i1 <- duplicated(df2[1:2])|duplicated(df2[1:2], fromLast = TRUE)
out11 <- df2[i1 & with(df2, !ave(mode==2, col1, col2, FUN = all)),]
out22 <- df2[setdiff(row.names(df2), row.names(out11)),]
data
df2 <- structure(list(col1 = c(1L, 5L, 1L, 1L, 3L, 3L, 4L, 4L, 4L),
col2 = c(3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), mode = c(5L,
9L, 1L, 1L, 2L, 2L, 3L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-9L))
ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5
You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])
Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5
A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")
Following my earlier question:
R: reshape/gather function to create dataset ready for multilevel analysis
I discovered it is a bit more complicated. My dataset is actually 'messier' than I hoped. So here's the full story:
I have a big dataset, 240 cases. Each row is a case (breast cancer patient). Somewhere at the end of the dataset(say from column 417 onwards) I have partner data of the patients, that also filled in a questionnaire.
In the beginning, there are demographic variables for both patients and partners, followed by test outcomes only of patients, thus followed by partner data.
I want to create a dataset, where I 'split' the patient and partner data, but keep it coupled. Thus: I want to duplicate the subject ID and create new column with 1s and 2s (1 corresponding to patient and 2 to partner).
Then, I want my data actually as it is now, but some variables can be matched though (for example, I know have "date of birth" for patient [pgebdat] and for partner [prgebdat] separate. Ofcourse, I can turn this into 'gebdat' with the two birth dates below each other.
This code worked for me for a small subset of my data:
mydf_long <- mydf4 %>%
unite(bb1:bb50rec, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pbb1:pbb50recM, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c(paste0("bb", seq(1:104),"", sep = ','))) %>% # Separate and retrieve original answers
arrange(id)
results in:
id groep_MNC zkhs fbeh pgebdat couple bb1,
1 3 1 1 1 1955-12-01 1 4
2 3 1 1 1 1955-12-01 2 5
3 5 1 1 1 1943-04-09 1 2
4 5 1 1 1 1943-04-09 2 2
But now it copies and pastes the date of birth of the patient also to 'partner' row.
I'm stuck, and don't even quite know what data you would need to be able to answer my question, so please do ask. I'll provide something of an example below:
Example of data
id groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age prgesl relpnst
1 3 1 1 1 1955-12-01 42.50000 1 <NA> NA 2 1
2 5 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.50000 1 2
3 7 1 1 1 1958-04-10 40.25000 1 <NA> NA 2 1
4 10 1 1 1 1958-04-17 40.25000 1 1957-07-31 41.33333 2 1
5 12 1 1 2 1947-11-01 50.66667 1 1944-06-08 54.58333 2 1
And then, after couple of hundred variables for only patients, this partner data comes along:
pbb1 pbb2 pbb3 pbb4 pbb5 pbb6 pbb7 pbb8 pbb9
1 5 5 5 5 2 5 4 2 3
2 2 1 4 1 3 4 3 3 4
3 5 3 4 4 4 3 5 3 4
4 5 3 5 5 5 5 4 4 4
5 5 5 5 5 5 4 4 3 4
note, I didn't create this dataset myself - I'm just here to tidy up the mess :)
Edit: The dataset is in dutch. Pgesl = gender for patient, prgesl = gender for partner... etc.
Using the melt function from the data.table-package you can use multiple measures by patterns and as a result create more than one value column:
library(data.table)
melt(setDT(df), measure.vars = patterns('_age','gesl','gebdat'),
value.name = c('age','geslacht','geboortedatum')
)[, variable := c('patient','partner')[variable]][]
you get:
id groep_MNC zkhs fbeh relpnst pbb1 pbb2 variable age geslacht geboortedatum
1: 3 1 1 1 1 5 5 patient 42.50000 1 1955-12-01
2: 5 1 1 1 2 2 1 patient 55.16667 1 1943-04-09
3: 7 1 1 1 1 5 3 patient 40.25000 1 1958-04-10
4: 10 1 1 1 1 5 3 patient 40.25000 1 1958-04-17
5: 12 1 1 2 1 5 5 patient 50.66667 1 1947-11-01
6: 3 1 1 1 1 5 5 partner NA 2 <NA>
7: 5 1 1 1 2 2 1 partner 36.50000 1 1962-04-18
8: 7 1 1 1 1 5 3 partner NA 2 <NA>
9: 10 1 1 1 1 5 3 partner 41.33333 2 1957-07-31
10: 12 1 1 2 1 5 5 partner 54.58333 2 1944-06-08
Instead of patterns you could also use a list of column indexes or columnnames.
HTH
Used data:
df <- structure(list(id = c(3L, 5L, 7L, 10L, 12L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 2L),
pgebdat = c("1955-12-01", "1943-04-09", "1958-04-10", "1958-04-17", "1947-11-01"),
p_age = c(42.5, 55.16667, 40.25, 40.25, 50.66667),
pgesl = c(1L, 1L, 1L, 1L, 1L),
prgebdat = c("<NA>", "1962-04-18", "<NA>", "1957-07-31", "1944-06-08"),
pr_age = c(NA, 36.5, NA, 41.33333, 54.58333),
prgesl = c(2L, 1L, 2L, 2L, 2L),
relpnst = c(1L, 2L, 1L, 1L, 1L),
pbb1 = c(5L, 2L, 5L, 5L, 5L),
pbb2 = c(5L, 1L, 3L, 3L, 5L)),
.Names = c("id", "groep_MNC", "zkhs", "fbeh", "pgebdat", "p_age", "pgesl", "prgebdat", "pr_age", "prgesl", "relpnst", "pbb1", "pbb2"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
Apologies if this has been asked already, but I searched and could not find an exact example of what I am trying to do. I'm trying to subset a dataframe to exclude rows that have matching numerical values across five columns. For example, for the following dataframe, df, I'd want to return a new dataframe only with rows 1:2, 5:6, and 8:10:
Row A B C D E
1 1 1 2 3 1
2 4 1 2 3 5
3 2 2 2 2 2
4 5 5 5 5 5
5 4 4 2 3 4
6 2 1 3 5 2
7 3 3 3 3 3
8 3 2 5 3 3
9 2 1 2 2 4
10 3 3 3 2 3
I'm having trouble figuring out how to do this for more than two columns. I've tried the following and know they are not right.
df2 <- df[!duplicated(df, c("A", "B", "C", "D", "E"))]
and
df2 <- df[df$A==df$B==df$C==df$D==df$E,]
Thanks in advance.
Data frames are usually operated on column-wise rather than row-wise, which is why your duplicated attempt doesn't work. (It's checking for duplicate rows within those columns.) And your == doesn't work because == is a binary operator, df$A == df$B will be TRUE or FALSE, and then (df$A == df$B) == df$C (implied parentheses) will be testing if df$C is TRUE or FALSE.
apply is a good way to run a function on each row. It will convert your data frame to a matrix to run the function, but in this case that's fine columns A through E are all numeric. Here's one way:
df[apply(df[, -1], 1, function(x) length(unique(x))) > 1, ]
# Row A B C D E
# 1 1 1 1 2 3 1
# 2 2 4 1 2 3 5
# 5 5 4 4 2 3 4
# 6 6 2 1 3 5 2
# 8 8 3 2 5 3 3
# 9 9 2 1 2 2 4
# 10 10 3 3 3 2 3
You could come up with all sorts of different functions to apply to test for all the elements being the same.
I assumed you actually have a column named Row. If that isn't the case, leave out the -1 in my code above.
Using this data, reproducibly shared with dput().
df = structure(list(Row = 1:10, A = c(1L, 4L, 2L, 5L, 4L, 2L, 3L,
3L, 2L, 3L), B = c(1L, 1L, 2L, 5L, 4L, 1L, 3L, 2L, 1L, 3L), C = c(2L,
2L, 2L, 5L, 2L, 3L, 3L, 5L, 2L, 3L), D = c(3L, 3L, 2L, 5L, 3L,
5L, 3L, 3L, 2L, 2L), E = c(1L, 5L, 2L, 5L, 4L, 2L, 3L, 3L, 4L,
3L)), .Names = c("Row", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-10L))
You can simply compare all the columns against a single column and see if all the same
df[rowSums(df[-1] == df[, 1]) < (ncol(df) - 1), ]
# A B C D E
# 1 1 1 2 3 1
# 2 4 1 2 3 5
# 5 4 4 2 3 4
# 6 2 1 3 5 2
# 8 3 2 5 3 3
# 9 2 1 2 2 4
# 10 3 3 3 2 3
Or just df[rowSums(df == df[, 1]) < (ncol(df)), ]
Or similarly, you can avoid matrix conversions all together and combine Reduce and lapply
df[!Reduce("&" , lapply(df, `==`, df[, 1])), ]
# A B C D E
# 1 1 1 2 3 1
# 2 4 1 2 3 5
# 5 4 4 2 3 4
# 6 2 1 3 5 2
# 8 3 2 5 3 3
# 9 2 1 2 2 4
# 10 3 3 3 2 3