I have two files and want to transfer date from one to other after doing a test
File1:
ID, X1, X2, X3
2000, 1, 2, 3
2001, 3, 4, 5
1999, 2, 5, 6
2003, 3, 5, 4
File2:
ID, X1, X2, X3,
2000,
2001,
2002,
2003,
Result file will be like:
1999 "There is an error"
File2:
ID, X1, X2, X3
2000, 1, 2, 3
2001, 3, 4, 5
2002, Na, Na, Na
2003, 3, 5, 4
I tried to use for loop with if, Unfortunately, it doesn't work:
for(j in length(1: nrows(file1){
for(i in length(1: nrows(file2){
if( file1&ID[j]>= file2&ID[j+1]){
print(j, ' wrong value')
esle
file2[i,]<- file1[j,]
break
It would be very nice if I can get some ideas, codes how I can get something similar to result file
I hope I can find the right code to solve this problem
No need to iterate using loops, you can simply use right_join from dplyr package
df1 %>%
right_join(df2, by="ID") %>%
arrange(ID)
ID X1 X2 X3
1 2000 1 2 3
2 2001 3 4 5
3 2002 NA NA NA
4 2003 3 5 4
Sample data
df1 <- structure(list(ID = c(2000L, 2001L, 1999L, 2003L), X1 = c(1L,
3L, 2L, 3L), X2 = c(2L, 4L, 5L, 5L), X3 = c(3L, 5L, 6L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ID = 2000:2003), class = "data.frame", row.names = c(NA,
-4L))
Using data.table
library(data.table)
setDT(df2)[df1, names(df1)[-1] := mget(paste0("i.", names(df1)[-1])), on = .(ID)]
-output
> df2
ID X1 X2 X3
1: 2000 1 2 3
2: 2001 3 4 5
3: 2002 NA NA NA
4: 2003 3 5 4
Here is a slightly different approach which does not give the exact expected output: Note that year 1999 is kept in the dataframe:
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
bind_rows(df1, df2) %>%
group_by(ID) %>%
summarise_all(coalesce_by_column)
ID X1 X2 X3
<int> <int> <int> <int>
1 1999 2 5 6
2 2000 1 2 3
3 2001 3 4 5
4 2002 NA NA NA
5 2003 3 5 4
Related
I have a dataset with one variable with participant IDs and several variables with peer-nominations (in form of IDs).
I need to replace all numbers in the peer-nomination variables, that are not among the participant IDs, with NA.
Example: I have
ID PN1 PN2
1 2 5
2 3 4
4 6 2
5 2 7
I need
ID PN1 PN2
1 2 5
2 NA 4
4 NA 2
5 2 NA
Would be great if someone can help! Thank you so much in advance.
An alternative with Base R,
df[,-1][matrix(!(unlist(df[,-1]) %in% df[,1]),nrow(df))] <- NA
df
gives,
ID PN1 PN2
1 1 2 5
2 2 NA 4
3 4 NA 2
4 5 2 NA
library(tidyverse)
df %>%
mutate(across(-ID, ~if_else(. %in% ID, ., NA_real_)))
which gives:
# ID PN1 PN2
# 1 1 2 5
# 2 2 NA 4
# 3 4 NA 2
# 4 5 2 NA
Data used:
df <- data.frame(ID = c(1, 2, 4, 5),
PN1 = c(2, 3, 6, 2),
PN2 = c(5, 4, 2, 7))
Here is a base R way.
The lapply loop on all columns except for the id column, uses function is.na<- to assign NA values to vector elements not in df1[[1]]. Then returns the changed vector.
df1[-1] <- lapply(df1[-1], function(x){
is.na(x) <- !x %in% df1[[1]]
x
})
df1
# ID PN1 PN2
#1 1 2 5
#2 2 NA 4
#3 4 NA 2
#4 5 2 NA
Data in dput format
df1 <-
structure(list(ID = c(1L, 2L, 4L, 5L),
PN1 = c(2L, NA, NA, 2L), PN2 = c(5L, 4L, 2L, NA)),
row.names = c(NA, -4L), class = "data.frame")
We could use mutate with case_when:
library(dplyr)
df %>%
mutate(across(starts_with("PN"), ~case_when(!(. %in% ID) ~ NA_real_,
TRUE ~ as.numeric(.))))
Output:
# A tibble: 4 x 3
ID PN1 PN2
<int> <dbl> <dbl>
1 1 2 5
2 2 NA 4
3 4 NA 2
4 5 2 NA
With data.table you can (l)apply the function fifelse() to every column
you have selected with .SD & .SDcols.
require(data.table)
cols = grep('PN', names(df)) # column indices (or names)
df[ , lapply(.SD, function(x) fifelse(!x %in% ID, NA_real_, x)),
.SDcols = cols ]
Data from #deschen:
df = data.frame(ID = c(1, 2, 4, 5),
PN1 = c(2, 3, 6, 2),
PN2 = c(5, 4, 2, 7))
setDT(df)
I have two data frames, df1 and df2, that look as follows:
df1<- data.frame(year, week, X1, X2)
df1
year week X1 X2
1 2010 1 2 3
2 2010 2 8 6
3 2011 1 7 5
firm<-c("X1", "X1", "X2")
year <- c(2010,2010,2011)
week<- c(1, 2, 1)
cost<-c(10,30,20)
df2<- data.frame(firm,year, week, cost)
df2
firm year week cost
1 X1 2010 1 10
2 X1 2010 2 30
3 X2 2011 1 20
I'd like to merge these so the final result (i.e. df3) looks as follows:
df3
firm year week cost Y
1 X1 2010 1 10 2
2 X1 2010 2 30 8
3 X2 2011 1 20 5
Where "Y" is a new variable that reflects the values of X1 and X2 for a particular year and week found in df1.
Is there a way to do this in R? Thank you in advance for your reply.
We can reshape the first dataset to 'long' format and then do a join with the second data
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = X1:X2, values_to = 'Y', names_to = 'firm') %>%
right_join(df2)
-output
# A tibble: 3 x 5
# year week firm Y cost
# <dbl> <dbl> <chr> <int> <dbl>
#1 2010 1 X1 2 10
#2 2010 2 X1 8 30
#3 2011 1 X2 5 20
data
df1 <- structure(list(year = c(2010L, 2010L, 2011L), week = c(1L, 2L,
1L), X1 = c(2L, 8L, 7L), X2 = c(3L, 6L, 5L)), class = "data.frame",
row.names = c("1",
"2", "3"))
df2 <- structure(list(firm = c("X1", "X1", "X2"), year = c(2010, 2010,
2011), week = c(1, 2, 1), cost = c(10, 30, 20)), class = "data.frame",
row.names = c(NA,
-3L))
Here is a base R option (borrow data from #akrun, thanks!)
q <- startsWith(names(df1),"X")
v <- cbind(df1[!q],stack(df1[q]),row.names = NULL)
df3 <- merge(setNames(v,c(names(df1)[!q],"Y","firm")),df2)
which gives
> df3
year week firm Y cost
1 2010 1 X1 2 10
2 2010 2 X1 8 30
3 2011 1 X2 5 20
I have a data set that looks something like this
A B 1960 1970 1980
x a 1 2 3
x b 1.1 2.1 NA
y a 2 3 4
y b 1 NA 1
I want to transform the columns based on row B so that it looks something like this
A year a b
x 1960 1 1.1
x 1970 2 2.1
x 1980 3 NA
y 1960 2 1
y 1970 3 NA
y 1980 4 1
I am not sure how to do this. I know that I can do a full transformation using t() or using row_to_columns() from tidyverse, but the result is not what I want.
The initial data has about 60 columns and 165 distinct values in column B.
You can do pivot_long() and then pivot_wide() , although might be a bad idea to rename your column "B" again:
library(dplyr)
library(tidyr)
df %>% pivot_longer(-c(A,B)) %>%
pivot_wider(names_from=B) %>% rename(B=name)
# A tibble: 6 x 4
A B a b
<fct> <chr> <dbl> <dbl>
1 x 1960 1 1.1
2 x 1970 2 2.1
3 x 1980 3 NA
4 y 1960 2 1
5 y 1970 3 NA
6 y 1980 4 1
df = structure(list(A = structure(c(1L, 1L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), B = structure(c(1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), `1960` = c(1, 1.1, 2, 1), `1970` = c(2,
2.1, 3, NA), `1980` = c(3L, NA, 4L, 1L)), class = "data.frame", row.names = c(NA,
-4L))
library(data.table)
dt <- fread('A B 1960 1970 1980
x a 1 2 3
x b 1.1 2.1 NA
y a 2 3 4
y b 1 NA 1')
names(dt) <- as.character(dt[1,])
dt <- dt[-1,]
dt[,(3:5):=lapply(.SD,as.numeric),.SDcols=3:5]
dcast(melt(dt,measure.vars = 3:5),...~B,value.var = "value")
#> A variable a b
#> 1: x 1960 1 1.1
#> 2: x 1970 2 2.1
#> 3: x 1980 3 NA
#> 4: y 1960 2 1.0
#> 5: y 1970 3 NA
#> 6: y 1980 4 1.0
Created on 2020-05-05 by the reprex package (v0.3.0)
Base R solution:
long_df <- reshape(df, direction = "long",
varying = which(!names(df) %in% c("A", "B")),
v.names = "value",
timevar = "year",
times = names(df)[!(names(df) %in% c("A", "B"))],
ids = NULL,
new.row.names = 1:(length(which(!names(df) %in% c("A", "B"))) * nrow(df)))
wide_df <- setNames(reshape(long_df, direction = "wide",
idvar = c("A", "year"),
timevar = "B"), c("A", "B", unique(df$B)))
Data:
df <- structure(list(A = c("x", "x", "y", "y"), B = c("a", "b", "a",
"b"), `1960` = c(1, 1.1, 2, 1), `1970` = c(2, 2.1, 3, NA), `1980` = c(3L,
NA, 4L, 1L)), row.names = 2:5, class = "data.frame")
I'm trying to accomplish something like what is illustrated in this this question
However, in my situation, I'll have there might be multiple cases where I have 2 columns that evaluates to True:
year cat1 cat2 cat3 ... catN
2000 0 1 1 0
2001 1 0 0 0
2002 0 1 0 1
....
2018 0 1 0 0
In the DF above year 2000 can have cat2 and cat3 categories. In this case, how do I create a new row, that will have the second category. Something like this:
year category
2000 cat2
2000 cat3
2001 cat1
2002 cat2
2002 catN
....
2018 cat2
You can use gather from the Tidyverse
library(tidyverse)
data = tribble(
~year,~ cat1, ~cat2, ~cat3, ~catN,
2000, 0, 1, 1, 0,
2001, 1, 0, 0 , 0,
2002, 0, 1, 0, 1
)
data %>%
gather(key = "cat", value = "bool", 2:ncol(.)) %>%
filter(bool == 1)
One way would be to get row/column indices of all the values which are 1, subset the year values from row indices and column names from column indices to create a new dataframe.
mat <- which(df[-1] == 1, arr.ind = TRUE)
df1 <- data.frame(year = df$year[mat[, 1]], category = names(df)[-1][mat[, 2]])
df1[order(df1$year), ]
# year category
#2 2000 cat2
#5 2000 cat3
#1 2001 cat1
#3 2002 cat2
#6 2002 catN
#4 2018 cat2
data
df <- structure(list(year = c(2000L, 2001L, 2002L, 2018L), cat1 = c(0L,
1L, 0L, 0L), cat2 = c(1L, 0L, 1L, 1L), cat3 = c(1L, 0L, 0L, 0L
), catN = c(0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -4L))
You can also use melt in reshape2
new_df = melt(df, id.vars='year')
new_df[new_df$value==1, c('year','variable')]
Data
df = data.frame(year=c(2000,2001),
cat1=c(0,1),
cat2=c(1,0),
cat3=c(1,0))
Output:
year variable
2 2001 cat1
3 2000 cat2
5 2000 cat3
Here is another variation with gather, by mutateing the columns having 0 to NA, then gather while removing the NA elements with na.rm = TRUE
library(dplyr)
library(tidyr)
data %>%
mutate_at(-1, na_if, y = 0) %>%
gather(category, val, -year, na.rm = TRUE) %>%
select(-val)
# A tibble: 5 x 2
# year category
# <dbl> <chr>
#1 2001 cat1
#2 2000 cat2
#3 2002 cat2
#4 2000 cat3
#5 2002 catN
data
data <- structure(list(year = c(2000, 2001, 2002), cat1 = c(0, 1, 0),
cat2 = c(1, 0, 1), cat3 = c(1, 0, 0), catN = c(0, 0, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
I am having issue with rearranging some data.
The original data is:
structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b",
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c",
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"),
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L,
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA,
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3",
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3",
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA,
-3L))
# id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1 1 a b c <NA> <NA> 3 2 3 NA NA
# 2 2 a <NA> <NA> <NA> <NA> 2 NA NA NA NA
# 3 3 b c d e f 1 2 3 4 5
I would like to get to this:
structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L),
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L),
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d",
"e", "f"), class = "data.frame", row.names = c(NA, -3L))
# id a b c d e f
# 1 1 3 2 3 NA NA NA
# 2 2 2 NA NA NA NA NA
# 3 3 NA 1 2 3 4 5
Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.
Is there a neat way to sort this dataframe out?
Modern tidyr makes the solution even more succinct via the pivot_ functions:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>%
filter(!is.na(artery)) %>%
pivot_wider(names_from = artery, values_from = diameter)
id a b c d e f
<int> <int> <int> <int> <int> <int> <int>
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Here is the older solution, which uses the deprecated gather and spread functions:
library(dplyr)
library(tidyr)
new.df <- gather(df, variable, value, artery.1:diameter.5) %>%
separate(variable, c('variable', 'num')) %>%
spread(variable, value) %>%
subset(!is.na(artery)) %>%
mutate(diameter = as.numeric(diameter)) %>%
select(-num) %>%
spread(artery, diameter)
Output:
id a b c d e f
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Or using melt/dcast combination with data.table while selecting variables using regex in the patterns function
library(data.table) #v>=1.9.6
dcast(melt(setDT(df),
id = "id",
measure = patterns("artery", "diameter")),
id ~ value1,
sum,
value.var = "value2",
subset = .(!is.na(value2)),
fill = NA)
# id a b c d e f
# 1: 1 3 2 3 NA NA NA
# 2: 2 2 NA NA NA NA NA
# 3: 3 NA 1 2 3 4 5
As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.
You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:
xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))
# artery
#id a b c d e f
# 1 3 2 3 0 0 0
# 2 2 0 0 0 0 0
# 3 0 1 2 3 4 5
This can be done with two reshape() calls. First, we can longify both artery and diameter on id, then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.
reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
## id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1 1 3 2 3 NA NA NA
## 2.1 2 2 NA NA NA NA NA
## 3.1 3 NA 1 2 3 4 5
The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.