I am trying to standardize feedback from an API in R. However in some cases, the API returns a different format. This does not allow me to standardize and automate. I have thought of a solution which is as follows:
if dataframe has more than 1 variable, keep dataframe as it is
if dataframe has 1 variable then transpose
this id what I tried till now
col <- ncol(df)
df <- ifelse( col > 1, as.data.frame(df), as.data.frame(t(df))
This however returns a list and does not allow the process further. Thank you for the help in advance. any links would help too.
Thanks
Maybe you need something like this:
# some simple dataframes
df1 <- data.frame(col1 = c("a","b"))
df2 <- data.frame(col1 = c("a","b"),
col2 = c("c","d"))
func <- function(df) {
if (ncol(df) ==1) {
as.data.frame(t(df))
} else {
(df)
}
}
func(df1)
V1 V2
col1 a b
func(df2)
col1 col2
1 a c
2 b d
Related
I would like to create a function like this (obviously not proper code):
forEach ID in DATAFRAME1 look at each row with ID in DATAFRAME2 {
if DATAFRAME2$VARIABLE1 = something {
DATAFRAME1$VARIABLE1 = TRUE;
DATAFRAME1$VARIABLE2 = DATAFRAME2$VARIABLE2
}
}
In plain text, I've got a list of individuals and a database with mixed information on these
individuals. Let's say DATAFRAME2 contains informations on books read c(id, title, author, date). I want to create a new variable in DATAFRAME1 with a boolean of if the individual has read a specific book (VARIABLE1 above) and the date they first read it (VARIABLE2above). Also adding a third variable with number of times read would be interesting but not neccesary.
I haven't really done this in R before, mostly doing basic statistics and basic wrangling with dplyr. I guess I could use dplyr and join but this feels like a better approach. Any help to get me started would be much appreciated.
The following function does what the question asks for. Its arguments are
DF1 and DF2 have an obvious meaning;
var1 and var2 are VARIABLE1 and VARIABLE2 in the question;
value is the value of something.
The test data is at the end.
fun <- function(DF1, DF2, ID = 'ID', var1, var2, value){
DF1[[var1]] <- NA
DF1[[var2]] <- NA
k <- DF2[[var1]] == value
for(id in df1[[ID]]){
i <- DF1[[ID]] == id
j <- DF2[[ID]] == id
if(any(j & k)){
DF1[[var1]][i] <- TRUE
DF1[[var2]][i] <- DF2[[var2]][j & k]
}
}
DF1
}
fun(df1, df2, value = 4, var1 = 'X', var2 = 'Y')
# ID X Y
#1 a NA NA
#2 d TRUE 19
Test data.
set.seed(1234)
df1 <- data.frame(ID = c("a", "d"))
df2 <- data.frame(ID = rep(letters[1:5], 4),
X = sample(20, 20, TRUE),
Y = sample(20))
I have a data set that has 313 columns, ~52000 rows of information. I need to remove each column that contains the word "PERMISSIONS". I've tried grep and dplyr but I can't seem to get it to work.
I've read the file in,
testSet <- read.csv("/Users/.../data.csv")
Other examples show how to remove columns by name but I don't know how to handle wildcards. Not quite sure where to go from here.
If you want to just remove columns that are named PERMISSIONS then you can use the select function in the dplyr package.
df <- data.frame("PERMISSIONS" = c(1,2), "Col2" = c(1,4), "Col3" = c(1,2))
PERMISSIONS Col2 Col3
1 1 1
2 4 2
df_sub <- select(df, -contains("PERMISSIONS"))
Col2 Col3
1 1
4 2
From what I could understand from the question, the OP has a data frame like this:
df <- read.table(text = '
a b c d
e f PERMISSIONS g
h i j k
PERMISSIONS l m n',
stringsAsFactors = F)
The goal is to remove every column that has any 'PERMISSIONS' entry. Assuming that there's no variability in 'PERMISSIONS', this code should work:
cols <- colSums(mapply('==', 'PERMISSIONS', df))
new.df <- df[,which(cols == 0)]
Try this,
New.testSet <- testSet[,!grepl("PERMISSIONS", colnames(testSet))]
EDIT: changed script as per comment.
We can use grepl with ! negate,
New.testSet <- testSet[!grepl("PERMISSIONS",row.names(testSet)),
!grepl("PERMISSIONS", colnames(testSet))]
It looks like these answers only do part of what you want. I think this is what you're looking for. There is probably a better way to write this though.
library(data.table)
df = data.frame("PERMISSIONS" = c(1,2), "Col2" = c("PERMISSIONS","A"), "Col3" = c(1,2))
PERMISSIONS Col2 Col3
1 1 PERMISSIONS 1
2 2 A 2
df = df[,!grepl("PERMISSIONS",colnames(df))]
setDT(df)
ind = df[, lapply(.SD, function(x) grepl("PERMISSIONS", x, perl=TRUE))]
df[,which(colSums(ind) == 0), with = FALSE]
Col3
1: 1
2: 2
I have a data frame and I would like to split the first column into two columns but the separate pattern is similar to others and I only want to split the pattern located on number 4.
data frame:
TCGA-TS-A7P1-01A-41D-A39S-05 0.8637304
TCGA-NQ-A57I-01A-11D-A34E-05 0.7812147
TCGA-3H-AB3O-01A-11D-A39S-05 0.8963944
TCGA-LK-A4O2-01A-11D-A34E-05 0.6942843
TCGA-MQ-A4LI-01A-11D-A34E-05 0.8882558
desired output:
TCGA-TS-A7P1-01A 41D-A39S-05 0.8637304
TCGA-NQ-A57I-01A 11D-A34E-05 0.7812147
TCGA-3H-AB3O-01A 11D-A39S-05 0.8963944
TCGA-LK-A4O2-01A 11D-A34E-05 0.6942843
TCGA-MQ-A4LI-01A 11D-A34E-05 0.8882558
I tried:
sapply(strsplit(as.character(df$ID), "-"), '[', 1:4)
However, it is not the desired output above that I want. Thank you very much.
It seems all the elements of your first column are of the same length so one simple way could be:
df <- data.frame(col1 = c("TCGA-TS-A7P1-01A-41D-A39S-05","TCGA-NQ-A57I-01A-11D-A34E-05","TCGA-3H-AB3O-01A-11D-A39S-05"),
col2 = c(0.8637304,0.7812147,0.8963944), stringsAsFactors = FALSE)
df$col1bis <- substr(df$col1,18,28)
df$col1 <- substr(df$col1,1,16)
Then I reaggange the order of the columns:
df <- df[, c(1,3,2)]
resulting in:
> df
col1 col1bis col2
1 TCGA-TS-A7P1-01A 41D-A39S-05 0.8637304
2 TCGA-NQ-A57I-01A 11D-A34E-05 0.7812147
3 TCGA-3H-AB3O-01A 11D-A39S-05 0.8963944
I tried this one and it worked well.
df <- cbind(df[,1],df)
df[,1] <- substr(df[,1],1,16)
df[,2] <- substr(df[,2],18,28)
This seems like it should be really simple. Ive 2 data frames of unequal length in R. one is simply a random subset of the larger data set. Therefore, they have the same exact data and a UniqueID that is exactly the same. What I would like to do is put an indicator say a 0 or 1 in the larger data set that says this row is in the smaller data set.
I can use which(long$UniqID %in% short$UniqID) but I can't seem to figure out how to match this indicator back to the long data set
Made same sample data.
long<-data.frame(UniqID=sample(letters[1:20],20))
short<-data.frame(UniqID=sample(letters[1:20],10))
You can use %in% without which() to get values TRUE and FALSE and then with as.numeric() convert them to 0 and 1.
long$sh<-as.numeric(long$UniqID %in% short$UniqID)
I'll use #AnandaMahto's data to illustrate another way using duplicated which also works if you've a unique ID or not.
Case 1: Has unique id column
set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1)[, "ID",
drop=FALSE])[-seq_len(nrow(df2))])
Case 2: Has no unique id column
set.seed(1)
df1 <- data.frame(A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1))[-seq_len(nrow(df2))])
The answers so far are good. However, a question was raised, "what if there wasn't a "UniqID" column?
At that point, perhaps merge can be of assistance:
Here's an example using merge and %in% where an ID is available:
set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
temp <- merge(df1, df2, by = "ID")$ID
df1$matches <- as.integer(df1$ID %in% temp)
And, a similar example where an ID isn't available.
set.seed(1)
df1_NoID <- data.frame(A = rnorm(10), B = rnorm(10))
df2_NoID <- df1_NoID[sample(10, 4), ]
temp <- merge(df1_NoID, df2_NoID, by = "row.names")$Row.names
df1_NoID$matches <- as.integer(rownames(df1_NoID) %in% temp)
You can directly use the logical vector as a new column:
long$Indicator <- 1*(long$UniqID %in% short$UniqID)
See if this can get you started:
long <- data.frame(UniqID=sample(1:100)) #creating a long data frame
short <- data.frame(UniqID=long[sample(1:100, 30), ]) #creating a short one with the same ids.
long$indicator <- long$UniqID %in% short$UniqID #creating an indicator column in long.
> head(long)
UniqID indicator
1 87 TRUE
2 15 TRUE
3 100 TRUE
4 40 FALSE
5 89 FALSE
6 21 FALSE
Every week I a incomplete dataset for a analysis. That looks like:
df1 <- data.frame(var1 = c("a","","","b",""),
var2 = c("x","y","z","x","z"))
Some var1 values are missing. The dataset should end up looking like this:
df2 <- data.frame(var1 = c("a","a","a","b","b"),
var2 = c("x","y","z","x","z"))
Currently I use an Excel macro to do this. But this makes it harder to automate the analysis. From now on I would like to do this in R. But I have no idea how to do this.
Thanks for your help.
QUESTION UPDATE AFTER COMMENT
var2 is not relevant for my question. The only thing I am trying to is. Get from df1 to df2.
df1 <- data.frame(var1 = c("a","","","b",""))
df2 <- data.frame(var1 = c("a","a","a","b","b"))
Here is one way of doing it by making use of run-length encoding (rle) and its inverse rle.inverse:
fillTheBlanks <- function(x, missing=""){
rle <- rle(as.character(x))
empty <- which(rle$value==missing)
rle$values[empty] <- rle$value[empty-1]
inverse.rle(rle)
}
df1$var1 <- fillTheBlanks(df1$var1)
The results:
df1
var1 var2
1 a x
2 a y
3 a z
4 b x
5 b z
Here is a simpler way:
library(zoo)
df1$var1[df1$var1 == ""] <- NA
df1$var1 <- na.locf(df1$var1)
The tidyr packages has the fill() function which does the trick.
df1 <- data.frame(var1 = c("a",NA,NA,"b",NA), stringsAsFactors = FALSE)
df1 %>% fill(var1)
Here is another way which is slightly shorter and doesn't coerce to character:
Fill <- function(x,missing="")
{
Log <- x != missing
y <- x[Log]
y[cumsum(Log)]
}
Results:
# For factor:
Fill(df1$var1)
[1] a a a b b
Levels: a b
# For character:
Fill(as.character(df1$var1))
[1] "a" "a" "a" "b" "b"
Below is my unfill function, encontered same problem, hope will help.
unfill <- function(df,cols){
col_names <- names(df)
unchanged <- df[!(names(df) %in% cols)]
changed <- df[names(df) %in% cols] %>%
map_df(function(col){
col[col == col %>% lag()] <- NA
col
})
unchanged %>% bind_cols(changed) %>% select(one_of(col_names))
}