Data frame reshape based on conditions - r

I have the following problem that I quite struggle to solve:
I have a data frame looking like:
row1 = c(55.7, NA, NA, "inf", 4.19, 99, 4, 15, 16, NA, 13, 0.1, 0.8, 51, NA, 44)
row2 = c(13, 1, 81, 6, NA, 0.3, NA, NA, 1.4, 89, NA, NA, 2.1, 99, 0.5, NA)
df = data.frame(row1, row2)
df = as.data.frame(t(df))
The first problem is that I need to change values "inf" to numerical == 100.
All I use does not help.
This creates additional NAs:
data[data =="inf"] = 100
This just don't work:
data[is.na(data)] = "Skip"
I expect it is because of data types but I cannot figure out how to fix it.
The second problem is more complex. I need to transform the data frame to match the highest values columns with lowest one columns to get somethink like this:
row3 = c("row1","V4", "V12")
row4 = c("row1", "V6", "V13")
df2 = data.frame(row3, row4)
df2 = t(df2)
And so on for all rows and columns.
The problem is that I cannot even find an approach how to solve this task, if you can give me a direction that will be extremely valuable.
Thanks a lot

For your first problem try to convert your values to character:
df[]<-lapply(df, as.character)
df[df =="inf"] = "100"
Then convert it back to factor:
df[]<-lapply(df, as.factor)

Related

counting observations greater than 20 in R

I have a dataset df in R and trying to get the number of observations greater than 20
sample input df:
df <- data.frame(Ensembl_ID = c("ENSG00000284662", "ENSG00000186827", "ENSG00000186891", "ENSG00000160072", "ENSG00000041988"), FS_glm_1_L_Ad_N1_233_233 = c(NA, "11.0704011098281", "18.5580644869131", NA, NA), FS_glm_1_L_Ad_N10_36_36 = c("25.5660669439994", NA, "17.7371918093936", "17.15620204154", NA), FS_glm_1_L_Ad_N2_115_115 = c("26.5660644083686", NA, "11.4006170885388", "17.9862691299736", "9.83546459757003" ), FS_glm_1_L_Ad_N3_84_84 = c("26.5660644053515", NA, "10.9591563938286", NA, NA), FS_glm_1_L_Ad_N4_65_65 = c("26.5660642078305", NA, "11.1498422647029", "10.5876449860129", "9.84781577969005"), FS_glm_1_L_Ad_N5_64_64 = c("26.5660688201853", NA, "18.613395947125", "10.5753792680759", "11.059101026016"), FS_glm_1_L_Ad_N6_55_55 = c("26.5660644039101", NA, "18.478237966938", "10.543187719545", NA), FS_glm_1_L_Ad_N7_32_32 = c("25.5660669436648", NA, "17.9467280294446", "10.0328888122706", NA), FS_glm_1_L_Ad_N8_31_31 = c("25.566069252448", NA, "17.6805603365895", "17.3419854603055", "9.81610669984747"))
class(df)
[1] "data.frame"
I tried
length(which(as.vector(df[,-1]) > 20))
[1] 11
and
sum(df[,-1] > 20, na.rm=TRUE)
[1] 11
However, the real occurrence is only 8 times instead of 11 why so?
The same script works correctly in another data frame but not in this df.
The data is character in this dataframe and not numeric. When numbers are characters weird things happen.
"2" > "13"
#[1] TRUE
Change the data to numeric before using sum.
df[-1] <- lapply(df[-1], as.numeric)
sum(df[,-1] > 20, na.rm=TRUE)
#[1] 8

Removing duplicate rows from one column without losing data from another column

I apologize if this has already been asked, but the solutions I came across didn't seem to work for me.
I have a data set that was initially multiple excel sheets containing different variables for the same subjects. I was able to import the data into r and combine into a single data frame using:
x1_data <- "/data.xlsx"
excel_sheets(path = x1_data)
tab_names <- excel_sheets(path = x1_data)
list_all <- lapply(tab_names, function(x)
read_excel(path = x1_data, sheet = x))
str(list_all)
df <- rbind.fill(list_all)
df <- as_tibble(df)
However, I now have many duplicate rows for each subject, as each sheet was essentially added beneath the preceding sheet. Something like this:
Sheet 1
ID: 1,2
Age: 32, 29
Sex: M, F
Sheet 2
ID: 1, 2
Weight: 75, 89
Height: 157, 146
Combined
ID: 1, 2, 1, 2
Age: 32, 29, NA, NA
Sex: M, F, NA, NA
Weight: NA, NA, 75, 89
Height: NA, NA, 157, 146
I can't seem to figure out how to delete the duplicate ID rows without losing the data in the columns that belong to those rows. I tried aggregate and group_by without success. What I am after is this:
Combined
ID: 1, 2
Age: 32, 29
Sex: M, F
Weight: 75, 89
Height: 157, 146
Any help would be appreciated. Thanks.
Here's a possible solution:
library(tidyverse)
df <- tibble(ID = c(1, 2, 1, 2),
Age = c(32, 29, NA, NA),
Sex = c("M", "F", NA, NA),
Weight = c(NA, NA, 75, 89),
Height = c(NA, NA, 157, 146))
df1 <- df %>% filter(is.na(Age)) %>% select(ID, Weight, Height)
df2 <- df %>% filter(!is.na(Age)) %>% select(ID, Age, Sex)
df.merged <- df2 %>% left_join(df1, by = "ID")
For future questions, please provide a already formatted sample of your data which makes it much more easier to work with.

Returning values from a column based on the last value of another column

I have a dataset like this:
data <- data.frame(Time = c(1,4,6,9,11,13,16, 25, 32, 65),
A = c(10, NA, 13, 2, 32, 19, 32, 34, 93, 12),
B = c(1, 99, 32, 31, 12, 13, NA, 13, NA, NA),
C = c(2, 32, NA, NA, NA, NA, NA, NA, NA, NA))
What I want to retrieve are the values in Time that corresponds to the last numerical value in A, B, and C.
For example, the last numerical values for A, B, and C are 12, 13, and 32 respectively.
So, the Time values that correspond are 65, 25, and 4.
I've tried something like data[which(data$Time== max(data$A)), ], but this doesn't work.
We can multiply the row index with the logical matrix, and get the colMaxs (from matrixStats) to subset the 'Time' column
library(matrixStats)
data$Time[colMaxs((!is.na(data[-1])) * row(data[-1]))]
#[1] 65 25 4
Or using base R, we get the index with which/arr.ind, get the max index using a group by operation (tapply) and use that to extract the 'Time' value
m1 <- which(!is.na(data[-1]), arr.ind = TRUE)
data$Time[tapply(m1[,1], m1[,2], FUN = max)]
#[1] 65 25 4
Or with summarise/across in the devel version of dplyr
library(dplyr)
data %>%
summarise(across(A:C, ~ tail(Time[!is.na(.)], 1)))
# A B C
#1 65 25 4
Or using summarise_at with the current version of dplyr
data %>%
summarise_at(vars(A:C), ~ tail(Time[!is.na(.)], 1))

How can I estimate a function in a group?

I have a data frame with 1530 obs of 6 varaibles. In this dataframe there 51 assets with 30 obs each. I tried to apply de MACD function to obtain two values: macd and signal but show up an error. This is an example:
macdusdt <- filtusdt %>% group_by(symbol) %>% do(tail(., n = 30))
macd1m <- macdusdt %>%
mutate (signals = MACD(macdusdt$lastPrice,
nFast = 12, nSlow = 26, nSig = 9, maType = "EMA", percent = T))
Error: Column signals must be length 30 (the group size) or one, not 3060
I want to apply de MACD function to every asset in the data frame. The database is here: https://www.dropbox.com/s/ww8stgsspqi8tef/macdusdt.xlsx?dl=0
Based on the data provided, it is giving an error when applied the code
Error in EMA(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, :
n > number of non-NA values in column(s) 1
To prevent that we can do
library(dplyr)
library(TTR)
filtusdt %>%
group_by(symbol) %>%
slice(tail(row_number(), 30)) %>%
mutate(signals = if(n() < sum(is.na(lastPrice))) MACD(lastPrice,
nFast = 12, nSlow = 26, nSig = 9, maType = "EMA", percent = TRUE) else NA)
It could be an issue because of the subset dataset provided

Write a function in R containing an if/else statement and rowSums(), defining how to handle NAs

I've been searching for answers and trying all I can think of, but nothing works:
I want to write a function to add the values across rows in a dataframe. It's easiest to write a function since I have so many columns and don't always have to add the same ones. Here an example of a dataframe:
ExampleData <- data.frame(Participant = 1:7,
Var1 = c(2, NA, 13, 15, 0, 2, NA),
Var2 = c(NA, NA, 1, 0, NA, 4, 2),
Var3 = c(6, NA, 1, 0, 1, 5, 3),
Var4 = c(12, NA, NA, 4, 10, 1, 4),
Var5 = c(10, NA, 3, 5, NA, 4, 4))
The conditions: If all values across a row are NA, the sum should be NA. If there is at least one value across a row that is a number (>= 0, or not NA), then rowSums should ignore NA's and add the values.
The best solution I've reached so far is:
addition <- function(x) {
if(all(is.na(x))){
NA
}else{
rowSums(x, na.rm = TRUE)
}
}
addition(ExampleData[, c("Var1", "Var2", "Var3")])
The output is: [1] 8 0 15 15 1 11 5
But it should be: [1] 8 NA 15 15 1 11 5
Does anyone know how to do this?
Thank you.
The reason is the all(is.na(x)) is checking whether all the elements of the dataset are NA instead of by rows. If we check the output of is.na(data), it is a logical matrix. A matrix is basically a vector with dimension attributes. So, wrapping with all checks if all the elements are NA or not.
For example,
all(is.na(matrix(c(1:9, NA), 5, 2)))
#[1] FALSE
We can change the function to
addition <- function(x) {
rowSums(x, na.rm = TRUE) * NA^(rowSums(!is.na(x))==0)
}
addition(ExampleData[, c("Var1", "Var2", "Var3")])
#[1] 8 NA 15 15 1 11 5

Resources