Transpose a data frame - r

I need to transpose a large data frame and so I used:
df.aree <- t(df.aree)
df.aree <- as.data.frame(df.aree)
This is what I obtain:
df.aree[c(1:5),c(1:5)]
10428 10760 12148 11865
name M231T3 M961T5 M960T6 M231T19
GS04.A 5.847557e+03 0.000000e+00 3.165891e+04 2.119232e+04
GS16.A 5.248690e+04 4.047780e+03 3.763850e+04 1.187454e+04
GS20.A 5.370910e+03 9.518396e+03 3.552036e+04 1.497956e+04
GS40.A 3.640794e+03 1.084391e+04 4.651735e+04 4.120606e+04
My problem is the new column names(10428, 10760, 12148, 11865) that I need to eliminate because I need to use the first row as column names.
I tried with col.names() function but I haven't obtain what I need.
Do you have any suggestion?
EDIT
Thanks for your suggestion!!! Using it I obtain:
df.aree[c(1:5),c(1:5)]
M231T3 M961T5 M960T6 M231T19
GS04.A 5.847557e+03 0.000000e+00 3.165891e+04 2.119232e+04
GS16.A 5.248690e+04 4.047780e+03 3.763850e+04 1.187454e+04
GS20.A 5.370910e+03 9.518396e+03 3.552036e+04 1.497956e+04
GS40.A 3.640794e+03 1.084391e+04 4.651735e+04 4.120606e+04
GS44.A 1.225938e+04 2.681887e+03 1.154924e+04 4.202394e+04
Now I need to transform the row names(GS..) in a factor column....

You'd better not transpose the data.frame while the name column is in it - all numeric values will then be turned into strings!
Here's a solution that keeps numbers as numbers:
# first remember the names
n <- df.aree$name
# transpose all but the first column (name)
df.aree <- as.data.frame(t(df.aree[,-1]))
colnames(df.aree) <- n
df.aree$myfactor <- factor(row.names(df.aree))
str(df.aree) # Check the column types

You can use the transpose function from the data.table library. Simple and fast solution that keeps numeric values as numeric.
library(data.table)
# get data
data("mtcars")
# transpose
t_mtcars <- transpose(mtcars)
# get row and colnames in order
colnames(t_mtcars) <- rownames(mtcars)
rownames(t_mtcars) <- colnames(mtcars)

df.aree <- as.data.frame(t(df.aree))
colnames(df.aree) <- df.aree[1, ]
df.aree <- df.aree[-1, ]
df.aree$myfactor <- factor(row.names(df.aree))

Take advantage of as.matrix:
# keep the first column
names <- df.aree[,1]
# Transpose everything other than the first column
df.aree.T <- as.data.frame(as.matrix(t(df.aree[,-1])))
# Assign first column as the column names of the transposed dataframe
colnames(df.aree.T) <- names

With tidyr, one can transpose a dataframe with "pivot_longer" and then "pivot_wider".
To transpose the widely used mtcars dataset, you should first transform rownames to a column (the function rownames_to_column creates a new column, named "rowname").
library(tidyverse)
mtcars %>%
rownames_to_column() %>%
pivot_longer(!rowname, names_to = "col1", values_to = "col2") %>%
pivot_wider(names_from = "rowname", values_from = "col2")

You can give another name for transpose matrix
df.aree1 <- t(df.aree)
df.aree1 <- as.data.frame(df.aree1)

Related

Select specific columns, where the column names are in another df in r

I couldn't find a solution in stack, so here's my issue:
I have a df with 342 columns.
I want to make a new df with only specific columns
The list of columns to keep is in another df, listed in 3 columns titled X,Y,Z for 3 new dataframes
Here's my code right now:
# Read the data:
data <- data.table::fread("data_30_9.csv")
# Import variable names #
variable.names.full = openxlsx::read.xlsx("variables2.xlsx")
Y.variable.names = na.omit(variable.names.full[1])
X.variable.names = na.omit(variable.names.full[2])
Z.variable.names = na.omit(variable.names.full[3])
# Make new DF with only specific columns:
X.Data = data %>% select(as.character(X.variable.names)) # This works as X has only 1 variable
Y.Data = data %>% select(as.character(Y.variable.names)) # This give an error: Error:
# # Can't subset columns that don't exist.
Help?
the data is available here:
https://github.com/amirnakar/TammyA/blob/main/data_30_9.csv
https://github.com/amirnakar/TammyA/blob/main/Variables2.xlsx
The problem is that Y.variable.names is a data.frame which you cannot use to subset another data.frame.
You can check by typing class(Y.variable.names).
So the solution to your problem is subsetting Y.variable.names:
Y.Data = data %>% select(Y.variable.names[,1])
Use lapply on variable.names.full and select the columns from data.
list_data <- lapply(variable.names.full, function(x)
data[, na.omit(x), drop = FALSE])

Read data set from website in numeric format not character

With the code below I read data from a website.
The problem is it reads the data as character not in numeric format especially some columns such as "Enlem(N) and Boylam(E).
How can I fix this?
library(rvest)
widths <- c(11,10,10,10,14,5,5,5,48,100)
dat <- "http://www.koeri.boun.edu.tr/scripts/lst5.asp" %>%
read_html %>%
html_nodes("pre") %>%
html_text %>%
textConnection %>%
read.fwf(widths = widths, stringsAsFactors = FALSE) %>%
setNames(nm = .[6,]) %>%
tail(-7) %>%
head(-2)
If you know what specific columns should be a number, you can convert those columns to be a number. If you do not know what columns should be a number, you can create a function to look at the data and if a large enough percentage of the cases in the column are a number change that column to be a number. I have used the function below for this purpose:
NumericColumns <- function(x, AllowedPercentNumeric =.95, PreserveDate=TRUE, PreserveColumns){
# find the counts of NA values in input data frame's columns
param_originalNA <- apply(x, 2, function(z){sum(is.na(z))})
# blindly coerce data.frame to numeric
df_JustNumbers <- suppressWarnings(as.data.frame(lapply(x, as.numeric)))
# Percent Non-NA values in each column
PercentNumeric <- (apply(df_JustNumbers, 2, function(x)sum(!is.na(x))))/(nrow(x)-param_originalNA)
rm(param_originalNA)
# identify columns which have a greater than or equal percentage of numeric as specified
param_numeric <- names(PercentNumeric)[PercentNumeric >= AllowedPercentNumeric]
# Remove columns from list to convert to numeric that are specified as to preserve
if (!missing(PreserveColumns)){param_numeric <- setdiff(param_numeric, PreserveColumns)}
# Identify columns that are dates initially
IsDateColumns <- lapply(x, function(y)(is(y, "Date")|is(y, "POSIXct")))
param_dates <- names(IsDateColumns)[IsDateColumns==TRUE]
# Remove dates from list if specified to preserve dates
if (PreserveDate){param_numeric <- setdiff(param_numeric, param_dates)}
# returns column position of numeric columns in target data frame
param_numeric <- match(param_numeric, colnames(df_JustNumbers))
# removes NA's from column list
param_numeric <- param_numeric[complete.cases(param_numeric)]
# coerces columns in param_numeric to numeric and inserts numeric columns into target data.frame
if(length(param_numeric)==1){
suppressWarnings(x[, param_numeric] <- as.numeric(x[, param_numeric]))
}
if(length(param_numeric)>1){
suppressWarnings(x[, param_numeric] <- apply(x[, param_numeric],2, function(x)as.numeric(x)))
}
return(x)
}
Once the function is created, you can use it on you data such as:
# Use function to convert to numeric
dat <- NumericColumns(dat)

as.numeric and as.character across different columns

I am trying to set some variables as character and others as numeric, what I currently have is;
colschar <- c(1:2, 68:72)
colsnum <- c(3:67)
subset <- as.data.frame(lapply(data[, colschar], as.character), (data[, colsnum], as.numeric))
which returns an error.
I am trying to set columns 1:2 and 68:72 as a character and columns 3:67 all as numeric.
I suggest:
data[colschar] <- lapply(data[colschar], as.character)
data[colsnum] <- lapply(data[colsnum], as.numeric)
It should be better if you share an extract of your data. In any case you may try with tidiverse approach:
library(dplyr)
mydf_molt <- mydf %>%
mutate_at(.vars=c(1:2, 68:72),.funs=funs(as.character(.))) %>%
mutate_at(.vars=c(3:67),.funs=funs(as.numeric(.)))

Recoding a large number of variables using another data frame in R

I'd like to use a data frame (Df2) to recode the variables of another data frame (Df1), so that the end result is a data frame that contains text like local/international rather than 1s/2s (Df3). Missingness is present in the Df1 data frame, and I'd like to make sure it's represented as NA.
This is a minimal working example, the actual data set contains more than a hundred variables (all of which are of the character class) with between one and fifteen levels. Any help would be much appreciated.
Starting point (dfs)
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Desired outcome (df)
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
Thoughts, not really code, so far: (If there's a match between a row of the df2 NameOfVariable and a df1 variable name, as well as a match between a row of df2 VariableLevel and a df1 observation, then paste the corresponding row of df2 VariableDef into df1. Wondering if you can use if statements for it.)
if (Df2["NameOfVariable"]==names(Df1))
{
if (Df2["VariableLevel"]==Df1[ ])
{
Df1[ ] <- paste0("VariableDef")
}
}
Here is on method in base R using match and Map. Map applies a function to corresponding list elements. Here, there are two list elements: Df1 and a list that is composed of the second and third columns of Df2, split by column 1. The second list is reordered to match the order of the names in Df1.
The applied function matches elements in a column Df1 to the corresponding column in the second argument and uses it as an index to return the corresponding name of the Df2 argument. Map returns a list, which is converted to a data.frame with the function of the same name.
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
this returns
buyer_Q1 seller_Q2 price_Q1_2
1 local internat 50-100K
2 internat local 100-200K
3 local NA 200+K
4 local internat 100-200K
Solution using loop and factors. Be careful. Results seem equivalent but they are not. The function fun return data frame with factors. If needed you can convert them to characters.
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
fun <- function(df, mdf) {
for (varn in names(df)) {
dat <- mdf[mdf$NameOfVariable == varn & !is.na(mdf$VariableDef),]
df[[varn]] <- factor(df[[varn]], dat$VariableLevel, dat$VariableDef)
}
return(df)
}
fun(Df1, Df2)
Df3
A solution from dplyr and tidyr. The code will work fine even with warning messages because the columns are in factor. If you don't want to see any warning messages, set stringsAsFactors = FALSE when creating the data frame like the example I provided.
library(dplyr)
library(tidyr)
Df3 <- Df1 %>%
mutate(ID = 1:n()) %>%
gather(NameOfVariable, VariableLevel, -ID) %>%
left_join(Df2, by = c("NameOfVariable", "VariableLevel")) %>%
select(-VariableLevel) %>%
spread(NameOfVariable, VariableDef) %>%
select(-ID)
Df3
buyer_Q1 price_Q1_2 seller_Q2
1 local 50-100K internat
2 internat 100-200K local
3 local 200+K NA
4 local 100-200K internat
DATA
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),
"seller_Q2"=c(2,1,3,2),
"price_Q1_2"=c(2,5,7,5),
stringsAsFactors = FALSE)
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),
"VariableLevel"=c(1,2,1,2,3,2,5,7),
"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"),
stringsAsFactors = FALSE)

adding values in certain columns of a data frame in R

I'm new to R. In a data frame, I wanted to create a new column #21 that is equal to the sum of column #1 to #20,row by row.
I knew I could do
df$Col21<-df$Col1+df$Col2+.....+df$Col20
But is there a more concise expression?
Also, can I achieve this if using column names not numbers? Thanks!
There is rowSums:
df$Col21 = rowSums(df[,1:20])
should do the trick, and with names:
df$Col21 = rowSums(df[,paste("Col", 1:20, sep="")])
With leading zeros and 3 digits, try:
df$Col21 = rowSums(df[,sprintf("Col%03d", 1:20, sep="")])
I find the dplyr functions for column selection very intuitive, like starts_with(), ends_with(), contains(), matches() and num_range():
df <- as.data.frame(replicate(20, runif(10)))
names(df) <- paste0("Col", 1:20)
library(dplyr)
# e.g.
summarise_each(df, funs(sum), starts_with("Col"))
# or
rowSums(select(df, contains("8")))

Resources