I couldn't find a solution in stack, so here's my issue:
I have a df with 342 columns.
I want to make a new df with only specific columns
The list of columns to keep is in another df, listed in 3 columns titled X,Y,Z for 3 new dataframes
Here's my code right now:
# Read the data:
data <- data.table::fread("data_30_9.csv")
# Import variable names #
variable.names.full = openxlsx::read.xlsx("variables2.xlsx")
Y.variable.names = na.omit(variable.names.full[1])
X.variable.names = na.omit(variable.names.full[2])
Z.variable.names = na.omit(variable.names.full[3])
# Make new DF with only specific columns:
X.Data = data %>% select(as.character(X.variable.names)) # This works as X has only 1 variable
Y.Data = data %>% select(as.character(Y.variable.names)) # This give an error: Error:
# # Can't subset columns that don't exist.
Help?
the data is available here:
https://github.com/amirnakar/TammyA/blob/main/data_30_9.csv
https://github.com/amirnakar/TammyA/blob/main/Variables2.xlsx
The problem is that Y.variable.names is a data.frame which you cannot use to subset another data.frame.
You can check by typing class(Y.variable.names).
So the solution to your problem is subsetting Y.variable.names:
Y.Data = data %>% select(Y.variable.names[,1])
Use lapply on variable.names.full and select the columns from data.
list_data <- lapply(variable.names.full, function(x)
data[, na.omit(x), drop = FALSE])
With the code below I read data from a website.
The problem is it reads the data as character not in numeric format especially some columns such as "Enlem(N) and Boylam(E).
How can I fix this?
library(rvest)
widths <- c(11,10,10,10,14,5,5,5,48,100)
dat <- "http://www.koeri.boun.edu.tr/scripts/lst5.asp" %>%
read_html %>%
html_nodes("pre") %>%
html_text %>%
textConnection %>%
read.fwf(widths = widths, stringsAsFactors = FALSE) %>%
setNames(nm = .[6,]) %>%
tail(-7) %>%
head(-2)
If you know what specific columns should be a number, you can convert those columns to be a number. If you do not know what columns should be a number, you can create a function to look at the data and if a large enough percentage of the cases in the column are a number change that column to be a number. I have used the function below for this purpose:
NumericColumns <- function(x, AllowedPercentNumeric =.95, PreserveDate=TRUE, PreserveColumns){
# find the counts of NA values in input data frame's columns
param_originalNA <- apply(x, 2, function(z){sum(is.na(z))})
# blindly coerce data.frame to numeric
df_JustNumbers <- suppressWarnings(as.data.frame(lapply(x, as.numeric)))
# Percent Non-NA values in each column
PercentNumeric <- (apply(df_JustNumbers, 2, function(x)sum(!is.na(x))))/(nrow(x)-param_originalNA)
rm(param_originalNA)
# identify columns which have a greater than or equal percentage of numeric as specified
param_numeric <- names(PercentNumeric)[PercentNumeric >= AllowedPercentNumeric]
# Remove columns from list to convert to numeric that are specified as to preserve
if (!missing(PreserveColumns)){param_numeric <- setdiff(param_numeric, PreserveColumns)}
# Identify columns that are dates initially
IsDateColumns <- lapply(x, function(y)(is(y, "Date")|is(y, "POSIXct")))
param_dates <- names(IsDateColumns)[IsDateColumns==TRUE]
# Remove dates from list if specified to preserve dates
if (PreserveDate){param_numeric <- setdiff(param_numeric, param_dates)}
# returns column position of numeric columns in target data frame
param_numeric <- match(param_numeric, colnames(df_JustNumbers))
# removes NA's from column list
param_numeric <- param_numeric[complete.cases(param_numeric)]
# coerces columns in param_numeric to numeric and inserts numeric columns into target data.frame
if(length(param_numeric)==1){
suppressWarnings(x[, param_numeric] <- as.numeric(x[, param_numeric]))
}
if(length(param_numeric)>1){
suppressWarnings(x[, param_numeric] <- apply(x[, param_numeric],2, function(x)as.numeric(x)))
}
return(x)
}
Once the function is created, you can use it on you data such as:
# Use function to convert to numeric
dat <- NumericColumns(dat)
I am trying to set some variables as character and others as numeric, what I currently have is;
colschar <- c(1:2, 68:72)
colsnum <- c(3:67)
subset <- as.data.frame(lapply(data[, colschar], as.character), (data[, colsnum], as.numeric))
which returns an error.
I am trying to set columns 1:2 and 68:72 as a character and columns 3:67 all as numeric.
I suggest:
data[colschar] <- lapply(data[colschar], as.character)
data[colsnum] <- lapply(data[colsnum], as.numeric)
It should be better if you share an extract of your data. In any case you may try with tidiverse approach:
library(dplyr)
mydf_molt <- mydf %>%
mutate_at(.vars=c(1:2, 68:72),.funs=funs(as.character(.))) %>%
mutate_at(.vars=c(3:67),.funs=funs(as.numeric(.)))
I'd like to use a data frame (Df2) to recode the variables of another data frame (Df1), so that the end result is a data frame that contains text like local/international rather than 1s/2s (Df3). Missingness is present in the Df1 data frame, and I'd like to make sure it's represented as NA.
This is a minimal working example, the actual data set contains more than a hundred variables (all of which are of the character class) with between one and fifteen levels. Any help would be much appreciated.
Starting point (dfs)
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Desired outcome (df)
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
Thoughts, not really code, so far: (If there's a match between a row of the df2 NameOfVariable and a df1 variable name, as well as a match between a row of df2 VariableLevel and a df1 observation, then paste the corresponding row of df2 VariableDef into df1. Wondering if you can use if statements for it.)
if (Df2["NameOfVariable"]==names(Df1))
{
if (Df2["VariableLevel"]==Df1[ ])
{
Df1[ ] <- paste0("VariableDef")
}
}
Here is on method in base R using match and Map. Map applies a function to corresponding list elements. Here, there are two list elements: Df1 and a list that is composed of the second and third columns of Df2, split by column 1. The second list is reordered to match the order of the names in Df1.
The applied function matches elements in a column Df1 to the corresponding column in the second argument and uses it as an index to return the corresponding name of the Df2 argument. Map returns a list, which is converted to a data.frame with the function of the same name.
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
this returns
buyer_Q1 seller_Q2 price_Q1_2
1 local internat 50-100K
2 internat local 100-200K
3 local NA 200+K
4 local internat 100-200K
Solution using loop and factors. Be careful. Results seem equivalent but they are not. The function fun return data frame with factors. If needed you can convert them to characters.
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
fun <- function(df, mdf) {
for (varn in names(df)) {
dat <- mdf[mdf$NameOfVariable == varn & !is.na(mdf$VariableDef),]
df[[varn]] <- factor(df[[varn]], dat$VariableLevel, dat$VariableDef)
}
return(df)
}
fun(Df1, Df2)
Df3
A solution from dplyr and tidyr. The code will work fine even with warning messages because the columns are in factor. If you don't want to see any warning messages, set stringsAsFactors = FALSE when creating the data frame like the example I provided.
library(dplyr)
library(tidyr)
Df3 <- Df1 %>%
mutate(ID = 1:n()) %>%
gather(NameOfVariable, VariableLevel, -ID) %>%
left_join(Df2, by = c("NameOfVariable", "VariableLevel")) %>%
select(-VariableLevel) %>%
spread(NameOfVariable, VariableDef) %>%
select(-ID)
Df3
buyer_Q1 price_Q1_2 seller_Q2
1 local 50-100K internat
2 internat 100-200K local
3 local 200+K NA
4 local 100-200K internat
DATA
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),
"seller_Q2"=c(2,1,3,2),
"price_Q1_2"=c(2,5,7,5),
stringsAsFactors = FALSE)
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),
"VariableLevel"=c(1,2,1,2,3,2,5,7),
"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"),
stringsAsFactors = FALSE)
I'm new to R. In a data frame, I wanted to create a new column #21 that is equal to the sum of column #1 to #20,row by row.
I knew I could do
df$Col21<-df$Col1+df$Col2+.....+df$Col20
But is there a more concise expression?
Also, can I achieve this if using column names not numbers? Thanks!
There is rowSums:
df$Col21 = rowSums(df[,1:20])
should do the trick, and with names:
df$Col21 = rowSums(df[,paste("Col", 1:20, sep="")])
With leading zeros and 3 digits, try:
df$Col21 = rowSums(df[,sprintf("Col%03d", 1:20, sep="")])
I find the dplyr functions for column selection very intuitive, like starts_with(), ends_with(), contains(), matches() and num_range():
df <- as.data.frame(replicate(20, runif(10)))
names(df) <- paste0("Col", 1:20)
library(dplyr)
# e.g.
summarise_each(df, funs(sum), starts_with("Col"))
# or
rowSums(select(df, contains("8")))