Coming from Python + Pandas, I tried to convert a column in an R data.frame.
In Python/Pandas I'd do it like so: df[['weight']] = df[['weight']] / 1000
In R I came up with this:
convertWeight <- function(w) {
return(w/1000)
}
df$weight <- lapply(df$weight, convertWeight)
I know of the library dplyr which has the function mutate. That would allow me to transform columns as well.
Is there a different way to mutate a column without using the dplyr library? Something that comes close to Pandas way of doing it?
EDIT
Inspecting the values in df$weight I see this:
> df[1,]
date weight
1 1.552954e+12 84500.01
> typeof(df[1,1])
[1] "list"
> df$weight[1]
[[1]]
[1] 84500.01
> typeof(df$weight[1])
[1] "list"
This is neither a number, nor a char. Why list?
Btw: I have the data from a json import like so:
library(rjson)
data <- fromJSON(file = "~/Desktop/weight.json")
# convert json data to data.frame
df <- as.data.frame(do.call("rbind", json_data))
# select only two columns
df <- df[c("date", "weight")]
# now I converted the `date` value from posix to date
# and the weight value from milli grams to kg
# ...
Obviously I have a lot to learn about R.
df$weight = as.numeric(df$weight)
df$weight = df$weight / 1000
# If you would like to eliminate the display of scientific notation:
options(scipen = 999)
# If having difficulty still because of the list, try
df = as.data.frame(df)
Related
I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))
I have created list of intervals from below code.
library(lubridate)
date1 <- ymd_hms("2000-01-01 05:30:00",tz = "US/Eastern")
shifts <- lapply(0:14, function(x){
mapply(function(y,z){
interval((date1+days(x)+minutes(y)), (date1+days(x)+minutes(y+z)))
}, y = c(0,150,390,570,690,810,1050), z = c(600,570,600,600,600,600,600), SIMPLIFY = FALSE)
})
I have another data set df with 105 columns.
I am trying to give column names as shifts intervals.
But the format is changing. I want my column names as same as shifts. I am trying as below.
list <- unlist(shifts, recursive = FALSE)
colnames(df)<-as.date(list)
The reason this is failing is because list is still of type interval. If you want to use the contents of this interval as colnames, you need to convert them to a list of characters like this:
list <- unlist(shifts, recursive = FALSE)
dmy <- list()
for(i in 1:length(list)){
foo <- c(list[[i]])
foo <- as.character(foo)
dmy <- append(dmy, foo)
}
colnames(df) <- dmy # list of characters
Output:
> class(list[[1]])
[1] "Interval"
attr(,"package")
[1] "lubridate"
> class(dmy[[1]])
[1] "character"
Now you should be able to rename columns of df :)
I have a CSV datafile called test_20171122
Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file.
I am looking into the optimal way to clean data from an accounting format "$##,###" to a number "####" in R using gsub().
My trouble is in the iteration of gsub() across all columns of a dataset. My first instinct run gsub() on the whole dataframe (below) but it seems to alter the data in a counterproductive way.
gsub("\\$", "", test_20171122)
The following code is a for loop that seems to get the job done.
for (i in 1:length(test_20171122)){
clean1 <- gsub("\\$","",test_20171122[[1]])
clean2 <- gsub("\\,","",clean1)
test_20171122[,i] <- clean2
i = i + 1
}
I am trying to figure out the optimal way of cleaning a dataframe using gsub(). I feel like sapply() would work but it seems to break the structure of the dataframe when I run the following code:
test_20171122 <- sapply(test_20171122,function(x) gsub("\\$","",x))
test_20171122 <- sapply(test_20171122,function(x) gsub("\\,","",x))
You can use the following pattern in gsub: "[$,]"
Example:
df <- data.frame(
V1 = c("$1,234.56", " $ 23,456.70"),
V2 = c("$89,101,124", "15,234")
)
df
# V1 V2
# 1 $1,234.56 $89,101,124
# 2 $ 23,456.70 15,234
df[] <- lapply(df, function(x) as.numeric(gsub("[$,]", "", x)))
df
# V1 V2
# 1 1234.56 89101124
# 2 23456.70 15234
A solution using the purrr function map_df :
clean_df <- map_df(test_20171122, ~ gsub("[$,]", "", .x))
I got a text file with data I want to read, but one of the columns is a messy "code" which contains the same character used as the separator. Take the following set as an example:
number:string
1:abc?][
2:def:{+
There will be a line with 3 columns and only 2 column names.
Is there any strategy to read this dataset?
Read the file a line at a time, split into two parts on the ":", bind into a data frame. The column names get lost but you can put them back on again easy enough. You need the stringr and readr packages:
> do.call(rbind.data.frame,stringr::str_split(readr::read_lines("seps.csv",skip=1),":",2))
c..1....2.. c..abc.......def.....
1 1 abc?][
2 2 def:{+
Here with stringr and readr attached for readability, with the names fixed:
> library(stringr)
> library(readr)
> d = do.call(rbind.data.frame,str_split(read_lines("seps.csv",skip=1),":",2))
> names(d) = str_split(read_lines("seps.csv",n_max=1),":",2)[[1]]
> d
number string
1 1 abc?][
2 2 def:{+
Good old regular expressions should help you with this
Read txt file
df <- read.table("pathToFile/fileName.txt", header = TRUE)
The data.frame will be one column, so we will need to split it based on some pattern
Create the columns
df$number <- sub("([0-9]+):.*", "\\1", df[, 1])
df$string <- sub("[0-9]+:(.*)", "\\1", df[, 1])
df <- df[, c("number", "string")]
View(df)
I need to , efficiently, parse one of my dataframe column (a url string)
and call a function (strsplit) to parse it, e.g.:
url <- c("www.google.com/nir1/nir2/nir3/index.asp")
unlist(strsplit(url,"/"))
My data frame : spark.data.url.clean looks like this:
classes url
[107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3
This df has 100k rows and I don't want to loop/iterate over it, parse each url separately and write the results to a new data frame.
What I DO need/want is to create a new 5 column data frame:
df.result <- data.frame(fullurl = as.character(),baseurl=as.character(), firstlevel = as.character(), secondlevel=as.character(),thirdlevel=as.character(),classificaiton=as.character())
call one of the "apply" family function over spark.data.url.clean$url
and to write the results to the new data frame df.result such that the first column (fullurl) will be populated with the relevant spark.data.url.clean$url, the 2nd to 5th columns will be populated with the relevant results from applying
unlist(strsplit(url,"/"))
- taking the only the first, 2nd, 3rd and 4th elements from the resulted vector and putting it in the first,2nd, 3rd and 4th columns in df.result and finally putting the spark.data.url.clean$classes in the new data frame columns df.result$classificaiton
Sorry for the complication and let me know if anything need to be further cleared out.
There is no need for apply, as far as I see it.
Try this:
spark.data.url.clean <- data.frame(classes = c(107,662,685,508,111,654,509),
url = c("drudgereport.com/level1/level2/level3", "drudgeddddreport.com/levelfe1/lefvel2/leveel3",
"drudgeaasreport2.com/lefvel13/lffvel244/fel223", "otherurl.com/level1/second/level3",
"whateversite.com/level13/level244/level223", "esportsnow.com/first/level2/level3",
"reeport2.com/level13/level244/third"), stringsAsFactors = FALSE)
df.result <- spark.data.url.clean
names(df.result) <- c("classification", "fullurl")
df.result[c("baseurl", "firstlevel", "secondlevel", "thirdlevel")] <- do.call(rbind, strsplit(df.result$fullurl, "/"))
You could consider using the package splitstackshape to do this; we can use its cSplit-function. Setting drop to F ensures that the original column is preserved. Not that it returns a data.table, not a data.frame.
library(splitstackshape)
output <- cSplit(dat,2,sep="/", drop=F)
data used:
dat <- data.frame(classes="[107,662,685,508,111,654,509]",
url="drudgereport.com/level1/level2/level3")
Here's an option with data.table which should be pretty fast. If your data looks like this:
> df
# classes url
#1 [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3
You can do the following:
library(data.table)
setDT(df) # convert to data.table
cols <- c("baseurl", "firstlevel", "secondlevel", "thirdlevel") # define new column names
df[, (cols) := tstrsplit(url, "/", fixed = TRUE)[1:4]] # assign new columns
Now, the data looks like this:
> df
# classes url baseurl firstlevel secondlevel thirdlevel
#1: [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3 drudgereport.com level1 level2 level3
The simple solution is to use:
apply(row, 2, function(col) {})