Observations in the environment of R studio but empties dataframes - r

I am working with precipitation data in R but I had a problem that I cannot solve. I will put here the code to be clearer. I have precipitation data (mm/h) for 47 meteorological stations with minute data, but I need the data hourly and a file with all the stations to then interpolate. The problem is, in this moment, to create that dataframes with the 47 stations, these dataframes must be structured normally by 47 observations and 3 variables.
But the problem is that in the environment I can see that apparently the process is correct, but when I open the dataframe I get surprised because there is only one value as you can see in the image.
Look the dataframe 071212
This is the code I have used to generate the dataframes.
setwd("D:/Escritorio/ohiiunam/estaciones")
temp <- list.files(pattern="*.csv")
lista = lapply(temp, read.csv)
lista<-data.table::rbindlist(lista)
n_last <- 6
lista$id2<- substr(lista$id, nchar(lista$id) - n_last + 1, nchar(lista$id))
unicos <- unique(lista$id2)
fun <- function(i) {
i<-lista %>% select(id, intensidad.mm.h, id2) %>% filter(lista$id2==i)
}
for (i in unicos) {
i <- as.data.frame(fun(i))
}

Well, you have a slight problem that you've named your data.frames with names that aren't syntactically valid (variable names cannot start with numbers). To work with them, you'll need to surround the name in backticks.
View(`071208`)
It's not clear how you loaded those data.frames, but it might be better to change that import routine to prefix the names with some character value.

I have solved the problem with modifications in the assign function.
for (i in unicos) {
assign(paste0("jul", sep=".", i), data.frame(lista %>% select(id, id2, intensidad.mm.h) %>% filter (lista$id2==i)))
}
Thank you for the help.

Related

How to apply a loop to filter a dataset and then bind the outcome with other dataset?

I am new at programming and bad at using loops in R. I'm facing a situation in which if I do not use a loop, I think I'm going to spend lots of time achieving my goal.
I have a big csv file in my working directory, that contains data related to 64 animal species (this csv is represented by the object "df2", created below). In the same directory, I have 64 smaller csv files, each one related to an animal species that is also present in the bigger csv. These 64 smaller files have the same number of columns (6), but different numbers of rows. I'll create some toy data to illustrate it and divide my question into four parts to make it as clear as I can.
library(tidyverse)
#Creating a df just to split it
df <- data.frame(animal=c(c("dog", "DOG"),
rep("cat", 4),
"frog",
rep("bird", 7),
rep("snake", 5),
rep("lizard", 3),
c("cow","cOW","COW","coww"),
rep("worm",6),
"lion",
rep("shark",9)),
var1=rnorm(42),
var2=rnorm(42),
var3=rnorm(42),
var4=rnorm(42),
var5=rnorm(42))
#The following steps are just to make a reproducible example. I'm filtering the toy data just to save it as csv files and import them.
da1 <- df %>%
filter(animal=="dog" | animal=="DOG")
da2 <- df %>%
filter(animal=="cat")
da3 <- df %>%
filter(animal=="frog")
da4 <- df %>%
filter(animal=="bird")
da5 <- df %>%
filter(animal=="snake")
da6 <- df %>%
filter(animal=="lizard")
da7 <- df %>%
filter(animal=="cow" | animal=="cOW"|
animal=="COW" | animal=="coww")
da8 <- df %>%
filter(animal=="worm")
da9 <- df %>%
filter(animal=="lion")
da10 <- df %>%
filter(animal=="shark")
readr::write_csv(da1, "da1.csv")
readr::write_csv(da2, "da2.csv")
readr::write_csv(da3, "da3.csv")
readr::write_csv(da4, "da4.csv")
readr::write_csv(da5, "da5.csv")
readr::write_csv(da6, "da6.csv")
readr::write_csv(da7, "da7.csv")
readr::write_csv(da8, "da8.csv")
readr::write_csv(da9, "da9.csv")
readr::write_csv(da10, "da10.csv")
#Those 10 csv files correspond to the 64 ones that I have in my directory
Part 1:
As you can see, I had to filter one species at a time. So, my first question is: how can I pass those filters and the "readr::write_csv" function inside of a loop so that I can do it all at once? (Instead of doing it individually). Note that some species such as "dog" and "cow" have several spellings. That's a problem I have to deal with since I downloaded my actual data from databases online and the files have such issues.
To load the small csv files I do the following:
library(rio)
data <- import_list(dir("path_to_directory", pattern = ".csv"), rbind = FALSE)
Part 2
Once I've imported them as above, they are stored in the object "data". This changes their order so that they are listed as da1, da10, da2, da3, da4, and so on, instead of sequentially as da1, da2, da3, da4, da5... What I want to do now is to reorder them from 1 to 10. After that, I would like to select the same three columns (animal, var1, var2) from each of the datasets. I was able to do that for each of the datasets individually:
ba1 <- data$da1 %>%
dplyr::select(animal, var1, var2)
ba2 <- data$da2 %>%
dplyr::select(animal, var1, var2)
.
.
.
Again, I would like to do it all at once using a loop or something like that.
Part 3
Once I've selected the columns and saved them in objects, I want to bind the resulting objects with subsets of the big csv file I cited above. Here are some toy data for it:
df2 <- data.frame(animal=c(rep("dog2", 2),
rep("cat2", 4),
"frog2",
rep("bird2", 7),
rep("snake2", 5),
rep("lizard2", 3),
rep("cow2",4),
rep("worm2",6),
"lion2",
rep("shark2",9)),
var1=rnorm(42),
var2=rnorm(42))
#This time all animals have the same spelling since I tabulated those data manually.
The subsets that I refer to are made by filtering this data frame by animal species. I was able to do that using dplyr::filter:
ca1 <- df2 %>%
filter(animal=="dog2")
ca2 <- df2 %>%
filter(animal=="cat2")
.
.
.
And so on until I've done it with all the animals. As my actual data contains several (64) animal species, filtering the df2 that way takes a lot of time, so I would like to do so using a faster way. I think a for loop can be useful, but I suck at this kind of programming and did not manage to write the code for it. Could anyone provide the code for it, please?
Part 4
Finally, once the species in the df2 are filtered, I want to use a loop to bind (rbind) the objects that refer to the same species, such as ba1 and ca1 in this example, and then save the objects as new csv files:
readr::write_csv(rbind(ba1, ca1), "ga1.csv")
readr::write_csv(rbind(ba2, ca2), "ga2.csv")
.
.
.
By doing that I should have 64 new csv files, containing a combination of the data of the 64 old ones and part of my big csv file. Could anyone help me? I would really appreciate it if you could answer my question stepwise.
I appreciate your time and your attention in reading all of this. Thanks so much in advance!
This is a bit confusing. You refer to "da#", "a#", "ba#", and "c1" but only "da#" and "ba#" are actually defined in your code. Here is a start on what you seem to be trying to do. First creating the files you are using as an example:
animals <- split(df2, df2$animal)
fnames <- paste0("da", formatC(1:10, digits=2, width=2, flag="0"), ".csv")
invisible(lapply(1:10, function(x) write_csv(animals[[x]], fnames[x])))
dir(pattern=".csv")
# [1] "da01.csv" "da02.csv" "da03.csv" "da04.csv" "da05.csv" "da06.csv" "da07.csv" "da08.csv" "da09.csv" "da10.csv"
First we split df2 into the different kinds of animals and then use lapply to create 10 .csv files but label them so they will appear in the correct numeric order.
Since splitting a data frame is easy, why not combine all of the files into a single data frame (alldata <- do.call(rbind, animals)), extract the columns you want and then use split to separate them by animal type. You can then keep the list and extract the parts you want - usually the simpler approach if you plan to do similar analyses on all of them - or extract them as separate objects.

Changing dataframes in bulk? How to apply a list of operations to multiple dataframes?

So, I have 6 data frames, all look like this (with different values):
Now I want to create a new column in all the data frames for the country. Then I want to convert it into a long df. This is how I am going about it.
dlist<- list(child_mortality,fertility,income_capita,life_expectancy,population)
convertlong <- function(trial){
trial$country <- rownames(trial)
trial <- melt(trial)
colnames(trial)<- c("country","year",trial)
}
for(i in dlist){
convertlong(i)
}
After running this I get:
Using country as id variables
Error in names(x) <- value :
'names' attribute [5] must be the same length as the vector [3]
That's all, it doesn't do the operations on the data frames. I am pretty sure I'm taking a stupid mistake, but I looked online on forums and cannot figure it out.
maybe you can replace
trial$country <- rownames(trial)
by
trial <- cbind(trial, rownames(trial))
Here's a tidyverse attempt -
library(tidyverse)
#Put the dataframes in a named list.
dlist<- dplyr::lst(child_mortality, fertility,
income_capita, life_expectancy,population)
#lst is not a typo!!
#Write a function which creates a new column with rowname
#and get's the data in long format
#The column name for 3rd column is passed separately (`col`).
convertlong <- function(trial, col){
trial %>%
rownames_to_column('country') %>%
pivot_longer(cols = -country, names_to = 'year', values_to = col)
}
#Use `imap` to pass dataframe as well as it's name to the function.
dlist <- imap(dlist, convertlong)
#If you want the changes to be reflected for dataframes in global environment.
list2env(dlist, .GlobalEnv)

Apply series of changes to multiple similar datasets in R

I have 20 csv files of data that are formatted exactly the same, about 40 columns of different numbers, but with different values in each column. I want to apply a series of changes to each data frame in order to extract specific information from every one of them.
Specifically I want to extract four columns from each data frame, find the maximum value of each column in each data frame and then add all of these maximum values together, so I get one final number for each data frame. Something like this:
str(data)
Extract<-data[c(1,2,3,4)]
Max<-apply(Extract,2,max)
Add<-Max[1] + Max[2] + Max[3] + Max[4]
I have the code written above to do all these steps for every data frame individually, but is it possible to apply this code to all of them at once?
If you put all 20 filenames into a vector called files
Maxes <- numeric(length(files))
i <- 1
for (file in files) {
data <- read.csv(file)
str(data)
Extract<-data[c(1,2,3,4)]
Max<-apply(Extract,2,max)
Add<-Max[1] + Max[2] + Max[3] + Max[4]
Maxes[i] <- Add
i <- i+1
}
Though that str(data) will just cause a lot of stuff to print to the terminal 20 times. I'm not sure the value of that, but it was in your question so I included.
Put all your files into a common folder such as /path/temp/
csvs <- list.files("/path/temp") # vector of csv
Use custom function for colMax
colMax <- function(data) sapply(data, max, na.rm = TRUE)
Using foreach, dplyr, and readr
library(foreach)
library(dplyr)
foreach(i=1:length(csvs), .combine="c") %do% { read_csv(csvs[i]) %>%
select(1:4) %>%
colMax(.) %>%
sum(.)
} # returns a vector

Use R to add a column to multiple dataframes using lapply

I would like to add a column containing the year (found in the file name) to each column. I've spent several hours googling this, but can't get it to work. Am I making some simple error?
Conceptually, I'm making a list of the files, and then using lapply to calculate a column for each file in the list.
I'm using data from Census OnTheMap. Fresh download. All files are named thus: "points_2013" "points_2014" etc. Reading in the data using the following code:
library(maptools)
library(sp)
shps <- dir(getwd(), "*.shp")
for (shp in shps) assign(shp, readShapePoints(shp))
# the assign function will take the string representing shp
# and turn it into a variable which holds the spatial points data
My question is very similar to this one, except that I don't have a list of file names--I just want extract the entry in a column from the file name. This thread has a question, but no answers. This person tried to use [[ instead of $, with no luck. This seems to imply the fault may be in cbind vs. rbind..not sure. I'm not trying to output to csv, so this is not fully relevant.
This is almost exactly what I am trying to do. Adapting the code from that example to my purpose yields the following:
dat <- ls(pattern="points_")
dat
ldf = lapply(dat, function(x) {
# Add a column with the year
dat$Year = substr(x,8,11)
return(dat)
})
ldf
points_2014.shp$Year
But the last line still returns NULL!
From this thread, I adapted their solution. Omitting the do.call and rbind, this seems to work:
lapply(points,
function(x) {
dat=get(x)
dat$year = sub('.*_(.*)$','\\1',x)
return(dat)
})
points_2014.shp$year
But the last line returns a null.
Starting to wonder if there is something wrong with my R in some way. I tested it using this example, and it works, so the trouble is elsewhere.
# a dataframe
a <- data.frame(x = 1:3, y = 4:6)
a
# make a list of several dataframes, then apply function
#(change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {
names(x) <- c("a", "b")
return(x)})
my.list
After some help from this site, my final code was:
#-------takes all the points files, adds the year, and then binds them together
points2<-do.call(rbind,lapply(ls(pattern='points_*'),
function(x) {
dat=get(x)
dat$year = substr(x,8,11)
dat
}))
points2$year
names(points2)
It does, however, use an rbind, which is helpful in the short term. In the long term, I will need to split it again, and use a cbind, so I can substract two columns from each other.
I use the following Code:
for (i in names.of.objects){
temp <- get(i)
# do transformations on temp
assign(i, temp)
}
This works, but is definitely not performant, since it does assignments of the whole data twice in a call by value manner.

Removing rows in data frame using get function

Suppose I have following data frame:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
I want to get following data frame at the end:
mydataframe[-which(is.na(mydataframe$ID)),]
I need to do this kind of cleaning (and other similar manipulations) with many other data frames. So, I decided to assign a name to mydataframe, and variable of interest.
dbname <- "mydataframe"
varname <- "ID"
attach(get(dbname))
I get an error in the following line, understandably.
get(dbname) <- get(dbname)[-which(is.na(get(varname))),]
detach(get(dbname))
How can I solve this? (I don't want to assign to a new data frame, even though it seems only solution right now. I will use "dbname" many times afterwards.)
Thanks in advance.
There is no get<- function, and there is no get(colname) function (since colnames are not first class objects), but there is an assign() function:
assign(dbname, get(dbname)[!is.na( get(dbname)[varname] ), ] )
You also do not want to use -which(.). It would have worked here since there were some matches to the condition. It will bite you, however, whenever there are not any rows that match and instead of returning nothing as it should, it will return everything, since vec[numeric(0)] == vec. Only use which for "positive" choices.
As #Dason suggests, lists are made for this sort of work.
E.g.:
# make a list with all your data.frames in it
# (just repeating the one data.frame 3x for this example)
alldfs <- list(mydataframe,mydataframe,mydataframe)
# apply your function to all the data.frames in the list
# have replaced original function in line with #DWin and #flodel's comments
# pointing out issues with using -which(...)
lapply(alldfs, function(x) x[!is.na(x$ID),])
The suggestion to use a list of data frames is good, but I think people are assuming that you're in a situation where all the data frames are loaded simultaneously. This might not necessarily be the case, eg if you're working on a number of projects and just want some boilerplate code to use in all of them.
Something like this should fit the bill.
stripNAs <- function(df, var) df[!is.na(df[[var]]), ]
mydataframe <- stripNAs(mydataframe, "ID")
cars <- stripNAs(cars, "speed")
I can totally understand your need for this, since I also frequently need to cycle through a set of data frames. I believe the following code should help you out:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
#define target dataframe and varname
dbname <- "mydataframe"
varname <- "ID"
tmp.df <- get(dbname) #get df and give it a temporary name
col.focus <- which(colnames(tmp.df) == varname) #define the column of focus
tmp.df <- tmp.df[which(!is.na(tmp.df[,col.focus])),] #cut out the subset of the df where the column of focus is not NA.
#Result
ID score
1 1 11
2 2 12
4 4 14
5 5 15

Resources