somosnlp
/

bertin_base_climate_detection_spa

@@ -54,7 +54,7 @@ Future steps:
 - **Fine-tuned from model:** [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
 - **Dataset used:** [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection)
-### Fuentes de modelos
 - **Repository:** [somosnlp/bertin_base_climate_detection_spa](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/tree/main) <!-- Enlace al `main` del repo donde tengáis los scripts, i.e.: o del mismo repo del modelo en HuggingFace o a GitHub. -->
 - **Demo:** [identificacion de textos sobre cambio climatico y sustentabilidad](https://huggingface.co/spaces/somosnlp/Identificacion_de_textos_sobre_sustentabilidad_cambio_climatico)
@@ -75,14 +75,14 @@ Future steps:
 - The use for text classification of unverifiable or unreliable sources and their dissemination, e.g., fake news or disinformation.
 ## Bias, Risks, and Limitations
-En este punto no se han realizados estudios concretos sobre los sesgos y limitaciones, sin embargo hacemos los siguientes apuntes en base a experiencia previa y pruebas del modelo:
-- Hereda los sesgos y limitaciones del modelo base con el que fue entrenado, para mas detalles véase: [BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403). Sin embargo, no son tan evidentes de encontrar por el tipo de tarea en el que se esta implementando el modelo como lo es la clasificacion de texto.
-- Sesgos directos como por ejemplo el mayoritario uso de lenguaje de alto nivel en el dataset debido a que se utilizan textos extraidos de noticias, documentación legal de empresas que pueden complicar la identificación de textos con lenguajes de bajo nivel (ejemplo: coloquial). Para mitigar estos sesgos, se incluyeron en el dataset opiniones diversas sobre temas de cambio climatico extraidas de fuentes como redes sociales, adicional se hizo un rebalanceo de las etiquetas.
-- El dataset nos hereda otras limitaciones como por ejemplo: el modelo pierde rendimiento en textos cortos, esto es debido a que la mayoria de los textos utilizados en el dataset tienen una longitud larga de entre 200 - 500 palabras. Nuevamente se intentó mitigar estas limitaciones con la inclusión de textos cortos.
 ### Recommendations
-- Como hemos mencionado, el modelo tiende a bajar el rendimiento en textos cortos, por lo que lo recomendable es establecer un criterio de selección de textos largos a los cuales se necesita identificar su temática.
 ## How to Get Started with the Model
@@ -158,8 +158,8 @@ The following hyperparameters were used during training:
 - num_epochs: 2
 #### Speeds, Sizes, Times
-El modelo fue entrenado en 2 epocas con una duración total de 14.22 minutos de entrenamiento, 'train_runtime': 853.6759.
-Como dato adicional: No se utilizó precision mixta (FP16 ó BF16)
 #### Resultados del entrenamiento:
@@ -211,12 +211,12 @@ Recall 0.99
 F1 score 0.951
 ## Environmental Impact
-Utilizando la herramienta de [ML CO2 IMPACT](https://mlco2.github.io/impact/#co2eq) calculamos que el siguiente impacto ambiental debido al entrenamiento:
--  **Tipo de hardware:** T4
--  **Horas utilizadas (incluye pruebas e iteraciones para mejorar el modelo):** 4 horas
--  **Proveedor de nube:** Google Cloud (colab)
--  **Región computacional:** us-east
--  **Huella de carbono emitida:** 0.1kg CO2
 ## Technical Specifications

 - **Fine-tuned from model:** [bertin-project/bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
 - **Dataset used:** [somosnlp/spa_climate_detection](https://huggingface.co/datasets/somosnlp/spa_climate_detection)
+### Model resurces:
 - **Repository:** [somosnlp/bertin_base_climate_detection_spa](https://huggingface.co/somosnlp/bertin_base_climate_detection_spa/tree/main) <!-- Enlace al `main` del repo donde tengáis los scripts, i.e.: o del mismo repo del modelo en HuggingFace o a GitHub. -->
 - **Demo:** [identificacion de textos sobre cambio climatico y sustentabilidad](https://huggingface.co/spaces/somosnlp/Identificacion_de_textos_sobre_sustentabilidad_cambio_climatico)
 - The use for text classification of unverifiable or unreliable sources and their dissemination, e.g., fake news or disinformation.
 ## Bias, Risks, and Limitations
+No specific studies on biases and limitations have been carried out at this point, however, we make the following points based on previous experience and model tests:
+- It inherits the biases and limitations of the base model with which it was trained, for more details see: [BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403). However, they are not so obvious to find because of the type of task in which the model is being implemented, such as text classification.
+- Direct biases such as the majority use of high-level language in the dataset due to the use of texts extracted from news, legal documentation of companies that can complicate the identification of texts with low-level language (e.g. colloquial). To mitigate these biases, diverse opinions on climate change issues extracted from sources such as social networks were included in the dataset, in addition to a rebalancing of the labels.
+- The dataset inherits other limitations such as: the model loses performance on short texts, this is due to the fact that most of the texts used in the dataset have a long length between 200 - 500 words. Again, we tried to mitigate these limitations by including short texts.
 ### Recommendations
+- As we have mentioned, the model tends to lower performance in short texts, so it is advisable to establish a selection criterion for long texts whose subject matter needs to be identified.
 ## How to Get Started with the Model
 - num_epochs: 2
 #### Speeds, Sizes, Times
+The model was trained in 2 epochs with a total training duration of 14.22 minutes, 'train_runtime': 853.6759.
+Additional information: No mixed precision (FP16 or BF16) was used.
 #### Resultados del entrenamiento:
 F1 score 0.951
 ## Environmental Impact
+Using the tool [ML CO2 IMPACT](https://mlco2.github.io/impact/#co2eq) we estimate the following environmental impact due to training:
+-  **Type of hardware:** T4
+-  **Total Hours for iterations and tests:** 4 horas
+-  **Cloud provider** Google Cloud (colab)
+-  **Computational region** us-east
+-  **Carbon footprint** 0.1kg CO2
 ## Technical Specifications